Master Linux server maintenance by using SMART monitoring and Prometheus. Prevent downtime with automated alerts for disk failure prevention.
I remember the exact moment I stopped trusting my hard drives. It was 3:00 AM on a Tuesday, and a RAID array in a production database server decided to drop two drives simultaneously. We didn't have proactive monitoring, so the first sign of trouble wasn't a warning—it was an application timeout. Since then, I’ve made Linux server maintenance a non-negotiable part of my infrastructure stack, specifically focusing on disk health before the kernel starts throwing I/O errors.
If you aren't tracking the S.M.A.R.T. status of your drives, you’re basically flying blind. Relying on "it worked yesterday" is a recipe for a bad weekend.
Most modern drives perform self-tests and keep internal logs of their health. We use smartmontools to extract this data. It’s the industry standard for a reason: it’s lightweight, reliable, and provides the raw data needed to predict failure.
We don't just want to know if a drive is "OK." We want to know if the Reallocated Sector Count is climbing or if the drive is reporting thermal issues.
First, ensure smartmontools is installed on your host. On Debian/Ubuntu systems, it’s a quick apt install smartmontools. You’ll need to enable the smartd daemon to poll your hardware regularly.
Edit /etc/smartd.conf to include your drives. A simple config looks like this:
Bash/dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03) /dev/sdb -a -o on -S on -s (S/../.././02|L/../../6/03)
This tells smartd to monitor the drives, enable auto-save, and run a short test daily at 2 AM and a long test every Saturday at 3 AM.
Once smartd is running, we need to get that data into our observability pipeline. This is where Node Exporter shines. If you're already using it for standard metrics, adding SMART data is seamless.
You’ll need to ensure Node Exporter is started with the --collector.smartctl flag.
Bash# Example systemd service override [Service] ExecStart=/usr/local/bin/node_exporter --collector.smartctl
When this flag is active, Node Exporter executes smartctl commands and exports the data as Prometheus metrics. If you’re curious about how this fits into a wider monitoring strategy, I’ve previously written about GitOps-Driven Observability: Implementing SLO-Based Alerting with OpenSLO, which helps turn these raw metrics into meaningful service objectives.
Having metrics in Prometheus is useless if you aren't alerting on them. I prefer to alert on the smartctl_device_smart_healthy metric.
If this value drops to 0, you have a problem. However, don't wait for a total failure. Alert on the "Pre-fail" attributes too. Create an alerting rule in your alert.rules file:
YAMLgroups: - name: disk_health rules: - alert: DiskHealthDegraded expr: smartctl_device_smart_healthy == 0 for: 5m labels: severity: critical annotations: summary: "Disk failure imminent on {{ $labels.instance }}" description: "SMART health check failed for device {{ $labels.device }}."
We once tried a custom bash script to email us when smartctl output contained the word "FAIL." It worked until the drive controller reset and the script missed the log rotation. Switching to the Prometheus ecosystem for disk failure prevention gave us a unified view. It’s significantly more reliable than stitching together disparate scripts.
I've learned that hardware is messy. Sometimes a drive reports a "failed" status during a heavy write load due to a temporary thermal spike, then clears itself. This is why I set the for: 5m duration in my alerts—it filters out the noise.
If you are managing VPS instances, you might also find Uptime Kuma Self-Hosted Monitoring: A Simple Guide for VPS Health useful for keeping an eye on the high-level service status, but for actual hardware longevity, you need the granularity that only smartmontools provides.
I’m still not perfect at this. I recently had a drive that showed zero SMART errors but developed bad sectors that caused the filesystem to remount as read-only. SMART isn't a silver bullet; it’s a warning system.
Always maintain backups. Use this setup to catch the 90% of failures that are predictable, but treat your data as if the hardware could vanish at any second. If you’re looking to scale this, I’d suggest looking at how you handle node-level automation, as Kubernetes Cluster API: Automating Node Upgrades with CAPI can help you cycle out nodes that report hardware degradation before they cause a production outage.
What’s your strategy for handling drives that pass SMART tests but still fail? I'm still experimenting with ZFS scrub logs to catch those edge cases.
Q: Does running smartctl frequently hurt the drive?
A: No. smartctl reads the drive's internal health log, which is a low-impact operation. It’s far less stressful than a full file system scan.
Q: Can I use this on NVMe drives?
A: Yes, smartmontools supports NVMe. The metrics might look slightly different (e.g., smartctl_nvme_critical_warning), but the logic remains the same.
Q: Do I need to be root to run these commands?
A: Yes, smartctl requires elevated privileges. Ensure your Node Exporter user has the necessary sudoers permissions or is part of the disk group to access the devices.
Master Uptime Kuma for self-hosted monitoring. Learn to track your VPS health and service uptime using Docker with this straightforward deployment guide.
Read moreeBPF-based socket monitoring lets you track network latency inside Docker containers. Learn how to pinpoint bottlenecks without adding overhead.