Up till now, most of my posts have centred around identifying files that are already broken, or data rescue attempts. But is there no way to notice the iceberg approaching before we hit it? All modern hard disk drives support a monitoring system known as S.M.A.R.T. Don't ask me what it stands for. It's clearly a backronym. Anyway, S.M.A.R.T. is a bit of software running on the firmware of your drives that keeps records of things that go wrong, or are abnormal, during the drive's operation. It also keeps statistics of the sorts of things that may indicate future problems - operating temperature, how many hours the drive has been powered on, that sort of thing. In Linux, we can access this data with the smartmontools package.
sudo apt-get install smartmontools
We can check what drives it can work with using the --scan option:-
$ sudo smartctl --scan /dev/sda -d scsi # /dev/sda, SCSI device /dev/sdb -d scsi # /dev/sdb, SCSI device /dev/sdc -d scsi # /dev/sdc, SCSI device
And dump a huge amount of statistics with the --all option:-
sudo smartctl --all /dev/sda
What you get back will vary considerably based on your drive's manufacturer and model, and may not contain much if the drive has never run any self-tests. So how do we do those tests, and can we get a simpler "good or bad" estimate from the data?
Work smarter, not harder
We absolutely can. The simplest way to go about this is to directly tell the drive to go and do a self-test for a while, and then check back later to see if it's finished. These self-tests can be run while a drive is in normal operation, although frequent disk I/O will interrupt things and slow the process down - run it when you aren't expecting much activity. The 'short' test should take no more than 10 minutes or so, and the 'long' test can run for several hours depending on disk size. Only one test can be run at a time (per drive, so if you have multiple drives feel free to run the tests in parallel).
$ sudo smartctl -t long /dev/sda smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-24-generic] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 92 minutes for test to complete. Test will complete after Sun Jul 20 15:19:58 2014 Use smartctl -X to abort test.
(Substitute 'short' instead of 'long' for the shorter test, naturally)
After the test is complete, we can check the drive's logs to see how it went:-
$ sudo smartctl --log selftest /dev/sda smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-24-generic] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 21504 - # 2 Short offline Completed without error 00% 21501 - # 3 Extended offline Aborted by host 90% 21501 -
"Completed without error", that's what I like to see. But overall, how is the drive doing? If we need a simple judgement on the health of the drive, and can't make sense of all the stats --all gives us, we can use the --health option.
$ sudo smartctl --health /dev/sda smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-24-generic] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED Please note the following marginal Attributes: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 190 Airflow_Temperature_Cel 0x0022 061 044 045 Old_age Always In_the_past 39 (Min/Max 30/43)
Passed. Excellent. It's also nice that smartctl points out that one of the statistics has caught its eye as being a little abnormal; the "Airflow_Temperature_Cel". So at one point, the drive was getting a little hot, or perhaps dust has been obstructing good airflow.
It's important to note that while regular checking of S.M.A.R.T. statistics and regular self-tests are a good thing to keep an eye on, it doesn't prevent the disk from suddenly failing one night with no prior warning. Keep regular backups of everything, and test the data and the backups for any signs of degradation. I Totally Intend To Create And Then Blog About My Backup Solution, but there's really no excuse for not having it done yesterday.