20 July 2014

Get S.M.A.R.T.

As followers of this blog may know, I've been having a cacophony of hardware problems lately. Most of them revolve around that one inevitability of packing more and more data into tinier and tinier spaces: Hard disk corruption. I've been busy moving my vital datas onto an older machine of mine and setting it up to host all my source code, so now is a great time to get paranoid about disk integrity.

Predictable Failure

Up till now, most of my posts have centred around identifying files that are already broken, or data rescue attempts. But is there no way to notice the iceberg approaching before we hit it? All modern hard disk drives support a monitoring system known as S.M.A.R.T. Don't ask me what it stands for. It's clearly a backronym. Anyway, S.M.A.R.T. is a bit of software running on the firmware of your drives that keeps records of things that go wrong, or are abnormal, during the drive's operation. It also keeps statistics of the sorts of things that may indicate future problems - operating temperature, how many hours the drive has been powered on, that sort of thing. In Linux, we can access this data with the smartmontools package.

sudo apt-get install smartmontools

We can check what drives it can work with using the --scan option:-

$ sudo smartctl --scan

/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device

And dump a huge amount of statistics with the --all option:-

sudo smartctl --all /dev/sda
What you get back will vary considerably based on your drive's manufacturer and model, and may not contain much if the drive has never run any self-tests. So how do we do those tests, and can we get a simpler "good or bad" estimate from the data?

Work smarter, not harder

We absolutely can. The simplest way to go about this is to directly tell the drive to go and do a self-test for a while, and then check back later to see if it's finished. These self-tests can be run while a drive is in normal operation, although frequent disk I/O will interrupt things and slow the process down - run it when you aren't expecting much activity. The 'short' test should take no more than 10 minutes or so, and the 'long' test can run for several hours depending on disk size. Only one test can be run at a time (per drive, so if you have multiple drives feel free to run the tests in parallel).

$ sudo smartctl -t long /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-24-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 92 minutes for test to complete.
Test will complete after Sun Jul 20 15:19:58 2014

Use smartctl -X to abort test.
(Substitute 'short' instead of 'long' for the shorter test, naturally)

After the test is complete, we can check the drive's logs to see how it went:-

$ sudo smartctl --log selftest /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-24-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     21504         -
# 2  Short offline       Completed without error       00%     21501         -
# 3  Extended offline    Aborted by host               90%     21501         -
"Completed without error", that's what I like to see. But overall, how is the drive doing? If we need a simple judgement on the health of the drive, and can't make sense of all the stats --all gives us, we can use the --health option.

$ sudo smartctl --health /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-24-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Please note the following marginal Attributes:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022   061   044   045    Old_age   Always   In_the_past 39 (Min/Max 30/43)
Passed. Excellent. It's also nice that smartctl points out that one of the statistics has caught its eye as being a little abnormal; the "Airflow_Temperature_Cel". So at one point, the drive was getting a little hot, or perhaps dust has been obstructing good airflow.

It's important to note that while regular checking of S.M.A.R.T. statistics and regular self-tests are a good thing to keep an eye on, it doesn't prevent the disk from suddenly failing one night with no prior warning. Keep regular backups of everything, and test the data and the backups for any signs of degradation. I Totally Intend To Create And Then Blog About My Backup Solution, but there's really no excuse for not having it done yesterday.

No comments:

Post a Comment