How to check a disk after RAID error
Once in a while my 3ware SATA RAID controller reports a hard disk error. Often the disk can be checked manually and turns out to be still usable. This is a short walkthrough how to check the drive.
Warning: Do not simply copy&paste from here. Make sure you have a backup and take care to verify the correct controller/unit/disk IDs.
Situation: The RAID controller (I have 3ware 8506s) detected a read error on one drive and already replaced it with the spare drive. The resulting configuration (p1 is the supposedly failed drive):
# tw_cli /c0 show Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u1 RAID-5 OK - - 64K 298.099 ON - Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u1 149.05 GB 312581808 S0D4J1KP125549 p1 OK - 189.92 GB 398297088 B4217WMH p2 OK u1 149.05 GB 312581808 V30DSKAG p3 OK u1 152.67 GB 320173056 Y45SBB9E
It is always possible to use smartctl for further info and to run a self-test on the impaired drive:
smartctl -a --device=3ware,1 /dev/twe0 smartctl --test=long --device=3ware,1 /dev/twe0
The info (-a) contains the test result, which will look like the following table. Entry #3 shows a test with read failure while later tests #1 and #2 are without errors:
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 13775 - # 2 Extended offline Completed without error 00% 13775 - # 3 Short offline Completed: read failure 60% 13772 9518783
Another approach is to use dd for testing. To make the drive “OS-accessible” one can set its type to JBOD:
# tw_cli /c0 add type=jbod disk=1 Creating new unit on controller /c0 ... Done. The new unit is /c0/u0.
The assigned device name will show in system log:
twed1: <Unit 0, JBOD, Normal> on twe0 twed1: 194481MB (398297088 sectors)
And now:
# dd if=/dev/twed1 of=/dev/null bs=1M
In case of RAID systems one can also write instead of reading the whole disk. A write should also “fix” bad sectors, meaning the drive’s controller will internally map them to spare sectors. — So the original data will be lost but later reads/writes will work correctly. This is what happened to the drive with the status log above (between #3 and #2).
When dd completes without errors one can remove the JBOD assignment and add the disk again as a spare drive:
# tw_cli /c0/u0 del # tw_cli /c0 add type=spare disk=1
Last note: So far I only had to do this on RAID drives where I could simply change the type to JBOD and overwrite the whole disk. For tips how to recover sectors inside a live file system the Bad block HOWTO for smartmontools is a good place to start.