How to check a disk after RAID error

Once in a while my 3ware SATA RAID controller reports a hard disk error. Often the disk can be checked manually and turns out to be still usable. This is a short walkthrough how to check the drive.

Warning: Do not simply copy&paste from here. Make sure you have a backup and take care to verify the correct controller/unit/disk IDs.

Situation: The RAID controller (I have 3ware 8506s) detected a read error on one drive and already replaced it with the spare drive. The resulting configuration (p1 is the supposedly failed drive):

# tw_cli /c0 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache AVrfy
------------------------------------------------------------------------------
u1    RAID-5    OK             -       -       64K     298.099   ON     -

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u1     149.05 GB   312581808     S0D4J1KP125549
p1     OK               -      189.92 GB   398297088     B4217WMH
p2     OK               u1     149.05 GB   312581808     V30DSKAG
p3     OK               u1     152.67 GB   320173056     Y45SBB9E

It is always possible to use smartctl for further info and to run a self-test on the impaired drive:

smartctl -a --device=3ware,1 /dev/twe0
smartctl --test=long --device=3ware,1 /dev/twe0

The info (-a) contains the test result, which will look like the following table. Entry #3 shows a test with read failure while later tests #1 and #2 are without errors:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     13775         -
# 2  Extended offline    Completed without error       00%     13775         -
# 3  Short offline       Completed: read failure       60%     13772         9518783

Another approach is to use dd for testing. To make the drive “OS-accessible” one can set its type to JBOD:

# tw_cli /c0 add type=jbod disk=1
Creating new unit on controller /c0 ...  Done. The new unit is /c0/u0.

The assigned device name will show in system log:

twed1: <Unit 0, JBOD, Normal> on twe0
twed1: 194481MB (398297088 sectors)

And now:

# dd if=/dev/twed1 of=/dev/null bs=1M

In case of RAID systems one can also write instead of reading the whole disk. A write should also “fix” bad sectors, meaning the drive’s controller will internally map them to spare sectors. — So the original data will be lost but later reads/writes will work correctly. This is what happened to the drive with the status log above (between #3 and #2).

When dd completes without errors one can remove the JBOD assignment and add the disk again as a spare drive:

# tw_cli /c0/u0 del
# tw_cli /c0 add type=spare disk=1

Last note: So far I only had to do this on RAID drives where I could simply change the type to JBOD and overwrite the whole disk. For tips how to recover sectors inside a live file system the Bad block HOWTO for smartmontools is a good place to start.

Comments are closed.