Makura no Soshi

How to check a disk after RAID error

Once in a while my 3ware SATA RAID controller reports a hard disk error. Often the disk can be checked manually and turns out to be still usable. This is a short walkthrough how to check the drive.

Warning: Do not simply copy&paste from here. Make sure you have a backup and take care to verify the correct controller/unit/disk IDs.

Situation: The RAID controller (I have 3ware 8506s) detected a read error on one drive and already replaced it with the spare drive. The resulting configuration (p1 is the supposedly failed drive):

# tw_cli /c0 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache AVrfy
------------------------------------------------------------------------------
u1    RAID-5    OK             -       -       64K     298.099   ON     -

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u1     149.05 GB   312581808     S0D4J1KP125549
p1     OK               -      189.92 GB   398297088     B4217WMH
p2     OK               u1     149.05 GB   312581808     V30DSKAG
p3     OK               u1     152.67 GB   320173056     Y45SBB9E

It is always possible to use smartctl for further info and to run a self-test on the impaired drive:

smartctl -a --device=3ware,1 /dev/twe0
smartctl --test=long --device=3ware,1 /dev/twe0

The info (-a) contains the test result, which will look like the following table. Entry #3 shows a test with read failure while later tests #1 and #2 are without errors:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     13775         -
# 2  Extended offline    Completed without error       00%     13775         -
# 3  Short offline       Completed: read failure       60%     13772         9518783

Another approach is to use dd for testing. To make the drive “OS-accessible” one can set its type to JBOD:

# tw_cli /c0 add type=jbod disk=1
Creating new unit on controller /c0 ...  Done. The new unit is /c0/u0.

The assigned device name will show in system log:

twed1: <Unit 0, JBOD, Normal> on twe0
twed1: 194481MB (398297088 sectors)

And now:

# dd if=/dev/twed1 of=/dev/null bs=1M

In case of RAID systems one can also write instead of reading the whole disk. A write should also “fix” bad sectors, meaning the drive’s controller will internally map them to spare sectors. — So the original data will be lost but later reads/writes will work correctly. This is what happened to the drive with the status log above (between #3 and #2).

When dd completes without errors one can remove the JBOD assignment and add the disk again as a spare drive:

# tw_cli /c0/u0 del
# tw_cli /c0 add type=spare disk=1

Last note: So far I only had to do this on RAID drives where I could simply change the type to JBOD and overwrite the whole disk. For tips how to recover sectors inside a live file system the Bad block HOWTO for smartmontools is a good place to start.

This entry was posted on Saturday, July 26th, 2008 at 0:16 and is filed under Admin, english. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Comments are closed.

How to check a disk after RAID error

pages

my profiles

search

categories

meta