How to Replace a Failed Disk

Steps to fix a hard drive failure that is in a raid 1 configuration:

[root@myServer ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[0] sda1[2](F)
128384 blocks [2/1] [U_]
md1 : active raid1 sdb2[0] sda2[2](F)
16779776 blocks [2/1] [U_]
md2 : active raid1 sdb3[0] sda3[2](F)
139379840 blocks [2/1] [U_]
unused devices: <none>

To see which disk is failed:

The following demonstrates what a failed disk looks like:

Short INQUIRY response, skip product id A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

START OF INFORMATION SECTION

Model Family: Seagate Barracuda 7200.10 Device Model: ST3160815AS Serial Number: 9RA6DZP8 Firmware Version: 4.AAB User Capacity: 160,041,885,696 bytes [160 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Sep 8 15:50:48 2014 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled

START OF READ SMART DATA SECTION

SMART overall-health self-assessment test result: PASSED

General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 54) minutes.

SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

 1 Raw_Read_Error_Rate     0x000f   100   253   006    Pre-fail  Always       -       0
 3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
 4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       44
 5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       8
 7 Seek_Error_Rate         0x000f   084   060   030    Pre-fail  Always       -       239527837
 9 Power_On_Hours          0x0032   043   043   000    Old_age   Always       -       50308
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       44

187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 075 061 045 Old_age Always - 25 (Min/Max 22/31) 194 Temperature_Celsius 0x0022 025 039 000 Old_age Always - 25 (0 22 0 0 0) 195 Hardware_ECC_Recovered 0x001a 064 057 000 Old_age Always - 155109699 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0

SMART Error Log Version: 1 No Errors Logged

SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Extended offline Completed without error 00% 13097 -
2 Extended offline Completed without error 00% 4345 -
3 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing

Selective self-test flags (0x0):

 After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

So /dev/sda has failed.

Here are the steps:

Take note of the GOOD disk serial number so I leave that one in when I replace it: Serial Number: 9RA6DZP8

Mark and remove failed disk from raid:

[root@myServer ~]# mdadm --manage /dev/md0 --fail /dev/sda1 mdadm: set /dev/sda1 faulty in /dev/md0

[root@myServer ~]# mdadm --manage /dev/md1 --fail /dev/sda2 mdadm: set /dev/sda2 faulty in /dev/md1

[root@myServer ~]# mdadm --manage /dev/md2 --fail /dev/sda3 mdadm: set /dev/sda3 faulty in /dev/md2

[root@myServer ~]# mdadm --manage /dev/md0 --remove /dev/sda1 mdadm: hot removed /dev/sda1

[root@myServer ~]# mdadm --manage /dev/md1 --remove /dev/sda2 mdadm: hot removed /dev/sda2

[root@myServer ~]# mdadm --manage /dev/md2 --remove /dev/sda3 mdadm: hot removed /dev/sda3

Make sure grub is installed on the good disk and that grub.conf is updated:

[root@myServer ~]# grub-install /dev/sdb Installation finished. No error reported. This is the contents of the device map /boot/grub/device.map. Check if this is correct or not. If any of the lines is incorrect, fix it and re-run the script `grub-install'.

this device map was generated by anaconda

(hd0) /dev/sda (hd1) /dev/sdb

Take note of the which hd partition corresponds with the good disk, ie hd1 in this case.

[root@myServer ~]# vim /boot/grub/menu.lst

Add fallback=1 right after default=0 Go to the bottom section where you should find some kernel stanzas. Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0) Should look like this:

[...] title CentOS (2.6.18-128.el5)

       root (hd1,0)
       kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00
       initrd /initrd-2.6.18-128.el5.img

title CentOS (2.6.18-128.el5)

       root (hd0,0)
       kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/
       initrd /initrd-2.6.18-128.el5.img

Save and quit

[root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak

[root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r)

[root@myServer ~]# init 0

Swap the bad drive with the new drive and boot the machine.

Once it's booted:

Check the device names with cat /proc/mdstat and/or fisk -l. The newly installed drive on myServer was named /dev/sda

[root@myServer ~]# modeprobe raid1

[root@myServer ~]# modeprobe linear

[root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda

[root@myServer ~]# sfdisk -l => sanity check

[root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1 mdadm: added /dev/sda1

[root@myServer ~]# mdadm --manage /dev/md1 --add /dev/sda2 mdadm: added /dev/sda2

[root@myServer ~]# mdadm --manage /dev/md2 --add /dev/sda3 mdadm: added /dev/sda3

[root@myServer ~]# cat /proc/mdstat => Sanity check Personalities : [raid1] [linear] md0 : active raid1 sda1[1] sdb1[0]

     128384 blocks [2/2] [UU]

md1 : active raid1 sda2[2] sdb2[0]

     16779776 blocks [2/1] [U_]
     [>....................]  recovery =  3.2% (548864/16779776) finish=8.8min speed=30492K/sec

md2 : active raid1 sda3[2] sdb3[0]

     139379840 blocks [2/1] [U_]
     	resync=DELAYED

unused devices: <none>

That's it! :)

How to Replace a Failed Disk

START OF INFORMATION SECTION

START OF READ SMART DATA SECTION

Navigation menu

Search