How to Replace a Failed Disk

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Steps to fix a hard drive failure that is in a raid 1 configuration:

The following demonstrates what a failed disk looks like:

 [root@myServer ~]# cat /proc/mdstat 

 Personalities : [raid1] 

 md0 : active raid1 sdb1[0] sda1[2](F) 

 128384 blocks [2/1] [U_]  

 md1 : active raid1 sdb2[0] sda2[2](F) 

 16779776 blocks [2/1] [U_] 

 md2 : active raid1 sdb3[0] sda3[2](F) 

 139379840 blocks [2/1] [U_] 
    
 unused devices: <none>

 [root@myServer ~]# smartctl -a /dev/sda   
              
 smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)   

 Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net   

 Short INQUIRY response, skip product id   

 A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 [root@myServer ~]# smartctl -a /dev/sdb  

 smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)   

 Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net   

 === START OF INFORMATION SECTION ===    

 Model Family:     Seagate Barracuda 7200.10    

 Device Model:     ST3160815AS    

 Serial Number:    9RA6DZP8     

 Firmware Version: 4.AAB    

 User Capacity:    160,041,885,696 bytes [160 GB]   

 Sector Size:      512 bytes logical/physical   

 Device is:        In smartctl database [for details use: -P show]   

 ATA Version is:   7   

 ATA Standard is:  Exact ATA specification draft version not indicated   

 Local Time is:    Mon Sep  8 15:50:48 2014 PDT  

 SMART support is: Available - device has SMART capability.   

 SMART support is: Enabled   

 === START OF READ SMART DATA SECTION ===   
 
 SMART overall-health self-assessment test result: PASSED

There is a lot more that gets printed, but I cut it out.

So /dev/sda has clearly failed.

Here are the steps:

Take note of the GOOD disk serial number so I leave that one in when I replace it:

 Serial Number:    9RA6DZP8

Mark and remove failed disk from raid:

 [root@myServer ~]# mdadm --manage /dev/md0 --fail /dev/sda1   

 mdadm: set /dev/sda1 faulty in /dev/md0   

 [root@myServer ~]# mdadm --manage /dev/md1 --fail /dev/sda2   
 
 mdadm: set /dev/sda2 faulty in /dev/md1   

 [root@myServer ~]# mdadm --manage /dev/md2 --fail /dev/sda3   

 mdadm: set /dev/sda3 faulty in /dev/md2   

 [root@myServer ~]# mdadm --manage /dev/md0 --remove /dev/sda1   

 mdadm: hot removed /dev/sda1   

 [root@myServer ~]# mdadm --manage /dev/md1 --remove /dev/sda2   

 mdadm: hot removed /dev/sda2   

 [root@myServer ~]# mdadm --manage /dev/md2 --remove /dev/sda3   

 mdadm: hot removed /dev/sda3

Make sure grub is installed on the good disk and that grub.conf is updated:

 [root@myServer ~]# grub-install /dev/sdb   

 Installation finished. No error reported.   

 This is the contents of the device map /boot/grub/device.map.   

 Check if this is correct or not. 

 If any of the lines is incorrect, fix it and re-run the script `grub-install'.   

 This device map was generated by anaconda   

 (hd0)     /dev/sda   

 (hd1)     /dev/sdb

Take note of the which hd partition corresponds with the good disk, ie hd1 in this case.

 [root@myServer ~]# vim /boot/grub/menu.lst  

 Add fallback=1 right after default=0  

 Go to the bottom section where you should find some kernel stanzas.   

 Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0)  

 Should look like this:  

   [...]   

   title CentOS (2.6.18-128.el5)  

   root (hd1,0)  

   kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00  

   initrd /initrd-2.6.18-128.el5.img  

   title CentOS (2.6.18-128.el5)  

   root (hd0,0)  

   kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/  

   initrd /initrd-2.6.18-128.el5.img

Save and quit

 [root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak   

 [root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r)   

 [root@myServer ~]# init 0

Swap the bad drive with the new drive and boot the machine.

Once it's booted:

Check the device names with cat /proc/mdstat and/or fisk -l.
The newly installed drive on myServer was named /dev/sda.

 [root@myServer ~]# modeprobe raid1   

 [root@myServer ~]# modeprobe linear

Copy the partitions from one disk to the other:

 [root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda   

 [root@myServer ~]# sfdisk -l => sanity check

Add the new disk to the raid array:

 [root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1   

 mdadm: added /dev/sda1   

 [root@myServer ~]# mdadm --manage /dev/md1 --add /dev/sda2   

 mdadm: added /dev/sda2  

 [root@myServer ~]# mdadm --manage /dev/md2 --add /dev/sda3   

 mdadm: added /dev/sda3

Sanity check:

 [root@myServer ~]# cat /proc/mdstat  

 Personalities : [raid1] [linear]    

 md0 : active raid1 sda1[1] sdb1[0]   

 128384 blocks [2/2] [UU]   
      
 md1 : active raid1 sda2[2] sdb2[0]   

 16779776 blocks [2/1] [U_]   

 [>....................]  recovery =  3.2% (548864/16779776) finish=8.8min speed=30492K/sec   

 md2 : active raid1 sda3[2] sdb3[0]   

 139379840 blocks [2/1] [U_]   

 resync=DELAYED   
     
 unused devices: <none>

That's it! :)

How to Replace a Failed Disk

Navigation menu

Search