How to Replace a Failed Disk: Difference between revisions

Latest revision as of 22:49, 25 January 2021

Steps to fix a hard drive failure that is in a raid 1 configuration:

The following demonstrates what a failed disk looks like:

 [root@myServer ~]# cat /proc/mdstat 

 Personalities : [raid1] 

 md0 : active raid1 sdb1[0] sda1[2](F) 

 128384 blocks [2/1] [U_]  

 md1 : active raid1 sdb2[0] sda2[2](F) 

 16779776 blocks [2/1] [U_] 

 md2 : active raid1 sdb3[0] sda3[2](F) 

 139379840 blocks [2/1] [U_] 
    
 unused devices: <none>

 [root@myServer ~]# smartctl -a /dev/sda   
              
 smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)   

 Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net   

 Short INQUIRY response, skip product id   

 A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 [root@myServer ~]# smartctl -a /dev/sdb  

 smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)   

 Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net   

 === START OF INFORMATION SECTION ===    

 Model Family:     Seagate Barracuda 7200.10    

 Device Model:     ST3160815AS    

 Serial Number:    9RA6DZP8     

 Firmware Version: 4.AAB    

 User Capacity:    160,041,885,696 bytes [160 GB]   

 Sector Size:      512 bytes logical/physical   

 Device is:        In smartctl database [for details use: -P show]   

 ATA Version is:   7   

 ATA Standard is:  Exact ATA specification draft version not indicated   

 Local Time is:    Mon Sep  8 15:50:48 2014 PDT  

 SMART support is: Available - device has SMART capability.   

 SMART support is: Enabled   

 === START OF READ SMART DATA SECTION ===   
 
 SMART overall-health self-assessment test result: PASSED

There is a lot more that gets printed, but I cut it out.

So /dev/sda has clearly failed.

Take note of the GOOD disk serial number so I leave that one in when I replace it:

 Serial Number:    9RA6DZP8

Mark and remove failed disk from raid:

 [root@myServer ~]# mdadm --manage /dev/md0 --fail /dev/sda1   

 mdadm: set /dev/sda1 faulty in /dev/md0   

 [root@myServer ~]# mdadm --manage /dev/md1 --fail /dev/sda2   
 
 mdadm: set /dev/sda2 faulty in /dev/md1   

 [root@myServer ~]# mdadm --manage /dev/md2 --fail /dev/sda3   

 mdadm: set /dev/sda3 faulty in /dev/md2   

 [root@myServer ~]# mdadm --manage /dev/md0 --remove /dev/sda1   

 mdadm: hot removed /dev/sda1   

 [root@myServer ~]# mdadm --manage /dev/md1 --remove /dev/sda2   

 mdadm: hot removed /dev/sda2   

 [root@myServer ~]# mdadm --manage /dev/md2 --remove /dev/sda3   

 mdadm: hot removed /dev/sda3

Make sure grub is installed on the good disk and that grub.conf is updated:

 [root@myServer ~]# grub-install /dev/sdb   

 Installation finished. No error reported.   

 This is the contents of the device map /boot/grub/device.map.   

 Check if this is correct or not. 

 If any of the lines is incorrect, fix it and re-run the script `grub-install'.   

 This device map was generated by anaconda   

 (hd0)     /dev/sda   

 (hd1)     /dev/sdb

Take note of the which hd partition corresponds with the good disk, ie hd1 in this case.

 [root@myServer ~]# vim /boot/grub/menu.lst  

 Add fallback=1 right after default=0  

 Go to the bottom section where you should find some kernel stanzas.   

 Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0)  

 Should look like this:  

   [...]   

   title CentOS (2.6.18-128.el5)  

   root (hd1,0)  

   kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00  

   initrd /initrd-2.6.18-128.el5.img  

   title CentOS (2.6.18-128.el5)  

   root (hd0,0)  

   kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/  

   initrd /initrd-2.6.18-128.el5.img

Save and quit

 [root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak   

 [root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r)   

 [root@myServer ~]# init 0

Swap the bad drive with the new drive and boot the machine.

Once it's booted:

Check the device names with cat /proc/mdstat and/or fisk -l.
The newly installed drive on myServer was named /dev/sda.

 [root@myServer ~]# modeprobe raid1   

 [root@myServer ~]# modeprobe linear

Copy the partitions from one disk to the other:

 [root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda   

 [root@myServer ~]# sfdisk -l => sanity check

Add the new disk to the raid array:

 [root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1   

 mdadm: added /dev/sda1   

 [root@myServer ~]# mdadm --manage /dev/md1 --add /dev/sda2   

 mdadm: added /dev/sda2  

 [root@myServer ~]# mdadm --manage /dev/md2 --add /dev/sda3   

 mdadm: added /dev/sda3

Sanity check:

 [root@myServer ~]# cat /proc/mdstat  

 Personalities : [raid1] [linear]    

 md0 : active raid1 sda1[1] sdb1[0]   

 128384 blocks [2/2] [UU]   
      
 md1 : active raid1 sda2[2] sdb2[0]   

 16779776 blocks [2/1] [U_]   

 [>....................]  recovery =  3.2% (548864/16779776) finish=8.8min speed=30492K/sec   

 md2 : active raid1 sda3[2] sdb3[0]   

 139379840 blocks [2/1] [U_]   

 resync=DELAYED   
     
 unused devices: <none>

That's it! :)

@@ Line 16: / Line 16: @@
    smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)   <br/>
    Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net   <br/>
    Short INQUIRY response, skip product id   <br/>
    A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.   <br/>
@@ Line 24: / Line 23: @@
    smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)   <br/>
    Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net   <br/>
    === START OF INFORMATION SECTION ===    <br/>
    Model Family:     Seagate Barracuda 7200.10    <br/>
@@ Line 38: / Line 36: @@
    SMART support is: Available - device has SMART capability.   <br/>
    SMART support is: Enabled   <br/>
    === START OF READ SMART DATA SECTION ===   <br/>
    SMART overall-health self-assessment test result: PASSED   <br/>
@@ Line 45: / Line 42: @@
 So /dev/sda has clearly failed.   <br/>
-Here are the steps:   <br/>
 Take note of the GOOD disk serial number so I leave that one in when I replace it:   <br/>
@@ Line 55: / Line 50: @@
    [root@myServer ~]# mdadm --manage /dev/md0 --fail /dev/sda1   <br/>
    mdadm: set /dev/sda1 faulty in /dev/md0   <br/>
    [root@myServer ~]# mdadm --manage /dev/md1 --fail /dev/sda2   <br/>
    mdadm: set /dev/sda2 faulty in /dev/md1   <br/>
    [root@myServer ~]# mdadm --manage /dev/md2 --fail /dev/sda3   <br/>
    mdadm: set /dev/sda3 faulty in /dev/md2   <br/>
    [root@myServer ~]# mdadm --manage /dev/md0 --remove /dev/sda1   <br/>
    mdadm: hot removed /dev/sda1   <br/>
    [root@myServer ~]# mdadm --manage /dev/md1 --remove /dev/sda2   <br/>
    mdadm: hot removed /dev/sda2   <br/>
    [root@myServer ~]# mdadm --manage /dev/md2 --remove /dev/sda3   <br/>
    mdadm: hot removed /dev/sda3   <br/>
@@ Line 73: / Line 63: @@
 Make sure grub is installed on the good disk and that grub.conf is updated:
+  [root@myServer ~]# grub-install /dev/sdb   <br/>
-[root@myServer ~]# grub-install /dev/sdb   <br/>
+  Installation finished. No error reported.   <br/>
-Installation finished. No error reported.   <br/>
+  This is the contents of the device map /boot/grub/device.map.   <br/>
-This is the contents of the device map /boot/grub/device.map.   <br/>
+  Check if this is correct or not. <br/>
-Check if this is correct or not. <br/>
+  If any of the lines is incorrect, fix it and re-run the script `grub-install'.   <br/>
-If any of the lines is incorrect, fix it and re-run the script `grub-install'.   <br/>
+  This device map was generated by anaconda   <br/>
+  (hd0)     /dev/sda   <br/>
-This device map was generated by anaconda   <br/>
+  (hd1)     /dev/sdb   <br/>
-(hd0)     /dev/sda   <br/>
-(hd1)     /dev/sdb   <br/>
 Take note of the which hd partition corresponds with the good disk, ie hd1 in this case.   <br/>
-[root@myServer ~]# vim /boot/grub/menu.lst  <br/>
+  [root@myServer ~]# vim /boot/grub/menu.lst  <br/>
+  Add fallback=1 right after default=0  <br/>
-Add fallback=1 right after default=0  <br/>
+  Go to the bottom section where you should find some kernel stanzas.   <br/>
-Go to the bottom section where you should find some kernel stanzas.   <br/>
+  Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0)  <br/>
-Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0)  <br/>
+  Should look like this:  <br/>
-Should look like this:  <br/>
+    [...]   <br/>
+    title CentOS (2.6.18-128.el5)  <br/>
-[...]   <br/>
+    root (hd1,0)  <br/>
-title CentOS (2.6.18-128.el5)  <br/>
+    kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00  <br/>
-root (hd1,0)  <br/>
+    initrd /initrd-2.6.18-128.el5.img  <br/>
-kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00  <br/>
+    title CentOS (2.6.18-128.el5)  <br/>
-initrd /initrd-2.6.18-128.el5.img  <br/>
+    root (hd0,0)  <br/>
-title CentOS (2.6.18-128.el5)  <br/>
+    kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/  <br/>
-root (hd0,0)  <br/>
+    initrd /initrd-2.6.18-128.el5.img  <br/>
-kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/  <br/>
-initrd /initrd-2.6.18-128.el5.img  <br/>
 Save and quit   <br/>
-[root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak   <br/>
+  [root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak   <br/>
+  [root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r)   <br/>
-[root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r)   <br/>
+  [root@myServer ~]# init 0   <br/>
-[root@myServer ~]# init 0   <br/>
 Swap the bad drive with the new drive and boot the machine.   <br/>
@@ Line 118: / Line 102: @@
 The newly installed drive on myServer was named /dev/sda.   <br/>
-[root@myServer ~]# modeprobe raid1   <br/>
+  [root@myServer ~]# modeprobe raid1   <br/>
+  [root@myServer ~]# modeprobe linear   <br/>
-[root@myServer ~]# modeprobe linear   <br/>
+Copy the partitions from one disk to the other:
-[root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda   <br/>
+  [root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda   <br/>
+  [root@myServer ~]# sfdisk -l => sanity check   <br/>
-[root@myServer ~]# sfdisk -l => sanity check   <br/>
+Add the new disk to the raid array:
-[root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1   <br/>
+  [root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1   <br/>
-mdadm: added /dev/sda1   <br/>
+  mdadm: added /dev/sda1   <br/>
+  [root@myServer ~]# mdadm --manage /dev/md1 --add /dev/sda2   <br/>
+  mdadm: added /dev/sda2  <br/>
+  [root@myServer ~]# mdadm --manage /dev/md2 --add /dev/sda3   <br/>
+  mdadm: added /dev/sda3   <br/>
-[root@myServer ~]# mdadm --manage /dev/md1 --add /dev/sda2   <br/>
+Sanity check:
-mdadm: added /dev/sda2  <br/>
+   [root@myServer ~]# cat /proc/mdstat  <br/>
+  Personalities : [raid1] [linear]    <br/>
-[root@myServer ~]# mdadm --manage /dev/md2 --add /dev/sda3   <br/>
+  md0 : active raid1 sda1[1] sdb1[0]   <br/>
-mdadm: added /dev/sda3   <br/>
+blocks [2/2] [UU]   <br/>
+  md1 : active raid1 sda2[2] sdb2[0]   <br/>
-[root@myServer ~]# cat /proc/mdstat   => Sanity check   <br/>
+  16779776 blocks [2/1] [U_]   <br/>
-Personalities : [raid1] [linear]    <br/>
+  [>....................]  recovery =  3.2% (548864/16779776) finish=8.8min speed=30492K/sec   <br/>
-md0 : active raid1 sda1[1] sdb1[0]   <br/>
+  md2 : active raid1 sda3[2] sdb3[0]   <br/>
-blocks [2/2] [UU]   <br/>
+  139379840 blocks [2/1] [U_]   <br/>
+  resync=DELAYED   <br/>
-md1 : active raid1 sda2[2] sdb2[0]   <br/>
+  unused devices: <none>  <br/>
-16779776 blocks [2/1] [U_]   <br/>
-[>....................]  recovery =  3.2% (548864/16779776) finish=8.8min speed=30492K/sec   <br/>
-md2 : active raid1 sda3[2] sdb3[0]   <br/>
-139379840 blocks [2/1] [U_]   <br/>
-resync=DELAYED   <br/>
-unused devices: <none>  <br/>
+That's it! :)   <br/>
-That's it! :)   <br/>
+[[Category:Sysadmin]]
+[[Category:Tutorials]]
+[[Category:Delete]]

How to Replace a Failed Disk: Difference between revisions

Latest revision as of 22:49, 25 January 2021

Navigation menu

Search