Difference between revisions of "How to Replace a Failed Disk"

From DISI
Jump to: navigation, search
Line 83: Line 83:
 
     [...]  <br/>
 
     [...]  <br/>
 
     title CentOS (2.6.18-128.el5)  <br/>
 
     title CentOS (2.6.18-128.el5)  <br/>
root (hd1,0)  <br/>
+
    root (hd1,0)  <br/>
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00  <br/>
+
    kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00  <br/>
initrd /initrd-2.6.18-128.el5.img  <br/>
+
    initrd /initrd-2.6.18-128.el5.img  <br/>
title CentOS (2.6.18-128.el5)  <br/>
+
    title CentOS (2.6.18-128.el5)  <br/>
root (hd0,0)  <br/>
+
    root (hd0,0)  <br/>
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/  <br/>
+
    kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/  <br/>
initrd /initrd-2.6.18-128.el5.img  <br/>
+
    initrd /initrd-2.6.18-128.el5.img  <br/>
  
 
Save and quit  <br/>
 
Save and quit  <br/>
  
[root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak  <br/>
+
  [root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak  <br/>
 
+
  [root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r)  <br/>
[root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r)  <br/>
+
  [root@myServer ~]# init 0  <br/>  
 
+
[root@myServer ~]# init 0  <br/>  
+
  
 
Swap the bad drive with the new drive and boot the machine.  <br/>
 
Swap the bad drive with the new drive and boot the machine.  <br/>
Line 106: Line 104:
 
The newly installed drive on myServer was named /dev/sda.  <br/>
 
The newly installed drive on myServer was named /dev/sda.  <br/>
  
[root@myServer ~]# modeprobe raid1  <br/>
+
  [root@myServer ~]# modeprobe raid1  <br/>
 
+
  [root@myServer ~]# modeprobe linear  <br/>
[root@myServer ~]# modeprobe linear   <br/>
+
 
+
[root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda  <br/>
+
 
+
[root@myServer ~]# sfdisk -l => sanity check   <br/>
+
  
[root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1  <br/>
+
Copy the partitions from one disk to the other:
mdadm: added /dev/sda1  <br/>
+
  
[root@myServer ~]# mdadm --manage /dev/md1 --add /dev/sda2   <br/>
+
  [root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda   <br/>
mdadm: added /dev/sda2  <br/>
+
  [root@myServer ~]# sfdisk -l => sanity check  <br/>
  
[root@myServer ~]# mdadm --manage /dev/md2 --add /dev/sda3  <br/>
+
Add the new disk to the raid array:
mdadm: added /dev/sda3  <br/>
+
  
[root@myServer ~]# cat /proc/mdstat  => Sanity check   <br/>
+
  [root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1   <br/>
Personalities : [raid1] [linear]    <br/>
+
  mdadm: added /dev/sda1  <br/>
md0 : active raid1 sda1[1] sdb1[0]   <br/>
+
  [root@myServer ~]# mdadm --manage /dev/md1 --add /dev/sda2  <br/>
128384 blocks [2/2] [UU]  <br/>
+
  mdadm: added /dev/sda2  <br/>
     
+
  [root@myServer ~]# mdadm --manage /dev/md2 --add /dev/sda3  <br/>
md1 : active raid1 sda2[2] sdb2[0]   <br/>
+
   mdadm: added /dev/sda3   <br/>
16779776 blocks [2/1] [U_]  <br/>
+
[>....................] recovery =  3.2% (548864/16779776) finish=8.8min speed=30492K/sec  <br/>
+
     
+
md2 : active raid1 sda3[2] sdb3[0]   <br/>
+
139379840 blocks [2/1] [U_]   <br/>
+
resync=DELAYED   <br/>
+
     
+
unused devices: <none>  <br/>
+
  
 +
Sanity check:
 +
  [root@myServer ~]# cat /proc/mdstat  <br/>
 +
  Personalities : [raid1] [linear]    <br/>
 +
  md0 : active raid1 sda1[1] sdb1[0]  <br/>
 +
  128384 blocks [2/2] [UU]  <br/>     
 +
  md1 : active raid1 sda2[2] sdb2[0]  <br/>
 +
  16779776 blocks [2/1] [U_]  <br/>
 +
  [>....................]  recovery =  3.2% (548864/16779776) finish=8.8min speed=30492K/sec  <br/>
 +
  md2 : active raid1 sda3[2] sdb3[0]  <br/>
 +
  139379840 blocks [2/1] [U_]  <br/>
 +
  resync=DELAYED  <br/>   
 +
  unused devices: <none>  <br/>
  
 
That's it! :)  <br/>
 
That's it! :)  <br/>

Revision as of 12:54, 9 September 2014

Steps to fix a hard drive failure that is in a raid 1 configuration:

The following demonstrates what a failed disk looks like:

 [root@myServer ~]# cat /proc/mdstat 
Personalities : [raid1]
md0 : active raid1 sdb1[0] sda1[2](F)
128384 blocks [2/1] [U_]
md1 : active raid1 sdb2[0] sda2[2](F)
16779776 blocks [2/1] [U_]
md2 : active raid1 sdb3[0] sda3[2](F)
139379840 blocks [2/1] [U_]
unused devices: <none>
 [root@myServer ~]# smartctl -a /dev/sda   
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
Short INQUIRY response, skip product id
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.


 [root@myServer ~]# smartctl -a /dev/sdb  
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10
Device Model: ST3160815AS
Serial Number: 9RA6DZP8
Firmware Version: 4.AAB
User Capacity: 160,041,885,696 bytes [160 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Sep 8 15:50:48 2014 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

There is a lot more that gets printed, but I cut it out.

So /dev/sda has clearly failed.

Here are the steps:

Take note of the GOOD disk serial number so I leave that one in when I replace it:

 Serial Number:    9RA6DZP8   

Mark and remove failed disk from raid:

 [root@myServer ~]# mdadm --manage /dev/md0 --fail /dev/sda1   
mdadm: set /dev/sda1 faulty in /dev/md0
[root@myServer ~]# mdadm --manage /dev/md1 --fail /dev/sda2
mdadm: set /dev/sda2 faulty in /dev/md1
[root@myServer ~]# mdadm --manage /dev/md2 --fail /dev/sda3
mdadm: set /dev/sda3 faulty in /dev/md2
[root@myServer ~]# mdadm --manage /dev/md0 --remove /dev/sda1
mdadm: hot removed /dev/sda1
[root@myServer ~]# mdadm --manage /dev/md1 --remove /dev/sda2
mdadm: hot removed /dev/sda2
[root@myServer ~]# mdadm --manage /dev/md2 --remove /dev/sda3
mdadm: hot removed /dev/sda3

Make sure grub is installed on the good disk and that grub.conf is updated:

 [root@myServer ~]# grub-install /dev/sdb   
Installation finished. No error reported.
This is the contents of the device map /boot/grub/device.map.
Check if this is correct or not.
If any of the lines is incorrect, fix it and re-run the script `grub-install'.
This device map was generated by anaconda
(hd0) /dev/sda
(hd1) /dev/sdb

Take note of the which hd partition corresponds with the good disk, ie hd1 in this case.

 [root@myServer ~]# vim /boot/grub/menu.lst  
Add fallback=1 right after default=0
Go to the bottom section where you should find some kernel stanzas.
Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0)
Should look like this:
[...]
title CentOS (2.6.18-128.el5)
root (hd1,0)
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00
initrd /initrd-2.6.18-128.el5.img
title CentOS (2.6.18-128.el5)
root (hd0,0)
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/
initrd /initrd-2.6.18-128.el5.img

Save and quit

 [root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak   
[root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r)
[root@myServer ~]# init 0

Swap the bad drive with the new drive and boot the machine.

Once it's booted:

Check the device names with cat /proc/mdstat and/or fisk -l.
The newly installed drive on myServer was named /dev/sda.

 [root@myServer ~]# modeprobe raid1   
[root@myServer ~]# modeprobe linear

Copy the partitions from one disk to the other:

 [root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda   
[root@myServer ~]# sfdisk -l => sanity check

Add the new disk to the raid array:

 [root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1   
mdadm: added /dev/sda1
[root@myServer ~]# mdadm --manage /dev/md1 --add /dev/sda2
mdadm: added /dev/sda2
[root@myServer ~]# mdadm --manage /dev/md2 --add /dev/sda3
mdadm: added /dev/sda3

Sanity check:

 [root@myServer ~]# cat /proc/mdstat  
Personalities : [raid1] [linear]
md0 : active raid1 sda1[1] sdb1[0]
128384 blocks [2/2] [UU]
md1 : active raid1 sda2[2] sdb2[0]
16779776 blocks [2/1] [U_]
[>....................] recovery = 3.2% (548864/16779776) finish=8.8min speed=30492K/sec
md2 : active raid1 sda3[2] sdb3[0]
139379840 blocks [2/1] [U_]
resync=DELAYED
unused devices: <none>

That's it! :)