How to Replace a Failed Disk: Difference between revisions

From DISI
Jump to navigation Jump to search
No edit summary
No edit summary
Line 74: Line 74:




[root@myServer ~]# grub-install /dev/sdb
[root@myServer ~]# grub-install /dev/sdb   <br/>
Installation finished. No error reported.
Installation finished. No error reported.   <br/>
This is the contents of the device map /boot/grub/device.map.
This is the contents of the device map /boot/grub/device.map.   <br/>
Check if this is correct or not. If any of the lines is incorrect,
Check if this is correct or not. If any of the lines is incorrect,   <br/>
fix it and re-run the script `grub-install'.
fix it and re-run the script `grub-install'.   <br/>


# this device map was generated by anaconda
# this device map was generated by anaconda   <br/>
(hd0)    /dev/sda
(hd0)    /dev/sda   <br/>
(hd1)    /dev/sdb
(hd1)    /dev/sdb   <br/>


Take note of the which hd partition corresponds with the good disk, ie hd1 in this case.
Take note of the which hd partition corresponds with the good disk, ie hd1 in this case.   <br/>


[root@myServer ~]# vim /boot/grub/menu.lst
[root@myServer ~]# vim /boot/grub/menu.lst <br/>


Add fallback=1 right after default=0
Add fallback=1 right after default=0 <br/>
Go to the bottom section where you should find some kernel stanzas.  
Go to the bottom section where you should find some kernel stanzas.   <br/>
Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0)
Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0) <br/>
Should look like this:
Should look like this: <br/>


[...]
[...]   <br/>
title CentOS (2.6.18-128.el5)
title CentOS (2.6.18-128.el5) <br/>
        root (hd1,0)
root (hd1,0) <br/>
        kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00 <br/>
        initrd /initrd-2.6.18-128.el5.img
initrd /initrd-2.6.18-128.el5.img <br/>
title CentOS (2.6.18-128.el5)
title CentOS (2.6.18-128.el5) <br/>
        root (hd0,0)
root (hd0,0) <br/>
        kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/ <br/>
        initrd /initrd-2.6.18-128.el5.img
initrd /initrd-2.6.18-128.el5.img <br/>


Save and quit
Save and quit

Revision as of 19:25, 9 September 2014

Steps to fix a hard drive failure that is in a raid 1 configuration:

The following demonstrates what a failed disk looks like:

[root@myServer ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[0] sda1[2](F)
128384 blocks [2/1] [U_]
md1 : active raid1 sdb2[0] sda2[2](F)
16779776 blocks [2/1] [U_]
md2 : active raid1 sdb3[0] sda3[2](F)
139379840 blocks [2/1] [U_]
unused devices: <none>

[root@myServer ~]# smartctl -a /dev/sda
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Short INQUIRY response, skip product id
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.


[root@myServer ~]# smartctl -a /dev/sdb
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10
Device Model: ST3160815AS
Serial Number: 9RA6DZP8
Firmware Version: 4.AAB
User Capacity: 160,041,885,696 bytes [160 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Sep 8 15:50:48 2014 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

There is a lot more that gets printed, but I cut it out.

So /dev/sda has clearly failed.

Here are the steps:

Take note of the GOOD disk serial number so I leave that one in when I replace it:
Serial Number: 9RA6DZP8

Mark and remove failed disk from raid:

[root@myServer ~]# mdadm --manage /dev/md0 --fail /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md0

[root@myServer ~]# mdadm --manage /dev/md1 --fail /dev/sda2
mdadm: set /dev/sda2 faulty in /dev/md1

[root@myServer ~]# mdadm --manage /dev/md2 --fail /dev/sda3
mdadm: set /dev/sda3 faulty in /dev/md2

[root@myServer ~]# mdadm --manage /dev/md0 --remove /dev/sda1
mdadm: hot removed /dev/sda1

[root@myServer ~]# mdadm --manage /dev/md1 --remove /dev/sda2
mdadm: hot removed /dev/sda2

[root@myServer ~]# mdadm --manage /dev/md2 --remove /dev/sda3
mdadm: hot removed /dev/sda3

Make sure grub is installed on the good disk and that grub.conf is updated:


[root@myServer ~]# grub-install /dev/sdb
Installation finished. No error reported.
This is the contents of the device map /boot/grub/device.map.
Check if this is correct or not. If any of the lines is incorrect,
fix it and re-run the script `grub-install'.

  1. this device map was generated by anaconda

(hd0) /dev/sda
(hd1) /dev/sdb

Take note of the which hd partition corresponds with the good disk, ie hd1 in this case.

[root@myServer ~]# vim /boot/grub/menu.lst

Add fallback=1 right after default=0
Go to the bottom section where you should find some kernel stanzas.
Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0)
Should look like this:

[...]
title CentOS (2.6.18-128.el5)
root (hd1,0)
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00
initrd /initrd-2.6.18-128.el5.img
title CentOS (2.6.18-128.el5)
root (hd0,0)
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/
initrd /initrd-2.6.18-128.el5.img

Save and quit

[root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak

[root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r)

[root@myServer ~]# init 0

Swap the bad drive with the new drive and boot the machine.

Once it's booted:

Check the device names with cat /proc/mdstat and/or fisk -l. The newly installed drive on myServer was named /dev/sda

[root@myServer ~]# modeprobe raid1

[root@myServer ~]# modeprobe linear

[root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda

[root@myServer ~]# sfdisk -l => sanity check

[root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1 mdadm: added /dev/sda1

[root@myServer ~]# mdadm --manage /dev/md1 --add /dev/sda2 mdadm: added /dev/sda2

[root@myServer ~]# mdadm --manage /dev/md2 --add /dev/sda3 mdadm: added /dev/sda3

[root@myServer ~]# cat /proc/mdstat => Sanity check Personalities : [raid1] [linear] md0 : active raid1 sda1[1] sdb1[0]

     128384 blocks [2/2] [UU]
     

md1 : active raid1 sda2[2] sdb2[0]

     16779776 blocks [2/1] [U_]
     [>....................]  recovery =  3.2% (548864/16779776) finish=8.8min speed=30492K/sec
     

md2 : active raid1 sda3[2] sdb3[0]

     139379840 blocks [2/1] [U_]
     	resync=DELAYED
     

unused devices: <none>


That's it! :)