How to Replace a Failed Disk: Difference between revisions
No edit summary |
No edit summary |
||
(5 intermediate revisions by 2 users not shown) | |||
Line 3: | Line 3: | ||
The following demonstrates what a failed disk looks like: | The following demonstrates what a failed disk looks like: | ||
[root@myServer ~]# cat /proc/mdstat <br /> | [root@myServer ~]# cat /proc/mdstat <br /> | ||
Personalities : [raid1] <br/> | Personalities : [raid1] <br/> | ||
md0 : active raid1 sdb1[0] sda1[2](F) <br/> | md0 : active raid1 sdb1[0] sda1[2](F) <br/> | ||
128384 blocks [2/1] [U_] <br/> | 128384 blocks [2/1] [U_] <br/> | ||
md1 : active raid1 sdb2[0] sda2[2](F) <br/> | md1 : active raid1 sdb2[0] sda2[2](F) <br/> | ||
16779776 blocks [2/1] [U_] <br/> | 16779776 blocks [2/1] [U_] <br/> | ||
md2 : active raid1 sdb3[0] sda3[2](F) <br/> | md2 : active raid1 sdb3[0] sda3[2](F) <br/> | ||
139379840 blocks [2/1] [U_] <br/> | 139379840 blocks [2/1] [U_] <br/> | ||
unused devices: <none> <br/> | unused devices: <none> <br/> | ||
[root@myServer ~]# smartctl -a /dev/sda <br/> | [root@myServer ~]# smartctl -a /dev/sda <br/> | ||
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build) <br/> | smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build) <br/> | ||
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net <br/> | Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net <br/> | ||
Short INQUIRY response, skip product id <br/> | |||
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. <br/> | |||
[root@myServer ~]# smartctl -a /dev/sdb <br/> | |||
[root@myServer ~]# smartctl -a /dev/sdb <br/> | smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build) <br/> | ||
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build) <br/> | Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net <br/> | ||
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net <br/> | === START OF INFORMATION SECTION === <br/> | ||
Model Family: Seagate Barracuda 7200.10 <br/> | |||
=== START OF INFORMATION SECTION === <br/> | Device Model: ST3160815AS <br/> | ||
Model Family: Seagate Barracuda 7200.10 <br/> | Serial Number: 9RA6DZP8 <br/> | ||
Device Model: ST3160815AS <br/> | Firmware Version: 4.AAB <br/> | ||
Serial Number: 9RA6DZP8 <br/> | User Capacity: 160,041,885,696 bytes [160 GB] <br/> | ||
Firmware Version: 4.AAB <br/> | Sector Size: 512 bytes logical/physical <br/> | ||
User Capacity: 160,041,885,696 bytes [160 GB] <br/> | Device is: In smartctl database [for details use: -P show] <br/> | ||
Sector Size: 512 bytes logical/physical <br/> | ATA Version is: 7 <br/> | ||
Device is: In smartctl database [for details use: -P show] <br/> | ATA Standard is: Exact ATA specification draft version not indicated <br/> | ||
ATA Version is: 7 <br/> | Local Time is: Mon Sep 8 15:50:48 2014 PDT <br/> | ||
ATA Standard is: Exact ATA specification draft version not indicated <br/> | SMART support is: Available - device has SMART capability. <br/> | ||
Local Time is: Mon Sep 8 15:50:48 2014 PDT <br/> | SMART support is: Enabled <br/> | ||
SMART support is: Available - device has SMART capability. <br/> | === START OF READ SMART DATA SECTION === <br/> | ||
SMART support is: Enabled <br/> | SMART overall-health self-assessment test result: PASSED <br/> | ||
=== START OF READ SMART DATA SECTION === <br/> | |||
SMART overall-health self-assessment test result: PASSED <br/> | |||
There is a lot more that gets printed, but I cut it out. <br/> | There is a lot more that gets printed, but I cut it out. <br/> | ||
So /dev/sda has clearly failed. <br/> | So /dev/sda has clearly failed. <br/> | ||
Take note of the GOOD disk serial number so I leave that one in when I replace it: <br/> | Take note of the GOOD disk serial number so I leave that one in when I replace it: <br/> | ||
Serial Number: 9RA6DZP8 <br/> | Serial Number: 9RA6DZP8 <br/> | ||
Mark and remove failed disk from raid: <br/> | Mark and remove failed disk from raid: <br/> | ||
[root@myServer ~]# mdadm --manage /dev/md0 --fail /dev/sda1 <br/> | [root@myServer ~]# mdadm --manage /dev/md0 --fail /dev/sda1 <br/> | ||
mdadm: set /dev/sda1 faulty in /dev/md0 <br/> | mdadm: set /dev/sda1 faulty in /dev/md0 <br/> | ||
[root@myServer ~]# mdadm --manage /dev/md1 --fail /dev/sda2 <br/> | |||
[root@myServer ~]# mdadm --manage /dev/md1 --fail /dev/sda2 <br/> | mdadm: set /dev/sda2 faulty in /dev/md1 <br/> | ||
mdadm: set /dev/sda2 faulty in /dev/md1 <br/> | [root@myServer ~]# mdadm --manage /dev/md2 --fail /dev/sda3 <br/> | ||
mdadm: set /dev/sda3 faulty in /dev/md2 <br/> | |||
[root@myServer ~]# mdadm --manage /dev/md2 --fail /dev/sda3 <br/> | [root@myServer ~]# mdadm --manage /dev/md0 --remove /dev/sda1 <br/> | ||
mdadm: set /dev/sda3 faulty in /dev/md2 <br/> | mdadm: hot removed /dev/sda1 <br/> | ||
[root@myServer ~]# mdadm --manage /dev/md1 --remove /dev/sda2 <br/> | |||
[root@myServer ~]# mdadm --manage /dev/md0 --remove /dev/sda1 <br/> | mdadm: hot removed /dev/sda2 <br/> | ||
mdadm: hot removed /dev/sda1 <br/> | [root@myServer ~]# mdadm --manage /dev/md2 --remove /dev/sda3 <br/> | ||
mdadm: hot removed /dev/sda3 <br/> | |||
[root@myServer ~]# mdadm --manage /dev/md1 --remove /dev/sda2 <br/> | |||
mdadm: hot removed /dev/sda2 <br/> | |||
[root@myServer ~]# mdadm --manage /dev/md2 --remove /dev/sda3 <br/> | |||
mdadm: hot removed /dev/sda3 <br/> | |||
Make sure grub is installed on the good disk and that grub.conf is updated: | Make sure grub is installed on the good disk and that grub.conf is updated: | ||
[root@myServer ~]# grub-install /dev/sdb <br/> | |||
[root@myServer ~]# grub-install /dev/sdb <br/> | Installation finished. No error reported. <br/> | ||
Installation finished. No error reported. <br/> | This is the contents of the device map /boot/grub/device.map. <br/> | ||
This is the contents of the device map /boot/grub/device.map. <br/> | Check if this is correct or not. <br/> | ||
Check if this is correct or not. <br/> | If any of the lines is incorrect, fix it and re-run the script `grub-install'. <br/> | ||
If any of the lines is incorrect, fix it and re-run the script `grub-install'. <br/> | This device map was generated by anaconda <br/> | ||
(hd0) /dev/sda <br/> | |||
This device map was generated by anaconda <br/> | (hd1) /dev/sdb <br/> | ||
(hd0) /dev/sda <br/> | |||
(hd1) /dev/sdb <br/> | |||
Take note of the which hd partition corresponds with the good disk, ie hd1 in this case. <br/> | Take note of the which hd partition corresponds with the good disk, ie hd1 in this case. <br/> | ||
[root@myServer ~]# vim /boot/grub/menu.lst <br/> | [root@myServer ~]# vim /boot/grub/menu.lst <br/> | ||
Add fallback=1 right after default=0 <br/> | |||
Add fallback=1 right after default=0 <br/> | Go to the bottom section where you should find some kernel stanzas. <br/> | ||
Go to the bottom section where you should find some kernel stanzas. <br/> | Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0) <br/> | ||
Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0) <br/> | Should look like this: <br/> | ||
Should look like this: <br/> | [...] <br/> | ||
title CentOS (2.6.18-128.el5) <br/> | |||
[...] <br/> | root (hd1,0) <br/> | ||
title CentOS (2.6.18-128.el5) <br/> | kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00 <br/> | ||
root (hd1,0) <br/> | initrd /initrd-2.6.18-128.el5.img <br/> | ||
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00 <br/> | title CentOS (2.6.18-128.el5) <br/> | ||
initrd /initrd-2.6.18-128.el5.img <br/> | root (hd0,0) <br/> | ||
title CentOS (2.6.18-128.el5) <br/> | kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/ <br/> | ||
root (hd0,0) <br/> | initrd /initrd-2.6.18-128.el5.img <br/> | ||
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/ <br/> | |||
initrd /initrd-2.6.18-128.el5.img <br/> | |||
Save and quit <br/> | Save and quit <br/> | ||
[root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak <br/> | [root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak <br/> | ||
[root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r) <br/> | |||
[root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r) <br/> | [root@myServer ~]# init 0 <br/> | ||
[root@myServer ~]# init 0 <br/> | |||
Swap the bad drive with the new drive and boot the machine. <br/> | Swap the bad drive with the new drive and boot the machine. <br/> | ||
Line 118: | Line 102: | ||
The newly installed drive on myServer was named /dev/sda. <br/> | The newly installed drive on myServer was named /dev/sda. <br/> | ||
[root@myServer ~]# modeprobe raid1 <br/> | [root@myServer ~]# modeprobe raid1 <br/> | ||
[root@myServer ~]# modeprobe linear <br/> | |||
Copy the partitions from one disk to the other: | |||
[root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda <br/> | [root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda <br/> | ||
[root@myServer ~]# sfdisk -l => sanity check <br/> | |||
Add the new disk to the raid array: | |||
[root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1 <br/> | [root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1 <br/> | ||
mdadm: added /dev/sda1 <br/> | mdadm: added /dev/sda1 <br/> | ||
[root@myServer ~]# mdadm --manage /dev/md1 --add /dev/sda2 <br/> | |||
mdadm: added /dev/sda2 <br/> | |||
[root@myServer ~]# mdadm --manage /dev/md2 --add /dev/sda3 <br/> | |||
mdadm: added /dev/sda3 <br/> | |||
Sanity check: | |||
[root@myServer ~]# cat /proc/mdstat <br/> | |||
Personalities : [raid1] [linear] <br/> | |||
md0 : active raid1 sda1[1] sdb1[0] <br/> | |||
128384 blocks [2/2] [UU] <br/> | |||
md1 : active raid1 sda2[2] sdb2[0] <br/> | |||
[root@myServer ~]# cat /proc/mdstat | 16779776 blocks [2/1] [U_] <br/> | ||
Personalities : [raid1] [linear] <br/> | [>....................] recovery = 3.2% (548864/16779776) finish=8.8min speed=30492K/sec <br/> | ||
md0 : active raid1 sda1[1] sdb1[0] <br/> | md2 : active raid1 sda3[2] sdb3[0] <br/> | ||
128384 blocks [2/2] [UU] <br/> | 139379840 blocks [2/1] [U_] <br/> | ||
resync=DELAYED <br/> | |||
md1 : active raid1 sda2[2] sdb2[0] <br/> | unused devices: <none> <br/> | ||
16779776 blocks [2/1] [U_] <br/> | |||
[>....................] recovery = 3.2% (548864/16779776) finish=8.8min speed=30492K/sec <br/> | |||
md2 : active raid1 sda3[2] sdb3[0] <br/> | |||
139379840 blocks [2/1] [U_] <br/> | |||
resync=DELAYED <br/> | |||
unused devices: <none> <br/> | |||
That's it! :) <br/> | |||
[[Category:Sysadmin]] | |||
[[Category:Tutorials]] | |||
[[Category:Delete]] |
Latest revision as of 22:49, 25 January 2021
Steps to fix a hard drive failure that is in a raid 1 configuration:
The following demonstrates what a failed disk looks like:
[root@myServer ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[0] sda1[2](F)
128384 blocks [2/1] [U_]
md1 : active raid1 sdb2[0] sda2[2](F)
16779776 blocks [2/1] [U_]
md2 : active raid1 sdb3[0] sda3[2](F)
139379840 blocks [2/1] [U_]
unused devices: <none>
[root@myServer ~]# smartctl -a /dev/sda
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
Short INQUIRY response, skip product id
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
[root@myServer ~]# smartctl -a /dev/sdb
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10
Device Model: ST3160815AS
Serial Number: 9RA6DZP8
Firmware Version: 4.AAB
User Capacity: 160,041,885,696 bytes [160 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Sep 8 15:50:48 2014 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
There is a lot more that gets printed, but I cut it out.
So /dev/sda has clearly failed.
Take note of the GOOD disk serial number so I leave that one in when I replace it:
Serial Number: 9RA6DZP8
Mark and remove failed disk from raid:
[root@myServer ~]# mdadm --manage /dev/md0 --fail /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md0
[root@myServer ~]# mdadm --manage /dev/md1 --fail /dev/sda2
mdadm: set /dev/sda2 faulty in /dev/md1
[root@myServer ~]# mdadm --manage /dev/md2 --fail /dev/sda3
mdadm: set /dev/sda3 faulty in /dev/md2
[root@myServer ~]# mdadm --manage /dev/md0 --remove /dev/sda1
mdadm: hot removed /dev/sda1
[root@myServer ~]# mdadm --manage /dev/md1 --remove /dev/sda2
mdadm: hot removed /dev/sda2
[root@myServer ~]# mdadm --manage /dev/md2 --remove /dev/sda3
mdadm: hot removed /dev/sda3
Make sure grub is installed on the good disk and that grub.conf is updated:
[root@myServer ~]# grub-install /dev/sdb
Installation finished. No error reported.
This is the contents of the device map /boot/grub/device.map.
Check if this is correct or not.
If any of the lines is incorrect, fix it and re-run the script `grub-install'.
This device map was generated by anaconda
(hd0) /dev/sda
(hd1) /dev/sdb
Take note of the which hd partition corresponds with the good disk, ie hd1 in this case.
[root@myServer ~]# vim /boot/grub/menu.lst
Add fallback=1 right after default=0
Go to the bottom section where you should find some kernel stanzas.
Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0)
Should look like this:
[...]
title CentOS (2.6.18-128.el5)
root (hd1,0)
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00
initrd /initrd-2.6.18-128.el5.img
title CentOS (2.6.18-128.el5)
root (hd0,0)
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/
initrd /initrd-2.6.18-128.el5.img
Save and quit
[root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
[root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r)
[root@myServer ~]# init 0
Swap the bad drive with the new drive and boot the machine.
Once it's booted:
Check the device names with cat /proc/mdstat and/or fisk -l.
The newly installed drive on myServer was named /dev/sda.
[root@myServer ~]# modeprobe raid1
[root@myServer ~]# modeprobe linear
Copy the partitions from one disk to the other:
[root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda
[root@myServer ~]# sfdisk -l => sanity check
Add the new disk to the raid array:
[root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1
mdadm: added /dev/sda1
[root@myServer ~]# mdadm --manage /dev/md1 --add /dev/sda2
mdadm: added /dev/sda2
[root@myServer ~]# mdadm --manage /dev/md2 --add /dev/sda3
mdadm: added /dev/sda3
Sanity check:
[root@myServer ~]# cat /proc/mdstat
Personalities : [raid1] [linear]
md0 : active raid1 sda1[1] sdb1[0]
128384 blocks [2/2] [UU]
md1 : active raid1 sda2[2] sdb2[0]
16779776 blocks [2/1] [U_]
[>....................] recovery = 3.2% (548864/16779776) finish=8.8min speed=30492K/sec
md2 : active raid1 sda3[2] sdb3[0]
139379840 blocks [2/1] [U_]
resync=DELAYED
unused devices: <none>
That's it! :)