How to Replace a Failed Disk
Steps to fix a hard drive failure that is in a raid 1 configuration:
[root@myServer ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[0] sda1[2](F)
128384 blocks [2/1] [U_]
md1 : active raid1 sdb2[0] sda2[2](F)
16779776 blocks [2/1] [U_]
md2 : active raid1 sdb3[0] sda3[2](F)
139379840 blocks [2/1] [U_]
unused devices: <none>
To see which disk is failed:
The following demonstrates what a failed disk looks like:
[root@myServer ~]# smartctl -a /dev/sda smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
Short INQUIRY response, skip product id A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
[root@myServer ~]# smartctl -a /dev/sdb smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
START OF INFORMATION SECTION
Model Family: Seagate Barracuda 7200.10 Device Model: ST3160815AS Serial Number: 9RA6DZP8 Firmware Version: 4.AAB User Capacity: 160,041,885,696 bytes [160 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Sep 8 15:50:48 2014 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled
START OF READ SMART DATA SECTION
SMART overall-health self-assessment test result: PASSED
General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 54) minutes.
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 253 006 Pre-fail Always - 0 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 44 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 8 7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail Always - 239527837 9 Power_On_Hours 0x0032 043 043 000 Old_age Always - 50308 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 44
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 075 061 045 Old_age Always - 25 (Min/Max 22/31) 194 Temperature_Celsius 0x0022 025 039 000 Old_age Always - 25 (0 22 0 0 0) 195 Hardware_ECC_Recovered 0x001a 064 057 000 Old_age Always - 155109699 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0
SMART Error Log Version: 1 No Errors Logged
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
- 1 Extended offline Completed without error 00% 13097 -
- 2 Extended offline Completed without error 00% 4345 -
- 3 Short offline Completed without error 00% 0 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
So /dev/sda has failed.
Here are the steps:
Take note of the GOOD disk serial number so I leave that one in when I replace it: Serial Number: 9RA6DZP8
Mark and remove failed disk from raid:
[root@myServer ~]# mdadm --manage /dev/md0 --fail /dev/sda1 mdadm: set /dev/sda1 faulty in /dev/md0
[root@myServer ~]# mdadm --manage /dev/md1 --fail /dev/sda2 mdadm: set /dev/sda2 faulty in /dev/md1
[root@myServer ~]# mdadm --manage /dev/md2 --fail /dev/sda3 mdadm: set /dev/sda3 faulty in /dev/md2
[root@myServer ~]# mdadm --manage /dev/md0 --remove /dev/sda1 mdadm: hot removed /dev/sda1
[root@myServer ~]# mdadm --manage /dev/md1 --remove /dev/sda2 mdadm: hot removed /dev/sda2
[root@myServer ~]# mdadm --manage /dev/md2 --remove /dev/sda3 mdadm: hot removed /dev/sda3
Make sure grub is installed on the good disk and that grub.conf is updated:
[root@myServer ~]# grub-install /dev/sdb
Installation finished. No error reported.
This is the contents of the device map /boot/grub/device.map.
Check if this is correct or not. If any of the lines is incorrect,
fix it and re-run the script `grub-install'.
- this device map was generated by anaconda
(hd0) /dev/sda (hd1) /dev/sdb
Take note of the which hd partition corresponds with the good disk, ie hd1 in this case.
[root@myServer ~]# vim /boot/grub/menu.lst
Add fallback=1 right after default=0 Go to the bottom section where you should find some kernel stanzas. Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0) Should look like this:
[...] title CentOS (2.6.18-128.el5)
root (hd1,0) kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00 initrd /initrd-2.6.18-128.el5.img
title CentOS (2.6.18-128.el5)
root (hd0,0) kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/ initrd /initrd-2.6.18-128.el5.img
Save and quit
[root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
[root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r)
[root@myServer ~]# init 0
Swap the bad drive with the new drive and boot the machine.
Once it's booted:
Check the device names with cat /proc/mdstat and/or fisk -l. The newly installed drive on myServer was named /dev/sda
[root@myServer ~]# modeprobe raid1
[root@myServer ~]# modeprobe linear
[root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda
[root@myServer ~]# sfdisk -l => sanity check
[root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1 mdadm: added /dev/sda1
[root@myServer ~]# mdadm --manage /dev/md1 --add /dev/sda2 mdadm: added /dev/sda2
[root@myServer ~]# mdadm --manage /dev/md2 --add /dev/sda3 mdadm: added /dev/sda3
[root@myServer ~]# cat /proc/mdstat => Sanity check Personalities : [raid1] [linear] md0 : active raid1 sda1[1] sdb1[0]
128384 blocks [2/2] [UU]
md1 : active raid1 sda2[2] sdb2[0]
16779776 blocks [2/1] [U_] [>....................] recovery = 3.2% (548864/16779776) finish=8.8min speed=30492K/sec
md2 : active raid1 sda3[2] sdb3[0]
139379840 blocks [2/1] [U_] resync=DELAYED
unused devices: <none>
That's it! :)