Difference between revisions of "How to Replace a Failed Disk"

From DISI
Jump to navigation Jump to search
(Created page with "Steps to fix a hard drive failure that is in a raid 1 configuration: [root@myServer ~]# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdb1[0] sda1[2](F) ...")
 
 
(15 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
Steps to fix a hard drive failure that is in a raid 1 configuration:
 
Steps to fix a hard drive failure that is in a raid 1 configuration:
 
[root@myServer ~]# cat /proc/mdstat
 
Personalities : [raid1]
 
md0 : active raid1 sdb1[0] sda1[2](F)
 
      128384 blocks [2/1] [U_]
 
     
 
md1 : active raid1 sdb2[0] sda2[2](F)
 
      16779776 blocks [2/1] [U_]
 
     
 
md2 : active raid1 sdb3[0] sda3[2](F)
 
      139379840 blocks [2/1] [U_]
 
     
 
unused devices: <none>
 
 
 
To see which disk is failed:
 
  
 
The following demonstrates what a failed disk looks like:
 
The following demonstrates what a failed disk looks like:
  
[root@myServer ~]# smartctl -a /dev/sda                 
+
  [root@myServer ~]# cat /proc/mdstat <br />
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)
+
  Personalities : [raid1] <br/>
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
+
  md0 : active raid1 sdb1[0] sda1[2](F) <br/>
 
+
  128384 blocks [2/1] [U_] <br/>
Short INQUIRY response, skip product id
+
   md1 : active raid1 sdb2[0] sda2[2](F) <br/>
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
+
   16779776 blocks [2/1] [U_] <br/>
 
+
   md2 : active raid1 sdb3[0] sda3[2](F) <br/>
 
+
   139379840 blocks [2/1] [U_] <br/>   
 
+
  unused devices: <none> <br/>
[root@myServer ~]# smartctl -a /dev/sdb
 
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)
 
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
 
 
 
=== START OF INFORMATION SECTION ===
 
Model Family:    Seagate Barracuda 7200.10
 
Device Model:    ST3160815AS
 
Serial Number:    9RA6DZP8
 
Firmware Version: 4.AAB
 
User Capacity:    160,041,885,696 bytes [160 GB]
 
Sector Size:      512 bytes logical/physical
 
Device is:        In smartctl database [for details use: -P show]
 
ATA Version is:   7
 
ATA Standard is: Exact ATA specification draft version not indicated
 
Local Time is:    Mon Sep  8 15:50:48 2014 PDT
 
SMART support is: Available - device has SMART capability.
 
SMART support is: Enabled
 
 
 
=== START OF READ SMART DATA SECTION ===
 
SMART overall-health self-assessment test result: PASSED
 
 
 
General SMART Values:
 
Offline data collection status:  (0x82) Offline data collection activity
 
was completed without error.
 
Auto Offline Data Collection: Enabled.
 
Self-test execution status:      (  0) The previous self-test routine completed
 
without error or no self-test has ever
 
been run.
 
Total time to complete Offline
 
data collection: (  430) seconds.
 
Offline data collection
 
capabilities: (0x5b) SMART execute Offline immediate.
 
Auto Offline data collection on/off support.
 
Suspend Offline collection upon new
 
command.
 
Offline surface scan supported.
 
Self-test supported.
 
No Conveyance Self-test supported.
 
Selective Self-test supported.
 
SMART capabilities:            (0x0003) Saves SMART data before entering
 
power-saving mode.
 
Supports SMART auto save timer.
 
Error logging capability:        (0x01) Error logging supported.
 
General Purpose Logging supported.
 
Short self-test routine
 
recommended polling time: (   1) minutes.
 
Extended self-test routine
 
recommended polling time: (  54) minutes.
 
 
 
SMART Attributes Data Structure revision number: 10
 
Vendor Specific SMART Attributes with Thresholds:
 
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 
  1 Raw_Read_Error_Rate    0x000f  100  253  006    Pre-fail  Always      -      0
 
  3 Spin_Up_Time            0x0003  097  097  000    Pre-fail  Always      -      0
 
  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      44
 
  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      8
 
  7 Seek_Error_Rate        0x000f  084  060  030    Pre-fail  Always      -      239527837
 
  9 Power_On_Hours          0x0032  043  043  000    Old_age  Always      -      50308
 
10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0
 
12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      44
 
187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0
 
189 High_Fly_Writes        0x003a  100  100  000    Old_age  Always      -      0
 
190 Airflow_Temperature_Cel 0x0022  075  061  045    Old_age  Always      -      25 (Min/Max 22/31)
 
194 Temperature_Celsius    0x0022  025  039  000    Old_age  Always      -      25 (0 22 0 0 0)
 
195 Hardware_ECC_Recovered  0x001a  064  057  000    Old_age  Always      -      155109699
 
197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0
 
198 Offline_Uncorrectable  0x0010  100   100  000    Old_age  Offline      -      0
 
199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0
 
200 Multi_Zone_Error_Rate  0x0000  100  253  000    Old_age  Offline      -      0
 
202 Data_Address_Mark_Errs  0x0032  100  253  000    Old_age  Always      -      0
 
 
 
SMART Error Log Version: 1
 
No Errors Logged
 
 
 
SMART Self-test log structure revision number 1
 
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
 
# 1  Extended offline    Completed without error      00%    13097        -
 
# 2  Extended offline    Completed without error      00%      4345        -
 
# 3  Short offline      Completed without error      00%        0         -
 
 
 
SMART Selective self-test log data structure revision number 1
 
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
 
    1        0        0  Not_testing
 
    2       0        0  Not_testing
 
    3        0        0  Not_testing
 
    4        0        0  Not_testing
 
    5        0        0  Not_testing
 
Selective self-test flags (0x0):
 
   After scanning selected spans, do NOT read-scan remainder of disk.
 
If Selective self-test is pending on power-up, resume after 0 minute delay.
 
 
 
So /dev/sda has failed.
 
 
 
Here are the steps:
 
  
Take note of the GOOD disk serial number so I leave that one in when I replace it:  
+
  [root@myServer ~]# smartctl -a /dev/sda  <br/>             
Serial Number:   9RA6DZP8
+
  smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)  <br/>
 +
  Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net  <br/>
 +
  Short INQUIRY response, skip product id  <br/>
 +
  A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.  <br/>
  
Mark and remove failed disk from raid:
 
  
[root@myServer ~]# mdadm --manage /dev/md0 --fail /dev/sda1
+
  [root@myServer ~]# smartctl -a /dev/sdb  <br/>
mdadm: set /dev/sda1 faulty in /dev/md0
+
  smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)  <br/>
 +
  Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net  <br/>
 +
  === START OF INFORMATION SECTION ===    <br/>
 +
  Model Family:    Seagate Barracuda 7200.10    <br/>
 +
  Device Model:    ST3160815AS    <br/>
 +
  Serial Number:    9RA6DZP8    <br/>
 +
  Firmware Version: 4.AAB    <br/>
 +
  User Capacity:    160,041,885,696 bytes [160 GB]  <br/>
 +
  Sector Size:      512 bytes logical/physical  <br/>
 +
  Device is:        In smartctl database [for details use: -P show]  <br/>
 +
  ATA Version is:  7  <br/>
 +
  ATA Standard is:  Exact ATA specification draft version not indicated  <br/>
 +
  Local Time is:    Mon Sep  8 15:50:48 2014 PDT  <br/>
 +
  SMART support is: Available - device has SMART capability.  <br/>
 +
  SMART support is: Enabled  <br/>
 +
  === START OF READ SMART DATA SECTION ===  <br/>
 +
  SMART overall-health self-assessment test result: PASSED  <br/>
  
[root@myServer ~]# mdadm --manage /dev/md1 --fail /dev/sda2
+
There is a lot more that gets printed, but I cut it out.  <br/>
mdadm: set /dev/sda2 faulty in /dev/md1
 
  
[root@myServer ~]# mdadm --manage /dev/md2 --fail /dev/sda3
+
So /dev/sda has clearly failed.  <br/>
mdadm: set /dev/sda3 faulty in /dev/md2
 
  
[root@myServer ~]# mdadm --manage /dev/md0 --remove /dev/sda1
+
Take note of the GOOD disk serial number so I leave that one in when I replace it:  <br/>
mdadm: hot removed /dev/sda1
+
  Serial Number:   9RA6DZP8  <br/>
  
[root@myServer ~]# mdadm --manage /dev/md1 --remove /dev/sda2
+
Mark and remove failed disk from raid:   <br/>
mdadm: hot removed /dev/sda2
 
  
[root@myServer ~]# mdadm --manage /dev/md2 --remove /dev/sda3
+
  [root@myServer ~]# mdadm --manage /dev/md0 --fail /dev/sda1  <br/>
mdadm: hot removed /dev/sda3
+
  mdadm: set /dev/sda1 faulty in /dev/md0  <br/>
 +
  [root@myServer ~]# mdadm --manage /dev/md1 --fail /dev/sda2  <br/>
 +
  mdadm: set /dev/sda2 faulty in /dev/md1  <br/>
 +
  [root@myServer ~]# mdadm --manage /dev/md2 --fail /dev/sda3  <br/>
 +
  mdadm: set /dev/sda3 faulty in /dev/md2  <br/>
 +
  [root@myServer ~]# mdadm --manage /dev/md0 --remove /dev/sda1  <br/>
 +
  mdadm: hot removed /dev/sda1  <br/>
 +
  [root@myServer ~]# mdadm --manage /dev/md1 --remove /dev/sda2  <br/>
 +
  mdadm: hot removed /dev/sda2  <br/>
 +
  [root@myServer ~]# mdadm --manage /dev/md2 --remove /dev/sda3   <br/>
 +
  mdadm: hot removed /dev/sda3   <br/>
  
 
Make sure grub is installed on the good disk and that grub.conf is updated:
 
Make sure grub is installed on the good disk and that grub.conf is updated:
  
 +
  [root@myServer ~]# grub-install /dev/sdb  <br/>
 +
  Installation finished. No error reported.  <br/>
 +
  This is the contents of the device map /boot/grub/device.map.  <br/>
 +
  Check if this is correct or not. <br/>
 +
  If any of the lines is incorrect, fix it and re-run the script `grub-install'.  <br/>
 +
  This device map was generated by anaconda  <br/>
 +
  (hd0)    /dev/sda  <br/>
 +
  (hd1)    /dev/sdb  <br/>
  
[root@myServer ~]# grub-install /dev/sdb
+
Take note of the which hd partition corresponds with the good disk, ie hd1 in this case.   <br/>
Installation finished. No error reported.
 
This is the contents of the device map /boot/grub/device.map.
 
Check if this is correct or not. If any of the lines is incorrect,
 
fix it and re-run the script `grub-install'.
 
 
 
# this device map was generated by anaconda
 
(hd0)    /dev/sda
 
(hd1)    /dev/sdb
 
 
 
Take note of the which hd partition corresponds with the good disk, ie hd1 in this case.
 
 
 
[root@myServer ~]# vim /boot/grub/menu.lst
 
 
 
Add fallback=1 right after default=0
 
Go to the bottom section where you should find some kernel stanzas.
 
Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0)
 
Should look like this:
 
 
 
[...]
 
title CentOS (2.6.18-128.el5)
 
        root (hd1,0)
 
        kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00
 
        initrd /initrd-2.6.18-128.el5.img
 
title CentOS (2.6.18-128.el5)
 
        root (hd0,0)
 
        kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/
 
        initrd /initrd-2.6.18-128.el5.img
 
 
 
Save and quit
 
 
 
[root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
 
 
 
[root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r)
 
  
[root@myServer ~]# init 0
+
  [root@myServer ~]# vim /boot/grub/menu.lst  <br/>
 +
  Add fallback=1 right after default=0 <br/>
 +
  Go to the bottom section where you should find some kernel stanzas.  <br/>
 +
  Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0)  <br/>
 +
  Should look like this:  <br/>
 +
    [...]  <br/>
 +
    title CentOS (2.6.18-128.el5)  <br/>
 +
    root (hd1,0)  <br/>
 +
    kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00  <br/>
 +
    initrd /initrd-2.6.18-128.el5.img  <br/>
 +
    title CentOS (2.6.18-128.el5)  <br/>
 +
    root (hd0,0)  <br/>
 +
    kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/  <br/>
 +
    initrd /initrd-2.6.18-128.el5.img  <br/>
  
Swap the bad drive with the new drive and boot the machine.
+
Save and quit  <br/>
  
Once it's booted:
+
  [root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak  <br/>
 +
  [root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r)  <br/>
 +
  [root@myServer ~]# init 0  <br/>
  
Check the device names with cat /proc/mdstat and/or fisk -l.
+
Swap the bad drive with the new drive and boot the machine.   <br/>
The newly installed drive on myServer was named /dev/sda
 
  
[root@myServer ~]# modeprobe raid1
+
Once it's booted:  <br/>
  
[root@myServer ~]# modeprobe linear
+
Check the device names with cat /proc/mdstat and/or fisk -l.  <br/>
 +
The newly installed drive on myServer was named /dev/sda.  <br/>
  
[root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda
+
  [root@myServer ~]# modeprobe raid1  <br/>
 +
  [root@myServer ~]# modeprobe linear  <br/>
  
[root@myServer ~]# sfdisk -l => sanity check
+
Copy the partitions from one disk to the other:
  
[root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1
+
  [root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda  <br/>
mdadm: added /dev/sda1
+
  [root@myServer ~]# sfdisk -l => sanity check  <br/>
  
[root@myServer ~]# mdadm --manage /dev/md1 --add /dev/sda2
+
Add the new disk to the raid array:
mdadm: added /dev/sda2
 
  
[root@myServer ~]# mdadm --manage /dev/md2 --add /dev/sda3
+
  [root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1  <br/>
mdadm: added /dev/sda3
+
  mdadm: added /dev/sda1  <br/>
 +
  [root@myServer ~]# mdadm --manage /dev/md1 --add /dev/sda2  <br/>
 +
  mdadm: added /dev/sda2  <br/>
 +
  [root@myServer ~]# mdadm --manage /dev/md2 --add /dev/sda3   <br/>
 +
  mdadm: added /dev/sda3   <br/>
  
[root@myServer ~]# cat /proc/mdstat   => Sanity check
+
Sanity check:
Personalities : [raid1] [linear]  
+
  [root@myServer ~]# cat /proc/mdstat <br/>
md0 : active raid1 sda1[1] sdb1[0]
+
  Personalities : [raid1] [linear]   <br/>
      128384 blocks [2/2] [UU]
+
  md0 : active raid1 sda1[1] sdb1[0]   <br/>
     
+
  128384 blocks [2/2] [UU]   <br/>     
md1 : active raid1 sda2[2] sdb2[0]
+
  md1 : active raid1 sda2[2] sdb2[0]   <br/>
      16779776 blocks [2/1] [U_]
+
  16779776 blocks [2/1] [U_]   <br/>
      [>....................]  recovery =  3.2% (548864/16779776) finish=8.8min speed=30492K/sec
+
  [>....................]  recovery =  3.2% (548864/16779776) finish=8.8min speed=30492K/sec   <br/>
     
+
  md2 : active raid1 sda3[2] sdb3[0]   <br/>
md2 : active raid1 sda3[2] sdb3[0]
+
  139379840 blocks [2/1] [U_]   <br/>
      139379840 blocks [2/1] [U_]
+
  resync=DELAYED   <br/>   
      resync=DELAYED
+
  unused devices: <none>  <br/>
     
 
unused devices: <none>
 
  
 +
That's it! :)  <br/>
  
That's it! :)
+
[[Category:Sysadmin]]
 +
[[Category:Tutorials]]
 +
[[Category:Delete]]

Latest revision as of 22:49, 25 January 2021

Steps to fix a hard drive failure that is in a raid 1 configuration:

The following demonstrates what a failed disk looks like:

 [root@myServer ~]# cat /proc/mdstat 
Personalities : [raid1]
md0 : active raid1 sdb1[0] sda1[2](F)
128384 blocks [2/1] [U_]
md1 : active raid1 sdb2[0] sda2[2](F)
16779776 blocks [2/1] [U_]
md2 : active raid1 sdb3[0] sda3[2](F)
139379840 blocks [2/1] [U_]
unused devices: <none>
 [root@myServer ~]# smartctl -a /dev/sda   
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
Short INQUIRY response, skip product id
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.


 [root@myServer ~]# smartctl -a /dev/sdb  
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10
Device Model: ST3160815AS
Serial Number: 9RA6DZP8
Firmware Version: 4.AAB
User Capacity: 160,041,885,696 bytes [160 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Sep 8 15:50:48 2014 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

There is a lot more that gets printed, but I cut it out.

So /dev/sda has clearly failed.

Take note of the GOOD disk serial number so I leave that one in when I replace it:

 Serial Number:    9RA6DZP8   

Mark and remove failed disk from raid:

 [root@myServer ~]# mdadm --manage /dev/md0 --fail /dev/sda1   
mdadm: set /dev/sda1 faulty in /dev/md0
[root@myServer ~]# mdadm --manage /dev/md1 --fail /dev/sda2
mdadm: set /dev/sda2 faulty in /dev/md1
[root@myServer ~]# mdadm --manage /dev/md2 --fail /dev/sda3
mdadm: set /dev/sda3 faulty in /dev/md2
[root@myServer ~]# mdadm --manage /dev/md0 --remove /dev/sda1
mdadm: hot removed /dev/sda1
[root@myServer ~]# mdadm --manage /dev/md1 --remove /dev/sda2
mdadm: hot removed /dev/sda2
[root@myServer ~]# mdadm --manage /dev/md2 --remove /dev/sda3
mdadm: hot removed /dev/sda3

Make sure grub is installed on the good disk and that grub.conf is updated:

 [root@myServer ~]# grub-install /dev/sdb   
Installation finished. No error reported.
This is the contents of the device map /boot/grub/device.map.
Check if this is correct or not.
If any of the lines is incorrect, fix it and re-run the script `grub-install'.
This device map was generated by anaconda
(hd0) /dev/sda
(hd1) /dev/sdb

Take note of the which hd partition corresponds with the good disk, ie hd1 in this case.

 [root@myServer ~]# vim /boot/grub/menu.lst  
Add fallback=1 right after default=0
Go to the bottom section where you should find some kernel stanzas.
Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0)
Should look like this:
[...]
title CentOS (2.6.18-128.el5)
root (hd1,0)
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00
initrd /initrd-2.6.18-128.el5.img
title CentOS (2.6.18-128.el5)
root (hd0,0)
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/
initrd /initrd-2.6.18-128.el5.img

Save and quit

 [root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak   
[root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r)
[root@myServer ~]# init 0

Swap the bad drive with the new drive and boot the machine.

Once it's booted:

Check the device names with cat /proc/mdstat and/or fisk -l.
The newly installed drive on myServer was named /dev/sda.

 [root@myServer ~]# modeprobe raid1   
[root@myServer ~]# modeprobe linear

Copy the partitions from one disk to the other:

 [root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda   
[root@myServer ~]# sfdisk -l => sanity check

Add the new disk to the raid array:

 [root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1   
mdadm: added /dev/sda1
[root@myServer ~]# mdadm --manage /dev/md1 --add /dev/sda2
mdadm: added /dev/sda2
[root@myServer ~]# mdadm --manage /dev/md2 --add /dev/sda3
mdadm: added /dev/sda3

Sanity check:

 [root@myServer ~]# cat /proc/mdstat  
Personalities : [raid1] [linear]
md0 : active raid1 sda1[1] sdb1[0]
128384 blocks [2/2] [UU]
md1 : active raid1 sda2[2] sdb2[0]
16779776 blocks [2/1] [U_]
[>....................] recovery = 3.2% (548864/16779776) finish=8.8min speed=30492K/sec
md2 : active raid1 sda3[2] sdb3[0]
139379840 blocks [2/1] [U_]
resync=DELAYED
unused devices: <none>

That's it! :)