Replacing failed disk on Server: Difference between revisions

Latest revision as of 18:20, 17 May 2022

How to check if Disk failed

Check for the light on disk

ZFS machines

Blue => Normal

Red => Fail

If the light is not working, this is how to identify disk by vdev

Log into the machine

$ zpool status
.
.
scsi-35002538f31801401  ONLINE       0     0     0
scsi-35002538f31801628  FAULTED     22     0     0  too many errors <<< faulty disk
zfs-cd3cd912951df815    ONLINE       0     0     0
.
.

Identify vdev

If faulty disk's identifier starts with scsi-**** 
$ ls -l /dev/disk/by-id | grep <id>
If faulty disk's identifier starts with zfs-****
$ ls -l /dev/disk/by-partlabel | grep <id>

Locate the disk physically

Flashing red light on disk
$ sudo ledctl locate=/dev/<vdev>
Turn off light on disk
$ sudo ledctl locate_off=/dev/<vdev>

Others

Solid Yellow => Fail

Blinking Yellow => Predictive Failure (going to fail soon)

Green => Normal

Replace disk instruction

Determine what machine the disk below to
Press the red button on the disk to turn it off.
Gently pull a little bit out (NOT all the way) and wait for 10 sec until it stops spinning before pulling all the way out.
Find replacement with a similar disk with the same specs
Carefully unscrew the disk from disk holder (if the disk holder part on the replacement is the same then you don't have to).

Auto-check Disk Machines Python Script

In gimel5, there is a python script that runs every day at 12am through crontab under s_jjg.

The file is located at: /nfs/home/jjg/python_scripts/check_for_failed_disks.py

This script ssh-es into the machines below and runs a command to list the status of disks. (Does not include cluster 0 machines)

machines: abacus, n-9-22, tsadi, lamed, qof, zayin, n-1-30, n-1-109, n-1-113, shin
data pools: db2, db3, db5, db4, ex1, ex2, ex3, ex4, ex5, ex6, ex7, ex8, ex9, exa, exb, exc, exd, db

If a disk in any of the listed machines report that a disk has failed, the script will email the sysadmins.

Example output:

  pool: db2
state: ONLINE
 pool: db3
state: ONLINE
 pool: db5
state: ONLINE
 pool: db4
state: ONLINE
 pool: ex1
state: ONLINE
 pool: ex2
state: ONLINE
 pool: ex3
state: ONLINE
 pool: ex4
state: ONLINE
 pool: ex5
state: ONLINE
 pool: ex6
state: ONLINE
 pool: ex7
state: ONLINE
 pool: ex8
state: ONLINE
 pool: ex9
state: ONLINE
 pool: exa
state: ONLINE
 pool: exb
state: ONLINE
 pool: exc
state: ONLINE
 pool: exd
state: ONLINE
----------------------------------------------------------------------------
pool: db2
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp Type
8:0      35 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:1      10 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:2      18 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:3      12 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:4      16 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:5      11 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:6      32 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:7      13 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:8      41 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:9      33 Onln   0 3.637 TB SAS  HDD N   N  512B WD4001FYYG-01SL3 U  -    
8:10     20 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:11     27 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:12     23 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:13     25 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:14     14 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:15     42 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:16     19 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:17     39 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:18     40 Onln   0 3.637 TB SAS  HDD N   N  512B MB4000JEFNC      U  -    
8:19     29 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:20     26 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:21     36 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:22     34 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -

How to check if disk is failed or install correctly

On Cluster 0 's machines

1. Log into gimel as root

$ ssh root@sgehead1.bkslab.org

2. Log in as root to the machine that you determined from earlier

$ ssh root@<machine_name>
Example: RAID 3,6,7 belongs to nfshead2

3. Run this command

$ /opt/compaq/hpacucli/bld/hpacucli ctrl all show config

Output Example:
Smart Array P800 in Slot 1                (sn: PAFGF0N9SXQ0MX)
  array A (SATA, Unused Space: 0 MB)
     logicaldrive 1 (5.5 TB, RAID 1+0, OK)
     physicaldrive 1E:1:1 (port 1E:box 1:bay 1, SATA, 1 TB, OK)
     physicaldrive 1E:1:2 (port 1E:box 1:bay 2, SATA, 1 TB, OK)
     physicaldrive 1E:1:3 (port 1E:box 1:bay 3, SATA, 1 TB, OK)
     physicaldrive 1E:1:4 (port 1E:box 1:bay 4, SATA, 1 TB, OK)
     physicaldrive 1E:1:5 (port 1E:box 1:bay 5, SATA, 1 TB, OK)
     physicaldrive 1E:1:6 (port 1E:box 1:bay 6, SATA, 1 TB, OK)
     physicaldrive 1E:1:7 (port 1E:box 1:bay 7, SATA, 1 TB, OK)
     physicaldrive 1E:1:8 (port 1E:box 1:bay 8, SATA, 1 TB, OK)
     physicaldrive 1E:1:9 (port 1E:box 1:bay 9, SATA, 1 TB, OK)
     physicaldrive 1E:1:10 (port 1E:box 1:bay 10, SATA, 1 TB, OK)
     physicaldrive 1E:1:11 (port 1E:box 1:bay 11, SATA, 1 TB, OK)
     physicaldrive 1E:1:12 (port 1E:box 1:bay 12, SATA, 1 TB, OK)
  array B (SATA, Unused Space: 0 MB)
     logicaldrive 2 (5.5 TB, RAID 1+0, OK)
     physicaldrive 2E:1:1 (port 2E:box 1:bay 1, SATA, 1 TB, OK)
     physicaldrive 2E:1:2 (port 2E:box 1:bay 2, SATA, 1 TB, Predictive Failure)
     physicaldrive 2E:1:3 (port 2E:box 1:bay 3, SATA, 1 TB, OK)
     physicaldrive 2E:1:4 (port 2E:box 1:bay 4, SATA, 1 TB, OK)
     physicaldrive 2E:1:5 (port 2E:box 1:bay 5, SATA, 1 TB, OK)
     physicaldrive 2E:1:6 (port 2E:box 1:bay 6, SATA, 1 TB, OK)
     physicaldrive 2E:1:7 (port 2E:box 1:bay 7, SATA, 1 TB, OK)
     physicaldrive 2E:1:8 (port 2E:box 1:bay 8, SATA, 1 TB, OK)
     physicaldrive 2E:1:9 (port 2E:box 1:bay 9, SATA, 1 TB, OK)
     physicaldrive 2E:1:10 (port 2E:box 1:bay 10, SATA, 1 TB, OK)
     physicaldrive 2E:1:11 (port 2E:box 1:bay 11, SATA, 1 TB, OK)
     physicaldrive 2E:1:12 (port 2E:box 1:bay 12, SATA, 1 TB, OK)
  array C (SATA, Unused Space: 0 MB)
     logicaldrive 3 (5.5 TB, RAID 1+0, Ready for Rebuild)
     physicaldrive 2E:2:1 (port 2E:box 2:bay 1, SATA, 1 TB, OK)
     physicaldrive 2E:2:2 (port 2E:box 2:bay 2, SATA, 1 TB, OK)
     physicaldrive 2E:2:3 (port 2E:box 2:bay 3, SATA, 1 TB, OK)
     physicaldrive 2E:2:4 (port 2E:box 2:bay 4, SATA, 1 TB, OK)
     physicaldrive 2E:2:5 (port 2E:box 2:bay 5, SATA, 1 TB, OK)
     physicaldrive 2E:2:6 (port 2E:box 2:bay 6, SATA, 1 TB, OK)
     physicaldrive 2E:2:7 (port 2E:box 2:bay 7, SATA, 1 TB, OK)
     physicaldrive 2E:2:8 (port 2E:box 2:bay 8, SATA, 1 TB, OK)
     physicaldrive 2E:2:9 (port 2E:box 2:bay 9, SATA, 1 TB, OK)
     physicaldrive 2E:2:10 (port 2E:box 2:bay 10, SATA, 1 TB, OK)
     physicaldrive 2E:2:11 (port 2E:box 2:bay 11, SATA, 1 TB, OK)
     physicaldrive 2E:2:12 (port 2E:box 2:bay 12, SATA, 1 TB, OK)
  Expander 243 (WWID: 50014380031A4B00, Port: 1E, Box: 1)
  Expander 245 (WWID: 5001438005396E00, Port: 2E, Box: 2)
  Expander 246 (WWID: 500143800460A600, Port: 2E, Box: 1)
  Expander 248 (WWID: 50014380055E913F)
  Enclosure SEP (Vendor ID HP, Model MSA60) 241 (WWID: 50014380031A4B25, Port: 1E, Box: 1)
  Enclosure SEP (Vendor ID HP, Model MSA60) 242 (WWID: 5001438005396E25, Port: 2E, Box: 2)
  Enclosure SEP (Vendor ID HP, Model MSA60) 244 (WWID: 500143800460A625, Port: 2E, Box: 1)
  SEP (Vendor ID HP, Model P800) 247 (WWID: 50014380055E913E)

On shin

As root

/opt/MegaRAID/storcli/storcli64 /c0 /eall /sall show all

Drive /c0/e8/s18 :
 ================

 -----------------------------------------------------------------------------
EID:Slt DID State  DG     Size Intf Med SED PI SeSz Model            Sp Type 
-----------------------------------------------------------------------------
8:18     24 Failed  0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
-----------------------------------------------------------------------------

EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare
UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface
Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition|F-Foreign
UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded
CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded


Drive /c0/e8/s18 - Detailed Information :
=======================================

Drive /c0/e8/s18 State :
======================
Shield Counter = 0
Media Error Count = 0
Other Error Count = 16
BBM Error Count = 0
Drive Temperature =  32C (89.60 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No


Drive /c0/e8/s18 Device attributes :
==================================
SN = Z1Z2S2TL0000C4216E9V
Manufacturer Id = SEAGATE 
Model Number = ST4000NM0023    
NAND Vendor = NA
WWN = 5000C50057DB2A28
Firmware Revision = 0003
Firmware Release Number = 03290003
Raw size = 3.638 TB [0x1d1c0beb0 Sectors]
Coerced size = 3.637 TB [0x1d1b00000 Sectors]
Non Coerced size = 3.637 TB [0x1d1b0beb0 Sectors]
Device Speed = 6.0Gb/s
Link Speed = 6.0Gb/s
Write cache = N/A
Logical Sector Size = 512B
Physical Sector Size = 512B
Connector Name = Port 0 - 3 & Port 4 - 7

On ZFS machines

$ zpool status

For instruction on how to identify and replace failed disk on ZFS system. Read here

On Any Raid1 Configurations

Steps to fix a hard drive failure that is in a raid 1 configuration:

The following demonstrates what a failed disk looks like:

 [root@myServer ~]# cat /proc/mdstat 

 Personalities : [raid1] 

 md0 : active raid1 sdb1[0] sda1[2](F) 

 128384 blocks [2/1] [U_]  

 md1 : active raid1 sdb2[0] sda2[2](F) 

 16779776 blocks [2/1] [U_] 

 md2 : active raid1 sdb3[0] sda3[2](F) 

 139379840 blocks [2/1] [U_] 
    
 unused devices: <none>

 [root@myServer ~]# smartctl -a /dev/sda   
              
 smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)   

 Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net   

 Short INQUIRY response, skip product id   

 A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 [root@myServer ~]# smartctl -a /dev/sdb  

 smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)   

 Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net   

 === START OF INFORMATION SECTION ===    

 Model Family:     Seagate Barracuda 7200.10    

 Device Model:     ST3160815AS    

 Serial Number:    9RA6DZP8     

 Firmware Version: 4.AAB    

 User Capacity:    160,041,885,696 bytes [160 GB]   

 Sector Size:      512 bytes logical/physical   

 Device is:        In smartctl database [for details use: -P show]   

 ATA Version is:   7   

 ATA Standard is:  Exact ATA specification draft version not indicated   

 Local Time is:    Mon Sep  8 15:50:48 2014 PDT  

 SMART support is: Available - device has SMART capability.   

 SMART support is: Enabled   

 === START OF READ SMART DATA SECTION ===   
 
 SMART overall-health self-assessment test result: PASSED

There is a lot more that gets printed, but I cut it out.

So /dev/sda has clearly failed.

Take note of the GOOD disk serial number so I leave that one in when I replace it:

 Serial Number:    9RA6DZP8

Mark and remove failed disk from raid:

 [root@myServer ~]# mdadm --manage /dev/md0 --fail /dev/sda1   

 mdadm: set /dev/sda1 faulty in /dev/md0   

 [root@myServer ~]# mdadm --manage /dev/md1 --fail /dev/sda2   
 
 mdadm: set /dev/sda2 faulty in /dev/md1   

 [root@myServer ~]# mdadm --manage /dev/md2 --fail /dev/sda3   

 mdadm: set /dev/sda3 faulty in /dev/md2   

 [root@myServer ~]# mdadm --manage /dev/md0 --remove /dev/sda1   

 mdadm: hot removed /dev/sda1   

 [root@myServer ~]# mdadm --manage /dev/md1 --remove /dev/sda2   

 mdadm: hot removed /dev/sda2   

 [root@myServer ~]# mdadm --manage /dev/md2 --remove /dev/sda3   

 mdadm: hot removed /dev/sda3

Make sure grub is installed on the good disk and that grub.conf is updated:

 [root@myServer ~]# grub-install /dev/sdb   

 Installation finished. No error reported.   

 This is the contents of the device map /boot/grub/device.map.   

 Check if this is correct or not. 

 If any of the lines is incorrect, fix it and re-run the script `grub-install'.   

 This device map was generated by anaconda   

 (hd0)     /dev/sda   

 (hd1)     /dev/sdb

Take note of the which hd partition corresponds with the good disk, ie hd1 in this case.

 [root@myServer ~]# vim /boot/grub/menu.lst  

 Add fallback=1 right after default=0  

 Go to the bottom section where you should find some kernel stanzas.   

 Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0)  

 Should look like this:  

   [...]   

   title CentOS (2.6.18-128.el5)  

   root (hd1,0)  

   kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00  

   initrd /initrd-2.6.18-128.el5.img  

   title CentOS (2.6.18-128.el5)  

   root (hd0,0)  

   kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/  

   initrd /initrd-2.6.18-128.el5.img

Save and quit

 [root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak   

 [root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r)   

 [root@myServer ~]# init 0

Swap the bad drive with the new drive and boot the machine.

Once it's booted:

Check the device names with cat /proc/mdstat and/or fisk -l.
The newly installed drive on myServer was named /dev/sda.

 [root@myServer ~]# modeprobe raid1   

 [root@myServer ~]# modeprobe linear

Copy the partitions from one disk to the other:

 [root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda   

 [root@myServer ~]# sfdisk -l => sanity check

Add the new disk to the raid array:

 [root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1   

 mdadm: added /dev/sda1   

 [root@myServer ~]# mdadm --manage /dev/md1 --add /dev/sda2   

 mdadm: added /dev/sda2  

 [root@myServer ~]# mdadm --manage /dev/md2 --add /dev/sda3   

 mdadm: added /dev/sda3

Sanity check:

 [root@myServer ~]# cat /proc/mdstat  

 Personalities : [raid1] [linear]    

 md0 : active raid1 sda1[1] sdb1[0]   

 128384 blocks [2/2] [UU]   
      
 md1 : active raid1 sda2[2] sdb2[0]   

 16779776 blocks [2/1] [U_]   

 [>....................]  recovery =  3.2% (548864/16779776) finish=8.8min speed=30492K/sec   

 md2 : active raid1 sda3[2] sdb3[0]   

 139379840 blocks [2/1] [U_]   

 resync=DELAYED   
     
 unused devices: <none>

That's it! :)

Replacing failed disk on Server: Difference between revisions

Latest revision as of 18:20, 17 May 2022

Contents

How to check if Disk failed

Check for the light on disk

ZFS machines

Others

Replace disk instruction

Auto-check Disk Machines Python Script

How to check if disk is failed or install correctly

On Cluster 0 's machines

On shin

On ZFS machines

On Any Raid1 Configurations

Navigation menu

Replacing failed disk on Server: Difference between revisions

Latest revision as of 18:20, 17 May 2022

How to check if Disk failed

Check for the light on disk

ZFS machines

Others

Replace disk instruction

Auto-check Disk Machines Python Script

How to check if disk is failed or install correctly

On Cluster 0 's machines

On shin

On ZFS machines

On Any Raid1 Configurations

Navigation menu

Search