Replacing failed disk on Server

From DISI
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

How to check if Disk failed

Check for the light on disk

ZFS machines

Blue => Normal

Red => Fail

If the light is not working, this is how to identify disk by vdev

Log into the machine

$ zpool status
.
.
scsi-35002538f31801401  ONLINE       0     0     0
scsi-35002538f31801628  FAULTED     22     0     0  too many errors <<< faulty disk
zfs-cd3cd912951df815    ONLINE       0     0     0
.
.

Identify vdev

If faulty disk's identifier starts with scsi-**** 
$ ls -l /dev/disk/by-id | grep <id>
If faulty disk's identifier starts with zfs-****
$ ls -l /dev/disk/by-partlabel | grep <id>

Locate the disk physically

Flashing red light on disk
$ sudo ledctl locate=/dev/<vdev>
Turn off light on disk
$ sudo ledctl locate_off=/dev/<vdev> 

Others

Solid Yellow => Fail

Blinking Yellow => Predictive Failure (going to fail soon)

Green => Normal

Replace disk instruction

  • Determine what machine the disk below to
  • Press the red button on the disk to turn it off.
  • Gently pull a little bit out (NOT all the way) and wait for 10 sec until it stops spinning before pulling all the way out.
  • Find replacement with a similar disk with the same specs
  • Carefully unscrew the disk from disk holder (if the disk holder part on the replacement is the same then you don't have to).

Auto-check Disk Machines Python Script

In gimel5, there is a python script that runs every day at 12am through crontab under s_jjg.

The file is located at: /nfs/home/jjg/python_scripts/check_for_failed_disks.py

This script ssh-es into the machines below and runs a command to list the status of disks. (Does not include cluster 0 machines)

machines: abacus, n-9-22, tsadi, lamed, qof, zayin, n-1-30, n-1-109, n-1-113, shin
data pools: db2, db3, db5, db4, ex1, ex2, ex3, ex4, ex5, ex6, ex7, ex8, ex9, exa, exb, exc, exd, db

If a disk in any of the listed machines report that a disk has failed, the script will email the sysadmins.

Example output:

  pool: db2
state: ONLINE
 pool: db3
state: ONLINE
 pool: db5
state: ONLINE
 pool: db4
state: ONLINE
 pool: ex1
state: ONLINE
 pool: ex2
state: ONLINE
 pool: ex3
state: ONLINE
 pool: ex4
state: ONLINE
 pool: ex5
state: ONLINE
 pool: ex6
state: ONLINE
 pool: ex7
state: ONLINE
 pool: ex8
state: ONLINE
 pool: ex9
state: ONLINE
 pool: exa
state: ONLINE
 pool: exb
state: ONLINE
 pool: exc
state: ONLINE
 pool: exd
state: ONLINE
----------------------------------------------------------------------------
pool: db2
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp Type
8:0      35 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:1      10 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:2      18 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:3      12 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:4      16 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:5      11 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:6      32 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:7      13 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:8      41 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:9      33 Onln   0 3.637 TB SAS  HDD N   N  512B WD4001FYYG-01SL3 U  -    
8:10     20 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:11     27 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:12     23 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:13     25 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:14     14 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:15     42 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:16     19 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:17     39 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:18     40 Onln   0 3.637 TB SAS  HDD N   N  512B MB4000JEFNC      U  -    
8:19     29 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:20     26 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:21     36 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
8:22     34 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -

How to check if disk is failed or install correctly

On Cluster 0 's machines

1. Log into gimel as root

$ ssh root@sgehead1.bkslab.org

2. Log in as root to the machine that you determined from earlier

$ ssh root@<machine_name>
Example: RAID 3,6,7 belongs to nfshead2

3. Run this command

$ /opt/compaq/hpacucli/bld/hpacucli ctrl all show config
Output Example:
Smart Array P800 in Slot 1                (sn: PAFGF0N9SXQ0MX)
  array A (SATA, Unused Space: 0 MB)
     logicaldrive 1 (5.5 TB, RAID 1+0, OK)
     physicaldrive 1E:1:1 (port 1E:box 1:bay 1, SATA, 1 TB, OK)
     physicaldrive 1E:1:2 (port 1E:box 1:bay 2, SATA, 1 TB, OK)
     physicaldrive 1E:1:3 (port 1E:box 1:bay 3, SATA, 1 TB, OK)
     physicaldrive 1E:1:4 (port 1E:box 1:bay 4, SATA, 1 TB, OK)
     physicaldrive 1E:1:5 (port 1E:box 1:bay 5, SATA, 1 TB, OK)
     physicaldrive 1E:1:6 (port 1E:box 1:bay 6, SATA, 1 TB, OK)
     physicaldrive 1E:1:7 (port 1E:box 1:bay 7, SATA, 1 TB, OK)
     physicaldrive 1E:1:8 (port 1E:box 1:bay 8, SATA, 1 TB, OK)
     physicaldrive 1E:1:9 (port 1E:box 1:bay 9, SATA, 1 TB, OK)
     physicaldrive 1E:1:10 (port 1E:box 1:bay 10, SATA, 1 TB, OK)
     physicaldrive 1E:1:11 (port 1E:box 1:bay 11, SATA, 1 TB, OK)
     physicaldrive 1E:1:12 (port 1E:box 1:bay 12, SATA, 1 TB, OK)
  array B (SATA, Unused Space: 0 MB)
     logicaldrive 2 (5.5 TB, RAID 1+0, OK)
     physicaldrive 2E:1:1 (port 2E:box 1:bay 1, SATA, 1 TB, OK)
     physicaldrive 2E:1:2 (port 2E:box 1:bay 2, SATA, 1 TB, Predictive Failure)
     physicaldrive 2E:1:3 (port 2E:box 1:bay 3, SATA, 1 TB, OK)
     physicaldrive 2E:1:4 (port 2E:box 1:bay 4, SATA, 1 TB, OK)
     physicaldrive 2E:1:5 (port 2E:box 1:bay 5, SATA, 1 TB, OK)
     physicaldrive 2E:1:6 (port 2E:box 1:bay 6, SATA, 1 TB, OK)
     physicaldrive 2E:1:7 (port 2E:box 1:bay 7, SATA, 1 TB, OK)
     physicaldrive 2E:1:8 (port 2E:box 1:bay 8, SATA, 1 TB, OK)
     physicaldrive 2E:1:9 (port 2E:box 1:bay 9, SATA, 1 TB, OK)
     physicaldrive 2E:1:10 (port 2E:box 1:bay 10, SATA, 1 TB, OK)
     physicaldrive 2E:1:11 (port 2E:box 1:bay 11, SATA, 1 TB, OK)
     physicaldrive 2E:1:12 (port 2E:box 1:bay 12, SATA, 1 TB, OK)
  array C (SATA, Unused Space: 0 MB)
     logicaldrive 3 (5.5 TB, RAID 1+0, Ready for Rebuild)
     physicaldrive 2E:2:1 (port 2E:box 2:bay 1, SATA, 1 TB, OK)
     physicaldrive 2E:2:2 (port 2E:box 2:bay 2, SATA, 1 TB, OK)
     physicaldrive 2E:2:3 (port 2E:box 2:bay 3, SATA, 1 TB, OK)
     physicaldrive 2E:2:4 (port 2E:box 2:bay 4, SATA, 1 TB, OK)
     physicaldrive 2E:2:5 (port 2E:box 2:bay 5, SATA, 1 TB, OK)
     physicaldrive 2E:2:6 (port 2E:box 2:bay 6, SATA, 1 TB, OK)
     physicaldrive 2E:2:7 (port 2E:box 2:bay 7, SATA, 1 TB, OK)
     physicaldrive 2E:2:8 (port 2E:box 2:bay 8, SATA, 1 TB, OK)
     physicaldrive 2E:2:9 (port 2E:box 2:bay 9, SATA, 1 TB, OK)
     physicaldrive 2E:2:10 (port 2E:box 2:bay 10, SATA, 1 TB, OK)
     physicaldrive 2E:2:11 (port 2E:box 2:bay 11, SATA, 1 TB, OK)
     physicaldrive 2E:2:12 (port 2E:box 2:bay 12, SATA, 1 TB, OK)
  Expander 243 (WWID: 50014380031A4B00, Port: 1E, Box: 1)
  Expander 245 (WWID: 5001438005396E00, Port: 2E, Box: 2)
  Expander 246 (WWID: 500143800460A600, Port: 2E, Box: 1)
  Expander 248 (WWID: 50014380055E913F)
  Enclosure SEP (Vendor ID HP, Model MSA60) 241 (WWID: 50014380031A4B25, Port: 1E, Box: 1)
  Enclosure SEP (Vendor ID HP, Model MSA60) 242 (WWID: 5001438005396E25, Port: 2E, Box: 2)
  Enclosure SEP (Vendor ID HP, Model MSA60) 244 (WWID: 500143800460A625, Port: 2E, Box: 1)
  SEP (Vendor ID HP, Model P800) 247 (WWID: 50014380055E913E)

On shin

As root

/opt/MegaRAID/storcli/storcli64 /c0 /eall /sall show all
Drive /c0/e8/s18 :
 ================

 -----------------------------------------------------------------------------
EID:Slt DID State  DG     Size Intf Med SED PI SeSz Model            Sp Type 
-----------------------------------------------------------------------------
8:18     24 Failed  0 3.637 TB SAS  HDD N   N  512B ST4000NM0023     U  -    
-----------------------------------------------------------------------------

EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare
UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface
Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition|F-Foreign
UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded
CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded


Drive /c0/e8/s18 - Detailed Information :
=======================================

Drive /c0/e8/s18 State :
======================
Shield Counter = 0
Media Error Count = 0
Other Error Count = 16
BBM Error Count = 0
Drive Temperature =  32C (89.60 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No


Drive /c0/e8/s18 Device attributes :
==================================
SN = Z1Z2S2TL0000C4216E9V
Manufacturer Id = SEAGATE 
Model Number = ST4000NM0023    
NAND Vendor = NA
WWN = 5000C50057DB2A28
Firmware Revision = 0003
Firmware Release Number = 03290003
Raw size = 3.638 TB [0x1d1c0beb0 Sectors]
Coerced size = 3.637 TB [0x1d1b00000 Sectors]
Non Coerced size = 3.637 TB [0x1d1b0beb0 Sectors]
Device Speed = 6.0Gb/s
Link Speed = 6.0Gb/s
Write cache = N/A
Logical Sector Size = 512B
Physical Sector Size = 512B
Connector Name = Port 0 - 3 & Port 4 - 7 

On ZFS machines

$ zpool status

For instruction on how to identify and replace failed disk on ZFS system. Read here

On Any Raid1 Configurations

Steps to fix a hard drive failure that is in a raid 1 configuration:

The following demonstrates what a failed disk looks like:

 [root@myServer ~]# cat /proc/mdstat 
Personalities : [raid1]
md0 : active raid1 sdb1[0] sda1[2](F)
128384 blocks [2/1] [U_]
md1 : active raid1 sdb2[0] sda2[2](F)
16779776 blocks [2/1] [U_]
md2 : active raid1 sdb3[0] sda3[2](F)
139379840 blocks [2/1] [U_]
unused devices: <none>
 [root@myServer ~]# smartctl -a /dev/sda   
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
Short INQUIRY response, skip product id
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.


 [root@myServer ~]# smartctl -a /dev/sdb  
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10
Device Model: ST3160815AS
Serial Number: 9RA6DZP8
Firmware Version: 4.AAB
User Capacity: 160,041,885,696 bytes [160 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Sep 8 15:50:48 2014 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

There is a lot more that gets printed, but I cut it out.

So /dev/sda has clearly failed.

Take note of the GOOD disk serial number so I leave that one in when I replace it:

 Serial Number:    9RA6DZP8   

Mark and remove failed disk from raid:

 [root@myServer ~]# mdadm --manage /dev/md0 --fail /dev/sda1   
mdadm: set /dev/sda1 faulty in /dev/md0
[root@myServer ~]# mdadm --manage /dev/md1 --fail /dev/sda2
mdadm: set /dev/sda2 faulty in /dev/md1
[root@myServer ~]# mdadm --manage /dev/md2 --fail /dev/sda3
mdadm: set /dev/sda3 faulty in /dev/md2
[root@myServer ~]# mdadm --manage /dev/md0 --remove /dev/sda1
mdadm: hot removed /dev/sda1
[root@myServer ~]# mdadm --manage /dev/md1 --remove /dev/sda2
mdadm: hot removed /dev/sda2
[root@myServer ~]# mdadm --manage /dev/md2 --remove /dev/sda3
mdadm: hot removed /dev/sda3

Make sure grub is installed on the good disk and that grub.conf is updated:

 [root@myServer ~]# grub-install /dev/sdb   
Installation finished. No error reported.
This is the contents of the device map /boot/grub/device.map.
Check if this is correct or not.
If any of the lines is incorrect, fix it and re-run the script `grub-install'.
This device map was generated by anaconda
(hd0) /dev/sda
(hd1) /dev/sdb

Take note of the which hd partition corresponds with the good disk, ie hd1 in this case.

 [root@myServer ~]# vim /boot/grub/menu.lst  
Add fallback=1 right after default=0
Go to the bottom section where you should find some kernel stanzas.
Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0)
Should look like this:
[...]
title CentOS (2.6.18-128.el5)
root (hd1,0)
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00
initrd /initrd-2.6.18-128.el5.img
title CentOS (2.6.18-128.el5)
root (hd0,0)
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/
initrd /initrd-2.6.18-128.el5.img

Save and quit

 [root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak   
[root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r)
[root@myServer ~]# init 0

Swap the bad drive with the new drive and boot the machine.

Once it's booted:

Check the device names with cat /proc/mdstat and/or fisk -l.
The newly installed drive on myServer was named /dev/sda.

 [root@myServer ~]# modeprobe raid1   
[root@myServer ~]# modeprobe linear

Copy the partitions from one disk to the other:

 [root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda   
[root@myServer ~]# sfdisk -l => sanity check

Add the new disk to the raid array:

 [root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1   
mdadm: added /dev/sda1
[root@myServer ~]# mdadm --manage /dev/md1 --add /dev/sda2
mdadm: added /dev/sda2
[root@myServer ~]# mdadm --manage /dev/md2 --add /dev/sda3
mdadm: added /dev/sda3

Sanity check:

 [root@myServer ~]# cat /proc/mdstat  
Personalities : [raid1] [linear]
md0 : active raid1 sda1[1] sdb1[0]
128384 blocks [2/2] [UU]
md1 : active raid1 sda2[2] sdb2[0]
16779776 blocks [2/1] [U_]
[>....................] recovery = 3.2% (548864/16779776) finish=8.8min speed=30492K/sec
md2 : active raid1 sda3[2] sdb3[0]
139379840 blocks [2/1] [U_]
resync=DELAYED
unused devices: <none>

That's it! :)