Replacing failed disk on Server: Difference between revisions
No edit summary |
No edit summary |
||
(One intermediate revision by one other user not shown) | |||
Line 1: | Line 1: | ||
== How to check if Disk failed== | == How to check if Disk failed== | ||
===Check for the light on disk=== | ===Check for the light on disk=== | ||
==== ZFS machines ==== | |||
Blue => Normal | |||
Red => Fail | |||
If the light is not working, this is how to identify disk by vdev | |||
Log into the machine | |||
$ zpool status | |||
. | |||
. | |||
scsi-35002538f31801401 ONLINE 0 0 0 | |||
scsi-35002538f31801628 FAULTED 22 0 0 too many errors <<< faulty disk | |||
zfs-cd3cd912951df815 ONLINE 0 0 0 | |||
. | |||
. | |||
Identify vdev | |||
If faulty disk's identifier starts with <b>scsi-****</b> | |||
$ ls -l /dev/disk/by-id | grep <id> | |||
If faulty disk's identifier starts with <b>zfs-****</b> | |||
$ ls -l /dev/disk/by-partlabel | grep <id> | |||
Locate the disk physically | |||
Flashing red light on disk | |||
$ sudo ledctl locate=/dev/<vdev> | |||
Turn off light on disk | |||
$ sudo ledctl locate_off=/dev/<vdev> | |||
==== Others ==== | |||
Solid Yellow => Fail | Solid Yellow => Fail | ||
Line 19: | Line 47: | ||
The file is located at: '''/nfs/home/jjg/python_scripts/check_for_failed_disks.py''' | The file is located at: '''/nfs/home/jjg/python_scripts/check_for_failed_disks.py''' | ||
This script ssh-es into the machines below and runs a command to list the status of | This script ssh-es into the machines below and runs a command to list the status of disks. (Does not include cluster 0 machines) | ||
abacus | machines: abacus, n-9-22, tsadi, lamed, qof, zayin, n-1-30, n-1-109, n-1-113, shin | ||
data pools: db2, db3, db5, db4, ex1, ex2, ex3, ex4, ex5, ex6, ex7, ex8, ex9, exa, exb, exc, exd, db | |||
If a disk in any of the listed machines report that a disk has failed, the script will email the sysadmins. | |||
Example output: | |||
pool: db2 | |||
state: ONLINE | |||
pool: db3 | |||
state: ONLINE | |||
pool: db5 | |||
state: ONLINE | |||
pool: db4 | |||
state: ONLINE | |||
pool: ex1 | |||
state: ONLINE | |||
pool: ex2 | |||
state: ONLINE | |||
pool: ex3 | |||
state: ONLINE | |||
pool: ex4 | |||
state: ONLINE | |||
pool: ex5 | |||
state: ONLINE | |||
pool: ex6 | |||
state: ONLINE | |||
pool: ex7 | |||
state: ONLINE | |||
pool: ex8 | |||
state: ONLINE | |||
pool: ex9 | |||
state: ONLINE | |||
pool: exa | |||
state: ONLINE | |||
pool: exb | |||
state: ONLINE | |||
pool: exc | |||
state: ONLINE | |||
pool: exd | |||
state: ONLINE | |||
---------------------------------------------------------------------------- | |||
pool: db2 | |||
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type | |||
8:0 35 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:1 10 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:2 18 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:3 12 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:4 16 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:5 11 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:6 32 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:7 13 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:8 41 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:9 33 Onln 0 3.637 TB SAS HDD N N 512B WD4001FYYG-01SL3 U - | |||
8:10 20 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:11 27 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:12 23 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:13 25 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:14 14 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:15 42 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:16 19 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:17 39 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:18 40 Onln 0 3.637 TB SAS HDD N N 512B MB4000JEFNC U - | |||
8:19 29 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:20 26 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:21 36 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
8:22 34 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - | |||
== How to check if disk is failed or install correctly== | == How to check if disk is failed or install correctly== |
Latest revision as of 18:20, 17 May 2022
How to check if Disk failed
Check for the light on disk
ZFS machines
Blue => Normal
Red => Fail
If the light is not working, this is how to identify disk by vdev
Log into the machine
$ zpool status . . scsi-35002538f31801401 ONLINE 0 0 0 scsi-35002538f31801628 FAULTED 22 0 0 too many errors <<< faulty disk zfs-cd3cd912951df815 ONLINE 0 0 0 . .
Identify vdev
If faulty disk's identifier starts with scsi-**** $ ls -l /dev/disk/by-id | grep <id> If faulty disk's identifier starts with zfs-**** $ ls -l /dev/disk/by-partlabel | grep <id>
Locate the disk physically
Flashing red light on disk $ sudo ledctl locate=/dev/<vdev> Turn off light on disk $ sudo ledctl locate_off=/dev/<vdev>
Others
Solid Yellow => Fail
Blinking Yellow => Predictive Failure (going to fail soon)
Green => Normal
Replace disk instruction
- Determine what machine the disk below to
- Press the red button on the disk to turn it off.
- Gently pull a little bit out (NOT all the way) and wait for 10 sec until it stops spinning before pulling all the way out.
- Find replacement with a similar disk with the same specs
- Carefully unscrew the disk from disk holder (if the disk holder part on the replacement is the same then you don't have to).
Auto-check Disk Machines Python Script
In gimel5, there is a python script that runs every day at 12am through crontab under s_jjg.
The file is located at: /nfs/home/jjg/python_scripts/check_for_failed_disks.py
This script ssh-es into the machines below and runs a command to list the status of disks. (Does not include cluster 0 machines)
machines: abacus, n-9-22, tsadi, lamed, qof, zayin, n-1-30, n-1-109, n-1-113, shin data pools: db2, db3, db5, db4, ex1, ex2, ex3, ex4, ex5, ex6, ex7, ex8, ex9, exa, exb, exc, exd, db
If a disk in any of the listed machines report that a disk has failed, the script will email the sysadmins.
Example output:
pool: db2 state: ONLINE pool: db3 state: ONLINE pool: db5 state: ONLINE pool: db4 state: ONLINE pool: ex1 state: ONLINE pool: ex2 state: ONLINE pool: ex3 state: ONLINE pool: ex4 state: ONLINE pool: ex5 state: ONLINE pool: ex6 state: ONLINE pool: ex7 state: ONLINE pool: ex8 state: ONLINE pool: ex9 state: ONLINE pool: exa state: ONLINE pool: exb state: ONLINE pool: exc state: ONLINE pool: exd state: ONLINE ---------------------------------------------------------------------------- pool: db2 EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type 8:0 35 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:1 10 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:2 18 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:3 12 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:4 16 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:5 11 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:6 32 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:7 13 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:8 41 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:9 33 Onln 0 3.637 TB SAS HDD N N 512B WD4001FYYG-01SL3 U - 8:10 20 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:11 27 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:12 23 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:13 25 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:14 14 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:15 42 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:16 19 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:17 39 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:18 40 Onln 0 3.637 TB SAS HDD N N 512B MB4000JEFNC U - 8:19 29 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:20 26 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:21 36 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - 8:22 34 Onln 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U -
How to check if disk is failed or install correctly
On Cluster 0 's machines
1. Log into gimel as root
$ ssh root@sgehead1.bkslab.org
2. Log in as root to the machine that you determined from earlier
$ ssh root@<machine_name> Example: RAID 3,6,7 belongs to nfshead2
3. Run this command
$ /opt/compaq/hpacucli/bld/hpacucli ctrl all show config
Output Example: Smart Array P800 in Slot 1 (sn: PAFGF0N9SXQ0MX) array A (SATA, Unused Space: 0 MB) logicaldrive 1 (5.5 TB, RAID 1+0, OK) physicaldrive 1E:1:1 (port 1E:box 1:bay 1, SATA, 1 TB, OK) physicaldrive 1E:1:2 (port 1E:box 1:bay 2, SATA, 1 TB, OK) physicaldrive 1E:1:3 (port 1E:box 1:bay 3, SATA, 1 TB, OK) physicaldrive 1E:1:4 (port 1E:box 1:bay 4, SATA, 1 TB, OK) physicaldrive 1E:1:5 (port 1E:box 1:bay 5, SATA, 1 TB, OK) physicaldrive 1E:1:6 (port 1E:box 1:bay 6, SATA, 1 TB, OK) physicaldrive 1E:1:7 (port 1E:box 1:bay 7, SATA, 1 TB, OK) physicaldrive 1E:1:8 (port 1E:box 1:bay 8, SATA, 1 TB, OK) physicaldrive 1E:1:9 (port 1E:box 1:bay 9, SATA, 1 TB, OK) physicaldrive 1E:1:10 (port 1E:box 1:bay 10, SATA, 1 TB, OK) physicaldrive 1E:1:11 (port 1E:box 1:bay 11, SATA, 1 TB, OK) physicaldrive 1E:1:12 (port 1E:box 1:bay 12, SATA, 1 TB, OK) array B (SATA, Unused Space: 0 MB) logicaldrive 2 (5.5 TB, RAID 1+0, OK) physicaldrive 2E:1:1 (port 2E:box 1:bay 1, SATA, 1 TB, OK) physicaldrive 2E:1:2 (port 2E:box 1:bay 2, SATA, 1 TB, Predictive Failure) physicaldrive 2E:1:3 (port 2E:box 1:bay 3, SATA, 1 TB, OK) physicaldrive 2E:1:4 (port 2E:box 1:bay 4, SATA, 1 TB, OK) physicaldrive 2E:1:5 (port 2E:box 1:bay 5, SATA, 1 TB, OK) physicaldrive 2E:1:6 (port 2E:box 1:bay 6, SATA, 1 TB, OK) physicaldrive 2E:1:7 (port 2E:box 1:bay 7, SATA, 1 TB, OK) physicaldrive 2E:1:8 (port 2E:box 1:bay 8, SATA, 1 TB, OK) physicaldrive 2E:1:9 (port 2E:box 1:bay 9, SATA, 1 TB, OK) physicaldrive 2E:1:10 (port 2E:box 1:bay 10, SATA, 1 TB, OK) physicaldrive 2E:1:11 (port 2E:box 1:bay 11, SATA, 1 TB, OK) physicaldrive 2E:1:12 (port 2E:box 1:bay 12, SATA, 1 TB, OK) array C (SATA, Unused Space: 0 MB) logicaldrive 3 (5.5 TB, RAID 1+0, Ready for Rebuild) physicaldrive 2E:2:1 (port 2E:box 2:bay 1, SATA, 1 TB, OK) physicaldrive 2E:2:2 (port 2E:box 2:bay 2, SATA, 1 TB, OK) physicaldrive 2E:2:3 (port 2E:box 2:bay 3, SATA, 1 TB, OK) physicaldrive 2E:2:4 (port 2E:box 2:bay 4, SATA, 1 TB, OK) physicaldrive 2E:2:5 (port 2E:box 2:bay 5, SATA, 1 TB, OK) physicaldrive 2E:2:6 (port 2E:box 2:bay 6, SATA, 1 TB, OK) physicaldrive 2E:2:7 (port 2E:box 2:bay 7, SATA, 1 TB, OK) physicaldrive 2E:2:8 (port 2E:box 2:bay 8, SATA, 1 TB, OK) physicaldrive 2E:2:9 (port 2E:box 2:bay 9, SATA, 1 TB, OK) physicaldrive 2E:2:10 (port 2E:box 2:bay 10, SATA, 1 TB, OK) physicaldrive 2E:2:11 (port 2E:box 2:bay 11, SATA, 1 TB, OK) physicaldrive 2E:2:12 (port 2E:box 2:bay 12, SATA, 1 TB, OK) Expander 243 (WWID: 50014380031A4B00, Port: 1E, Box: 1) Expander 245 (WWID: 5001438005396E00, Port: 2E, Box: 2) Expander 246 (WWID: 500143800460A600, Port: 2E, Box: 1) Expander 248 (WWID: 50014380055E913F) Enclosure SEP (Vendor ID HP, Model MSA60) 241 (WWID: 50014380031A4B25, Port: 1E, Box: 1) Enclosure SEP (Vendor ID HP, Model MSA60) 242 (WWID: 5001438005396E25, Port: 2E, Box: 2) Enclosure SEP (Vendor ID HP, Model MSA60) 244 (WWID: 500143800460A625, Port: 2E, Box: 1) SEP (Vendor ID HP, Model P800) 247 (WWID: 50014380055E913E)
On shin
As root
/opt/MegaRAID/storcli/storcli64 /c0 /eall /sall show all
Drive /c0/e8/s18 : ================ ----------------------------------------------------------------------------- EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type ----------------------------------------------------------------------------- 8:18 24 Failed 0 3.637 TB SAS HDD N N 512B ST4000NM0023 U - ----------------------------------------------------------------------------- EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition|F-Foreign UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded Drive /c0/e8/s18 - Detailed Information : ======================================= Drive /c0/e8/s18 State : ====================== Shield Counter = 0 Media Error Count = 0 Other Error Count = 16 BBM Error Count = 0 Drive Temperature = 32C (89.60 F) Predictive Failure Count = 0 S.M.A.R.T alert flagged by drive = No Drive /c0/e8/s18 Device attributes : ================================== SN = Z1Z2S2TL0000C4216E9V Manufacturer Id = SEAGATE Model Number = ST4000NM0023 NAND Vendor = NA WWN = 5000C50057DB2A28 Firmware Revision = 0003 Firmware Release Number = 03290003 Raw size = 3.638 TB [0x1d1c0beb0 Sectors] Coerced size = 3.637 TB [0x1d1b00000 Sectors] Non Coerced size = 3.637 TB [0x1d1b0beb0 Sectors] Device Speed = 6.0Gb/s Link Speed = 6.0Gb/s Write cache = N/A Logical Sector Size = 512B Physical Sector Size = 512B Connector Name = Port 0 - 3 & Port 4 - 7
On ZFS machines
$ zpool status
For instruction on how to identify and replace failed disk on ZFS system. Read here
On Any Raid1 Configurations
Steps to fix a hard drive failure that is in a raid 1 configuration:
The following demonstrates what a failed disk looks like:
[root@myServer ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[0] sda1[2](F)
128384 blocks [2/1] [U_]
md1 : active raid1 sdb2[0] sda2[2](F)
16779776 blocks [2/1] [U_]
md2 : active raid1 sdb3[0] sda3[2](F)
139379840 blocks [2/1] [U_]
unused devices: <none>
[root@myServer ~]# smartctl -a /dev/sda
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
Short INQUIRY response, skip product id
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
[root@myServer ~]# smartctl -a /dev/sdb
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.18-371.1.2.el5] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10
Device Model: ST3160815AS
Serial Number: 9RA6DZP8
Firmware Version: 4.AAB
User Capacity: 160,041,885,696 bytes [160 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Sep 8 15:50:48 2014 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
There is a lot more that gets printed, but I cut it out.
So /dev/sda has clearly failed.
Take note of the GOOD disk serial number so I leave that one in when I replace it:
Serial Number: 9RA6DZP8
Mark and remove failed disk from raid:
[root@myServer ~]# mdadm --manage /dev/md0 --fail /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md0
[root@myServer ~]# mdadm --manage /dev/md1 --fail /dev/sda2
mdadm: set /dev/sda2 faulty in /dev/md1
[root@myServer ~]# mdadm --manage /dev/md2 --fail /dev/sda3
mdadm: set /dev/sda3 faulty in /dev/md2
[root@myServer ~]# mdadm --manage /dev/md0 --remove /dev/sda1
mdadm: hot removed /dev/sda1
[root@myServer ~]# mdadm --manage /dev/md1 --remove /dev/sda2
mdadm: hot removed /dev/sda2
[root@myServer ~]# mdadm --manage /dev/md2 --remove /dev/sda3
mdadm: hot removed /dev/sda3
Make sure grub is installed on the good disk and that grub.conf is updated:
[root@myServer ~]# grub-install /dev/sdb
Installation finished. No error reported.
This is the contents of the device map /boot/grub/device.map.
Check if this is correct or not.
If any of the lines is incorrect, fix it and re-run the script `grub-install'.
This device map was generated by anaconda
(hd0) /dev/sda
(hd1) /dev/sdb
Take note of the which hd partition corresponds with the good disk, ie hd1 in this case.
[root@myServer ~]# vim /boot/grub/menu.lst
Add fallback=1 right after default=0
Go to the bottom section where you should find some kernel stanzas.
Copy the first of them and paste the stanza before the first existing stanza; replace root (hd0,0) with root (hd1,0)
Should look like this:
[...]
title CentOS (2.6.18-128.el5)
root (hd1,0)
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00
initrd /initrd-2.6.18-128.el5.img
title CentOS (2.6.18-128.el5)
root (hd0,0)
kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/
initrd /initrd-2.6.18-128.el5.img
Save and quit
[root@myServer ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
[root@myServer ~]# mkinitrd /boot/initramfs-$(uname -r).img $(uname -r)
[root@myServer ~]# init 0
Swap the bad drive with the new drive and boot the machine.
Once it's booted:
Check the device names with cat /proc/mdstat and/or fisk -l.
The newly installed drive on myServer was named /dev/sda.
[root@myServer ~]# modeprobe raid1
[root@myServer ~]# modeprobe linear
Copy the partitions from one disk to the other:
[root@myServer ~]# sfdisk -d /dev/sdb | sfdisk --force /dev/sda
[root@myServer ~]# sfdisk -l => sanity check
Add the new disk to the raid array:
[root@myServer ~]# mdadm --manage /dev/md0 --add /dev/sda1
mdadm: added /dev/sda1
[root@myServer ~]# mdadm --manage /dev/md1 --add /dev/sda2
mdadm: added /dev/sda2
[root@myServer ~]# mdadm --manage /dev/md2 --add /dev/sda3
mdadm: added /dev/sda3
Sanity check:
[root@myServer ~]# cat /proc/mdstat
Personalities : [raid1] [linear]
md0 : active raid1 sda1[1] sdb1[0]
128384 blocks [2/2] [UU]
md1 : active raid1 sda2[2] sdb2[0]
16779776 blocks [2/1] [U_]
[>....................] recovery = 3.2% (548864/16779776) finish=8.8min speed=30492K/sec
md2 : active raid1 sda3[2] sdb3[0]
139379840 blocks [2/1] [U_]
resync=DELAYED
unused devices: <none>
That's it! :)