Zfs: Difference between revisions

From DISI
Jump to navigation Jump to search
No edit summary
mNo edit summary
 
(60 intermediate revisions by 6 users not shown)
Line 1: Line 1:
ZFS - Zettabyte Filesystem
== Introduction ==
ZFS - Zettabyte File System. It's the best for redundant data storage and one of the best ways for RAID setups.


== Beginning ZFS instances ==
== Installation ==
#Install CentOS 7. Make sure it has access to internet.
#Create a Foreman Entry [https://foreman.ucsf.bkslab.org/hosts here]
#Become root and run these commands. These will install all the necessary packages plus ZFS and also enable all the firewall rules necessary.
#*<source>#!/bin/bash


There are only two commmands to interact with ZFS.
yum install epel-release -y
yum update -y
yum install puppet -y
yum install sssd -y
yum install nss-pam-ldapd -y
yum install oddjob-mkhomedir -y
systemctl start oddjobd
systemctl enable oddjobd
puppet agent -t
yum install https://zfsonlinux.org/epel/zfs-release.el7_9.noarch.rpm -y
yum install zfs -y


zpool: used to create a ZFS vdev (virtual device). vdevs are composed of physical devices.
systemctl start nfs
zfs: used to create/interact with a ZFS dataset.  ZFS datasets are akin to logical volumes
systemctl enable nfs
systemctl start zfs.target
systemctl enable zfs.target
systemctl start zfs-import-cache.service
systemctl enable zfs-import-cache.service
firewall-cmd --permanent --add-service=nfs
firewall-cmd --permanent --add-service=mountd
firewall-cmd --permanent --add-service=rpc-bind
firewall-cmd --reload
</source>


# zpool creation syntax
== How to Create the Zpool ==
zpool create <poolname> <vdev(s)>  
#First determine which ones the SSD drives are because that will be the log and cache of the zfs filesystem. Use the ones that you didn't use for the OS.
# Create a zpool of six raidz2 vdevs, each with six drives.  Includes two SSDs to used as a mirrored SLOG and one SSD as an L2ARC read cache. (example commmand was run on qof)
#*<source>lsblk -o NAME,SIZE,SERIAL,LABEL,FSTYPE</source>
zpool create ex9 raidz2 sda sdb sdc sdd sde sdf raidz2 sdg sdh sdi sdj sdk sdl raidz2 sdm sdn sdo sdp sdq sdr raidz2 sds sdt sdu sdv sdw sdx raidz2 sdy sdz sdaa sdab sdac sdad raidz2 sdae sdaf sdag sdah sdai sdaj log mirror ata-INTEL_SSDSC2KG480G7_BTYM740603E0480BGN ata-INTEL_SSDSC2KG480G7_BTYM7406019K480BGN cache ata-INTEL_SSDSC2KG480G7_BTYM740602GN480BGN
#*2 x 240GB SSDs will be configured to be mirrors of each other for logging and 1 x 480GB SSD will be for cache.
  [root@qof ~]# zpool status
#Lets now create the pool:
  pool: ex9
#*<source>
  state: ONLINE
zpool create <options> <name-of-pool> raidz2 <hdd-1> <hdd-2> <hdd-3> <hdd-4> <hdd-5> <hdd-6>\
  scan: none requested
                                      raidz2 <hdd-7> <hdd-8> <hdd-9> <hdd-10> <hdd-11> <hdd-12>\
  config:
                                      raidz2 <hdd-13> <hdd-14> <hdd-15> <hdd-16> <hdd-17> <hdd-18>\
  NAME                                            STATE    READ WRITE CKSUM
                                      raidz2 <hdd-19> <hdd-20> <hdd-21> <hdd-22> <hdd-23> <hdd-24>\
  ex9                                            ONLINE      0    0    0
                                      raidz2 <hdd-25> <hdd-26> <hdd-27> <hdd-28> <hdd-29> <hdd-30>\
  raidz2-0                                      ONLINE      0    0    0
                                      raidz2 <hdd-31> <hdd-32> <hdd-33> <hdd-34> <hdd-35> <hdd-36>\
    sda                                        ONLINE      0    0    0
                                      log mirror <ssd-1> <ssd-2>
    sdb                                        ONLINE      0    0    0
                                      cache <ssd-3>
    sdc                                        ONLINE      0    0    0
</source>
    sdd                                        ONLINE      0    0    0
#Here is an example of how to create a zpool:
    sde                                        ONLINE      0    0    0
#*<source>zpool create -f exj raidz2 sdf sdg sdh sdi sdj sdk\
    sdf                                         ONLINE      0    0    0
              raidz2 sdl sdm sdn sdo sdp sdq\
  raidz2-1                                      ONLINE      0    0    0
              raidz2 sdr sds sdt sdu sdv sdw\
    sdg                                         ONLINE      0    0    0
              raidz2 sdx sdy sdz sdaa sdab sdac\
    sdh                                         ONLINE      0    0    0
              raidz2 sdad sdae sdaf sdag sdah sdai\
    sdi                                         ONLINE      0    0    0
              raidz2 sdaj sdak sdal sdam sdan sdao\
    sdj                                         ONLINE      0    0    0
              log mirror sdc sdd\
    sdk                                         ONLINE      0    0    0
              cache sde
    sdl                                         ONLINE      0    0    0
</source>
  raidz2-2                                      ONLINE      0    0    0
#Once the zpool has been created, double check by using one of these commands
    sdm                                         ONLINE      0    0    0
#*<source>
    sdn                                         ONLINE      0    0    0
zfs list
    sdo                                         ONLINE      0    0    0
zpool status
    sdp                                         ONLINE      0    0    0
</source>
    sdq                                         ONLINE      0    0    0
#Here comes a weird part. Locating broken disks is hard with 'sd*' naming convention since it changes with every reboot. Therefore, we will change it to their id's.
    sdr                                         ONLINE      0    0    0
#*<source>
  raidz2-3                                      ONLINE      0    0    0
zpool export <zpool-name>
    sds                                         ONLINE      0    0    0
zpool import -d /dev/disk/by-id -aN
    sdt                                         ONLINE      0    0    0
</source>
    sdu                                         ONLINE      0    0    0
#Now, mount the zpool into a directory in the machine. We usually create a new directory called '/export/<zpool-name>'.
    sdv                                         ONLINE      0    0    0
#*<source>zfs set mountpoint=/export/<zpool-name> <zpool-name></source>
    sdw                                         ONLINE      0    0    0
#For example
    sdx                                         ONLINE      0    0    0
#*<source>zfs set mountpoint=/export/exa exa</source>
  raidz2-4                                      ONLINE      0    0    0
#Double check that the zpool mounted to the directory by checking the disk space in that directory
    sdy                                         ONLINE      0    0    0
#*<source>df -h /export/<zpool-name></source>
    sdz                                         ONLINE      0    0    0
#Lastly, reboot the machine to see if the zpool will mount automatically. If it doesn't, once it turns on, run
    sdaa                                       ONLINE      0    0    0
#*<source>modprobe zfs
    sdab                                       ONLINE      0    0    0
zpool import -a
    sdac                                       ONLINE      0    0    0
</source>
    sdad                                       ONLINE      0    0    0
#Then reboot again and it should remount itself.
  raidz2-5                                      ONLINE      0    0    0
    sdae                                       ONLINE      0    0    0
    sdaf                                       ONLINE      0    0    0
    sdag                                       ONLINE      0    0    0
    sdah                                       ONLINE      0    0    0
    sdai                                       ONLINE      0    0    0
    sdaj                                       ONLINE      0    0    0
  logs
  mirror-6                                      ONLINE      0    0    0
    ata-INTEL_SSDSC2KG480G7_BTYM740603E0480BGN  ONLINE      0    0    0
    ata-INTEL_SSDSC2KG480G7_BTYM7406019K480BGN  ONLINE      0    0    0
  cache
  ata-INTEL_SSDSC2KG480G7_BTYM740602GN480BGN    ONLINE      0    0    0
 
Adding a zfs filesystem:
 
Using qof as an example, I will create a child filesystem under ex9 named archive that will be mounted under /export/ex9/archive. This archive will be used to backup user data.
 
-bash-4.2$ zfs list
NAME          USED  AVAIL  REFER  MOUNTPOINT
ex9          2.39T  249T  2.39T  /export/ex9
-bash-4.2$ sudo zfs create -o mountpoint=/export/ex9/archive ex9/archive
-bash-4.2$ zfs list
NAME          USED  AVAIL  REFER  MOUNTPOINT
ex9          2.39T  249T  2.39T  /export/ex9
ex9/archive  192K  249T  192K  /export/ex9/archive


== Adding L2ARC Read Cache to a zpool==
== Exporting the Zpool to the Cluster ==
# Look for available SSDs in /dev/disk/by-id/
#Add the rules on where to export the zpool
# Choose an available SSD to use for read cache. Then decide which pool you want to put the cache on.  
#*<source>vim /etc/exports</source>
Syntax: zpool add <zpool name> <cache/log> <path to disk>
#Then, add this inside. Replace <> with the respective information.
$ sudo zpool add ex6 cache /dev/disk/by-id/ata-INTEL_SSDSC2KG480G7_BTYM72830AV6480BGN
#*<source>
/export/<zpool-name>     10.20.0.0/16(rw,async,fsid=<unused-id>,no_subtree_check) \
                        169.230.26.0/24(rw,async,fsid=<unused-id>,no_subtree_check) \
                        169.230.90.0/24(rw,async,fsid=<unused-id>,no_subtree_check) \
                        169.230.91.0/24(rw,async,fsid=<unused-id>,no_subtree_check) \
                        169.230.92.0/24(rw,async,fsid=<unused-id>,no_subtree_check)
</source>
#For example
#*<source>
/export/exl    10.20.0.0/16(rw,async,fsid=547,no_subtree_check) \
                169.230.26.0/24(rw,async,fsid=547,no_subtree_check) \
                169.230.90.0/24(rw,async,fsid=547,no_subtree_check) \
                169.230.91.0/24(rw,async,fsid=547,no_subtree_check) \
                169.230.92.0/24(rw,async,fsid=547,no_subtree_check)
</source>
#Export the rules
#*<source>exportfs -a</source>
#In another machine, check that the rules are applied
#*<source>showmount -e <machine-name></source>
#Then follow  [[PuppetTricks#Adding_new_mount_point_to_Puppet|this guide]] to add this new machine to the puppet module


== Tuning ZFS options ==
== Tuning ZFS options ==
   # stores extended attributes as system attributes to improve performance
   # stores extended attributes as system attributes to improve performance
   zfs xattr=sa <zfs dataset name>  
   $ zfs xattr=sa <zfs dataset name>  
    
    
   # Turn on ZFS lz4 compression.  Use this for compressible dataset such as many files with text  
   # Turn on ZFS lz4 compression.  Use this for compressible dataset such as many files with text  
   zfs set compression=lz4 <zfs dataset name>  
   $ zfs set compression=lz4 <zfs dataset name>  
    
    
   # Turn off access time for improved disk performance (so that the OS doesn't write a new time every time a file is accessed)
   # Turn off access time for improved disk performance (so that the OS doesn't write a new time every time a file is accessed)
   zfs set atime=off <zfs dataset name>
   $ zfs set atime=off <zfs dataset name>


   NOTE: ZFS performance degrades tremendously when the zpool is over 80% used.  To avoid this, I have set a quota to 80% of the 248TB in qof/nfs-ex9.
   NOTE: ZFS performance degrades tremendously when the zpool is over 80% used.  To avoid this, I have set a quota to 80% of the 248TB in qof/nfs-ex9.
   # To set a quota of 200TB on ZFS dataset:
   # To set a quota of 200TB on ZFS dataset:
   zfs set quota=200T <zfs dataset>
   $ zfs set quota=200T <zfs dataset>


   # To remove a quota from a ZFS dataset:
   # To remove a quota from a ZFS dataset:
   zfs set quota=none <zfs dataset>  
   $ zfs set quota=none <zfs dataset>


By default, ZFS pools/mounts do not have ACLs active. 
  # to active access control lists on a zpool
  $ sudo zfs set acltype=posixacl <pool name>


== Checking Disk Health and Integrity ==
Print a brief summary of all pools:
zpool list


== situation ==
Print a detailed status of each disk and status of pool:
  zpool status
  zpool status
zfs list
zfs get all
== mount after reboot ==
zfs set mountpoint=/export/db2 db2
== when you put in a new disk ==
fdisk -l
to see what is new
sudo zpool create -f /srv/db3 raidz2 /dev/sdaa  /dev/sdab  /dev/sdac  /dev/sdad  /dev/sdae  /dev/sdaf  /dev/sdag  /dev/sdah  /dev/sdai  /dev/sdaj  /dev/sdak  /dev/sdal 
sudo zpool add -f /srv/db3 raidz2  /dev/sdam  /dev/sdan  /dev/sdao  /dev/sdap  /dev/sdaq  /dev/sdar  /dev/sdas  /dev/sdat  /dev/sdau  /dev/sdav  /dev/sdaw  /dev/sdax
zfs unmount db3
zfs mount db3
= latest =
zpool create -f db3 raidz2  /dev/sdy /dev/sdz  /dev/sdaa  /dev/sdab  /dev/sdac  /dev/sdad  /dev/sdae  /dev/sdaf  /dev/sdag  /dev/sdah  /dev/sdai  /dev/sdaj
zpool add -f db3 raidz2 /dev/sdak  /dev/sdal  /dev/sdam  /dev/sdan  /dev/sdao  /dev/sdap  /dev/sdaq  /dev/sdar  /dev/sdas  /dev/sdat  /dev/sdau  /dev/sdav
zpool create -f db4 raidz2 /dev/sdax /dev/sday /dev/sdaz /dev/sdba  /dev/sdbb  /dev/sdbc  /dev/sdbd  /dev/sdbe  /dev/sdbf  /dev/sdbg  /dev/sdbh  /dev/sdbi
zpool add -f db4 raidz2 /dev/sdbj /dev/sdbk /dev/sdbl /dev/sdbm /dev/sdbn /dev/sdbo /dev/sdbp /dev/sdbq /dev/sdbr /dev/sdbs /dev/sdbt /dev/sdbu
= Fri Jan 19 2018 =
zpool create -f db5 raidz2 /dev/sdbw /dev/sdbx /dev/sdby /dev/sdbz /dev/sdca  /dev/sdcb  /dev/sdcc  /dev/sdcd  /dev/sdce  /dev/sdcf  /dev/sdcg  /dev/sdch
zpool add -f db5 raidz2 /dev/sdci /dev/sdcj /dev/sdck /dev/sdcl /dev/sdcm /dev/sdcn /dev/sdco /dev/sdcp /dev/sdcq /dev/sdcr /dev/sdcs /dev/sdct
zfs mount db5
= Wed Jan 24 2018 =
On tsadi
zpool create -f ex1 mirror /dev/sdaa /dev/sdab /dev/sdac /dev/sdad /dev/sdae
zpool add -f ex1 mirror /dev/sdaf /dev/sdag /dev/sdah /dev/sdai /dev/sdaj
zpool create -f ex2 mirror /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj
zpool add -f ex2 /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo
zpool create -f ex3 mirror /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt
zpool add -f ex3 mirror /dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy
zpool create -f ex4 mirror /dev/sdz /dev/sdak /dev/sdal
zpool add -f ex4 mirror /dev/sdam /dev/sdan /dev/sdao


On tsadi
Clear read errors on disk, if not anything serious:
  zpool create -f ex1 mirror /dev/sdaa /dev/sdab mirror /dev/sdac /dev/sdad mirror /dev/sdae /dev/sdaf mirror /dev/sdag /dev/sdah mirror  /dev/sdai /dev/sdaj
  zpool clear <pool_name>
zpool create -f ex2 mirror  /dev/sdf /dev/sdg mirror /dev/sdh /dev/sdi mirror /dev/sdj /dev/sdk mirror /dev/sdl /dev/sdm mirror /dev/sdn /dev/sdo
zpool create -f ex3 mirror /dev/sdp /dev/sdq mirror /dev/sdr /dev/sds mirro /dev/sdt /dev/sdu mirror /dev/sdv /dev/sdw mirror /dev/sdx /dev/sdy
zpool create -f ex4 mirror /dev/sdz /dev/sdak /dev/sdal  mirror /dev/sdam mirror /dev/sdan /dev/sdao


On lamed
Check data integrity, traverses all the data in the pool once and verifies that all blocks can be read:
  zpool create -f ex5 mirror /dev/sdaa /dev/sdab mirror /dev/sdac /dev/sdad mirror /dev/sdae /dev/sdaf mirror /dev/sdag /dev/sdah mirror  /dev/sdai /dev/sdaj
  zpool scrub <pool_name>
zpool create -f ex6 mirror  /dev/sda /dev/sdb mirror /dev/sdc /dev/sdd mirror /dev/sde /dev/sdf mirror /dev/sdg /dev/sdh mirror /dev/sdi /dev/sdj
zpool create -f ex7 mirror  /dev/sdk /dev/sdl mirror /dev/sdm /dev/sdn mirror /dev/sdo /dev/sdp mirror /dev/sdq /dev/sdr mirror /dev/sds /dev/sdt
zpool create -f ex8 mirror /dev/sdu /dev/sdv mirror /dev/sdw /dev/sdx mirror /dev/sdy /dev/sdz


To stop scrub:
zpool scrub -s <pool_name>


zfs mount
== recovery from accidental pool destruction ==
== recovery from accidental pool destruction ==
  umount /mnt /mnt2
  umount /mnt /mnt2
Line 193: Line 170:


NOTE:  If you destroyed your zpool with command 'zpool destroy', you can use the command 'zpool import' to view destroyed pools and recover the pool by doing 'zpool import <zpool name>'.
NOTE:  If you destroyed your zpool with command 'zpool destroy', you can use the command 'zpool import' to view destroyed pools and recover the pool by doing 'zpool import <zpool name>'.
[[Category:Curator]]
 
= Troubleshooting =
 
== Panic!! The disk is full and I can't remove files! ==
 
If the zfs pool somehow gets completely filled up, to the point that "rm" no longer works, don't panic. The disk may seem full, but ZFS actually keeps a little bit of space free for internal operations (why rm is not one of these, I don't know). The amount of space reserved for this purpose is determined by the "spa_slop_shift" module parameter. You can find the value of this parameter @ /sys/module/zfs/parameters/spa_slop_shift
 
If you can't find the value of this parameter, you can calculate it from the ALLOC and FREE columns given by "zpool list", like so:
<nowiki>
spa_slop_shift = floor(log2(ALLOC/FREE))</nowiki>
 
To free up a little bit more space on disk, you can increase the value of this parameter by one. Here's a real-world example of this, from when /nfs/exh filled up.
<nowiki>
[root@nfs-exh ~]# df -h /nfs/exh
Filesystem          Size  Used Avail Use% Mounted on
nfs-exh:/export/exh  349T  349T  0B  100% /mnt/nfs/exh
 
[root@nfs-exh ~]# zpool list
NAME  SIZE  ALLOC  FREE  CKPOINT  EXPANDSZ  FRAG    CAP  DEDUP    HEALTH  ALTROOT
exh    524T  524T  95.9G        -        -    91%    99%  1.00x    ONLINE  -
 
[root@nfs-exh ~]# cat /sys/module/zfs/parameters/spa_slop_shift
12
 
[root@nfs-exh ~]# echo 13 > /sys/module/zfs/parameters/spa_slop_shift
 
[root@nfs-exh ~]# df -h /nfs/exh
Filesystem          Size  Used Avail Use% Mounted on
nfs-exh:/export/exh  349T  349T  22G 100% /mnt/nfs/exh</nowiki>
 
Once you've gotten a foothold on the disk and made some space, you should revert the spa_slop_shift parameter back to its original value.
 
== zpool destroy : Failed to unmount <device> - device busy ==
 
The help text will advise you to check lsof or fuser, but really what you need to do is stop the nfs service
 
  systemctl stop nfs
  umount /export/ex*
  zpool destroy ...
  zpool create ...
  zpool ...
  ...
  systemctl start nfs
 
== zpool missing after reboot ==
This is due to zfs-import-cache failed to start at boot time.
# check
$ systemctl status zfs-import-cache.service
# enable at boot time
$ systemctl enable zfs-import-cache.service
 
== Example: Fixing degraded pool, replacing faulted disk ==
On Feb 22, 2019, one of nfs-ex9's disks became faulty. 
 
-bash-4.2$ '''zpool status'''
pool: ex9
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
  scan: scrub canceled on Fri Feb 22 11:31:25 2019
config:
          raidz2-5                                      DEGRADED    0    0    0
sdae                                        ONLINE      0    0    0
sdaf                                        ONLINE      0    0    0
sdag                                        ONLINE      0    0    0
sdah                                        FAULTED    18    0    0  too many errors
sdai                                        ONLINE      0    0    0
sdaj                                        ONLINE      0    0    0
 
 
I did the following:
 
-bash-4.2$ '''sudo zpool offline ex9 sdb'''
 
Then I went to the server room to see that disk 1 still had a red light due to the fault.  I pulled the disk out.  Inserted a fresh one of the same brand, a Seagate Exos X12.  The server detected the new disk and set the disk name as /dev/sdb, just like the one I just pulled out.  Finally, I did the following command.
 
-bash-4.2$ '''sudo zpool replace ex9 /dev/sdah'''
-bash-4.2$ '''zpool status'''
  pool: ex9
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Mar 19 14:06:33 2019
1.37G scanned out of 51.8T at 127M/s, 118h33m to go
37.9M resilvered, 0.00% done
.
.
.
  raidz2-5                                      DEGRADED    0    0    0
    sdae                                        ONLINE      0    0    0
    sdaf                                        ONLINE      0    0    0
    sdag                                        ONLINE      0    0    0
    replacing-3                                DEGRADED    0    0    0
      old                                      FAULTED    18    0    0  too many errors
      sdah                                      ONLINE      0    0    0  (resilvering)
    sdai                                        ONLINE      0    0    0
    sdaj                                        ONLINE      0    0    0
 
Resilvering is the process of a disk being rebuilt from its parity group.  Once it is finished, you should be good to go again.
==== Replace disk by disk ids ====
 
For zayin/nfs-exa, some of the disks are named by id instead of the vdev-id. It is recommended to use id instead of vdev-id as vdev-id can change after reboot.
raidz2-4                  DEGRADED    0    0    0
scsi-35000c500a7da67cb  ONLINE      0    0    0
scsi-35000c500a7daa34f  ONLINE      0    0    0
scsi-35000c500a7db39db  FAULTED      0    0    0  too many errors
scsi-35000c500a7da6b97  ONLINE      0    0    0
scsi-35000c500a7da265b  ONLINE      0    0    0
scsi-35000c500a7da740f  ONLINE      0    0    0
 
In this case, we have to determine the id name of the new disk disk just got inserted with dmesg. Look for log that mentioning about an new disk
$ dmesg -T | tail
[7819794.080935] scsi 0:0:40:0: Power-on or device reset occurred
[7819794.099111] sd 0:0:40:0: Attached scsi generic sg8 type 0
[7819794.100978] sd 0:0:40:0: [sdi] Spinning up disk...
[7819795.103622] ......................ready
[7819817.123255] sd 0:0:40:0: [sdi] 31251759104 512-byte logical blocks: (16.0 TB/14.5 TiB)
[7819817.123263] sd 0:0:40:0: [sdi] 4096-byte physical blocks
[7819817.128478] sd 0:0:40:0: [sdi] Write Protect is off
[7819817.128486] sd 0:0:40:0: [sdi] Mode Sense: df 00 10 08
[7819817.130308] sd 0:0:40:0: [sdi] Write cache: enabled, read cache: enabled, supports DPO and FUA
[7819817.165231] sd 0:0:40:0: [sdi] Attached SCSI disk
 
Check if disk is properly recognized, the new disk should be at the bottom and doesn't have any partition
$ fdisk -l
 
$ cd /dev/disk/by-id
$ ls -ltr | grep sdi
lrwxrwxrwx. 1 root root  9 Feb  7 13:29 scsi-35000c500d7947833 -> ../../sdi
 
Once determine the name, we will start the resilvering process
$ zpool replace exa  scsi-35000c500a7db39db scsi-35000c500d7947833
# scsi-35000c500a7db39db is the id of the failed disk obtained from zpool status
# scsi-35000c500d7947833 is the id of the new replacement disk determined above
 
== Disk LED light ==
=== Identify failed disk by LED light ===
By disk_id
# turn light off
$ ledctl locate_off=/dev/disk/by-id/<disk_id>
# turn light on
$ ledctl locate=/dev/disk/by-id/<disk_id>
Example
$ ledctl locate_off=/dev/disk/by-id/scsi-35000c500a7d8137f
$ ledctl locate=/dev/disk/by-id/scsi-35000c500a7d8137f
$ ledctl locate=/dev/disk/by-partlabel/zfs-c34473d19032c002
By vdev
# turn light on
$ ledctl locate_off=/dev/<vdev>
# turn light on
$ ledctl locate=/dev/disk/<vdev>
Example
$ ledctl locate_off=/dev/sdaf
$ ledctl locate=/dev/sdaf
 
=== Reset light from LED light glitch ===
For qof/nfs-ex9, we had an issue with the disk LED for /dev/sdah still showing up red despite the resilvering occurring.  To return the disk LED to a normal status, issue the following command:
$ '''sudo ledctl normal=/dev/<disk vdev id>'''
Example: $ '''sudo ledctl normal=/dev/sdah'''
 
or for zayin/nfs-exa, disk are identify by id
$ '''sudo ledctl normal=/dev/disk/by-id/<disk id>'''
Example: $ '''sudo ledctl normal=/dev/disk/by-id/scsi-35000c500a7db39db'''
 
== Check if pool is compressed ==
<source>
zfs get all | grep compression
</source>
 
== Check User Usage in ZFS ==
<source>
zfs userspace <pool name>
</source>
 
 
 
 
 
[[Category:Curator]][[Category:Sysadmin]]

Latest revision as of 01:49, 20 November 2024

Introduction

ZFS - Zettabyte File System. It's the best for redundant data storage and one of the best ways for RAID setups.

Installation

  1. Install CentOS 7. Make sure it has access to internet.
  2. Create a Foreman Entry here
  3. Become root and run these commands. These will install all the necessary packages plus ZFS and also enable all the firewall rules necessary.
    • #!/bin/bash
      
      yum install epel-release -y
      yum update -y
      yum install puppet -y
      yum install sssd -y
      yum install nss-pam-ldapd -y
      yum install oddjob-mkhomedir -y
      systemctl start oddjobd
      systemctl enable oddjobd
      puppet agent -t
      yum install https://zfsonlinux.org/epel/zfs-release.el7_9.noarch.rpm -y
      yum install zfs -y
      
      systemctl start nfs
      systemctl enable nfs
      systemctl start zfs.target
      systemctl enable zfs.target
      systemctl start zfs-import-cache.service
      systemctl enable zfs-import-cache.service
      firewall-cmd --permanent --add-service=nfs
      firewall-cmd --permanent --add-service=mountd
      firewall-cmd --permanent --add-service=rpc-bind
      firewall-cmd --reload

How to Create the Zpool

  1. First determine which ones the SSD drives are because that will be the log and cache of the zfs filesystem. Use the ones that you didn't use for the OS.
    • lsblk -o NAME,SIZE,SERIAL,LABEL,FSTYPE
    • 2 x 240GB SSDs will be configured to be mirrors of each other for logging and 1 x 480GB SSD will be for cache.
  2. Lets now create the pool:
    • zpool create <options> <name-of-pool> raidz2 <hdd-1> <hdd-2> <hdd-3> <hdd-4> <hdd-5> <hdd-6>\
                                            raidz2 <hdd-7> <hdd-8> <hdd-9> <hdd-10> <hdd-11> <hdd-12>\
                                            raidz2 <hdd-13> <hdd-14> <hdd-15> <hdd-16> <hdd-17> <hdd-18>\
                                            raidz2 <hdd-19> <hdd-20> <hdd-21> <hdd-22> <hdd-23> <hdd-24>\
                                            raidz2 <hdd-25> <hdd-26> <hdd-27> <hdd-28> <hdd-29> <hdd-30>\
                                            raidz2 <hdd-31> <hdd-32> <hdd-33> <hdd-34> <hdd-35> <hdd-36>\
                                            log mirror <ssd-1> <ssd-2>
                                            cache <ssd-3>
  3. Here is an example of how to create a zpool:
    • zpool create -f exj raidz2 sdf sdg sdh sdi sdj sdk\
      		              raidz2 sdl sdm sdn sdo sdp sdq\
      		              raidz2 sdr sds sdt sdu sdv sdw\
      		              raidz2 sdx sdy sdz sdaa sdab sdac\
      		              raidz2 sdad sdae sdaf sdag sdah sdai\
      		              raidz2 sdaj sdak sdal sdam sdan sdao\
      		              log mirror sdc sdd\
      		              cache sde
  4. Once the zpool has been created, double check by using one of these commands
    • zfs list
      zpool status
  5. Here comes a weird part. Locating broken disks is hard with 'sd*' naming convention since it changes with every reboot. Therefore, we will change it to their id's.
    • zpool export <zpool-name>
      zpool import -d /dev/disk/by-id -aN
  6. Now, mount the zpool into a directory in the machine. We usually create a new directory called '/export/<zpool-name>'.
    • zfs set mountpoint=/export/<zpool-name> <zpool-name>
  7. For example
    • zfs set mountpoint=/export/exa exa
  8. Double check that the zpool mounted to the directory by checking the disk space in that directory
    • df -h /export/<zpool-name>
  9. Lastly, reboot the machine to see if the zpool will mount automatically. If it doesn't, once it turns on, run
    • modprobe zfs
      zpool import -a
  10. Then reboot again and it should remount itself.

Exporting the Zpool to the Cluster

  1. Add the rules on where to export the zpool
    • vim /etc/exports
  2. Then, add this inside. Replace <> with the respective information.
    • /export/<zpool-name>     10.20.0.0/16(rw,async,fsid=<unused-id>,no_subtree_check) \
                               169.230.26.0/24(rw,async,fsid=<unused-id>,no_subtree_check) \
                               169.230.90.0/24(rw,async,fsid=<unused-id>,no_subtree_check) \
                               169.230.91.0/24(rw,async,fsid=<unused-id>,no_subtree_check) \
                               169.230.92.0/24(rw,async,fsid=<unused-id>,no_subtree_check)
  3. For example
    • /export/exl     10.20.0.0/16(rw,async,fsid=547,no_subtree_check) \
                      169.230.26.0/24(rw,async,fsid=547,no_subtree_check) \
                      169.230.90.0/24(rw,async,fsid=547,no_subtree_check) \
                      169.230.91.0/24(rw,async,fsid=547,no_subtree_check) \
                      169.230.92.0/24(rw,async,fsid=547,no_subtree_check)
  4. Export the rules
    • exportfs -a
  5. In another machine, check that the rules are applied
    • showmount -e <machine-name>
  6. Then follow this guide to add this new machine to the puppet module

Tuning ZFS options

 # stores extended attributes as system attributes to improve performance
 $ zfs xattr=sa <zfs dataset name> 
 
 # Turn on ZFS lz4 compression.  Use this for compressible dataset such as many files with text 
 $ zfs set compression=lz4 <zfs dataset name> 
 
 # Turn off access time for improved disk performance (so that the OS doesn't write a new time every time a file is accessed)
 $ zfs set atime=off <zfs dataset name>
 NOTE: ZFS performance degrades tremendously when the zpool is over 80% used.  To avoid this, I have set a quota to 80% of the 248TB in qof/nfs-ex9.
 # To set a quota of 200TB on ZFS dataset:
 $ zfs set quota=200T <zfs dataset>
 # To remove a quota from a ZFS dataset:
 $ zfs set quota=none <zfs dataset>

By default, ZFS pools/mounts do not have ACLs active.

 # to active access control lists on a zpool
 $ sudo zfs set acltype=posixacl <pool name>

Checking Disk Health and Integrity

Print a brief summary of all pools:

zpool list

Print a detailed status of each disk and status of pool:

zpool status

Clear read errors on disk, if not anything serious:

zpool clear <pool_name>

Check data integrity, traverses all the data in the pool once and verifies that all blocks can be read:

zpool scrub <pool_name>

To stop scrub:

zpool scrub -s <pool_name>

recovery from accidental pool destruction

umount /mnt /mnt2
mdadm -S /dev/md125/dev/md126/dev/md127
sfdisk -d /dev/sda < sda.sfdisk
sfdisk -d /dev/sdb < sdb.sfdisk
sfdisk /dev/sda < sdb.sfdisk
mdadm --detail /dev/md127
mdadm -A -R /dev/md127/dev/sdb2/dev/sda2
mdadm /dev/md127 -a /dev/sda2
mdadm --detail /dev/md127
echo check > /sys/block/md127/md/sync_action
cat /proc/mdstat
mdadm --detail /dev/md126
mdadm -A -R /dev/md126/dev/sdb3/dev/sda3
mdadm /dev/md126 -a /dev/sda3
mdadm --detail /dev/md126
echo check > /sys/block/md126/md/sync_action
cat /proc/mdstat

Also switched the bios to boot from hd2 instead of hd1 (or something)

  • Recreate zpool with correct drives
  • Point an instance photorec at each of the wiped drives set to recover files of the following types: .gz, .solv (custom definition)


NOTE: If you destroyed your zpool with command 'zpool destroy', you can use the command 'zpool import' to view destroyed pools and recover the pool by doing 'zpool import <zpool name>'.

Troubleshooting

Panic!! The disk is full and I can't remove files!

If the zfs pool somehow gets completely filled up, to the point that "rm" no longer works, don't panic. The disk may seem full, but ZFS actually keeps a little bit of space free for internal operations (why rm is not one of these, I don't know). The amount of space reserved for this purpose is determined by the "spa_slop_shift" module parameter. You can find the value of this parameter @ /sys/module/zfs/parameters/spa_slop_shift

If you can't find the value of this parameter, you can calculate it from the ALLOC and FREE columns given by "zpool list", like so:

spa_slop_shift = floor(log2(ALLOC/FREE))

To free up a little bit more space on disk, you can increase the value of this parameter by one. Here's a real-world example of this, from when /nfs/exh filled up.

[root@nfs-exh ~]# df -h /nfs/exh
Filesystem           Size  Used Avail Use% Mounted on
nfs-exh:/export/exh  349T  349T   0B  100% /mnt/nfs/exh

[root@nfs-exh ~]# zpool list
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
exh    524T   524T  95.9G        -         -    91%    99%  1.00x    ONLINE  -

[root@nfs-exh ~]# cat /sys/module/zfs/parameters/spa_slop_shift
12

[root@nfs-exh ~]# echo 13 > /sys/module/zfs/parameters/spa_slop_shift

[root@nfs-exh ~]# df -h /nfs/exh
Filesystem           Size  Used Avail Use% Mounted on
nfs-exh:/export/exh  349T  349T   22G 100% /mnt/nfs/exh

Once you've gotten a foothold on the disk and made some space, you should revert the spa_slop_shift parameter back to its original value.

zpool destroy : Failed to unmount <device> - device busy

The help text will advise you to check lsof or fuser, but really what you need to do is stop the nfs service

 systemctl stop nfs
 umount /export/ex*
 zpool destroy ...
 zpool create ...
 zpool ...
 ...
 systemctl start nfs

zpool missing after reboot

This is due to zfs-import-cache failed to start at boot time.

# check
$ systemctl status zfs-import-cache.service
# enable at boot time
$ systemctl enable zfs-import-cache.service

Example: Fixing degraded pool, replacing faulted disk

On Feb 22, 2019, one of nfs-ex9's disks became faulty.

-bash-4.2$ zpool status
pool: ex9
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
	repaired.
  scan: scrub canceled on Fri Feb 22 11:31:25 2019
config:
         raidz2-5                                      DEGRADED     0     0     0
sdae                                        ONLINE       0     0     0
sdaf                                        ONLINE       0     0     0
sdag                                        ONLINE       0     0     0
sdah                                        FAULTED     18     0     0  too many errors
sdai                                        ONLINE       0     0     0
sdaj                                        ONLINE       0     0     0


I did the following:

-bash-4.2$ sudo zpool offline ex9 sdb

Then I went to the server room to see that disk 1 still had a red light due to the fault. I pulled the disk out. Inserted a fresh one of the same brand, a Seagate Exos X12. The server detected the new disk and set the disk name as /dev/sdb, just like the one I just pulled out. Finally, I did the following command.

-bash-4.2$ sudo zpool replace ex9 /dev/sdah
-bash-4.2$ zpool status
 pool: ex9
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Tue Mar 19 14:06:33 2019
1.37G scanned out of 51.8T at 127M/s, 118h33m to go
37.9M resilvered, 0.00% done
.
.
.
	  raidz2-5                                      DEGRADED     0     0     0
   sdae                                        ONLINE       0     0     0
   sdaf                                        ONLINE       0     0     0
   sdag                                        ONLINE       0     0     0
   replacing-3                                 DEGRADED     0     0     0
     old                                       FAULTED     18     0     0  too many errors
     sdah                                      ONLINE       0     0     0  (resilvering)
   sdai                                        ONLINE       0     0     0
   sdaj                                        ONLINE       0     0     0

Resilvering is the process of a disk being rebuilt from its parity group. Once it is finished, you should be good to go again.

Replace disk by disk ids

For zayin/nfs-exa, some of the disks are named by id instead of the vdev-id. It is recommended to use id instead of vdev-id as vdev-id can change after reboot.

raidz2-4                  DEGRADED     0     0     0
scsi-35000c500a7da67cb  ONLINE       0     0     0
scsi-35000c500a7daa34f  ONLINE       0     0     0
scsi-35000c500a7db39db  FAULTED      0     0     0  too many errors
scsi-35000c500a7da6b97  ONLINE       0     0     0
scsi-35000c500a7da265b  ONLINE       0     0     0
scsi-35000c500a7da740f  ONLINE       0     0     0

In this case, we have to determine the id name of the new disk disk just got inserted with dmesg. Look for log that mentioning about an new disk

$ dmesg -T | tail
[7819794.080935] scsi 0:0:40:0: Power-on or device reset occurred
[7819794.099111] sd 0:0:40:0: Attached scsi generic sg8 type 0
[7819794.100978] sd 0:0:40:0: [sdi] Spinning up disk...
[7819795.103622] ......................ready
[7819817.123255] sd 0:0:40:0: [sdi] 31251759104 512-byte logical blocks: (16.0 TB/14.5 TiB)
[7819817.123263] sd 0:0:40:0: [sdi] 4096-byte physical blocks
[7819817.128478] sd 0:0:40:0: [sdi] Write Protect is off
[7819817.128486] sd 0:0:40:0: [sdi] Mode Sense: df 00 10 08
[7819817.130308] sd 0:0:40:0: [sdi] Write cache: enabled, read cache: enabled, supports DPO and FUA
[7819817.165231] sd 0:0:40:0: [sdi] Attached SCSI disk

Check if disk is properly recognized, the new disk should be at the bottom and doesn't have any partition

$ fdisk -l
$ cd /dev/disk/by-id
$ ls -ltr | grep sdi
lrwxrwxrwx. 1 root root  9 Feb  7 13:29 scsi-35000c500d7947833 -> ../../sdi

Once determine the name, we will start the resilvering process

$ zpool replace exa  scsi-35000c500a7db39db scsi-35000c500d7947833
# scsi-35000c500a7db39db is the id of the failed disk obtained from zpool status
# scsi-35000c500d7947833 is the id of the new replacement disk determined above

Disk LED light

Identify failed disk by LED light

By disk_id

# turn light off
$ ledctl locate_off=/dev/disk/by-id/<disk_id> 
# turn light on
$ ledctl locate=/dev/disk/by-id/<disk_id>
Example
$ ledctl locate_off=/dev/disk/by-id/scsi-35000c500a7d8137f 
$ ledctl locate=/dev/disk/by-id/scsi-35000c500a7d8137f 
$ ledctl locate=/dev/disk/by-partlabel/zfs-c34473d19032c002

By vdev

# turn light on
$ ledctl locate_off=/dev/<vdev>
# turn light on
$ ledctl locate=/dev/disk/<vdev>
Example 
$ ledctl locate_off=/dev/sdaf
$ ledctl locate=/dev/sdaf

Reset light from LED light glitch

For qof/nfs-ex9, we had an issue with the disk LED for /dev/sdah still showing up red despite the resilvering occurring. To return the disk LED to a normal status, issue the following command:

$ sudo ledctl normal=/dev/<disk vdev id>
Example: $ sudo ledctl normal=/dev/sdah

or for zayin/nfs-exa, disk are identify by id

$ sudo ledctl normal=/dev/disk/by-id/<disk id>
Example: $ sudo ledctl normal=/dev/disk/by-id/scsi-35000c500a7db39db

Check if pool is compressed

zfs get all | grep compression

Check User Usage in ZFS

zfs userspace <pool name>