Zfs
Introduction
ZFS - Zettabyte File System. It's the best for redundant data storage and one of the best ways for RAID setups.
Installation
- Install CentOS 7. Make sure it has access to internet.
- Create a Foreman Entry here
- Become root and run these commands. These will install all the necessary packages plus ZFS and also enable all the firewall rules necessary.
#!/bin/bash yum install epel-release -y yum update -y yum install puppet -y yum install sssd -y yum install nss-pam-ldapd -y yum install oddjob-mkhomedir -y systemctl start oddjobd systemctl enable oddjobd puppet agent -t yum install https://zfsonlinux.org/epel/zfs-release.el7_9.noarch.rpm -y yum install zfs -y systemctl start nfs systemctl enable nfs systemctl start zfs.target systemctl enable zfs.target systemctl start zfs-import-cache.service systemctl enable zfs-import-cache.service firewall-cmd --permanent --add-service=nfs firewall-cmd --permanent --add-service=mountd firewall-cmd --permanent --add-service=rpc-bind firewall-cmd --reload
How to Create the Zpool
- First determine which ones the SSD drives are because that will be the log and cache of the zfs filesystem. Use the ones that you didn't use for the OS.
lsblk -o NAME,SIZE,SERIAL,LABEL,FSTYPE
- 2 x 240GB SSDs will be configured to be mirrors of each other for logging and 1 x 480GB SSD will be for cache.
- Lets now create the pool:
zpool create <options> <name-of-pool> raidz2 <hdd-1> <hdd-2> <hdd-3> <hdd-4> <hdd-5> <hdd-6>\ raidz2 <hdd-7> <hdd-8> <hdd-9> <hdd-10> <hdd-11> <hdd-12>\ raidz2 <hdd-13> <hdd-14> <hdd-15> <hdd-16> <hdd-17> <hdd-18>\ raidz2 <hdd-19> <hdd-20> <hdd-21> <hdd-22> <hdd-23> <hdd-24>\ raidz2 <hdd-25> <hdd-26> <hdd-27> <hdd-28> <hdd-29> <hdd-30>\ raidz2 <hdd-31> <hdd-32> <hdd-33> <hdd-34> <hdd-35> <hdd-36>\ log mirror <ssd-1> <ssd-2> cache <ssd-3>
- Here is an example of how to create a zpool:
zpool create -f exj raidz2 sdf sdg sdh sdi sdj sdk\ raidz2 sdl sdm sdn sdo sdp sdq\ raidz2 sdr sds sdt sdu sdv sdw\ raidz2 sdx sdy sdz sdaa sdab sdac\ raidz2 sdad sdae sdaf sdag sdah sdai\ raidz2 sdaj sdak sdal sdam sdan sdao\ log mirror sdc sdd\ cache sde
- Once the zpool has been created, double check by using one of these commands
zfs list zpool status
- Here comes a weird part. Locating broken disks is hard with 'sd*' naming convention since it changes with every reboot. Therefore, we will change it to their id's.
zpool export <zpool-name> zpool import -d /dev/disk/by-id -aN
- Now, mount the zpool into a directory in the machine. We usually create a new directory called '/export/<zpool-name>'.
zfs set mountpoint=/export/<zpool-name> <zpool-name>
- For example
zfs set mountpoint=/export/exa exa
- Double check that the zpool mounted to the directory by checking the disk space in that directory
df -h /export/<zpool-name>
- Lastly, reboot the machine to see if the zpool will mount automatically. If it doesn't, once it turns on, run
modprobe zfs zpool import -a
- Then reboot again and it should remount itself.
Exporting the Zpool to the Cluster
- Add the rules on where to export the zpool
vim /etc/exports
- Then, add this inside. Replace <> with the respective information.
/export/<zpool-name> 10.20.0.0/16(rw,async,fsid=<unused-id>,no_subtree_check) \ 169.230.26.0/24(rw,async,fsid=<unused-id>,no_subtree_check) \ 169.230.90.0/24(rw,async,fsid=<unused-id>,no_subtree_check) \ 169.230.91.0/24(rw,async,fsid=<unused-id>,no_subtree_check) \ 169.230.92.0/24(rw,async,fsid=<unused-id>,no_subtree_check)
- For example
/export/exl 10.20.0.0/16(rw,async,fsid=547,no_subtree_check) \ 169.230.26.0/24(rw,async,fsid=547,no_subtree_check) \ 169.230.90.0/24(rw,async,fsid=547,no_subtree_check) \ 169.230.91.0/24(rw,async,fsid=547,no_subtree_check) \ 169.230.92.0/24(rw,async,fsid=547,no_subtree_check)
- Export the rules
exportfs -a
- In another machine, check that the rules are applied
showmount -e <machine-name>
- Then follow this guide to add this new machine to the puppet module
Adding L2ARC Read Cache to a zpool
# Look for available SSDs in /dev/disk/by-id/ # Choose an available SSD to use for read cache. Then decide which pool you want to put the cache on. Syntax: zpool add <zpool-name> <cache/log> <path to disk> $ sudo zpool add ex6 cache /dev/disk/by-id/ata-INTEL_SSDSC2KG480G7_BTYM72830AV6480BGN
Tuning ZFS options
# stores extended attributes as system attributes to improve performance $ zfs xattr=sa <zfs dataset name> # Turn on ZFS lz4 compression. Use this for compressible dataset such as many files with text $ zfs set compression=lz4 <zfs dataset name> # Turn off access time for improved disk performance (so that the OS doesn't write a new time every time a file is accessed) $ zfs set atime=off <zfs dataset name>
NOTE: ZFS performance degrades tremendously when the zpool is over 80% used. To avoid this, I have set a quota to 80% of the 248TB in qof/nfs-ex9. # To set a quota of 200TB on ZFS dataset: $ zfs set quota=200T <zfs dataset>
# To remove a quota from a ZFS dataset: $ zfs set quota=none <zfs dataset>
By default, ZFS pools/mounts do not have ACLs active.
# to active access control lists on a zpool $ sudo zfs set acltype=posixacl <pool name>
Checking Disk Health and Integrity
Print a brief summary of all pools:
zpool list
Print a detailed status of each disk and status of pool:
zpool status
Clear read errors on disk, if not anything serious:
zpool clear <pool_name>
Check data integrity, traverses all the data in the pool once and verifies that all blocks can be read:
zpool scrub <pool_name>
To stop scrub:
zpool scrub -s <pool_name>
mount after reboot
zfs set mountpoint=/export/db2 db2
recovery from accidental pool destruction
umount /mnt /mnt2 mdadm -S /dev/md125/dev/md126/dev/md127
sfdisk -d /dev/sda < sda.sfdisk sfdisk -d /dev/sdb < sdb.sfdisk sfdisk /dev/sda < sdb.sfdisk
mdadm --detail /dev/md127 mdadm -A -R /dev/md127/dev/sdb2/dev/sda2 mdadm /dev/md127 -a /dev/sda2 mdadm --detail /dev/md127 echo check > /sys/block/md127/md/sync_action cat /proc/mdstat
mdadm --detail /dev/md126 mdadm -A -R /dev/md126/dev/sdb3/dev/sda3 mdadm /dev/md126 -a /dev/sda3 mdadm --detail /dev/md126 echo check > /sys/block/md126/md/sync_action cat /proc/mdstat
Also switched the bios to boot from hd2 instead of hd1 (or something)
- Recreate zpool with correct drives
- Point an instance photorec at each of the wiped drives set to recover files of the following types: .gz, .solv (custom definition)
NOTE: If you destroyed your zpool with command 'zpool destroy', you can use the command 'zpool import' to view destroyed pools and recover the pool by doing 'zpool import <zpool name>'.
Troubleshooting
Panic!! The disk is full and I can't remove files!
If the zfs pool somehow gets completely filled up, to the point that "rm" no longer works, don't panic. The disk may seem full, but ZFS actually keeps a little bit of space free for internal operations (why rm is not one of these, I don't know). The amount of space reserved for this purpose is determined by the "spa_slop_shift" module parameter. You can find the value of this parameter @ /sys/module/zfs/parameters/spa_slop_shift
If you can't find the value of this parameter, you can calculate it from the ALLOC and FREE columns given by "zpool list", like so:
spa_slop_shift = floor(log2(ALLOC/FREE))
To free up a little bit more space on disk, you can increase the value of this parameter by one. Here's a real-world example of this, from when /nfs/exh filled up.
[root@nfs-exh ~]# df -h /nfs/exh Filesystem Size Used Avail Use% Mounted on nfs-exh:/export/exh 349T 349T 0B 100% /mnt/nfs/exh [root@nfs-exh ~]# zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT exh 524T 524T 95.9G - - 91% 99% 1.00x ONLINE - [root@nfs-exh ~]# cat /sys/module/zfs/parameters/spa_slop_shift 12 [root@nfs-exh ~]# echo 13 > /sys/module/zfs/parameters/spa_slop_shift [root@nfs-exh ~]# df -h /nfs/exh Filesystem Size Used Avail Use% Mounted on nfs-exh:/export/exh 349T 349T 22G 100% /mnt/nfs/exh
Once you've gotten a foothold on the disk and made some space, you should revert the spa_slop_shift parameter back to its original value.
zpool destroy : Failed to unmount <device> - device busy
The help text will advise you to check lsof or fuser, but really what you need to do is stop the nfs service
systemctl stop nfs umount /export/ex* zpool destroy ... zpool create ... zpool ... ... systemctl start nfs
zpool missing after reboot
This is due to zfs-import-cache failed to start at boot time.
# check $ systemctl status zfs-import-cache.service # enable at boot time $ systemctl enable zfs-import-cache.service
Example: Fixing degraded pool, replacing faulted disk
On Feb 22, 2019, one of nfs-ex9's disks became faulty.
-bash-4.2$ zpool status pool: ex9 state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: scrub canceled on Fri Feb 22 11:31:25 2019 config: raidz2-5 DEGRADED 0 0 0 sdae ONLINE 0 0 0 sdaf ONLINE 0 0 0 sdag ONLINE 0 0 0 sdah FAULTED 18 0 0 too many errors sdai ONLINE 0 0 0 sdaj ONLINE 0 0 0
I did the following:
-bash-4.2$ sudo zpool offline ex9 sdb
Then I went to the server room to see that disk 1 still had a red light due to the fault. I pulled the disk out. Inserted a fresh one of the same brand, a Seagate Exos X12. The server detected the new disk and set the disk name as /dev/sdb, just like the one I just pulled out. Finally, I did the following command.
-bash-4.2$ sudo zpool replace ex9 /dev/sdah -bash-4.2$ zpool status pool: ex9 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Tue Mar 19 14:06:33 2019 1.37G scanned out of 51.8T at 127M/s, 118h33m to go 37.9M resilvered, 0.00% done . . . raidz2-5 DEGRADED 0 0 0 sdae ONLINE 0 0 0 sdaf ONLINE 0 0 0 sdag ONLINE 0 0 0 replacing-3 DEGRADED 0 0 0 old FAULTED 18 0 0 too many errors sdah ONLINE 0 0 0 (resilvering) sdai ONLINE 0 0 0 sdaj ONLINE 0 0 0
Resilvering is the process of a disk being rebuilt from its parity group. Once it is finished, you should be good to go again.
Replace disk by disk ids
For zayin/nfs-exa, some of the disks are named by id instead of the vdev-id. It is recommended to use id instead of vdev-id as vdev-id can change after reboot.
raidz2-4 DEGRADED 0 0 0 scsi-35000c500a7da67cb ONLINE 0 0 0 scsi-35000c500a7daa34f ONLINE 0 0 0 scsi-35000c500a7db39db FAULTED 0 0 0 too many errors scsi-35000c500a7da6b97 ONLINE 0 0 0 scsi-35000c500a7da265b ONLINE 0 0 0 scsi-35000c500a7da740f ONLINE 0 0 0
In this case, we have to determine the id name of the new disk disk just got inserted with dmesg. Look for log that mentioning about an new disk
$ dmesg -T | tail [7819794.080935] scsi 0:0:40:0: Power-on or device reset occurred [7819794.099111] sd 0:0:40:0: Attached scsi generic sg8 type 0 [7819794.100978] sd 0:0:40:0: [sdi] Spinning up disk... [7819795.103622] ......................ready [7819817.123255] sd 0:0:40:0: [sdi] 31251759104 512-byte logical blocks: (16.0 TB/14.5 TiB) [7819817.123263] sd 0:0:40:0: [sdi] 4096-byte physical blocks [7819817.128478] sd 0:0:40:0: [sdi] Write Protect is off [7819817.128486] sd 0:0:40:0: [sdi] Mode Sense: df 00 10 08 [7819817.130308] sd 0:0:40:0: [sdi] Write cache: enabled, read cache: enabled, supports DPO and FUA [7819817.165231] sd 0:0:40:0: [sdi] Attached SCSI disk
Check if disk is properly recognized, the new disk should be at the bottom and doesn't have any partition
$ fdisk -l
$ cd /dev/disk/by-id $ ls -ltr | grep sdi lrwxrwxrwx. 1 root root 9 Feb 7 13:29 scsi-35000c500d7947833 -> ../../sdi
Once determine the name, we will start the resilvering process
$ zpool replace exa scsi-35000c500a7db39db scsi-35000c500d7947833 # scsi-35000c500a7db39db is the id of the failed disk obtained from zpool status # scsi-35000c500d7947833 is the id of the new replacement disk determined above
Disk LED light
Identify failed disk by LED light
By disk_id
# turn light off $ ledctl locate_off=/dev/disk/by-id/<disk_id> # turn light on $ ledctl locate=/dev/disk/by-id/<disk_id> Example $ ledctl locate_off=/dev/disk/by-id/scsi-35000c500a7d8137f $ ledctl locate=/dev/disk/by-id/scsi-35000c500a7d8137f $ ledctl locate=/dev/disk/by-partlabel/zfs-c34473d19032c002
By vdev
# turn light on $ ledctl locate_off=/dev/<vdev> # turn light on $ ledctl locate=/dev/disk/<vdev> Example $ ledctl locate_off=/dev/sdaf $ ledctl locate=/dev/sdaf
Reset light from LED light glitch
For qof/nfs-ex9, we had an issue with the disk LED for /dev/sdah still showing up red despite the resilvering occurring. To return the disk LED to a normal status, issue the following command:
$ sudo ledctl normal=/dev/<disk vdev id> Example: $ sudo ledctl normal=/dev/sdah
or for zayin/nfs-exa, disk are identify by id
$ sudo ledctl normal=/dev/disk/by-id/<disk id> Example: $ sudo ledctl normal=/dev/disk/by-id/scsi-35000c500a7db39db