Zfs
ZFS - Zettabyte Filesystem
ZFS packages installation
https://www.symmcom.com/docs/how-tos/storages/how-to-install-zfs-on-centos-7 ZFS rpm pack must match the CentOS version install in the machine
- Check CentOS version
$ cat /etc/centos-release CentOS Linux release 7.8.2003 (Core)
- Install ZFS-release package. In this case, you will need to install the package for CentOS 7.8 version
$ yum install http://download.zfsonlinux.org/epel/zfs-release.el7_8.noarch.rpm
- Edit /etc/yum.repos.d/zfs.repo
The ZFS package that we want to install is zfs-kmod
$ vim /etc/yum.repos.d/zfs.repo There are 2 items to change [zfs] name=ZFS on Linux for EL7 - dkms baseurl=http://download.zfsonlinux.org/epel/7.8/$basearch/ enabled=1 -> change to 0 metadata_expire=7d gpgcheck=1 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-zfsonlinux
[zfs-kmod] name=ZFS on Linux for EL7 - kmod baseurl=http://download.zfsonlinux.org/epel/7.8/kmod/$basearch/ enabled=0 change to 1 metadata_expire=7d gpgcheck=1 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-zfsonlinux
Beginning ZFS instances
There are only two commmands to interact with ZFS.
zpool: used to create a ZFS vdev (virtual device). vdevs are composed of physical devices. zfs: used to create/interact with a ZFS dataset. ZFS datasets are akin to logical volumes
# zpool creation syntax zpool create <poolname> <vdev(s)> # Create a zpool of six raidz2 vdevs, each with six drives. Includes two SSDs to used as a mirrored SLOG and one SSD as an L2ARC read cache. (example commmand was run on qof) zpool create ex9 raidz2 sda sdb sdc sdd sde sdf raidz2 sdg sdh sdi sdj sdk sdl raidz2 sdm sdn sdo sdp sdq sdr raidz2 sds sdt sdu sdv sdw sdx raidz2 sdy sdz sdaa sdab sdac sdad raidz2 sdae sdaf sdag sdah sdai sdaj log mirror ata-INTEL_SSDSC2KG480G7_BTYM740603E0480BGN ata-INTEL_SSDSC2KG480G7_BTYM7406019K480BGN cache ata-INTEL_SSDSC2KG480G7_BTYM740602GN480BGN [root@qof ~]# zpool status pool: ex9 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM ex9 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 sda ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 sde ONLINE 0 0 0 sdf ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 sdg ONLINE 0 0 0 sdh ONLINE 0 0 0 sdi ONLINE 0 0 0 sdj ONLINE 0 0 0 sdk ONLINE 0 0 0 sdl ONLINE 0 0 0 raidz2-2 ONLINE 0 0 0 sdm ONLINE 0 0 0 sdn ONLINE 0 0 0 sdo ONLINE 0 0 0 sdp ONLINE 0 0 0 sdq ONLINE 0 0 0 sdr ONLINE 0 0 0 raidz2-3 ONLINE 0 0 0 sds ONLINE 0 0 0 sdt ONLINE 0 0 0 sdu ONLINE 0 0 0 sdv ONLINE 0 0 0 sdw ONLINE 0 0 0 sdx ONLINE 0 0 0 raidz2-4 ONLINE 0 0 0 sdy ONLINE 0 0 0 sdz ONLINE 0 0 0 sdaa ONLINE 0 0 0 sdab ONLINE 0 0 0 sdac ONLINE 0 0 0 sdad ONLINE 0 0 0 raidz2-5 ONLINE 0 0 0 sdae ONLINE 0 0 0 sdaf ONLINE 0 0 0 sdag ONLINE 0 0 0 sdah ONLINE 0 0 0 sdai ONLINE 0 0 0 sdaj ONLINE 0 0 0 logs mirror-6 ONLINE 0 0 0 ata-INTEL_SSDSC2KG480G7_BTYM740603E0480BGN ONLINE 0 0 0 ata-INTEL_SSDSC2KG480G7_BTYM7406019K480BGN ONLINE 0 0 0 cache ata-INTEL_SSDSC2KG480G7_BTYM740602GN480BGN ONLINE 0 0 0
Adding a zfs filesystem:
Using qof as an example, I will create a child filesystem under ex9 named archive that will be mounted under /export/ex9/archive. This archive will be used to backup user data.
-bash-4.2$ zfs list NAME USED AVAIL REFER MOUNTPOINT ex9 2.39T 249T 2.39T /export/ex9 -bash-4.2$ sudo zfs create -o mountpoint=/export/ex9/archive ex9/archive -bash-4.2$ zfs list NAME USED AVAIL REFER MOUNTPOINT ex9 2.39T 249T 2.39T /export/ex9 ex9/archive 192K 249T 192K /export/ex9/archive
Adding L2ARC Read Cache to a zpool
# Look for available SSDs in /dev/disk/by-id/ # Choose an available SSD to use for read cache. Then decide which pool you want to put the cache on. Syntax: zpool add <zpool name> <cache/log> <path to disk> $ sudo zpool add ex6 cache /dev/disk/by-id/ata-INTEL_SSDSC2KG480G7_BTYM72830AV6480BGN
Tuning ZFS options
# stores extended attributes as system attributes to improve performance $ zfs xattr=sa <zfs dataset name> # Turn on ZFS lz4 compression. Use this for compressible dataset such as many files with text $ zfs set compression=lz4 <zfs dataset name> # Turn off access time for improved disk performance (so that the OS doesn't write a new time every time a file is accessed) $ zfs set atime=off <zfs dataset name>
NOTE: ZFS performance degrades tremendously when the zpool is over 80% used. To avoid this, I have set a quota to 80% of the 248TB in qof/nfs-ex9. # To set a quota of 200TB on ZFS dataset: $ zfs set quota=200T <zfs dataset>
# To remove a quota from a ZFS dataset: $ zfs set quota=none <zfs dataset>
By default, ZFS pools/mounts do not have ACLs active.
# to active access control lists on a zpool $ sudo zfs set acltype=posixacl <pool name>
situation
zpool status zfs list zfs get all
mount after reboot
zfs set mountpoint=/export/db2 db2
when you put in a new disk
fdisk -l
to see what is new
sudo zpool create -f /srv/db3 raidz2 /dev/sdaa /dev/sdab /dev/sdac /dev/sdad /dev/sdae /dev/sdaf /dev/sdag /dev/sdah /dev/sdai /dev/sdaj /dev/sdak /dev/sdal sudo zpool add -f /srv/db3 raidz2 /dev/sdam /dev/sdan /dev/sdao /dev/sdap /dev/sdaq /dev/sdar /dev/sdas /dev/sdat /dev/sdau /dev/sdav /dev/sdaw /dev/sdax
zfs unmount db3
zfs mount db3
latest
zpool create -f db3 raidz2 /dev/sdy /dev/sdz /dev/sdaa /dev/sdab /dev/sdac /dev/sdad /dev/sdae /dev/sdaf /dev/sdag /dev/sdah /dev/sdai /dev/sdaj zpool add -f db3 raidz2 /dev/sdak /dev/sdal /dev/sdam /dev/sdan /dev/sdao /dev/sdap /dev/sdaq /dev/sdar /dev/sdas /dev/sdat /dev/sdau /dev/sdav
zpool create -f db4 raidz2 /dev/sdax /dev/sday /dev/sdaz /dev/sdba /dev/sdbb /dev/sdbc /dev/sdbd /dev/sdbe /dev/sdbf /dev/sdbg /dev/sdbh /dev/sdbi zpool add -f db4 raidz2 /dev/sdbj /dev/sdbk /dev/sdbl /dev/sdbm /dev/sdbn /dev/sdbo /dev/sdbp /dev/sdbq /dev/sdbr /dev/sdbs /dev/sdbt /dev/sdbu
Fri Jan 19 2018
zpool create -f db5 raidz2 /dev/sdbw /dev/sdbx /dev/sdby /dev/sdbz /dev/sdca /dev/sdcb /dev/sdcc /dev/sdcd /dev/sdce /dev/sdcf /dev/sdcg /dev/sdch zpool add -f db5 raidz2 /dev/sdci /dev/sdcj /dev/sdck /dev/sdcl /dev/sdcm /dev/sdcn /dev/sdco /dev/sdcp /dev/sdcq /dev/sdcr /dev/sdcs /dev/sdct zfs mount db5
Wed Jan 24 2018
On tsadi
zpool create -f ex1 mirror /dev/sdaa /dev/sdab /dev/sdac /dev/sdad /dev/sdae zpool add -f ex1 mirror /dev/sdaf /dev/sdag /dev/sdah /dev/sdai /dev/sdaj zpool create -f ex2 mirror /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj zpool add -f ex2 /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo zpool create -f ex3 mirror /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt zpool add -f ex3 mirror /dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy zpool create -f ex4 mirror /dev/sdz /dev/sdak /dev/sdal zpool add -f ex4 mirror /dev/sdam /dev/sdan /dev/sdao
On tsadi
zpool create -f ex1 mirror /dev/sdaa /dev/sdab mirror /dev/sdac /dev/sdad mirror /dev/sdae /dev/sdaf mirror /dev/sdag /dev/sdah mirror /dev/sdai /dev/sdaj zpool create -f ex2 mirror /dev/sdf /dev/sdg mirror /dev/sdh /dev/sdi mirror /dev/sdj /dev/sdk mirror /dev/sdl /dev/sdm mirror /dev/sdn /dev/sdo zpool create -f ex3 mirror /dev/sdp /dev/sdq mirror /dev/sdr /dev/sds mirro /dev/sdt /dev/sdu mirror /dev/sdv /dev/sdw mirror /dev/sdx /dev/sdy zpool create -f ex4 mirror /dev/sdz /dev/sdak /dev/sdal mirror /dev/sdam mirror /dev/sdan /dev/sdao
On lamed
zpool create -f ex5 mirror /dev/sdaa /dev/sdab mirror /dev/sdac /dev/sdad mirror /dev/sdae /dev/sdaf mirror /dev/sdag /dev/sdah mirror /dev/sdai /dev/sdaj zpool create -f ex6 mirror /dev/sda /dev/sdb mirror /dev/sdc /dev/sdd mirror /dev/sde /dev/sdf mirror /dev/sdg /dev/sdh mirror /dev/sdi /dev/sdj zpool create -f ex7 mirror /dev/sdk /dev/sdl mirror /dev/sdm /dev/sdn mirror /dev/sdo /dev/sdp mirror /dev/sdq /dev/sdr mirror /dev/sds /dev/sdt zpool create -f ex8 mirror /dev/sdu /dev/sdv mirror /dev/sdw /dev/sdx mirror /dev/sdy /dev/sdz
Sun Jan 19 2020
on mem2, sql system, note sda and sdc are system disks
zpool create -f sql1 raidz2 /dev/sdb /dev/sdc /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm zpool add -f sql1 raidz2 /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx
transform db4 on n-9-22 from z2 to z0
zpool destroy db4 zpool create -f db4 raidz /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy
zfs mount
recovery from accidental pool destruction
umount /mnt /mnt2 mdadm -S /dev/md125/dev/md126/dev/md127
sfdisk -d /dev/sda < sda.sfdisk sfdisk -d /dev/sdb < sdb.sfdisk sfdisk /dev/sda < sdb.sfdisk
mdadm --detail /dev/md127 mdadm -A -R /dev/md127/dev/sdb2/dev/sda2 mdadm /dev/md127 -a /dev/sda2 mdadm --detail /dev/md127 echo check > /sys/block/md127/md/sync_action cat /proc/mdstat
mdadm --detail /dev/md126 mdadm -A -R /dev/md126/dev/sdb3/dev/sda3 mdadm /dev/md126 -a /dev/sda3 mdadm --detail /dev/md126 echo check > /sys/block/md126/md/sync_action cat /proc/mdstat
Also switched the bios to boot from hd2 instead of hd1 (or something)
- Recreate zpool with correct drives
- Point an instance photorec at each of the wiped drives set to recover files of the following types: .gz, .solv (custom definition)
NOTE: If you destroyed your zpool with command 'zpool destroy', you can use the command 'zpool import' to view destroyed pools and recover the pool by doing 'zpool import <zpool name>'.
Thu Apr 16, 2020
We destroyed old db2 on abacus. We put in 20 new disks 7.68 TB and 2 new 2.5 TB disks
zpool create -f /scratch /dev/sdc /dev/sdd
zpool create -f /srv/db2 raidz2 /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn zpool add -f /srv/db2 raidz2 /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx
OLD:
sudo zpool create -f /srv/db3 raidz2 /dev/sdaa /dev/sdab /dev/sdac /dev/sdad /dev/sdae /dev/sdaf /dev/sdag /dev/sdah /dev/sdai /dev/sdaj /dev/sdak /dev/sdal sudo zpool add -f /srv/db3 raidz2 /dev/sdam /dev/sdan /dev/sdao /dev/sdap /dev/sdaq /dev/sdar /dev/sdas /dev/sdat /dev/sdau /dev/sdav /dev/sdaw /dev/sdax
Mon Apr 20 2020
zpool create -f db2 raidz2 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn zpool add -f db2 raidz2 /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy /dev/sdz
zpool create -f db3 raidz2 /dev/sdaa /dev/sdab /dev/sdac /dev/sdad /dev/sdae /dev/sdaf /dev/sdag /dev/sdah /dev/sdai /dev/sdaj /dev/sdak /dev/sdal zpool add -f db3 raidz2 /dev/sdam /dev/sdan /dev/sdao /dev/sdap /dev/sdaq /dev/sdar /dev/sdas /dev/sdat /dev/sdau /dev/sdav /dev/sdaw /dev/sdax
zpool create -f db5 raidz2 /dev/sday /dev/sdaz /dev/sdba /dev/sdbb /dev/sdbc /dev/sdbd /dev/sdbe /dev/sdbf /dev/sdbg /dev/sdbh /dev/sdbi /dev/sdbj zpool add -f db5 raidz2 /dev/sdbk /dev/sdbl /dev/sdbm /dev/sdbn /dev/sdbo /dev/sdbp /dev/sdbq /dev/sdbr /dev/sdbs /dev/sdbt /dev/sdbu /dev/sdbv
Tue Apr 21 2020
Ben's commands:
fdisk -l 2>/dev/null | grep -o "zfs.*" > disk_ids split -n 3 disk_ids disk_id_ db2_disks=`cat disk_id_aa` db3_disks=`cat disk_id_ab` db5_disks=`cat disk_id_ac` zpool create -f db2 raidz2 $db2_disks zpool create -f db3 raidz2 $db3_disks zpool create -f db5 raidz2 $db5_disks reboot
Amended commands, Apr 22- based on advice from john that vdevs should be limited to 12 disks each:
fdisk -l 2>/dev/null | grep -o "zfs.*" > disk_ids split -n 6 disk_ids disk_id_
db2_disks_1=`cat disk_id_aa` db2_disks_2=`cat disk_id_ab`
db3_disks_1=`cat disk_id_ac` db3_disks_2=`cat disk_id_ad`
db5_disks_1=`cat disk_id_ae` db5_disks_2=`cat disk_id_af`
zpool create -f db2 raidz2 $db2_disks_1 zpool add -f db2 raidz2 $db2_disks_2 zpool create -f db3 raidz2 $db3_disks_1 zpool add -f db3 raidz2 $db3_disks_2 zpool create -f db5 raidz2 $db5_disks_1 zpool add -f db5 raidz2 $db5_disks_2
Mon Jul 20 2020
zpool create -f exb raidz2 sdc sdd sde sdf sdg sdh raidz2 sdi sdj sdk sdl sdm sdn raidz2 sdo sdp sdq sdr sds sdt raidz2 sdu sdv sdw sdx sdy sdz raidz2 sdaa sdab sdac sdad sdae sdaf raidz2 sdag sdah sdai sdaj sdak sdal log mirror sdam sdan cache sdao
zpool destroy : Failed to unmount <device> - device busy
The help text will advise you to check lsof or fuser, but really what you need to do is stop the nfs service
systemctl stop nfs zpool destroy ... zpool create ... zpool ... ... systemctl start nfs
Example: Fixing degraded pool, replacing faulted disk
On Feb 22, 2019, one of nfs-ex9's disks became faulty.
-bash-4.2$ zpool status pool: ex9 state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: scrub canceled on Fri Feb 22 11:31:25 2019 config: raidz2-5 DEGRADED 0 0 0 sdae ONLINE 0 0 0 sdaf ONLINE 0 0 0 sdag ONLINE 0 0 0 sdah FAULTED 18 0 0 too many errors sdai ONLINE 0 0 0 sdaj ONLINE 0 0 0
I did the following:
-bash-4.2$ sudo zpool offline ex9 sdb
Then I went to the server room to see that disk 1 still had a red light due to the fault. I pulled the disk out. Inserted a fresh one of the same brand, a Seagate Exos X12. The server detected the new disk and set the disk name as /dev/sdb, just like the one I just pulled out. Finally, I did the following command.
-bash-4.2$ sudo zpool replace ex9 /dev/sdah -bash-4.2$ zpool status pool: ex9 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Tue Mar 19 14:06:33 2019 1.37G scanned out of 51.8T at 127M/s, 118h33m to go 37.9M resilvered, 0.00% done . . . raidz2-5 DEGRADED 0 0 0 sdae ONLINE 0 0 0 sdaf ONLINE 0 0 0 sdag ONLINE 0 0 0 replacing-3 DEGRADED 0 0 0 old FAULTED 18 0 0 too many errors sdah ONLINE 0 0 0 (resilvering) sdai ONLINE 0 0 0 sdaj ONLINE 0 0 0
Resilvering is the process of a disk being rebuilt from its parity group. Once it is finished, you should be good to go again.
For qof/nfs-ex9, we had an issue with the disk LED for /dev/sdah still showing up red despite the resilvering occurring. To return the disk LED to a normal status, issue the following command:
$ sudo ledctl normal=/dev/<disk id> Example: $ sudo ledctl normal=/dev/sdah
For zayin/nfs-exa, some of the disks are named by id instead of the vdev-id.
raidz2-4 DEGRADED 0 0 0 scsi-35000c500a7da67cb ONLINE 0 0 0 scsi-35000c500a7daa34f ONLINE 0 0 0 scsi-35000c500a7db39db FAULTED 0 0 0 too many errors scsi-35000c500a7da6b97 ONLINE 0 0 0 scsi-35000c500a7da265b ONLINE 0 0 0 scsi-35000c500a7da740f ONLINE 0 0 0
In this case, we have to determine the vdev name of the new disk disk just got inserted with dmesg. Look for log that mentioning about an new disk
$ dmesg | tail [14663327.192519] sd 0:0:38:0: [sdad] Spinning up disk... [14663327.192756] sd 0:0:38:0: Attached scsi generic sg27 type 0 [14663328.193173] ........................ready [14663352.681625] sd 0:0:38:0: [sdad] 27344764928 512-byte logical blocks: (14.0 TB/12.7 TiB) [14663352.681627] sd 0:0:38:0: [sdad] 4096-byte physical blocks [14663352.687268] sd 0:0:38:0: [sdad] Write Protect is off [14663352.687273] sd 0:0:38:0: [sdad] Mode Sense: db 00 10 08 [14663352.690847] sd 0:0:38:0: [sdad] Write cache: enabled, read cache: enabled, supports DPO and FUA [14663352.732297] sd 0:0:38:0: [sdad] Attached SCSI disk
Once determine the name, we will start the resilvering process
$ zpool replace exa scsi-35000c500a7db39db sdad # scsi-35000c500a7db39db is the id of the failed disk obtained from zpool status # sdad is the vdev-id of the new replacement disk determined above