ZFS Performance Focused Parameters
February 20, 2018
We’ve recently gotten some significantly larger storage systems and after running some 50T pools with basically all the defaults it felt like time to dig into what common options are used to chase performance. The intended use for these systems is ultimately CIFS/NFS targets for scientists who are running simulations that generate small (1M) to large (100G) files. I’m not being rigorous and offering any benchmarks, just digging into documented performance parameters and explaining the rational.
Alightment Shift #
Configuring ashift correctly is important because partial sector writes incur a penalty where the sector must be read into a buffer before it can be written
Alignment Shift needs to be set properly for each vdev in a pool. You should query your drives to understand if they are advanced format (4K, 8K) and set the ashift
parameter to 12 or 13 respectively.
cat /sys/class/block/sdX/queue/physical_block_size
cat /sys/class/block/sdX/queue/logical_block_size
You can find many reports of significant performance impact for people who’ve ignored setting ashift
or forcibly set ashift
to 0
. There is an overhead to setting this parameter, you will have a reduction in raw available storage from each device that is advanced format.
Spares and Autoreplace #
If you add spares to the pool you can tell ZFS to automatically resilver onto those spares. There is some discussion of how this works in ZoL. As resilvering is a significantly time costly situation (at-least until performance improvements hit in 0.8.0) an autoreplace can save you a ton of time on getting a system back on-line. You however run the risk of a resilver being triggered automatically for an event that you might not classify as a full device failure.
RAIDZ(1,2,3), Mirrored RAIDZ(1,2,3), or Striped Mirrors (RAID10) #
There are some good write ups that get into comparative performance, but the major items to consider here are:
- what is your tolerance for growth requirements?
- what is your tolerance for resilvering performance?
With growth, you have to anticipate what your ability is to provision more vdevs for a system. It is possible to have heterogenous vdevs within a pool (however I’m unsure of the use case for this as ZFS will favor the larger vdev and your pool will become imbalanced), generally you’d increase a pool by adding vdevs. If we assume homogenous vdevs, then the vdev size is your target for growth. Choosing 12 drive vdevs means that growth requires you to purchase 12 drives. In a professional environment this isn’t a terrible pressure, but in personal use I’ve rarely exceeded 8 drive vdevs because it’s a forced strain of purchasing 8 more drives if I want to grow or replace the pool.
With resilvering, you have to consider how tolerant your ecosystem is to downtime. It can take a significantly non-trivial period of time for a large vdev raidz(1,2,3) to resilver after a device failure. If you’re not prepared to tolerate 36-48 hours of degraded performance or downtime you should not be wading into raidz.
For these large systems we’re initially provisioning 12 drives with two hot spares. We’re pretty unwilling to tolerate downtime in excess of a single day and we’re generally planning to grow by 2-12 drive increments. Based on this we’re going to create a striped mirror as it offers us the best possible read performance, scrub/resilver performance, and simplest growth strategy.
Compression #
Compression is a sort of “free lunch” in ZFS. There are some discussions online where compression can become a convolution in relation to distributing parity for raidz schemes, however I’ve not found a strong case for generally not using compression where instead you find a deluge recommending it. LZ4 is the default compression strategy in OpenZFS and it is incredibly fast.
Normalization #
This property indicates whether a file system should perform a unicode normalization of file names whenever two file names are compared, and which normalization algorithm should be used. There are performance benefits to be gained with ZFS is doing comparison that involved unicode equivalence. This is a less documented parameter that shows up oftent in discussion. The mailing list for ZoL has some discussion about it, however I’ve been unable to dig up anything better than Richard Laager’s reply. It appears that most people are selecting formD
so we’ll throw our lot in with that as a more heavily tested pathway.
Prevent Mounting of the root dataset #
You generally never want to use the root dataset to store anything, some dataset options can never be disabled. So, if you enable a dataset option on the root pool and you have files there you’d have to destroy the entire pool to clear it, instead of destroying just a dataset. This can be done by specifying -O canmount=off
. If the canmount
property is set to off
, the file system cannot be mounted by using the zfs mount
or zfs mount -a
commands. Setting this property to off
is similar to setting the mountpoint
property to none
, except that the dataset still has a normal mountpoint
property that can be inherited.
Extended File Attributes #
ZFS can greatly benefit from setting the xattr
attribute to sa
. There is some discussion here of its implementation in ZoL. Further discussion here with a good quote:
The issue here appears to be that – under the hood – Linux doesn’t have a competent extended attribute engine yet. It’s using getfactl/getxattr/setfacl/setattr; the native chown/chmod utilities on Linux don’t even support extended attributes yet. Andreas Gruenbacher – the author of the utilities – clearly did a great job implementing something that doesn’t have great support in the kernel.
Access Time Updates #
Some people turn atime
off, we will instead leverage relatime
to make ZFS behave more like ext4/xfs where access time is only updated if the modified or changed time changes.
Recordsize #
Most of the discussion for recordsize
focuses on tuning for databse interactions. We’re providing this storage to users that will be writing primarily large files. The idea with recordsize
is you want to set it as high as possible while simultaneously avoiding two things:
- read-modify-write-cycles: where ZFS has to write a chunk that is smaller than its
recordsize
. For example if your ZFSrecordsize
is set to 128k and innodb wants to write 16k, ZFS will have to read the entire 128k, modify some 16k within that 128k, then write back the entire 128k. - write amplification: when you’re doing read-modify-write you’re writing far more than you intended to, for the above example its a multiplicitive of eight.
There is a good discussion about this with Allan Jude. recordsize
should really be set per dataset, as datasets should be created for specific purposes. There is a post here, from which I used the following:
find . -type f -print0 \
| xargs -0 ls -l \
| awk '{ n=int(log($5)/log(2)); \
if (n<10) n=10; \
size[n]++ } \
END { for (i in size) printf("%d %d\n", 2^i, size[i]) }' \
| sort -n \
| awk 'function human(x) { x[1]/=1024; \
if (x[1]>=1024) { x[2]++; \
human(x) } } \
{ a[1]=$1; \
a[2]=0; \
human(a); \
printf("%3d%s: %6d\n", a[1],substr("kMGTEPYZ",a[2]+1,1),$2) }'
Which produced this for a dataset:
1k: 77526
2k: 26490
4k: 26354
8k: 35760
16k: 15681
32k: 12206
64k: 8606
128k: 8740
256k: 12421
512k: 19919
1M: 15813
2M: 10342
4M: 13070
8M: 7604
16M: 2981
32M: 988
64M: 1062
128M: 560
256M: 711
512M: 498
1G: 107
2G: 17
4G: 17
8G: 6
16G: 3
32G: 4
64G: 2
In this case I’m going to leave recordsize the default 128k
and pray that compression saves me.
Final Result #
You specify pool options with -o
and root dataset options with -O
:
zpool create -o ashift=12 -o autoreplace=on -O canmount=off -O mountpoint=/tank -O normalization=formD -O compression=lz4 -O xattr=sa -O relatime=on tank mirror /dev/mapper/disk0 /dev/mapper/disk1
Expanding that out:
zpool create \
-o ashift=12 \
-o autoreplace=on \
-O canmount=off \
-O compression=lz4 \
-O normalization=formD \
-O mountpoint=/tank \
-O xattr=sa \
-O relatime=on \
tank mirror /dev/mapper/disk0 /dev/mapper/disk1
Adding another mirror device:
zpool add -o ashift=12 tank mirror /dev/mapper/disk2 /dev/mapper/disk3
Adding a hot spare:
zpool add -o ashift=12 tank spare /dev/mapper/disk4