I’ve had a chance to revisit ZFS lately and decided to take some more notes. One highlight of this test is revalidating a quick performance test against ext4. This test is ran on Google Compute Engine n1-highmem-8 (8 vCPU, 52GB RAM) with 2x375G local SSD attached via NVMe and 4x500G standard persistent disks using the Ubuntu 16.04 cloud image.
- For ext4, I simply created a RAID10 software RAID volume.
- For ZFS, I used a RAID10 pool with the local SSDs as separate SLOG.
- innodb_flush_method=O_DSYNC
- innodb_doublewrite=0
- The tests are IO bound with only 1G buffer pool and 40G+ of total InnoDB data.
- sysbench read/write, index only update and read only tests were executed against Percona Server 5.7.19.
Commands can be found from this gist including the MySQL Sandbox command used to create the test instances.
Let me explain the ZFS parameters first:
- compressions=lz4 We use compression unless there is compelling reason not to. It is faster to decompress and compress and read/write smaller data than read/write the same amount of data from disk.
- recordsize=16k/128k InnoDB data file IO is aligned to its page size while the redo log can be bigger depending on how much data gets flushed from the log buffer.
- ashift=12 Most modern storage devices uses 4K block sizes even though they sometime advertise 512b. Google Storage uses 4K based on my tests.
- logbias=throughput Do not use the Separate ZIL (SLOG), this is better as the number of pool disks increases and improves throughput. The other option is latency if latency for synchronous writes is more important. Supposedly, there are no writes to the SLOG when using throughput but this is not the case from our test which needs further verification.
- atime=off Reduces metadata writes to MySQL data files and into the Adaptive Read Cache.
- sync=standard This instructs the ZFS IO scheduler to sync to disk when needed. For example, on our test, we use innodb_flush_method=O_DSYNC, this means the redo log writes will always be synchronous while data files will be flushed to disk when fsync() is called from MySQL.
- primarycache=metadata We did not use an L2ARC, it would have helped with the read only tests but I wanted to see how far both configurations could drive the physical disks.
The results:
+----------------------------------------------------------------------------+
| EXT4 |
+-----------------------------+---------------+---------------+--------------+
| | read_write | update_index | read_only |
+-----------------------------+---------------+---------------+--------------+
| SQL statistics: | | | |
| queries performed: | | | |
| read: | 390488 | 0 | 665266 |
| write: | 111564 | 210258 | 0 |
| other: | 55783 | 0 | 95038 |
| total: | 557835 | 210258 | 760304 |
| transactions: | | | |
| total | 27891 | 210258 | 47519 |
| tps | 15.49 | 116.81 | 26.4 |
| qps: | 309.84 | 116.81 | 422.33 |
| | | | |
| Latency (ms): | | | |
| min: | 57.5 | 0.73 | 2.75 |
| avg: | 516.36 | 68.49 | 303.06 |
| max: | 1943.78 | 624.89 | 794.55 |
| 95th percentile: | 746.32 | 150.29 | 475.79 |
+-----------------------------+---------------+---------------+--------------+
| ZFS |
+-----------------------------+---------------+---------------+--------------+
| | read_write | update_index | read_only |
+-----------------------------+---------------+---------------+--------------+
| SQL statistics: | | | |
| queries performed: | | | |
| read: | 642236 | 0 | 1148574 |
| write: | 183472 | 294325 | 0 |
| other: | 91741 | 0 | 164082 |
| total: | 917449 | 294325 | 1312656 |
| transactions: | | | |
| total | 45867 | 294325 | 82041 |
| tps | 25.48 | 163.51 | 45.57 |
| qps: | 509.6 | 163.51 | 729.17 |
| | | | |
| Latency (ms): | | | |
| min: | 46.81 | 2.61 | 2.95 |
| avg: | 313.97 | 48.92 | 175.53 |
| max: | 1045.32 | 332.37 | 589.83 |
| 95th percentile: | 458.96 | 116.8 | 287.38 |
+-----------------------------+---------------+---------------+--------------+
The ZFS numbers shows between 1.5x to 2x better QPS and TPS and better overall latency. While this is good and all, I am not entirely sure whether O_DIRECT is the best flush method on Google storage. I would assume that GCP persistent disks are POSIX compliant and if that is the case, I’d also need to test O_DSYNC.
For now, the potentials are really exciting:
- With ZFS compression, depending on how compressible your data is, we can fit more data into the same hardware configuration compared to ext4.
- Take advantage of ZFS snapshots for backups, especially for large datasets and transfer it directly between nodes.
- Support for encryption (pending upstream merge, another blog) with the same or better performance than LUKS/dm-crypt.
GCP has support for disk encryption and snapshots, so ZFS may be redundant. However, ZFS has advantages over snapshots if you use multiple disks and sending it directly to another node (think PXC SST for large data, another blog :)). Second, I intend to redo these tests when I get my hands on the right hardware.
What’s next?
- Test with O_DSYNC.
- How much impact does logbias=latency have with the NVMe SLOG.
- Will ZFS + snapshots on large datasets be better with PXC?
- Measure encryption performance against LUKS.
Comments
Leave a Comment