O_DSYNC Flush Method for MySQL on Google Cloud
On my previous article, I tested ZFS with Google Compute engine using persistent disks and local SSDs. ZFS shows better performance with the help of a fast dedicated SLOG using the local SSD and NVMe driver – though the cloud, especially Google Cloud may not be the best platform for these tests.
Fast forward to today, I finally had the chance to test O_DSYNC for MySQL on Google Cloud based on that previous article. Using the same instance class and and specs (n1-highmem-8, 4x500GB standard persistent disk, RAID10 mdadm, ext4) I tested with innodb_flush_method=O_DSYNC|O_DIRECT and the results are quite surprising.
The results shows O_DSYNC having very low latency compared to O_DIRECT. As result, the actual throughput is also significantly higher.
However, the question is wether O_DSYNC is safe – is the persistent disk POSIX compliant and honors the SYNC flag when set? Based on a dated paper, Google persistent disk, assuming GFS is the same underlying system, might (“[GFS] does not implement a standard API such as POSIX”) not be the safe option. I could not find any other definitive literature on this topic to answer this question. Another thing to note is that, based on Google’s documentation, a write is also replicated to 3 other locations – this complicates our question more on how SYNC and DIRECT behaves underneath.
Without backing evidence, the only other way to confirm this is to compare how Google Cloud SQL for MySQL actually implements its writes. Looking at the default configuration for a Cloud SQL running 5.7.14, innodb_flush_method is set to O_DIRECT, this value is non-configurable – so I ran the same benchmarks as before except this time, I had to use 60million rows tables because the buffer pool size on Google Cloud SQL is also non-modifiable. The total InnoDB dataset size was about 110GB with a buffer pool of 38GB from the instance class. MySQL config and sysbench command template can be found here.
From the results above, just by looking at the averages and 95th percentiles – we can definitely conclude Google itself will not recommend and/or not use O_DSYNC for database workloads even if it provides better IO latency and throughput.
That being said, one potential use case of O_DSYNC in on Google Cloud is Galera/Percona XtraDB cluster based deployments where per node durability is not top priority especially for IO bound workloads.