An investigation of performance problems with msync() system calls on filesystem DAX

Abstract

Persistent Memory (PM) is a new device which provides faster access than conventional storage devices, such as SSDs. Among several methods prepared for accessing files on PM, a combination of filesystem direct access (DAX) and mmap() is used to take advantage of its native abilities. We can avoid buffer cache and access PM with byte granularity. To make sure that stored data is persisted, we need to synchronize the data on the CPU cache to PM. One option is an msync() system call and another is pmem_persist(). DAX enabled filesystems implement msync(), as it can synchronize data with CPU instructions, such as CLWB or CLFLUSHOPT. An alternative is pmem_persist(), which is included in the Persistent Memory Development Kit and issues those CPU instructions directly from the user space. msync() has the advantage of not having to modify legacy applications, while pmem_persist() is faster since it does not incur context switches. There are legacy applications which use mmap() and msync() to gain high performance, even though their original targets are traditional storage devices. They should be able to take advantage of PM's ability without any code modifications. Kyoto Cabinet, a library of routines for managing a database, is an example [1]. This paper reports on a case in which msync() resulted in tremendously slow performance on a DAX enabled ext4 filesystem. Throughput of a 4KiB random store followed by a 4KiB msync() was 1.1k IOPS, while 192k IOPS can be obtained with pmem_persist(). File size was 100MiB for both cases and the whole region was mapped. Moreover, a SET operation for Kyoto Cabinet database with OAUTOSYNC enabled takes 4.87ms when stored on a PM, while 0.77ms is taken when stored on an SSD. The time used for msync() is 3.00ms and 0.33ms respectively. PM was Intel Optane DC Persistent Memory and Linux kernel was 5.4.0. This performance loss comes from the side effect of the 2MiB hugepage.mmap() tries to allocate 2MiB pages instead of 4KiB pages when a region greater than 2MiB is requested. Since a dirty flag is associated to a whole 2MiB page, a kernel has to flush all of the 2MiB at once, even though msync() only requests a 4KiB flush. Though hugepage support has several advantages, such as fewer page faults and smaller page tables, this performance problem hinders legacy applications to use PM without any code modifications. The result below shows how this hugepage side effect is large. It is obtained with a kernel in which CONFIG_FS_DAX_PMD is disabled. Throughput of a 4KiB random store followed by a 4KiB msync() increases to 174k IOPS. In addition, the latency of a Kyoto Cabinet SET operation decreases to 0.30ms and the time spent for msync() decreases to 0.03ms. Though disabling hugepage results in better performance, it cannot be the best solution, as hugepage offers several advantages. So, I will conclude the abstract by mentioning a method which would avoid a performance loss, even with hugepages. Since the root cause is that a dirty flag manages the 2MiB page as a whole, preparing 512 dirty flags every 4KiB would work. It will reduce the size of wasted cache line flush to nearly zero when msync() requests the flushing of a small region. In addition, some optimizations may be required for a large flush request, since flushing 512 4KiB pages would take longer compared to flushing a 2MiB page.

An investigation of performance problems with msync() system calls on filesystem DAX

Abstract

Keywords