Getting around the Transparent Huge Pages trap in Linux

Getting around the Transparent Huge Pages trap in Linux

July 17, 2015

A couple of weeks ago, an interesting issue came up in the field. The symptom was that the resident set size (RSS) of the Volt Active Data process would keep growing when the database was idle. The symptom only manifested itself on Red Hat Enterprise Linux (RHEL) 6.6 in KVM. There was no client workload, not even reads or statistics polling. The RSS would grow up to the virtual size and stop. Throughout the process, the virtual size of the Volt Active Data instance did not change. In one instance, we saw the RSS grow to 50GB for a Volt Active Data instance that normally used 21GB. The version of Volt Active Data used was 4.9.3.

Our initial response was to look for memory leaks in Volt Active Data. We run valgrind daily on the native code that manages data storage and processing. It was clear the leak was not there, which matched the observation that the cluster was idle.

That left us with only memory used by the Java front end. Memory allocated in Java is used for bookkeeping or buffers for different features such as command logging, database replication, etc. Volt Active Data allocates memory in Java in three ways: on-heap ByteBuffers, direct ByteBuffers, and native buffers via JNI. Tracking code is built into the Java allocation path to track these allocations. The tracking code will bark loudly if a buffer is going to be garbage collected without being discarded explicitly. Our nightly tests run with the tracking code enabled. A dangling undiscarded buffer is highly unlikely. The other possibility is unused buffers being hoarded accidentally. To make sure this was not happening, we took a heap dump of a Volt Active Data process and searched for all live buffers at the time the dump was taken. Nothing suspicious stood out. We also used SystemTap scripts to ensure all Java unsafe buffer allocations were freed properly. In addition, on-heap and direct ByteBuffers are capped by the Java max heap and max direct memory settings. Unless the JVM is broken, it is impossible to leak on-heap and direct ByteBuffers beyond the settings.

The initial test program that exhibited the problem was a 50-minute Volt Active Data test application. Every time we tuned environmental settings or put debug code into Volt Active Data, it took another 50-minute test to check if the problem persisted. The process was very time consuming. After several trials, we shortened the reproducer down to an approximately 10-minute run of the Volt Active Data test application, with the database replication feature turned on.

After some investigation, we noticed that the buffers used by database replication (DR) were always partially filled because of the low transaction throughput. The default size for the DR buffers was 256KB. Only several KBs of each buffer were used. Based on these findings, we tested out the idea of reducing the default buffer size to 4KB and also zero-filling the buffers so all were paged in. The result was promising. We no longer saw the growth of RSS when the database was idle.

Since we did not change any memory management code other than reducing the default size and zero-filling the buffers, it was clear that the RSS growth did not originate from leaks. Something more subtle in the environment was causing the problem.

With this in mind, we tried to mimic Volt Active Data’s memory management pattern in a standalone C program, hoping it would be able to reproduce the unexpected RSS growth outside of Volt Active Data. It turned out that the following 25 lines of C code were enough to reproduce the memory growth.

#include <string.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

#define BUFFER_SIZE 262144
#define MAX_LIVE_BUFFERS 1000
#define APPROX_RUNTIME 300

int main() {
  int i;
  void* buffers[MAX_LIVE_BUFFERS];

  printf("Allocating %d %dKB buffers\n", MAX_LIVE_BUFFERS, BUFFER_SIZE / 1024);

  for (i = 0; i < MAX_LIVE_BUFFERS; i++) {
    buffers[i] = malloc(BUFFER_SIZE);
  }

  printf("Starting quiesced period...will sleep for %d seconds\n", APPROX_RUNTIME);

  sleep(APPROX_RUNTIME);

  for (i = 0; i < MAX_LIVE_BUFFERS; i++) {
    free(buffers[i]);
  }
}

The program allocates 1000 256KB fixed size buffers and goes to sleep for five minutes without touching a single bit in those buffers. Normally, the allocated memory would be reflected in the virtual size of the process, but physical pages would not be mapped until they were written to.

However, the RSS grew by hundreds of megabytes when the program was in the quiesced period when we ran it on the same RHEL 6.6 in KVM. It grew to the virtual size and stopped there. The page fault count of the process did not change in the quiesced period, so no user-land code could have caused the physical pages to be mapped.

Eventually, watching the /proc/buddyinfo output while the reproducer was running shed light on the issue. During the quiesce period, several 4MB pages would be consumed every 10 seconds. The timing corresponded with the increments of the RSS of the process. The breakthrough led us to think of Transparent Huge Page (THP), which has been turned on by default since RHEL 6. Surprisingly, the RSS growth went away after disabling THP.

THP was designed to reduce translation lookaside buffer (TLB) misses and increase application performance. However, the downsides of THP have long been written about (TokuDB, Oracle, SAP, Cloudera, MongoDB, NuoDB), including intensive CPU usage and memory fragmentation. What was different in our case was that unused buffers were paged in implicitly. The THP documentation briefly mentions this:

“In certain cases when hugepages are enabled system wide, application may end up allocating more memory resources. An application may mmap a large region but only touch 1 byte of it, in that case a 2M page might be allocated instead of a 4k page for no good. This is why it’s possible to disable hugepages system-wide and to only have them inside MADV_HUGEPAGE madvise regions.”

The problem was also discussed in detail in this LWN article. When the THP kernel daemon khugepaged coalesces smaller pages, it allocates a zero-filled 2MB huge page and remaps the smaller pages into the huge page. If the smaller pages are never paged in, the remapping implicitly pages them in because of the zero-filling, which in turn causes the full 2MB to be accounted for in the resident size. In the worst case, a single 4KB page could be migrated to a huge page, using 512 times more memory. See here for an example.

The quick solution was to turn THP off completely, or only enable it on programs that choose to opt-in.

THP is not turned on by default in Ubuntu or RHEL 5.x, so there is no problem there. If you want to make sure it is not enabled, run the following command:

cat /sys/kernel/mm/transparent_hugepage/enabled

The output should be either, “always madvise [never]” or “always [madvise] never”.

High performance applications like Volt Active Data push hard on the operating system and the hardware. Volt Active Data is engineered to work well in most environments; nevertheless, problems can emerge anywhere in the stack, sometimes in the least expected places. We hope our findings can provide hints for other developers who have faced or are facing similar performance problems. If you experience unexpectedly intensive CPU usage, memory fragmentation, or even resident set size creep without new allocations or page faults, it may be worth trying to turn THP off.

  • 184/A, Newman, Main Street Victor
  • info@examplehigh.com
  • 889 787 685 6