12 February 2008 - 3:43Memory management
Over the past week or two, I read three semi-related papers on the general theme of memory management in applications: one was on the memory overhead of various application design choices, another was on the cost of garbage collection, and the third was an older but interesting paper on custom memory allocators. For my own research, I read a lot of papers from OSDI/PPOPP/SOSP/ASPLOS/Usenix/HotOS (and others with topical overlap like ICDCS, HPDC, SC, PODC, SIGCOMM), but I sprinkle in pleasure reading from other areas of interest — that’s not to say that I don’t actually like reading OS and distributed systems papers; it’s just nice to get topical diversity. In particular, I tend to select a lot of papers that are related to Programming Languages and Compilers as well as Information Security. Under the broad PLC umbrella, I tend to like papers related to language runtime/library implementation, compiler implementation and functional programming as well as issues of software engineering (in the broad sense of runtime/language issues to make programmers productive, less-error prone, etc.). Coincidentally, all three of these papers appeared at OOPSLA.
“The Causes of Bloat, The Limits of Health”
The first paper I mentioned is “The Causes of Bloat, The Limits of Health” (Nick Mitchell and Gary Sevitsky) from OOPSLA ‘07. I found this paper interesting for a few reasons, but it shows how application design choices in mapping a data model onto concrete data structures may lead to chronically wasteful memory usage where the amount of real data is overwhelmed by collection metadata/overhead (like pointers). The paper explores these issues in the context of Java and Java applications; the basic premise is applicable in any language, but things like Java’s object headers exacerbate the problem when compared to a comparable data structure in C. A key metric here, collection health, is a comparison of actual application data to overhead imposed by object headers, collection metadata, pointers, etc. One example they provide is putting Java Strings in collections:
Observe that a String must have at least 140 characters in order to achieve an S < 1.2 (i.e. no more than 80% actual data). When Strings are placed in a standard Java HashSet, they must have at least 270 characters to achieve this level of health. On the flip side, placing 10-character Strings into a HashSet will result in an S of no less than 3.7 (i.e. no more than 27% actual data), no matter how many Strings are placed into the HashSet.
They define “good health” for a data structure as having at least an 80% real data (actually, less than 20% overhead/data ratio). Even for a single string to qualify, it needs to at least 140 characters, because a String is actually fairly complicated under the hood: a String is an Object and has an object header and some fields (a cached hash code, a length and potentially an offset into the backing array), as well as a pointer to a char[], which also has a header (containing the length and other VM bookkeeping information). Ignoring the String fields, they put the overhead at 28 bytes per String, and they note that, in some cases, object headers could be up-to 20 bytes a piece.
Some of the applications they tested had less than 20% actual data due to inappropriate choice of Java collections or container objects. It’s not just about choosing the wrong data structure or having the default collection size be too big, it’s also about choosing the wrong object “containers” to hold values (decisions that may seem inconsequential). Java doesn’t have tuples (a huge gripe of mine), and one application needed to hash a pair of items, an int and some other Object. The developer chose java.util.Arrays$ArrayList, an inner-class of java.util.Arrays used in the asList method, because it was convenient to specify literal values by making a simple call like this: Arrays.asList(new Object[] {1, obj}). You couldn’t necessarily just put the Object array in the collection directly because array equality is object identity rather than content equality (i.e. it just compares the addresses of the arrays). Anyway, using Arrays$ArrayList causes a lot of overhead because the ArrayList is itself an Object with attributes and it also keeps a separate backing array (which has its own header overhead). Since the backing array is an Object array, storing an int actually requires boxing the primitive int into an Integer object wrapper, which is even more overhead, and every little bit adds up when you multiply it by the number of these Arrays$ArrayList objects you will create. If the author had instead created a Pair class with an int field and an Object field, it would reduce the memory usage of all of these (int, Object) pairs in this application to about a third of the original size (1.7MB versus 4.9MB).
This kind of study makes you think a little more carefully about choices you might take for granted or overheads you might dismiss as negligible. I know in the past I’ve done things like the above scenario with ArrayList (although not using that exact class) — but simply taking an existing Java class already present in the API that is close to what I want rather than making my own trivial two element wrapper class to work around the lack of tuples. If it’s a one-off it won’t be a big deal, but if it goes in a collection with many like objects, eventually the overhead may become unreasonable. This paper also reminded me of the sometimes significant size overhead of Java objects compared to plain data, and that’s good to keep in mind. Now don’t get me wrong, I’m not one of those systems programmers who likes to beat his chest and loudly proclaim that Java is slow and for n00bs and that we should all be coding everything in hand-tuned assembly language. I spend a lot of my time coding in C, and when I end up doing stuff in Java, it’s often a breath of fresh air (except when I need unsigned integers). Even in a high-level language, though, I personally like knowing what’s going on underneath.
“Quantifying the Performance of Garbage Collection vs. Explicit Memory Management”
The second paper is “Quantifying the Performance of Garbage Collection vs. Explicit Memory Management” (Matthew Hertz and Emery D. Berger) from OOPSLA ‘05. A post on Lambda The Ultimate a few months ago referenced this paper. I had heard the conclusion of the paper cited before, but I’d never actually read the paper so I put it in my reading queue. The authors note that the garbage collection process visits more pages of memory than an application would using explicit allocation. This is bad for locality and increases pressure at many levels of the memory hierarchy. In the presence of physical memory pressure and demand-paged virtual memory, it causes significantly more paging overhead which is very expensive (it can cause an order of magnitude performance degradation). In addition, the authors note that you need a larger heap to achieve performance parity with manual allocation because smaller heaps increase (internal) memory pressure, which leads to more frequent collections, which has baseline overhead and of course leads to visiting a lot of pages. With 5x as much memory, the garbage collector will generally equal or surpass manually allocation, and 3x memory gives an average 17% performance penalty over manual allocation, while lower factors degrade significantly.
Again, I’m not one of those “macho programmers” who likes to complain about garbage collection being too expensive and never appropriate. I think managed languages and garbage collection are really a good thing and even though it’s often more expensive, I don’t think it’s too expensive for most applications. And with the imminent rise in concurrency and parallel programming, the usefulness of garbage collection is even greater; manual storage management in the face of concurrency is often even more painful — a lot of novel concurrent data structures just assume the presence of garbage collection. There’s always programming features that we eventually take for granted which people gripe about being too expensive at one time: people complained about the overhead of operating systems (versus running applications on base metal), and later it was compilers/high-level languages that were “too expensive” (versus assembly language). The situation with garbage collection may not be exactly analogous, but current trends seem to indicate that it’ll be a given in due time.
“Reconsidering Custom Memory Allocation”
The last paper is “Reconsidering Custom Memory Allocation” (Emery D. Berger, Benjamin G. Zorn and Kathryn S. McKinley). The authors tested custom memory allocators versus the general purpose Lea allocator, which is just a really good general purpose allocator created by the amazing Doug Lea (who is also one of the people responsible for the fantastic java.util.concurrent package added via JSR 166). The Lea allocator is also used in glibc in a modified form (ptmalloc/ptmalloc2/ptmalloc3, which is basically a Lea allocator enhanced for multi-threaded allocation). The authors test various applications that use custom allocators and found that custom allocators were rarely worth it. In their conclusion, they state:
Despite the widespread belief that custom allocators should be used in order to improve performance, we come to a different conclusion. In this paper, we examine eight benchmarks using custom memory allocators, including the Apache web server and several applications from the SPECint2000 benchmark suite. We find that the Lea allocator is as fast as or even faster than most custom allocators. The exceptions are region-based allocators, which often outperform general-purpose allocation.
The fact that the Lea allocator outperforms most custom allocators isn’t a coincidence, it was an explicit design goal. Doug Lea says on his malloc webpage:
I soon realized that building a special allocator for each new class that tended to be dynamically allocated and heavily used was not a good strategy when building kinds of general-purpose programming support classes I was writing at the time. (From 1986 to 1991, I was the the primary author of libg++ , the GNU C++ library.) A broader solution was needed — to write an allocator that was good enough under normal C++ and C loads so that programmers would not be tempted to write special-purpose allocators except under very special conditions.
I’d say this paper shows that he largely succeeded. But I think the paper’s conclusion is just a classic lesson in optimization. Knuth famously stated, “premature optimization is the root of all evil.” This isn’t to necessarily say that the developers of applications like Apache and gcc were oblivious and guilty of premature optimization. The authors note that this may be a factor of general purpose allocators getting better as well as program evolution. At one time, gcc’s runtime was dominated by parsing, which benefited from the custom allocator; now optimization is where more cycles are spent and so the game has changed. In any event, the take-home message is about future practice; don’t rush in to custom memory allocators just because you think they’re faster. Most applications won’t benefit, and you’ll save yourself a lot of trouble.
When I was more of a novice, I was tempted to perform premature optimization and it rarely paid off in terms of the time investment. Now I’m much more concerned with getting the structure of the system so that it doesn’t impose high overhead in general — a holistic view of the entire system rather than micro-optimizing specific operations. After a while you just get a feel for how you can structure a system so that cumulative overhead is avoided. Obviously part of that is in using the right data structures and algorithms (from an asymptotic complexity standpoint, but also considering constant factors when they matter), but also thinking about things like how data flows through the system, so you can avoid unnecessary copies. After the general structure is there, you profile and you can always micro-optimize hot paths. If you micro-optimize while the structure is still in flux, Murphy’s Law dictates that whatever part you optimize will change to foil you. Anyway, this post is getting quite long, so I just wanted to mention some high-performance, drop-in replacement (general-purpose) malloc implementations:
- ptmalloc3 — newer and faster than ptmalloc2, which is in glibc.
- Hoard — fast allocator developed as a research project at UMass Amherst under Prof. Emery Berger (a co-author of two of the papers I mentioned)
- tcmalloc — Google’s thread caching malloc. At one time it was the fastest out there, but now various others may be equally competitive or faster.
- nedmalloc — supposedly faster than Hoard, ptmalloc2 and tcmalloc.
- libumem — Solaris’s umem allocator, made portable.
No Comments | Tags: Research Content