7 March 2010 - 17:30Storage and scale

In 2008, I worked at IBM Research Almaden on a project called Panache. I was recently pleased to find that an in-depth paper on Panache appears in the proceedings of FAST ‘10 (PDF).

The main impetus for my post was not specifically on Panache, however. Over the years I’ve been thinking about the evolution of storage as the scale grows, and it comes up in conversation maybe once every six months. Most recently, it came up during a job interview when talking about some of the projects I’d worked on. I made a note to myself to post something on this.

The growth of storage solutions tends to follow a certain trajectory. In the following I’m going to be very loose with definitions and paint with a broad brush. True backup and redundancy for availability are different, but here I’m just covering redundancy for availability:

  • Single drive / JBOD — home users tend to start out with one hard drive or a collection of disks; there’s no redundancy here so they may just manually sync to another drive or system to guard against failure (or the most popular form of failure contingency: nothing).
  • Single-host RAID / NAS — after home users’ storage needs increase in scale, they may move to a RAID solution which adds some disk-level redundancy. RAID isn’t a backup solution (you can still delete or corrupt your own files by accident), but it helps give you some redundancy in the face of drive failure. Maybe these drive arrays are in a home-network visible NAS exporting storage via NFS or CIFS or maybe they’re just in one main system where you use the storage locally. Small enterprises may use this kind of NAS solution too. Crude scaling “up” may be achieved by having several different NAS appliances serving separate sets of files.
  • SAN / “Enterprise” storage — after you reach a certain level of scale, you want to eliminate the single point of failure imposed by the NAS / single host with a RAID setup. You’re protected from a drive failure, but the host could still fail, the network connection to the host could fail, the drive controller could fail, etc. At this point you tend to move to a storage area network (SAN) setup, often from a company like NetApp, IBM or Panasas. The idea behind a SAN is that your machines connect to your disks like switched local networking. You may have 16 machines and 128 disks, and all machines can talk to any disk. If a machine fails or a controller fails, the rest of the machines can still talk to the same disks just fine. Generally there is some redundancy at the IO controller level, too. These are typically coordinated by a shared-disk filesystem like GPFS or OCFS2.
    In a large enterprise, it is impractical for all machines that need to access the storage to have a direct SAN connection, so in this case you may have a small number of storage-connected NAS hosts that export the same coherent underlying shared-disk filesystem via NFS or CIFS. Then clients can access any exporting host via a standard network filesystem protocol, and any NAS host can die without the entire filesystem becoming inaccessible. Redundancy may be achieved by layering the filesystem over redundant disk arrays in the SAN, or by doing parity / Reed Solomon coding within the filesystem on individual disks (I talked about “Declustered RAID” in an earlier post), or by doing file-level and metadata-level replication.
  • Google Filesystem — “Enterprise” storage based on SANs is way more expensive per unit of storage than commodity disks, and there is a limit to how far you can scale “out” the number of hosts with access to the storage and scale “up” the amount of storage. Once you get to the level of companies like Google or Amazon, you are already going to have sophisticated internal software for managing failures when dealing with your own computation. At this level of scale, it makes sense to build your own custom data storage solution using the same failure-handling principles with unshared, commodity disks. Google filesystem (GFS) is layered over a bunch of commodity servers with local non-redundant storage. The storage is aggregated into a coherent distributed filesystem and data is replicated at the “chunk” level (a large sub-unit of a file). This will cost a lot less for more storage and you can build in all kinds of special considerations for your own uses (e.g. MapReduce works in concert with GFS to try to process data locally where it is stored).

As I mentioned above, I’m really glossing over a lot here and speaking in generalities. Some things I’ve glossed over:

  • In reality there are very large scale systems that do use SANs (i.e. HPC clusters for scientific simulation) rather than local storage and custom software. Batch, shared-nothing crunching via MapReduce has different data requirements than tightly coupled parallel MPI jobs. The final stage of “scale” listed above is really about what you need to do with all of that data.
  • GFS/HDFS aren’t replacements for traditional POSIX filesystems — they are accessed through a library, they don’t provide the same semantics (GFS has atomic appends which may append the data more than once), and they were designed to be used in a specific context for batch data analysis in concert with MapReduce/Hadoop. Given the current state of development, you probably wouldn’t archive important datasets within HDFS. The finer details of Google’s own strategies are private, but given the intended uses of BigTable, it’s probable that a lot of data is actually archived within GFS on top of BigTable. This makes sense if the intended uses of the datasets are all within the framework of MapReduce — jobs that can read out of BigTable.
  • Large internet companies with these specialized storage solutions still probably use SAN-based systems for some purposes. It may not make sense, for example, to store source repositories in GFS or BigTable or to try to export data in GFS like an NFS share.
  • There are, of course, many other things that don’t fit so neatly into these categories. For example, a while back I wrote a post where I talked about distributed peer-to-peer filesystems that run over wide-area networks (often built on DHTs).

As an aside, it’s interesting how there is an somewhat parallel scaling trajectory to databases as well. You start off with stock relational databases, then you tend to shard and denormalize, then from there you may loosen semantics further, interpose memcached, try to do custom transactional replication, etc.; if that can’t keep up, you may eventually progress to your own eventually consistent key-value store or document DB or non-relational tabular data storage solution or any other number of interesting variants. But I will save that for some future post.

No Comments | Tags: Research Content

4 November 2009 - 2:10Flash and the storage hierarchy (again)

Earlier this year in a post titled “Disk is the new disk”, I ruminated a bit on the implications of the changing storage hierarchy. One of the frequently discussed changes on the horizon is the introduction of flash memory (or other solid state, non-volatile memories like phase-change memory, which I mentioned in that post) into the storage hierarchy. One question I didn’t address at the time is, “do these solid state non-volatile memories belong as external or internal memory?” That is to say, “do these belong on the memory bus or IO bus?” With flash memory, the latencies and writing semantics (i.e., individual bytes can’t be arbitrarily modified; entire blocks are erased at once) tend to naturally place flash memory devices on the IO bus as external memory. However, other technologies may be more appropriate to expose as directly CPU-addressable memory. A recent paper at SOSP ‘09 explores this in the context of Phase-Change Memory (although the techniques are applicable to any directly-addressable BPRAM — byte-addressable persistent memory): “Better I/O Through Byte-Addressable, Persistent Memory”. Unlike flash, PCM can be read and written in a byte-addressable manner, and access times are fast enough to merit putting PCM on the CPU’s memory bus.

The paper introduces BPFS, a filesystem for BPRAM which exploits various properties of directly-addressable non-volatile memory to improve performance and durability over traditional filesystems. It’s interesting how a lot of the very thorny problems of filesystem consistency associated with block-based disk interfaces are solved elegantly by applying traditional in-memory style atomic updates to data structures in BPRAM. Traditional filesystems use relatively hairy techniques like soft-updates or journaling to try to ensure that persistent data structures are updated in a way that preserves metadata consistency even if the system fails in the middle of a write. With byte-modifiable data structures, you can use atomic update techniques similar to those used in lock-free data structures (e.g. do modifications on a private copy of data and then atomically “publish” it by setting a pointer field pointing to the private copy — and make sure to put architecture-appropriate fences so reordering doesn’t bite you!*). When I started reading the paper, two classic systems paper quickly came to mind: “Lightweight Recoverable Virtual Memory” and “Free Transactions with Rio Vista”. Rio Vista owes a lot to RVM, but one of the memorable things about Rio Vista is that it uses battery-backed RAM to perform atomic and durable transactions on persistent data structures. Like BPRAM, the persistent but directly-modifiable nature of battery-backed DRAM really simplifies many aspects of the system. Naturally, when I got to the end of the paper, I found that they cited Rio Vista in their related work. Anyway, it’s good to see that one of the co-authors is a former Georgia Tech classmate, Derrick Coetzee.

* One of the most interesting parts of the paper is their “epoch” mechanism — while fences work fine for volatile data structures, they don’t provide strong enough semantics when you have persistent memory. Memory fences aren’t strong enough because they just affect a CPU’s view of memory, not the actual contents of DRAM. A CPU’s view of memory is basically DRAM plus “diffs” of more recent data at various caching levels. With BPRAM, if the power goes out, the diffs in cache disappear and then the persistent data left may not be consistent by itself. You have to make stronger guarantees about when persistent data gets written to maintain proper stored data structure consistency.

Linux and flash
Although the SOSP paper is recent and on my research radar, it is not what prompted this post. The original impetus came from my recent viewing of the Linuxcon 2009 roundtable discussion. This roundtable gained some Slashdot notoriety because Linus made a comment about the kernel being “huge and bloated,” due to its expanding feature set and icache footprint. During the Q&A session, someone asked a question regarding flash RAM. The question was predicated on the assumption that flash will transition to directly addressable (internal) memory and was asking whether, since flash cells have limited lifetimes, would Linux eventually integrate code to deal with directly accessible memory failing. Ted Ts’o sort of dismissed the question and said that he believed that the right place to address failure properties are in the hardware (and he didn’t necessarily agree that flash would move to directly addressable memory). Currently, flash exposed with a “disk drive” interface — i.e., flash that looks like a fast hard drive — handles failure and wear leveling underneath the storage interface.

This reminded me, however, that there is another class of flash support in Linux that is commonly misunderstood. Linux has a MTD (Memory Technology Devices) subsystem which supports “bare flash” devices. These devices are more common on embedded systems, and basically “bare flash” is exposed flash memory that doesn’t look/act like a standard hard drive. The software above gets to access the real flash blocks and has to handle wear leveling, dealing with bad blocks and also dealing with write/erase semantics (things that the firmware would do in a hard drive-like flash disk). MTD devices don’t act like block devices and actually expose three operations: read, write, and erase. There’s a whole class of Linux filesystems built to run on top of these “bare flash” devices: YAFFS, JFFS2, LogFS, and UBIFS, and they’ve also factored out some of the common functionality used in a lot of these filesystems into a separate UBI (Unsorted Block Images) layer. But I see a lot of misunderstanding of the point of these filesystems. Many casual Linux users or observers think that these filesystems are flash-optimized regular filesystems to be used on top of hard-drive like flash devices (or CF/SD/etc.). Anyway, I see this mistake made a lot on forums and such so it came to mind after the Linux roundtable question.

No Comments | Tags: Linux, Research Content

27 July 2009 - 21:37“C++ is a superpower”

A few months ago, Bjarne Stroustrup came to Georgia Tech to give a talk sponsored by the Georgia Tech chapter of the ACM. Bjarne’s talk was about two hours, and it was mostly a broad overview of the history of C++: its significance, its evolution, Bjarne’s broad language design philosophy and the future directions he envisions for C++. Before the presentation started, I talked to Bjarne one-on-one for a bit about the most interesting aspect of the upcoming C++ 0x standard to me, concurrency-related features and the language memory model. I also asked him about his thoughts on garbage collection and shared-state concurrency versus message-passing. On the latter point, like me, he believes that the future of concurrent programming (at least in the near to foreseeable future) involves the use of a large arsenal of techniques for different situations rather than one dominant paradigm. It’s a pragmatic view and ostensibly supported by the fact that this is not the first “boom cycle” for parallel programming. Arguably shared-state concurrency doesn’t scale very easily to large systems, but shared-state concurrency within finer-grained components is manageable today; using actor-style message-passing between coarser grained components with shared-state concurrency in limited places (perhaps for high-performance primitives or highly coupled functionality) seems like an analogue to the way we use inline assembly or intrinsics for limited sections of performance-critical library routines (like memcpy or BLAS primitives) and everywhere else use more manageable high-level languages.

Garbage Collection
One topic where I didn’t fully agree with Bjarne was the issue of garbage collection. Personally, I think the addition of hooks to support garbage collection to C++0x is an essential decision for enabling effective concurrency programming. Storage management in single-threaded code is drudge work, but when concurrency is involved, it suddenly becomes a much thornier problem. People find themselves building their own slapdash garbage collection-like facilities to manage the lifetime of data objects that may be used in many different threads. These user-built facilities are tricky and often induce extra contention. Ultimately many concurrent algorithms, particularly lock-free ones, rely on the existence of a garbage collector (to prevent certain instances of ABA problems, for example*). Sure there are classes of techniques for lock-free reference counting (e.g. Detlefs et al.), but it seems like the modern mainstream languages with GC like Java and C# are gaining powerful high-level concurrent programming utilities a lot quicker than C and C++, and I suspect the higher difficulty of storage management is one influencing factor (obviously, the lack of a memory model and portability concerns are other major issues). Anyway, I mentioned to Bjarne that I thought the addition of garbage collection support was good because a lot of concurrent algorithms rely on it, and he said something to the effect of it being a crutch or somewhat lazy to rely on garbage collection just to avoid complicating concurrent algorithms with intricate reference counting. I don’t want to misrepresent his opinion, but that’s the general gist I got from his comment.

As far as I’m concerned, garbage collection is one of those largely inevitable evolutions of higher-level programming that is begrudged by some but eventually taken as a given. Some complained about overhead induced by operating systems when computers tended to run single applications; the same thing happened with the first compiled languages (versus hand-written assembly). Later it was VMs for managed languages or GC. Eventually, however, I think these will be taken for granted in mainstream programming. Java has already gone a long way to making garbage collection more of a necessary and mainstream feature. Now, certainly there are some gripes about the Java throwing out the baby with the bath water in not providing something as useful as explicitly scoped destruction to guarantee cleanup of certain related non-memory resources (like open files or sockets), and providing too weak semantics for finalize, but this isn’t a fundamental flaw of GC. Anyway, that’s tangential, but my view is that GC is nearly essential in new concurrent era, and I’m glad that C++0x is taking a first step in the right direction on that front.

Memory Model and Atomics
The rigorous memory model is, of course, also essential for great concurrent programming, and well known experts like Hans Boehm (and many others) have put a lot of work into that. I mentioned to Bjarne that I was also pleased that they are offering atomics and even support for relaxed consistency reads and writes useful in certain low-level concurrent algorithms. I told him that I thought that the inclusion of selective relaxed consistency, while definitely addressing only a limited audience of very skilled programmers, is really vindicated in light of Java’s experience — after initially resisting exposing such features, the developers of Java have come to the same conclusion that you need to expose these operations to allow the efficient implementation of certain kinds of concurrent data structures and algorithms. See Doug Lea’s work on the Java 7 Fences API and his announcement and rationale to the concurrency-interest list. He states:

The basic JSR133 JMM scheme provides three kinds of underlying ordering constaints, that are tied to variable declarations — read-volatile, write-volatile, and write-final (aka write-release, aka publish, aka lazy-set), or for stand-alone AtomicX objects.

This turns out not to mesh very well with the development of core concurrent algorithms (like the ones I tend to implement), where you often need these special flavors of reads and writes on an occasional basis, which is currently either impossible or insensible to achieve.

In the recent C++0x standard, this kind of usage was addressed by allowing per-usage modes on new read and write methods. (And further, supporting more than the three modes considered here, but I don’t think we need to introduce more for Java.)

So, after years of resisting the idea, my current conclusion is that we need to stop wishing for a miraculous solution to lack of call-by-ref, and instead allow developers to roll their own out of the raw ingredients — fences.

In his actual talk, Bjarne only briefly mentioned C++0x concurrency features and the memory model (because there was so much ground to cover), but after noting that “C++ 0x will have atomics,” he quipped that the new feature makes C++ a “superpower.” Besides the obvious pun, some people (myself included) would liken C++ to a nuclear weapon for a different reason — namely that it tends to provide many experts-only features with extremely intricate rules and many hidden pitfalls. This was illustrated perfectly to me by Bartosz Milewski’s attempt to implement a well-known synchronization algorithm (Peterson’s algorithm for shared-memory, two party mutual exclusion) using relaxed consistency atomics. Just by trying to implement Peterson’s algorithm, he ended up finding that the draft standard semantics were subtly broken, and it took Hans Boehm’s intervention and the modification of the standard just to get it to work. In his post titled The Inscrutable C++ Memory Model, he recounts just how complicated it is:

I had no idea what I was getting myself into when attempting to reason about C++ weak atomics. The theory behind them is so complex that it’s borderline unusable. It took three people (Anthony, Hans, and me) and a modification to the Standard to complete the proof of a relatively simple algorithm.

All the people involved are in the target population of expert users for relaxed consistency atomics, so the byzantine complexity illustrated here (i.e. “so complex that it’s borderline unusable”) doesn’t bode well for 99.9% of the C++ using population. Sometimes I’m afraid that some bold and foolhardy programmers won’t be able to resist the temptation of “optimizing” by using features like this — the programming equivalent of playing Russian Roulette with all chambers loaded. Of course, I still support their inclusion, since the Java experience shows that they are needed for a small subset of people like Doug Lea to write kickass foundational concurrent library functionality.

Anyway, to bring this post full circle, earlier I mentioned that shared-state concurrency can be tricky to apply to large-scale systems. One of the reasons is thematically similar to the “experts only” problem we just discussed — current abstractions for shared-state concurrency are very complex and tricky to get right. The STM people point out that locks don’t compose — in other words, to build on existing pieces and maintain invariants, you often have to “invite” other parties to participate in your own internal locking protocols, basically foiling true abstraction or encapsulation of concerns. Just how tricky are threads and locks? Well, Edward A. Lee’s “The Trouble With Threads” recounts the following anecdote:

A part of the Ptolemy Project experiment was to see whether effective software engineering practices could be developed for an academic research setting. We developed a process that included a code maturity rating system (with four levels, red, yellow, green, and blue), design reviews, code reviews, nightly builds, regression tests, and automated code coverage metrics [43]. The portion of the kernel that ensured a consistent view of the program structure was written in early 2000, design reviewed to yellow, and code reviewed to green. The reviewers included concurrency experts, not just inexperienced graduate students (Christopher Hylands (now Brooks), Bart Kienhuis, John Reekie, and myself were all reviewers). We wrote regression tests that achieved 100 percent code coverage. The nightly build and regression tests ran on a two processor SMP machine, which exhibited different thread behavior than the development machines, which all had a single processor. The Ptolemy II system itself began to be widely used, and every use of the system exercised this code. No problems were observed until the code deadlocked on April 26, 2004, four years later.

In closing, I just want to say I enjoyed Bjarne’s talk a lot, and I find him to be extremely pragmatic in technical matters. He’s very frank about C++’s purpose, the fact that it’s evolving and that it has certain warts (and also many things that are better than other languages). In other words, he’s not the Steve Jobs of C++, pretending it’s eternally perfect and consistent. There wasn’t much time for many questions, but the most amusing question to me was what can only be described as an “Oprah” question — “How do you feel about having your software running on so many systems?” Hah.

* See “ABA Prevention Using Single-Word Instructions” by Maged Michael and “DCAS is not a Silver Bullet for Nonblocking Algorithm Design” by Doherty et al. for some examples of ABA problems solved by GC and instances where GC alone is insufficient. However, regardless of whether GC solves the ABA problem entirely, GC tends to make concurrent algorithms significantly simpler in many other ways.

3 Comments | Tags: Research Content

8 May 2009 - 22:40Followup on academic publishing in systems

I follow Ed Felten’s blog, Freedom To Tinker (which is actually now a blog for many people at Princeton’s Center for Information Technology Policy) — it has good coverage on issues like electronic voting and intellectual property. Dan Wallach, a well known security (among other things) researcher at Rice, published an interesting post titled “Acceptance rates at security conferences” assessing the state of academic CS conferences in the area of security. He points out that the conferences are getting increasingly competitive with an ever growing field of researchers and a relatively fixed number of conference venues; he notes that this will lead to certain “structural problems” in the research community and discusses potential options.

He also points to Matt Welsh’s thoughts on similar issues in the systems community:

This is of particular interest to me since, as a PhD student, I am an academic systems researcher. Dan Wallach summarizes Matt first post as follows:

He argues that there’s a real disparity between the top programs / labs and everybody else and that it’s worthwhile to take steps to fix this. (I’ll argue that security conferences don’t seem to have this particular problem.) He also points out what I think is the deeper problem, which is that hotshot grad students must get themselves a long list of publications to have a crack at a decent faculty job. This was emphatically not the case ten years ago.

I definitely see what Matt is talking about in the systems community. For example, for a large subset of lower-level systems work*, SOSP and OSDI are a sort of gold standard in publication venues. Each conference is held every two years (alternating years between the two venues), so each year 20-30 papers will be accepted total (for reference, OSDI ‘08 accepted 26/193 and SOSP ‘07 accepted 25/131). Given the size of the systems community, that doesn’t give much leeway for up-and-coming researchers, but a publication in such a venue is virtually required to be competitive academically — as Matt describes it, a publication in these venues is “a highly prized commodity, and one that is becoming increasingly valued over time.” Matt says:

Several of us on the hiring committee were amazed at how many freshly-minted Ph.D.s were coming out with multiple papers in places like SOSP, OSDI, and NSDI. Clearly there is a lot more weight placed on getting those papers accepted than there used to be. … Somewhere along the way the ante has been upped considerably.

I notice this too. For example, Georgia Tech’s College of Computing (where I am finishing my PhD) was ranked in the top 10 graduate programs in CS (#9) by US News and World Report in 2008. For systems research specifically, we were also ranked in US News and World Report’s top 10 (#10) in 2008. Now, of course US News and World Report’s rankings are contentious and reducing the work of a whole bunch of different researchers in CS to a single dimensional ordinal representing the whole program is very subjective, but the point is only to say that our program can be considered competitive in the universe of CS graduate programs.

But if you look at our publishing track record in these two prized venues, we’re virtually unrepresented. If you look at the OSDI proceedings, you will see that a paper from Georgia Tech has never been accepted there (my 2008 submission was rejected, although it did get decently positive reviews), and we have two SOSP papers — one in 97 which was collaborative with Microsoft Research and involved only students and no Georgia Tech faculty (which makes me wonder if it was related to an internship) and one in 2007 which was a collaboration between a student in the Electrical and Computer Engineering and a professor in the College of Computing. Compare this college-wide record with that of Haryadi Gunawi, an excellent faculty candidate interviewing at Georgia Tech this year. In his career as a PhD student, he had 2 OSDI and 3 SOSP publications (plus publications in top venues in other areas, like PLDI, ISCA and FAST). As a student, he has amassed significantly more publications in these prized venues than our whole College of Computing can claim**. And other students from his advisor(s) have similarly impressive CVs. Look at the students of many other “rockstar” systems researchers and you’ll see the same pattern; we had a parade of great faculty candidates with similarly strong records.

So what am I supposed to make of this? I get a deep sense of cynicism when I see trends like this over many years. Matt says, “I don’t have hard data to back this up, but it feels that it is increasingly rare to see papers from anything other than ‘top ten’ universities and labs in the top venues.” I would go a step further and say that there’s a certain “clique” (or “cabal” if you want something sinister) of key researchers who facilitate virtually all of the publications in these venues. If you are a student of one of these researchers, or a nth-generation student (e.g. a student of a faculty member who once was a student of…), you know how to do work that appeals to the program committee and present it in the proper way — if you don’t have the right perspective on these fine points of taste, your chances are grim.*** As a student, if your advisor is a big name, you can have a paper in these top venues every year. If you don’t, you have very bleak academic job prospects. Now I’m definitely not trying to diminish Haryadi’s impressive accomplishments, and his research is very exciting. But I get the sense that there’s a very strong dis-proportionality in academic publishing in systems that is a lot worse than most other areas in computer science.

A comment to Matt’s first post and the end of Dan’s post also pointed me to another relevant article. In May’s CACM Viewpoints, Ken Birman and Fred Schneider wrote an interesting critique of the state of systems conferences titled “Program Committee Overload in Systems” (here’s a free pdf from Fred Schneider — the same content but without the fancy formatting of the CACM hardcopy). This CACM article seems like a follow-up and expansion of an earlier work of Ken’s I’ve blogged about (titled “Overcoming Challenges of Maturity”).

Anyway, I’m glad that some well-respected systems researchers are being vocal about these issues. It’s definitely good to know I’m not the only one with gripes; I’ve been somewhat cynical about this for a while, but since I have very little clout it helps to find a few senior systems researchers with some common concerns.


* Yes, I understand that “lower-level” is a matter of perspective. To my electrical and computer engineering colleagues, things like hypervisors and operating systems count as “high-level” “end-user” programming.

** If you look at DBLP, you will find a good bit more from current College of Computing faculty, but I’m counting publications where the author is at Georgia Tech when the publication is made (i.e. the author’s affiliation at the time of the publication).

*** Even presented well, good work on certain kinds of systems topics just doesn’t seem to be interesting to the PCs of these top conferences (the Europeans have been irked by this for years — leading to the establishment of EuroSys).

No Comments | Tags: Rants, Research Content

4 February 2009 - 1:43Disk is the new disk

Recently, statements of the form “X is the new Y” where X and Y are members of {disk, tape, flash, RAM} have become a popular meme in computing. And when I say recently, I mean mid-2008 when I first started this post. I just haven’t been very motivated to write blog posts lately as I’ve been working on my dissertation; consequently, I don’t have the overwhelming desire to do more writing in my spare time. Anyway, here are some examples of those statements:

Despite the fact that this post has languished as a half-completed draft, the various trends that incited this meme remain: the local storage landscape/memory hierarchy of computers is changing. There are really several major trends. The most recent (and hyped) trend concerns changes in the external memory hierarchy due to the advent of cheap solid state memory, most notably flash (as well as some not-quite-viable but hot research areas like phase-change memory).

When I say “external,” I mean storage after main memory in the hierarchy, which traditionally consisted of disk followed by tape. External implies that the storage is not directly addressable by the processor in the same way as main memory. It also often implies that the storage isn’t random-access or word addressed — hard disks typically have to read and write in whole sectors, and random access is too slow to be used regularly. There are whole classes of “external memory algorithms” (“external sorts” are popular, for example) and “external memory data structures” designed specifically to perform well with the constraints of sequentially-accessed external storage. For instance, consider sorting a dataset so large that it won’t fit in main memory all at once. In memory, Quicksort is often used. But when the dataset is resident on disk, it would perform very poorly if applied directly since it is heavily dependent on random access. It is much better to use an algorithm that can do most of the work in sequential chunks, operating like a multi-pass Mergesort. You’d probably try to sort the data in smaller chunks that do fit in main memory (using any fast regular sorting algorithm), and then merge the sorted pieces sequentially.

Anyway, what are the trends that lead to these “X is the new Y” declarations? 1) the advent of flash; 2) growth in main memory sizes; and 3) the ever growing disparity between latency and bandwidth. So how do all of these fit together? Well, let’s break it down in a few ways — I’ll tackle the above bulleted items in order (mostly).

A straightforward way to interpret “Flash is the new disk, disk is the new tape” is to assume that flash will be interposed in the storage architecture before hard disks, thus “shifting down” disk to tape’s old position and tape to… well, I guess it becomes the new “stone tablet.” But “disk is the new tape” fits in more than one way: disk bandwidth often grows with disk size (improved density, etc.), but seek time remains very hard to improve due to physics. The way disks are now, random seeks are possible but so expensive as to make them worthless for many use-cases. In this way, disk is moving towards the properties of tape, which isn’t seen as a “random access” media, even though seeks are possible. To achieve good performance, it makes sense to treat disks as if they are sequential access-only media. Doug Cutting, creator of Hadoop, gave a presentation at OSCON in 2007 where he hammers home this point. MapReduce and Hadoop need to process large datasets from disk, so they must operate sequentially using linear sort and merge techniques. He does some back-of-the-envelope calculations and illustrates the differences between a random-access and streaming strategy using some fixed disk parameters (see slide 5): to update 1% of a terabyte database using an STR-limited sort and merge strategy takes ~1 day versus ~100-1000 days with a seek-limited B-tree index strategy.

This fact about disks is part of a more general trend succinctly articulated by David Patterson in an article titled “Latency lags bandwidth” appearing in the the October 2004 Communications of the ACM. Bandwidth increases way faster than latency in most areas (networks, disks and RAM), and Patterson gives six reasons and trend data to illustrate the point quite clearly. He also briefly analyzes the implications on software design.

How is “RAM [becoming] the new disk?” Well, this alludes to the fact that main memory sizes are getting so large, we can often keep relatively large working datasets entirely in memory. Coupled with high-speed networking, we can use a collection of memory-rich machines to store very large datasets. For a long time now, it has been faster to access the memory of another machine via the network than access a local hard disk. “Cooperative Caching: Using Remote Client Memory to Improve File System Performance” was published in 1994 (OSDI ‘94):

Where fetching data from remote memory over an older network might be only three times faster than getting the data from remote disk, remote memory may now be accessed ten to twenty times as quickly as disk, increasing the payoff for cooperative caching.

The fact that RAM sizes have grown large enough to fit relatively large datasets in memory has fed the widespread use and development of “in-memory” front-end caches such as memcached and Microsoft’s Velocity, as well as in-memory databases such as TimesTen. The cheapness of massive amounts of RAM has led to various specialty RAM-based disk devices where RAM is literally given a storage interface, such as Gigabyte’s i-RAM or the new ACard’s ANS-9010 Serial ATA RAM disk. The latter is backed-up to flash storage by pressing a button on the front of the unit.

So going back to disks, what about “Disk is the new RAM?” Without context, it would seem to counter the implication that “Disk is the new tape,” but the sentiment is really the same. Really the thesis of the article is the same as Doug Cutting’s point about Hadoop: you can compute relatively fast if you can work at the sequential transfer rate speed of the disk in a batch processing manner. From the article:

Caveat 1. Disk latency is much more limiting than disk bandwidth. Therefore, despite the fact that RAM stands for random access memory, we would almost never use the “new RAM” (disk) in random-access mode. The old-fashioned RAM already serves as our random-access cache

In this case, they leverage many aggregated disks in parallel to get a large amount of disk bandwidth and architect the computation to work efficiently on disks. They accomplish this by using techniques from external memory algorithms and by eliminating random-access data structures in favor of sequential ones.

Finally, to bring us full-circle, what about flash being anointed the “new disk?” Well, it’s probably too early to tell what it will ultimately mean in the external storage landscape. If it does replace disk as the “first line” after main memory, we will have to rethink various system assumptions made — filesystems, demand paging code, etc. generally make certain assumptions about the cost of operations which need to be re-evaluated and tweaked to the properties of whatever media we are using as first-line backing store.

1 Comment | Tags: Research Content

8 November 2008 - 14:57Burton Smith: Reinventing Computing

Yes, I know I haven’t posted in a while, but I’ve been busy this semester: I finished my PhD thesis proposal a few weeks ago and I’m trying to prepare for a defense within a year or so.

Reinventing Computing
Anyway, recently I attended a talk by Microsoft Technical Fellow and computing architecture and HPC guru, Burton Smith. The talk was very interesting though not specifically because of the subject matter (which is familiar), but because of the broad perspective. The premise is the same thing I’ve been hearing in every other talk for the past few years (which I’ve ranted about before) — namely, parallelism is going mainstream and we have to deal with it. Most people then follow with their specific sales pitch, but Burton followed with a broad overview of many different views of the issue and a wide range of different techniques and technologies.

One thing I really liked about his overview is that he shares my philosophy of “pragmatism over dogmatism.” He discussed potential alternatives and stated that there is a need to utilize a variety of different solutions, even those that are often seen as mutually opposing philosophies: for example, he said he believed that we need both message passing AND shared state concurrency, mutable state with transactions AND immutable functional objects, declarative programming AND imperative programming, etc. This is in contract to presentations where you hear that approach X is the “right” way forward. While it’s conceptually appealing to say that one approach is uniformly better, real world constraints often limit practical applicability — we probably need a tool chest, not just one really fancy hammer.

One thing he mentioned was the concept of viewing resource allocation in multi-core systems as a “2D bin packing” problem. He showed a view of traditional CPU scheduling as a one-dimensional problem of time-multiplexing single runnable kernel threads over each single processor (with a small number of processors total). He then showed the alternate view as a two dimensional problem of assigning chunks of processors over time to applications (i.e. time is the X-axis and processors form the Y-axis). This is reminiscent of gang scheduling or co-scheduling, except the internal scheduling of work within an application with a chunk of processors would be handled at the user level and kernel level time-slicing may not occur in a standard manner anymore. This reminded me of several pieces of current (and classical) related work.

One idea is the following: “why time slice at all on massively multi-core systems?” If you have 256 processors, just dynamically assign chunks of them like spatial multiplexing of memory.

Corey
Corey, a research OS for many-core systems, will be presented at OSDI this year (OSDI 08 program) and follows this principle — “Corey allocates physical cores to applications rather than presenting a time-shared virtual processor abstraction.” Another related concept is that Corey also allows the allocation of dedicated kernel cores for running the kernel, so kernel calls are handled “via fast shared-memory IPC rather than slow traps.” I remember Rik Farrow suggested this same idea in a 2006 Google Tech Talk titled “Security is Broken”. Corey is organized like an earlier OS called an Exokernel (and in fact M. Frans Kaashoek and his PDOS group are involved in both). Like the Exokernel, Corey delegates scheduling policy of an allocated set of cores to the “library operating systems.” This basically amounts to the same thing as user-level scheduling in a normal OS, since the library OS generally runs within the address space of an application (the authors note that the library OS doesn’t need to further isolate itself from the application because the exokernel doesn’t trust the library OS).

User-level Scheduling
One thing that the idea of user-level scheduling reminds me of is the infamous two-level scheduling of Scheduler Activations or Solaris M:N threading. The goals were similar: in theory, at the user (application) level, you can make better scheduling decisions via custom scheduling policies or just better information, and scheduling is also cheaper. In practice, the extra complexity didn’t pay off and the two levels of scheduling (at kernel and user) level often interacted in negative ways. Subtle interference between decisions made at different levels could cause significant and often unexpected performance issues, and to really take advantage of it, you needed to make sure that both levels of scheduling were not working at cross purposes — to do that really requires propagating the application level scheduling decision information to the kernel level scheduler too, which is messy and complicated.

So although, at first glance, the phrase “user-level scheduling” appearing on the slides brought the aforementioned black eye to my mind, I think the situation in the case of Corey and the kind of system Burton is proposing will be different because it’s really not the same kind of combination. In these scenarios, we have a lot more cores and the kernel level “scheduling” is at much longer time scales — instead of time slicing, it’s easier to think of it like allocating physical page frames of memory to processes’ address spaces. Of course, the allocated number of cores is less “transparent” than virtual memory, but model of holding on to a resource is more similar than time slicing over a small number of processors.

Concurrency Runtime
Another project that this brought to mind is Microsoft’s Concurrency Runtime (not to be confused with Microsoft’s similarly named Concurrency and Coordination Runtime). Not only is it related to the idea of user-mode domain-specific scheduling for multi-core applications, but it is also designed to allow different concurrency solutions to work together (thus supporting Burton’s view of utilizing many solutions with different strengths, potentially in the same program). The idea behind the Concurrency Runtime is providing a common resource management framework to allow various concurrency solutions to interoperate and “play nice” together. One problem with current solutions likes OpenMP or Intel TBB, etc. is that they all think they “own the machine.” If you want to use multiple solutions together, they interact poorly because they are all oblivious of each other. The concurrency runtime provides a user-mode common resource management framework underneath the various concurrency solutions which can arbitrate between these different requests. It also provides a bunch of richer primitives for building these solutions (i.e. higher level concepts of tasks, groups, events, thread pools, etc.), but I can’t find too much documentation on it so far (most of the references are in the form of interviews and presentations).

Snake Oil
At the beginning of this post, when I was talking about how Burton Smith’s talk and approach are different than what you usually hear, it reminded me of a recent post by Sun engineer Bryan Cantrill (an outspoken, frequently provocative* fellow — as an aside, his chapter in Beautiful Code was one of the most enjoyable to me, coming from a systems background). Anyway, as I mentioned, most talks follow the common “concurrency is here, we must deal with it” introduction with a sales pitch for a specific tool. One hot area currently is transactional memory, which can be implemented in hardware or software, and I’ve seen and read a lot of papers on this in the past few years. Anyway, Bryan recently posted a scathing post about transactional memory titled “Concurrency’s Shysters”. I think this kind of critique comes as an inevitable backlash against over-hyped/newly in vogue solutions and the dogmatic selling of some technique as a the fix for what ails you. Of course, for the sake of marketing, it’s very hard to present a nuanced view and still be convincing, so maybe it’s out of necessity, but it’s off-putting for people who feel like the presentation of a technology is unbalanced and unrealistic.

* Bryan Cantrill is infamous for the “Have you ever kissed a girl?” Usenet quip in historical Solaris v. Linux performance wars, and more recently for dumping on “Dreaming in Code” in a Google Tech Talk on DTrace. That’s random trivia, but I quite enjoy watching strongly opinionated and outspoken technical people duke it out.

BTW, if you like following all of the various concurrency solutions (which are popping up fast and furious), you might want to check out Alex Miller’s Concurrency feed.

No Comments | Tags: Research Content

19 May 2008 - 0:12In Silicon Valley for the summer

Yes, I know the average time between my blog posts is quite long, but I tend to post longer posts infrequently rather than daily brain-dumps (or anything limiting towards Twitter). Anyway, I’ve been preparing a conference paper as well as getting ready to leave for San Jose, CA. I’ll be working for IBM Research at their Almaden Research Center on a GPFS-related project, Panache. See “Panache: a parallel WAN cache for clustered filesystems” in ACM SIGOPS Operating Systems Review from January 2008 for a basic idea. I’ve never been to Silicon Valley, so I’m excited to see the area.

I’ve spent most of my past seven or so months working on my thesis proposal and preparing a conference submission. On the topic of CS conferences in my area (Systems), I wanted to highlight a USENIX-sposored meta-workshop I found serendipitously — WOWCS ‘08: Workshop on Organizing Workshops, Conferences, and Symposia for Computer Systems. The WOWCS 08 PC and accepted authors is a list of seasoned veterans in systems research. Given that, and the improvement-based focus of the venue, many papers detail a lot of what some people see as “broken” in current systems academic venues (reviews, PC meetings, etc.).

One paper in particular that somewhat confirmed some disheartening truths about the nature of conference and workshop reviews is Overcoming Challenges of Maturity by Ken Birman — Ken is an ACM Fellow, well known for his work in systems and networking. Some of his gripes are from his experience chairing SOSP (in 2005), which is one of the most prestigious (and oldest) systems venues. Ken said,

Overwhelmed by the huge numbers of submissions, most PCs have turned to multi-round processes in which the first-round reviews are farmed out, often to students who may do an erratic reviewing job.

Most of us are learning to write papers in a manner calculated to appear to those beleaguered first-round reviewers. To get into SOSP or SIGCOMM a paper has to survive two thresholds: it must get past the two randomly selected students, and then must get past the six or so PC members who are most knowledgeable about the topic.

a PC chair today assigns some paper to PC member X, who then randomly hands it to students Y and Z, producing completely random reviews from people who have never been a part of the community and who are naturally inclined to be overly critical and to overly favor work in their own areas of interest: our mature researchers have long since shed these flaws of youth.

The above content encourages somewhat cynical views of publishing in such academic venues. Ken points out that the quality of the top conferences isn’t really diminished by these schizophrenic, semi-random first round eliminations because there are enough good papers remaining to fill the program. But it’s still somewhat unfair and very frustrating to authors. He also said,

Who hasn’t had papers that were rejected in the first round of reviews at a top conference, with just two reviews, one or both of which seemed almost completely clueless? Who hasn’t expressed anger at the system? Here are two little “factoids” to illustrate the depth of the issue: when I sent out the SOSP reviews, we discovered that in one case, a rejected paper had missed the initial cut on the basis of a review that was clearly written about some other paper.

Anyway, I’ve had pretty good experiences so far with my current primary research work, but some projects I’ve collaborated on as a secondary/advisory participant have experienced treatment like that (“completely clueless” reviews). Sometimes it could be chalked up to clarity issues in the paper, but other times it left me wondering if certain reviewers actually read the paper or just the abstract. Oh well… it helps to know that even highly-regarded, established researchers experience this sort of thing too.

2 Comments | Tags: Research Content

3 April 2008 - 0:14Memory ordering and memory models

Along with a variety of interesting papers, I’ve seen a few nice Google Tech Talks on the topic of processor memory ordering and language-level memory models recently. With the relatively recent resurgence of interest and attempts to “mainstream” parallel/concurrent programming, it is increasingly important to get these things right. Processor memory ordering guarantees (sometimes called a processor’s/architecture’s memory model) are generally relevant to people like me writing systems-level software (e.g. programming in C or assembly implementing operating systems, higher-level language runtimes and compilers, etc.). In theory, if you are doing user-level programming in C with something like pthreads (or win32 threads, OpenMP, etc.) and have a “race free” program (according to the threading specification), processor memory ordering should not be directly exposed to you. Even though Alpha and PowerPC have weaker memory ordering guarantees than x86, it is the implementation’s responsibility to issue appropriate memory fence operations, locked operations, etc., to make sure that the thread primitives perform as expected. If you want to use lock-free algorithms or atomic operations, you have to be aware of such architecture-specific things and do this manually.

Some higher-level languages — most notably Java — have defined language-level memory models: Java’s memory model defines how threads interact through memory and precisely what it means for a program to be data race free / well synchronized. It provides guarantees that a compiler/JIT won’t perform optimizations that break race free code. In a language like C without a defined memory model, these things can be tricky and surprising because it is unclear what a data race means when considering the mapping from language level statements to actual machine code. A programmer may have two independent threads concurrently assigning values to two different variables. If these variables are small (chars, for example) and stored adjacently in the same machine word, this may be a data race on some architectures. Since these kinds of decisions are often left unspecified and up to the compiler (or possibly the linker), there is no guaranteed way to write portable and robust code. Additionally, the compiler may introduce extra stores or perform other optimizations that, while fine for single-threaded code, introduce races into otherwise race-free code. People like me, who (try to) make use of lock-free algorithms and atomic operations, end up with code that may be quite fragile to even minor compiler optimization changes. More on this later.

One other nice thing about Java’s memory model is that it also constrains the language implementation on what can happen to incorrectly synchronized code; languages like C and C++ often say that the result of illegal code (e.g. modifying a variable twice without an intervening sequence point) is undefined — and undefined behavior can allow anything at all to happen. As one example, Java specifies that the implementation cannot introduce values into improperly synchronized code that appear “out of thin air.”

The videos are as follows:

  • IA Memory Ordering — Richard Hudson explains Intel’s newly clarified memory ordering semantics for x86.
  • Advanced Topics in Programming Languages: The Java Memory Model — Jeremy Manson describes the current Java memory model as revised by JSR-133 and Java thread semantics. The talk covers basics about model ordering guarantees, the meaning of locking/synchronization primitives and volatile, as well as common pitfalls.
  • Getting C++ Threads Right — Hans Boehm, the well-known programming languages/compilers researcher, talks about the effort to provide better threads support in the upcoming C++ standard (C++0x). More importantly, he talks about the general problems plaguing implementation of correct multi-threaded programs in languages like C and C++.
  • Towards a Memory Model for C++ — Not a Google Tech Talk, but another talk by Hans Boehm very similar to that above.

C++ Threads / Memory Model
Hans Boehm covered similar ground in his 2005 PLDI paper, “Threads Cannot be Implemented as a Library”. Some people felt the paper was trivial or hyping a non-problem by saying that the compiler and language have to provide a few extra guarantees and can’t be completely oblivious (I recall the LTU discussion — one commenter said, “Ayone who was paying attention already knew that”). But he does bring up a whole set of issues which is becoming increasingly important. Concurrent programming is already more difficult than regular sequential programming. On top of that, you have poorly specified semantics or broken implementations which actually cause problems. It’s hard enough already to get your part right without worrying about the compiler breaking things behind your back or the underlying language implementation not properly obeying the specification. His concerns aren’t just theoretical “cleanliness” issues, either.

Part of what prompted this post is the recent thread on LKML: “Is gcc thread-unsafe?” This is a perfect example of the problem that Boehm alludes to with regard to compiler optimization. The sample code in the gcc thread:

int trylock() {
  int res;

  res = pthread_mutex_trylock(&mutex);
  if (res == 0)
    ++acquires_count;

   return res;
}

That code attempts to acquire the mutex and increments acquires_count only if it succeeded in locking the mutex. With -O1, gcc 4.3 generated code that always reads and writes the acquires_count variable (load, conditional add, store) regardless of whether the mutex is obtained. Typical language lawyers pointed out that the C standard allows this optimization, even though it introduces a fairly nasty race condition. Some of the gcc developers tend to take a very defensive language lawyer stance, vigorously defending things that are technically permitted but practically useless. Standard C says nothing of threads and imposes very little on the compiler in this regard.

The Java Memory Model
The Java Memory Model is quite useful today, but it wasn’t always perfect. Java 5 incorporated JSR 133: Java Memory Model and Thread Specification Revision — a revision making some substantial changes to the original Java Memory Model, which was regarded as “broken.” I was still in high school and hadn’t even taken my first CS course when William Pugh (also known for the invention of the skip list, a wonderful data structure that works well in concurrent situations), published a paper titled The Java Memory Model is Fatally Flawed, a revised version of Fixing the Java Memory Model from a year earlier. I learned about the controversy a year or two later, and I was finishing my Masters by the time the new, fixed memory model was finalized and adopted. Although it seems somewhat strange that people tolerated and obviously wrote multi-threaded Java with a “fatally flawed” memory model for so long, the situation is roughly analogous to the current situation with C/C++ and threads: generally stuff works, and compilers/runtimes often do the “right thing” anyway, but we’d rather have stronger guarantees in this area.

Double-checked locking, a trick typically used to avoid synchronization on lazily-initialized fields, is one of the things broken by Java’s old memory model. Here is an example:

public Object getA() {
  if(a == null) {
    synchronized(this) {
      if(a == null)
        a = new Object();
    }
  }
  return a;
}

Despite the fact that the idiom was commonly-used within some Java libraries, it was technically incorrect and there was no satisfactory way to fix it (using thread-local storage made it possible, but that was a hack and often too expensive to be worth it). Under the current Java Memory Model, double-checked locking works if the field is made volatile. Under the old memory model, volatile was not very useful because volatile reads and writes could be reordered with respect to regular reads and writes, so you couldn’t use a write to a volatile value to, for instance, indicate to another thread that an object was initialized (because the write to signal the initialization might be reordered to before the actual initialization). The new memory model prevents this reordering. Since there are a lot of resources available on the web about the Java Memory Model, I won’t say too much more except that Java has been somewhat of a trailblazer in the area of really providing clear and useful semantics for multi-threaded programs.

William Pugh has a fairly comprehensive page with Java Memory Model resources and Doug Lea has a JSR-133 Cookbook for Compiler Writers and a tutorial titled Synchronization and the Java Memory Model (the latter is an excerpt from Doug’s nice Concurrent Programming in Java book).

IA Memory Ordering
x86 (IA32) has used, in the past, fairly strong memory ordering semantics called “processor ordering.” Intel has clarified/more robustly specified the memory ordering semantics in a document titled Intel® 64 Architecture Memory Ordering White Paper. The newly clarified memory ordering semantics apply to both 32-bit and 64-bit x86 and are described as “Total Lock Ordering + Causal Consistency.” Locked instructions are totally ordered across all processors, and memory obeys causal consistency, which is a fairly natural concept and provides publication safety. These memory ordering semantics don’t apply to all memory types (I/O would be different), but these semantics are what would be encountered in regular user code dealing with standard heap or stack memory. Precisely knowing the target architecture’s memory consistency guarantees is critical to correct and efficient implementation of higher-level language memory models and systems software (and lock-free code). AMD also has their own memory ordering reference which is very close or identical to Intel’s.
A long time ago, I read a paper titled Memory Ordering in Modern Microprocessors by Paul McKenney which explains the memory ordering guarantees provided by many modern microprocessor architectures (in the context of Linux’s memory barrier primitives). It is interesting just how weak Alpha’s (and to some extent PowerPC’s) ordering guarantees can be. Alpha seems to allow just about everything short of the processor simply making up bogus values for memory reads.

No Comments | Tags: Research Content

12 February 2008 - 3:43Memory management

Over the past week or two, I read three semi-related papers on the general theme of memory management in applications: one was on the memory overhead of various application design choices, another was on the cost of garbage collection, and the third was an older but interesting paper on custom memory allocators. For my own research, I read a lot of papers from OSDI/PPOPP/SOSP/ASPLOS/Usenix/HotOS (and others with topical overlap like ICDCS, HPDC, SC, PODC, SIGCOMM), but I sprinkle in pleasure reading from other areas of interest — that’s not to say that I don’t actually like reading OS and distributed systems papers; it’s just nice to get topical diversity. In particular, I tend to select a lot of papers that are related to Programming Languages and Compilers as well as Information Security. Under the broad PLC umbrella, I tend to like papers related to language runtime/library implementation, compiler implementation and functional programming as well as issues of software engineering (in the broad sense of runtime/language issues to make programmers productive, less-error prone, etc.). Coincidentally, all three of these papers appeared at OOPSLA.

“The Causes of Bloat, The Limits of Health”

The first paper I mentioned is “The Causes of Bloat, The Limits of Health” (Nick Mitchell and Gary Sevitsky) from OOPSLA ‘07. I found this paper interesting for a few reasons, but it shows how application design choices in mapping a data model onto concrete data structures may lead to chronically wasteful memory usage where the amount of real data is overwhelmed by collection metadata/overhead (like pointers). The paper explores these issues in the context of Java and Java applications; the basic premise is applicable in any language, but things like Java’s object headers exacerbate the problem when compared to a comparable data structure in C. A key metric here, collection health, is a comparison of actual application data to overhead imposed by object headers, collection metadata, pointers, etc. One example they provide is putting Java Strings in collections:

Observe that a String must have at least 140 characters in order to achieve an S < 1.2 (i.e. no more than 80% actual data). When Strings are placed in a standard Java HashSet, they must have at least 270 characters to achieve this level of health. On the flip side, placing 10-character Strings into a HashSet will result in an S of no less than 3.7 (i.e. no more than 27% actual data), no matter how many Strings are placed into the HashSet.

They define “good health” for a data structure as having at least an 80% real data (actually, less than 20% overhead/data ratio). Even for a single string to qualify, it needs to be at least 140 characters, because a String is actually fairly complicated under the hood: a String is an Object and has an object header and some fields (a cached hash code, a length and potentially an offset into the backing array), as well as a pointer to a char[], which also has a header (containing the length and other VM bookkeeping information). Ignoring the String fields, they put the overhead at 28 bytes per String, and they note that, in some cases, object headers could be up-to 20 bytes a piece.

Some of the applications they tested had less than 20% actual data due to inappropriate choice of Java collections or container objects. It’s not just about choosing the wrong data structure or having the default collection size be too big, it’s also about choosing the wrong object “containers” to hold values (decisions that may seem inconsequential). Java doesn’t have tuples (a huge gripe of mine), and one application needed to hash a pair of items, an int and some other Object. The developer chose java.util.Arrays$ArrayList, an inner-class of java.util.Arrays used in the asList method, because it was convenient to specify literal values by making a simple call like this: Arrays.asList(new Object[] {1, obj}). You couldn’t necessarily just put the Object array in the collection directly because array equality is object identity rather than content equality (i.e. it just compares the addresses of the arrays). Anyway, using Arrays$ArrayList causes a lot of overhead because the ArrayList is itself an Object with attributes and it also keeps a separate backing array (which has its own header overhead). Since the backing array is an Object array, storing an int actually requires boxing the primitive int into an Integer object wrapper, which is even more overhead, and every little bit adds up when you multiply it by the number of these Arrays$ArrayList objects you will create. If the author had instead created a Pair class with an int field and an Object field, it would reduce the memory usage of all of these (int, Object) pairs in this application to about a third of the original size (1.7MB versus 4.9MB).

This kind of study makes you think a little more carefully about choices you might take for granted or overheads you might dismiss as negligible. I know in the past I’ve done things like the above scenario with ArrayList (although not using that exact class) — but simply taking an existing Java class already present in the API that is close to what I want rather than making my own trivial two element wrapper class to work around the lack of tuples. If it’s a one-off it won’t be a big deal, but if it goes in a collection with many like objects, eventually the overhead may become unreasonable. This paper also reminded me of the sometimes significant size overhead of Java objects compared to plain data, and that’s good to keep in mind. Now don’t get me wrong, I’m not one of those systems programmers who likes to beat his chest and loudly proclaim that Java is slow and for n00bs and that we should all be coding everything in hand-tuned assembly language. I spend a lot of my time coding in C, and when I end up doing stuff in Java, it’s often a breath of fresh air (except when I need unsigned integers). Even in a high-level language, though, I personally like knowing what’s going on underneath.

“Quantifying the Performance of Garbage Collection vs. Explicit Memory Management”

The second paper is “Quantifying the Performance of Garbage Collection vs. Explicit Memory Management” (Matthew Hertz and Emery D. Berger) from OOPSLA ‘05. A post on Lambda The Ultimate a few months ago referenced this paper. I had heard the conclusion of the paper cited before, but I’d never actually read the paper so I put it in my reading queue. The authors note that the garbage collection process visits more pages of memory than an application would using explicit allocation. This is bad for locality and increases pressure at many levels of the memory hierarchy. In the presence of physical memory pressure and demand-paged virtual memory, it causes significantly more paging overhead which is very expensive (it can cause an order of magnitude performance degradation). In addition, the authors note that you need a larger heap to achieve performance parity with manual allocation because smaller heaps increase (internal) memory pressure, which leads to more frequent collections, which has baseline overhead and of course leads to visiting a lot of pages. With 5x as much memory, the garbage collector will generally equal or surpass manually allocation, and 3x memory gives an average 17% performance penalty over manual allocation, while lower factors degrade significantly.

Again, I’m not one of those “macho programmers” who likes to complain about garbage collection being too expensive and never appropriate. I think managed languages and garbage collection are really a good thing and even though it’s often more expensive, I don’t think it’s too expensive for most applications. And with the imminent rise in concurrency and parallel programming, the usefulness of garbage collection is even greater; manual storage management in the face of concurrency is often even more painful — a lot of novel concurrent data structures just assume the presence of garbage collection. There’s always programming features that we eventually take for granted which people gripe about being too expensive at one time: people complained about the overhead of operating systems (versus running applications on base metal), and later it was compilers/high-level languages that were “too expensive” (versus assembly language). The situation with garbage collection may not be exactly analogous, but current trends seem to indicate that it’ll be a given in due time.

“Reconsidering Custom Memory Allocation”

The last paper is “Reconsidering Custom Memory Allocation” (Emery D. Berger, Benjamin G. Zorn and Kathryn S. McKinley). The authors tested custom memory allocators versus the general purpose Lea allocator, which is just a really good general purpose allocator created by the amazing Doug Lea (who is also one of the people responsible for the fantastic java.util.concurrent package added via JSR 166). The Lea allocator is also used in glibc in a modified form (ptmalloc/ptmalloc2/ptmalloc3, which is basically a Lea allocator enhanced for multi-threaded allocation). The authors test various applications that use custom allocators and found that custom allocators were rarely worth it. In their conclusion, they state:

Despite the widespread belief that custom allocators should be used in order to improve performance, we come to a different conclusion. In this paper, we examine eight benchmarks using custom memory allocators, including the Apache web server and several applications from the SPECint2000 benchmark suite. We find that the Lea allocator is as fast as or even faster than most custom allocators. The exceptions are region-based allocators, which often outperform general-purpose allocation.

The fact that the Lea allocator outperforms most custom allocators isn’t a coincidence, it was an explicit design goal. Doug Lea says on his malloc webpage:

I soon realized that building a special allocator for each new class that tended to be dynamically allocated and heavily used was not a good strategy when building kinds of general-purpose programming support classes I was writing at the time. (From 1986 to 1991, I was the the primary author of libg++ , the GNU C++ library.) A broader solution was needed — to write an allocator that was good enough under normal C++ and C loads so that programmers would not be tempted to write special-purpose allocators except under very special conditions.

I’d say this paper shows that he largely succeeded. But I think the paper’s conclusion is just a classic lesson in optimization. Knuth famously stated, “premature optimization is the root of all evil.” This isn’t to necessarily say that the developers of applications like Apache and gcc were oblivious and guilty of premature optimization. The authors note that this may be a factor of general purpose allocators getting better as well as program evolution. At one time, gcc’s runtime was dominated by parsing, which benefited from the custom allocator; now optimization is where more cycles are spent and so the game has changed. In any event, the take-home message is about future practice; don’t rush in to custom memory allocators just because you think they’re faster. Most applications won’t benefit, and you’ll save yourself a lot of trouble.

When I was more of a novice, I was tempted to perform premature optimization and it rarely paid off in terms of the time investment. Now I’m much more concerned with getting the structure of the system so that it doesn’t impose high overhead in general — a holistic view of the entire system rather than micro-optimizing specific operations. After a while you just get a feel for how you can structure a system so that cumulative overhead is avoided. Obviously part of that is in using the right data structures and algorithms (from an asymptotic complexity standpoint, but also considering constant factors when they matter), but also thinking about things like how data flows through the system, so you can avoid unnecessary copies. After the general structure is there, you profile and you can always micro-optimize hot paths. If you micro-optimize while the structure is still in flux, Murphy’s Law dictates that whatever part you optimize will change to foil you. Anyway, this post is getting quite long, so I just wanted to mention some high-performance, drop-in replacement (general-purpose) malloc implementations:

  • ptmalloc3 — newer and faster than ptmalloc2, which is in glibc.
  • Hoard — fast allocator developed as a research project at UMass Amherst under Prof. Emery Berger (a co-author of two of the papers I mentioned)
  • tcmalloc — Google’s thread caching malloc. At one time it was the fastest out there, but now various others may be equally competitive or faster.
  • nedmalloc — supposedly faster than Hoard, ptmalloc2 and tcmalloc.
  • libumem — Solaris’s umem allocator, made portable.

No Comments | Tags: Research Content

24 January 2008 - 3:44More on layers and coupling

After my last (quite long) post, I was thinking about filesystem layering and coupling/interfaces between independent components. I wanted to post three mostly unrelated ideas on the same general theme. I’ll post the two most related to the previous post now and make the third a future post:

Changing traditional storage layering for distribution
In very large distributed filesystems (getting into the multi petabyte range), traditional RAID is often too weak and constrained for redundancy. By traditional RAID, I mean RAID that is implemented in hardware or software and is effectively invisible to the file system (sits below the block level). Say you had thousands of multi-disk RAID-5 arrays. In this situation, the probability that you’ll lose an entire array at some point is probably going to be make people nervous, particularly the kind of people who would have such massive storage systems. Depending on the scale, you could try to manage with hot spares and round-the-clock IT staff or increase the redundancy by going to RAID-6 or RAID-1, but you still have dangerous and constraining locality in your redundancy. This is more important when distribution is involved: what if you lose a controller or network connection or some other local aspect that takes an entire array effectively offline (or an entire chassis/rack or entire SAN or even an entire datacenter)?

A presentation titled “Storage Challenges for Petascale Systems” given by Dilip D. Kandlur, Director of IBM’s Storage Systems Research talks about these challenges in the context of petaflop systems with tens or even hundreds of petabytes of storage. These systems might have 100k-150k disk drives! The presentation notes:

RAID-5 is dead at petascale; even RAID-6 may not be sufficient to prevent data loss
Simulations of file system size, drive MTBF, failure probability distribution show 4%-28% chance of data loss over five-year lifetime for 8+2P code.

The probability of failure is unacceptably high even with the double parity of RAID-6, but triple parity gives you several orders of magnitude lower mean time to data loss. With these challenges in mind, GPFS is adding software RAID to support such stronger RAID codes not typically supported by RAID controller hardware (triple parity). In addition, it will support what is called “declustered RAID” (see Parity Declustering for Continuous Operation in Redundant Disk Arrays) which significantly improves load balancing during rebuild (see slides 11-13 for a great visual depiction of the way declustered RAID works). See also “The Challenges of Storage System Growth” a presentation by Denis Serenyi of Symantec, which covered some related issues.

The stronger software RAID helps at one level and allows you to using striping for throughput, but it doesn’t really deal with the location-based redundancy. The context of the previous presentation is primarily extremely large but single-site HPC systems, so it doesn’t discuss this issue much, but when you have a distributed system spanning many locations you need to consider it. In principle you could span multiple datacenters with your RAID layout, but that wouldn’t work very well from a performance perspective. RAID works best with symmetric (and predictable) latencies between devices; moreover, it’s unnecessary if you just want to deal with failure, because you can handle it much more robustly at a higher layer. Most large distributed systems provide for redundancy at the level of larger storage granules: files, objects (if you’re using an object-store based system), or perhaps larger blocks or “chunks.” For example, the Google File System (GFS, not to be confused with Red Hat’s Global File System, another distributed filesystem also named GFS) stores files as a series of 64MB chunks and each chunk is replicated. The paper notes the importance of replicating chunks on different racks:

We must also spread chunk replicas across racks. This ensures that some replicas of a chunk will survive and remain available even if an entire rack is damaged or offline (for example, due to failure of a shared resource like a network switch or power circuit). It also means that traffic, especially reads, for a chunk can exploit the aggregate bandwidth of multiple racks. On the other hand, write traffic has to flow through multiple racks, a tradeoff we make willingly.

Ceph, a recent distributed filesystem (see Ceph: A Scalable, High-Performance Distributed File System in OSDI ‘06), uses an underlying object store model and replicates at the level of objects:

In contrast to systems like Lustre [4], which assume one can construct sufficiently reliable OSDs using mechanisms like RAID or fail-over on a SAN, we assume that in a petabyte or exabyte system failure will be the norm rather than the exception, and at any point in time several OSDs are likely to be inoperable. To maintain system availability and ensure data safety in a scalable fashion, RADOS manages its own replication of data using a variant of primary-copy replication [2], while taking steps to minimize the impact on performance. Data is replicated in terms of placement groups, each of which is mapped to an ordered list of n OSDs (for n-way replication).

In Ceph, the data for a traditional file at the filesystem-level may consist of many underlying objects in the object store (a file is striped across objects named by combining an inode and a stripe number), so this is similar to replicating at a “chunk” or large block level. GPFS, in addition to the planned lower level declustered/striped strong parity strategy, already has file data and metadata replication (which can be controlled on a per-file basis). These features are actually a part of a rich set of ILM (Information Lifecycle Management) features that allow you to define different policies for various data on the same filesystem in a SQL-like declarative language. For example, you can create a policy that a certain directory subtree should be stored on a pool of faster disks and have a specific, higher replication factor than other files. Or you could make a policy to have the system gradually decrease the replication factor of files that haven’t been accessed in a long time, finally migrating it to offline, external storage after a certain threshold.

The various DHT-based filesystems/storage systems (CFS, Ivy, PAST, Pastiche, OceanStore, etc.) mentioned in my previous post also replicate pieces of files or entire files on multiple nodes. These systems are designed for more widely distributed and dynamic environments so they have to deal with things like significant node churn (nodes not being powered on/connected all the time or leaving the system permanently); it is critical to adopt a replication strategy that is easy to maintain in such circumstances. In such systems is it also accepted that some files may be temporarily unavailable or lost permanently due to a loss of all replicas — most files will be fine, but you don’t necessarily set replication parameters for losing a given file to the same low probability of failure as RAID type strategies. Note the difference in the nature of redundancy and the reasons for doing so: losing a piece of a file might be bad for the user of a file, but the rest of the filesystem is fine. In the case of RAID, where a coherent filesystem’s data and metadata are striped indiscriminately across several disks, losing all replicas of a block could mean that the filesystem’s metadata is damaged, which could cause serious problems.

Anyway, I just think it’s interesting to note how the traditional storage layering evolves in the face of distribution and large datasets. With local disks and filesystems, people tend to put replication below everything and provide a replicated block device. When distribution is involved, it becomes more flexible to think of replication in the context of filesystem entities like files or chunks.

A violation of layering by DHash
My last post got quite long so I didn’t remember to include every interesting footnote and piece of trivia. One interesting “layering violation” for efficiency in the related work I listed is in DHash, the distributed block storage layer built on top of Chord. The authors of the CFS SOSP paper note:

DHash has its own implementation of the Chord lookup algorithm, but relies on the Chord layer to maintain the routing tables. Integrating block lookup into DHash increases its efficiency. If DHash instead called the Chord find successor routine, it would be awkward for DHash to check each server along the lookup path for cached copies of the desired block. It would also cost an unneeded round trip time, since both Chord and DHash would end up separately contacting the block’s successor server.

That’s obviously a case in which duplicating code and violating layering is a good tradeoff, since it eliminates a costly network round trip. Also, since DHash and Chord are maintained by the same entity, it is unlikely to be particularly painful. However, it does make me wonder if Chord’s dead simple interface is just too austere. The only function it provides is the ability to find the successor for a node, which is too basic for DHash (at least without compromising performance). Most competing DHT solutions (e.g. Pastry, Tapestry, CAN) didn’t separate the hash lookup primitive into a separate externally-distinguished artifact. I really like the idea of separating the hashing/routing from storage policy, and the single successor primitive is appealing for its simplicity, but perhaps the interface at the split should have been richer.

No Comments | Tags: Research Content