18 January 2008 - 23:38ZFS hype?

Over the past year and a half or so, there’s been a lot of hype surrounding Sun’s ZFS (originally the “Zettabyte File System”). After the initial release, the “buzz” has come back in waves, peaking once with the initial porting of ZFS to FreeBSD (announced, merged), and later reappearing with (false) rumors of ZFS becoming the default filesystem in Mac OS X 10.5. This month another wave started with ZFS code and binaries for OS X being made available. Sun itself feeds the hype by touting ZFS as “the last word in file systems.” I’ve also been following Oracle’s btrfs (“Butter FS”), Matthew Dillon’s (of DragonFlyBSD) HAMMER and ext4, which are all sort of taking feature cues from ZFS. The option for checksumming should have been common in filesystems long before now, so I’m glad to see it is finally becoming a mainstream feature. Cheaper snapshotting/transactional support is also nice, but that’s not as rare.

Personally, I don’t understand the reason for the large hype over ZFS, particularly in the context of OS X (more on that later). Now, it seems to be a fairly impressive engineering effort, but it is concentrating on an artifact that is somewhat pedestrian by now: a purely local filesystem. ZFS seems to be a good but still incremental improvement in local filesystem capabilities (with some unorthodox choices, but more on that later) — they took functionality that was previously available in different storage layers and increased coupling to improve performance and flexibility. Don’t get me wrong; there are certainly great things about ZFS, but in my book “the last word in file systems” would at least have to be distributed. From my perspective, the bulk of “cutting-edge” research in filesystems over the last decade has been on distributed or cluster filesystems. Of course, my research is generally in distributed systems and I worked on a distributed/parallel filesystem (IBM’s GPFS) this past summer, so I’m not an impartial bystander, but I think it’s non-controversial to say that storage transparently interfacing with the network is important now and will only become more important in the future. Now, Sun has indicated that they are going to use ZFS as a local storage backend for Lustre (since it currently uses ext3, this would be a big improvement), but that’s slightly different. Alternately, maybe if ZFS was like the eternal vaporware relational filesystem WinFS, I could see the hype being justified, particularly from end-users.

One of the reasons I’m baffled about the hype of ZFS on FreeBSD/OpenBSD/Mac OS X is because the port is not stable yet and there are some general ZFS issues that would limit its wide use. First of all, it needs a lot of memory and can panic or deadlock if it runs out of memory. It can really only run reliably on 64-bit machines (because it tends to exhaust kernel resources on 32-bit machines), but Sun is upfront about this. It also seems to be somewhat finicky and require manual tuning to get good performance (and sometimes just not crashing). A post to the FreeBSD mailing list titled “ZFS Honesty” summarizes the issues nicely:

But let’s also be honest about ZFS in the 64-bit world. There is ample evidence that ZFS basically wants to grow unbounded in proportion to the workload that you give it. Indeed, even Sun recommends basically throwing more RAM at most problems. Again, tuning is often needed, and I think it’s fair to say that it can’t be expected to work on arbitrary workloads out of the box.

A followup added:

I guess what makes me mad about ZFS is that it’s all-or-nothing; either it works, or it crashes. It doesn’t automatically recognize limits and make adjustments or sacrifices when it reaches those limits, it just crashes. Wanting multiple gigabytes of RAM for caching in order to optimize performance is great, but crashing when it doesn’t get those multiple gigabytes of RAM is not so great, and it leaves a bad taste in my mouth about ZFS in general.

Anyway, I’m not trying to dump on ZFS for these problems, because some are related to the BSD port and manual tuning is not unreasonable in high-end storage applications. The thing that gets me is that the hype is generally among user groups where such constraints would not be appropriate. For example, all of the buzz about OS X getting ZFS (and possibly being the default filesystem): based on the information above, it does not fit into the “Mac ethos” of “it just works.” Sure, it may get there one day, but why did a lot of people get all worked up about the availability of OS X binaries when they probably won’t be ready for general use for a long time? I guess it’s pre-excitement.

Now, as for ZFS’s unorthodox design choices: the designers essentially decided to collapse (or induce a tighter coupling between) many storage layers, including volume management and striping/RAID and make them a more integrated part of the filesystem. In fact, Linux kernel developer Andrew Morton famously called ZFS a “rampant layering violation” (ZFS developer Jeff Bonwick replies to Morton’s comment here). Normally you have a separate filesystem agnostic volume manager (like FreeBSD’s Vinum and Linux’s LVM/LVM2), and potentially RAID below that. The presence of those layers, however, effectively virtualize some aspect of the underlying disks and may hide information crucial for making layout decisions impacting performance. In the proceedings of HotOS X (2005), this point is well articulated in Lex Stein’s “Stupid File Systems Are Better.” In that paper, the author argues that a simple (“stupid”) filesystem that does random block layout has more uniformly good performance than filesystems with sophisticated layout policies (because the policies make assumptions about the underlying layout which may be completely invalidated by striping or other issues). Sun took the opposite approach: instead of making their filesystem stupid, Sun removed the layers between the disks and the filesystem, thus giving the filesystem more information to make layout decisions. Portability is another good reason not to rely on layering; if you need to ensure that certain volume management features are available on all platforms, you may have to “bring your own.” I don’t think that was a major factor in Sun’s ZFS, however, because I believe it was meant to be a compelling reason to use Solaris.

I’m somewhat ambivalent about the decision to couple layers, because there are good arguments both for and against. Sun has invested significant effort in engineering complex artifacts where the added complexity ultimately didn’t pay off, like Solaris M:N threading, which was an impressive effort but was ultimately ditched for simpler 1:1 user/kernel thread designs (Sun’s whitepaper explaining the Solaris 9 switch from M:N to 1:1 threading)*. An influential systems paper “Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism” (appearing in SOSP and later in TOCS) argued for the two-level, M:N approach, but also noted it was critical to share information between the user-level scheduler and the kernel level thread scheduler. In this case, strict layering with both schedulers oblivious to each other would lead to poor performance decisions (and possibly deadlock).

But, in the realm of filesystems, there are also good arguments for clean layering (particularly once distribution is involved). Chandu Thekkath of Microsoft Research is a very strong advocate of structuring filesystems/storage abstractions in clean simple layers. For example, Frangipani, a distributed filesystem, is built on top of a virtual distributed disk, Petal. In the Frangipani paper, the authors praise the layered approach for making the filesystem very simple and quick to develop. In addition, the layering itself provides parallelism and the simplicity of the implementation allows it to be quite fast, despite the fact that information sharing in non-layered implementations may open up potential optimization opportunities. A more recent paper of his describes Boxwood, which is another distributed storage mechanism. Instead of a filesystem-like interface, however, it provides either distributed, replicated block-like storage or persistent data structures (also distributed and replicated). It again argues convincingly about the benefits of designing a system with clean and simple layers rather than complex tight coupling.

More widely distributed/peer-to-peer storage systems like CFS, described in “Wide-area cooperative storage with CFS”, are commonly built in several layers. CFS is a read-only filesystem built on top of a distributed block storage system DHash, which is itself built upon Chord, a peer-to-peer overlay network middleware (basically like half of a DHT — hashing/routing without storage; DHash provides storage). DHash and Chord were later used to implement Ivy, a peer-to-peer read-write filesystem. Similarly, both PAST, a peer-to-peer data publishing/archival system (immutable data; not a read-write filesystem) and Pastiche, a cooperative-storage based backup system, are layered on Pastry, a feature rich peer-to-peer routing/DHT like system. OceanStore, another wide-area distributed storage system, was itself built upon Tapestry, another peer-to-peer DHT/overlay network.

Anyway, now I’m just rambling through tangentially related work in distributed filesystems, but I guess the question is whether the added complexity of ZFS will pay off versus something like btrfs or HAMMER on top of a good volume manager and RAID. I guess time will tell, but I’m sympathetic to arguments for both alternatives: on one hand, increasing coupling between layers allows you to optimize, but decreasing coupling may make each layer simpler. Given finite development time/effort, it’s easier to perfect and optimize simple artifacts than complex ones. As for my previous examples, one might say they aren’t directly relevant in that distribution nearly always suggests a clean layered design because the complexity is just too high otherwise, whereas a local filesystem with locally attached disks may benefit from cooperation between layers because they do similar things and you gain performance and flexibility within the confines of a specific filesystem. On the other hand, separate layers give you more horizontal flexibility (i.e. if you definitely need to use something other than ZFS): for example, Linux’s LVM2 support snapshots on many filesystems at the volume manager level, but they’re not as flexible or fast as ZFS.

Anyway, I guess the whole impetus behind this post was that I’m bothered by the level of hype I’m seeing in certain circles over ZFS (and the marketing label “the last word in file systems” doesn’t help). Sure, it seems like impressive engineering, but nothing particularly groundbreaking or revolutionary. I’ll be interested to see how it ultimately turns out in competing against various other filesystems in development.

* Incidentially, FreeBSD, one of the last major holdouts with M:N threading (along with NetBSD), is also purportedly switching to 1:1 threading by making the 1:1 libthr the default threading library in very-soon-to-be-released FreeBSD 7.

6 Comments | Tags: Research Content

Comments:

  1. Hi David,

    That was a very interesting and informative post.

    I am myself working on Lustre and ZFS, and I would like to make a point that ZFS is in fact structured in layers.
    The difference is that they are not the same “dumb” layers as the typical disk/striping|raid/volume management/filesystem layers which have a simple block-based interface. In ZFS the layers have a richer interface and are a bit more sophisticated.

    You can see an explanation of the different layers of ZFS here: http://opensolaris.org/os/community/zfs/source/

    However, it is true that a couple of these layers are tightly coupled with one another. But just because Sun calls ZFS the aggregation of those layers, it does not mean that you couldn’t easily develop another filesystem on top of the DMU (or that you couldn’t develop another cache mechanism besides the ARC, or another name-value interface besides ZAPs, or another journaling mechanism besides the ZIL, etc..).

    A good way to make that clear is that for example, Lustre (and pNFS too) will be interfacing with the DMU directly, instead of interfacing with the ZPL layer which is what makes ZFS a filesystem.

  2. Joe Uhl says;
    19 Jan 2008 - 9:06

    There’s often recurring mention of ZFS on the Postgresql performance list, I think primarily because Sun builds very large disk enclosures so it ends up being a target for higher-grade databases. General theme seems to be performance is wonderful once it has been heavily tuned to the DB workload but nothing special until then.

    In my limited experience we’ve never needed more than Linux, a good raid controller with battery-backed cache, and lots of disks in raid 10 to get performance.

    There’s a presentation benchmarking Postgres and MySQL on FreeBSD 7 demonstrating the improved performance of both as they scale to 8 cores with various threading models on that platform.

    http://people.freebsd.org/~kris/scaling/7.0%20Preview.pdf

  3. What I’d really like to see is ptrace ported to the Linux kernel. Not sure if anyone at Sun would be interested in making this happen.

  4. Thanks for the comments and additional info guys.

    Ricardo:
    I didn’t mean to imply that ZFS isn’t layered internally (or is just a big monolithic messy blob), but as I noted and you also mentioned, I was more referring to the coupling between layers and Sun’s choice to re-implement functionality traditionally performed at lower levels of the storage stack (some layers which Solaris already provided separately) and do them differently. If you re-implement the lower levels in conjunction with the ZFS implementation, it gives you a chance to make the interfaces (as you also said) richer and more well-suited to ZFS.

    But their design choices affect several properties of the layers you get: the level of specificity/generality, potential information sharing, and stability of interfaces. Is the structure internal and subject to change or will certain layers become first class components with crystallized interfaces that could have many different clients? How specialized are some of the facilities to ZFS (e.g. could pooled storage replace Solaris’s old volume manager, and if so, why wouldn’t they push in that direction)?

    Maybe the issue is more of how Sun pitches it than anything else: they could say, “the traditional interface to lower level storage layers like volume managers and RAID hide too much information from the clients above. We propose a new abstraction, the ‘pooled storage manager,’ which performs similar duties as these layers but has a richer, more flexible interface that can benefit the pieces above.” But they haven’t quite pitched it like that. It’s not a question of whether one could implement other stuff on top of the pooled storage layer; the question is whether Sun has committed to keeping that interface stable for this purpose, or if might change with future ZFS changes. Does Sun intend for the only clients of the pooled storage layer or the data management unit to be ZFS and Lustre/ZFS (i.e. Sun maintained projects)?

  5. AG:
    I’m assuming you mean DTrace, Sun’s excellent dynamic instrumentation framework?

  6. Hi David,

    I believe you are right. Sun does not pitch these layers as something that could be used by other consumers, nor does Sun intend to keep the interfaces stable, at least in a short to mid-term basis.

    To answer your question, as far as I know Sun’s intention is to have ZFS (and also ZVols), Lustre and pNFS as consumers of the pooled storage layer.

    Those are the projects that I know are being actively developed, but who knows what else will come along in the future :)

Add a Comment