24 January 2008 - 3:44More on layers and coupling

After my last (quite long) post, I was thinking about filesystem layering and coupling/interfaces between independent components. I wanted to post three mostly unrelated ideas on the same general theme. I’ll post the two most related to the previous post now and make the third a future post:

Changing traditional storage layering for distribution
In very large distributed filesystems (getting into the multi petabyte range), traditional RAID is often too weak and constrained for redundancy. By traditional RAID, I mean RAID that is implemented in hardware or software and is effectively invisible to the file system (sits below the block level). Say you had thousands of multi-disk RAID-5 arrays. In this situation, the probability that you’ll lose an entire array at some point is probably going to be make people nervous, particularly the kind of people who would have such massive storage systems. Depending on the scale, you could try to manage with hot spares and round-the-clock IT staff or increase the redundancy by going to RAID-6 or RAID-1, but you still have dangerous and constraining locality in your redundancy. This is more important when distribution is involved: what if you lose a controller or network connection or some other local aspect that takes an entire array effectively offline (or an entire chassis/rack or entire SAN or even an entire datacenter)?

A presentation titled “Storage Challenges for Petascale Systems” given by Dilip D. Kandlur, Director of IBM’s Storage Systems Research talks about these challenges in the context of petaflop systems with tens or even hundreds of petabytes of storage. These systems might have 100k-150k disk drives! The presentation notes:

RAID-5 is dead at petascale; even RAID-6 may not be sufficient to prevent data loss
Simulations of file system size, drive MTBF, failure probability distribution show 4%-28% chance of data loss over five-year lifetime for 8+2P code.

The probability of failure is unacceptably high even with the double parity of RAID-6, but triple parity gives you several orders of magnitude lower mean time to data loss. With these challenges in mind, GPFS is adding software RAID to support such stronger RAID codes not typically supported by RAID controller hardware (triple parity). In addition, it will support what is called “declustered RAID” (see Parity Declustering for Continuous Operation in Redundant Disk Arrays) which significantly improves load balancing during rebuild (see slides 11-13 for a great visual depiction of the way declustered RAID works). See also “The Challenges of Storage System Growth” a presentation by Denis Serenyi of Symantec, which covered some related issues.

The stronger software RAID helps at one level and allows you to using striping for throughput, but it doesn’t really deal with the location-based redundancy. The context of the previous presentation is primarily extremely large but single-site HPC systems, so it doesn’t discuss this issue much, but when you have a distributed system spanning many locations you need to consider it. In principle you could span multiple datacenters with your RAID layout, but that wouldn’t work very well from a performance perspective. RAID works best with symmetric (and predictable) latencies between devices; moreover, it’s unnecessary if you just want to deal with failure, because you can handle it much more robustly at a higher layer. Most large distributed systems provide for redundancy at the level of larger storage granules: files, objects (if you’re using an object-store based system), or perhaps larger blocks or “chunks.” For example, the Google File System (GFS, not to be confused with Red Hat’s Global File System, another distributed filesystem also named GFS) stores files as a series of 64MB chunks and each chunk is replicated. The paper notes the importance of replicating chunks on different racks:

We must also spread chunk replicas across racks. This ensures that some replicas of a chunk will survive and remain available even if an entire rack is damaged or offline (for example, due to failure of a shared resource like a network switch or power circuit). It also means that traffic, especially reads, for a chunk can exploit the aggregate bandwidth of multiple racks. On the other hand, write traffic has to flow through multiple racks, a tradeoff we make willingly.

Ceph, a recent distributed filesystem (see Ceph: A Scalable, High-Performance Distributed File System in OSDI ’06), uses an underlying object store model and replicates at the level of objects:

In contrast to systems like Lustre [4], which assume one can construct sufficiently reliable OSDs using mechanisms like RAID or fail-over on a SAN, we assume that in a petabyte or exabyte system failure will be the norm rather than the exception, and at any point in time several OSDs are likely to be inoperable. To maintain system availability and ensure data safety in a scalable fashion, RADOS manages its own replication of data using a variant of primary-copy replication [2], while taking steps to minimize the impact on performance. Data is replicated in terms of placement groups, each of which is mapped to an ordered list of n OSDs (for n-way replication).

In Ceph, the data for a traditional file at the filesystem-level may consist of many underlying objects in the object store (a file is striped across objects named by combining an inode and a stripe number), so this is similar to replicating at a “chunk” or large block level. GPFS, in addition to the planned lower level declustered/striped strong parity strategy, already has file data and metadata replication (which can be controlled on a per-file basis). These features are actually a part of a rich set of ILM (Information Lifecycle Management) features that allow you to define different policies for various data on the same filesystem in a SQL-like declarative language. For example, you can create a policy that a certain directory subtree should be stored on a pool of faster disks and have a specific, higher replication factor than other files. Or you could make a policy to have the system gradually decrease the replication factor of files that haven’t been accessed in a long time, finally migrating it to offline, external storage after a certain threshold.

The various DHT-based filesystems/storage systems (CFS, Ivy, PAST, Pastiche, OceanStore, etc.) mentioned in my previous post also replicate pieces of files or entire files on multiple nodes. These systems are designed for more widely distributed and dynamic environments so they have to deal with things like significant node churn (nodes not being powered on/connected all the time or leaving the system permanently); it is critical to adopt a replication strategy that is easy to maintain in such circumstances. In such systems is it also accepted that some files may be temporarily unavailable or lost permanently due to a loss of all replicas — most files will be fine, but you don’t necessarily set replication parameters for losing a given file to the same low probability of failure as RAID type strategies. Note the difference in the nature of redundancy and the reasons for doing so: losing a piece of a file might be bad for the user of a file, but the rest of the filesystem is fine. In the case of RAID, where a coherent filesystem’s data and metadata are striped indiscriminately across several disks, losing all replicas of a block could mean that the filesystem’s metadata is damaged, which could cause serious problems.

Anyway, I just think it’s interesting to note how the traditional storage layering evolves in the face of distribution and large datasets. With local disks and filesystems, people tend to put replication below everything and provide a replicated block device. When distribution is involved, it becomes more flexible to think of replication in the context of filesystem entities like files or chunks.

A violation of layering by DHash
My last post got quite long so I didn’t remember to include every interesting footnote and piece of trivia. One interesting “layering violation” for efficiency in the related work I listed is in DHash, the distributed block storage layer built on top of Chord. The authors of the CFS SOSP paper note:

DHash has its own implementation of the Chord lookup algorithm, but relies on the Chord layer to maintain the routing tables. Integrating block lookup into DHash increases its efficiency. If DHash instead called the Chord find successor routine, it would be awkward for DHash to check each server along the lookup path for cached copies of the desired block. It would also cost an unneeded round trip time, since both Chord and DHash would end up separately contacting the block’s successor server.

That’s obviously a case in which duplicating code and violating layering is a good tradeoff, since it eliminates a costly network round trip. Also, since DHash and Chord are maintained by the same entity, it is unlikely to be particularly painful. However, it does make me wonder if Chord’s dead simple interface is just too austere. The only function it provides is the ability to find the successor for a node, which is too basic for DHash (at least without compromising performance). Most competing DHT solutions (e.g. Pastry, Tapestry, CAN) didn’t separate the hash lookup primitive into a separate externally-distinguished artifact. I really like the idea of separating the hashing/routing from storage policy, and the single successor primitive is appealing for its simplicity, but perhaps the interface at the split should have been richer.

No Comments | Tags: Research Content

Add a Comment