7 March 2010 - 17:30Storage and scale

In 2008, I worked at IBM Research Almaden on a project called Panache. I was recently pleased to find that an in-depth paper on Panache appears in the proceedings of FAST ’10 (PDF).

The main impetus for my post was not specifically on Panache, however. Over the years I’ve been thinking about the evolution of storage as the scale grows, and it comes up in conversation maybe once every six months. Most recently, it came up during a job interview when talking about some of the projects I’d worked on. I made a note to myself to post something on this.

The growth of storage solutions tends to follow a certain trajectory. In the following I’m going to be very loose with definitions and paint with a broad brush. True backup and redundancy for availability are different, but here I’m just covering redundancy for availability:

  • Single drive / JBOD — home users tend to start out with one hard drive or a collection of disks; there’s no redundancy here so they may just manually sync to another drive or system to guard against failure (or the most popular form of failure contingency: nothing).
  • Single-host RAID / NAS — after home users’ storage needs increase in scale, they may move to a RAID solution which adds some disk-level redundancy. RAID isn’t a backup solution (you can still delete or corrupt your own files by accident), but it helps give you some redundancy in the face of drive failure. Maybe these drive arrays are in a home-network visible NAS exporting storage via NFS or CIFS or maybe they’re just in one main system where you use the storage locally. Small enterprises may use this kind of NAS solution too. Crude scaling “up” may be achieved by having several different NAS appliances serving separate sets of files.
  • SAN / “Enterprise” storage — after you reach a certain level of scale, you want to eliminate the single point of failure imposed by the NAS / single host with a RAID setup. You’re protected from a drive failure, but the host could still fail, the network connection to the host could fail, the drive controller could fail, etc. At this point you tend to move to a storage area network (SAN) setup, often from a company like NetApp, IBM or Panasas. The idea behind a SAN is that your machines connect to your disks like switched local networking. You may have 16 machines and 128 disks, and all machines can talk to any disk. If a machine fails or a controller fails, the rest of the machines can still talk to the same disks just fine. Generally there is some redundancy at the IO controller level, too. These are typically coordinated by a shared-disk filesystem like GPFS or OCFS2.
    In a large enterprise, it is impractical for all machines that need to access the storage to have a direct SAN connection, so in this case you may have a small number of storage-connected NAS hosts that export the same coherent underlying shared-disk filesystem via NFS or CIFS. Then clients can access any exporting host via a standard network filesystem protocol, and any NAS host can die without the entire filesystem becoming inaccessible. Redundancy may be achieved by layering the filesystem over redundant disk arrays in the SAN, or by doing parity / Reed Solomon coding within the filesystem on individual disks (I talked about “Declustered RAID” in an earlier post), or by doing file-level and metadata-level replication.
  • Google Filesystem — “Enterprise” storage based on SANs is way more expensive per unit of storage than commodity disks, and there is a limit to how far you can scale “out” the number of hosts with access to the storage and scale “up” the amount of storage. Once you get to the level of companies like Google or Amazon, you are already going to have sophisticated internal software for managing failures when dealing with your own computation. At this level of scale, it makes sense to build your own custom data storage solution using the same failure-handling principles with unshared, commodity disks. Google filesystem (GFS) is layered over a bunch of commodity servers with local non-redundant storage. The storage is aggregated into a coherent distributed filesystem and data is replicated at the “chunk” level (a large sub-unit of a file). This will cost a lot less for more storage and you can build in all kinds of special considerations for your own uses (e.g. MapReduce works in concert with GFS to try to process data locally where it is stored).

As I mentioned above, I’m really glossing over a lot here and speaking in generalities. Some things I’ve glossed over:

  • In reality there are very large scale systems that do use SANs (i.e. HPC clusters for scientific simulation) rather than local storage and custom software. Batch, shared-nothing crunching via MapReduce has different data requirements than tightly coupled parallel MPI jobs. The final stage of “scale” listed above is really about what you need to do with all of that data.
  • GFS/HDFS aren’t replacements for traditional POSIX filesystems — they are accessed through a library, they don’t provide the same semantics (GFS has atomic appends which may append the data more than once), and they were designed to be used in a specific context for batch data analysis in concert with MapReduce/Hadoop. Given the current state of development, you probably wouldn’t archive important datasets within HDFS. The finer details of Google’s own strategies are private, but given the intended uses of BigTable, it’s probable that a lot of data is actually archived within GFS on top of BigTable. This makes sense if the intended uses of the datasets are all within the framework of MapReduce — jobs that can read out of BigTable.
  • Large internet companies with these specialized storage solutions still probably use SAN-based systems for some purposes. It may not make sense, for example, to store source repositories in GFS or BigTable or to try to export data in GFS like an NFS share.
  • There are, of course, many other things that don’t fit so neatly into these categories. For example, a while back I wrote a post where I talked about distributed peer-to-peer filesystems that run over wide-area networks (often built on DHTs).

As an aside, it’s interesting how there is an somewhat parallel scaling trajectory to databases as well. You start off with stock relational databases, then you tend to shard and denormalize, then from there you may loosen semantics further, interpose memcached, try to do custom transactional replication, etc.; if that can’t keep up, you may eventually progress to your own eventually consistent key-value store or document DB or non-relational tabular data storage solution or any other number of interesting variants. But I will save that for some future post.

1 Comment | Tags: Research Content