27 January 2008 - 15:46Automated mirror selection for LUG PXE installs

I’ve posted several times about the PXE-based install server I created/maintain for our local LUG’s installfests. We can network boot installers for all of the different Linux distros we support as well as various BSDs (and a GParted live image). The various Linux net-install images require users to specify a network mirror to retrieve installations files. For some distros (RHEL, Ubuntu, Fedora), I maintain local mirrors on the PXE server to speed installations (and in the case of RHEL, because there are no public network mirrors — our school has a site-license that affords every student a legitimate copy, so we can only do RHEL installs for attendees who are students). For most other distros, we use our school’s local mirror gtlib.gatech.edu.

For the first few installfests where we used the PXE server, I printed out several copies of a sheet with the various network mirrors for different distros. It would be something like this:

  • Red Hat Enterprise Linux 5.1 — http://10.0.0.2/rhel5.1/i386/
  • Ubuntu 7.10 (gutsy) — http://10.0.0.2/pub/ubuntu/
  • Fedora 8 — http://10.0.0.2/fc8/i386/os/
  • Debian testing — ­ http://www.gtlib.gatech.edu/pub/debian/
  • openSUSE 10.3 –­ http://128.61.111.11/pub/opensuse/distribution/10.3/repo/oss/
  • Mandriva 2008.0 –­ http://www.gtlib.gatech.edu/pub/mandrake/official/2008.0/i586/
  • Gentoo 2007.0 — http://10.0.0.2/gentoo (stage3 tarballs)

Now, obviously it was a minor pain to fill in that information repeatedly for each install, so I decided to automate it. For RedHat-based distros, I used a kickstart configuration file to specify the mirror, and for Debian-derived distros I used the installer “preseed” mechanism. Mandriva and openSUSE also provide features to do the same. Here’s how I set up each distro to automatically find the mirrors:


Red Hat Enterprise Linux / Fedora
For each distro version and architecture combination, create a kickstart configuration file available via HTTP. Here’s an example for Fedora 8 i386:

interactive
network --bootproto dhcp --noipv6
url --url http://10.0.0.2/fc8/i386/os/
firstboot --enable

The “interactive” and “firstboot –enable” parts are important because the installer assumes that you are doing a semi- or entirely automated installation if you use a kickstart configuration file. Without the “firstboot” line, it won’t give you the first boot system configuration dialogs where you create a non-root user and configure sound, video, etc. I wanted these installs to be basically identical to a manual install except with the network mirror pre-selected.

Now, all you have to do to use the kickstart config is edit your pxelinux configuration file. Append “ks=url_to_file” lines to the kernel boot parameters of each entry. E.g.:

LABEL fedora8_x86_64
kernel fedora/8/x86_64/vmlinuz
append initrd=fedora/8/x86_64/initrd.img ks=http://10.0.0.2/fc8/x86_64/ks.cfg


Debian / Ubuntu
The Debian installer (which is also used for Ubuntu network and alternate installs) allows you to “preseed” answers to all installer prompts. For each distro, you only need a single file for all version and architecture combinations assuming they all use the same mirror site (versions and architecture are not explicit in the mirror URL). Create a preseed configuration file like this:

d-i mirror/protocol string http
d-i mirror/country string enter information manually
d-i mirror/http/hostname string 10.0.0.2
d-i mirror/http/directory string /pub/ubuntu/
d-i mirror/http/proxy string
d-i apt-setup/security_host string

After creating the preseed file, simply append “preseed/url=url_to_file” to the kernel boot parameters of each entry. E.g.:

LABEL ubuntu_gutsy_i386
kernel ubuntu/gutsy/i386/ubuntu-installer/i386/linux
append vga=normal initrd=ubuntu/gutsy/i386/ubuntu-installer/i386
/initrd.gz preseed/url=http://10.0.0.2/ubuntu/lug.cfg --

One problem with using a local mirror for Ubuntu (10.0.0.2/pub/ubuntu) is that we have to go back and change the /etc/apt/source.list file to point to a public mirror after install — otherwise, after leaving the installfest, a user would be trying to use a non-existent Ubuntu mirror on RFC1918 IP space. I also used the pre-seed configuration file to automatically replace the source.list file after installation. The “preseed/late_command” option allows you to run stuff just before the install finishes (the root of the new system is in /target at this point). Here is the slightly hackish entry I use to fix the sources.list:

d-i preseed/late_command string cd /target/tmp ; wget http://10.0.0.2/ubuntu/fix_sources.sh ; cd /target ; sh tmp/fix_sources.sh

The fix_sources.sh simply replaces the entries in sources.list to point to a public mirror.

One other trick I did with the Ubuntu installer is disable the supremely annoying “Automatic Keyboard layout detection” mechanism. It prompts you “Yes/No,” but the default is Yes, so many people select it unwittingly. The result is a long and irritating process of pressing various keys on the keyboard which could have been solved in 1 second by simply selecting “American English” (99.9% of the time) from the keyboard layout menu. If you append “console-setup/ask_detect=false” to the kernel parameters to the installer image, it will go directly to the keyboard layout menu as if you selected “No” to keyboard autodetection.


openSUSE
The openSUSE installer supports setting the installation mirror source by passing “install=url_to_repository” as a kernel parameter to the install. The mirror path does not change with different architectures (like Debian), but it does change between versions. For example:

LABEL opensuse10.3_i386
kernel opensuse/10.3/i386/linux
append initrd=opensuse/10.3/i386/initrd splash=silent showopts install=http://128.61.111.11/pub/opensuse/distribution/10.3/repo/oss

The url provided uses a numeric IP address rather than a hostname because the openSUSE installer doesn’t handle DNS resolution (at least last time I checked; it may have been fixed in the meantime).


Mandriva
Mandriva supports mirror selection through the “automatic=config_list” kernel parameter, where config_list is a list of comma-separated key/value pairs in the form of “key:value.” To set the mirror, one could specify the string as follows: “automatic=method:http,network:dhcp,server:mirror_hostname,directory:mirror_path.” For example, here is an entry:

LABEL mandriva2008.0_i586
kernel mandriva/2008.0/i586/vmlinuz
append initrd=mandriva/2008.0/i586/all.rdz vga=788 splash=silent automatic=method:http,network:dhcp,server:www.gtlib.gatech.edu,directory:/pub/mandrake/official/2008.0/i586

The full directory above is “/pub/mandrake/official/2008.0/i586,” but fixed-width “code” entries don’t word-wrap without putting extra spaces. Note that Mandriva’s mirror URLs also include both architecture and distro version.

3 Comments | Tags: PXE-related

24 January 2008 - 3:44More on layers and coupling

After my last (quite long) post, I was thinking about filesystem layering and coupling/interfaces between independent components. I wanted to post three mostly unrelated ideas on the same general theme. I’ll post the two most related to the previous post now and make the third a future post:

Changing traditional storage layering for distribution
In very large distributed filesystems (getting into the multi petabyte range), traditional RAID is often too weak and constrained for redundancy. By traditional RAID, I mean RAID that is implemented in hardware or software and is effectively invisible to the file system (sits below the block level). Say you had thousands of multi-disk RAID-5 arrays. In this situation, the probability that you’ll lose an entire array at some point is probably going to be make people nervous, particularly the kind of people who would have such massive storage systems. Depending on the scale, you could try to manage with hot spares and round-the-clock IT staff or increase the redundancy by going to RAID-6 or RAID-1, but you still have dangerous and constraining locality in your redundancy. This is more important when distribution is involved: what if you lose a controller or network connection or some other local aspect that takes an entire array effectively offline (or an entire chassis/rack or entire SAN or even an entire datacenter)?

A presentation titled “Storage Challenges for Petascale Systems” given by Dilip D. Kandlur, Director of IBM’s Storage Systems Research talks about these challenges in the context of petaflop systems with tens or even hundreds of petabytes of storage. These systems might have 100k-150k disk drives! The presentation notes:

RAID-5 is dead at petascale; even RAID-6 may not be sufficient to prevent data loss
Simulations of file system size, drive MTBF, failure probability distribution show 4%-28% chance of data loss over five-year lifetime for 8+2P code.

The probability of failure is unacceptably high even with the double parity of RAID-6, but triple parity gives you several orders of magnitude lower mean time to data loss. With these challenges in mind, GPFS is adding software RAID to support such stronger RAID codes not typically supported by RAID controller hardware (triple parity). In addition, it will support what is called “declustered RAID” (see Parity Declustering for Continuous Operation in Redundant Disk Arrays) which significantly improves load balancing during rebuild (see slides 11-13 for a great visual depiction of the way declustered RAID works). See also “The Challenges of Storage System Growth” a presentation by Denis Serenyi of Symantec, which covered some related issues.

The stronger software RAID helps at one level and allows you to using striping for throughput, but it doesn’t really deal with the location-based redundancy. The context of the previous presentation is primarily extremely large but single-site HPC systems, so it doesn’t discuss this issue much, but when you have a distributed system spanning many locations you need to consider it. In principle you could span multiple datacenters with your RAID layout, but that wouldn’t work very well from a performance perspective. RAID works best with symmetric (and predictable) latencies between devices; moreover, it’s unnecessary if you just want to deal with failure, because you can handle it much more robustly at a higher layer. Most large distributed systems provide for redundancy at the level of larger storage granules: files, objects (if you’re using an object-store based system), or perhaps larger blocks or “chunks.” For example, the Google File System (GFS, not to be confused with Red Hat’s Global File System, another distributed filesystem also named GFS) stores files as a series of 64MB chunks and each chunk is replicated. The paper notes the importance of replicating chunks on different racks:

We must also spread chunk replicas across racks. This ensures that some replicas of a chunk will survive and remain available even if an entire rack is damaged or offline (for example, due to failure of a shared resource like a network switch or power circuit). It also means that traffic, especially reads, for a chunk can exploit the aggregate bandwidth of multiple racks. On the other hand, write traffic has to flow through multiple racks, a tradeoff we make willingly.

Ceph, a recent distributed filesystem (see Ceph: A Scalable, High-Performance Distributed File System in OSDI ‘06), uses an underlying object store model and replicates at the level of objects:

In contrast to systems like Lustre [4], which assume one can construct sufficiently reliable OSDs using mechanisms like RAID or fail-over on a SAN, we assume that in a petabyte or exabyte system failure will be the norm rather than the exception, and at any point in time several OSDs are likely to be inoperable. To maintain system availability and ensure data safety in a scalable fashion, RADOS manages its own replication of data using a variant of primary-copy replication [2], while taking steps to minimize the impact on performance. Data is replicated in terms of placement groups, each of which is mapped to an ordered list of n OSDs (for n-way replication).

In Ceph, the data for a traditional file at the filesystem-level may consist of many underlying objects in the object store (a file is striped across objects named by combining an inode and a stripe number), so this is similar to replicating at a “chunk” or large block level. GPFS, in addition to the planned lower level declustered/striped strong parity strategy, already has file data and metadata replication (which can be controlled on a per-file basis). These features are actually a part of a rich set of ILM (Information Lifecycle Management) features that allow you to define different policies for various data on the same filesystem in a SQL-like declarative language. For example, you can create a policy that a certain directory subtree should be stored on a pool of faster disks and have a specific, higher replication factor than other files. Or you could make a policy to have the system gradually decrease the replication factor of files that haven’t been accessed in a long time, finally migrating it to offline, external storage after a certain threshold.

The various DHT-based filesystems/storage systems (CFS, Ivy, PAST, Pastiche, OceanStore, etc.) mentioned in my previous post also replicate pieces of files or entire files on multiple nodes. These systems are designed for more widely distributed and dynamic environments so they have to deal with things like significant node churn (nodes not being powered on/connected all the time or leaving the system permanently); it is critical to adopt a replication strategy that is easy to maintain in such circumstances. In such systems is it also accepted that some files may be temporarily unavailable or lost permanently due to a loss of all replicas — most files will be fine, but you don’t necessarily set replication parameters for losing a given file to the same low probability of failure as RAID type strategies. Note the difference in the nature of redundancy and the reasons for doing so: losing a piece of a file might be bad for the user of a file, but the rest of the filesystem is fine. In the case of RAID, where a coherent filesystem’s data and metadata are striped indiscriminately across several disks, losing all replicas of a block could mean that the filesystem’s metadata is damaged, which could cause serious problems.

Anyway, I just think it’s interesting to note how the traditional storage layering evolves in the face of distribution and large datasets. With local disks and filesystems, people tend to put replication below everything and provide a replicated block device. When distribution is involved, it becomes more flexible to think of replication in the context of filesystem entities like files or chunks.

A violation of layering by DHash
My last post got quite long so I didn’t remember to include every interesting footnote and piece of trivia. One interesting “layering violation” for efficiency in the related work I listed is in DHash, the distributed block storage layer built on top of Chord. The authors of the CFS SOSP paper note:

DHash has its own implementation of the Chord lookup algorithm, but relies on the Chord layer to maintain the routing tables. Integrating block lookup into DHash increases its efficiency. If DHash instead called the Chord find successor routine, it would be awkward for DHash to check each server along the lookup path for cached copies of the desired block. It would also cost an unneeded round trip time, since both Chord and DHash would end up separately contacting the block’s successor server.

That’s obviously a case in which duplicating code and violating layering is a good tradeoff, since it eliminates a costly network round trip. Also, since DHash and Chord are maintained by the same entity, it is unlikely to be particularly painful. However, it does make me wonder if Chord’s dead simple interface is just too austere. The only function it provides is the ability to find the successor for a node, which is too basic for DHash (at least without compromising performance). Most competing DHT solutions (e.g. Pastry, Tapestry, CAN) didn’t separate the hash lookup primitive into a separate externally-distinguished artifact. I really like the idea of separating the hashing/routing from storage policy, and the single successor primitive is appealing for its simplicity, but perhaps the interface at the split should have been richer.

No Comments | Tags: Research Content

18 January 2008 - 23:38ZFS hype?

Over the past year and a half or so, there’s been a lot of hype surrounding Sun’s ZFS (originally the “Zettabyte File System”). After the initial release, the “buzz” has come back in waves, peaking once with the initial porting of ZFS to FreeBSD (announced, merged), and later reappearing with (false) rumors of ZFS becoming the default filesystem in Mac OS X 10.5. This month another wave started with ZFS code and binaries for OS X being made available. Sun itself feeds the hype by touting ZFS as “the last word in file systems.” I’ve also been following Oracle’s btrfs (”Butter FS”), Matthew Dillon’s (of DragonFlyBSD) HAMMER and ext4, which are all sort of taking feature cues from ZFS. The option for checksumming should have been common in filesystems long before now, so I’m glad to see it is finally becoming a mainstream feature. Cheaper snapshotting/transactional support is also nice, but that’s not as rare.

Personally, I don’t understand the reason for the large hype over ZFS, particularly in the context of OS X (more on that later). Now, it seems to be a fairly impressive engineering effort, but it is concentrating on an artifact that is somewhat pedestrian by now: a purely local filesystem. ZFS seems to be a good but still incremental improvement in local filesystem capabilities (with some unorthodox choices, but more on that later) — they took functionality that was previously available in different storage layers and increased coupling to improve performance and flexibility. Don’t get me wrong; there are certainly great things about ZFS, but in my book “the last word in file systems” would at least have to be distributed. From my perspective, the bulk of “cutting-edge” research in filesystems over the last decade has been on distributed or cluster filesystems. Of course, my research is generally in distributed systems and I worked on a distributed/parallel filesystem (IBM’s GPFS) this past summer, so I’m not an impartial bystander, but I think it’s non-controversial to say that storage transparently interfacing with the network is important now and will only become more important in the future. Now, Sun has indicated that they are going to use ZFS as a local storage backend for Lustre (since it currently uses ext3, this would be a big improvement), but that’s slightly different. Alternately, maybe if ZFS was like the eternal vaporware relational filesystem WinFS, I could see the hype being justified, particularly from end-users.

One of the reasons I’m baffled about the hype of ZFS on FreeBSD/OpenBSD/Mac OS X is because the port is not stable yet and there are some general ZFS issues that would limit its wide use. First of all, it needs a lot of memory and can panic or deadlock if it runs out of memory. It can really only run reliably on 64-bit machines (because it tends to exhaust kernel resources on 32-bit machines), but Sun is upfront about this. It also seems to be somewhat finicky and require manual tuning to get good performance (and sometimes just not crashing). A post to the FreeBSD mailing list titled “ZFS Honesty” summarizes the issues nicely:

But let’s also be honest about ZFS in the 64-bit world. There is ample evidence that ZFS basically wants to grow unbounded in proportion to the workload that you give it. Indeed, even Sun recommends basically throwing more RAM at most problems. Again, tuning is often needed, and I think it’s fair to say that it can’t be expected to work on arbitrary workloads out of the box.

A followup added:

I guess what makes me mad about ZFS is that it’s all-or-nothing; either it works, or it crashes. It doesn’t automatically recognize limits and make adjustments or sacrifices when it reaches those limits, it just crashes. Wanting multiple gigabytes of RAM for caching in order to optimize performance is great, but crashing when it doesn’t get those multiple gigabytes of RAM is not so great, and it leaves a bad taste in my mouth about ZFS in general.

Anyway, I’m not trying to dump on ZFS for these problems, because some are related to the BSD port and manual tuning is not unreasonable in high-end storage applications. The thing that gets me is that the hype is generally among user groups where such constraints would not be appropriate. For example, all of the buzz about OS X getting ZFS (and possibly being the default filesystem): based on the information above, it does not fit into the “Mac ethos” of “it just works.” Sure, it may get there one day, but why did a lot of people get all worked up about the availability of OS X binaries when they probably won’t be ready for general use for a long time? I guess it’s pre-excitement.

Now, as for ZFS’s unorthodox design choices: the designers essentially decided to collapse (or induce a tighter coupling between) many storage layers, including volume management and striping/RAID and make them a more integrated part of the filesystem. In fact, Linux kernel developer Andrew Morton famously called ZFS a “rampant layering violation” (ZFS developer Jeff Bonwick replies to Morton’s comment here). Normally you have a separate filesystem agnostic volume manager (like FreeBSD’s Vinum and Linux’s LVM/LVM2), and potentially RAID below that. The presence of those layers, however, effectively virtualize some aspect of the underlying disks and may hide information crucial for making layout decisions impacting performance. In the proceedings of HotOS X (2005), this point is well articulated in Lex Stein’s “Stupid File Systems Are Better.” In that paper, the author argues that a simple (”stupid”) filesystem that does random block layout has more uniformly good performance than filesystems with sophisticated layout policies (because the policies make assumptions about the underlying layout which may be completely invalidated by striping or other issues). Sun took the opposite approach: instead of making their filesystem stupid, Sun removed the layers between the disks and the filesystem, thus giving the filesystem more information to make layout decisions. Portability is another good reason not to rely on layering; if you need to ensure that certain volume management features are available on all platforms, you may have to “bring your own.” I don’t think that was a major factor in Sun’s ZFS, however, because I believe it was meant to be a compelling reason to use Solaris.

I’m somewhat ambivalent about the decision to couple layers, because there are good arguments both for and against. Sun has invested significant effort in engineering complex artifacts where the added complexity ultimately didn’t pay off, like Solaris M:N threading, which was an impressive effort but was ultimately ditched for simpler 1:1 user/kernel thread designs (Sun’s whitepaper explaining the Solaris 9 switch from M:N to 1:1 threading)*. An influential systems paper “Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism” (appearing in SOSP and later in TOCS) argued for the two-level, M:N approach, but also noted it was critical to share information between the user-level scheduler and the kernel level thread scheduler. In this case, strict layering with both schedulers oblivious to each other would lead to poor performance decisions (and possibly deadlock).

But, in the realm of filesystems, there are also good arguments for clean layering (particularly once distribution is involved). Chandu Thekkath of Microsoft Research is a very strong advocate of structuring filesystems/storage abstractions in clean simple layers. For example, Frangipani, a distributed filesystem, is built on top of a virtual distributed disk, Petal. In the Frangipani paper, the authors praise the layered approach for making the filesystem very simple and quick to develop. In addition, the layering itself provides parallelism and the simplicity of the implementation allows it to be quite fast, despite the fact that information sharing in non-layered implementations may open up potential optimization opportunities. A more recent paper of his describes Boxwood, which is another distributed storage mechanism. Instead of a filesystem-like interface, however, it provides either distributed, replicated block-like storage or persistent data structures (also distributed and replicated). It again argues convincingly about the benefits of designing a system with clean and simple layers rather than complex tight coupling.

More widely distributed/peer-to-peer storage systems like CFS, described in “Wide-area cooperative storage with CFS”, are commonly built in several layers. CFS is a read-only filesystem built on top of a distributed block storage system DHash, which is itself built upon Chord, a peer-to-peer overlay network middleware (basically like half of a DHT — hashing/routing without storage; DHash provides storage). DHash and Chord were later used to implement Ivy, a peer-to-peer read-write filesystem. Similarly, both PAST, a peer-to-peer data publishing/archival system (immutable data; not a read-write filesystem) and Pastiche, a cooperative-storage based backup system, are layered on Pastry, a feature rich peer-to-peer routing/DHT like system. OceanStore, another wide-area distributed storage system, was itself built upon Tapestry, another peer-to-peer DHT/overlay network.

Anyway, now I’m just rambling through tangentially related work in distributed filesystems, but I guess the question is whether the added complexity of ZFS will pay off versus something like btrfs or HAMMER on top of a good volume manager and RAID. I guess time will tell, but I’m sympathetic to arguments for both alternatives: on one hand, increasing coupling between layers allows you to optimize, but decreasing coupling may make each layer simpler. Given finite development time/effort, it’s easier to perfect and optimize simple artifacts than complex ones. As for my previous examples, one might say they aren’t directly relevant in that distribution nearly always suggests a clean layered design because the complexity is just too high otherwise, whereas a local filesystem with locally attached disks may benefit from cooperation between layers because they do similar things and you gain performance and flexibility within the confines of a specific filesystem. On the other hand, separate layers give you more horizontal flexibility (i.e. if you definitely need to use something other than ZFS): for example, Linux’s LVM2 support snapshots on many filesystems at the volume manager level, but they’re not as flexible or fast as ZFS.

Anyway, I guess the whole impetus behind this post was that I’m bothered by the level of hype I’m seeing in certain circles over ZFS (and the marketing label “the last word in file systems” doesn’t help). Sure, it seems like impressive engineering, but nothing particularly groundbreaking or revolutionary. I’ll be interested to see how it ultimately turns out in competing against various other filesystems in development.

* Incidentially, FreeBSD, one of the last major holdouts with M:N threading (along with NetBSD), is also purportedly switching to 1:1 threading by making the 1:1 libthr the default threading library in very-soon-to-be-released FreeBSD 7.

6 Comments | Tags: Research Content

2 January 2008 - 1:22More pxelinux tricks: GParted LivePXE and PXE-booting DOS CDs

Well, it’s been a while since my last post because I’ve been busy writing a bunch and didn’t feel much like writing a blog entry in addition. Anyway, since my last post, LUG@GT held another InstallFest. This time I decided to add a PXE bootable GParted live distro so we could also repartition without involving extra optical media. In order to do that, I started from the base of the GParted LiveCD (which is also suitable for a LiveUSB version). The GParted LiveCD is based on Gentoo, so preparing a PXE bootable image is similar to how I must prepare the Gentoo installer for PXE booting. Since this is the most complicated image to prepare for PXE booting (relative to Ubuntu, Debian, Fedora, RHEL, OpenSUSE and Mandriva, which are the other distros we offer), I will first start with the instructions on making a Gentoo install PXE-bootable.

Read more…

7 Comments | Tags: PXE-related