<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>

<channel>
	<title>/dev/rant</title>
	<atom:link href="http://www.thegibson.org/blog/feed" rel="self" type="application/rss+xml" />
	<link>http://www.thegibson.org/blog</link>
	<description>Technology-related rantings of David Hilley</description>
	<pubDate>Mon, 10 Nov 2008 23:56:13 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Burton Smith: Reinventing Computing</title>
		<link>http://www.thegibson.org/blog/archives/32</link>
		<comments>http://www.thegibson.org/blog/archives/32#comments</comments>
		<pubDate>Sat, 08 Nov 2008 19:57:37 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
		
		<category><![CDATA[Research Content]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/?p=32</guid>
		<description><![CDATA[Yes, I know I haven&#8217;t posted in a while, but I&#8217;ve been busy this semester: I finished my PhD thesis proposal a few weeks ago and I&#8217;m trying to prepare for a defense within a year or so.
Reinventing Computing
Anyway, recently I attended a talk by Microsoft Technical Fellow and computing architecture and HPC guru, Burton [...]]]></description>
			<content:encoded><![CDATA[<p>Yes, I know I haven&#8217;t posted in a while, but I&#8217;ve been busy this semester: I finished my PhD thesis proposal a few weeks ago and I&#8217;m trying to prepare for a defense within a year or so.</p>
<p><b>Reinventing Computing</b><br />
Anyway, recently I attended a talk by Microsoft Technical Fellow and computing architecture and HPC guru, <a href="http://www.microsoft.com/presspass/exec/techfellow/Smith/default.mspx">Burton Smith</a>.  The talk was very interesting though not specifically because of the subject matter (which is familiar), but because of the broad perspective.  The premise is the same thing I&#8217;ve been hearing in every other talk for the past few years (which I&#8217;ve <a>ranted about before</a>) &#8212; namely, parallelism is going mainstream and we have to deal with it.  Most people then follow with their specific sales pitch, but Burton followed with a broad overview of many different views of the issue and a wide range of different techniques and technologies.   </p>
<p>One thing I really liked about his overview is that he shares my philosophy of &#8220;pragmatism over dogmatism.&#8221;  He discussed potential alternatives and stated that there is a need to utilize a variety of different solutions, even those that are often seen as mutually opposing philosophies: for example, he said he believed that we need both message passing AND shared state concurrency, mutable state with transactions AND immutable functional objects, declarative programming AND imperative programming, etc.  This is in contract to presentations where you hear that approach X is the &#8220;right&#8221; way forward.  While it&#8217;s conceptually appealing to say that one approach is uniformly better, real world constraints often limit practical applicability &#8212; we probably need a tool chest, not just one really fancy hammer.</p>
<p>One thing he mentioned was the concept of viewing resource allocation in multi-core systems as a &#8220;2D bin packing&#8221; problem.  He showed a view of traditional CPU scheduling as a one-dimensional problem of time-multiplexing single runnable kernel threads over each single processor (with a small number of processors total).  He then showed the alternate view as a two dimensional problem of assigning chunks of processors over time to applications (i.e. time is the X-axis and processors form the Y-axis).  This is reminiscent of <a href="http://en.wikipedia.org/wiki/Gang_scheduling">gang scheduling</a> or <a href="http://en.wikipedia.org/wiki/Coscheduling">co-scheduling</a>, except the internal scheduling of work within an application with a chunk of processors would be handled at the user level and kernel level time-slicing may not occur in a standard manner anymore.  This reminded me of several pieces of current (and classical) related work.  </p>
<p>One idea is the following: &#8220;why time slice at all on massively multi-core systems?&#8221;  If you have 256 processors, just dynamically assign chunks of them like spatial multiplexing of memory.  </p>
<p><b>Corey</b><br />
<a href="http://pdos.csail.mit.edu/papers/corey:osdi08.pdf">Corey</a>, a research OS for many-core systems, will be presented at OSDI this year (<a href="http://www.usenix.org/events/osdi08/tech/">OSDI 08 program</a>) and follows this principle &#8212; &#8220;Corey allocates physical cores to applications rather than presenting a time-shared virtual processor abstraction.&#8221;  Another related concept is that Corey also allows the allocation of dedicated <i>kernel cores</i> for running the kernel, so kernel calls are handled &#8220;via fast shared-memory IPC rather than slow traps.&#8221;  I remember <a href="http://rikfarrow.com">Rik Farrow</a> suggested this same idea in a 2006 Google Tech Talk titled  <a href="http://www.youtube.com/watch?v=RLd8kPT9Dzg">&#8220;Security is Broken&#8221;</a>.  Corey is organized like an earlier OS called an <a href="http://portal.acm.org/citation.cfm?id=224076">Exokernel</a> (and in fact <a href="http://pdos.csail.mit.edu/~kaashoek/">M. Frans Kaashoek</a> and his <a href="http://pdos.csail.mit.edu/">PDOS</a> group are involved in both).  Like the Exokernel, Corey delegates scheduling policy of an allocated set of cores to the &#8220;library operating systems.&#8221;  This basically amounts to the same thing as user-level scheduling in a normal OS, since the library OS generally runs within the address space of an application (the authors note that the library OS doesn&#8217;t need to further isolate itself from the application because the exokernel doesn&#8217;t trust the library OS).</p>
<p><b>User-level Scheduling</b><br />
One thing that the idea of user-level scheduling reminds me of is the infamous two-level scheduling of <a href="http://www.cs.washington.edu/homes/tom/pubs/sched_act.html">Scheduler Activations</a> or <a href="http://citeseer.ist.psu.edu/106541.html">Solaris M:N threading</a>.  The goals were similar: in theory, at the user (application) level, you can make better scheduling decisions via custom scheduling policies or just better information, and scheduling is also cheaper.  In practice, the extra complexity didn&#8217;t pay off and the two levels of scheduling (at kernel and user) level often interacted in negative ways.  Subtle interference between decisions made at different levels could cause significant and often unexpected performance issues, and to really take advantage of it, you needed to make sure that both levels of scheduling were not working at cross purposes &#8212; to do that really requires propagating the application level scheduling decision information to the kernel level scheduler too, which is messy and complicated.  </p>
<p>So although, at first glance, the phrase &#8220;user-level scheduling&#8221; appearing on the slides brought the aforementioned black eye to my mind, I think the situation in the case of Corey and the kind of system Burton is proposing will be different because it&#8217;s really not the same kind of combination.  In these scenarios, we have a lot more cores and the kernel level &#8220;scheduling&#8221; is at much longer time scales &#8212; instead of time slicing, it&#8217;s easier to think of it like allocating physical page frames of memory to processes&#8217; address spaces.  Of course, the allocated number of cores is less &#8220;transparent&#8221; than virtual memory, but model of holding on to a resource is more similar than time slicing over a small number of processors.</p>
<p><b>Concurrency Runtime</b><br />
Another project that this brought to mind is Microsoft&#8217;s <a href="http://channel9.msdn.com/posts/Charles/The-Concurrency-Runtime-Fine-Grained-Parallelism-for-C/">Concurrency Runtime</a> (not to be confused with Microsoft&#8217;s similarly named <a href="http://msdn.microsoft.com/en-us/magazine/cc163556.aspx">Concurrency and Coordination Runtime</a>).  Not only is it related to the idea of user-mode domain-specific scheduling for multi-core applications, but it is also designed to allow different concurrency solutions to work together (thus supporting Burton&#8217;s view of utilizing many solutions with different strengths, potentially in the same program).  The idea behind the Concurrency Runtime is providing a common resource management framework to allow various concurrency solutions to interoperate and &#8220;play nice&#8221; together.  One problem with current solutions likes OpenMP or Intel TBB, etc. is that they all think they &#8220;own the machine.&#8221;  If you want to use multiple solutions together, they interact poorly because they are all oblivious of each other. The concurrency runtime provides a user-mode common resource management framework underneath the various concurrency solutions which can arbitrate between these different requests.  It also provides a bunch of richer primitives for building these solutions (i.e. higher level concepts of tasks, groups, events, thread pools, etc.), but I can&#8217;t find too much documentation on it so far (most of the references are in the form of interviews and presentations).</p>
<p><b>Snake Oil</b><br />
At the beginning of this post, when I was talking about how Burton Smith&#8217;s talk and approach are different than what you usually hear, it reminded me of a recent post by Sun engineer <a href="http://blogs.sun.com/bmc/">Bryan Cantrill</a> (an outspoken, frequently provocative* fellow &#8212; as an aside, his chapter in <a href="http://oreilly.com/catalog/9780596510046/toc.html">Beautiful Code</a> was one of the most enjoyable to me, coming from a systems background).  Anyway, as I mentioned, most talks follow the common &#8220;concurrency is here, we must deal with it&#8221; introduction with a sales pitch for a specific tool.  One hot area currently is <i>transactional memory</i>, which can be implemented in <a href="http://www.sosp2007.org/papers/sosp056-rossbach.pdf">hardware</a> or <a href="http://www.haskell.org/haskellwiki/Software_transactional_memory">software</a>, and I&#8217;ve seen and read a lot of papers on this in the past few years.  Anyway, Bryan recently posted a scathing post about transactional memory titled <a href="http://blogs.sun.com/bmc/entry/concurrency_s_shysters">&#8220;Concurrency&#8217;s Shysters&#8221;</a>.  I think this kind of critique comes as an inevitable backlash against over-hyped/newly in vogue solutions and the dogmatic selling of some technique as a <i>the</i> fix for what ails you.  Of course, for the sake of marketing, it&#8217;s very hard to present a nuanced view and still be convincing, so maybe it&#8217;s out of necessity, but it&#8217;s off-putting for people who feel like the presentation of a technology is unbalanced and unrealistic.</p>
<p>*  Bryan Cantrill is infamous for the <a href="http://cryptnet.net/mirrors/texts/kissedagirl.html">&#8220;Have you ever kissed a girl?&#8221;</a> Usenet quip in historical Solaris v. Linux performance wars, and more recently for <a href="http://blogs.sun.com/bmc/entry/on_i_dreaming_in_code">dumping on &#8220;Dreaming in Code&#8221;</a> in a <a href="http://www.youtube.com/watch?v=6chLw2aodYQ">Google Tech Talk on DTrace</a>.  That&#8217;s random trivia, but I quite enjoy watching strongly opinionated and outspoken technical people duke it out.</p>
<p>BTW, if you like following all of the various concurrency solutions (which are popping up fast and furious), you might want to check out <a href="http://tech.puredanger.com">Alex Miller&#8217;s</a> <a href="http://concurrency.tumblr.com/">Concurrency</a> feed.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/32/feed</wfw:commentRss>
		</item>
		<item>
		<title>In Silicon Valley for the summer</title>
		<link>http://www.thegibson.org/blog/archives/27</link>
		<comments>http://www.thegibson.org/blog/archives/27#comments</comments>
		<pubDate>Mon, 19 May 2008 05:12:10 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
		
		<category><![CDATA[Research Content]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/?p=27</guid>
		<description><![CDATA[Yes, I know the average time between my blog posts is quite long, but I tend to post longer posts infrequently rather than daily brain-dumps (or anything limiting towards Twitter).  Anyway, I&#8217;ve been preparing a conference paper as well as getting ready to leave for San Jose, CA.  I&#8217;ll be working for IBM [...]]]></description>
			<content:encoded><![CDATA[<p>Yes, I know the average time between my blog posts is quite long, but I tend to post longer posts infrequently rather than daily brain-dumps (or anything limiting towards Twitter).  Anyway, I&#8217;ve been preparing a conference paper as well as getting ready to leave for San Jose, CA.  I&#8217;ll be working for IBM Research at their <a href="http://www.almaden.ibm.com/">Almaden Research Center</a> on a GPFS-related project, Panache.  See <a href="http://portal.acm.org/citation.cfm?id=1341312.1341322">&#8220;Panache: a parallel WAN cache for clustered filesystems&#8221;</a> in ACM SIGOPS Operating Systems Review from January 2008 for a basic idea.  I&#8217;ve never been to Silicon Valley, so I&#8217;m excited to see the area.</p>
<p>I&#8217;ve spent most of my past seven or so months working on my thesis proposal and preparing a conference submission.  On the topic of CS conferences in my area (Systems), I wanted to highlight a USENIX-sposored meta-workshop I found serendipitously &#8212; <a href="http://www.usenix.org/event/wowcs08/">WOWCS &#8216;08: Workshop on Organizing Workshops, Conferences, and Symposia for Computer Systems</a>.  The WOWCS 08 PC and accepted authors is a list of seasoned veterans in systems research.  Given that, and the improvement-based focus of the venue, many papers detail a lot of what some people see as &#8220;broken&#8221; in current systems academic venues (reviews, PC meetings, etc.).  </p>
<p>One paper in particular that somewhat confirmed some disheartening truths about the nature of conference and workshop reviews is <a href="http://www.usenix.org/event/wowcs08/tech/full_papers/birman/birman_html/">Overcoming Challenges of Maturity</a> by Ken Birman &#8212; Ken is an ACM Fellow, well known for his work in systems and networking.  Some of his gripes are from his experience chairing SOSP (in 2005), which is one of the most prestigious (and oldest) systems venues.  Ken said,</p>
<blockquote><p>
Overwhelmed by the huge numbers of submissions, most PCs have turned to multi-round processes in which the first-round reviews are farmed out, often to students who may do an erratic reviewing job.<br />
&#8230;<br />
Most of us are learning to write papers in a manner calculated to appear to those beleaguered first-round reviewers.  To get into SOSP or SIGCOMM a paper has to survive two thresholds: it must get past the two randomly selected students, and then must get past the six or so PC members who are most knowledgeable about the topic.<br />
&#8230;<br />
a PC chair today assigns some paper to PC member X, who then randomly hands it to students Y and Z, producing completely random reviews from people who have never been a part of the community and who are naturally inclined to be overly critical and to overly favor work in their own areas of interest: our mature researchers have long since shed these flaws of youth. </p></blockquote>
<p>The above content encourages somewhat cynical views of publishing in such academic venues.  Ken points out that the quality of the top conferences isn&#8217;t really diminished by these schizophrenic, semi-random first round eliminations because there are enough good papers remaining to fill the program.  But it&#8217;s still somewhat unfair and very frustrating to authors.  He also said,</p>
<blockquote><p> Who hasn’t had papers that were rejected in the first round of reviews at a top conference, with just two reviews, one or both of which seemed almost completely clueless?  Who hasn’t expressed anger at the system?  Here are two little “factoids” to illustrate the depth of the issue: when I sent out the SOSP reviews, we discovered that in one case, a rejected paper had missed the initial cut on the basis of a review that was clearly written about some other paper.</p></blockquote>
<p>Anyway, I&#8217;ve had pretty good experiences so far with my current primary research work, but some projects I&#8217;ve collaborated on as a secondary/advisory participant have experienced treatment like that (&#8221;completely clueless&#8221; reviews).  Sometimes it could be chalked up to clarity issues in the paper, but other times it left me wondering if certain reviewers actually read the paper or just the abstract.  Oh well&#8230; it helps to know that even highly-regarded, established researchers experience this sort of thing too.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/27/feed</wfw:commentRss>
		</item>
		<item>
		<title>Memory ordering and memory models</title>
		<link>http://www.thegibson.org/blog/archives/23</link>
		<comments>http://www.thegibson.org/blog/archives/23#comments</comments>
		<pubDate>Thu, 03 Apr 2008 05:14:11 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
		
		<category><![CDATA[Research Content]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/?p=23</guid>
		<description><![CDATA[Along with a variety of interesting papers, I&#8217;ve seen a few nice Google Tech Talks on the topic of processor memory ordering and language-level memory models recently.  With the relatively recent resurgence of interest and attempts to &#8220;mainstream&#8221; parallel/concurrent programming, it is increasingly important to get these things right.  Processor memory ordering guarantees [...]]]></description>
			<content:encoded><![CDATA[<p>Along with a variety of interesting papers, I&#8217;ve seen a few nice <a href="http://research.google.com/video.html">Google Tech Talks</a> on the topic of processor memory ordering and language-level memory models recently.  With the relatively recent resurgence of interest and attempts to &#8220;mainstream&#8221; parallel/concurrent programming, it is increasingly important to get these things right.  Processor memory ordering guarantees (sometimes called a processor&#8217;s/architecture&#8217;s memory model) are generally relevant to people like me writing systems-level software (e.g. programming in C or assembly implementing operating systems, higher-level language runtimes and compilers, etc.).  In theory, if you are doing user-level programming in C with something like pthreads (or win32 threads, OpenMP, etc.) and have a &#8220;race free&#8221; program (according to the threading specification), processor memory ordering should not be directly exposed to you.  Even though Alpha and PowerPC have weaker memory ordering guarantees than x86, it is the implementation&#8217;s responsibility to issue appropriate memory fence operations, locked operations, etc., to make sure that the thread primitives perform as expected.  If you want to use lock-free algorithms or atomic operations, you have to be aware of such architecture-specific things and do this manually.</p>
<p>Some higher-level languages &#8212; most notably Java &#8212; have defined language-level memory models: Java&#8217;s memory model defines how threads interact through memory and precisely what it means for a program to be data race free / well synchronized.  It provides guarantees that a compiler/JIT won&#8217;t perform optimizations that break race free code. In a language like C without a defined memory model, these things can be tricky and surprising because it is unclear what a data race means when considering the mapping from language level statements to actual machine code.  A programmer may have two independent threads concurrently assigning values to two different variables.  If these variables are small (chars, for example) and stored adjacently in the same machine word, this may be a data race on some architectures.  Since these kinds of decisions are often left unspecified and up to the compiler (or possibly the linker), there is no guaranteed way to write portable and robust code.  Additionally, the compiler may introduce extra stores or perform other optimizations that, while fine for single-threaded code, introduce races into otherwise race-free code.  People like me, who (try to) make use of lock-free algorithms and atomic operations, end up with code that may be quite fragile to even minor compiler optimization changes.  More on this later.  </p>
<p>One other nice thing about Java&#8217;s memory model is that it also constrains the language implementation on what can happen to incorrectly synchronized code; languages like C and C++ often say that the result of illegal code (e.g. modifying a variable twice without an intervening sequence point) is undefined &#8212; and undefined behavior can allow anything at all to happen.  As one example, Java specifies that the implementation cannot introduce values into improperly synchronized code that appear &#8220;out of thin air.&#8221;</p>
<p>The videos are as follows:</p>
<ul>
<li> <a href="http://www.youtube.com/watch?v=WUfvvFD5tAA">IA Memory Ordering</a> &#8212;  Richard Hudson explains Intel&#8217;s newly clarified memory ordering semantics for x86.
<li> <a href="http://www.youtube.com/watch?v=1FX4zco0ziY">Advanced Topics in Programming Languages: The Java Memory Model</a> &#8212; Jeremy Manson describes the current Java memory model as revised by JSR-133 and Java thread semantics.  The talk covers basics about model ordering guarantees, the meaning of locking/synchronization primitives and volatile, as well as common pitfalls.
<li> <a href="http://www.youtube.com/watch?v=mrvAqvtWYb4">Getting C++ Threads Right</a> &#8212; Hans Boehm, the well-known programming languages/compilers researcher, talks about the effort to provide better threads support in the upcoming C++ standard (C++0x). More importantly, he talks about the general problems plaguing implementation of correct multi-threaded programs in languages like C and C++.
<li> <a href="http://irbseminars.intel-research.net/">Towards a Memory Model for C++</a> &#8212; Not a Google Tech Talk, but another talk by Hans Boehm very similar to that above.
</ul>
<p><b>C++ Threads / Memory Model</b><br />
Hans Boehm covered similar ground in his 2005 PLDI paper, <a href="http://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf">&#8220;Threads Cannot be Implemented as a Library&#8221;</a>.  Some people felt the paper was trivial or hyping a non-problem by saying that the compiler and language have to provide a few extra guarantees and can&#8217;t be completely oblivious (I recall the <a href="http://lambda-the-ultimate.org/node/950">LTU discussion</a> &#8212; one commenter said, &#8220;Ayone who was paying attention already knew that&#8221;).  But he does bring up a whole set of issues which is becoming increasingly important.  Concurrent programming is already more difficult than regular sequential programming.  On top of that, you have poorly specified semantics or broken implementations which actually cause problems.  It&#8217;s hard enough already to get your part right without worrying about the compiler breaking things behind your back or the underlying language implementation not properly obeying the specification.  His concerns aren&#8217;t just theoretical &#8220;cleanliness&#8221; issues, either.</p>
<p>Part of what prompted this post is the recent thread on LKML: <a href="http://lkml.org/lkml/2007/10/24/673">&#8220;Is gcc thread-unsafe?&#8221;</a> This is a perfect example of the problem that Boehm alludes to with regard to compiler optimization.  The sample code in the gcc thread:</p>
<pre>
int trylock() {
  int res;

  res = pthread_mutex_trylock(&amp;mutex);
  if (res == 0)
    ++acquires_count;

   return res;
}
</pre>
<p>That code attempts to acquire the mutex and increments <code>acquires_count</code> only if it succeeded in locking the mutex.  With -O1, gcc 4.3 generated code that always reads and writes the <code>acquires_count</code> variable (load, conditional add, store) regardless of whether the mutex is obtained.  Typical language lawyers pointed out that the C standard allows this optimization, even though it introduces a fairly nasty race condition.  Some of the gcc developers tend to take a very defensive language lawyer stance, vigorously defending things that are technically permitted but practically useless. Standard C says nothing of threads and imposes very little on the compiler in this regard.</p>
<p><b>The Java Memory Model</b><br />
The Java Memory Model is quite useful today, but it wasn&#8217;t always perfect.  Java 5 incorporated <a href="http://jcp.org/en/jsr/detail?id=133">JSR 133: Java Memory Model and Thread Specification Revision</a> &#8212; a revision making some substantial changes to the original Java Memory Model, which was regarded as &#8220;broken.&#8221;  I was still in high school and hadn&#8217;t even taken my first CS course when William Pugh (also known for the invention of the skip list, a wonderful data structure that works well in concurrent situations), published a paper titled <a href="http://citeseer.ist.psu.edu/pugh00java.html">The Java Memory Model is Fatally Flawed</a>, a revised version of <a href="http://www.cs.umd.edu/~pugh/jmm.pdf">Fixing the Java Memory Model</a> from a year earlier.  I learned about the controversy a year or two later, and I was finishing my Masters by the time the new, fixed memory model was finalized and adopted.  Although it seems somewhat strange that people tolerated and obviously wrote multi-threaded Java with a &#8220;fatally flawed&#8221; memory model for so long, the situation is roughly analogous to the current situation with C/C++ and threads: generally stuff works, and compilers/runtimes often do the &#8220;right thing&#8221; anyway, but we&#8217;d rather have stronger guarantees in this area.</p>
<p><a href="http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html">Double-checked locking</a>, a trick typically used to avoid synchronization on lazily-initialized fields, is one of the things broken by Java&#8217;s old memory model.  Here is an example:</p>
<pre>
public Object getA() {
  if(a == null) {
    synchronized(this) {
      if(a == null)
        a = new Object();
    }
  }
  return a;
}
</pre>
<p>Despite the fact that the idiom was commonly-used within some Java libraries, it was technically incorrect and there was no satisfactory way to fix it (using thread-local storage made it possible, but that was a hack and often too expensive to be worth it).  Under the current Java Memory Model, double-checked locking works if the field is made <code>volatile</code>.  Under the old memory model, <code>volatile</code> was not very useful because volatile reads and writes could be reordered with respect to regular reads and writes, so you couldn&#8217;t use a write to a volatile value to, for instance, indicate to another thread that an object was initialized (because the write to signal the initialization might be reordered to before the actual initialization).  The new memory model prevents this reordering. Since there are a lot of resources available on the web about the Java Memory Model, I won&#8217;t say too much more except that Java has been somewhat of a trailblazer in the area of really providing clear and useful semantics for multi-threaded programs.</p>
<p>William Pugh has a fairly comprehensive page with <a href="http://www.cs.umd.edu/~pugh/java/memoryModel/">Java Memory Model resources</a> and Doug Lea has a <a href="http://gee.cs.oswego.edu/dl/jmm/cookbook.html">JSR-133 Cookbook for Compiler Writers</a> and a tutorial titled <a href="http://gee.cs.oswego.edu/dl/cpj/jmm.html">Synchronization and the Java Memory Model</a> (the latter is an excerpt from Doug&#8217;s nice <a href="http://gee.cs.oswego.edu/dl/cpj/index.html">Concurrent Programming in Java</a> book).</p>
<p><b>IA Memory Ordering</b><br />
x86 (IA32) has used, in the past, fairly strong memory ordering semantics called &#8220;processor ordering.&#8221;  Intel has clarified/more robustly specified the memory ordering semantics in a document titled <a href="http://www.intel.com/products/processor/manuals/">Intel® 64 Architecture Memory Ordering White Paper</a>.  The newly clarified memory ordering semantics apply to both 32-bit and 64-bit x86 and are described as &#8220;Total Lock Ordering + Causal Consistency.&#8221;  Locked instructions are totally ordered across all processors, and memory obeys <a href="http://en.wikipedia.org/wiki/Causal_consistency">causal consistency</a>, which is a fairly natural concept and provides publication safety. These memory ordering semantics don&#8217;t apply to all memory types (I/O would be different), but these semantics are what would be encountered in regular user code dealing with standard heap or stack memory.  Precisely knowing the target architecture&#8217;s memory consistency guarantees is critical to correct and efficient implementation of higher-level language memory models and systems software (and lock-free code).  AMD also has their own memory ordering reference which is very close or identical to Intel&#8217;s.<br />
A long time ago, I read a paper titled <a href="http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf">Memory Ordering in Modern Microprocessors</a> by Paul McKenney which explains the memory ordering guarantees provided by many modern microprocessor architectures (in the context of Linux&#8217;s memory barrier primitives).  It is interesting just how weak Alpha&#8217;s (and to some extent PowerPC&#8217;s) ordering guarantees can be.  Alpha seems to allow just about everything short of the processor simply making up bogus values for memory reads.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/23/feed</wfw:commentRss>
		</item>
		<item>
		<title>The perils of numerical algorithms</title>
		<link>http://www.thegibson.org/blog/archives/19</link>
		<comments>http://www.thegibson.org/blog/archives/19#comments</comments>
		<pubDate>Wed, 12 Mar 2008 07:17:31 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/archives/19</guid>
		<description><![CDATA[Recently I needed to generate binomially-distributed integers on demand for a micro-benchmark.  Actually, I started with the idea of generating normally distributed numbers and rushed headfirst into reading about how to do that without thinking about the fact that I really needed a discrete distribution; later I switched distributions, but I found out it [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I needed to generate binomially-distributed integers on demand for a micro-benchmark.  Actually, I started with the idea of generating normally distributed numbers and rushed headfirst into reading about how to do that without thinking about the fact that I really needed a discrete distribution; later I switched distributions, but I found out it is quite easy to generate normally distributed numbers.  Starting with the primitive of rand_r or drand48  (or the GNU drand48_r), we can get uniformly distributed random numbers on the interval [0, 1).  If you aren't too picky about issues of numerical stability or speed, generating normally distributed numbers with uniformly distributed numbers is simple using the basic rectangular form of the <a href="http://en.wikipedia.org/wiki/Box-Muller_transform">Box-Muller transform</a>.  Assuming u_1 and u_2 are the uniform random numbers:<br />
<img src='http://www.thegibson.org/blog/wp-content/uploads/2008/03/boxmuller11.png' alt='boxmuller11.png' /><br />
n_1 and n_2 are normally distributed.  So that's nifty and really easy, but I then remembered that I needed to generate integers in the range [0, N], and the normal distribution is defined over the entire real line.  I could just chop the tails off of each side of the distribution and create a <a href="http://en.wikipedia.org/wiki/Truncated_normal_distribution">truncated normal distribution</a>, but it seemed like a better idea to just directly go with a discrete distribution.  </p>
<p>The <a>Binomial distribution</a> with p=0.5 over [0, N] looks approximately normal when N becomes very large, so I decided to generate binomially-distributed integers.  Even when there isn&#8217;t a simple, closed-form transform from uniformly random integers to another distribution, it seemed like you could just generate the cumulative distribution function; then you generate a uniform random number u_1 and then binary search for the greatest index i where cdf(i) &lt;= u_1 (basically inverting the cdf).  So my plan was to generate the probability mass function straight from the definition and then sum it to get the cdf.  With 128-bit floating point and care in implementing the binomial coefficient (i.e. don&#8217;t perform factorials directly and divide, evaluate it a non-canceled term at a time), this worked fine for my tests the range of a few thousand.  However, I quickly ran into trouble since N=72,000 in the real test.  The definition of the pmf is:<br />
<img src='http://www.thegibson.org/blog/wp-content/uploads/2008/03/binomial1.png' alt='Binomial pmf' /><br />
Obviously you can see where this is going.  That definition works somewhat well for smaller integers, but once you have quantities like 0.5^(72000), you can easily get underflows and overflows.  I tried some term rearrangement, but I still ended up with a mass function that was +Inf in the middle and 0 everywhere else, which is completely worthless.  Not wanting to spend a lot of time implementing this, I figured <a href="http://www.r-project.org/">GNU R </a> &#8212; the excellent statistical computing language/environment &#8212; could probably do this.  So I ended up using R to generate the cdf (with the pbinom function), and wrote the results out to a file which I read in when I needed the cdf.  </p>
<p>Later I looked into how R actually implements <code>pbinom</code>, and it delegates the problem to the <a href="http://en.wikipedia.org/wiki/Beta_distribution">Beta distribution</a> cdf implementation, <code>pbeta</code>.  The work in <code>pbeta</code> is all done by a function called <code>bratio</code>, which is the interesting part.  The <code>bratio</code> function is in <code>toms708.c</code>.  A comment at the top explains the name:
<pre>
/*      ALGORITHM 708, COLLECTED ALGORITHMS FROM ACM.
 *      This work published in  Transactions On Mathematical Software,
 *      vol. 18, no. 3, September 1992, pp. 360-373z.
 */</pre>
<p>Sure enough, you can locate <a href="http://portal.acm.org/citation.cfm?id=131776&amp;dl=GUIDE&amp;dl=ACM">&#8220;Algorithm 708; significant digit computation of the incomplete beta function ratios&#8221;</a> in ACM&#8217;s digital library.  The the <code>bratio</code> function computes the <a href="http://en.wikipedia.org/wiki/Beta_function">Incomplete beta function</a>, and <code>toms708.c</code> is almost 2300 lines of fairly scary looking numerics code (lots of gotos and labels, and a fair number of magic looking constants).  The original code from ACM TOMS was in Fortran, and the top of the source file says &#8220;Based on C translation of ACM TOMS 708.&#8221;  They don&#8217;t explicitly say whether it was manual or if machine translation was involved at any point, but I found several comments above some variable declarations that says &#8220;System generated locals&#8221;.  Googling on that phrase brings up references to <code>f2c</code>, the Fortran to C translator.  That probably also explains the abundance of labels with names corresponding to line numbers in the original Fortran source.</p>
<p>Anyway, I just thought this was interesting.  Numerical algorithms are fascinating: they can take a lot of care to get right and often require complex tricks to avoid over/under-flows and maintain numerical stability.  Just working with floating point is perilous on its own, because there are things like cancellation errors, non-associativity of operations, non-intuitive notions of equality, etc.  This is one area that wasn&#8217;t well covered in my undergraduate classes (although it might have been covered in graphics classes, which I never took).  In the introductory systems class, they discussed floating point representations and had us encode/decode some numbers from scientific notation into <a href="http://en.wikipedia.org/wiki/IEEE_floating-point_standard">IEEE 754</a>, but no class really went into the pragmatics and pitfalls of using floating point numbers in mathematical algorithms.  I learned about  that mostly a) from reading the classic report, <a href="http://docs.sun.com/source/806-3568/ncg_goldberg.html">&#8220;What Every Computer Scientist Should Know About Floating-Point Arithmetic&#8221;</a> and later; b) from working at the Federal Reserve in Economic Research, helping economists make use of distributed computing with their simulations and such.  </p>
<p>Through my experience at the Fed, I gained a lot of insight into just how much effort it takes to do correct numerical algorithms, let alone fast ones.  Of course, most of the economists relied on highly-tuned primitive libraries like <a href="http://www.vni.com/products/imsl/">IMSL</a>, but it still takes work to make sure your own calculations (built on top of those primitives) don&#8217;t introduce accumulating error.  Fortran and Matlab were, by far, the most popular languages, with a few economists choosing C or Mathematica.  Anyway, this is a deep and interesting area.</p>
<p>As an aside, it&#8217;s not too hard to generate Zipf distributed numbers using a technique called &#8220;Rejection-Inversion&#8221; (it doesn&#8217;t require generating the cdf and searching).  Take a look at <a href="http://sdm.lbl.gov/fastbit/">FastBit</a> which has an implementation in <code>twister.h</code>
<pre>
/// Discrete Zipf distribution: p(k) is proportional to (v+k)^(-a) where a
/// &gt; 1, k &gt;= 0.  It uses the rejection-inversion algorithm of W. Hormann
/// and G. Derflinger.  The values generated are in the range of [0, imax]
/// (inclusive, both ends are included).
</pre>
<pre> </pre>
<p>The referenced paper is <a href="http://portal.acm.org/citation.cfm?id=235029&amp;dl=ACM&amp;coll=GUIDE">&#8220;Rejection-inversion to generate variates from monotone discrete distributions&#8221;</a>; the code to generate Zipf-distributed integers is only about 20 lines of C++, and it&#8217;s actually understandable.  </p>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/19/feed</wfw:commentRss>
		</item>
		<item>
		<title>Memory management</title>
		<link>http://www.thegibson.org/blog/archives/17</link>
		<comments>http://www.thegibson.org/blog/archives/17#comments</comments>
		<pubDate>Tue, 12 Feb 2008 08:43:53 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
		
		<category><![CDATA[Research Content]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/archives/17</guid>
		<description><![CDATA[Over the past week or two, I read three semi-related papers on the general theme of memory management in applications: one was on the memory overhead of various application design choices, another was on the cost of garbage collection, and the third was an older but interesting paper on custom memory allocators.  For my [...]]]></description>
			<content:encoded><![CDATA[<p>Over the past week or two, I read three semi-related papers on the general theme of memory management in applications: one was on the memory overhead of various application design choices, another was on the cost of garbage collection, and the third was an older but interesting paper on custom memory allocators.  For my own research, I read a lot of papers from OSDI/PPOPP/SOSP/ASPLOS/Usenix/HotOS (and others with topical overlap like ICDCS, HPDC, SC, PODC, SIGCOMM), but I sprinkle in pleasure reading from other areas of interest &#8212; that&#8217;s not to say that I don&#8217;t actually like reading OS and distributed systems papers; it&#8217;s just nice to get topical diversity.  In particular, I tend to select a lot of papers that are related to Programming Languages and Compilers as well as Information Security.  Under the broad PLC umbrella, I tend to like papers related to language runtime/library implementation, compiler implementation and functional programming as well as issues of software engineering (in the broad sense of runtime/language issues to make programmers productive, less-error prone, etc.).  Coincidentally, all three of these papers appeared at OOPSLA.</p>
<p><b>&#8220;The Causes of Bloat, The Limits of Health&#8221;</b></p>
<p>The first paper I mentioned is &#8220;<a href="http://domino.research.ibm.com/comm/research_people.nsf/pages/nickmitchell.pubs.html/$FILE/oopsla2007-bloat.pdf">The Causes of Bloat, The Limits of Health</a>&#8221; (Nick Mitchell and Gary Sevitsky) from OOPSLA &#8216;07.  I found this paper interesting for a few reasons, but it shows how application design choices in mapping a data model onto concrete data structures may lead to chronically wasteful memory usage where the amount of real data is overwhelmed by collection metadata/overhead (like pointers).  The paper explores these issues in the context of Java and Java applications; the basic premise is applicable in any language, but things like Java&#8217;s object headers exacerbate the problem when compared to a comparable data structure in C. A key metric here, collection health, is a comparison of actual application data to overhead imposed by object headers, collection metadata, pointers, etc.  One example they provide is putting Java Strings in collections: </p>
<blockquote><p> Observe that a String must have at least 140 characters in order to achieve an S &lt; 1.2 (i.e. no more than 80% actual data). When Strings are placed in a standard Java HashSet, they must have at least 270 characters to achieve this level of health. On the flip side, placing 10-character Strings into a HashSet will result in an S of no less than 3.7 (i.e. no more than 27% actual data), no matter how many Strings are placed into the HashSet.</p></blockquote>
<p>They define &#8220;good health&#8221; for a data structure as having at least an 80% real data (actually, less than 20% overhead/data ratio).  Even for a single string to qualify, it needs to at least 140 characters, because a String is actually fairly complicated under the hood: a String is an Object and has an object header and some fields (a cached hash code, a length and potentially an offset into the backing array), as well as a pointer to a char[], which also has a header (containing the length and other VM bookkeeping information).  Ignoring the String fields, they put the overhead at 28 bytes per String, and they note that, in some cases, object headers could be up-to 20 bytes a piece. </p>
<p>Some of the applications they tested had less than 20% actual data due to inappropriate choice of Java collections or container objects.  It&#8217;s not just about choosing the wrong data structure or having the default collection size be too big, it&#8217;s also about choosing the wrong object &#8220;containers&#8221; to hold values (decisions that may seem inconsequential).  Java doesn&#8217;t have tuples (a huge gripe of mine), and one application needed to hash a pair of items, an int and some other Object.  The developer chose java.util.Arrays$ArrayList, an inner-class of java.util.Arrays used in the asList method, because it was convenient to specify literal values by making a simple call like this: Arrays.asList(new Object[] {1, obj}).  You couldn&#8217;t necessarily just put the Object array in the collection directly because array equality is object identity rather than content equality (i.e. it just compares the addresses of the arrays).  Anyway, using Arrays$ArrayList causes a lot of overhead because the ArrayList is itself an Object with attributes and it also keeps a separate backing array (which has its own header overhead).  Since the backing array is an Object array, storing an int actually requires boxing the primitive int into an Integer object wrapper, which is even more overhead, and every little bit adds up when you multiply it by the number of these Arrays$ArrayList objects you will create.  If the author had instead created a Pair class with an int field and an Object field, it would reduce the memory usage of all of these (int, Object) pairs in this application to about a third of the original size (1.7MB versus 4.9MB).  </p>
<p>This kind of study makes you think a little more carefully about choices you might take for granted or overheads you might dismiss as negligible.  I know in the past I&#8217;ve done things like the above scenario with ArrayList (although not using that exact class) &#8212; but simply taking an existing Java class already present in the API that is close to what I want rather than making my own trivial two element wrapper class to work around the lack of tuples.  If it&#8217;s a one-off it won&#8217;t be a big deal, but if it goes in a collection with many like objects, eventually the overhead may become unreasonable.  This paper also reminded me of the sometimes significant size overhead of Java objects compared to plain data, and that&#8217;s good to keep in mind.  Now don&#8217;t get me wrong, I&#8217;m not one of those systems programmers who likes to beat his chest and loudly proclaim that Java is slow and for n00bs and that we should all be coding everything in hand-tuned assembly language.  I spend a lot of my time coding in C, and when I end up doing stuff in Java, it&#8217;s often a breath of fresh air (except when I need unsigned integers).  Even in a high-level language, though, I personally like knowing what&#8217;s going on underneath.  </p>
<p><b>&#8220;Quantifying the Performance of Garbage Collection vs. Explicit Memory Management&#8221;</b></p>
<p>The second paper is &#8220;<a href="http://citeseer.ist.psu.edu/hertz05quantifying.html">Quantifying the Performance of Garbage Collection vs. Explicit Memory Management</a>&#8221; (Matthew Hertz and Emery D. Berger) from OOPSLA &#8216;05.  <a href="http://lambda-the-ultimate.org/node/2552">A post</a> on <a href="http://lambda-the-ultimate.org/">Lambda The Ultimate</a> a few months ago referenced this paper.  I had heard the conclusion of the paper cited before, but I&#8217;d never actually read the paper so I put it in my reading queue.  The authors note that the garbage collection process visits more pages of memory than an application would using explicit allocation.  This is bad for locality and increases pressure at many levels of the memory hierarchy.  In the presence of physical memory pressure and demand-paged virtual memory, it causes significantly more paging overhead which is very expensive (it can cause an order of magnitude performance degradation).  In addition, the authors note that you need a larger heap to achieve performance parity with manual allocation because smaller heaps increase (internal) memory pressure, which leads to more frequent collections, which has baseline overhead and of course leads to visiting a lot of pages.  With 5x as much memory, the garbage collector will generally equal or surpass manually allocation, and 3x memory gives an average 17% performance penalty over manual allocation, while lower factors degrade significantly.  </p>
<p>Again, I&#8217;m not one of those &#8220;macho programmers&#8221; who likes to complain about garbage collection being too expensive and never appropriate.  I think managed languages and garbage collection are really a good thing and even though it&#8217;s often <i>more</i> expensive, I don&#8217;t think it&#8217;s too expensive for most applications.  And with the imminent rise in concurrency and parallel programming, the usefulness of garbage collection is even greater; manual storage management in the face of concurrency is often even more painful &#8212; a lot of novel concurrent data structures just assume the presence of garbage collection.  There&#8217;s always programming features that we eventually take for granted which people gripe about being too expensive at one time: people complained about the overhead of operating systems (versus running applications on base metal), and later it was compilers/high-level languages that were &#8220;too expensive&#8221; (versus assembly language).  The situation with garbage collection may not be exactly analogous, but current trends seem to indicate that it&#8217;ll be a given in due time.</p>
<p><b>&#8220;Reconsidering Custom Memory Allocation&#8221;</b></p>
<p>The last paper is &#8220;<a href="http://citeseer.ist.psu.edu/berger01reconsidering.html">Reconsidering Custom Memory Allocation</a>&#8221; (Emery D. Berger, Benjamin G. Zorn and Kathryn S. McKinley).  The authors tested custom memory allocators versus the general purpose <a href="http://g.oswego.edu/dl/html/malloc.html">Lea allocator</a>, which is just a really good general purpose allocator created by the amazing <a href="http://g.oswego.edu/">Doug Lea</a> (who is also one of the people responsible for the fantastic <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/package-summary.html">java.util.concurrent</a> package added via <a href="http://www.jcp.org/en/jsr/detail?id=166">JSR 166</a>).  The Lea allocator is also used in glibc in a modified form (<a href="http://www.malloc.de/en/">ptmalloc/ptmalloc2/ptmalloc3</a>, which is basically a Lea allocator enhanced for multi-threaded allocation).  The authors test various applications that use custom allocators and found that custom allocators were rarely worth it.  In their conclusion, they state:</p>
<blockquote><p>Despite the widespread belief that custom allocators should be used in order to improve performance, we come to a different conclusion. In this paper, we examine eight benchmarks using custom memory allocators, including the Apache web server and several applications from the SPECint2000 benchmark suite. We find that the Lea allocator is as fast as or even faster than most custom allocators. The exceptions are region-based allocators, which often outperform general-purpose allocation.</p></blockquote>
<p>The fact that the Lea allocator outperforms most custom allocators isn&#8217;t a coincidence, it was an explicit design goal.  Doug Lea says on his malloc webpage:</p>
<blockquote><p>I soon realized that building a special allocator for each new class that tended to be dynamically allocated and heavily used was not a good strategy when building kinds of general-purpose programming support classes I was writing at the time. (From 1986 to 1991, I was the the primary author of  libg++ , the GNU C++ library.) A broader solution was needed &#8212; to write an allocator that was good enough under normal C++ and C loads so that programmers would not be tempted to write special-purpose allocators except under very special conditions.
</p></blockquote>
<p> I&#8217;d say this paper shows that he largely succeeded.  But I think the paper&#8217;s conclusion is just a classic lesson in optimization.  Knuth famously stated, &#8220;premature optimization is the root of all evil.&#8221;  This isn&#8217;t to necessarily say that the developers of applications like Apache and gcc were oblivious and guilty of premature optimization.  The authors note that this may be a factor of general purpose allocators getting better as well as program evolution.  At one time, gcc&#8217;s runtime was dominated by parsing, which benefited from the custom allocator; now optimization is where more cycles are spent and so the game has changed.  In any event, the take-home message is about future practice; don&#8217;t rush in to custom memory allocators just because you think they&#8217;re faster.  Most applications won&#8217;t benefit, and you&#8217;ll save yourself a lot of trouble.  </p>
<p>When I was more of a novice, I was tempted to perform premature optimization and it rarely paid off in terms of the time investment.  Now I&#8217;m much more concerned with getting the structure of the system so that it doesn&#8217;t impose high overhead in general &#8212; a holistic view of the entire system rather than micro-optimizing specific operations.  After a while you just get a feel for how you can structure a system so that cumulative overhead is avoided.  Obviously part of that is in using the right data structures and algorithms (from an asymptotic complexity standpoint, but also considering constant factors when they matter), but also thinking about things like how data flows through the system, so you can avoid unnecessary copies.  After the general structure is there, you profile and you can always micro-optimize hot paths.  If you micro-optimize while the structure is still in flux, Murphy&#8217;s Law dictates that whatever part you optimize will change to foil you. Anyway, this post is getting quite long, so I just wanted to mention some high-performance, drop-in replacement (general-purpose) malloc implementations:</p>
<ul>
<li><a href="http://www.malloc.de/en/">ptmalloc3</a> &#8212; newer and faster than ptmalloc2, which is in glibc.  </li>
<li><a href="http://www.hoard.org/">Hoard</a> &#8212; fast allocator developed as a research project at UMass Amherst under Prof. Emery Berger (a co-author of two of the papers I mentioned)  </li>
<li><a href="http://goog-perftools.sourceforge.net/doc/tcmalloc.html">tcmalloc</a> &#8212; Google&#8217;s thread caching malloc.  At one time it was the fastest out there, but now various others may be equally competitive or faster.</li>
<li><a href="http://www.nedprod.com/programs/portable/nedmalloc/index.html">nedmalloc</a> &#8212; supposedly faster than Hoard, ptmalloc2 and tcmalloc.</li>
<li><a href="https://labs.omniti.com/trac/portableumem">libumem</a> &#8212; Solaris&#8217;s umem allocator, made portable.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/17/feed</wfw:commentRss>
		</item>
		<item>
		<title>Automated mirror selection for LUG PXE installs</title>
		<link>http://www.thegibson.org/blog/archives/16</link>
		<comments>http://www.thegibson.org/blog/archives/16#comments</comments>
		<pubDate>Sun, 27 Jan 2008 20:46:08 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
		
		<category><![CDATA[PXE-related]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/archives/16</guid>
		<description><![CDATA[I&#8217;ve posted several times about the PXE-based install server I created/maintain for our local LUG&#8217;s installfests.  We can network boot installers for all of the different Linux distros we support as well as various BSDs (and a GParted live image).  The various Linux net-install images require users to specify a network mirror to [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve posted several times about the PXE-based install server I created/maintain for our <a href="http://lugatgt.org/">local LUG</a>&#8217;s installfests.  We can network boot installers for all of the different Linux distros we support as well as <a href="http://www.thegibson.org/blog/archives/10">various BSDs</a> (and <a href="http://www.thegibson.org/blog/archives/13">a GParted live image</a>).  The various Linux net-install images require users to specify a network mirror to retrieve installations files.  For some distros (RHEL, Ubuntu, Fedora), I maintain local mirrors on the PXE server to speed installations (and in the case of RHEL, because there are no public network mirrors &#8212; our school has a site-license that affords every student a legitimate copy, so we can only do RHEL installs for attendees who are students).  For most other distros, we use our school&#8217;s local mirror <a href="http://www.gtlib.gatech.edu/">gtlib.gatech.edu</a>.  </p>
<p>For the first few installfests where we used the PXE server, I printed out several copies of a sheet with the various network mirrors for different distros.  It would be something like this:</p>
<ul>
<li>Red Hat Enterprise Linux 5.1 &#8212; http://10.0.0.2/rhel5.1/i386/
<li>Ubuntu 7.10 (gutsy) &#8212; http://10.0.0.2/pub/ubuntu/
<li>Fedora 8 &#8212;  http://10.0.0.2/fc8/i386/os/
<li>Debian testing &#8212; ­ http://www.gtlib.gatech.edu/pub/debian/
<li>openSUSE 10.3 &#8211;­ http://128.61.111.11/pub/opensuse/distribution/10.3/repo/oss/
<li>Mandriva 2008.0 &#8211;­ http://www.gtlib.gatech.edu/pub/mandrake/official/2008.0/i586/
<li>Gentoo 2007.0 &#8212; http://10.0.0.2/gentoo (stage3 tarballs)
</ul>
<p>Now, obviously it was a minor pain to fill in that information repeatedly for each install, so I decided to automate it.  For RedHat-based distros, I used a kickstart configuration file to specify the mirror, and for Debian-derived distros I used the installer &#8220;preseed&#8221; mechanism.  Mandriva and openSUSE also provide features to do the same.  Here&#8217;s how I set up each distro to automatically find the mirrors:</p>
<p><strong></strong><br />
<strong>Red Hat Enterprise Linux / Fedora</strong><br />
For each distro version and architecture combination, create a kickstart configuration file available via HTTP.  Here&#8217;s an example for Fedora 8 i386:</p>
<p><code>interactive<br />
network --bootproto dhcp --noipv6<br />
url --url http://10.0.0.2/fc8/i386/os/<br />
firstboot --enable</code></p>
<p>The &#8220;interactive&#8221; and &#8220;firstboot &#8211;enable&#8221; parts are important because the installer assumes that you are doing a semi- or entirely automated installation if you use a kickstart configuration file.  Without the &#8220;firstboot&#8221; line, it won&#8217;t give you the first boot system configuration dialogs where you create a non-root user and configure sound, video, etc.  I wanted these installs to be basically identical to a manual install except with the network mirror pre-selected.</p>
<p>Now, all you have to do to use the kickstart config is edit your pxelinux configuration file.  Append &#8220;ks=<em>url_to_file</em>&#8221; lines to the kernel boot parameters of each entry.  E.g.:</p>
<p><code>LABEL fedora8_x86_64<br />
        kernel fedora/8/x86_64/vmlinuz<br />
        append initrd=fedora/8/x86_64/initrd.img ks=http://10.0.0.2/fc8/x86_64/ks.cfg</code></p>
<p><strong></strong><br />
<strong>Debian / Ubuntu</strong><br />
The Debian installer (which is also used for Ubuntu network and alternate installs) allows you to &#8220;preseed&#8221; answers to all installer prompts.  For each distro, you only need a single file for all version and architecture combinations assuming they all use the same mirror site (versions and architecture are not explicit in the mirror URL).  Create a preseed configuration file like this:</p>
<p><code>d-i mirror/protocol string http<br />
d-i mirror/country string enter information manually<br />
d-i mirror/http/hostname string 10.0.0.2<br />
d-i mirror/http/directory string /pub/ubuntu/<br />
d-i mirror/http/proxy string<br />
d-i apt-setup/security_host string</code></p>
<p>After creating the preseed file, simply append &#8220;preseed/url=<em>url_to_file</em>&#8221; to the kernel boot parameters of each entry.  E.g.:</p>
<p><code>LABEL ubuntu_gutsy_i386<br />
        kernel ubuntu/gutsy/i386/ubuntu-installer/i386/linux<br />
        append vga=normal initrd=ubuntu/gutsy/i386/ubuntu-installer/i386<br />
/initrd.gz  preseed/url=http://10.0.0.2/ubuntu/lug.cfg --</code></p>
<p>One problem with using a local mirror for Ubuntu (10.0.0.2/pub/ubuntu) is that we have to go back and change the /etc/apt/source.list file to point to a public mirror after install &#8212; otherwise, after leaving the installfest, a user would be trying to use a non-existent Ubuntu mirror on RFC1918 IP space.  I also used the pre-seed configuration file to automatically replace the source.list file after installation. The &#8220;preseed/late_command&#8221; option allows you to run stuff just before the install finishes (the root of the new system is in /target at this point).  Here is the slightly hackish entry I use to fix the sources.list:</p>
<p><code>d-i preseed/late_command string cd /target/tmp ; wget http://10.0.0.2/ubuntu/fix_sources.sh ; cd /target ; sh tmp/fix_sources.sh</code></p>
<p>The fix_sources.sh simply replaces the entries in sources.list to point to a public mirror. </p>
<p>One other trick I did with the Ubuntu installer is disable the supremely annoying &#8220;Automatic Keyboard layout detection&#8221; mechanism.  It prompts you &#8220;Yes/No,&#8221; but the default is Yes, so many people select it unwittingly.  The result is a long and irritating process of pressing various keys on the keyboard which could have been solved in 1 second by simply selecting &#8220;American English&#8221; (99.9% of the time) from the keyboard layout menu.  If you append &#8220;console-setup/ask_detect=false&#8221; to the kernel parameters to the installer image, it will go directly to the keyboard layout menu as if you selected &#8220;No&#8221; to keyboard autodetection.</p>
<p><strong></strong><br />
<strong>openSUSE</strong><br />
The openSUSE installer supports setting the installation mirror source by passing &#8220;install=<em>url_to_repository</em>&#8221; as a kernel parameter to the install.  The mirror path does not change with different architectures (like Debian), but it does change between versions. For example:</p>
<p><code>LABEL opensuse10.3_i386<br />
        kernel opensuse/10.3/i386/linux<br />
        append initrd=opensuse/10.3/i386/initrd splash=silent showopts install=http://128.61.111.11/pub/opensuse/distribution/10.3/repo/oss</code></p>
<p>The url provided uses a numeric IP address rather than a hostname because the openSUSE installer doesn&#8217;t handle DNS resolution (at least last time I checked; it may have been fixed in the meantime).</p>
<p><strong></strong><br />
<strong>Mandriva</strong><br />
Mandriva supports mirror selection through the &#8220;automatic=<em>config_list</em>&#8221; kernel parameter, where <em>config_list</em> is a list of comma-separated key/value pairs in the form of &#8220;key:value.&#8221;  To set the mirror, one could specify the string as follows: &#8220;automatic=method:http,network:dhcp,server:<em>mirror_hostname</em>,directory:<em>mirror_path</em>.&#8221;  For example, here is an entry:</p>
<p><code>LABEL mandriva2008.0_i586<br />
         kernel mandriva/2008.0/i586/vmlinuz<br />
         append initrd=mandriva/2008.0/i586/all.rdz vga=788 splash=silent automatic=method:http,network:dhcp,server:www.gtlib.gatech.edu,directory:/pub/mandrake/official/2008.0/i586</code></p>
<p>The full directory above is &#8220;/pub/mandrake/official/2008.0/i586,&#8221; but fixed-width &#8220;code&#8221; entries don&#8217;t word-wrap without putting extra spaces.  Note that Mandriva&#8217;s mirror URLs also include both architecture and distro version.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/16/feed</wfw:commentRss>
		</item>
		<item>
		<title>More on layers and coupling</title>
		<link>http://www.thegibson.org/blog/archives/15</link>
		<comments>http://www.thegibson.org/blog/archives/15#comments</comments>
		<pubDate>Thu, 24 Jan 2008 08:44:47 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
		
		<category><![CDATA[Research Content]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/archives/15</guid>
		<description><![CDATA[After my last (quite long) post, I was thinking about filesystem layering and coupling/interfaces between independent components.  I wanted to post three mostly unrelated ideas on the same general theme.  I&#8217;ll post the two most related to the previous post now and make the third a future post:
Changing traditional storage layering for distribution
In [...]]]></description>
			<content:encoded><![CDATA[<p>After my last (quite long) post, I was thinking about filesystem layering and coupling/interfaces between independent components.  I wanted to post three mostly unrelated ideas on the same general theme.  I&#8217;ll post the two most related to the previous post now and make the third a future post:</p>
<p><b>Changing traditional storage layering for distribution</b><br />
In very large distributed filesystems (getting into the multi petabyte range), traditional RAID is often too weak and constrained for redundancy.  By traditional RAID, I mean RAID that is implemented in hardware or software and is effectively invisible to the file system (sits below the block level). Say you had thousands of multi-disk RAID-5 arrays.  In this situation, the probability that you&#8217;ll lose an entire array at some point is probably going to be make people nervous, particularly the kind of people who would have such massive storage systems.  Depending on the scale, you could try to manage with hot spares and round-the-clock IT staff or increase the redundancy by going to RAID-6 or RAID-1, but you still have dangerous and constraining locality in your redundancy.  This is more important when distribution is involved: what if you lose a controller or network connection or some other local aspect that takes an entire array effectively offline (or an entire chassis/rack or entire SAN or even an entire datacenter)?</p>
<p>A presentation titled &#8220;<a href="http://www.dtc.umn.edu/disc/resources/KandlurISW5.pdf">Storage Challenges for Petascale Systems</a>&#8221; given by Dilip D. Kandlur, Director of IBM&#8217;s Storage Systems Research talks about these challenges in the context of petaflop systems with tens or even hundreds of petabytes of storage.  These systems might have 100k-150k disk drives!  The presentation notes:</p>
<blockquote><p>
<b>RAID-5 is dead at petascale; even RAID-6 may not be sufficient to prevent data loss</b><br />
Simulations of file system size, drive MTBF, failure probability distribution show 4%-28% chance of data loss over five-year lifetime for 8+2P code.
</p></blockquote>
<p>The probability of failure is unacceptably high even with the double parity of RAID-6, but triple parity gives you several orders of magnitude lower mean time to data loss.  With these challenges in mind, <a href="http://www.ibm.com/systems/clusters/software/gpfs.html">GPFS</a> is adding software RAID to support such stronger RAID codes not typically supported by RAID controller hardware (triple parity).  In addition, it will support what is called &#8220;declustered RAID&#8221; (see <a href="http://citeseer.ist.psu.edu/49298.html">Parity Declustering for Continuous Operation in Redundant Disk Arrays</a>) which significantly improves load balancing during rebuild (see slides 11-13 for a great visual depiction of the way declustered RAID works).  See also &#8220;<a href="http://www.ists.dartmouth.edu/serenyi.pdf">The Challenges of Storage System Growth</a>&#8221; a presentation by Denis Serenyi of Symantec, which covered some related issues.</p>
<p>The stronger software RAID helps at one level and allows you to using striping for throughput, but it doesn&#8217;t really deal with the location-based redundancy.  The context of the previous presentation is primarily extremely large but single-site HPC systems, so it doesn&#8217;t discuss this issue much, but when you have a distributed system spanning many locations you need to consider it.  In principle you could span multiple datacenters with your RAID layout, but that wouldn&#8217;t work very well from a performance perspective.  RAID works best with symmetric (and predictable) latencies between devices; moreover, it&#8217;s unnecessary if you just want to deal with failure, because you can handle it much more robustly at a higher layer.  Most large distributed systems provide for redundancy at the level of larger storage granules: files, objects (if you&#8217;re using an object-store based system), or perhaps larger blocks or &#8220;chunks.&#8221;  For example, the <a href="http://labs.google.com/papers/gfs.html">Google File System</a> (GFS, not to be confused with Red Hat&#8217;s <a href="http://www.redhat.com/gfs/">Global File System</a>, another distributed filesystem also named GFS) stores files as a series of 64MB chunks and each chunk is replicated.  The paper notes the importance of replicating chunks on different racks:</p>
<blockquote><p>We must also spread chunk replicas across racks. This ensures that some replicas of a chunk will survive and remain available even if an entire rack is damaged or offline (for example, due to failure of a shared resource like a network switch or power circuit). It also means that traffic, especially reads, for a chunk can exploit the aggregate bandwidth of multiple racks. On the other hand, write traffic has to flow through multiple racks, a tradeoff we make willingly.</p></blockquote>
<p>Ceph, a recent distributed filesystem (see <a href="http://www.usenix.org/events/osdi06/tech/weil.html">Ceph: A Scalable, High-Performance Distributed File System</a> in OSDI &#8216;06), uses an underlying object store model and replicates at the level of objects:</p>
<blockquote><p>In contrast to systems like Lustre [4], which assume one can construct sufficiently reliable OSDs using mechanisms like RAID or fail-over on a SAN, we assume that in a petabyte or exabyte system failure will be the norm rather than the exception, and at any point in time several OSDs are likely to be inoperable. To maintain system availability and ensure data safety in a scalable fashion, RADOS manages its own replication of data using a variant of primary-copy replication [2], while taking steps to minimize the impact on performance.  Data is replicated in terms of placement groups, each of which is mapped to an ordered list of n OSDs (for <i>n</i>-way replication).</p></blockquote>
<p>In Ceph, the data for a traditional file at the filesystem-level may consist of many underlying objects in the object store (a file is striped across objects named by combining an inode and a stripe number), so this is similar to replicating at a &#8220;chunk&#8221; or large block level.  GPFS, in addition to the planned lower level declustered/striped strong parity strategy, already has file data and metadata replication (which can be controlled on a per-file basis).  These features are actually a part of a rich set of ILM (Information Lifecycle Management) features that allow you to define different policies for various data on the same filesystem in a SQL-like declarative language.  For example, you can create a policy that a certain directory subtree should be stored on a pool of faster disks and have a specific, higher replication factor than other files.  Or you could make a policy to have the system gradually decrease the replication factor of files that haven&#8217;t been accessed in a long time, finally migrating it to offline, external storage after a certain threshold.</p>
<p>The various DHT-based filesystems/storage systems (<a href="http://citeseer.ist.psu.edu/dabek01widearea.html">CFS</a>, <a href="http://pdos.csail.mit.edu/ivy/">Ivy</a>, <a href="http://research.microsoft.com/~antr/PAST/">PAST</a>, <a href="http://citeseer.ist.psu.edu/cox02pastiche.html">Pastiche</a>, <a href="http://oceanstore.cs.berkeley.edu/">OceanStore</a>, etc.) mentioned in my previous post also replicate pieces of files or entire files on multiple nodes.  These systems are designed for more widely distributed and dynamic environments so they have to deal with things like significant node churn (nodes not being powered on/connected all the time or leaving the system permanently); it is critical to adopt a replication strategy that is easy to maintain in such circumstances.  In such systems is it also accepted that some files may be temporarily unavailable or lost permanently due to a loss of all replicas &#8212; most files will be fine, but you don&#8217;t necessarily set replication parameters for losing a given file to the same low probability of failure as RAID type strategies.   Note the difference in the nature of redundancy and the reasons for doing so: losing a piece of a file might be bad for the user of a file, but the rest of the filesystem is fine.  In the case of RAID, where a coherent filesystem&#8217;s data and metadata are striped indiscriminately across several disks, losing all replicas of a block could mean that the filesystem&#8217;s metadata is  damaged, which could cause serious problems.  </p>
<p>Anyway, I just think it&#8217;s interesting to note how the traditional storage layering evolves in the face of distribution and large datasets.  With local disks and filesystems, people tend to put replication below everything and provide a replicated block device.  When distribution is involved, it becomes more flexible to think of replication in the context of filesystem entities like files or chunks.  </p>
<p><b>A violation of layering by DHash</b><br />
My last post got quite long so I didn&#8217;t remember to include every interesting footnote and piece of trivia.  One interesting &#8220;layering violation&#8221; for efficiency in the related work I listed is in DHash, the distributed block storage layer built on top of Chord.  The authors of the <a href="http://citeseer.ist.psu.edu/dabek01widearea.html">CFS SOSP paper</a> note:</p>
<blockquote><p>   DHash has its own implementation of the Chord lookup algorithm, but relies on the Chord layer to maintain the routing tables.  Integrating block lookup into DHash increases its efficiency. If DHash instead called the Chord find successor routine, it would be awkward for DHash to check each server along the lookup path for cached copies of the desired block. It would also cost an unneeded round trip time, since both Chord and DHash would end up separately contacting the block&#8217;s successor server.
</p></blockquote>
<p>That&#8217;s obviously a case in which duplicating code and violating layering is a good tradeoff, since it eliminates a costly network round trip.  Also, since DHash and Chord are maintained by the same entity, it is unlikely to be particularly painful.  However, it does make me wonder if Chord&#8217;s dead simple interface is just too austere.  The only function it provides is the ability to find the successor for a node, which is too basic for DHash (at least without compromising performance).  Most competing DHT solutions (e.g. <a href="http://research.microsoft.com/~antr/Pastry/default.htm">Pastry</a>, <a href="http://citeseer.ist.psu.edu/zhao04tapestry.html">Tapestry</a>, <a href="http://citeseer.ist.psu.edu/ratnasamy01scalable.html">CAN</a>) didn&#8217;t separate the hash lookup primitive into a separate externally-distinguished artifact.  I really like the idea of separating the hashing/routing from storage policy, and the single <i>successor</i> primitive is appealing for its simplicity, but perhaps the interface at the split should have been richer.  </p>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/15/feed</wfw:commentRss>
		</item>
		<item>
		<title>ZFS hype?</title>
		<link>http://www.thegibson.org/blog/archives/14</link>
		<comments>http://www.thegibson.org/blog/archives/14#comments</comments>
		<pubDate>Sat, 19 Jan 2008 04:38:05 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
		
		<category><![CDATA[Research Content]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/archives/14</guid>
		<description><![CDATA[Over the past year and a half or so, there&#8217;s been a lot of hype surrounding Sun&#8217;s ZFS (originally the &#8220;Zettabyte File System&#8221;).  After the initial release, the &#8220;buzz&#8221; has come back in waves, peaking once with the initial porting of ZFS to FreeBSD (announced, merged), and later reappearing with (false) rumors of ZFS [...]]]></description>
			<content:encoded><![CDATA[<p>Over the past year and a half or so, there&#8217;s been a lot of hype surrounding <a href="http://www.sun.com/2004-0914/feature/">Sun&#8217;s ZFS</a> (originally the &#8220;Zettabyte File System&#8221;).  After the initial release, the &#8220;buzz&#8221; has come back in waves, peaking once with the initial porting of ZFS to FreeBSD (<a href="http://lists.freebsd.org/pipermail/freebsd-current/2006-August/065306.html">announced</a>, <a href="http://lists.freebsd.org/pipermail/freebsd-current/2007-April/070544.html">merged</a>), and later reappearing with (false) <a href="http://www.macrumors.com/2007/06/06/zfs-to-become-default-file-system-in-leopard/">rumors</a> of ZFS becoming the default filesystem in Mac OS X 10.5.  This month another wave started with <a href="http://trac.macosforge.org/projects/zfs/wiki/?p=6">ZFS code and binaries for OS X</a> being made available.  Sun itself feeds the hype by touting ZFS as &#8220;the last word in file systems.&#8221;  I&#8217;ve also been following Oracle&#8217;s <a href="http://oss.oracle.com/projects/btrfs/">btrfs</a> (&#8221;Butter FS&#8221;), Matthew Dillon&#8217;s (of <a href="http://www.dragonflybsd.org/index.shtml">DragonFlyBSD</a>) <a href="http://leaf.dragonflybsd.org/mailarchive/kernel/2007-10/msg00006.html">HAMMER</a> and <a href="http://www.bullopensource.org/ext4/">ext4</a>, which are all sort of taking feature cues from ZFS.  The option for checksumming should have been common in filesystems long before now, so I&#8217;m glad to see it is finally becoming a mainstream feature.  Cheaper snapshotting/transactional support is also nice, but that&#8217;s not as rare.</p>
<p>Personally, I don&#8217;t understand the reason for the large hype over ZFS, particularly in the context of OS X (more on that later).  Now, it seems to be a fairly impressive engineering effort, but it is concentrating on an artifact that is somewhat pedestrian by now: a purely local filesystem.  ZFS seems to be a good but still incremental improvement in local filesystem capabilities (with some unorthodox choices, but more on that later) &#8212; they took functionality that was previously available in different storage layers and increased coupling to improve performance and flexibility.  Don&#8217;t get me wrong; there are certainly great things about ZFS, but in my book &#8220;the last word in file systems&#8221; would at least have to be distributed.   From my perspective, the bulk of &#8220;cutting-edge&#8221; research in filesystems over the last decade has been on distributed or cluster filesystems.  Of course, my research is generally in distributed systems and I worked on a distributed/parallel filesystem (IBM&#8217;s <a href="http://www.ibm.com/systems/clusters/software/gpfs.html">GPFS</a>) this past summer, so I&#8217;m not an impartial bystander, but I think it&#8217;s non-controversial to say that storage transparently interfacing with the network is important now and will only become more important in the future.  Now, Sun has indicated that they are going to use ZFS as a local storage backend for <a href="http://wiki.lustre.org/">Lustre</a> (since it currently uses ext3, this would be a big improvement), but that&#8217;s slightly different.  Alternately, maybe if ZFS was like the eternal vaporware relational filesystem <a href="http://en.wikipedia.org/wiki/WinFS">WinFS</a>, I could see the hype being justified, particularly from end-users.</p>
<p>One of the reasons I&#8217;m baffled about the hype of ZFS on FreeBSD/OpenBSD/Mac OS X is because the port is not stable yet and there are some general ZFS issues that would limit its wide use.  First of all, it needs a lot of memory and can panic or deadlock if it runs out of memory.  It can really only run reliably on 64-bit machines (because it tends to exhaust kernel resources on 32-bit machines), but Sun is upfront about this. It also seems to be somewhat finicky and require manual tuning to get good performance (and sometimes just not crashing).  A post to the FreeBSD mailing list titled <a href="http://lists.freebsd.org/pipermail/freebsd-current/2008-January/081853.html">&#8220;ZFS Honesty&#8221;</a> summarizes the issues nicely:</p>
<blockquote><p>But let&#8217;s also be honest about ZFS in the 64-bit world.  There is ample evidence that ZFS basically wants to grow unbounded in proportion to the workload that you give it.  Indeed, even Sun recommends basically throwing more RAM at most problems.  Again, tuning is often needed, and I think it&#8217;s fair to say that it can&#8217;t be expected to work on arbitrary workloads out of the box.</p></blockquote>
<p>A <a href="http://lists.freebsd.org/pipermail/freebsd-current/2008-January/081862.html">followup</a> added:</p>
<blockquote><p>I guess what makes me mad about ZFS is that it&#8217;s all-or-nothing; either it works, or it crashes.  It doesn&#8217;t automatically recognize limits and make adjustments or sacrifices when it reaches those limits, it just crashes.  Wanting multiple gigabytes of RAM for caching in order to optimize performance is great, but crashing when it doesn&#8217;t get those multiple gigabytes of RAM is not so great, and it leaves a bad taste in my mouth about ZFS in general.</p></blockquote>
<p>Anyway, I&#8217;m not trying to dump on ZFS for these problems, because some are related to the BSD port and manual tuning is not unreasonable in high-end storage applications.  The thing that gets me is that the hype is generally among user groups where such constraints would not be appropriate.  For example, all of the buzz about OS X getting ZFS (and possibly being the default filesystem): based on the information above, it does not fit into the &#8220;Mac ethos&#8221; of &#8220;it just works.&#8221;  Sure, it may get there one day, but why did a lot of people get all worked up about the availability of OS X binaries when they probably won&#8217;t be ready for general use for a long time?  I guess it&#8217;s pre-excitement.</p>
<p>Now, as for ZFS&#8217;s unorthodox design choices: the designers essentially decided to collapse (or induce a tighter coupling between) many storage layers, including volume management and striping/RAID and make them a more integrated part of the filesystem.  In fact, Linux kernel developer Andrew Morton famously called ZFS a <a href="http://lkml.org/lkml/2006/6/9/389">&#8220;rampant layering violation&#8221;</a> (ZFS developer Jeff Bonwick replies to Morton&#8217;s comment <a href="http://blogs.sun.com/bonwick/entry/rampant_layering_violation">here</a>).  Normally you have a separate filesystem agnostic volume manager (like FreeBSD&#8217;s <a href="http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/vinum-vinum.html">Vinum</a> and Linux&#8217;s LVM/<a href="http://sourceware.org/lvm2/">LVM2</a>), and potentially RAID below that.  The presence of those layers, however, effectively virtualize some aspect of the underlying disks and may hide information crucial for making layout decisions impacting performance.  In the proceedings of HotOS X (2005), this point is well articulated in Lex Stein&#8217;s &#8220;<a href="http://www.eecs.harvard.edu/~stein/PAPERS/hotosx-html/">Stupid File Systems Are Better</a>.&#8221;  In that paper, the author argues that a simple (&#8221;stupid&#8221;) filesystem that does random block layout has more uniformly good performance than filesystems with sophisticated layout policies (because the policies make assumptions about the underlying layout which may be completely invalidated by striping or other issues).  Sun took the opposite approach: instead of making their filesystem stupid, Sun removed the layers between the disks and the filesystem, thus giving the filesystem more information to make layout decisions.  Portability is another good reason not to rely on layering; if you need to ensure that certain volume management features are available on all platforms, you may have to &#8220;bring your own.&#8221;  I don&#8217;t think that was a major factor in Sun&#8217;s ZFS, however, because I believe it was meant to be a compelling reason to use Solaris.</p>
<p>I&#8217;m somewhat ambivalent about the decision to couple layers, because there are good arguments both for and against.  Sun has invested significant effort in engineering complex artifacts where the added complexity ultimately didn&#8217;t pay off, like <a href="http://citeseer.ist.psu.edu/106541.html">Solaris M:N threading</a>, which was an impressive effort but was ultimately ditched for simpler 1:1 user/kernel thread designs (<a href="http://wwws.sun.com/software/whitepapers/solaris9/multithread.pdf">Sun&#8217;s whitepaper</a> explaining the Solaris 9 switch from M:N to 1:1 threading)*. An influential systems paper <a href="http://www.cs.washington.edu/homes/tom/pubs/sched_act.html">&#8220;Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism&#8221;</a> (appearing in SOSP and later in TOCS) argued for the two-level, M:N approach, but also noted it was critical to share information between the user-level scheduler and the kernel level thread scheduler.  In this case, strict layering with both schedulers oblivious to each other would lead to poor performance decisions (and possibly deadlock).</p>
<p>But, in the realm of filesystems, there are also good arguments for clean layering (particularly once distribution is involved).  <a href="http://research.microsoft.com/~thekkath/">Chandu Thekkath</a> of Microsoft Research is a very strong advocate of structuring filesystems/storage abstractions in clean simple layers.  For example, <a href="http://citeseer.ist.psu.edu/thekkath97frangipani.html">Frangipani</a>, a distributed filesystem, is built on top of a virtual distributed disk, <a href="http://citeseer.ist.psu.edu/lee96petal.html">Petal</a>.  In the Frangipani paper, the authors praise the layered approach for making the filesystem very simple and quick to develop.  In addition, the layering itself provides parallelism and the simplicity of the implementation allows it to be quite fast, despite the fact that information sharing in non-layered implementations may open up potential optimization opportunities.  A more recent paper of his describes <a href="http://citeseer.ist.psu.edu/maccormick04boxwood.html">Boxwood</a>, which is another distributed storage mechanism.  Instead of a filesystem-like interface, however, it provides either distributed, replicated block-like storage or persistent data structures (also distributed and replicated).  It again argues convincingly about the benefits of designing a system with clean and simple layers rather than complex tight coupling.   </p>
<p>More widely distributed/peer-to-peer storage systems like CFS, described in <a href="http://citeseer.ist.psu.edu/dabek01widearea.html">&#8220;Wide-area cooperative storage with CFS&#8221;</a>, are commonly built in several layers.  CFS is a read-only filesystem built on top of a distributed block storage system DHash, which is itself built upon <a href="http://pdos.csail.mit.edu/papers/chord:sigcomm01/">Chord</a>, a peer-to-peer overlay network middleware (basically like half of a DHT &#8212; hashing/routing without storage; DHash provides storage).  DHash and Chord were later used to implement <a href="http://pdos.csail.mit.edu/ivy/">Ivy</a>, a peer-to-peer read-write filesystem.  Similarly, both <a href="http://research.microsoft.com/~antr/PAST/">PAST</a>, a peer-to-peer data publishing/archival system (immutable data; not a read-write filesystem) and <a href="http://citeseer.ist.psu.edu/cox02pastiche.html">Pastiche</a>, a cooperative-storage based backup system, are layered on <a href="http://research.microsoft.com/~antr/Pastry/default.htm">Pastry</a>, a feature rich peer-to-peer routing/DHT like system.  <a href="http://oceanstore.cs.berkeley.edu/">OceanStore</a>, another wide-area distributed storage system, was itself built upon <a href="http://citeseer.ist.psu.edu/zhao04tapestry.html">Tapestry</a>, another peer-to-peer DHT/overlay network.</p>
<p>Anyway, now I&#8217;m just rambling through tangentially related work in distributed filesystems, but I guess the question is whether the added complexity of ZFS will pay off versus something like btrfs or HAMMER on top of a good volume manager and RAID.  I guess time will tell, but I&#8217;m sympathetic to arguments for both alternatives: on one hand, increasing coupling between layers allows you to optimize, but decreasing coupling may make each layer simpler.  Given finite development time/effort, it&#8217;s easier to perfect and optimize simple artifacts than complex ones.  As for my previous examples, one might say they aren&#8217;t directly relevant in that distribution nearly always suggests a clean layered design because the complexity is just too high otherwise, whereas a local filesystem with locally attached disks may benefit from cooperation between layers because they do similar things and you gain performance and flexibility within the confines of a specific filesystem.  On the other hand, separate layers give you more horizontal flexibility (i.e. if you definitely need to use something other than ZFS): for example, Linux&#8217;s LVM2 support snapshots on many filesystems at the volume manager level, but they&#8217;re not as flexible or fast as ZFS.  </p>
<p>Anyway, I guess the whole impetus behind this post was that I&#8217;m bothered by the level of hype I&#8217;m seeing in certain circles over ZFS (and the marketing label &#8220;the last word in file systems&#8221; doesn&#8217;t help).  Sure, it seems like impressive engineering, but nothing particularly groundbreaking or revolutionary.  I&#8217;ll be interested to see how it ultimately turns out in competing against various other filesystems in development.</p>
<p>* Incidentially, FreeBSD, one of the last major holdouts with M:N threading (along with NetBSD), is also purportedly switching to 1:1 threading by making the 1:1 libthr the default threading library in very-soon-to-be-released FreeBSD 7.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/14/feed</wfw:commentRss>
		</item>
		<item>
		<title>More pxelinux tricks: GParted LivePXE and PXE-booting DOS CDs</title>
		<link>http://www.thegibson.org/blog/archives/13</link>
		<comments>http://www.thegibson.org/blog/archives/13#comments</comments>
		<pubDate>Wed, 02 Jan 2008 06:22:49 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
		
		<category><![CDATA[PXE-related]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/archives/13</guid>
		<description><![CDATA[Well, it&#8217;s been a while since my last post because I&#8217;ve been busy writing a bunch and didn&#8217;t feel much like writing a blog entry in addition.  Anyway, since my last post, LUG@GT held another InstallFest.  This time I decided to add a PXE bootable GParted live distro so we could also repartition [...]]]></description>
			<content:encoded><![CDATA[<p>Well, it&#8217;s been a while since my last post because I&#8217;ve been busy writing a bunch and didn&#8217;t feel much like writing a blog entry in addition.  Anyway, since my last post, <a href="http://lugatgt.org">LUG@GT</a> held another InstallFest.  This time I decided to add a PXE bootable <a href="http://gparted.sourceforge.net/">GParted</a> live distro so we could also repartition without involving extra optical media.  In order to do that, I started from the base of the <a href="http://gparted-livecd.tuxfamily.org/">GParted LiveCD</a> (which is also suitable for a LiveUSB version).  The GParted LiveCD is based on Gentoo, so preparing a PXE bootable image is similar to how I must prepare the Gentoo installer for PXE booting.  Since this is the most complicated image to prepare for PXE booting (relative to Ubuntu, Debian, Fedora, RHEL, OpenSUSE and Mandriva, which are the other distros we offer), I will first start with the instructions on making a Gentoo install PXE-bootable.</p>
<p><span id="more-13"></span><br />
<strong>Gentoo PXE Install</strong></p>
<ul>
<li>Grab a minimal install ISO.   For this example, I&#8217;ll use install-x86-minimal-2007.0-r1.iso.</li>
<li>Mount the iso and copy the following files: image.squashfs, isolinux/gentoo, isolinux/gentoo.igz (and isolinux/isolinux.cfg for reference)</li>
<li>Make a temporary directory and unpack gentoo.igz: <code>mkdir tmp; cd tmp; zcat ../gentoo.igz | cpio -idv</code></li>
<li>Make a mnt/cdrom subdirectory (mkdir -p mnt/cdrom) and copy the image.squashfs into it</li>
<li>Patch the init script so that it looks for the squashfs image in the right place.  Here&#8217;s my <a href="http://www.thegibson.org/blog/files/gentoo/gentoo_init.patch">gentoo_init.patch</a>.</li>
<li>Repack the gentoo.igz file: <code>find * | cpio --quiet --dereference -o -H newc | gzip -9 &gt; ../gentoo.igz</code></li>
</ul>
<p>Now, add Gentoo entries to your pxelinux config following the isolinux.cfg for reference, but add <code>real_root=/</code> to the kernel parameters (the append=) line (assuming you are using my patch).</p>
<p><code> </code><br />
<strong>GParted Live PXE</strong><br />
So the process for making the GParted LiveCD PXE bootable is similar.  Once I have performed the above modifications on the .igz file (to pack the squashfs right into the initrd), I could take the iso/syslinux entries for the GParted boot options and add them directly to my pxelinux boot menu.  Instead, I decided to make a two-level menu with a self-contained GParted disk image.  In other words, you select GParted from the PXE boot options and it loads a disk image which itself boots into syslinux, prompting for GParted-specific options. This is slightly more complicated, but it separates the maintenance of GParted-specific options from the pxelinux.cfg file.  One could accomplish this kind of two-level separation in other ways, too, but I&#8217;ll describe what I did.</p>
<ul>
<li>Using syslinux&#8217;s mkdiskimage tool, make a disk image to contain the files: <code>mkdiskimage disk.img 7 255 63</code> (make sure it is big enough to hold everything)</li>
<li>Mount the FAT filesystem part of the disk image: <code>mount -o loop,offset=32256 disk.img files</code></li>
<li>Copy the appropriate files from the LiveUSB stuff to the image (gparted kernel, gparted.igz, properly edited syslinux.cfg, options.msg, boot.msg, splash.lss, etc.)</li>
<li>Unmount the loopback image: <code>umount disk.img</code></li>
<li>Run syslinux on the filesystem: <code>syslinux -o 32256 disk.img</code></li>
</ul>
<p>Now I can copy this disk image somewhere and add an entry to my pxelinux.cfg similar to this:</p>
<pre>label gparted
  kernel memdisk
  append initrd=gparted.img c=7 h=255 s=63 noedd</pre>
<p><code> </code><br />
If it doesn&#8217;t work, make sure after copying the files to the loopback image that you don&#8217;t have errors due to going &#8220;out of bounds&#8221; on the image.  I&#8217;ve found that when the size of the files are close to the size of the filesystem, and I copy and recopy certain files, I&#8217;ll get error messages like the following:<br />
<code><br />
kernel: loop1: rw=1, want=112452, limit=112392<br />
kernel: lost page write due to I/O error on loop1<br />
</code><br />
I find starting over with a fresh filesystem and copying the files once only fixes this.<br />
<code> </code><br />
<strong>Bootable DOS CD</strong><br />
Finally, I also wanted to make a bootable DOS CD PXE bootable, so I tried a couple of methods before finding one that worked well.  I won&#8217;t name the CD because it is probably against the terms of the license to do something like this, but it&#8217;s just convenient to not have to have media handy for personal uses.  I used some utilities from the <a href="http://mtools.linux.lu/">mtools</a> suite as well as the <a href="http://freshmeat.net/projects/geteltorito/">geteltorito</a> tool which is included in Debian&#8217;s genisoimage package.</p>
<ul>
<li>Extract the El Torito bootable image from the CD: <code>geteltorito -o eltorito.img cd.iso</code></li>
<li>Grab the DOS boot block from the El Torito image: <code>dd if=eltorito.img of=bootblock.img count=512 bs=1</code></li>
<li>Set up mtools to create a loopback disk image.  For this, I edited my ~/.mtoolsrc to contain <code>drive x: file="/tmp/floppyimage"</code></li>
<li>Use mformat to format the disk image that can hold the relevant files and give it the boot block: <code>mformat -C -t 160 -s 36 -h 2 -B bootblock.img x:</code></li>
<li>After that, I had to edit the autoexec.bat and other pieces of the CD startup files so it referred to files in the right places.  This is the non-automatic part of the process.  Previously, the system would boot with A: being the El Torito image, load mscdex.exe and then refer to Y: for files on the CD.  When PXE booting, all of the relevant files will be on A: and mscdex won&#8217;t need to be loaded.</li>
<li>After editing the proper files, copy them to the disk image: <code>mcopy -s * x:</code></li>
</ul>
<p>In the last case, make sure to get the relevant DOS system and startup files.  For the CD I was modifying, it was PC-DOS, so I needed ibmio.com, ibmdos.com, command.com as well as startup files like autoexec.bat and config.sys.  After the disk image is made, I could simply add an entry to pxelinux.cfg like the following:</p>
<pre>label dosboot
  kernel memdisk
  append initrd=dosimage.img.gz c=160 h=2 s=36 floppy noedd</pre>
<p><code> </code><br />
In this case, I gzipped the image.  For gparted, I didn&#8217;t gzip the outer image because the contents were already mostly compressed (e.g. the igz and kernel image).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/13/feed</wfw:commentRss>
		</item>
		<item>
		<title>Systems research is not sexy</title>
		<link>http://www.thegibson.org/blog/archives/12</link>
		<comments>http://www.thegibson.org/blog/archives/12#comments</comments>
		<pubDate>Tue, 06 Nov 2007 06:25:32 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
		
		<category><![CDATA[Research Content]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/archives/12</guid>
		<description><![CDATA[Sometimes I envy my fellow CS grad students doing research in areas like computer vision, graphics, robotics, infovis, etc. because their demos and final products are usually sexier and are much more accessible to non-computer scientists and non-technical people.  That sort of stuff makes the news and just plain looks cool to bystanders. Distributed [...]]]></description>
			<content:encoded><![CDATA[<p>Sometimes I envy my fellow CS grad students doing research in areas like computer vision, graphics, robotics, infovis, etc. because their demos and final products are usually sexier and are much more accessible to non-computer scientists and non-technical people.  That sort of stuff makes the news and just plain looks cool to bystanders. Distributed programming middleware, filesystems, operating systems research and a lot of other systems research topics just don&#8217;t usually lead to very sexy demos, at least not the infrastructure &#8212; the applications built on top may provide cool demos, but people tend to take the infrastructure for granted.  Other computer scientists may (hopefully) appreciate good systems research, but it&#8217;s just not sexy.</p>
<p>Look at stuff that gets published in SIGGRAPH, ICCV, ACM Multimedia and similar venues.  Compare with venues like SOSP, OSDI, Usenix, etc.  In terms of the ability to appeal to non-technical people, I think systems work is soundly beaten.  Now, of course popular appeal is not the point of these venues, and I don&#8217;t think it&#8217;s something to &#8220;fix,&#8221; I&#8217;m just using them as representative samples of their respective subdisciplines.  I guess what makes the output of some disciplines more accessible to outsiders is their connection to the real world (the parts that people interact with, at least).  Graphics and information visualization deal with visual output, and computer vision deals with visual input. Robots navigate in and manipulate the real world.  Systems work is building software that either interfaces with computer hardware or other layers of software.  I guess in that respect, middleware is sort of the ultimate &#8220;boring&#8221; and unappreciated artifact.</p>
<p>On the topic of cool graphics demos, here are a few that immediately come to mind:</p>
<ul>
<li> <a href="http://graphics.stanford.edu/papers/dual_photography/">Dual Photography</a> from SIGGRAPH 2005 &#8212; check out the demo where they reconstruct the face of a playing card with its back to the camera</li>
<li> <a href="http://phototour.cs.washington.edu/">Photo Tourism</a> from SIGGRAPH 2006 &#8212; Microsoft Live Labs has turned this into Photosynth, and they have been doing demos of it quite a bit for the past year or so.   Blaise Aguera y Arcas gave a <a href="http://www.ted.com/index.php/talks/view/id/129">MS Photosynth and Seadragon demo</a> as a TED talk.</li>
<li> <a href="http://www.thegibson.org/blog/wp-admin/Scene%20Completion%20Using%20Millions%20of%20Photographs">Scene Completion Using Millions of Photographs</a> from SIGGRAPH 2007 &#8212; Applying &#8220;Google-sized&#8221; data sets to hard vision problems.  Alexei Efros, the faculty advisor of this project, gave a Google Tech talk recently titled <a href="http://video.google.com/videoplay?docid=-8639996003880499413">&#8220;Using Data to &#8216;Brute Force&#8217; Hard Problems in Vision&#8221;</a> talking about the results of this paper and related data-heavy efforts.  Again, the talk has lots of cool, compelling visual examples of the research.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/12/feed</wfw:commentRss>
		</item>
	</channel>
</rss>
