<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>/dev/rant</title>
	<atom:link href="http://www.thegibson.org/blog/feed" rel="self" type="application/rss+xml" />
	<link>http://www.thegibson.org/blog</link>
	<description>Technology-related rantings of David Hilley</description>
	<lastBuildDate>Wed, 09 Feb 2011 06:52:03 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Flaky RAM bit flips</title>
		<link>http://www.thegibson.org/blog/archives/847</link>
		<comments>http://www.thegibson.org/blog/archives/847#comments</comments>
		<pubDate>Thu, 01 Jul 2010 06:48:02 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/?p=847</guid>
		<description><![CDATA[A few days ago, a Ksplice blog entry titled &#8220;Attack of the Cosmic Rays!&#8221; described the process of tracking down a program error caused by a flipped bit in RAM. Basically, a code page (an ELF text section page) of the &#8220;expr&#8221; binary was sitting in buffer cache in RAM and was affected by a [...]]]></description>
			<content:encoded><![CDATA[<p>A few days ago, a <a href="http://blog.ksplice.com">Ksplice</a> blog entry titled <a href="http://blog.ksplice.com/2010/06/attack-of-the-cosmic-rays/">&#8220;Attack of the Cosmic Rays!&#8221;</a> described the process of tracking down a program error caused by a flipped bit in RAM.  Basically, a code page (an ELF text section page) of the &#8220;expr&#8221; binary was sitting in buffer cache in RAM and was affected by a flipped bit.  This changed bit caused the program to segfault consistently, and the problem was solved by manually clearing the OS buffer cache (causing the expr binary to be reloaded from disk).</p>
<p><b>My own experience</b><br />
This reminded me of a similar incident that occurred when I was an undergraduate in college taking my first non-introductory systems class.  We had to write a multi-threaded web server, and I had been banging on the project pretty regularly for at least a week.  I had gotten to the point where I was confident my code was stable and ready to be turned in when the program suddenly started segfaulting frequently and without much perturbation.  Opening the core file in gdb showed that segfaults were happening when an obviously invalid pointer was being dereferenced, which made no sense.  I was reasonably certain that there is no way I could have missed an frequent, obvious segfault problem, so I was a bit perplexed.  Then, while I was trying to debug the issue, other applications started to crash, like emacs, gaim and xterm.  At that point I was actually relieved that it was probably a hardware issue.  I rebooted into memtest86 and thousands of memory errors were detected within minutes.  I took out each DIMM and retested and eventually narrowed it down to a single module, and continued using my system with half the RAM until the memory was RMA&#8217;d.  </p>
<p><b>NULL with one bit flipped!</b><br />
So suddenly what I was seeing in the debugger made sense.  All of the invalid pointers were powers of 2 and unusual addresses like 0&#215;00000008.  After I ran memtest, it clicked &#8212; on x86 (and many platforms) a C NULL pointer&#8217;s runtime representation is the address with all 0 bits.  If one of those bits gets flipped, the standard NULL pointer checks fail and it will probably end up getting dereferenced like a valid pointer shortly thereafter.  </p>
<p><b>ECC and more advanced things</b><br />
I&#8217;m not an expert on DRAM hardware, but it&#8217;s possible that the bit error rate is not decreasing as fast as RAM capacity is increasing.*  As more and more data is being exclusively kept in RAM (like giant memcached clusters), it is possible bit-errors could becoming increasingly problematic.  ECC has generally been considered essential in servers (a <a href="http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf">paper</a> from Google and University of Toronto in SIGMETRICS &#8217;09 on this is interesting).  And I make sure any machines with large storage have ECC RAM (especially ones with software RAID, since so much data manipulation happens in host memory).  But now high-end machines are getting much more advanced memory error correction capabilities.  For example, IBM&#8217;s xSeries servers have features they call <a href="http://www.redbooks.ibm.com/abstracts/tips0259.html">&#8220;Active Memory&#8221;</a>, including memory mirroring (like RAID-1 for RAM), <a href="http://en.wikipedia.org/wiki/Chipkill">Chipkill</a> &#8212; a very strong ECC implementation capable of handling multiple bit errors even from the same memory module, and memory scrubbing.  The latter is very neat &#8212; basically the RAM equivalent of RAID scrubbing, which is an important process for maintaining disk arrays.  Scrubbing detects latent errors before a failure; otherwise they will be detected during rebuild when it may be impossible to recover.</p>
<p>* This problem can affect non-volatile storage, too (the error rate on large media is decreasing slightly slower than storage density is growing; that means single parity redundancy is increasingly likely to encounter an error during rebuild as disk sizes increase.  See <a href="http://queue.acm.org/detail.cfm?id=1670144">&#8220;Triple-Parity RAID and Beyond&#8221;</a> for a bit on that issue as well as the more important throughput to capacity ratio consideration).     </p>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/847/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Storage and scale</title>
		<link>http://www.thegibson.org/blog/archives/778</link>
		<comments>http://www.thegibson.org/blog/archives/778#comments</comments>
		<pubDate>Sun, 07 Mar 2010 22:30:22 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
				<category><![CDATA[Research Content]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/?p=778</guid>
		<description><![CDATA[In 2008, I worked at IBM Research Almaden on a project called Panache. I was recently pleased to find that an in-depth paper on Panache appears in the proceedings of FAST &#8217;10 (PDF). The main impetus for my post was not specifically on Panache, however. Over the years I&#8217;ve been thinking about the evolution of [...]]]></description>
			<content:encoded><![CDATA[<p>In 2008, I <a href="http://www.thegibson.org/blog/archives/27">worked</a> at IBM Research Almaden on a project called Panache.  I was recently pleased to find that an in-depth paper on Panache appears in the <a href="http://www.usenix.org/events/fast10/tech/">proceedings</a> of FAST &#8217;10 (<a href="http://www.usenix.org/events/fast10/tech/full_papers/eshel.pdf">PDF</a>).  </p>
<p>The main impetus for my post was not specifically on Panache, however.  Over the years I&#8217;ve been thinking about the evolution of storage as the scale grows, and it comes up in conversation maybe once every six months.  Most recently, it came up during a job interview when talking about some of the projects I&#8217;d worked on.  I made a note to myself to post something on this.</p>
<p>The growth of storage solutions tends to follow a certain trajectory.  In the following I&#8217;m going to be very loose with definitions and paint with a broad brush.  True backup and redundancy for availability are different, but here I&#8217;m just covering redundancy for availability:</p>
<ul>
<li> Single drive / JBOD &#8212; home users tend to start out with one hard drive or a collection of disks; there&#8217;s no redundancy here so they may just manually sync to another drive or system to guard against failure (or the most popular form of failure contingency: nothing).
</li>
<li> Single-host RAID / NAS &#8212; after home users&#8217; storage needs increase in scale, they may move to a RAID solution which adds some disk-level redundancy.  RAID isn&#8217;t a backup solution (you can still delete or corrupt your own files by accident), but it helps give you some redundancy in the face of drive failure.  Maybe these drive arrays are in a home-network visible NAS exporting storage via NFS or CIFS or maybe they&#8217;re just in one main system where you use the storage locally.  Small enterprises may use this kind of NAS solution too.   Crude scaling &#8220;up&#8221; may be achieved by having several different NAS appliances serving separate sets of files.
</li>
<li> SAN / &#8220;Enterprise&#8221; storage &#8212; after you reach a certain level of scale, you want to eliminate the single point of failure imposed by the NAS / single host with a RAID setup.  You&#8217;re protected from a drive failure, but the host could still fail, the network connection to the host could fail, the drive controller could fail, etc.  At this point you tend to move to a storage area network (SAN) setup, often from a company like NetApp, IBM or Panasas.  The idea behind a SAN is that your machines connect to your disks like switched local networking.  You may have 16 machines and 128 disks, and all machines can talk to any disk.  If a machine fails or a controller fails, the rest of the machines can still talk to the same disks just fine.  Generally there is some redundancy at the IO controller level, too.  These are typically coordinated by a shared-disk filesystem like <a href="http://en.wikipedia.org/wiki/IBM_General_Parallel_File_System">GPFS</a> or <a href="http://en.wikipedia.org/wiki/OCFS">OCFS2</a>.<br />
In a large enterprise, it is impractical for all machines that need to access the storage to have a direct SAN connection, so in this case you may have a small number of storage-connected NAS hosts that export the same coherent underlying shared-disk filesystem via NFS or CIFS.  Then clients can access any exporting host via a standard network filesystem protocol, and any NAS host can die without the entire filesystem becoming inaccessible.  Redundancy may be achieved by layering the filesystem over redundant disk arrays in the SAN, or by doing parity / Reed Solomon coding within the filesystem on individual disks (I talked about <a href="http://www.thegibson.org/blog/archives/15">&#8220;Declustered RAID&#8221;</a> in an earlier post), or by doing file-level and metadata-level replication.</li>
<li> Google Filesystem &#8212; &#8220;Enterprise&#8221; storage based on SANs is way more expensive per unit of storage than commodity disks, and there is a limit to how far you can scale &#8220;out&#8221; the number of hosts with access to the storage and scale &#8220;up&#8221; the amount of storage.   Once you get to the level of companies like Google or Amazon, you are already going to have sophisticated internal software for managing failures when dealing with your own computation.  At this level of scale, it makes sense to build your own custom data storage solution using the same failure-handling principles with unshared, commodity disks.  <a href="http://en.wikipedia.org/wiki/Google_File_System">Google filesystem (GFS)</a> is layered over a bunch of commodity servers with local non-redundant storage.  The storage is aggregated into a coherent distributed filesystem and data is replicated at the &#8220;chunk&#8221; level (a large sub-unit of a file).  This will cost a lot less for more storage and you can build in all kinds of special considerations for your own uses (e.g. MapReduce works in concert with GFS to try to process data locally where it is stored).
</li>
</ul>
<p>As I mentioned above, I&#8217;m really glossing over a lot here and speaking in generalities.  Some things I&#8217;ve glossed over:</p>
<ul>
<li>In reality there are very large scale systems that do use SANs (i.e. HPC clusters for scientific simulation) rather than local storage and custom software.  Batch, shared-nothing crunching via MapReduce has different data requirements than tightly coupled parallel MPI jobs. The final stage of &#8220;scale&#8221; listed above is really about what you need to do with all of that data. </li>
<li>GFS/HDFS aren&#8217;t replacements for traditional POSIX filesystems &#8212; they are accessed through a library, they don&#8217;t provide the same semantics (GFS has atomic appends which may append the data more than once), and they were designed to be used in a specific context for batch data analysis in concert with MapReduce/Hadoop.  Given the current state of development, you probably wouldn&#8217;t <i>archive</i> important datasets within HDFS.  The finer details of Google&#8217;s own strategies are private, but given the intended uses of BigTable, it&#8217;s probable that a lot of data is actually archived within GFS on top of BigTable.  This makes sense if the intended uses of the datasets are all within the framework of MapReduce &#8212; jobs that can read out of BigTable.  </li>
<li>Large internet companies with these specialized storage solutions still probably use SAN-based systems for some purposes.  It may not make sense, for example, to store source repositories in GFS or BigTable or to try to export data in GFS like an NFS share.</li>
<li>There are, of course, many other things that don&#8217;t fit so neatly into these categories.  For example, a while back I wrote a <a href="http://www.thegibson.org/blog/archives/14">post</a> where I talked about distributed peer-to-peer filesystems that run over wide-area networks (often built on DHTs). </li>
</ul>
<p>As an aside, it&#8217;s interesting how there is an somewhat parallel scaling trajectory to databases as well.  You start off with stock relational databases, then you tend to shard and denormalize, then from there you may loosen semantics further, interpose memcached, try to do custom transactional replication, etc.; if that can&#8217;t keep up, you may eventually progress to your own eventually consistent key-value store or document DB or non-relational tabular data storage solution or any other number of interesting variants.  But I will save that for some future post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/778/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Traffic starvation</title>
		<link>http://www.thegibson.org/blog/archives/736</link>
		<comments>http://www.thegibson.org/blog/archives/736#comments</comments>
		<pubDate>Wed, 09 Dec 2009 06:46:43 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
				<category><![CDATA[Rants]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/?p=736</guid>
		<description><![CDATA[One of my first posts on this blog was titled, &#8220;Traffic Deadlock&#8221;. Deadlock is a classic pitfall when dealing with mutual exclusion, and another pitfall is &#8220;starvation.&#8221; Starvation occurs when a process cannot acquire a shared resource (and thus cannot make progress). Typical implementations of locks (like pthread&#8217;s mutexes) do not preserve FIFO ordering of [...]]]></description>
			<content:encoded><![CDATA[<p>One of my first posts on this blog was titled, <a href="http://www.thegibson.org/blog/archives/6">&#8220;Traffic Deadlock&#8221;</a>.  Deadlock is a classic pitfall when dealing with mutual exclusion, and another pitfall is &#8220;starvation.&#8221;  Starvation occurs when a process cannot acquire a shared resource (and thus cannot make progress).  Typical implementations of locks (like pthread&#8217;s mutexes) do not preserve FIFO ordering of acquisition requests, so the locks are not fair.  If you have three threads &#8212; A, B, and C &#8212; and they all want access to the same shared resource, B might starve as follows: A holds the lock and B blocks trying to acquire it.  Right as A releases the lock, C also tries to acquire it and wins over B because the lock does not enforce fairness.  Then B is still waiting on the lock while C uses it.  Right as C releases the lock, A tries to acquire the lock and wins over B.  This could continue indefinitely before B gets the lock.  In practice B&#8217;s &#8220;bad luck&#8221; usually doesn&#8217;t happen for too long, because thread scheduling and fluctuations in computation time perturb the system enough to make this rare.  So fair FIFO locks are usually only used when you need lower variance and more predictability (since there is more overhead associated with them).</p>
<p>But sometimes there are structural asymmetries in the system that make starvation more likely and fairness more important.  For example, reader-writer locks make it easier to starve writers if more readers can always enter while the lock is in reader mode (because writers must wait until all readers are done before acquiring).  This brings me to the traffic scenario below: </p>
<p align="center">
<img src="http://www.thegibson.org/blog/wp-content/uploads/2009/12/starvation1.png" alt="Traffic Starvation" width="320" height="388" class="size-full wp-image-742" />
</p>
<p>A popular drive-through (Chick-fil-a) gets backed up at meal times.  The driveway in the picture above is the drive-through entrance.  Now, once the driveway gets backed up to the road, new cars cannot enter until the line moves forward some.  A car turning right into the entrance must wait if the line is backed up too far, but a car turning left across traffic must wait for both 1) no immediate oncoming traffic and 2) space in the line.  Since the left turning car must yield to all right turns, you get an interesting starvation phenomenon: the line backs up so no car can enter and the left turning car waits.  The line moves forward some so now there is room, but the traffic in the other lane continues to move and blocks the turn.  Eventually a right turning car fills the space in the line, so when the next break in traffic occurs the left turning car still cannot complete the turn.  All of the right turning cars occupy any freed space in line and starve out the left turning ones.</p>
<p>This was a frustration I encountered first hand, and it reminded me of <a href="http://www.slashfood.com/2009/11/27/drive-thru-wait-times-getting-longer/">an article</a> I read a few weeks ago (on <a href="http://www.slashfood.com/">Slashfood</a>) about increasingly unreasonable fast food drive-through wait times and the traffic problems a poorly-placed drive-through can cause. </p>
<blockquote><p>&#8220;In Peabody, Mass., the opening of the first New England Sonic in August drew excited customers from across the region anticipating their first extra-long chili-cheese Coney hot dog. In the first month, it wasn&#8217;t unusual for customers to wait in their cars for up to four hours.&#8221;</p></blockquote>
<p>Not exactly &#8220;fast food,&#8221; is it?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/736/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Chrome extension: edit Gmail textarea in an external editor</title>
		<link>http://www.thegibson.org/blog/archives/689</link>
		<comments>http://www.thegibson.org/blog/archives/689#comments</comments>
		<pubDate>Sun, 08 Nov 2009 04:32:22 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/?p=689</guid>
		<description><![CDATA[A month or two ago, I switched to using Chrome daily builds (on Debian) as my primary browser. As with most people switching to Chrome, I love the stability and speed, but there are a few extensions from Firefox I sorely miss. A major one is mozex, which allows you to edit textareas in an [...]]]></description>
			<content:encoded><![CDATA[<p>A month or two ago, I switched to using Chrome daily builds (on Debian) as my primary browser.  As with most people switching to Chrome, I love the stability and speed, but there are a few extensions from Firefox I sorely miss.  A major one is <a href="http://mozex.mozdev.org/">mozex</a>, which allows you to edit textareas in an external editor (It&#8217;s All Text also does this).  I used this for editing Gmail.  I used Gnus as my primary mail and news client for years, and one of the big sticking points to switching to webmail was the anemic textarea editing interface.  I&#8217;m just used to all of the trappings of Emacs &#8212; dict-mode, fill-paragraph, the keybindings, remembrance agent, etc.  </p>
<p>Anyway, I decided to see if I could whip up a quick and dirty Chrome extension to substitute for mozex.  And when I say quick and dirty, I mean it &#8212; this code would make a Perl-happy sysadmin blush.  But I hacked something up in a few hours, and I thought I&#8217;d put it out there since I find it useful.  Maybe someone else can run with it and make it more complete like mozex.  With a few one-line edits, this extension can be made to allow the editing of virtually any textarea in any external editor.  I limited it to Gmail because I don&#8217;t know if there is the possibility of resource leaks or other weird side-effects from running it more extensively.  Also, it&#8217;s hard-coded to use Emacs, but you could just as easily use Gvim (have it spawn an xterm with an editor or use emacs-client or whatever).</p>
<p>I&#8217;ve put the extension source files up here:<br />
<del datetime="2009-12-06T07:05:54+00:00">http://www.thegibson.org/blog/files/emacs_chrome</del> </p>
<p><del datetime="2009-12-12T16:38:59+00:00">Edit: see <a href="http://www.thegibson.org/blog/files/emacs_chrome/sync_ver">http://www.thegibson.org/blog/files/emacs_chrome/sync_ver</a> &#8212; the original only works right on platforms like Linux due to a serendipitous interaction (see comments).</del><br />
Edit2: see <a href="http://www.thegibson.org/blog/files/emacs_chrome/ba_sync_ver/">http://www.thegibson.org/blog/files/emacs_chrome/ba_sync_ver/</a> &#8212; this version uses a Chrome &#8220;browser action&#8221; rather than a &#8220;page action.&#8221;  The &#8220;browser action&#8221; extension type is better suited for this.  Note that some of the text below is not longer applicable to this new and simpler version, so see comments.  I&#8217;ll try to make a new post soon with the updated info.</p>
<p>The extension comes with a Python script (you&#8217;ll need Python 2.6 to run it).  The pycl.py script is a little web server running on port 9292.  Run the web server and leave it in the background.  Then add the extension to Chrome (go to chrome://extensions).  Once you add the extension, you can load up Gmail and either compose or reply to a plain text message (not HTML).  Then use the &#8220;compose in a new window&#8221; icon to open a new window, and you&#8217;ll see an Emacs 23 icon appear in the &#8220;page action&#8221; area (the upper right-hand corner of the address bar, where the SSL lock icon goes).  Clicking the icon sends the contents of the text area to the Python web server, which writes it to a temporary file (using tempfile.NamedTemporaryFile) and spawns an editor.  Edit your text; when you&#8217;re done, save and exit.  In the meantime, the background script will start polling the web server to see if the editor has terminated.  When the editor finishes, the file contents are read and sent back to Chrome, which modifies the textarea with the new contents.  </p>
<p>Anyway, it has several major limitations &#8212; first and foremost, it only works consistently if you compose or reply in a new window (click that little diagonal arrow icon in the upper right hand corner of the message frame).  Sometimes it works on the main Gmail page, but not always.  Due to the hooks it uses, if you open the new window while viewing the message, and then click reply, you&#8217;ll have to type something in the textarea or un-focus and re-focus the window before it&#8217;ll &#8220;notice&#8221; the existence of the new textarea.  This is a limitation of the way the extension content script is finding the textarea in the page.  Maybe the fact that it only works consistently in a new window has to do with Gmail&#8217;s crazy dynamic DOM manipulation.  Or maybe &#8212; perhaps even more likely &#8212; I&#8217;m missing something obvious.  I&#8217;m sure someone with more Javascript / DOM / experience could probably fix this.  I&#8217;ve never used Javascript in any context before, so it was an interesting experience.  I basically looked at two Chrome example extensions &#8212; 1) the RSS feed subscription and 2) the Gmail checker &#8212; and munged their code and techniques together.</p>
<p><b>Changing editors</b><br />
To change editors, just edit pycl.py and modify the line:</p>
<pre>p = subprocess.Popen(["/usr/bin/emacs", f.name])</pre>
<p><code></code><br />
<b>Making it work on arbitrary textareas</b><br />
If you want to change the extension to work on more than just Gmail, first edit majifest.json and change the &#8220;matches&#8221; line:</p>
<pre>"matches": ["http://mail.google.com/*", "https://mail.google.com/*", "file://*/*"],</pre>
<hr />
<pre>"matches": ["http://*/*", "https://*/*", "file://*/*"],</pre>
<p>
The second matches example works on all http and https websites.</p>
<p>Then, in the ta_find.js file, change the getElementsByName(&#8216;body&#8217;); line:</p>
<pre>  result = document.getElementsByName('body');</pre>
<hr />
<pre>  result = document.getElementsByTagName('textarea');</pre>
<p>
The second will find any textarea on the page (one limitation of the current incarnation is that it will only deal with the first textarea, however).</p>
<p><code></code><br />
<b>Making it not suck</b><br />
This is my first foray into in-browser Javascript programming.  I&#8217;m sure there&#8217;s a lot of missing exception handling in background.html when dealing with the XmlHttpRequest objects.  There might be resource leaks too.  And there are corner cases where it won&#8217;t get the textarea contents to send to the external editor correct (like if you only use mouse cut and paste to change the contents around without any keypresses).  Also, in the Python script, there&#8217;s not much error handling either, and using a dedicated private directory (like mozex) rather than tempfile.NamedTemporaryFile is probably advisable on shared systems.  Also, it&#8217;d be great if some Javascript wizard could help figure out why it only works with Gmail when you open a new window.  Ideally I would have liked to add a context menu item to all textareas, but I figured it was easier to use existing Chrome extensions as templates and do it with the &#8220;page action&#8221; mechanism.</p>
<p><code></code><br />
<b>Thoughts on Javascript</b><br />
Looking at the way that the Gmail checker and the RSS feed subscription work, I see an extensive use of closures and continuation-passing style asynchronous programming.  Given JavaScript&#8217;s reputation, this is definitely a lot nicer and cleaner than than what I expected.  Steve Yegge has also pointed out several times that JavaScript is a nicer language than people actually give it credit for, and now I know firsthand.  </p>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/689/feed</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
		<item>
		<title>Flash and the storage hierarchy (again)</title>
		<link>http://www.thegibson.org/blog/archives/560</link>
		<comments>http://www.thegibson.org/blog/archives/560#comments</comments>
		<pubDate>Wed, 04 Nov 2009 07:10:21 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
				<category><![CDATA[Linux]]></category>
		<category><![CDATA[Research Content]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/?p=560</guid>
		<description><![CDATA[Earlier this year in a post titled &#8220;Disk is the new disk&#8221;, I ruminated a bit on the implications of the changing storage hierarchy. One of the frequently discussed changes on the horizon is the introduction of flash memory (or other solid state, non-volatile memories like phase-change memory, which I mentioned in that post) into [...]]]></description>
			<content:encoded><![CDATA[<p>Earlier this year in a post titled <a href="http://www.thegibson.org/blog/archives/28">&#8220;Disk is the new disk&#8221;</a>, I ruminated a bit on the implications of the changing storage hierarchy.  One of the frequently discussed changes on the horizon is the introduction of flash memory (or other solid state, non-volatile memories like phase-change memory, which I mentioned in that post) into the storage hierarchy.  One question I didn&#8217;t address at the time is, &#8220;do these solid state non-volatile memories belong as external or internal memory?&#8221;  That is to say, &#8220;do these belong on the memory bus or IO bus?&#8221;  With flash memory, the latencies and writing semantics (i.e., individual bytes can&#8217;t be arbitrarily modified; entire blocks are erased at once) tend to naturally place flash memory devices on the IO bus as external memory.  However, other technologies may be more appropriate to expose as directly CPU-addressable memory.  A recent paper at SOSP &#8217;09 explores this in the context of Phase-Change Memory (although the techniques are applicable to any directly-addressable BPRAM &#8212; byte-addressable persistent memory): <a href="http://research.microsoft.com/apps/pubs/default.aspx?id=81175">&#8220;Better I/O Through Byte-Addressable, Persistent Memory&#8221;</a>.  Unlike flash, PCM can be read and written in a byte-addressable manner, and access times are fast enough to merit putting PCM on the CPU&#8217;s memory bus.</p>
<p>The paper introduces BPFS, a filesystem for BPRAM which exploits various properties of directly-addressable non-volatile memory to improve performance and durability over traditional filesystems.  It&#8217;s interesting how a lot of the very thorny problems of filesystem consistency associated with block-based disk interfaces are solved elegantly by applying traditional in-memory style atomic updates to data structures in BPRAM.  Traditional filesystems use relatively hairy techniques like soft-updates or journaling to try to ensure that persistent data structures are updated in a way that preserves metadata consistency even if the system fails in the middle of a write.  With byte-modifiable data structures, you can use atomic update techniques similar to those used in lock-free data structures (e.g. do modifications on a private copy of data and then atomically &#8220;publish&#8221; it by setting a pointer field pointing to the private copy &#8212; and make sure to put architecture-appropriate fences so reordering doesn&#8217;t bite you!*).  When I started reading the paper, two classic systems paper quickly came to mind:  <a href="http://www.cl.cam.ac.uk/~smh22/docs/lrvm-tocs94.pdf">&#8220;Lightweight Recoverable Virtual Memory&#8221;</a> and <a href="http://www.eecs.umich.edu/~pmchen/papers/lowell97.pdf">&#8220;Free Transactions with Rio Vista&#8221;</a>.  Rio Vista owes a lot to RVM, but one of the memorable things about Rio Vista is that it uses battery-backed RAM to perform atomic and durable transactions on persistent data structures.  Like BPRAM, the persistent but directly-modifiable nature of battery-backed DRAM really simplifies many aspects of the system.  Naturally, when I got to the end of the paper, I found that they cited Rio Vista in their related work.  Anyway, it&#8217;s good to see that one of the co-authors is a former Georgia Tech classmate, Derrick Coetzee.</p>
<p>* One of the most interesting parts of the paper is their &#8220;epoch&#8221; mechanism &#8212; while fences work fine for volatile data structures, they don&#8217;t provide strong enough semantics when you have persistent memory.  Memory fences aren&#8217;t strong enough because they just affect a CPU&#8217;s view of memory, not the actual contents of DRAM.  A CPU&#8217;s view of memory is basically DRAM plus &#8220;diffs&#8221; of more recent data at various caching levels.  With BPRAM, if the power goes out, the diffs in cache disappear and then the persistent data left may not be consistent by itself.  You have to make stronger guarantees about when persistent data gets written to maintain proper stored data structure consistency.</p>
<p><b>Linux and flash</b><br />
Although the SOSP paper is recent and on my research radar, it is not what prompted this post.  The original impetus came from my recent viewing of the <a href="http://www.techcast.com/events/linuxcon/roundtable/">Linuxcon 2009 roundtable discussion</a>.  This roundtable gained <a href="http://lwn.net/Articles/354835/">some Slashdot notoriety</a> because Linus made a comment about the kernel being &#8220;huge and bloated,&#8221; due to its expanding feature set and icache footprint.  During the Q&amp;A session, someone asked a question regarding flash RAM.  The question was predicated on the assumption that flash will transition to directly addressable (internal) memory and was asking whether, since flash cells have limited lifetimes, would Linux eventually integrate code to deal with directly accessible memory failing.  Ted Ts&#8217;o sort of dismissed the question and said that he believed that the right place to address failure properties are in the hardware (and he didn&#8217;t necessarily agree that flash would move to directly addressable memory).  Currently, flash exposed with a &#8220;disk drive&#8221; interface &#8212; i.e., flash that looks like a fast hard drive &#8212; handles failure and wear leveling underneath the storage interface.  </p>
<p>This reminded me, however, that there is another class of flash support in Linux that is commonly misunderstood.  Linux has a MTD (Memory Technology Devices) subsystem which supports &#8220;bare flash&#8221; devices.  These devices are more common on embedded systems, and basically &#8220;bare flash&#8221; is exposed flash memory that doesn&#8217;t look/act like a standard hard drive.  The software above gets to access the real flash blocks and has to handle wear leveling, dealing with bad blocks and also dealing with write/erase semantics (things that the firmware would do in a hard drive-like flash disk).  MTD devices don&#8217;t act like block devices and actually expose three operations: read, write, and erase. There&#8217;s a whole class of Linux filesystems built to run on top of these &#8220;bare flash&#8221; devices: YAFFS, JFFS2, LogFS, and UBIFS, and they&#8217;ve also factored out some of the common functionality used in a lot of these filesystems into a separate UBI (Unsorted Block Images) layer.  But I see a lot of misunderstanding of the point of these filesystems.  Many casual Linux users or observers think that these filesystems are flash-optimized regular filesystems to be used on top of hard-drive like flash devices (or CF/SD/etc.).  Anyway, I see this mistake made a lot on forums and such so it came to mind after the Linux roundtable question.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/560/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SSH tips and tricks</title>
		<link>http://www.thegibson.org/blog/archives/572</link>
		<comments>http://www.thegibson.org/blog/archives/572#comments</comments>
		<pubDate>Thu, 29 Oct 2009 04:57:12 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
				<category><![CDATA[Linux]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/?p=572</guid>
		<description><![CDATA[Today, I gave an &#8220;SSH tips and tricks&#8221; presentation for our local LUG. Practically every Linux hobbyist knows about basic ssh, scp (and sftp), but OpenSSH has a lot of rich, hidden functionality &#8212; like a SOCKS proxy, connection sharing and even layer-2 and -3 VPN functionality. Also, there&#8217;s a lot of little useful bits [...]]]></description>
			<content:encoded><![CDATA[<p>Today, I gave an &#8220;SSH tips and tricks&#8221; presentation for our <a href="http://www.lugatgt.org/">local LUG</a>.  Practically every Linux hobbyist knows about basic ssh, scp (and sftp), but OpenSSH has a lot of rich, hidden functionality &#8212; like a SOCKS proxy, connection sharing and even layer-2 and -3 VPN functionality.  Also, there&#8217;s a lot of little useful bits of functionality like ssh-keygen -R, selectable ciphers and ssh -t, which people are often unaware of.  So the goal of my presentation was to go over some lesser known (or more advanced) useful features of OpenSSH.  I&#8217;m posting the presentation here since I post Linux-related stuff on my blog.</p>
<p>This presentation is updated from &#8220;<a href="http://lugatgt.org/articles/ssh_tips/" title="SSH Tips and Tricks">SSH Tips and Tricks</a> given on Wed. Feb 28th, 2007  by <a href="mailto:ben@benmcmillan.com">Benjamin McMillan</a> and <a href="mailto:davidhi@cc.gatech.edu">David Hilley</a>.<br />
 </p>
<p>New things compared to last time:</p>
<ul>
<li>
ssh -o StrictHostKeyChecking=no
</li>
<li>
ssh-keygen -R &amp; HashKnownHosts
</li>
<li>
ssh Restricted Shells (rssh, scponly, etc.)
</li>
<li>
ssh keychain
</li>
<li>ssh pseudo TTY allocation</li>
<li>forwarding bind addresses</li>
<li>
PAM &amp; ssh
</li>
<li>
rsync &amp; ssh ciphers
</li>
<li>compression</li>
<li>
parallel ssh tools
</li>
<li>
fuse sshfs</li>
</ul>
<p>In this presentation, I will skip the &#8220;using ssh&#8221; basics.</p>
<h2>SSH Config File</h2>
<p>The ssh config file isn&#8217;t particularly advanced, but many of the later tips benefit from config file customization.  Typing your username for every host is a pain.  In addition, ssh has a large variety of special options (e.g., compression, agent forwarding, host-specific private keys, etc.) that may differ on a per-host basis.  In your ~/.ssh directory, create a file named config to set host aliases and options.  Example ~/.ssh/config:</p>
<hr />
<pre>Host feynman
  User foobar
  ForwardAgent yes 

Host hawking 192.168.1.1 router
  Hostname hawking
  User root
  ForwardAgent yes
  Port 222 

Host *.cc.gatech.edu
  User bmcm
  ForwardAgent yes
  Compression yes
</pre>
<hr />
Run <code>man ssh_config</code> to view documentation about ssh config files.<br />
<code><br />
</code></p>
<h2>Change ssh cipher for rsync</h2>
<p>Ever rsync over ssh to CPU-limited devices (e.g., VIA CPUs, ARM, Atom, contended shared servers, etc.)?</p>
<hr /><code>$&gt; rsync -e 'ssh -c blowfish' -P -a -v files remotehost:path</code></p>
<hr />
<ul>
<li>arcfour and blowfish are typically &#8220;cheaper&#8221; than 3DES or AES; best one depends on CPU/architecture</li>
<li>&#8220;none&#8221; is a SSHv1 only option; don&#8217;t use with password across a public network &#8212; your password will be sent in plain text (keys are okay)</li>
<li>Note: If you use connection sharing and already have a connection open, the cipher request will be ignored.</li>
</ul>
<p><code><br />
</code></p>
<h2>SSH Compression</h2>
<p>ssh features built-in compression:</p>
<hr /><code>$&gt; ssh -C user@hostname</code></p>
<hr />This is particularly helpful for X forwarding over WAN connections or even multi-segment LANs (combine with -Y for trusted X11 forwarding).  It can make the difference between unusable and tolerable forwarding.<br />
<code><br />
</code></p>
<h2>SSH Pseudo TTY Allocation</h2>
<p>ssh -t forces pseudo-tty allocation.  By default, when you ssh without a command to execute (just to log in), a pseudo tty is allocated.  When you specify a command to execute, ssh does not allocate a pty.  Forcing it is necessary if you want to run something that&#8217;s not just plain text output, like screen or top.</p>
<hr /><code>$&gt; ssh -t user@host1 htop</code></p>
<hr />Why wouldn&#8217;t you want a pseudo-tty?  Well, consider piping binary data:</p>
<hr /><code>$&gt; ssh user@host1 "cat file" | diff same_file_copy -</code></p>
<hr />You don&#8217;t want a pseudo tty in that case.  It will molest your data by interpreting escapes and such (unless you want to over-complicate things with uuencode).<br />
<code><br />
</code></p>
<h2>SSH Host Keys</h2>
<p>When you connect to a new host via ssh, the host key is stored in ~/.ssh/known_hosts, establishing the host&#8217;s identity.</p>
<h3>StrictHostKeyChecking option</h3>
<p>When you first connect to an unknown host via ssh, it asks you to confirm:</p>
<hr /><code>$&gt; ssh star2.cc.gt.atl.ga.us<br />
The authenticity of host 'star2.cc.gt.atl.ga.us (143.215.129.169)' can't be established.<br />
RSA key fingerprint is fe:82:da:4d:76:f7:fa:b4:40:6f:7d:3e:1b:b3:01:bb.<br />
Are you sure you want to continue connecting (yes/no)?<br />
</code></p>
<hr />
<br />
This explicit confirmation is good for security, but sometimes you want to script commands via ssh and don&#8217;t want to be prompted.  Personal example: 52 node cluster without shared home directories; do it the naive way, and you&#8217;d need ~(N^2) different confirmations.  Even if you do it on one host and then copy the known_hosts file to the other nodes, it is annoying to go through all of the prompts.  Solution:</p>
<hr /><code>$&gt; ssh star2.cc.gt.atl.ga.us -o StrictHostKeyChecking=no<br />
Warning: Permanently added 'star2.cc.gt.atl.ga.us,143.215.129.169' (RSA) to the list of known hosts.</code></p>
<hr />
<br />
On a cluster, you could do something like this to pre-seed the files:</p>
<hr /><code>$&gt; seq 1 52 | xargs -n1 -I@ ssh -o StrictHostKeyChecking=no rohan@.cc.gatech.edu uptime</code></p>
<hr />
</p>
<h3>HashKnownHosts option</h3>
<p>Default in many cases; hashes hostnames in ~/.ssh/known_hosts.  Entries look like the following (top is hashing on, bottom is hashing off):</p>
<hr />
<pre>|1|QxEMrKqPTNuBtHIEYSbztjaOkF8=|Y1387YZhibtug9rr4ZVXenyRXb4= ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAg...</pre>
<hr />
<pre>boar3.almaden.ibm.com,9.1.112.221 ssh-rsa sAAAAB3NzaC1yc2EAAAABIwAAAg...</pre>
<hr />
Some people prefer to disable by setting <code>HashKnownHosts no</code> in ~/.ssh/config.  Hashing breaks tab-completion based on ~/.ssh/known_hosts (but not via other means &#8212; like parsing ~/.ssh/config &#8212; so it may not be a big deal).<br />
<br />
Want to convert an old, non-hashed ~/.ssh/known_hosts file to hashed?  Run <code>ssh-keygen -H</code><br />
</p>
<h3>ssh-keygen -R &amp; -F</h3>
<p>When a host&#8217;s key changes, ssh will make you remove the offending key from ~/.ssh/known_hosts before it will let you connect (unless you force it to ignore).  Some people edit this file by hand (why?), but ssh-keygen has specific options for this:</p>
<ul>
<li><code>ssh-keygen -R host </code> &#8212; removes the entry for given host</li>
<li><code>ssh-keygen -F host </code> &#8212; finds entries for a host</li>
</ul>
<p>This works whether or not ~/.ssh/known_hosts is hashed, so don&#8217;t waste your time editing things by hand.<br />
</p>
<h3>View a host&#8217;s key fingerprint?</h3>
<p>Use ssh-keygen: <code>ssh-keygen -l -f /etc/ssh/ssh_host_rsa_key.pub</code> (or dsa).  Add -v to get an ASCII art depiction:</p>
<hr />
<pre>$&gt; ssh-keygen -v -l -f /etc/ssh/ssh_host_rsa_key.pub
+--[ RSA 2048]----+
|    oo  ..       |
|   . ..o. .      |
|    . o. +       |
|     . .= .      |
|      .oS+       |
|      ..oo..     |
|         o+ o    |
|          E+ .   |
|           =o    |
+-----------------+</pre>
<hr />
<p><code><br />
</code></p>
<h2>SSH Keys</h2>
<p>ssh keys allow you to log in to remote machines with a public/private key pair.  In your ~/.ssh directory you will use ssh-keygen to generate a key pair, and you will have to distribute the public half of the key pair to remote machines.  You keep the private half of the key pair private (as if that isn&#8217;t obvious).<br />
</p>
<h3>Passphrases</h3>
<p>A passphrase is a password that unlocks the generated key. It may be blank, but we suggest only making it blank if you have other security measures (one of those is detailed below).  A stolen but password-protected private key will not compromise your accounts.<br />
</p>
<h3>Generating a key</h3>
<hr /><code>$&gt; ssh-keygen -t rsa<br />
Enter file in which to save the key (/home/user/.ssh/id_rsa):<br />
Enter passphrase (empty for no passphrase): (enter passphrase)<br />
Enter same passphrase again: (enter passphrase)<br />
Your identification has been saved in /home/user/.ssh/id_rsa.<br />
Your public key has been saved in /home/user/.ssh/id_rsa.pub.<br />
The key fingerprint is:<br />
2c:3f:a4:be:46:23:47:19:f7:dc:74:9b:69:24:4a:44 user@mydesktop </code></p>
<hr />
Add -b to change the number of bits and -C to add a descriptive comment.<br />
</p>
<h3>Distributing the key</h3>
<hr /><code>$&gt; cat .ssh/id_rsa.pub | ssh user@server "mkdir -p ~/.ssh/ &amp;&amp; cat - &gt;&gt; ~/.ssh/authorized_keys"</code></p>
<hr />
The above command appends your public key to the list of authorized keys on a given host.  Now when you login, instead of asking you for a password, ssh will ask you for the key&#8217;s passphrase (or it will just let you in, if your passphrase is blank). </p>
<p>Note: Make sure ~/.ssh/authorized_keys, ~/.ssh/ and ~/ do not have go+w perms or sshd will ignore your public keys.  &#8220;<code>StrictModes no</code>&#8221; in sshd_config will disable this check, but not a good idea.   Think about it: if ~/.ssh/authorized_keys is writable by another user, they can throw in a newly generated public key and log in as you.  If ~/.ssh is writable, they can rename or delete ~/.ssh/authorized_keys and make a new one with their key.  If ~/ is writable, they can rename the whole ~/.ssh directory and make a new one to use (and so forth).<br />
</p>
<h3>Finer-grained Key Control</h3>
<p>You can limit the power of keys to run certain commands, to only connect from certain hosts, and disallow things like port forwarding, X forwarding, etc.<br />
<br />
Concrete example:<br />
I use a restricted ssh key to run fetchmail.  My mailhome is baobab.cc.gatech.edu, and I&#8217;ve set up a special entry in my ~/.ssh/config for email checking:</p>
<hr /><code>Host email<br />
Hostname baobab.cc.gatech.edu<br />
User davidhi<br />
IdentityFile ~/.ssh/mail_check<br />
LogLevel QUIET </code></p>
<hr />
<br />
The mail_check key is a special password-less key. In my ~/.ssh/authorized_keys file, I have the following prepended before the public key:</p>
<hr /><code>command="/usr/sbin/imapd",no-X11-forwarding,no-port-forwarding, no-agent-forwarding,no-pty ssh-dss A...</code></p>
<hr />
This restricts connections made via the key to only running imapd and disallows X11 forwarding, port forwarding, agent forwarding and pty access. This means that the password-less key is locked down (assuming imapd is not compromised) [0].</p>
<p>List of things:</p>
<ul>
<li><code>no-agent-forwarding</code></li>
<li><code>no-port-forwarding</code></li>
<li><code>no-pty</code></li>
<li><code>no-X11-forwarding</code></li>
<li><code>command="command"</code></li>
<li><code>environment="NAME=value"</code></li>
<li><code>from="pattern-list"</code></li>
<li><code>permitopen="host:port"</code></li>
<li><code>tunnel="n"</code></li>
</ul>
<p>
Notes:</p>
<ul>
<li>&#8216;command&#8217; sets a command to run. Adding no-pty makes the command 8-bit clean for two way data transfer. In command, one can use $SSH_ORIGINAL_COMMAND to refer to the command the user wants to execute. Using the shell expression &#8220;${SSH_ORIGINAL_COMMAND:-}&#8221; handles the case in which no command is specified gracefully [1].</li>
<li>&#8216;environment&#8217; modifies the environment variables.</li>
<li>&#8216;from&#8217; restricts based on the remote host (see ssh patterns).</li>
<li>&#8216;permitopen&#8217; allows -L forwarding for specified host/port combinations only.</li>
<li>&#8216;tunnel&#8217; sets up VPN tunnels.</li>
</ul>
<p>
[0] Not sure of the original source of this since I&#8217;ve been using it since at least 2003. This might be it, though: http://mah.everybody.org/docs/mail/fetchmail_check<br />
[1] Hint from: http://www.oreilly.com/catalog/sshtdg/chapter/ch11.html<br />
<code><br />
</code></p>
<h2>SSH Agent &amp; Single Sign-on</h2>
<p>Wouldn&#8217;t it be convenient to enter a key&#8217;s passphrase only once, and have it unlock your key for the entire terminal session?  <code>ssh-agent</code> is a background service that keeps track of your unlocked keys. It can manage multiple keys, and you can add or remove keys or identities dynamically.  When you use ssh to login to a server with key authentication, ssh will ask your agent if it already has the key unlocked.  That way, you don&#8217;t have to type your password every time.<br />
</p>
<h3>Starting the agent</h3>
<p>Sometimes your distribution&#8217;s X server will start up the agent automatically. If it doesn&#8217;t, you can either start it manually each time, or add it to an environment appropriate startup script. For Gnome, you might put this in <code>/etc/X11/gdm/PreSession/Default</code>.</p>
<hr /><code>$&gt; eval `ssh-agent -s`</code></p>
<hr />The above command will start up the agent, and set some environment variables which ssh will later use to connect to this agent.<br />
<br />
Note: the above command is for Bourne-style shells.  Use ssh-agent -c for csh-like variants.<br />
</p>
<h3>Adding an identity to the agent</h3>
<p>Before things will start working, you must add your key(s) to the agent. You can do this automatically when you login to Gnome by adding it to your Gnome session (Startup Programs).  There are several X11-based ssh-agent password prompting programs (ssh-askpass, ksshaskpass, etc.).</p>
<hr /><code>$&gt; ssh-add</code></p>
<hr />
By default, the above command will try to add ~/.ssh/id_rsa and ~/.ssh/id_dsa, which are the most popular key filenames. If you have a different filename, just provide the name of the key as a command-line argument to ssh-add.  If your passphrase is not blank, it will prompt you for the passphrase, but this will be the only time you have to enter it (until the agent stops or you manually remove the identity from the agent). Confirm that the agent has the key by running:</p>
<hr /><code>$&gt; ssh-add -l</code></p>
<hr />Agent connections can also be forwarded automatically via ssh to achieve single sign-on.<br />
</p>
<h3>Keychain</h3>
<p>Keychain is a front-end for ssh-agent allowing easy, system-wide sharing of ssh-agent (rather than per-login).  Add something like this to your profile / bash_profile / preferred login script:</p>
<hr /><code><font color="#808080">### START-Keychain ###<br />
# Let re-use ssh-agent and/or gpg-agent between logins </font><br />
/usr/bin/keychain <font color="#007800">$HOME</font>/.ssh/id_dsa<br />
<font color="#7a0874">source</font> <font color="#007800">$HOME</font>/.keychain/<font color="#007800">$HOSTNAME</font>-sh<br />
<font color="#808080">### End-Keychain ###</font></code></p>
<hr />
If set up properly, you&#8217;ll only need to enter your key password once per boot.<br />
</p>
<h3>PAM SSH</h3>
<p>Single sign-on via PAM.  Add <a href="http://pam-ssh.sourceforge.net/">pam-ssh</a> to XDM&#8217;s pam hooks (or GDM or whatever) and your X login will also unlock your private key via ssh-agent.<br />
<code><br />
</code></p>
<h2>Port Forwarding</h2>
<p>Tunnel traffic through your ssh connection to access non-public / local networks.<br />
</p>
<h3>Local Forwarding: -L</h3>
<p>Say you want to access services local to a private network (not publicly exposed), but from outside you can only ssh to a host on that network. For example, you want to access an internal website running on <i>webhost</i>. You can&#8217;t access it from a web browser at home normally, but ssh local port forwarding will let you tunnel the request through the ssh connection to it:</p>
<hr /><code>$&gt; ssh -L 80:webhost:80 sshhost</code></p>
<hr />This will bind the local port 80 (the first 80) to a tunnel, with the other end pointing to port 80 (the 2nd 80) of the webhost. You may then access the website by pointing your browser to http://localhost (because localhost port 80 forwards to webhost port 80).  Port forwarding can be used to access any TCP service, not just websites (see below for an email example).  In fact, dynamic forwards (see below) are often better for websites.<br />
<br />
Notes:</p>
<ul>
<li>Ports = 1024.  For example: <code>ssh -L 8080:webhost:80 sshhost</code> will bind local port 8080 to tunnel to webhost port 80 through the ssh connection to sshhost.  Then you&#8217;d just point your browser at http://localhost:8080</li>
<li>With many websites, virtual hosts are used and the tunneled request won&#8217;t have the right virtual host name (i.e., if you point your browser at http://localhost, it&#8217;ll use localhost as the virtual hostname rather than webhost).  You can fix that by editing /etc/hosts and creating a temporary mapping, but dynamic forwards (see below) are often more convenient.</li>
</ul>
<p>
Here is an example for accessing pop3 via a tunnel.  The command below sets up the forwarding and then leaves the connection open in the background without a shell.</p>
<hr /><code>$&gt; ssh -L 9999:mailserver:110 shellserver -N -f</code></p>
<hr />After that, you can setup your mail app to use localhost:9999 instead.<br />
Also, as illustrated in the last example, you may use -N -f:</p>
<ul>
<li>-N tells the client not to execute anything remotely</li>
<li>-f tells ssh to background</li>
</ul>
<p></p>
<h3>Remote Forwarding: -R</h3>
<p>Local forwarding lets you tunnel requests from a local port through the remote host you&#8217;ve ssh&#8217;d in to.  What if you want to go in the other direction &#8212; bind a port on the host you&#8217;re ssh&#8217;d in to and send it back to the host you ssh&#8217;d from?  Why would you want to do this?  Say you have a desktop at work, and there&#8217;s no way for you to ssh directly to your work desktop because it doesn&#8217;t have a public IP (or incoming ports are firewalled off).  ssh from your work desktop back to your home network and use a remote forward.  Here is an example ~/.ssh/config:</p>
<hr /><code>Host home-with-tunnel<br />
Hostname homemachine.com<br />
RemoteForward 2222:localhost:22<br />
User joe</code></p>
<hr />
Now, when you ssh to &#8216;home-with-tunnel&#8217;, it will set up a remote forward from port 2222 on your home machine to your work desktop&#8217;s port 22.  At home, ssh to localhost over port 2222 (using the -p option) and the request will be forwarded through the existing ssh connection, granting you access to an otherwise unreachable host.  Note that this example also illustrates how to set up forwards using your ~/.ssh/config file.  Sometimes setting up a remote forward can get confusing, but the localhost in the RemoteForward entry above refers to the machine you are ssh-ing <i>from</i>.<br />
<br />
Note: the above scenario establishing potentially unauthorized outside access to a work machine may be against the Network Usage Policies of your place of employment.  Use with caution.<br />
</p>
<h3>Dynamic Forwarding (SOCKS Proxy): -D</h3>
<p>Use -D to create a SOCKS proxy.</p>
<hr /><code>$&gt; ssh -D 8080 helsinki.cc.gatech.edu</code></p>
<hr />
Now set your web browser to use localhost:8080 as a SOCKS 5 proxy. For Firefox, set network.proxy.socks_remote_dns = true (in about:config) and DNS resolving will occur via the proxy, too.  That means you can use resolve internal network DNS hostnames.  SOCKS proxies are very flexible.  Instead of forwarding a specific remote TCP port to a local TCP port, -D runs a flexible proxy server which can be used to tunnel arbitrary requests with the right support.  Also check out the very useful <a href="http://tsocks.sourceforge.net/"><code>tsocks</code></a> wrapper tool.<br />
</p>
<h3>Bind Addresses</h3>
<p>All flavors of forwarding &#8212; local, remote and dynamic &#8212; can optionally take &#8220;bind addresses&#8221; so you can expose forwarded ports externally.  Normally forwarded ports are only listening on loopback interfaces, but there may be a legitimate reason to publicly expose a port forward.  Example:</p>
<hr />
<pre>$&gt; ssh -L 2222:localhost:22 localhost  # establish a port 2222 forward to my own port 22
$&gt; ssh -p 2222 `resolveip -s localhost`
The authenticity of host '[127.0.0.1]:2222 ([127.0.0.1]:2222)' can't be established.
...
$&gt; ssh -p 2222 `resolveip -s $HOSTNAME`
ssh: connect to host 143.215.128.82 port 2222: Connection refused</pre>
<hr />
<pre>$&gt; ssh -L '*:2222:localhost:22' localhost
$&gt; ssh -p 2222 `resolveip -s $HOSTNAME`
The authenticity of host '[143.215.128.82]:2222 ([143.215.128.82]:2222)' can't be established.</pre>
<hr />
The * bind address tells ssh to listen on all interfaces.  Remember to protect the * from shell expansion with quotes!  You could also put a specific hostname or IP there.<br />
<code><br />
</code></p>
<h2>Connection Sharing (ControlMaster)</h2>
<p>Use ControlMaster auto to speed up remote filename tab completion, sshfs, or other ssh operations to the same host.  New connections to a host will use an already established session (if present), which makes operations a lot faster.<br />
<br />
Add to ~/.ssh/config:</p>
<hr /><code>Host *<br />
ControlMaster auto<br />
ControlPath ~/.ssh/.sock_%r@%h:%p </code></p>
<hr />
<br />
Add to ~/.ssh/config:</p>
<hr /><code>$&gt; ssh -Nf helsinki.cc.gatech.edu<br />
$&gt; ls ~/.ssh -la | grep sock<br />
.sock_davidhi@helsinki.cc.gatech.edu:22</code></p>
<hr />
<code><br />
</code></p>
<h2>Parallel SSH Tools</h2>
<p>Good for managing clusters of identical machines (or just all of your own internal systems).</p>
<ul>
<li><a href="http://freshmeat.net/projects/pssh/" title="pssh (parallel-ssh)">pssh (parallel-ssh)</a></li>
<li><a href="http://freshmeat.net/projects/clusterssh/" title="ClusterSSH">Cluster SSH</a></li>
<li><a href="http://sourceforge.net/projects/mssh/" title="mssh">mssh</a></li>
</ul>
<p></p>
<hr />
<pre>
$&gt; parallel-ssh -l root -h hosts.txt -i -- uptime
<font color="#3D85C6">[1]</font> 18:45:25 <font color="#38761D">[SUCCESS]</font> capricorn.cc.gt.atl.ga.us 22
 18:45:25 up 1 day, 4:20, 0 users, load average: 0.00, 0.00, 0.00
<font color="#3D85C6">[2]</font> 18:45:25 <font color="#38761D">[SUCCESS]</font> leo.cc.gt.atl.ga.us 22
 18:45:25 up 1 day, 4:17, 0 users, load average: 0.00, 0.00, 0.00
<font color="#3D85C6">[3]</font> 18:45:25 <font color="#38761D">[SUCCESS]</font> scorpio.cc.gt.atl.ga.us 22
 18:45:25 up 1 day, 4:13, 0 users, load average: 0.00, 0.00, 0.00
<font color="#3D85C6">[4]</font> 18:45:25 <font color="#38761D">[SUCCESS]</font> libra.cc.gt.atl.ga.us 22
 18:45:25 up 1 day, 4:12, 0 users, load average: 0.00, 0.00, 0.00
</pre>
<hr />
<code><br />
</code></p>
<h2>Restricted SSH Shells</h2>
<p>Sometimes you want users to be able to ssh to a machine but not perform arbitrary shell functions.  A system administrator could rely on restricted keys to do certain things, but restricted shells offer more options.<br />
</p>
<h3>ChrootDirectory option</h3>
<p>Relatively recent addition to OpenSSH: </p>
<hr /><code>This commit adds a chroot(2) facility to sshd, controlled by a new sshd_config(5) option "ChrootDirectory". This can be used to "jail" users into a limited view of the filesystem, such as their home directory, rather than letting them see the full filesystem.</code><br />
<hr />
</p>
<h3>rssh</h3>
<p>Restricted shell for sftp, scp, rsync and cvs.  Doesn&#8217;t support unison or svn.<br />
</p>
<h3>scponly</h3>
<p>Restricted shell for sftp, scp, rsync, unison and svn.  Doesn&#8217;t support cvs.<br />
<code><br />
</code></p>
<h2>SSH Escape Sequences</h2>
<p>Escape sequences are only recognized after a newline and are initiated with a tilde (~) unless you modify it with the -e flag.  Hit  ~? on a running ssh session to see a list of escapes:</p>
<hr />
<pre>Supported escape sequences:
~. - terminate connection
~B - send a BREAK to the remote system
~C - open a command line
~R - Request rekey (SSH protocol 2 only)
~^Z - suspend ssh
~# - list forwarded connections
~&amp; - background ssh (when waiting for connections to terminate)
~? - this message
~~ - send the escape character by typing it twice
(Note that escapes are only recognized immediately after newline.) </pre>
<hr />~. and ~# are particularly useful.<br />
<code><br />
</code></p>
<h2>VPN Tunneling</h2>
<p>Did you know that ssh can do layer 2 and 3 VPN tunneling? Check out ssh -w. Example from manpage:</p>
<hr /><code>$&gt; ssh -f -w 0:1 192.168.1.15 true<br />
$&gt; ifconfig tun0 10.0.50.1 10.0.99.1 netmask 255.255.255.252</code><br />
<hr />
<code><br />
</code></p>
<h2>FUSE SSHfs</h2>
<p>Expose a remote host&#8217;s files like an NFS mount but via ssh:</p>
<hr /><code>$&gt; sshfs davidhi@killerbee2.cc.gatech.edu:/net/hu17/davidhi ~/cc_home -oCiphers=arcfour</code></p>
<hr />
<br />
That&#8217;s all folks.  Thanks!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/572/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>CentOS / RHEL repo madness</title>
		<link>http://www.thegibson.org/blog/archives/531</link>
		<comments>http://www.thegibson.org/blog/archives/531#comments</comments>
		<pubDate>Wed, 30 Sep 2009 20:21:37 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
				<category><![CDATA[Linux]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/?p=531</guid>
		<description><![CDATA[Our school&#8217;s local LUG had an InstallFest this past weekend. I encountered and solved a problem with RHEL/CentOS and extra repositories with conflicting dependencies for media players like vlc and mplayer which seems to be rather common, so I decided to document it here. I&#8217;m a Debian user, so I&#8217;m used to having a very [...]]]></description>
			<content:encoded><![CDATA[<p>Our school&#8217;s <a href="http://www.lugatgt.org">local LUG</a> had an InstallFest this past weekend.  I encountered and solved a problem with RHEL/CentOS and extra repositories with conflicting dependencies for media players like vlc and mplayer which seems to be rather common, so I decided to document it here.  </p>
<p>I&#8217;m a Debian user, so I&#8217;m used to having a very large set of packages in the base repositories.  On a typical Debian desktop system of mine, I may have the base repos with main, contrib and non-free plus debian-multimedia.org (and backports.org and volatile.debian.org on servers running stable).  On RHEL, the set of base packages is much smaller; CentOS (plus the CentOS plus) has a little bit more but basically the same issues.  So people tend to add a bunch of other repositories like the <a href="http://fedoraproject.org/wiki/EPEL">EPEL</a>, <a href="http://rpmforge.net">rpmforge</a>, <a href="http://rpmfusion.org/">RPM Fusion</a>, or individual repos included in those umbrellas like Dries, DAG, Livna, etc.  With the proliferation of repositories comes duplicated effort and problems with mutual compatibility.</p>
<p>For example, consider loading a RHEL 5.4 system from scratch and adding the EPEL and rpmforge repos:</p>
<ul>
<li>Install EPEL:<code>   rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/i386/epel-release-5-3.noarch.rpm</code></li>
<li>Install rpmforge: <code>   rpm -Uvh http://packages.sw.be/rpmforge-release/rpmforge-release-0.3.6-1.el5.rf.`uname -m`.rpm        </code>  [0]</li>
</ul>
<p>On CentOS I would now add yum-priorities (<code>yum install yum-priorities</code>) and set relative priorities for the different repos.  But either way, let&#8217;s say you now try to install vlc with <code>yum install vlc</code> and it explodes with the following:</p>
<pre>
vlc-0.9.9a-3.el5.rf.x86_64 from rpmforge has depsolving problems
  --&gt; Missing Dependency: libcucul.so.0()(64bit) is needed by package vlc-0.9.9a-3.el5.rf.x86_64 (rpmforge)
vlc-0.9.9a-3.el5.rf.x86_64 from rpmforge has depsolving problems
  --&gt; Missing Dependency: libdvdread.so.3()(64bit) is needed by package vlc-0.9.9a-3.el5.rf.x86_64 (rpmforge)
Error: Missing Dependency: libdvdread.so.3()(64bit) is needed by package vlc-0.9.9a-3.el5.rf.x86_64 (rpmforge)
Error: Missing Dependency: libcucul.so.0()(64bit) is needed by package vlc-0.9.9a-3.el5.rf.x86_64 (rpmforge)
</pre>
<p></p>
<p>Why is this happening?  Well, for the <code>libcucul.so.0</code> dependency, the EPEL packages libcaca, and so does rpmforge, but they are not entirely substitutable (there&#8217;s a similar problem with libdvdread).  Usually I trust the EPEL more (and give it a better priority) since it is more &#8220;official&#8221; and Redhat-endorsed (and has high quality packaging standards), but in this case we need to satisfy all of the dependencies from rpmforge.  So, I use the following command:<br />
<code>yum --disablerepo='epel' install vlc</code></p>
<p>For the system at the InstallFest, it was even more complex because it had the EPEL, rpmforge and RPM Fusion and some packages from each.  I had to disable all but rpmforge and install libcaca and caca-utils and then disable RPM Fusion to properly install vlc (and first I had to remove the libcaca package that was pulled in from the EPEL).  yum was developed in part to resolve &#8220;rpm dependency hell&#8221;, but these warring packaging factions are causing the same problems to be exposed again to the user through yum, which is frustrating.</p>
<p>[0] Right now packages.sw.be seems to be broken in some places so you can grab from say http://mirror.cpsc.ucalgary.ca/mirror/dag/redhat/el5/en/`uname -m`/rpmforge/RPMS/rpmforge-release-0.3.6-1.el5.rf.`uname -m`.rpm</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/531/feed</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Supermicro X8DT3&#8242;s onboard LSI controller in Linux</title>
		<link>http://www.thegibson.org/blog/archives/500</link>
		<comments>http://www.thegibson.org/blog/archives/500#comments</comments>
		<pubDate>Mon, 28 Sep 2009 19:39:56 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
				<category><![CDATA[Linux]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/?p=500</guid>
		<description><![CDATA[So we got a new storage box in our lab with a Supermicro X8DT3 motherboard and a bunch of 15k SAS drives. The motherboard comes with an integrated LSI 1068E SAS controller and lspci identifies it as an &#8220;LSI Logic / Symbios Logic MegaRAID SAS 8208ELP/8208ELP.&#8221; The primary user of the system had a bit [...]]]></description>
			<content:encoded><![CDATA[<p>So we got a new storage box in our lab with a Supermicro X8DT3 motherboard and a bunch of 15k SAS drives.  The motherboard comes with an integrated LSI 1068E SAS controller and <code>lspci</code> identifies it as an &#8220;LSI Logic / Symbios Logic MegaRAID SAS 8208ELP/8208ELP.&#8221;  The primary user of the system had a bit of trouble getting the damn thing to work in Linux, though, so I investigated the issue.</p>
<p>So apparently there&#8217;s a few different drivers that <i>could</i> support the card, and they all have their quirks.  Some people have had success with the open-source and in-kernel <code>megaraid</code> driver, and LSI provides some <a href="http://www.lsi.com/storage_home/products_home/standard_product_ics/sas_ics/lsisas1068e/">1068E</a> Linux drivers on their site.  Their drivers are the semi-closed <code>mptsas</code> drivers, but they do provide the source and dkms support, so they can be recompiled against different kernels (provided they aren&#8217;t changed to the point where stuff breaks).  </p>
<p>Our problem was that the <code>mptsas</code> driver was compiling and loading fine but we weren&#8217;t seeing any disks.  It turns out the integrated X8DT3 has two modes controlled by a hardware jumper: SR mode (Software RAID Mode) and IT (Integrated Target Mode).  SR mode is the default, and that is where the card exports logical arrays to the host OS, but the <code>mptsas</code> driver doesn&#8217;t seem to work in that mode.  In IT mode, the card just exports individual drives, and the <code>mptsas</code> driver works fine.  However, the primary user didn&#8217;t want individual drives, he wanted the logically configured arrays (yes, I&#8217;m aware that Linux md/software RAID could have probably done just as good a job as the LSI controller&#8217;s SR mode, but I wanted to see if I could get it to work in SR mode).</p>
<p>Anyway, a <a href="http://blogaristoo.lqx.net/index.php/2009/01/14/sassy-lip-from-the-lsi-1068e">blog post</a> that informed me about the IT/SR mode distinctions had a pointer to Supermicro&#8217;s FTP site, and there I found some promising &#8220;SR&#8221; drivers.  I had seen references to a <code>megasr</code> closed source driver in some forum posts, but I couldn&#8217;t find it from LSI (at least not a RHEL5u3 version).  Well, the <code>megasr</code> is apparently what is on Supermicro&#8217;s FTP site.  I navigated to <a href="ftp://ftp.supermicro.com/driver/SAS/LSI/1064_1068/SR/Driver/Linux/">ftp://ftp.supermicro.com/driver/SAS/LSI/1064_1068/SR/Driver/Linux/</a> and grabbed a disk image for the version of RHEL 5 we&#8217;re using (luckily we hadn&#8217;t updated to the newly released 5.4, because there doesn&#8217;t seem to be an image for it).  I grabbed the .img file and saw that it was a floppy image using the <code>file</code> command.  I mounted it loopback and extracted the compiled <code>megasr.ko</code> using the following process:</p>
<ul>
<li> <code>mount -o loop megasr-13.10.0708.2009-1-rhel50-u3-all.img /mnt/</code></li>
<li> <code>zcat /mnt/modules.cgz | cpio -idv</code></li>
</ul>
<p>The last command will extract the modules to the current directory (making subdirectories for various kernel versions and architectures).  The .img file is compiled for use with the RHEL installer, hence the floppy image format and the modules packed as a gzipped cpio archive.  We had a working install on a standalone SATA disk and the SAS drives off of the controller were just going to be used for data storage, so we just needed to add the kernel module after the system was already installed.</p>
<p>So I removed the LSI provided modules (with <code>rpm -e</code> to remove the <code>mptlinux</code> dkms package), copied the appropriate <code>megasr.ko</code> to /lib/modules/`uname -r`/extra, ran <code>depmod</code>, rebooted and everything was finally working in SR mode.  Luckily we were using RHEL &#8212; it looks like people who are using non-RHEL or SLES distros are out of luck because they don&#8217;t seem to provide source to recompile the module against vanilla (or other) Linux kernels.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/500/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Mount LVM-based volumes from loopback full disk images</title>
		<link>http://www.thegibson.org/blog/archives/467</link>
		<comments>http://www.thegibson.org/blog/archives/467#comments</comments>
		<pubDate>Sat, 01 Aug 2009 18:42:00 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
				<category><![CDATA[Linux]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/?p=467</guid>
		<description><![CDATA[Recently I needed to extract some files from the root partition in a full disk backup image taken with dd. I didn&#8217;t notice when I took the disk image, but the disk only contained two primary partitions: /boot and an LVM physical volume containing the rest of the partitions as LVM logical volumes. I don&#8217;t [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I needed to extract some files from the root partition in a full disk backup image taken with dd.  I didn&#8217;t notice when I took the disk image, but the disk only contained two primary partitions: /boot and an LVM physical volume containing the rest of the partitions as LVM logical volumes.  I don&#8217;t work with LVM much manually, so I had to look up the commands to get it to find physical volumes and activate volume groups.  Here&#8217;s the full process of mounting LVM logical volumes from a full disk image:</p>
<ul>
<li>There are two ways to get to the LVM partition on this disk, and I&#8217;ll cover both: 1) the manual offset finding way and 2) the easy way.  </li>
<li>First, the easy way: make sure the loopback module is inserted with the <code>max_part</code> parameter, which causes the automatic creation of loopback subdevices for individual partitions.  An easy way to make sure is to remove it and re-insert with the right parameter: <code>modprobe -r loop &amp;&amp; modprobe loop max_part=63</code>
<li>Next, mount the whole disk image loopback: <code>losetup /dev/loop0 sda.img</code>.  Now you should see <code>/dev/loop0p1</code>,<code>/dev/loop0p2</code>, etc. for all of the individual partitions.  Now you&#8217;re already done &#8212; you can go directly to the next section to deal with LVM directly. </li>
<li>If you can&#8217;t do the easy method, mount the whole disk image loopback to look at the partition offsets: <code>losetup /dev/loop0 sda.img</code></li>
<li>Now that /dev/loop0 looks just like the block device image, so check out the partition table sector offsets with fdisk: <code>fdisk -u -l /dev/loop0</code>.  I use sector offsets (-u flag) rather than cylinders because they are easier to work with and some partitions may not fall on cylinder boundaries.  My image shows something like this:</li>
<pre>
Disk /dev/loop0: 250 GB, 250056737280 bytes
255 heads, 63 sectors/track, 30401 cylinders, total 488392065 sectors
Units = sectors of 1 * 512 = 512 bytes

     Device Boot      Start         End      Blocks   Id  System
/dev/loop0p1   *          63      401624      200781   83  Linux
/dev/loop0p2          401625   488392064   243987187   8e  Linux LVM
</pre>
<li>Now, remove the whole disk image from /dev/loop0: <code>losetup -d /dev/loop0</code> and set /dev/loop0 to just the LVM partition by adding the partition offset.  Here, the second partition starts at sector 401625, and each sector is 512 bytes, so the offset is 205632000.  Run <code>losetup /dev/loop0 sda.img -o205632000</code></li>
</ul>
<p>Now whichever method you used, you have a loopback device with the LVM physical volume partition: <code>/dev/loop0p?</code> (2 in my case) if you used the easy way, or <code>/dev/loop0</code> if you used the manual offset method.  Get LVM to recognize the physical volume and activate the volume groups:</p>
<ul>
<li>Tell LVM to scan for new physical volumes: <code>lvm pvscan</code></li>
<li>Activate the volume groups: <code>lvm vgchange -ay</code> (it will print something like <code>2 logical volume(s) in volume group "VolGroup00" now active</code>).</li>
<li>Now you can finally mount the LVM logical volumes.  Run <code>lvm lvs</code> to list the logical volumes.  Each should appear in /dev/mapper, typically with the device name (volume group name)-(logical volume name), like VolGroup00-LogVol00.</li>
<li>When you are done, unmount all logical volumes and deactivate the volume groups with <code>lvm vgchange -an</code>.  Now you can reclaim the loopback device by using <code>losetup -d /dev/loop0</code>.
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/467/feed</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>&#8220;C++ is a superpower&#8221;</title>
		<link>http://www.thegibson.org/blog/archives/303</link>
		<comments>http://www.thegibson.org/blog/archives/303#comments</comments>
		<pubDate>Tue, 28 Jul 2009 02:37:49 +0000</pubDate>
		<dc:creator>davidhi</dc:creator>
				<category><![CDATA[Research Content]]></category>

		<guid isPermaLink="false">http://www.thegibson.org/blog/?p=303</guid>
		<description><![CDATA[A few months ago, Bjarne Stroustrup came to Georgia Tech to give a talk sponsored by the Georgia Tech chapter of the ACM. Bjarne&#8217;s talk was about two hours, and it was mostly a broad overview of the history of C++: its significance, its evolution, Bjarne&#8217;s broad language design philosophy and the future directions he [...]]]></description>
			<content:encoded><![CDATA[<p>A few months ago, Bjarne Stroustrup came to Georgia Tech to give a talk sponsored by the Georgia Tech chapter of the ACM.  Bjarne&#8217;s talk was about two hours, and it was mostly a broad overview of the history of C++: its significance, its evolution, Bjarne&#8217;s broad language design philosophy and the future directions he envisions for C++.  Before the presentation started, I talked to Bjarne one-on-one for a bit about the most interesting aspect of the upcoming C++ 0x standard to me, concurrency-related features and the language memory model.  I also asked him about his thoughts on garbage collection and shared-state concurrency versus message-passing.  On the latter point, like me, he believes that the future of concurrent programming (at least in the near to foreseeable future) involves the use of a large arsenal of techniques for different situations rather than one dominant paradigm.  It&#8217;s a pragmatic view and ostensibly supported by the fact that this is not the first &#8220;boom cycle&#8221; for parallel programming.  Arguably shared-state concurrency doesn&#8217;t scale very easily to large systems, but shared-state concurrency within finer-grained components is manageable today; using actor-style message-passing between coarser grained components with shared-state concurrency in limited places (perhaps for high-performance primitives or highly coupled functionality) seems like an analogue to the way we use inline assembly or intrinsics for limited sections of performance-critical library routines (like memcpy or BLAS primitives) and everywhere else use more manageable high-level languages.  </p>
<p><strong>Garbage Collection</strong><br />
One topic where I didn&#8217;t fully agree with Bjarne was the issue of garbage collection.  Personally, I think the addition of hooks to support garbage collection to C++0x is an essential decision for enabling effective concurrency programming.  Storage management in single-threaded code is drudge work, but when concurrency is involved, it suddenly becomes a much thornier problem.  People find themselves building their own slapdash garbage collection-like facilities to manage the lifetime of data objects that may be used in many different threads.  These user-built facilities are tricky and often induce extra contention.  Ultimately many concurrent algorithms, particularly lock-free ones, rely on the existence of a garbage collector (to prevent certain instances of ABA problems, for example*).  Sure there are classes of techniques for lock-free reference counting (e.g. <a href="http://portal.acm.org/citation.cfm?id=384016">Detlefs et al.</a>), but it seems like the modern mainstream languages with GC like Java and C# are gaining powerful high-level concurrent programming utilities a lot quicker than C and C++, and I suspect the higher difficulty of storage management is one influencing factor (obviously, the lack of a memory model and portability concerns are other major issues).  Anyway, I mentioned to Bjarne that I thought the addition of garbage collection support was good because a lot of concurrent algorithms rely on it, and he said something to the effect of it being a crutch or somewhat lazy to rely on garbage collection just to avoid complicating concurrent algorithms with intricate reference counting.  I don&#8217;t want to misrepresent his opinion, but that&#8217;s the general gist I got from his comment.</p>
<p>As far as I&#8217;m concerned, garbage collection is one of those largely inevitable evolutions of higher-level programming that is begrudged by some but eventually taken as a given.  Some complained about overhead induced by operating systems when computers tended to run single applications; the same thing happened with the first compiled languages (versus hand-written assembly).  Later it was VMs for managed languages or GC.  Eventually, however, I think these will be taken for granted in mainstream programming.  Java has already gone a long way to making garbage collection more of a necessary and mainstream feature.  Now, certainly there are some gripes about the Java throwing out the baby with the bath water in not providing something as useful as explicitly scoped destruction to guarantee cleanup of certain related non-memory resources (like open files or sockets), and providing too weak semantics for finalize, but this isn&#8217;t a fundamental flaw of GC.  Anyway, that&#8217;s tangential, but my view is that GC is nearly essential in new concurrent era, and I&#8217;m glad that C++0x is taking a first step in the right direction on that front.  </p>
<p><strong>Memory Model and Atomics</strong><br />
The rigorous memory model is, of course, also essential for great concurrent programming, and well known experts like Hans Boehm (and many others) have put a lot of work into that.  I mentioned to Bjarne that I was also pleased that they are offering atomics and even support for relaxed consistency reads and writes useful in certain low-level concurrent algorithms.   I told him that I thought that the inclusion of selective relaxed consistency, while definitely addressing only a limited audience of very skilled programmers, is really vindicated in light of Java&#8217;s experience &#8212; after initially resisting exposing such features, the developers of Java have come to the same conclusion that you need to expose these operations to allow the efficient implementation of certain kinds of concurrent data structures and algorithms.  See Doug Lea&#8217;s work on the Java 7 <a href="http://gee.cs.oswego.edu/dl/jsr166/dist/docs/java/util/concurrent/atomic/Fences.html">Fences API</a> and his <a href="http://cs.oswego.edu/pipermail/concurrency-interest/2009-January/005743.html">announcement and rationale</a> to the concurrency-interest list.  He states:</p>
<blockquote><p>The basic JSR133 JMM scheme provides three kinds of underlying ordering constaints, that are tied to variable declarations &#8212; read-volatile, write-volatile, and write-final (aka write-release, aka publish, aka lazy-set), or for stand-alone AtomicX objects.</p>
<p>This turns out not to mesh very well with the development of core concurrent algorithms (like the ones I tend to implement), where you often need these special flavors of reads and writes on an occasional basis, which is currently either impossible or insensible to achieve.<br />
&#8230;<br />
In the recent C++0x standard, this kind of usage was addressed by allowing per-usage modes on new read and write methods. (And further, supporting more than the three modes considered here, but I don&#8217;t think we need to introduce more for Java.)<br />
&#8230;<br />
So, after years of resisting the idea, my current conclusion is that we need to stop wishing for a miraculous solution to lack of call-by-ref, and instead allow developers to roll their own out of the raw ingredients &#8212; fences.
</p></blockquote>
<p>In his actual talk, Bjarne only briefly mentioned C++0x concurrency features and the memory model (because there was so much ground to cover), but after noting that &#8220;C++ 0x will have atomics,&#8221; he quipped that the new feature makes C++ a &#8220;superpower.&#8221;  Besides the obvious pun, some people (myself included) would liken C++ to a nuclear weapon for a different reason &#8212; namely that it tends to provide many experts-only features with extremely intricate rules and many hidden pitfalls.  This was illustrated perfectly to me by Bartosz Milewski&#8217;s attempt to implement a well-known synchronization algorithm (<a href="http://en.wikipedia.org/wiki/Peterson%27s_algorithm">Peterson&#8217;s algorithm</a> for shared-memory, two party mutual exclusion) using relaxed consistency atomics. Just by trying to implement Peterson&#8217;s algorithm, he ended up finding that the draft standard semantics were subtly broken, and it took Hans Boehm&#8217;s intervention and the modification of the standard just to get it to work.  In his post titled <a href="http://bartoszmilewski.wordpress.com/2008/12/23/the-inscrutable-c-memory-model/">The Inscrutable C++ Memory Model</a>, he recounts just how complicated it is:</p>
<blockquote><p>I had no idea what I was getting myself into when attempting to reason about C++ weak atomics. The theory behind them is so complex that it’s borderline unusable. It took three people (Anthony, Hans, and me) and a modification to the Standard to complete the proof of a relatively simple algorithm.</p></blockquote>
<p>All the people involved are in the target population of expert users for relaxed consistency atomics, so the byzantine complexity illustrated here (i.e. &#8220;so complex that it’s borderline unusable&#8221;) doesn&#8217;t bode well for 99.9% of the C++ using population.  Sometimes I&#8217;m afraid that some bold and foolhardy programmers won&#8217;t be able to resist the temptation of &#8220;optimizing&#8221; by using features like this &#8212; the programming equivalent of playing Russian Roulette with all chambers loaded.   Of course, I still support their inclusion, since the Java experience shows that they are needed for a small subset of people like Doug Lea to write kickass foundational concurrent library functionality.  </p>
<p>Anyway, to bring this post full circle, earlier I mentioned that shared-state concurrency can be tricky to apply to large-scale systems.  One of the reasons is thematically similar to the &#8220;experts only&#8221; problem we just discussed &#8212; current abstractions for shared-state concurrency are very complex and tricky to get right.  The STM people point out that locks don&#8217;t compose &#8212; in other words, to build on existing pieces and maintain invariants, you often have to &#8220;invite&#8221; other parties to participate in your own internal locking protocols, basically foiling true abstraction or encapsulation of concerns.  Just how tricky are threads and locks?  Well, Edward A. Lee&#8217;s &#8220;<a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf">The Trouble With Threads</a>&#8221; recounts the following anecdote:</p>
<blockquote><p>A part of the Ptolemy Project experiment was to see whether effective software engineering practices could be developed for an academic research setting. We developed a process that included a code maturity rating system (with four levels, red, yellow, green, and blue), design reviews, code reviews, nightly builds, regression tests, and automated code coverage metrics [43]. The portion of the kernel that ensured a consistent view of the program structure was written in early 2000, design reviewed to yellow, and code reviewed to green. The reviewers included concurrency experts, not just inexperienced graduate students (Christopher Hylands (now Brooks), Bart Kienhuis, John Reekie, and myself were all reviewers). We wrote regression tests that achieved 100 percent code coverage. The nightly build and regression tests ran on a two processor SMP machine, which exhibited different thread behavior than the development machines, which all had a single processor. The Ptolemy II system itself began to be widely used, and every use of the system exercised this code. No problems were observed until the code deadlocked on April 26, 2004, four years later.
</p></blockquote>
<p>In closing, I just want to say I enjoyed Bjarne&#8217;s talk a lot, and I find him to be extremely pragmatic in technical matters.  He&#8217;s very frank about C++&#8217;s purpose, the fact that it&#8217;s evolving and that it has certain warts (and also many things that are better than other languages).  In other words, he&#8217;s not the Steve Jobs of C++, pretending it&#8217;s eternally perfect and consistent.  There wasn&#8217;t much time for many questions, but the most amusing question to me was what can only be described as an &#8220;Oprah&#8221; question &#8212; &#8220;How do you feel about having your software running on so many systems?&#8221;  Hah.</p>
<p>* See <a href="http://www.research.ibm.com/people/m/michael/RC23089.pdf">&#8220;ABA Prevention Using Single-Word Instructions&#8221;</a> by Maged Michael and <a href="http://research.sun.com/scalable/pubs/SPAA04.pdf">&#8220;DCAS is not a Silver Bullet for Nonblocking Algorithm Design&#8221;</a> by Doherty et al. for some examples of ABA problems solved by GC and instances where GC alone is insufficient.  However, regardless of whether GC solves the ABA problem entirely, GC tends to make concurrent algorithms significantly simpler in many other ways.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.thegibson.org/blog/archives/303/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

