3 April 2008 - 0:14Memory ordering and memory models

Along with a variety of interesting papers, I’ve seen a few nice Google Tech Talks on the topic of processor memory ordering and language-level memory models recently. With the relatively recent resurgence of interest and attempts to “mainstream” parallel/concurrent programming, it is increasingly important to get these things right. Processor memory ordering guarantees (sometimes called a processor’s/architecture’s memory model) are generally relevant to people like me writing systems-level software (e.g. programming in C or assembly implementing operating systems, higher-level language runtimes and compilers, etc.). In theory, if you are doing user-level programming in C with something like pthreads (or win32 threads, OpenMP, etc.) and have a “race free” program (according to the threading specification), processor memory ordering should not be directly exposed to you. Even though Alpha and PowerPC have weaker memory ordering guarantees than x86, it is the implementation’s responsibility to issue appropriate memory fence operations, locked operations, etc., to make sure that the thread primitives perform as expected. If you want to use lock-free algorithms or atomic operations, you have to be aware of such architecture-specific things and do this manually.

Some higher-level languages — most notably Java — have defined language-level memory models: Java’s memory model defines how threads interact through memory and precisely what it means for a program to be data race free / well synchronized. It provides guarantees that a compiler/JIT won’t perform optimizations that break race free code. In a language like C without a defined memory model, these things can be tricky and surprising because it is unclear what a data race means when considering the mapping from language level statements to actual machine code. A programmer may have two independent threads concurrently assigning values to two different variables. If these variables are small (chars, for example) and stored adjacently in the same machine word, this may be a data race on some architectures. Since these kinds of decisions are often left unspecified and up to the compiler (or possibly the linker), there is no guaranteed way to write portable and robust code. Additionally, the compiler may introduce extra stores or perform other optimizations that, while fine for single-threaded code, introduce races into otherwise race-free code. People like me, who (try to) make use of lock-free algorithms and atomic operations, end up with code that may be quite fragile to even minor compiler optimization changes. More on this later.

One other nice thing about Java’s memory model is that it also constrains the language implementation on what can happen to incorrectly synchronized code; languages like C and C++ often say that the result of illegal code (e.g. modifying a variable twice without an intervening sequence point) is undefined — and undefined behavior can allow anything at all to happen. As one example, Java specifies that the implementation cannot introduce values into improperly synchronized code that appear “out of thin air.”

The videos are as follows:

  • IA Memory Ordering — Richard Hudson explains Intel’s newly clarified memory ordering semantics for x86.
  • Advanced Topics in Programming Languages: The Java Memory Model — Jeremy Manson describes the current Java memory model as revised by JSR-133 and Java thread semantics. The talk covers basics about model ordering guarantees, the meaning of locking/synchronization primitives and volatile, as well as common pitfalls.
  • Getting C++ Threads Right — Hans Boehm, the well-known programming languages/compilers researcher, talks about the effort to provide better threads support in the upcoming C++ standard (C++0x). More importantly, he talks about the general problems plaguing implementation of correct multi-threaded programs in languages like C and C++.
  • Towards a Memory Model for C++ — Not a Google Tech Talk, but another talk by Hans Boehm very similar to that above.

C++ Threads / Memory Model
Hans Boehm covered similar ground in his 2005 PLDI paper, “Threads Cannot be Implemented as a Library”. Some people felt the paper was trivial or hyping a non-problem by saying that the compiler and language have to provide a few extra guarantees and can’t be completely oblivious (I recall the LTU discussion — one commenter said, “Ayone who was paying attention already knew that”). But he does bring up a whole set of issues which is becoming increasingly important. Concurrent programming is already more difficult than regular sequential programming. On top of that, you have poorly specified semantics or broken implementations which actually cause problems. It’s hard enough already to get your part right without worrying about the compiler breaking things behind your back or the underlying language implementation not properly obeying the specification. His concerns aren’t just theoretical “cleanliness” issues, either.

Part of what prompted this post is the recent thread on LKML: “Is gcc thread-unsafe?” This is a perfect example of the problem that Boehm alludes to with regard to compiler optimization. The sample code in the gcc thread:

int trylock() {
  int res;

  res = pthread_mutex_trylock(&mutex);
  if (res == 0)
    ++acquires_count;

   return res;
}

That code attempts to acquire the mutex and increments acquires_count only if it succeeded in locking the mutex. With -O1, gcc 4.3 generated code that always reads and writes the acquires_count variable (load, conditional add, store) regardless of whether the mutex is obtained. Typical language lawyers pointed out that the C standard allows this optimization, even though it introduces a fairly nasty race condition. Some of the gcc developers tend to take a very defensive language lawyer stance, vigorously defending things that are technically permitted but practically useless. Standard C says nothing of threads and imposes very little on the compiler in this regard.

The Java Memory Model
The Java Memory Model is quite useful today, but it wasn’t always perfect. Java 5 incorporated JSR 133: Java Memory Model and Thread Specification Revision — a revision making some substantial changes to the original Java Memory Model, which was regarded as “broken.” I was still in high school and hadn’t even taken my first CS course when William Pugh (also known for the invention of the skip list, a wonderful data structure that works well in concurrent situations), published a paper titled The Java Memory Model is Fatally Flawed, a revised version of Fixing the Java Memory Model from a year earlier. I learned about the controversy a year or two later, and I was finishing my Masters by the time the new, fixed memory model was finalized and adopted. Although it seems somewhat strange that people tolerated and obviously wrote multi-threaded Java with a “fatally flawed” memory model for so long, the situation is roughly analogous to the current situation with C/C++ and threads: generally stuff works, and compilers/runtimes often do the “right thing” anyway, but we’d rather have stronger guarantees in this area.

Double-checked locking, a trick typically used to avoid synchronization on lazily-initialized fields, is one of the things broken by Java’s old memory model. Here is an example:

public Object getA() {
  if(a == null) {
    synchronized(this) {
      if(a == null)
        a = new Object();
    }
  }
  return a;
}

Despite the fact that the idiom was commonly-used within some Java libraries, it was technically incorrect and there was no satisfactory way to fix it (using thread-local storage made it possible, but that was a hack and often too expensive to be worth it). Under the current Java Memory Model, double-checked locking works if the field is made volatile. Under the old memory model, volatile was not very useful because volatile reads and writes could be reordered with respect to regular reads and writes, so you couldn’t use a write to a volatile value to, for instance, indicate to another thread that an object was initialized (because the write to signal the initialization might be reordered to before the actual initialization). The new memory model prevents this reordering. Since there are a lot of resources available on the web about the Java Memory Model, I won’t say too much more except that Java has been somewhat of a trailblazer in the area of really providing clear and useful semantics for multi-threaded programs.

William Pugh has a fairly comprehensive page with Java Memory Model resources and Doug Lea has a JSR-133 Cookbook for Compiler Writers and a tutorial titled Synchronization and the Java Memory Model (the latter is an excerpt from Doug’s nice Concurrent Programming in Java book).

IA Memory Ordering
x86 (IA32) has used, in the past, fairly strong memory ordering semantics called “processor ordering.” Intel has clarified/more robustly specified the memory ordering semantics in a document titled Intel® 64 Architecture Memory Ordering White Paper. The newly clarified memory ordering semantics apply to both 32-bit and 64-bit x86 and are described as “Total Lock Ordering + Causal Consistency.” Locked instructions are totally ordered across all processors, and memory obeys causal consistency, which is a fairly natural concept and provides publication safety. These memory ordering semantics don’t apply to all memory types (I/O would be different), but these semantics are what would be encountered in regular user code dealing with standard heap or stack memory. Precisely knowing the target architecture’s memory consistency guarantees is critical to correct and efficient implementation of higher-level language memory models and systems software (and lock-free code). AMD also has their own memory ordering reference which is very close or identical to Intel’s.
A long time ago, I read a paper titled Memory Ordering in Modern Microprocessors by Paul McKenney which explains the memory ordering guarantees provided by many modern microprocessor architectures (in the context of Linux’s memory barrier primitives). It is interesting just how weak Alpha’s (and to some extent PowerPC’s) ordering guarantees can be. Alpha seems to allow just about everything short of the processor simply making up bogus values for memory reads.

No Comments | Tags: Research Content