12 March 2008 - 2:17The perils of numerical algorithms
Recently I needed to generate binomially-distributed integers on demand for a micro-benchmark. Actually, I started with the idea of generating normally distributed numbers and rushed headfirst into reading about how to do that without thinking about the fact that I really needed a discrete distribution; later I switched distributions, but I found out it is quite easy to generate normally distributed numbers. Starting with the primitive of rand_r or drand48 (or the GNU drand48_r), we can get uniformly distributed random numbers on the interval [0, 1). If you aren't too picky about issues of numerical stability or speed, generating normally distributed numbers with uniformly distributed numbers is simple using the basic rectangular form of the Box-Muller transform. Assuming u_1 and u_2 are the uniform random numbers:

n_1 and n_2 are normally distributed. So that's nifty and really easy, but I then remembered that I needed to generate integers in the range [0, N], and the normal distribution is defined over the entire real line. I could just chop the tails off of each side of the distribution and create a truncated normal distribution, but it seemed like a better idea to just directly go with a discrete distribution.
The Binomial distribution with p=0.5 over [0, N] looks approximately normal when N becomes very large, so I decided to generate binomially-distributed integers. Even when there isn’t a simple, closed-form transform from uniformly random integers to another distribution, it seemed like you could just generate the cumulative distribution function; then you generate a uniform random number u_1 and then binary search for the greatest index i where cdf(i) <= u_1 (basically inverting the cdf). So my plan was to generate the probability mass function straight from the definition and then sum it to get the cdf. With 128-bit floating point and care in implementing the binomial coefficient (i.e. don’t perform factorials directly and divide, evaluate it a non-canceled term at a time), this worked fine for my tests the range of a few thousand. However, I quickly ran into trouble since N=72,000 in the real test. The definition of the pmf is:

Obviously you can see where this is going. That definition works somewhat well for smaller integers, but once you have quantities like 0.5^(72000), you can easily get underflows and overflows. I tried some term rearrangement, but I still ended up with a mass function that was +Inf in the middle and 0 everywhere else, which is completely worthless. Not wanting to spend a lot of time implementing this, I figured GNU R — the excellent statistical computing language/environment — could probably do this. So I ended up using R to generate the cdf (with the pbinom function), and wrote the results out to a file which I read in when I needed the cdf.
Later I looked into how R actually implements pbinom, and it delegates the problem to the Beta distribution cdf implementation, pbeta. The work in pbeta is all done by a function called bratio, which is the interesting part. The bratio function is in toms708.c. A comment at the top explains the name:
/* ALGORITHM 708, COLLECTED ALGORITHMS FROM ACM. * This work published in Transactions On Mathematical Software, * vol. 18, no. 3, September 1992, pp. 360-373z. */
Sure enough, you can locate “Algorithm 708; significant digit computation of the incomplete beta function ratios” in ACM’s digital library. The the bratio function computes the Incomplete beta function, and toms708.c is almost 2300 lines of fairly scary looking numerics code (lots of gotos and labels, and a fair number of magic looking constants). The original code from ACM TOMS was in Fortran, and the top of the source file says “Based on C translation of ACM TOMS 708.” They don’t explicitly say whether it was manual or if machine translation was involved at any point, but I found several comments above some variable declarations that says “System generated locals”. Googling on that phrase brings up references to f2c, the Fortran to C translator. That probably also explains the abundance of labels with names corresponding to line numbers in the original Fortran source.
Anyway, I just thought this was interesting. Numerical algorithms are fascinating: they can take a lot of care to get right and often require complex tricks to avoid over/under-flows and maintain numerical stability. Just working with floating point is perilous on its own, because there are things like cancellation errors, non-associativity of operations, non-intuitive notions of equality, etc. This is one area that wasn’t well covered in my undergraduate classes (although it might have been covered in graphics classes, which I never took). In the introductory systems class, they discussed floating point representations and had us encode/decode some numbers from scientific notation into IEEE 754, but no class really went into the pragmatics and pitfalls of using floating point numbers in mathematical algorithms. I learned about that mostly a) from reading the classic report, “What Every Computer Scientist Should Know About Floating-Point Arithmetic” and later; b) from working at the Federal Reserve in Economic Research, helping economists make use of distributed computing with their simulations and such.
Through my experience at the Fed, I gained a lot of insight into just how much effort it takes to do correct numerical algorithms, let alone fast ones. Of course, most of the economists relied on highly-tuned primitive libraries like IMSL, but it still takes work to make sure your own calculations (built on top of those primitives) don’t introduce accumulating error. Fortran and Matlab were, by far, the most popular languages, with a few economists choosing C or Mathematica. Anyway, this is a deep and interesting area.
As an aside, it’s not too hard to generate Zipf distributed numbers using a technique called “Rejection-Inversion” (it doesn’t require generating the cdf and searching). Take a look at FastBit which has an implementation in twister.h
/// Discrete Zipf distribution: p(k) is proportional to (v+k)^(-a) where a /// > 1, k >= 0. It uses the rejection-inversion algorithm of W. Hormann /// and G. Derflinger. The values generated are in the range of [0, imax] /// (inclusive, both ends are included).
The referenced paper is “Rejection-inversion to generate variates from monotone discrete distributions”; the code to generate Zipf-distributed integers is only about 20 lines of C++, and it’s actually understandable.
No Comments | Tags: Uncategorized