Some days ago, there was some discussion in #lisp about how to program a random variable with nontrivial probability distribution. That is, given values x1,…,xn, and probabilities p1,…,pn, how to write a function that randomly returns one of those values with the corresponding probability each time it is called.
It turns out this problem has a beautiful, simple, and efficient solution called the alias method (R. A. Kronmal and A. V. Peterson. On the alias method for generating random variables from a discrete distribution. The American Statistician, 33:214–218, 1979.), that requires only one call to the random number generator. While I haven’t read any complete explanation of it (only the one in R. A. Kronmal, A. V. Peterson, The alias and alias-rejection-mixture methods for generating random variables from probability distributions. Proc. 1979 Winter Simulation Conference (ed. H.J. Highland et al.) 269–280, 1979, which isn’t complete), I think it is highly unlikely that it is different from the method I describe below.
An Common Lisp implementation can be found here.
The alias method
If you choose one real number between 0 and 1 at random with a uniform distribution, the probability that it lies between a and b, with
0≤a<b≤1, is b-a. So the obvious solution to our problem is to put intervals Ii in [0,1) of length pi next to each other. Each time we want a value of our random variable, we choose a value v in [0,1) at random with a uniform distribution, and see in which interval it falls. If it falls in Ik, the value is xk. The problem is that finding the interval is costly.
The fundamental idea behind the alias method is that this approach still works if each Ii was subdivided in nonoverlaping, not necessarily contiguous intervals Ji1, Ji2,…,Jiki in [0,1) such that the sum of their lengths is pi. This can be used to rearange the intervals in such a way that it is easy to know in what interval a given point lies, and to which value the interval corresponds. To do this, one divides [0,1) in n intervals Lk of equal length, and then packs each with at most two intervals belonging to different xk. Thus, finding out in what interval a point falls amounts to finding out, first, in which Lk it falls, which can be obtained by division and truncation, and then finding out which one of the two possibilities that are left holds.
To get this redistribution right, pack L1 with an interval whose length is less or equal than 1/n, and, if necessary, take away what is needed to fill up L1 from one whose length is greater than 1/n. Now, filling up the rest of the Li is very close to the initial problem, but with one value (and one interval) less.