Last fall, I ran across this post at Andrew Gelman's blog, which links to an article by Cook, Gelman, and Rubin. The article presents a methodology for testing MCMC samplers, and this bit from the article's abstract really captured my attention:
The validation method is shown to find errors in software when they exist and, moreover, the validation output can be informative about the nature and location of such errors.
Intriguing! For quite a while, I had been frustrated by my inability to test my MCMC code at a level I was really comfortable with. When I worked as a software developer, I relied heavily on unit testing to validate my code. This was the approach I tried to take when I began writing MCMC samplers at the start of my PhD. But as I quickly learned, traditional unit testing isn't much help for this type of code.
First of all, the reason I'm writing this software is that I don't have an answer to a question (e.g., what is the distribution of this parameter given the data I have). This sort of flies in the face of the whole "write the unit test first" paradigm for coding—if I knew what the code was supposed to do, I wouldn't need to write the code at all. Another reason is that coding errors that are very easy to make, such as multiplying a vector by the upper instead of lower Cholesky root, can be very difficult to identify by simply examining the output of an MCMC sampler. Another way to put this is to say that most mistakes are made at the integration, rather than the unit level.
The standard solution to the problem of not knowing whether one's code does what it is supposed to do is indeed a type of integration testing. One typically chooses a set of parameter values, simulates data according to the likelihood function, and then runs the sampling code to "recover" the parameters. These samples are then compared to the original ("true") parameter values in order to measure how well the sampler works.
There are a few problems with this approach as it is typically employed in marketing and economics. For one, the simulation is often conducted using just a single set of "true" parameter values, even though bugs in MCMC samplers often go unnoticed as long as the parameters fall within a particular range of "nice" values (and if you're only using one set of parameters, they are almost certainly "nice").
A second problem has to do with the tests used to compare the samples to their true values. Researchers in marketing and economics often report the means and standard deviations of the sampled parameters (sometimes 95% credible intervals), implying some kind of interval test. But these interval tests will almost always fail to discover coding problems that result in samples that are, despite being unbiased with respect to the mean, over- or under-dispersed relative to the true posterior distribution. These tests can also fail to pick up coding errors cause small but persistent biases, unintended skewness, other higher-moment badness, etc.
Cook, Gelman, and Rubin's approach solves both of these problems, incorporating multiple simulations (using different "true" parameter values) and a more reliable measure of the sampler's performance. The basic idea is the following:
The statistics generated in step 4 have some nice properties. Specifically, the statistic for any given parameter is i.i.d. Uniform(0,1) across simulations. The paper provides some clever transformations of these uniformly distributed statistics that lead to the procedure spitting out a p-value for each parameter. The authors suggest transforming these p-values into folded-standard-normal variates, and then plotting them as strip charts or histograms (each plot containing a group of related parameters, e.g. all elements of a vector). It turns out that we're all quite good at catching deviations from normality with the unaided eye.
I've used this method twice, and have a few practical observations:
To wrap up: this method is remarkably effective, and has changed the way I write MCMC samplers. Most importantly, it has allowed me to attain a much higher level of confidence in my code than I was able to using the standard approach.comments powered by Disqus