My paper with Ron Shachar, When Kerry Met Sally: Politics and Perception in the Demand for Movies, has been accepted for publication at Management Science and is available for download via Articles in Advance. I've also filmed a short video in which I talk about the main results, and co-authored (with Imogen Moore) a brief summary article.
The paper presents a predictive model of local demand for movies with two unique features: First, arguing that consumers' political tendencies have unutilized predictive power for marketing models, we allow consumers' heterogeneity to depend on their voting tendencies. Basically, we take voting data and use it in much the same way we would demographic data. Second, instead of using the commonly used genre classifications (e.g. the ones used by IMDB) to characterize movies, we estimate latent movie attributes. These attributes are not determined a priori by industry professionals, but rather reflect consumers' perceptions, as revealed by their movie-going behavior.
We find that not only are consumers' preferences related to their political tendencies—for example, counties that voted for congressional Republicans prefer movies starring young, white, female actors over those starring African-American, male actors—they also improve the predictive power of the model. Second, the perceived attributes we estimate provide new insights into consumers' preferences for movies. For example, one of these attributes is the movie’s degree of seriousness. Together, the two improvements we propose have a meaningful impact on forecasting error, decreasing it by 12.6 percent.
I'm really happy with this paper, as I think each of the two contributions can have a meaningful impact on practice. The success of the perceived attributes in predicting demand for movies supports the idea that the movie industry's narrow classification of movies into binary categories for drama, comedy, action, etc., may not be very meaningful to consumers. Instead, classifications describing cast demographics, or the degree of emotional intensity, may be closer to how consumers think about movies.
But perhaps the most exciting result is the success of political data in explaining consumer heterogeneity across different local markets. We found that political data outperformed a wide range of demographic data regarding race, income, family size, and myriad other variables. Indeed, when we estimated models with both political and demographic data, it was the political data that explained preferences. Political data are updated far more frequently than demographic data and almost as easy to obtain. My hope is that other marketing researchers will start to use these data to explain consumer heterogeneity for other goods.
Stan is an application framework for building Monte Carlo samplers. Superficially, it looks a lot like BUGS (or JAGS): you define a model using an R-like syntax and then...magic!...samples appear. But that's where the similarities end.
For starters, with Stan, there's an intermediate step in which the Stan "compiler" spits out a C++ file based on your model; you must then compile this source file and link it to the Stan libraries. (The documentation is quite clear about how to do this.) So Stan's output is another program—your program—which you can then run to produce (magically) samples.
Another difference between Stan and BUGS lies in how the samples are generated. BUGS relies on Gibbs sampling, whereas Stan uses self-tuning Hamiltonian Monte Carlo sampling (the so-called no U-turn sampler of Hoffman and Gelman). Hamiltonian MC is similar to another technique I've used—Riemann manifold Langevin MC—in that it uses the gradient of the log posterior density to "aim" the sampler to higher density regions. Samples tend to take more computation time to generate, but make up for this loss with lower autocorrelation.
Stan uses automatic differentiation to calculate the value of the gradient for any exact values of the model parameters—the same approach I took in my job market paper. This isn't an easy thing to code up by hand, in part because the best templated library for matrix operations (eigen) doesn't work very will with automatic differentiation (yet). The Stan team decided to work around this problem by implementing their own automatic differentiation library. Basically, Stan rolls all of my favorite tools into one nicely integrated package.
I've been using Stan to prototype a model comprising a finite mixture of Poisson distributions. So far it's working really nicely. I can make rather drastic changes to the model in a short amount of time, and because the sampler runs rather quickly with the small amount of data I'm using for the prototype, I can see how these changes impact fit. I'm eager to see how far I can push this tool as I add more data...
One last point—Stan claims to provide an API so that you can access the log-probability functions (and their gradients) directly from your own code. So far these features do not appear to have been documented, and looking through the code, I can see that such integration will require an investment in understanding how the whole shebang works. Hopefully the upcoming release of Stan 2.0 will address this shortcoming.
In hierarchical Bayesian models, there's often a high degree of correlation between parameters
at different levels of the model. To give an example, if I have the following hierarchical model:
then \(\eta_j\) and \(\beta\) might be strongly correlated.
The correlation, of course, makes MCMC sampling inefficient. This parameterization is often called 'centered'
because the distribution of \(\eta_j\) is said to be 'centered' over \(x_j \beta\). A common solution is to
'un-center' the model. A typical reparameterization would be:
Sometimes the un-centered version of the model is better than the centered, but that's not always the case. A nice paper by
Yaming Yu and Xiao-Li Meng (JCGS, 2011) addresses this
problem. Their method builds upon the intuition that we might do well to sample from both parameterizations of the model; that is, to
alternately draw \(\eta_j\) from the centered and un-centered models within each MCMC iteration.
Yu and Meng's method combines the second and third steps above, so that we have, at each iteration, the following procedure:
What's nice (i.e. convenient) is that the distribution \(\tilde\eta_j | \eta_j, \beta\) is usually degenerate. In this example, \(\tilde\eta_j = \eta_j - x_j\beta\).
I've implemented their method and it seems to do a good job breaking some of the correlation between parameters. Their
paper includes a number of theoretical results showing how awesome this procedure is (e.g., it's never worse than the
less efficient of the two parameterizations, and it can converge in cases where neither parameterization by itself will).
Update: I continue to be amazed at this technique. I plan to use this as my default sampling scheme for hierarchical models.
A Key Word History of Marketing Science (co-authored with Carl Mela and Yiting Deng) has just been accepted at (where else?) Marketing Science. The idea behind the paper is to use the key words authors have assigned to papers published over the past 30 years in Marketing Science (the journal) to track changes in marketing science (the field). By “key words” we mean the small set of descriptors assigned to papers when they are published, similar to the JEL classifications used for papers in econ journals. At Marketing Science, however, authors are free to make up their own key words.
Our approach to inference in this paper is almost entirely model-free—that is, for most of the analysis, we plot raw data and then draw our inferences directly with defining a formal model (we bring in multivariate analysis—cluster analaysis and multidimensional scaling—but the bulk of our analysis is based on the data themselves). I like this approach, especially for a descriptive paper like this.
Two of my favorite plots from the paper are Figures 4 and 5. In Figure 4, we show new key words appearing each year; some of them sticking around in future years, but most never appearing again. It's clear that the rate at which new key words appear in the journal is growing (we discuss reasons for this in the paper).
Figure 4: Emergence and popularity of new key words by year. Rows represent individual key words. Circle size is proportional to key word share in each year. Colored bands group key words together by the year they first appeared.
But even though the rate of new key word is increasing, the likelihood of hitting on a new key word that sticks around for years to come is decreasing, suggesting the field is maturing. This is shown in Figure 5.
Figure 5: Emergence, persistence, and decline of highly popular key words, grouped in three-year periods. Words are ordered by their popularity in the year they first broke into the top 10. Circle size represents popularity within each year.
One of the recurring themes in the paper is the dominance of game theory as a research paradigm in marketing. Game theory first entered the top 10 (in terms of share of papers using that key word) in the mid-90's and has been extremely popular ever since. Interestingly, in our cluster analysis, we find game theory doesn't group well with other key words. Rather, it seems to function as a uniting framework that brings together a wide range of topics and research methods. Or as Carl puts it, it is "the one ring to rule them all."
Sauron with the latest issue of Marketing Science.
Last fall, I ran across this post at Andrew Gelman's blog, which links to
an article by Cook, Gelman, and Rubin. The article presents a methodology for testing MCMC samplers, and this bit from the article's abstract really captured my attention:
The validation method is shown to find errors in software when they exist and, moreover, the validation output can be informative about the nature and location of such errors.
Intriguing! For quite a while, I had been frustrated by my inability to test my MCMC code at a level I was really comfortable with. When I worked as a software developer, I relied heavily on unit testing to validate my code. This was the approach I tried to take when I began writing MCMC samplers at the start of my PhD. But as I quickly learned, traditional unit testing isn't much help for this type of code.
First of all, the reason I'm writing this software is that I don't have an answer to a question (e.g., what is the distribution of this parameter given the data I have). This sort of flies in the face of the whole "write the unit test first" paradigm for coding—if I knew what the code was supposed to do, I wouldn't need to write the code at all. Another reason is that coding errors that are very easy to make, such as multiplying a vector by the upper instead of lower Cholesky root, can be very difficult to identify by simply examining the output of an MCMC sampler. Another way to put this is to say that most mistakes are made at the integration, rather than the unit level.
The Standard Approach to Testing MCMC Samplers
The standard solution to the problem of not knowing whether one's code does what it is supposed to do is indeed a type of integration testing. One typically chooses a set of parameter values, simulates data according to the likelihood function, and then runs the sampling code to "recover" the parameters. These samples are then compared to the original ("true") parameter values in order to measure how well the sampler works.
There are a few problems with this approach as it is typically employed in marketing and economics. For one, the simulation is often conducted using just a single set of "true" parameter values, even though bugs in MCMC samplers often go unnoticed as long as the parameters fall within a particular range of "nice" values (and if you're only using one set of parameters, they are almost certainly "nice").
A second problem has to do with the tests used to compare the samples to their true values. Researchers in marketing and economics often report the means and standard deviations of the sampled parameters (sometimes 95% credible intervals), implying some kind of interval test. But these interval tests will almost always fail to discover coding problems that result in samples that are, despite being unbiased with respect to the mean, over- or under-dispersed relative to the true posterior distribution. These tests can also fail to pick up coding errors cause small but persistent biases, unintended skewness, other higher-moment badness, etc.
A Better Approach
Cook, Gelman, and Rubin's approach solves both of these problems, incorporating multiple simulations (using different "true" parameter values) and a more reliable measure of the sampler's performance. The basic idea is the following:
- Simulate parameters from their prior distributions
- Simulate data from the likelihood, conditional on the simulated parameters
- Run the MCMC sampler
- Compare the true parameter values to the samples. Generate a statistic for each parameter representing the proportion of samples that are greater than the true value
- Repeat many times
The statistics generated in step 4 have some nice properties. Specifically, the statistic for any given parameter is i.i.d. Uniform(0,1) across simulations. The paper provides some clever transformations of these uniformly distributed statistics that lead to the procedure spitting out a p-value for each parameter. The authors suggest transforming these p-values into folded-standard-normal variates, and then plotting them as strip charts or histograms (each plot containing a group of related parameters, e.g. all elements of a vector). It turns out that we're all quite good at catching deviations from normality with the unaided eye.
I've used this method twice, and have a few practical observations:
- The method is quite sensitive. I was working on a Metropolis-Hastings step as part of a bigger sampling routine, and made a very minor error in the calculation of the prior density for a very small subset of parameters. I could see that the test statistics were shifted just slightly away from zero.
- The method cannot tell the difference between a coding error and a lack of identification. I consider this a strength in some ways, as I find it quite difficult to reason through model identification. But it is also a weakness to the extent that you might not know whether to look for bugs in your code or in your model.
- Simulations are independent, and can be run in parallel on multi-core computers. MCMC samplers are slow in part because each parameter needs to condition on all the others, forcing the sampler to move sequentially through most of the parameters (although you can exploit conditional independence to run some steps in parallel). On an 8-core system, I can run eight simulations in parallel—easily enough to catch any problems.
To wrap up: this method is remarkably effective, and has changed the way I write MCMC samplers. Most importantly, it has allowed me to attain a much higher level of confidence in my code than I was able to using the standard approach.
I've started my new job as an Assistant Professor in the Department of Marketing Management at the Rotterdam School of Management, Erasmus University (my updated contact information can be found here).
Late last year, I was named one of three finalists in the 2012 ISMS Doctoral Dissertation Proposal Competition. This was an amazing honor and a genuine surprise to me—the sort of thing you hope you might win, but don't really expect to. The awards (i.e., cash money) were disbursed last year, but the awards (i.e., plaques) were distributed this month at the Marketing Science conference in Boston. Here are a few snapshots of me at the awards ceremony.
Vithala Rao presents the award
Back at the Table
With John Hauser