On a number of occasions, I've spoken to prospective clients and observed others making the following mistake. They've set up an A/B test or bandit system using relatively standard statistical methodology. Unfortunately, their sample size is prohibitively low. In order to "fix" this, they tweak the methodology until the variable N in their statistical technique becomes sufficiently large. In this (somewhat technical) post, I'm going to explain in detail why this simply doesn't work. There is no free lunch and there are no free samples.
Classical Hypothesis Testing
Many classical hypothesis tests are based on the Central Limit Theorem, and for simplicity that's what I'll focus on.
A classical hypothesis test works in the following way. Consider an experiment generating a sequence of statistics $@ X_1, X_2, \ldots, X_N $@. For simplicity lets suppose that the mean of $@ X $@ is zero. We'll typically generate a test statistic $@ S_N $@ as follows:
Now suppose for simplicity that the sequence $@ X_1, X_2, \ldots, X_N $@ are Independent and Identically Distributed random variables. This means that $@ X_1 $@ and $@ X_2 $@ bear no relation to each other, except for being drawn from the same distribution. Then provided $@ N $@ is sufficiently large, the test statistic is normally distributed with mean $@ E[X] $@ and variance $@ \sigma^2/N $@ where $@ \sigma^2 $@ is the variance of $@ X $@.
Since this test statistic is (approximately) normally distributed, we can then compute p-values in a straightforward way. Suppose we did an experiment and computed $@ S_N $@ empirically. Then the p-value of the test is:
Lets be concrete now and stick to a one-tailed test, and suppose we want to measure a particular effect size $@ E $@. Suppose further that we want the p-value to be smaller than 0.05. Then we need:
Or:
The key point here is that $@ N > \sigma^2/E^2 C $@ - the other terms don't matter that much for the purpose of this discussion.
Examples of a hypothetical free lunch
Before getting into this, I want to emphasize that the examples I describe are real examples, not straw men. One of them is an example that I found on the web, and which a VC-funded company has apparently sold to a number of enterprises. The other is something a client of mine wanted to before I persuaded him not to.
Tracking sessions rather than users
Yesterday I drew attention to a post by dynamic yield explaining their bandit techniques. In short, to get around some technical difficulties relating to delayed observations, dynamicyield decided to track user sessions rather than users.
Once nice side effect of this is that it will increase the sample size. If a user visits a site using dynamicyield 3 times, then the number of samples $@ N $@ will increase by 3. In contrast, if dynamicyield tracked users, $@ N $@ would increase only by 1.
Unfortunately, we've now just violated the assumptions of our test. Our samples are no longer IID. The problem is that the repeated visits generated by a single user are correlated with each other. Some users have a high propensity for making a purchase, while others have a low propensity.
Two sided markets
A fictitious client of mine runs a prostitution site. Whenever he successfully introduces a prostitute to a customer, he receives a cut of the revenue. He wants to A/B test various changes on either side of the market in order to increase the number of matches.
Suppose he has $@ N_h $@ hookers and $@ N_j $@ johns. In his marketplace, he has a relatively low $@ N_h $@. His $@ N_j $@ is not that large either, but somewhat larger than $@ N_h $@. However, he then came up with a brilliant idea to increase his sample sizes. Rather than grouping by hooker, he would instead group by (hooker, client) pair. This yields, in principle, $@ N_h \times N_j = N $@ samples - we are bound to be statistically significant with this many. But is it valid?
As before, we've violated the IID assumption. Some individual hooker may be highly selective, and this high selectivity will introduce correlations across all possible matches involving this hooker.
Hypothesis testing of non-independent variables
So we've introduced a bit of correlation into our test procedure. Is everything lost?
Actually no. The Central Limit Theorem can still be shown to hold in certain cases, specifically the case of weak mixing. Examples like the ones above, where a single user is counted repeatedly in the test, satisfy the criteria for weak mixing.
In the case of weak mixing, we merely need to replace the $@ \sigma $@ described above with a new version, $@ \hat{\sigma} $@:
The sum term represents the correction due to the fact that a single user may be involved in multiple cases. So what is the effect of these terms?
Suppose the correlation terms $@ E[X_1 X_j] = \delta $@ for some $@ \delta $@, provided the samples $@ X_1 $@ and $@ X_j $@ are generated by the same user. For example, the two samples might represent two sessions generated by the same visitor, or two (hooker, john) pairs with the same hooker. In the session tracking case, we then find:
In the prostitution market, we find:
Going back to our original relationship, we now need:
So for the visitor/session game, we find:
Rearranging yields:
Oops! It looks like once we plug the correct variance into the formula, we still need $@ N_{user} = O(E^{-2}) $@.
What about for the prostitution market?
Or:
Again, the number of prostitutes in the market is the limiting factor - $@ N_h = O(E^{-2}) $@.
This approach can still be useful if $@ 2\delta \ll \sigma^2 $@, and in many cases it can be. But that's an important fact which needs to be checked. It cannot simply be assumed.
No free samples
Statistics is full of pitfalls. For the practicing statistician, one of the biggest pitfalls we run into is our own cleverness. It's easy to play games with the basic quantities in our tests, and with our experiment design, in order to make $@ N $@ go up. But information isn't free, no matter how the data is counted.