Why common balance tests are over-indicating imbalance in randomization and what to do about it

Posted on:
Person seated at a desk working in front of multiple monitors.
Credit: Shutterstock
Featuring research by Jason Kerwin, Olivier Sterck, and Nada Rostom.

In an August 2024 working paper, J-PAL affiliated professor Jason Kerwin (University of Washington), J-PAL invited researcher Olivier Sterck (University of Antwerp), and J-PAL MENA alumna Nada Rostom (University of Antwerp) present a comprehensive analysis of how standard balance tests used in randomized evaluations indicate imbalance too often, over-rejecting the null hypothesis that the treatment and comparison groups are, on average, balanced on observable characteristics. We spoke with the authors to learn more about their research process, what they found, and what they recommend researchers do to more effectively test for balance across treatment groups after randomization. 

 

Can you describe the general problem you’re trying to address? 
One of the most compelling arguments in favor of randomized evaluations is that random assignment to treatment and comparison groups ensures that individual characteristics (e.g., age, income, gender) are balanced, i.e. uncorrelated with assignment to treatment. But imbalance across groups can still arise, either organically due to random chance, or as a result of study complications like improper randomization, and treatment assignment violations by study participants. To confirm that groups are comparable, researchers often conduct balance tests. However, the methods we generally use too often indicate that balance is a problem when it is not.

The most common way that researchers conduct balance tests is by looking at balance on a series of baseline variables individually (pairwise t-tests). A concern with this method is that there is no clear rule for how many imbalanced variables are too many. Researchers may also examine overall balance—which is a better approach to understand whether the intended randomization procedure was somehow violated—by regressing the treatment indicator on the full list of baseline variables to produce a statistic that indicates whether there is overall imbalance (i.e., p-value of an omnibus F-test of joint orthogonality). We demonstrate that both of these approaches can often indicate imbalance when groups are actually balanced.

What are the consequences of this problem? 
We don’t fully understand the scope of the consequences. Personally, we’ve seen this crop up in our own research. It’s common when running randomized evaluations to find that the number of unbalanced pairwise t-tests is about what you’d expect by chance, but that the omnibus test indicates that there is significant overall imbalance. So you might infer that there is a balance problem when there actually isn’t. One thing we presume is happening is that some researchers don’t report this overall test statistic if it indicates imbalance if the pairwise tests don’t look largely unbalanced.

However, there are three larger consequences we think might be a result of this problem. First, researchers might be inclined to re-randomize after seeing this reported imbalance. The issue here is that most research teams don’t correctly account for this (often ad hoc) decision in their analysis, leading to issues in statistical inference. The second and bigger problem is that some people might be abandoning projects where they find overall imbalance when there may not be true imbalance. This unfortunately means that time, effort, and resources that were invested in really interesting and important research projects are going to waste, and ultimately the field has less evidence than it should. 

Another consequence is that randomized evaluations that incorrectly indicate imbalance might be relatively harder to publish or be seen as lower quality by clearinghouses or meta analyses. In turn, that might affect the overall literature, possibly causing researchers and policymakers to put more weight on results from non-randomized studies that may be more subject to bias. 

How did this research come about? What sparked your initial interest in investigating balance test usage?
Having encountered test results that indicate overall imbalance a number of times when running randomized evaluations, I (Jason) began developing an intuition that there was something wrong with the tests that economists often use (joint orthogonality F-tests). This issue came up during another evaluation that I (Olivier) was conducting with Jason, so I spent some time running simulations to verify this intuition. I then brought on my grad student, Nada, who expanded on those simulations, which evolved into this paper that we all wrote together.  

What would you suggest that researchers conducting balance tests do instead? 
We recommend assessing overall balance using omnibus F-tests of joint orthogonality with randomization inference p-values, which we show more accurately identify imbalances.

Standard tests, which over-indicate imbalance, use sampling-based inference, where the source of uncertainty is assumed to come from the fact that the study sample is randomly drawn from a larger population. In randomization inference, we assume that the uncertainty in our estimates arises from which exact units (e.g., people, schools) are assigned to treatment or control; if the study were repeated, the same sample would be drawn but different units would get assigned to treatment. This is a more intuitive source of uncertainty for balance tests, since experiments in economics often work with every unit in the sampling frame. Using randomization inference for balance tests compares the imbalances in our study to the possible imbalances we could have gotten from different random assignments. 

Our preferred test can easily be implemented using the RItest package in Stata. The code for doing the test is very straightforward. Future versions of our paper will include replication files and sample code to facilitate using this improved test.

Click here for sample code

*define the list of balance variables here
local list_x x1 x2 x3 x4 x5

*for independently randomized experiments
ritest T e(F),  reps(500)  : reg T `list_x'

*for cluster-randomized experiments
ritest T e(F),  reps(500)  cluster(cluster_id): reg T `list_x' , cluster(cluster_id)

Authored By