This resource is intended for researchers who are designing and assessing the feasibility of a randomized evaluation with an implementing partner. We outline key principles, provide guidance on identifying inputs for calculations, and walk through a process for incorporating power calculations into study design. We assume some background in statistics and a basic understanding of the purpose of power calculations. We provide links to additional resources and sample code for performing power calculations at the end of the document. Readers interested in a more comprehensive discussion of the intuition and process of conducting calculations as well as sample code may refer to our longer power calculations resource.
1) Do power calculations early
The benefit of doing any power calculation early on—even if rough—can be large.
2) The hardest part is choosing a reasonable minimum detectable effect (MDE).
There is no universal rule of thumb for determining a "good" minimum detectable effect (MDE)—it depends on what is meaningful to the parties involved weighed against the opportunity cost of doing the research.
3) Power calculations are a rough guide, not an exact science.
Power calculations are most useful to assess an order of magnitude. Some degree of refinement—such as using covariates to soak up residual variance or redoing calculations on a more complete dataset—can be valuable. But remember that the exact ex post value of inputs to power will necessarily vary from ex ante estimates; one can quickly hit diminishing returns with continued fine-tuning based on ex ante estimates.
4) The first stage has an outsized effect on power.
The strength of the first stage (taking into account factors like rates of take-up and compliance) is commonly under-appreciated in calculating power or required sample size. Overly optimistic assumptions for the first stage can lead to a severely underpowered second stage. For instance, to be powered to detect the same effect size with 25% take-up, we would need to offer treatment to 16 times more people and provide treatment to 4 times more people (assuming equal numbers of treatment and control) than if we had 100% take-up (McKenzie 2011).1
Given those key principles, we now provide more details on each step of the process, including gathering the information needed, introducing the concept of statistical power to partners, running "back-of-envelope" calculations, deciding whether to proceed, refining calculations, and ultimately deciding whether to run a research study.
The implementing partner may be a key source of input for power calculations. Some inputs—such as the maximum sample size, a policy- or program-relevant MDE, and a feasible unit of observation—can only be found out by discussing these parameters with a partner. Rough estimates of other inputs—such as mean and variance of key outcomes, take-up rates and intra-cluster correlation—can be found in previous research or publicly-available data. If readily available, data or summary statistics from the partner’s current operations, or from the data source(s) that will be used in the final analysis, may be preferable.
After running initial calculations, set aside time for a call or meeting with the research partner to discuss the calculation results and decide together whether it makes sense to proceed with the study.
If the study looks promising, but it is still not clear, based on initial calculations, whether it will be sufficiently powered, research teams can iterate over the details of the study design with the research partner.3 During this stage, there are two key situations where refinements may be particularly helpful.
There are diminishing marginal returns to refining power calculations. If initial calculations were satisfactory, refinements may be minimal or may not be necessary at all. However, if the following points were not considered in initial calculations, they should be considered before making final design decisions:
After refining power calculations, you may jointly decide that the study is not feasible and to discontinue discussions. Alternatively, if the research team is satisfied that the study would be adequately powered, and the research partner is satisfied that the chosen MDE is meaningful to them, you may jointly decide take a leap and launch the study.4
If P (x=1) = p
var(x) = p * (1-p)
Before beginning conversations with a partner about ingredients of power, implications of power calculations, or program changes necessary to achieve a certain level of power, take the time to explain the concept of statistical power. After establishing a common understanding around what power is and why it is important, talk through the ingredients in more detail. Talking points and resources to introduce non-technical partners to statistical power are presented at the end of this resource. The following points often come up in conversations about study design and are worth clarifying early:
Early discussions with partners about power can help set researchers up for a successful partnership later on—both because understanding the reasons for design decisions can help increase partners’ investment in the success of the study, and because a better understanding of power can enable partners to flag potential threats to design that may arise during implementation.
The power of an evaluation reflects how likely we are to detect any meaningful changes in an outcome of interest brought about by a program. For example, most studies aim to have power of 80 percent or higher. Power of 80 percent means that there is a 20 percent chance of concluding that an intervention does not have an impact of a particular size when, in fact, it does. The sample size needed to achieve sufficient power varies from case to case.
Say, for instance, we are studying the impact of a job training program on participants’ income. We set the MDE at 10%, which means we are powered to detect a 10% (or more) increase in participants’ income due to the program. Imagine that the actual effect of the program is lower than 10%—for example, a 7% increase in income. This might still provide a substantial improvement in quality of life for participants, may more than pay for the cost of delivering the training, and may be exciting for policymakers and funders. But because our MDE is 10%, our study may not be able to distinguish this 7% increase from zero (i.e., we may not find a statistically significant result). Instead, we may conclude that the program had no detectable effect.
All else equal, with a larger sample, we are better able to detect true effects that are smaller. We need to agree on an acceptable effect size, and ensure that you are aware of what we will and will not be able to learn from the results.
Budgetary, program, and timing constraints may create pressure to conduct an “underpowered” evaluation—but there are risks to doing so. An underpowered evaluation may consume substantial time and monetary resources while providing little useful information, or worse, tarnishing the reputation of a (potentially effective) program. When a study with insufficient power does not find a statistically significant result, we say that we found no evidence of an effect, but this does not mean that we found evidence of no effect. However, funders, media, and the general public can easily conflate “finding no evidence of an effect” with a “finding of no effect.” As a result, inconclusive findings can damage the reputation of an organization or program nearly as much as conclusive findings of no effect.
Last updated March 2021.
These resources are a collaborative effort. If you notice a bug or have a suggestion for additional content, please fill out this form.
Thanks to Maya Duru, Amy Finkelstein, Noreen Giga, Kenya Heard, Rohit Naimpally, and Anja Sautmann for their thoughtful contributions. Caroline Garau copy-edited this document. This work was made possible by support from the Alfred P. Sloan Foundation and Arnold Ventures.
Rachel Glennerster’s lecture, "Sampling and Sample Size" [video recording] provides an introduction to the concept of statistical power.
The book Running Randomized Evaluations: A Practical Guide (Glennerster and Takavarasha 2013) includes a detailed chapter on statistical power and its ingredients. The companion website, runningres.com, includes data and sample exercises for power.
EGAP’s "10 things you need to know about statistical power" is an accessible guide that provides both information on what power calculations are and why they are important, and practical guidance on implementing them (Coppock 2013).
The section "Power calculations: how big a sample size do I need?" in the World Bank’s e-book Impact Evaluation in Practice (Gertler, Martinez, Premand, Rawlings and Vermeersch 2010) provides an introduction to the concept, and works through examples of power calculations for different study designs.
The chapter “Sample Size and Power Calculations” from Data Analysis Using Regression and Multilevel/Hierarchical Models provides an in-depth technical overview of considerations related to power (Geldman and Hill 2006).
The blog post “Did you do your power calculations in standard deviations? Do them again…” provides further information about MDE in terms of standard deviations and in absolute terms (Ozler 2016).
The blog post “What is success, anyhow?” discusses considerations related to decision-relevant effect sizes in more detail (Goldstein 2011).
The blog post “Power Calculations 101: Dealing with Incomplete Take-up” provides information on incomplete take-up and power, as well as a detailed description of the effect of the first stage on power. (McKenzie 2011).
The Stata blog has a helpful post on calculating power using Monte Carlo simulations (Huber 2019).