Lessons for assessing power and feasibility from studies of health care delivery

Contributors
Laura Ruiz Jesse Gubb
Last updated
In Collaboration With
MIT Roybal Center logo

Introduction

This resource highlights key lessons for designing well-powered randomized evaluations, based on evidence from health care delivery studies, funded and implemented by J-PAL North America.1 Determining whether a study is sufficiently powered to detect effects is an important decision to make at the outset of a project, to determine whether a project is worth pursuing. There are dangers to underpowered studies; for example, if the lack of a statistically significant result is interpreted as evidence that a program is ineffective, rather than underpowered. Although designing well-powered studies is critical in all domains, health care delivery presents particular challenges — direct effects on health outcomes are often difficult to measure and for whom an intervention is effective is a particularly important question because differential effects can have a dramatic impact in the type of care received. Health care delivery settings also present opportunities, in that implementing partners have important complementary expertise to address these challenges.

 

Key takeaways

  • Communicate with partners to:
    • Choose outcomes that align with the program’s theory of change
    • Gather data for power calculations
    • Select meaningful minimum detectable effect sizes
    • Assess whether a study is worth pursuing
  •  Anticipate challenges when measuring health outcomes:
    • Plan to collect primary data from sufficiently large samples
    • Choose health outcomes that can be impacted during the study period and consider prevalence in the sample
    • Consider whether results, particularly null results, will be informative
  • Think carefully about subgroup analysis and heterogeneous effects:
    • Set conservative expectations about subgroup comparisons, which may be underpowered or yield false positives
    • Calculate power for the smallest comparison
    • Stratify to improve power
    • Choose subgroups based on theory

 

The value of talking with partners early (and often)

Power calculations should be conducted as early as possible. Starting conversations about power calculations with implementing partners early not only provides insight on the feasibility of the study, but also allows a partner to be involved in the research process and offer necessary inputs like data. Estimating power is fundamental to help researchers and implementing partners understand the other's perspective and create a common understanding of the possibilities and constraints of a research project. Implementing partners can be incredibly useful in making key design decisions that affect power.

Decide what outcomes to measure based on theory of change

An intervention may affect many aspects of health and its impact could be measured by many outcomes. Researchers will need to decide what outcomes to prioritize to ensure adequate power and a feasible study. Determining what outcomes to prioritize collaboratively with implementing partners helps ensure that outcomes are theory-driven and decision-relevant while maximizing statistical power. 

  • In Health Care Hotspotting, an evaluation of the Camden Coalition of Healthcare Providers’ care coordination program for high cost, high need patients, researchers and the Camden Coalition chose hospital readmission rates as their primary outcome. This was not straightforward — the program may have had additional benefits in other domains — but this outcome aligned with the program’s theory of change, was measurable in administrative data, and ensured adequate power despite the limitations imposed by a relatively small sample. Although there was also interest in measuring whether the intervention could reduce hospital spending, this outcome was not chosen as a primary outcome2 because there was not sufficient power and because cost containment was less central to the program’s mission. 

Provide data and understand constraints

While it is possible to conduct power calculations without data by making simplifying assumptions, partners can provide invaluable information for initial power calculations. Data from partners, including historical program data and statistics, the maximum available sample size (based on their understanding of the target population), and take-up rates gleaned from existing program implementation, can be used to estimate power within the context of a study. Program data may be preferable to data drawn from national surveys or other sources because it comes from the context in which an evaluation is likely to operate. However, previous participants may differ from study participants, so researchers should always assess sensitivity to assumptions made in power calculations. (See this resource for more information on testing sensitivity to assumptions). 

  • A randomized evaluation of Geisinger Health’s Fresh Food Farmacy program for patients with diabetes used statistics from current clients as baseline data for power calculations in their pre-analysis plan. This allowed researchers to use baseline values of the outcomes (HbA1c, weight, hospital visits) that were more likely to be reflective of the population participating in the study. It also allowed researchers to investigate how controlling for lagged outcomes would improve power. Without collaboration from implementing partners, these measures would be hard to approximate from other sources.
  • In Health Care Hotspotting, the Camden Coalition provided historical electronic medical record (EMR) data, which gave researchers inputs for the control mean and standard deviation necessary to calculate power for hospital readmissions. The actual study population, however, had twice the assumed readmission rate (60% instead of 30%), which led to a slight decrease in power relative to expectations. This experience emphasizes the importance of assessing sensitivity. 

Determine reasonable effect sizes

Choosing reasonable effect sizes is a critical component of calculating power and often a challenge. In addition to consulting academic literature, conversations with partners can help determine what level of precision will be decision-relevant. The minimum detectable effect (MDE) required for an evaluation should be determined in that particular context, where the partner’s own cost-benefit assessment or a more standardized benchmark may guide discussions. The risks of interpreting underpowered evaluations often fall on the implementing partner, where lack of power increases the likelihood that an intervention is wrongly seen as ineffective, so choosing a minimum detectable effect should be a joint decision between the research team and partner. 

  • In Health Care Hotspotting, researchers were powered to detect a 9.6 percentage point effect on the primary outcome of hospital readmissions. While smaller effects were potentially clinically meaningful, the research team and their partners determined that  this effect size would be meaningful because it would rule out the much larger 15-45% reductions in readmissions found in previous evaluations of similar programs delivered to a different population (Finkelstein et al. 2020).
  • In Fresh Food Farmacy, researchers produced target minimum detectable effects (see Table 2) in collaboration with their implementing partners and demonstrated that the study sample size would allow them to detect effects below these thresholds.
  • Implementing partner judgment is critical when deciding to pursue an evaluation that can only measure large effects. Researchers proceeded with an evaluation of legal assistance for preventing evictions, which compared an intensive legal representation program to a limited, much less expensive, and more scalable program, based on an understanding that the full representation program would only be valuable if it could produce larger effects than the limited program. Estimating potentially small effects with precision was not of interest, given the cost of the program and the immense, unmet need for services.

In the section "Calculate minimum detectable effects and consider prevalence of relevant conditions" below, we further cover how special considerations such as prevalence of a condition have to also be considered when determining reasonable effect sizes.

Assess feasibility and make contingency plans

Early conversations about power calculations can help set clear expectations on parameters for study feasibility and define protocols for addressing future implementation challenges. Sometimes early conversations lead to a decision not to pursue a study. Talking points for non-technical conversations about power are included in this resource. 

  • During a long-term, large-scale evaluation of the Nurse Family Partnership in South Carolina, early power conversations proved to be helpful when making decisions in response to new constraints. Initially the study had hoped to enroll 6,000 women over a period of four years, but recruitment for the study was cut short due to Covid-19. Instead, the study enrolled 5,655 women, 94% of what was originally targeted. Since the initial power calculations anticipated a 95% participation rate and used conservative assumptions, the decision to stop enrollment due to the pandemic was made without concern that it would jeopardize power for the evaluation. Given the long timeline of the study and turnover in personnel, it was important to revisit conversations about power and involve everyone in the decision to halt enrollment. 
  • A recent study measuring the impact of online social support on nurse burnout was halted due to lack of power to measure outcomes. In the proposed study design, the main outcome was staff turnover. Given a sample of approximately 25,000 nurses, power calculations estimated a minimum detectable effect size of 1.5 percentage points, which represents a 8 to 9 percentage point reduction in staff turnover. When planning for implementation of the study, however, randomization was only possible at the nursing unit level. Given a new sample of approximately 800 nursing units, power calculations estimated a minimum detectable effect size of 7 to 10 percentage points, which represents roughly a 50 percent reduction in turnover. These new power calculations assumed complete take-up, which did not seem feasible to the research team. With the partner’s support, other outcomes of interest were explored, but these had low rates of prevalence in administrative data. Early conversations about study power surfaced constraints early, which prevented embarking on an infeasible research study.

Challenges of measuring health outcomes

Data collection and required sample sizes are two fundamental challenges to studying impacts on health outcomes. Health outcomes, other than mortality, are rarely available in administrative data, or if they are, may be differentially measured for treatment and control groups.3 For example, consider  researchers who wish to measure blood pressure but rely on measurements taken at medical appointments. Only those who choose to go to the doctor appear in the data. If the intervention increases appointments attended, the treatment group will be more prevalent in the data in comparison to the control group, further biasing the result. Therefore, measuring health outcomes usually requires researchers to conduct costly primary health data collection to avoid bias resulting from differential coverage in administrative data.4 

Required sample sizes also pose challenges to research teams studying health care interventions. Because many factors influence health, health care delivery interventions may be presumed to have small effects. This requires large sample sizes or lengthy follow-up periods. Additionally, health impacts are likely to be downstream of health care delivery or behavioral interventions that may have limited take up, further reducing power.5 As a result of these challenges, many evaluations of health care delivery interventions measure inputs into health — for example, receiving preventive services like a flu shot (Alsan, Garrick, and Graziani 2019) or medication adherence — instead of, or in addition to, measuring downstream health outcomes. These health inputs can often be measured in administrative data and directly attributed to the intervention.

Despite these challenges, researchers may wish to demonstrate actual health impacts of an intervention when feasible. This section introduces key steps in the process of designing a health outcome measurement strategy illustrated by two cases, the Oregon Health Insurance Experiment (OHIE) and a study examining a workplace wellness program. These cases highlight thoughtful approaches and challenges to estimating precise effects and include examples for interpreting null results.

Choose measurable outcomes plausibly affected by the intervention in the time frame allotted

Not all aspects of health will be expected to change as a result of a particular intervention, especially within a study timeframe. For instance, preventing major cardiovascular events like heart attacks may be the goal of an intervention, but it is more feasible to measure blood pressure and cholesterol, both of which are risk factors for heart attacks, than to measure heart attacks, which might not occur for years or at all. What to measure should be determined by a program’s theory of change, an understanding of the study population, and the length of follow up. 

  • In the OHIE, researchers measured the effects of Medicaid6 on hypertension, high cholesterol, diabetes, and depression. These outcomes were chosen because they were “important contributors to morbidity and mortality, feasible to measure, prevalent in the low-income population of the study, and plausibly modifiable by effective treatment within two years” (Baicker et al, 2013). For insurance to lead to observable improvements in these outcomes, receiving Medicaid would need to increase health care utilization, lead to accurate diagnoses, and generate effective treatment plans that are followed by patients. Any “slippage” in this theory of change, such as not taking prescribed medication, would limit the ability to observe effects. Diabetes, for example, was not as prevalent as expected in the sample, reducing power relative to initial estimates.
  • In an evaluation of a workplace wellness program, researchers measured effects of a wellness program on cholesterol, blood glucose, blood pressure, and BMI, because they were all elements of health plausibly improved by the wellness program within the study period. 

Make sample size decisions and a plan to collect data.

Measuring health outcomes will likely require researchers to collect data themselves, in person, with the help of medical professionals. In this situation, data collection logistics and costs will limit sample size. The costs of data collection must be balanced with sample sizes needed to draw informative conclusions. In each study, researchers restricted their data collection efforts in terms of geography and sample size. In addition, clinical measurements were relatively simple, involving survey questions, blood tests7, and easily portable instruments. Both of these strategies addressed cost and logistical constraints.

  • In the OHIE, clinical measures were only collected in the Portland area, with 20,745 people receiving health screenings, despite the intervention being statewide with a study sample size of over 80,000.
  • In workplace wellness, researchers collected clinical data from all 20 treatment sites but only 20 of the available 140 control sites in the study.

Calculate minimum detectable effects and consider prevalence of relevant conditions

In the health care context, researchers should also consider the prevalence of individuals who are at risk for a health outcome and for whom we can expect the intervention to potentially address that outcome. In other words, researchers must understand the number of potential compliers with that aspect of treatment, where a compiler receives that element of the intervention only when assigned to treatment, in comparison to other participants who may always or never receive the intervention. 

Interventions like health insurance or workplace wellness programs are broad and multifaceted, but measurable outcomes may only be relevant for small subsets of the population. Consider, for example, blood sugar and diabetes. We should only expect HbA1c (a blood sugar measure) to change as a result of the intervention for those either with diabetes or at risk for diabetes. The theory of change for reducing HbA1c as a result of insurance requires going to a provider, being assessed by a provider, and prescribed either a behavioral modification or medication. If most people do not have high blood sugar and are therefore not told by their provider to reduce it, we should not expect the intervention to affect Hba1c. If this complier group is small relative to the study sample, the intervention will be poorly targeted, and this will reduce power similarly to low take-up or other forms of noncompliance. 

Suppose we have a sample of 1,000 people, 10 percent with high HbA1c, and another 10 percent who have near-high levels that would prompt their provider to offer treatment. Initial power calculations with a control mean of 10 percent produce an MDE of about 6 percentage points (for a binary high/low outcome). However, once we correct for 20 percent compliance, observing this overall effect requires a direct effect for compliers of roughly 30 percentage points. This is significantly larger and might make a seemingly well-powered study seem infeasible. 

In cases where data from the sample is not yet available, broad based survey or administrative data can be used to measure population level (i.e. control group) outcome means and standard deviations and the prevalence of certain conditions, as well as investigate whether other variables are useful predictors of chosen outcomes and can be included in analysis to improve precision. As always, the feasibility of a study should be assessed using a wide range of possible scenarios. 

  • In the OHIE, researchers noted that they overestimated the prevalence of certain conditions in their sample. Ex-post, they found that only 5.1% had diabetes and 16.3% had high blood pressure. This limited the effective sample size in which one might expect an effect to be observed, effectively reducing take up and statistical power relative to expectations. One solution when relevant sample sizes are smaller than expected is to restrict analysis to subgroups expected to have larger effects ex ante, such as those with a preexisting condition, for which the intervention may be better targeted. 
  • In the workplace wellness study, the researchers note that they used data from the National Health and Nutrition Examination Survey, weighted to the demographics of the study population, to estimate control group statistics. These estimates proved quite accurate resulting in the expected power to detect effects. 

In both cases, researchers proceeded with analysis while acknowledging the limits of their ability to detect effects.

Ensure a high response rate

Response rates have to be factored into anticipated sample size and given that sample size is a critical input for determining sufficient power, designing data collection methods to prevent attrition is an important strategy to maintain sample size throughout the duration of a study. Concerns about attrition grow when data collection requires in-person contact and medical procedures.8

  • To ensure a high response rate in Oregon, researchers devoted significant resources to identifying and collecting data from respondents. This included several forms of initial contact, a tracking team devoted to locating respondents with out-of-date contact information, flexible data collection (interviews were done via several methods and health screenings could be performed in a clinic or at home), intensive follow up dedicated to a random subset of non-respondents, and significant compensation for participation ($30 for the interview, an additional $20 for the dried blood spot collection, and $25 for travel if the interview was at a clinic site). These efforts resulted in an effective response rate of 73%. Dried blood spots (a minimally invasive method of collecting biometric data) and short forms of typical diagnostic questionnaires were used to reduce the burden on respondents. These methods are detailed in the supplement to Baicker et al. 2013.
  • In workplace wellness, health surveys and biometric data collection were done on site at workplaces and employees received a $50 gift card for participation. Some employees received an additional $150. Participation rates in the wellness program among current employees were just above 40%. However, participation in data collection was much lower — about 18% — primarily because less than half of participants ever employed during the study period were employed during data collection periods. Participants had to be employed at that point in the study, present on those particular days, and be willing to participate in order to be included in data collection. 

Understand your results and what you can and cannot rule out 

Both the OHIE and workplace wellness analyses produced null results on health outcomes, but not all null results are created equal.

  • The OHIE results differed across outcome measures. There were significant improvements only in the rate of depression; depression was also the most prevalent condition of the four examined. There were no detectable effects on diabetes or blood pressure, but what could be concluded in each of these domains differed. Medicaid’s effect on diabetes was imprecise, with a wide confidence interval that included both no effect and large positive effects, including a very plausible positive effect. The results could not rule out the effect one might expect to find if you estimated how many people had diabetes, saw their doctor as a result of getting insurance, got medication, and how effective that medication is at reducing HbA1c (based on clinical trial results). This is not strong evidence of no effect. In contrast, Medicaid’s null effect on blood pressure could rule out much larger previous estimates from quasi-experimental work because its confidence interval did not include the larger estimates generated from quasi-experimental work (Baicker et al, 2013, The Oregon Experiment: Effects of Medicaid on Clinical Outcomes). 
  • The effects on health shown in Workplace Wellness were all nearly zero. Given that impact measurements were null across a variety of outcome measures and another randomized evaluation of a large scale workplace wellness program found similar results, it is reasonable to conclude that the workplace wellness program did not affect health significantly in the study time period. 

 

Does health insurance not affect health? 

New research has since demonstrated health impacts of insurance, but the small effect sizes emphasize why large sample sizes are needed. The IRS partnered with researchers to randomly send letters to taxpayers who paid a tax penalty for lacking health insurance coverage, encouraging them to enroll. They sent letters to 3.9 million out of 4.5 million potential recipients. The letters were effective at increasing health insurance coverage and in reducing mortality, but the effects on mortality were small: among middle aged adults (45-64 years old) they saw a 0.06 percentage point decline in mortality, one fewer death for about 1,600 letters. 

 

Being powered for subgroup analysis and heterogeneous effects

Policymakers, implementing partners, and researchers may be interested in for whom a program works best, not only in average treatment effects for a study population. However, subgroup analysis is often underpowered and increases the risk of false positive results due to the larger number of hypotheses being tested. Care must be taken to balance the desire for subgroup analysis and the need for sufficient power.

Talk to partners and set expectations about subgroups

Set conservative expectations with partners before analysis begins about what subgroup analyses may be feasible and what can be learned from them. Underpowered comparisons and large numbers of comparisons should be treated with caution as the likelihood of Type I errors (false positives) will be high. These result from multiple hypothesis testing and because underpowered estimation, conditional on finding a statistically significant effect, will overestimate the true effect.

Conduct power calculations for the smallest relevant comparison

If a study is well-powered for an average treatment effect, it may not be powered to detect effects within subgroups. Power calculations should be done using the sample size of the smallest relevant subgroup. Examining heterogeneous treatment effects (i.e. determining whether effects within subgroups are different from each other) require even more sample size to be powered.9

Stratify to improve power

Stratifying on variables that are strong predictors of the outcome can improve power by guaranteeing that these variables are balanced between treatment and control groups. Stratifying by subgroups may improve power for subgroup analyses. However, stratifying on variables that are not highly correlated with the outcome may reduce statistical power by reducing the degrees of freedom in the analysis.10 

Ground subgroups in theory

Choose subgroups based on the theory of change of the program being evaluated. These might be groups where you expect to find different (e.g., larger or smaller) effects or where results are particularly important. Prespecifying and limiting the number of subgroups that will be considered can guard against concerns about specification searching (p hacking) and force researchers to consider only the subgroups that are theoretically driven. It is also helpful to pre-specify any adjustments for multiple hypothesis testing. 

When examining differential effects among subgroups is of interest, but it is not possible to pre-specify relevant groups, machine learning techniques may allow researchers to flexibly identify data-driven subgroups, but this also requires researchers to impute substantive meaning for the subgroups, which may not always be apparent.11 

  • In OHIE, researchers prespecified subgroups for which effects on clinical outcomes might have been stronger: older individuals, those with an existing diagnosis of hypertension, high cholesterol, or diabetes, and those with a heart attack or congestive heart failure. Even by doing this, the researchers found no significant improvements in these particular dimensions of physical health over this time period.
  • In an Evaluation of Voluntary Counseling and Testing (VCT) in Malawi that explored the effect of a home based HIV testing and counseling intervention on risky sexual behaviors and schooling investments, researchers identified several relevant subgroups for which effects might differ: HIV-positive status, HIV-negative status, HIV-positive status with no prior belief of HIV infection, and HIV-negative status with prior belief of HIV infection. Though the program had no overall effect on risky sexual behaviors or test scores, there were significant effects within groups where test results corrected prior beliefs. Those who had an HIV-positive status but did not have a prior belief of HIV infection engaged in more dangerous sexual behaviors, and those who were surprised by a negative test experienced a significant improvement in achievement test scores (Baird et al 2014).
  • An evaluation that used discounts and counseling strategies to incentivize the use of long term contraceptives in Cameroon used causal forests to identify subgroups that were more likely to be persuaded by price discounts. Causal forests are a machine learning technique to identify an optimal strategy for splitting a sample into groups. Using this approach, the researchers found that clients strongly affected by discount prices are younger, more likely to be students and have higher levels of education. Given this subgroup analysis, the researchers found that discounts increased the use of contraceptives by 50%, with larger effects for adolescents (Athey et al, 2021). Researchers pre-specified this approach without having to identify the actual subgroups in advance.
     

Acknowledgments: Thanks to Amy Finkelstein, Berk Özler, Jacob Goldin, James Greiner, Joseph J Doyle, Katherine Baicker, Maggie McConnell, Marcella Alsan, Rebecca Dizon-Ross, Zirui Song and all the researchers included in this resource for their thoughtful contributions. Thanks to Jesse Gubb and Laura Ruiz-Gaona for their insightful edits and guidance, as well as to Amanda Buechele who copy-edited this document. Creation of this resource was supported by the National Institute On Aging of the National Institutes of Health under Award Number P30AG064190. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
 

For more information on statistical power and how to perform power calculations, see the Power Calculations resource; for additional technical background and sample code, see the Quick Guide to Power Calculations; for practical tips on conducting power calculations at the start of a project and additional intuition behind the components of a well-powered study, see Six Rules of Thumb for Determining Sample Size and Statistical Power.

 The practice of designating a primary outcome is common in health care research and is described in the checklist for publishing in medical journals. 

The challenge of differential coverage in administrative data is discussed at length in the resource on using administrative data for randomized evaluations.

The exception would be events, like births or deaths, that are guaranteed to appear in administrative data unaffected by post-treatment selection bias. In an evaluation of the Nurse Family Partnership in South Carolina researchers were able to measure adverse birth outcomes like preterm birth and low birth weight from vital statistics records. 

5 Program take up has an outsize effect on power. A study with 50% take up requires four times the sample size to be equally powered as one with 100% take up, because sample size is inversely proportional to the square of take up. See Power Calculations 101: Dealing with Incomplete Take-up (McKenzie 2011) for a more complete illustration of the effect of the first stage of an intervention on power.

6 Health insurance that the government provides – either for free or at a very low cost – to qualifying low-income individuals.

7 In the OHIE, 5 blood spots were collected and then dried for analysis.

8 More information about how to increase response rates in mail surveys can be found in the resource on increasing response rates of mail surveys and mailings.

9  More on the algebraic explanation can be found in the following post “You need 16 times the sample size to estimate an interaction than to estimate a main effect”.

10 One would typically include stratum fixed effects in analysis of stratified data. The potential loss of power would be of greater concern in small samples where the number of parameters is large relative to the sample size.

11More on using machine learning techniques for pre-specifying subgroups in the following blog post: “What’s new in the analysis of heterogeneous treatment effects?

Research Papers

Others

Footnotes
1.

For more information on statistical power and how to perform power calculations, see the Power Calculations resource; for additional technical background and sample code, see the Quick Guide to Power Calculations; for practical tips on conducting power calculations at the start of a project and additional intuition behind the components of a well-powered study, see Six Rules of Thumb for Determining Sample Size and Statistical Power.