J-PAL promotes the publication of de-identified data from randomized evaluations 1. This resource provides guidance on doing so in the form of a checklist for preparing data for submission. It also includes sample informed consent language and other considerations during project planning and before publication, and provides a description of trusted digital repositories that can host data. This guide is intended to be read alongside the accompanying data de-identification resource.
Over the last decade, the number of funders, journals, and research organizations that have adopted data-sharing policies has increased considerably. When the American Economic Association adopted its first policy in 2005, it was among the first academic journals in the social sciences to require the publication of data alongside the research paper. Today, data publication is required by most top academic journals in economics and the social sciences. Similarly, foundations and government institutions, such as the Bill and Melinda Gates Foundation, the National Science Foundation, and the National Institutes of Health, have data publication policies. J-PAL, as both a funder and an organization that conducts research, adopted a data publication policy in 2015 that applies to all research projects that we fund or implement.
This guide and the accompanying guide on de-identification are intended to help research teams think about the steps involved in publishing research data. They draw on J-PAL’s experience of publishing research data on randomized evaluations in the social sciences for more than a decade.
Increasing the availability of research data benefits researchers, policy partners who supported the studies, students who learn from using the data, and, importantly, the people from whom the data was collected. Data sharing can provide many benefits and opportunities to the research community, including:
We are still in the early stages of data availability in social science research. Gertler, Galiani, and Romero (2018) found that only a small minority of empirical papers published in the top nine journals in economics in May 2016 contained all the necessary materials (raw data, estimation data, cleaning code, and analysis/estimation code) to successfully reproduce the results of the original study. J-PAL’s goal is to make research data from randomized evaluations widely available and accessible.
Before publishing your data, you should ensure that you have the legal, regulatory, and ethical authority to do so. Some questions to ask at the outset of the data publication process include:
Addressing these questions will help determine what data to publish and where.
All studies that collect survey data from individual subjects should go through an informed consent procedure; see more in the IRB resource and Define intake and consent process. The informed consent procedure should include language that allows for the publication of de-identified data. As with all parts of the consent procedure, this language should be concise and clear while avoiding jargon or technical terms that study participants may not understand. Before collecting data, researchers should review their consent process regarding data sharing.
Sample consent form language that could be used (subject to approval by the IRB of record):
No one outside the survey team can directly connect your personal details, like your name, address, and cell phone number, with anything you say in this survey. Your survey responses and personal details will be stored in secure international computer storage. Your personal details will be encrypted and password-protected to prevent unauthorized access. Before we share the study with anyone other than the research team, your personal details will be separated from your survey responses. We do this to prevent anyone outside the research team from linking your survey responses back to you.
The Inter-university Consortium for Political and Social Research (ICPSR) has developed a set of recommendations for researchers to think about when drafting informed consent clauses. Remember that informed consent and data-sharing disclosure requirements may vary by host institution and legal jurisdiction in which the data is collected; if in doubt, consult with the IRB of record.
In addition to the considerations listed above, publishing and sharing administrative data requires permission from the data provider, who will have final determination over what data can be published. Data that is provided by a third party often falls under additional regulatory authorities.
Much of what a research team can do with administrative data is controlled by the data use agreement (DUA) signed with the data provider at the beginning of the study. It is thus important to have a conversation about data-sharing with the data provider at the beginning of the study so that plans for publishing data can be added to the DUA at the outset. If the DUA does not clearly regulate data publication, then it is critical to consult with the data provider, and any research partners or implementing organizations, to determine what data can be published.
Data providers might have concerns about sharing their data because it is often governed by robust regulatory regimes (such as private health information, personal financial information, or criminal activity). They may be concerned about the privacy of participants or other potential effects if certain information is released. For example, businesses may be concerned about competitors using the data to gain commercial advantage, while government agencies may be concerned about political sensitivities, such as the release of data on spending practices or total case numbers.
It is important to discuss with data providers that data can be made available in a variety of formats. Restricted data archives, discussed in further detail below, provide locations where more sensitive and regulated data could be made available.
If a data provider has agreed to make data from the study publicly available, additional considerations include:
Published research data should be stored in a trusted digital repository to ensure long-term access to the files and documentation. A trusted digital repository is defined as a data repository “whose mission is to provide reliable, long-term access to managed digital resources to its designated community, now and in the future” (RLG 2002).
A trusted digital repository is committed to maintaining the repository in perpetuity, ensures minimal to zero data loss or data rot, and allows for version control. For example, the Harvard Dataverse allows users to view exactly what has changed, starting from the originally published version to any subsequent published versions. Users are also allowed to access these versions and see the changes made for that particular dataset. A trusted repository also assigns each dataset and its associated program code with a unique identifier (e.g., a “digital object identifier” (DOI)) to enable citation, cataloging, and search. This identifier is designed to persist even if URLs––or the website itself––change. Most digital repositories capture metadata about the published research materials. This allows future researchers to quickly explore and understand the data without having to download the data and run the code.
Below are some common trusted digital repositories used by researchers in the social sciences, along with an idea of the type of data you can find in them:
While posting data on personal websites (even if hosted by the university) technically makes it public, the lifespan of personal websites is much shorter than that of a trusted digital repository. Furthermore, it makes the data more difficult to search, cite, and explore. New tools like Google Dataset Search systematically harvest metadata from trusted repositories, making datasets hosted on them much more findable and usable. In addition, researchers can still make their data available on their personal websites by using a permanent identifier (DOI) to link to where the data is stored.
Some repositories have set up archives for particularly sensitive data. In addition to open ICPSR, a free public access repository, the ICPSR has also developed more secure repository options that range from secure downloads to physical on-site storage. More information on ICPSR’s restricted repositories can be found on their website.
This section provides a checklist of steps and best practices for researchers who wish to publish their research data. The checklist draws from the World Bank’s Microdata Catalog submission checklist, and further information can also be found in ICPSR’s Guide to Social Science Data Preparation and Archiving.
After ensuring you have the authority to publish your data, you can prepare the data for publication. The process has two purposes: first, to ensure the dataset is clean and comprehensible to new users, and second, to make sure that the privacy of research subjects is protected.
The best way to publish a set of files related to a research project is to save them in a clear file and folder structure and then compress the entire set of folders into e.g., a zip archive. The folder structure might look something like the following:
Steps to prepare each of these files are described next. Recognizing that preparing data for publication can be a time-consuming and involved process, we differentiate between steps that are absolutely essential, important (steps that we strongly recommend, as they facilitate re-use of the data), and suggested (additional steps that facilitate data re-use further but are less essential).
Important: Ensure there is no overlap between datasets (e.g., if you have panel data on daily electricity consumption and a survey that was conducted only once, do not publish a merged dataset (that includes the daily electricity consumption merged with the survey data) AND the two datasets separately––either publish just the merged file OR publish the two separate datasets. If one data set is much larger than the other, then they are ideally separate. This reduces the risk of inconsistencies and saves computer storage and memory when working with the data.
Additional checks for data:
If including output files with your published data, we suggest self-explanatory naming of files (e.g., table 1, table 2, etc.; or main tables, robustness checks, appendix tables, etc.), matching the corresponding publication.
A readme file is essential. The readme file should be in an open, platform-independent format such as ASCII text, markdown, or a PDF and at a minimum include the following:
For an example of a published readme document, see Vilhuber et al. (2020).
Trusted data repositories allow for version control, which is especially useful for long-term research data management where metadata and files are updated over time. Versioning is used to track any metadata or file changes (e.g., by uploading a new file, changing file metadata, adding or editing metadata) once you have published your dataset. In most repositories, a new DOI will be issued for each version. When updating versions of published data, it is important to include a note documenting the changes made from the previous version.
A trusted repository will automatically document changes such as metadata changes and addition/deletion of files, but it will not document the specific changes inside the files. So for these data/file-specific changes, it is useful to note down in the readme file the changes that were made. For example:
The act of removing a dataset from a digital repository where it has been published is known as deaccessioning. Most repositories will have a process for deaccessioning a dataset. Deaccessioning is an important component of version control that allows the metadata and citation of the data to remain available even if access to the data is removed. You can deaccession one version of a dataset or the entire dataset. The deaccessioning of a dataset could be required for a variety of reasons. For instance, if a research team inadvertently publishes the wrong dataset, incomplete data, or code with an error, then they might want to deaccession the dataset and upload the correct dataset that accompanies their study. When a dataset is deaccessioned, that version will no longer be available, but the citation and metadata will remain available. This allows other researchers to know that a version of the dataset existed previously in case they come across it in another study.
Last updated October 2023.
These resources are a collaborative effort. If you notice a bug or have a suggestion for additional content, please fill out this form.
We thank Jack Cavanagh, Shawn Cole, Mary-Alice Doyle, Laura Feeney, William Parienté, Karl Rubio, and Lars Vilhuber for helpful comments. Any errors are our own.
Berkeley Initiative for Transparency in the Social Sciences, Replication
Center for Open Science, Transparency and Openness Promotion Guidelines
Dataverse, Big Data Support
ICPSR, Guide to Social Science Data Preparation and Archiving
World Bank, Checklist: Microdata Catalog Submission