This document provides a primer on basic data security themes, provides context on elements of data security that are particularly relevant for randomized evaluations using individual-level administrative and/or survey data, and offers guidance for describing data security procedures to an Institutional Review Board (IRB) or in an application for data use. This is not an exhaustive or definitive guide to data security and does not replace the guidance of professional data security experts nor supersede any external data security requirements.
Data security is critical to protecting confidential data, respecting the privacy of research subjects, and complying with applicable protocols and requirements. Even seemingly de-identified data may be re-identified if enough unique characteristics are included.1 Additionally, the information revealed in this process could be damaging in unexpected ways. For example, computer scientist Arvind Narayanan successfully re-identified a public-use de-identified data set from Netflix. Through this, he was able to infer viewers’ political preferences and other potentially sensitive information (Narayanan and Shmatikov 2008).
Many research universities provide support and guidance for data security through their IT departments and through dedicated IT staff in their academic departments. Researchers should consult with their home institution’s IT staff in setting up data security measures, as the IT department may have recommendations and support for specific security software.
In addition to working with data security experts, researchers should acquire a working knowledge of data security issues to ensure the smooth integration of security measures into their research workflow and adherence to the applicable data security protocols. Researchers should also ensure that their research assistants, students, implementing partners, and data providers have a basic understanding of data security protocols.
Data-security measures should be calibrated to the risk of harm of a data breach and incorporate any requirements imposed by the data provider. Harvard University’s classification system for data sensitivity and corresponding requirements for data security illustrate how this calibration may function in practice.2
A data security breach can result in serious consequences for research subjects, the researcher’s home institution, and the researcher. Research subjects may suffer unintentional disclosure of sensitive identified information, which may expose them to identity theft, embarrassment, and financial, emotional, or other harms. Both the researcher’s home institution and the researcher may suffer reputational damage and may have more difficulty obtaining sensitive data in the future. A breach will likely trigger additional compliance requirements, including reporting the data breach to the Institutional Review Board (IRB), and, in certain circumstances, to each individual whose data was compromised. The data provider may require additional security protections or terminate access to the data. There may, in some cases, be financial and/or criminal liability to the data provider and/or the research subjects.3
If data security protocols are not adhered to, data may be disclosed through email, device loss, file-sharing software such as Google Drive, Box or Dropbox, or improper erasure of files from hardware that has been recycled, donated or disposed of. All hardware that comes into contact with study data should remain protected including: laptops, desktops, external hard drives, USB flash drives, mobile phones, and tablets. Theft or a cyber-attack may target either a researcher’s specific data set or the researcher’s home institution more generally and inadvertently sweep up the researcher’s data set in the course of the attack. Sensitive data must be protected from all of these threats.
Minimizing the research team’s contact with sensitive, individually identifiable data may substantially reduce the potential harm caused by a data breach and the required data security measures that need to be put in place. This will often simplify and accelerate the research data flow.
Reduce the data security threat-level a priori by acquiring and handling only the minimum amount of sensitive data strictly needed for the research study. Researchers may, for example, request that the data provider or a trusted third party link particularly sensitive individualized data to individual treatment status and outcome measures, so that the researchers themselves do not need to handle and store the sensitive data.
Data poses the most risk when sensitive or confidential information is linked directly to identifiable individuals. Once separated, the “identifiers” data set and the “analysis” data set should be stored separately, analyzed separately, and transmitted separately.4 Once separated, the identifiers should remain encrypted at all times, and the two data sets should only meet again if necessary to adjust the data matching technique. Tables 1, 2, and 3 illustrate this separation. J-PAL hosts programs for searching for PII in Stata and in R on a GitHub repository.
Name | SSN | DOB | Income | State | Diabetic? |
---|---|---|---|---|---|
Jane Doe | 123-45-6789 | 5/1/50 | $50,000 | FL | Y |
John Smith | 987-65-4321 | 7/1/75 | $43,000 | FL | N |
Bob Doe | 888-67-1234 | 1/1/82 | $65,000 | GA | N |
Adam Jones | 333-22-1111 | 8/23/87 | $43,000 | FL | Y |
Name | SSN | Study ID |
---|---|---|
Jane Doe | 123-45-6789 | 1 |
John Smith | 987-65-4321 | 2 |
Bob Doe | 888-67-1234 | 3 |
Adam Jones | 333-22-1111 | 4 |
Study ID | Income | State | Diabetic? |
---|---|---|---|
1 | $50,000 | FL | Y |
2 | $43,000 | FL | N |
3 | $65,000 | GA | N |
4 | $43,000 | FL | Y |
Paper-based surveys should also be designed in a way such that PII is removable. Refer to Figure 1 for a mock-up of survey design. All direct identifiers such as name or Social Security Number and contact information such as address or telephone number should appear on a separate cover sheet. The cover sheet and any consent form should be separated from the main questionnaire as soon as possible – ideally within 24 hours. As described below, participants will be assigned study IDs (listed on both sheets) so that the research team can match and re-identify the data if needed. A crosswalk document will contain the study ID link between these two sections and will be stored in a separate location from both survey halves to ensure confidentiality.
This ID should be created by a random process, such as a numbered list after sorting the data on a random number, or through a random number generator. This ID should not be:
One method for creating Study IDs is:
J-PAL and IPA’s randomization exercise in Stata includes the creation of Study IDs using this process.
Depending on how the Study ID is created, it may be essential to maintain a secure crosswalk (i.e., mapping/decoding) between the Study ID and PII. This crosswalk should be guarded both to ensure confidentiality, and to insure against data loss. Innovations for Poverty Action (IPA) has a publicly-available Stata program on GitHub that automates the separation of PII from other data and the creation of a crosswalk.
Researchers have many options for secure data storage and access. Relevant considerations for choosing among these options include: the sensitivity of the data, applicable compliance requirements, the research team’s technical expertise, internet connectivity, and access to IT expertise and support.
Institutions may offer space on a server or provide a location to host a server. Storing data on such a server may be preferable to relying on laptops or desktops and cloud storage to maintain data. Depending on the institution, the IT department may be able to provide secure remote access for off-campus users, automated secure backups of data, and encryption.
Access to these servers is typically automatic when connected over an official institutional internet connection. Off-site access requires the use of a Virtual Private Network (VPN). This may provide additional layers of security by encrypting all network connections and requiring two or more types of authentication (e.g., a password and a code sent via text message). Data will still need to be encrypted at both endpoints – i.e., the server or files on the server must be encrypted, and any data transferred to or from the server to another server or hard drive must be encrypted at those points.
Additional features that may be available upon request include:
IT or data managers may be able to grant access permissions to specific users for specific files or folders on the server. This level of control would enable teams to share general access to a folder while limiting access to identified data to a specific subset of the team. Seek out your IT department’s official recommendations regarding passwords and permission, such as IS&T Policies for MIT projects.
Data must be protected both when at rest and in transit between the data provider, research team members, and partners. Data that are encrypted while at rest on a whole-disk encrypted laptop, or on a secure server, will not necessarily be protected while being transmitted. The options presented below may vary in their level of security.
Unsafe transmission methods include:
Safer transmission methods include:
Many research partners, such as service providers, survey enumerators, and holders of administrative data, have had minimal prior exposure to data security or data sharing protocols. It is best practice to develop a data sharing and security protocol with these partners, and to guide them in understanding their role in data security. All partners handling or transmitting data should be informed of and trained on data collection, storage, and transfer policies agreed upon for the study. Request that partners notify the research team before sharing any data to ensure compliance with the data protocol. Teams should communicate with each other and with partners by referencing Study ID numbers rather than using PII. Consider developing standard operating procedures for checking for and responding to breaches in following the agreed upon method for sharing data. For example, if partners share data in a non-secure way or if unauthorized data are disclosed to researchers or partners. This will allow staff to respond quickly in the event of a breach. A standard operating procedures document should include:
There are several simple steps researchers and their staff can take to ensure their machines remain secure and to minimize possible weak points. These steps include:
IT departments generally have recommended software to help secure personal devices and may be able to assist with updating this software or may push automatic updates.
Strong passwords are essential to ensuring data security. A different password should be used for each high-value account. For example, the passwords for Dropbox, email, institutional servers, and encrypted files should all be different.
The National Institute of Standards and Technology (NIST) published revised guidelines for passwords in 2017. These guidelines and the underlying rationale are explained in more approachable language in a NIST staff blog post.
Strong passwords may be difficult to remember. When using some software, a forgotten password is completely irretrievable and means the loss of all project data.
An unencrypted, password-protected Excel file of passwords is not a secure way to store or share passwords. Passwords should never be shared using the same mechanism as file transfer, nor should they be shared over the phone.
Password storage systems such as Bitwarden and 1Password offer a secure way to create, manage, and store passwords online. On this webpage and mobile application, notes and passwords also can be securely shared with specified teammates. A hard copy of a password list, locked in a safe, is another secure option.
In addition to securing against outside threats, preventing data loss is an essential component of data security. Data and crosswalks between study IDs and PII should be backed up regularly in at least two separate locations, and passwords must not be forgotten.
Cloud-based backup tools such as CrashPlan and Carbonite offer a range of options for data backups and may offer additional packages to back up data for longer periods of time to protect against the unintentional erasure of data. Cloud-based storage tools such as Box, Dropbox, and Google Drive offer packages to back up data for several months or more, and may insure against unintentional erasure of data if it is noticed within the backup time period; these storage tools are not true backup tools as they do not keep deleted files forever. Institutional servers may also have data backup plans, and device-level backup plans are also available. Backing up data to an encrypted external hard drive (stored in a separate location from daily computers) is an option for low-connectivity environments.
The IRB or data provider may dictate whether and when data must be retained or destroyed. PII linkages should be erased when they are no longer needed. Simply moving files to the “recycle bin” and emptying the bin is not sufficient to thoroughly erase sensitive data. There are several software options for removing all files. For example, MIT maintains recommendations for removing sensitive data. Some IT departments may offer support for secure removal and disposal services. For paper-based surveys, J-PAL recommends that hard copies of PII cover sheets, questionnaires, and study ID crosswalks be destroyed within 3-5 years of the end of a project (or as committed on IRB protocol).10
The data provider will need to be confident that all files have been securely removed and no additional copies have been retained. In order to document data erasure, some researchers have taken screenshots of the removal process.
Data Use Agreements (DUAs) and IRBs often require researchers to provide a description of their data security and destruction procedures, and some may include specific requirements on these processes dependent on the sensitivity of requested data.11 This section provides examples of descriptions of data management plans drawn from approved DUAs. This language is provided for informational purposes only; it is not necessarily comprehensive nor feasible in all environments. Please refer to your university’s (and/or department’s) research support center, libraries, or IT department for detailed protocols, potential templates, and descriptions of what is feasible, required, and sufficient at your location. Additional external resources on describing data security plans are in the "Resources for data security" section at the bottom of this page.
Security data storage: the Department maintains a Unix/Linux-based research computing environment for its students and faculty members. The research computing systems utilize enterprise-level hardware and are managed by a dedicated staff of IT professionals. The Department leverages additional resources provided by the institution centrally, such as network infrastructure and professional co-location services in institutional datacenters. Department IT staff fully support private research servers purchased by individual faculty members. This support includes account management, security patching, software installation and host monitoring. Secure servers will be utilized for the purposes of processing and analyzing data. All computations and analytical work will be performed exclusively on these servers. File based permissions will be set to restrict data access to the research team.
All project data will be stored on a network attached storage (NAS) device. A dedicated volume will be created on the NAS for exclusive storage of all data related to this research project. Data on this volume will be served using the NFSv4 protocol and restricted to authorized hosts and users using IP-based host lists and institutional credentials. Data on this volume will be accessible only to authenticated users on the project servers described below. Data is backed-up to a secondary NAS device which is accessible only by IT personnel.
All network traffic is encrypted using the SSH2 protocol. A VPN provides an additional level of encryption/access restriction for off-campus connections. All server logins require two forms of authentication, a password and an SSH key pair. SSH Inactivity Timeout is used as the session timeout protocol.
Last updated May 2023.
These resources are a collaborative effort. If you notice a bug or have a suggestion for additional content, please fill out this form.
This document was originally a part of the “Using administrative data for randomized evaluations” resource, originally produced in 2015 by J-PAL North America. We are grateful to the original contributors to that work, as well as Patrick McNeal, and James Turitto for their insightful feedback and advice. Chloe Lesieur copyedited this document and Laurie Messenger formatted the guide, tables, and figures. This work was made possible by support from the Alfred P. Sloan Foundation and the Laura and John Arnold Foundation. Any errors are our own.
Please send comments, questions, or feedback to na_resources@povertyactionlab.org.
This document is intended for informational purposes only. Any information related to the law contained herein is intended to convey a general understanding and not to provide specific legal advice. Use of this information does not create an attorney-client relationship between you and MIT. Any information provided in this document should not be used as a substitute for competent legal advice from a licensed professional attorney applied to your circumstances.
Harvard University’s Research Data Security Policy classifies data according to five levels of sensitivity and defines data security requirements that correspond to each sensitivity level.
US Department of Health & Human Services’ Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule.
The National Institutes of Health’s “How Can Covered Entities Use and Disclose Protected Health Information for Research and Comply with the Privacy Rule?”
45 CFR 164.514 – Describes the HIPAA standard for de-identification of protected health information (original text).