Introduction to Clinical Data
Clinical data is either collected during patient care or as part of a clinical trial program. Funding agencies, publishers, and research communities are encouraging researchers to share data, while respecting Institutional Review Board (IRB) and federal restrictions against disclosing identifiers of human subjects.
You should take initial steps to de-identify data for:
- Protecting data during research projects
- Preparing data for vetted collaborators, restricted-access or public access data repositories
Clinical Data Terminology
-
Personal Identifiers
Private information that subjects expect not to be made public that are linked to information associated with a unique individualPII: Personally Identifiable Information (NIST SP- 800-122)
- Any information maintained by an agency…used to distinguish or trace an individual’s identity
- Any other information that is linked or linkable to an individual
PHI: Protected Health Information
- Created or received by a health care provider
- Relating to physical or mental health of an individual or provision of care (past, present, or future) and (i) that identifies or (ii) could be used to identify the individual. (HIPAA's Privacy Rule)
-
Types of Identifying Information
Identifying information is classified as one of two types: direct and indirectDirect Identifiers
HIPAA lists 18 typical direct identifiers for PHI as part of the standards for patient protection used by US Health and Human Services.
- Names
- All geographic subdivisions smaller than state, including street address, city county, precinct, zip code and their equivalent geocodes, except for the initial three digits of the ZIP code if: the geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and the initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000
- All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older
- Telephone numbers
- Fax numbers
- Email address
- Social Security numbers
- Medical record numbers
- Health plan beneficiary numbers
- Account numbers
- Certificate or license numbers
- Vehicle identifiers and serial numbers, including license plate numbers
- Device identifiers and serial numbers
- Web Universal Resource Locators (URLs)
- Internet Protocol (IP) addresses
- Biometric identifiers, including finger and voice prints
- Full-face photographs and any comparable images - photographs are not limited to images of the face
- Any other unique identifying number, characteristic, or code that could uniquely identify the individual
Indirect Identifiers
Information that can be combined with other information to potentially identify a specific individual.
- Place of medical treatment or doctor's name
- Gender
- Rare disease or treatment
- Sensitive data like illicit drug use or other "risky behaviors"
- Place of birth
- Socioeconomic data, like workplace, occupation, annual income, education, etc.
- General geographic indicators, like postal code of residence
- Household and family composition
- Ethnicity
- Birth year or age
- Verbatim responses or transcripts
-
Anonymization
Used as a more broad term to encompass two types of tasks to reduce disclosure risk for identifiersMasking
- Alter direct identifiers so that the original is no longer usable for analysis.
- Delete items like social security numbers and replace identifiers with pseudonyms or randomized codes.
De-identification
- Minimal distortion of data so that they retain utility for analysis, while adequately protecting privacy.
- Methods include generalizing data elements, such as replacing age with range values; or more advanced statistical techniques, such as suppression of outlier values, grouped averaging or record swapping.
- Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule
-
HIPAA's Privacy Rule
Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule to protect individuals’ medical records and other personal health informationLimited Data Sets (LDS) (§164.514(e))
- Remove or anonymize HIPPA direct identifiers, and "facial" identifiers.
- Certain dates, geographic location to zip code level, and birth dates may remain.
- Indirect identifiers may also remain if not easily removed.
"Safe Harbor" Anonymization Level (§164.514(b))
- 18 direct identifiers, 3-digit zip code truncation, and year only dates.
- Alter indirect identifiers to sufficiently limit "actual knowledge" of data that could, alone or in combination with other information, re-identify a data subject.
"Expert determination" Statistically De-identified Datasets (§164.514(b)(1))
- Remove or mask all direct and indirect identifiers.
- Statistical techniques can be applied to make remaining risk "very small"
- A trained statistical professional should be consulted to adequately assist in preparing datasets in order to assess and mitigate disclosure risk.
-
REDCap
Secure web application for data capture for research studiesREDCap is a free, secure, web-based application designed to support data capture for research studies. The system was developed by a multi-institutional consortium initiated at Vanderbilt University. Data collection is customized for each study or clinical trial by the research team with guidance from Harvard Catalyst EDC Support Staff. REDCap is designed to comply with HIPAA regulations.
REDCap is a mature, secure web application for building and managing online surveys and databases.
- Design your own survey electronically
- Share data securely with research staff and external collaborators
- Built in tools for viewing EPIC data, and limited de-identification
Available Harvard Licenses
- Harvard Medical School (HMS REDCap is not a HIPAA compliant service)
- Harvard School of Public Health
- Additional Harvard Affiliated Institutions
Additional Resources
-
Clinical Research Datasets
Clinical research data may be available through national or discipline-specific organizations. Level of access is likely restricted but available through proper channels. Proprietary research data may also be available through individual use agreements.
- Biologic Specimen and Data Repository Information Coordinating Center (NHLBI): Listing of studies with resources available for searching and request via BioLINCC.
- Biomedical Translational Research Information System (BTRIS): Research data available to the NIH intramural community only.
- Clinical Data Study Request: Clinical trials data. Partners include Pharmaceutical companies.
- NIMH Clinical Trials: Limited Access Datasets.
- Harvard Health Care Policy (HCP) - Center for Healthcare Data Analytics (CHDA): Medicare and Medicaid health care data for research purposes. Restricted to collaborations with HCP faculty. Details on current data available through the HCP Hub CHDA on the Harvard Intranet.
- Harvard Medical School Department of Biomedical Informatics (DBMI) Data Portal: Growing catalog of DBMI resources, including datasets.
-
Additional Readings
Is there such thing as being truly anonymous?
- Harvard Professor Re-Identifies Anonymous Volunteers in DNA Study (Tanner, 2013)
- Only You, Your Doctor, and Many Others May Know (Latanya Sweeney, 2015)
- Re-identification of Patients in Maine and Vermont Statewide Hospital Data (Ji Su Yoo et al. 2018)
- Genome Hackers Show No One’s Data is Anonymous Anymore (Megan Molteni, 2018)
- Test it yourself!
Accessible Version
-
The Levels of De-identification Actions Table
For personal & collaborator use, you must remove personal identifiers not needed for analysis and replace them with codes, e.g., names with pseudonyms, per HIPAA's 'Safe Harbor' limited datasets rule. For a restricted access repository, you do all of the above as well as broaden or mask direct and indirect identifiers when possible, e.g., change specific values to ranges, such as changing 52 to 50-55, per HIPAA's 'Safe Harbor' limited datasets rule. For a public access repository, you must do all of the above as well as find and remove or mask all potential indirect or inferential identifiers and apply advanced statistical de-identification techniques if needed, seek professional assistance and disclosure risk review, per HIPAA's expert determination 'de-identified'.