Analysis Ready Datasets

It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data

Hadley Wickham, Chief Scientist at RStudio and Adjunct Professor of Statistics at University of Auckland, Stanford, and Rice University

Analysis-ready datasets have been responsibly collected and reviewed so that analysis of the data yields clear, consistent, and error-free results to the greatest extent possible. When working on a research project, take steps to ensure that your data is safe, authentic, and usable.

Since data is often messy, with data management, we aim to clean it before we analyze it. The following are concepts for preparing analysis-ready datasets:

Creation

Working with two-dimensional data

Spreadsheets are a two-dimensional way to store, view, analyze, and alter two-dimensional data.

Best practices for creating datasets:

All data should be labeled.
Each experimental subject should have a unique study ID.
Data should be in rectangular format (flat files).
Rows should represent the appropriate unit of analysis.
Columns should represent the unique attributes of the rows.
Data files should contain the same number of columns in each row. Problems arise when data are missing in the middle of a row.
Data should be atomic within each column. Discrete data should not be combined into a single column.

Formatting

Tidy data principles

Tidy Data specifically has become the standard format for the sciences because it easily allows people to easily turn a data table into graphs, analysis and insight. Dr. Hadley Wickham, Chief Scientist at RStudio and Adjunct Professor of Statistics at University of Auckland, Stanford, and Rice University, coined the term “tidy data” in order to minimize the effort involved when preparing data for visualization and statistical modeling.

A “tidy dataset” has the default structure:

Each variable forms a column
Each observation forms a row
Each data set contains information on only one observational unit of analysis (e.g., families, participants, participant visits)

Tidy Data where each variable is placed in its own column, each observation in its own row, and each value in its own cell.

Tidy Data from Data Science with R by Garrett Grolemund. Each variable is placed in its own column, each observation in its own row, and each value in its own cell.

Validation

Review your data

Validation helps ensure that data is collected correctly.

Best practices for validating your datasets:

Program valid ranges for inputting data into fields when applicable.
Apply data formatting to fields in advance to prevent risk of inaccurate "Automatic" formatting.
Prevent the entry of leading and/or trailing spaces or other characters that may interfere with data analysis.
Plan for “other” data responses.
Plan for “prefer not to answer”.

Standardization

Establish consistency

Standardization ensures the data is internally consistent. Ensures the data is the same kind and format for each data element that you are collecting. It also helps minimize data collection and analysis errors and prevents inconsistencies.

Best practices for standardizing your data:

Data should be coded harmoniously.
Standardize free text into categorical data.
Treat date and time consistently. Choose one data format and employ that standard throughout (e.g. ISO 8601 YYYY-MM-DD)

Find more guidance on File Naming Conventions.

Cleaning

Make your data easier to work with

Before performing analysis of your data, review the datasets for inaccuracies, inconsistencies, or sensitive data. Cleaning your data allows you to identify outliers or errors before you compile your results.

Best practices for cleaning your data:

Check for outliers. Ensure all data elements are in the correct formats and ranges.
Check for missing data. Ensure there are no data items or records that are missing, creating null elements. Code missing data appropriately.
Ensure that your data does not contain Protected Health Information (PHI). HIPAA requires that researchers protect the privacy and confidentiality of their patients. No individually identifiable health information should be included in your datasets.
1. REDCap automatically flags personally identifiable data.
2. See a list of items included in Protected Health Information.

Documentation

Ensure dataset metadata

Before analyzing or sharing your data, ensure that you have appropriate documentation. Appropriate documentation facilities the understanding, analysis, sharing and reuse of your data.

Best practices for documenting your data:

Data should be stored with appropriate metadata.
Create and use a data dictionary and README files.
Save data as machine-readable ASCII or UNICODE file.
Adopt appropriate file naming practices to accommodate multiple versions of data files.

See more guidance on Documentation and Metadata.