It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data
Hadley Wickham, Chief Scientist at RStudio and Adjunct Professor of Statistics at University of Auckland, Stanford, and Rice University
CreationWorking with two-dimensional data
Spreadsheets are a two-dimensional way to store, view, analyze, and alter two-dimensional data.
Best practices for creating datasets:
- All data should be labeled.
- Each experimental subject should have a unique study ID.
- Data should be in rectangular format (flat files).
- Rows should represent the appropriate unit of analysis.
- Columns should represent the unique attributes of the rows.
- Data files should contain the same number of columns in each row. Problems arise when data are missing in the middle of a row.
- Data should be atomic within each column. Discrete data should not be combined into a single column.
FormattingTidy data principles
Tidy Data specifically has become the standard format for the sciences because it easily allows people to easily turn a data table into graphs, analysis and insight. Dr. Hadley Wickham, Chief Scientist at RStudio and Adjunct Professor of Statistics at University of Auckland, Stanford, and Rice University, coined the term “tidy data” in order to minimize the effort involved when preparing data for visualization and statistical modeling.
A “tidy dataset” has the default structure:
- Each variable forms a column
- Each observation forms a row
- Each data set contains information on only one observational unit of analysis (e.g., families, participants, participant visits)
Tidy Data from Data Science with R by Garrett Grolemund. Each variable is placed in its own column, each observation in its own row, and each value in its own cell.
ValidationReview your data
Validation helps ensure that data is collected correctly.
Best practices for validating your datasets:
- Program valid ranges for inputting data into fields when applicable.
- Apply data formatting to fields in advance to prevent risk of inaccurate "Automatic" formatting.
- Prevent the entry of leading and/or trailing spaces or other characters that may interfere with data analysis.
- Plan for “other” data responses.
- Plan for “prefer not to answer”.
Standardization ensures the data is internally consistent. Ensures the data is the same kind and format for each data element that you are collecting. It also helps minimize data collection and analysis errors and prevents inconsistencies.
Best practices for standardizing your data:
- Data should be coded harmoniously.
- Standardize free text into categorical data.
- Treat date and time consistently. Choose one data format and employ that standard throughout (e.g. ISO 8601 YYYY-MM-DD)
Find more guidance on File Naming Conventions.
CleaningMake your data easier to work with
Before performing analysis of your data, review the datasets for inaccuracies, inconsistencies, or sensitive data. Cleaning your data allows you to identify outliers or errors before you compile your results.
Best practices for cleaning your data:
- Check for outliers. Ensure all data elements are in the correct formats and ranges.
- Check for missing data. Ensure there are no data items or records that are missing, creating null elements. Code missing data appropriately.
- Ensure that your data does not contain Protected Health Information (PHI). HIPAA requires that researchers protect the privacy and confidentiality of their patients. No individually identifiable health information should be included in your datasets.
DocumentationEnsure dataset metadata
Before analyzing or sharing your data, ensure that you have appropriate documentation. Appropriate documentation facilities the understanding, analysis, sharing and reuse of your data.
Best practices for documenting your data:
- Data should be stored with appropriate metadata.
- Create and use a data dictionary and README files.
- Save data as machine-readable ASCII or UNICODE file.
- Adopt appropriate file naming practices to accommodate multiple versions of data files.
See more guidance on Documentation and Metadata.