Harvard Dataverse

Harvard Dataverse is a general-purpose data repository built on open-source software that is intended for sharing and facilitating citation of research data. It is under continuous development by Harvard Library, Harvard University Information Technology (HUIT), and the Harvard Institute for Quantitative Social Science (IQSS). Several other institutions have made use of this open-source software project to develop independent Dataverse installations around the world.

Harvard Dataverse is open to all researchers from any discipline, both inside and outside of the Harvard community, where you can share, archive, cite, access, and explore research data.

Compare Harvard Dataverse to other options in the Harvard Biomedical Repository Matrix.

Please contact us if you have any questions or suggestions about the content of this page. Last updated: 2023-11-13

Features & Specifications

Data Size and Format

File Size Limit: To use the browser-based upload function, file can’t exceed 2.5GB. However, Harvard Dataverse is willing to work with Harvard researchers who have larger files.

Dataset Size Limit: 1TB per researcher. Harvard Dataverse will work with Harvard researchers who have larger datasets (>1 TB).

Data Types and Formats Hosted: All file formats accepted (tabular, non-tabular, and compressed as a zip file bundle with file hierarchy feature to preserve directory structure).

Data Licensing

Waiver: Harvard Dataverse strongly encourages use of a Creative Commons Zero (CC0) waiver for all public datasets, but dataset owners can specify other terms of use and restrict access to data.

Data Attribution and Citation Tools

Within Harvard Dataverse, specific programs or projects can create nested dataverses (collections), and each nested dataverse (collections) itself can contain nested dataverses (collections) or one or more datasets. Harvard Dataverse assigns a DOI to each dataset and datafile within a dataverse.
Dataset authors can identify themselves and other types of data contributors using the following types of unique IDs: ORCID, ISNI, LCNA, VIAF, GND, DAI, ResearcherID, Scopus ID.
When substantive changes are made to the metadata and files associated with a published dataset, a new version number is assigned to the existing dataset citation; the DOI remains constant. Users have the option to determine if substantial metadata changes should result in a “major version” change.
Minor version changes will not impact the existing citation version number, but a minor version number will appear in the “versions” tab of the dataset page (*.*). All deletions/additions/replacement of data files will result in a major version# change that is displayed in the citation ( v1 ---> v2) and in the “version s” tab of the dataset and file landing page.
Whenever a dataset is edited (metadata or files), the resulting draft version must be published in order to visualize the changes. Researchers can export dataset citation files in several formats (Endnote XML, BibTeX, RIS) to manage citations in Latex, Endnote, Zotero, and more. Web browser plugins (e.g. Zotero and Endnote plugins) can also extract dataset citation info from dataset pages.

User Access Controls

Option to Share: Harvard Dataverse allows draft, unpublished, and published (public) datasets. For draft and unpublished datasets, a variety of tiers of access can be assigned to different registered users.

The Harvard Dataverse Repository offers open access, restricted, and embargo options for all files, along with the ability to apply standard licenses and add custom terms of data access. Depositors can monitor access to their files and request or require that data requestors provide information before downloading data, such as who the requestors are and how they intend to use the data.

Data Access Tools

Search

Without logging in, users can browse a Dataverse installation and search for Dataverse collections, datasets, and files, view dataset descriptions and files for published datasets, and subset, analyze, and visualize data for published (restricted & not restricted) data files.
Data descriptors and metadata: At the dataset level, Harvard Dataverse offers several different metadata templates appropriate for datasets from different disciplines, and the life sciences metadata template adheres to the ISA-TAB specification.
Additional free-form keyword fields are provided. These dataset-level metadata are searchable, but depositors cannot add their own detailed file-level metadata.
Dataverse extracts variable-level metadata from ingested tabular files, extracts metadata from FITS files, and makes that file-level metadata searchable.

Download

In addition to individual file downloading, Harvard Dataverse has multiple APIs for programmatic data and metadata access, as described in their API Guide.

Proprietary File Format Access

Tabular files are converted to tabular format which allows download of some proprietary files in tabular format as well as other formats. See Tabular Data File Ingest.
The Dataverse Software supports reading of all SPSS versions 7 to 22 with limitations since SPSS does not openly publish the specifications of their proprietary file formats.
Stata is the best supported format for tabular data ingest since documentation is freely and easily available to developers.
Only the newer XLSX Excel files are supported. However, if an Excel file has multiple sheets, only the first sheet of the file will be ingested.

Data Analysis

Harvard Dataverse includes external tools that provide additional features that are not part of the Dataverse Software itself, such as data file previews, visualization, and curation.
Data Explorer provides a UI that displays the variables of tabular data files and allows users to search, chart, and conduct cross tabulation analysis.
File Preview is a set of tools that display the content of files - including audio, html, hypothes.is annotations, images, PDF, text, video, tabular data, spreadsheets, GeoJSON, zip, and NcML files - allowing them to be viewed without downloading.
Whole Tale is a platform for the creation of reproducible research packages that allows users to launch containerized interactive analysis environments based on popular tools such as Jupyter and RStudio. Using this integration, Dataverse users can launch Jupyter and RStudio environments to analyze published datasets.
Binder allows you to spin up custom computing environments in the cloud (including Jupyter notebooks) with the files from your dataset.
A GUI for curating data by adding labels, groups, weights and other details to assist with informed reuse.

Cost

Harvard Dataverse Repository is free for all researchers worldwide (up to 1 TB)
Harvard Dataverse does provides free consultation and paid curation services to help collection managers develop their collections to ensure FAIR deposit of content.

Other Features

Pros

Can share data with collaborators or the public. Open content can be accessed directly via the UI or API, and restricted content can be requested using a "request access" feature if enabled by the data depositor.
Assigns a DOI to every published dataset. The repository records file downloads and views and makes this information available to depositors. Depositors who create collections can also ask or require downloaders to provide information about their data re-use.
Uses standard-compliant metadata to ensure that dataset metadata can be mapped easily to common metadata schemas to make data more preservable and interoperable.
Provides a mechanism by which a journal's editors and reviewers can have anonymous access to a dataset or dataverse before it is made public. See Private URLs

Cons

Does not support sensitive data sharing.
Repository staff do not administer access to restricted data.