Identify research data

Research data are defined by the OECD as « factual records (numerical scores, textual records, images and sounds) used as primary sources for scientific research, and that are commonly accepted in the scientific community as necessary to validate research findings » (p.18, 2007).

Research data can be produced in many formats and using a wide range of methodologies. Finally, it is important to note that almost all disciplines and research fields produce research data: mathematics, anthropology, computer science, the humanities, law, etc.

Examples of research data :

  • Documents (text, word), spreadsheets, slides
  • Photographs, films
  • Surveys, transcripts, codebooks
  • Samples, genomic sequences
  • Laboratory notebooks, field notebooks
  • Audio or video recordings
  • Computer code, algorithms, models, scripts
  • Methodologies and workflows
  • Bibliographies

Given this variety, it is sometimes difficult to accurately identify one's research data. However, the pyramid below proposed by Andorfer (2015) facilitates the understanding of the role of research data in a scientific research process, especially in the social sciences and humanities:



Here are different typologies that can be used as guides to identify the research data of a project.

By format : digital versus physical

Data can take a physical, analog, or material form:

  • Physical data: manuscripts, field notebooks, etc.
  • Natively analog data: data generated by laboratory instruments, online questionnaires, images, etc.
  • Non-native analogue data: digitized documents, photographs of artworks, etc. Given the ubiquity of technology and the digital in the sphere of work and research, our thoughts turn quite automatically to the digital format when we talk about data.
By method of production

The University of Bristol identifies five categories of research data, based on their method of production and reproducibility:

  • Observational:

Observational data are captured in real time in a specific context. They are generally unique and therefore irreplaceable.

Examples: neuroiamaging, survey data, field recordings, sample data

  • Experimental:

These data are produced with laboratory instruments or standardized methods. They are potentially reproducible but it requires a significant investment of time and money.

Examples: gene sequences, chromatograms,

  • Simulation/models :

Data produced by experimental models, which are often more important than the data themselves.

Examples: climate models, economic models

  • Derived/compiled:

These data are the product of processing or combining raw data.

Examples : data mining, compiled databases

Concrete example: UNSCdeb8 Database

  • References :

These data take the form of corpora, generally published and edited, of reference content in a field.

Examples: gene databases, archival collections, old image databases

The Pôle de l'Information Scientifique et Technique of the École des Ponts ParisTech also includes in this typology computer code, which it considers a category in its own right.