Creating metadata
Metadata, literally "data about data", is information that describes the basic characteristics of a data item or dataset, regardless of the medium (physical or digital).
For example :
- Its author(s)
- Its content
- Date of creation
- The place of capture/production
- The reason why the data was generated
- How the data was created
- etc.
These different elements are called metadata fields.
The role of metadata is therefore to set research data in the context of its creation and use, making it easier to understand, process and potentially reuse by oneself or others. Metadata should be as complete as possible, using the standards and conventions of the discipline in question, and should be machine readable.
Typology
Metadata can be classified into several main families. Several typologies coexist, such as the one proposed by the Australian National Data Service (ANDS) which distinguishes 6 families of metadata:
Standards / Disciplinary metadata schemas
Determining precisely what metadata should be filled in is a difficult task, as the choice is highly dependent on the context of production and the use of the data. For this reason, initiatives have created templates with a list of elements that match the description needs of a discipline or a special purpose. These models are called metadata standards.
Examples of standards :
- Dublin Core : a standard consisting, in its initial version, of 15 elements and generally used to describe books.
- Darwin Core : a standard derived from the Dublin Core and developed for the specific needs of biodiversity informatics for describing and facilitating information sharing.
- Data Documentation Initiative (DDI) : an international standard for describing data produced in the social, behavioral, economic, and health sciences. DDI standards enable data to be documented, discovered, and interoperable. The specifications and tools are available on the DDI website.
- Digital Imaging and Communications in Medicine (DICOM) : an international standard accredited by ISO 12052 specific to medical images and their related information. It defines the formats of medical images that can be exchanged with the data and quality necessary for their clinical use.
Many academic disciplines have formalized specific metadata standards adapted to the needs of their communities and the reuse of their data.
On its website, the Digital Curation Centre (DCC) offers a page gathering these standards with general information for each of them, tools to implement them and use cases of data repositories that currently use them.
The FAIRSharing initiative also provides a summary table of metadata standards.
When a list of metadata fields has a particular structure and more constraining values in terms of format or options, it becomes a metadata schema.
The metadata schemas thus propose lists of elements, mandatory or optional, to be filled in, accompanied by the precise syntax to be used. For example, the formatting of dates following the model 2021-05-14 or 20210514.
Example of a schema:
- DataCite schema: consisting of a list of fields selected for their suitability for accurate and consistent identification of a resource for citation and retrieval purposes. In addition, the fields have been classified into three categories: mandatory, recommended and optional. Full documentation with recommended usage instructions for this schema is available on their website.
When and how create metadata
As with the management of research data as a whole, metadata should be created as early as possible and over the course of the project to avoid overload at the end of the project when the research data is archived.
Metadata can be created manually or by relying on software or platforms to facilitate or automate this process. These platforms can be general or discipline-specific.
The Digital Curation Centre compiled a list of these tools.