Section : 4 Metadata and metadata standards | Research Data Management - An Online Introduction

4.1 Introduction to Metadata or “why metadata is important”

Metadata and metadata standards are important for structuring and organising your data. Metadata is data that contains information about other data. Data does not necessarily have to be digital data; it can also be real objects that are provided with descriptive metadata and thus provide better information about this object. The following practical examples show how relevant detailed documentation using metadata can be:

**Scenario 1: **

You have carried out various measurements in your research project. The research data and results fit your hypothesis exactly. You are very proud! You remember all the settings and parameters very clearly. You have also written down some of them. Due to unfortunate circumstances, you cannot continue working on them for the next few weeks... You come back and realise with horror that you can no longer place much of what you had in your head correctly. You would never have thought that! You try to put everything into the right order. Do you succeed? You discuss the measurement series in your working group. One colleague is not convinced; he has different results. You start doubting. Actually, you are sure; but only actually. Over the next few days, you spend a lot of time repeating some measurements. Now you are quite sure that your results are correct. You document everything in detail to be able to present it convincingly at the next working group meeting. Wouldn't it have been less time-consuming and nerve-wracking if you had created detailed documentation straight away?

** Scenario 2: **

You only realise shortly before your first major publication that research data from an earlier sub-project could be relevant for it. You actually put that project aside three years ago. Is the research data so well documented that you can use it for the publication?

** Scenario 3: **

You have published successfully and have been widely cited. Now someone publicly questions your results and approach. Are you able to substantiate your findings?

In all the scenarios mentioned, documentation with the help of metadata is helpful and benefits you at the latest when compiling your results and research data for your doctorate, habilitation, the next publication or in projects for your successors and for new colleagues. Complete and correct metadata are an important contribution to good scientific practice! Metadata are key for finding, searching, reading and interpreting research data and, in a figurative sense, are a kind of “instruction leaflet” for the actual data.

After completing this chapter, you will be able to...

...recognise metadata and the benefits of metadata
...name important categories of metadata
...name selected metadata standards
...create your own metadata
...describe your research data via metadata so that your research data can be used in the future

4.2 When and why do I create metadata?

Metadata ensures that research data can continue to be used today and in the future, even if the people involved in the experiments at the time have perhaps died or are now busy with other research priorities and can therefore no longer provide more detailed information about the earlier experiments. Without metadata, such research data is often worthless, as it is incoherent and incomprehensible.

In order to assign metadata correctly and to be able to continue to use your data correctly and in an orderly manner, it is best to document metadata right from the start of the research project. However, metadata must be created at the latest when your research data is to be deposited in a repository, published, or archived for the long term.

Often, however, it is no longer possible to create certain metadata retrospectively. This can be the case, for example, in a long project when it is necessary to explain the provenance (origin) of the data precisely for others.

4.3 What do metadata look like?

Metadata always have a certain internal structure, even though the actual application can take different forms (e.g. from a simple text document to a table form to a very formalised form as an XML file that follows a certain metadata standard). The structure itself depends on the described data (for example, use of headers and legends in Excel spreadsheets versus a formalised description of a literary work in an OPAC), the intended use and the standards used. Generally speaking, metadata describe (digital) objects in a formalised and structured way. Such digital objects also include research data. In our application, metadata describe your own research project and related research data in a formalised and structured way.

It makes sense, but is not absolutely necessary, for metadata to be readable not only by humans, but also by machines, so that research data can be processed by machines and automatically. Machines are primarily computers in this case, which is why one can also speak more precisely of readability for a computer. To achieve this, the metadata must be available in a machine-readable markup language. Research-specific standards in the markup language XML (Extensible Markup Language) are often used for this, but there are also others such as JSON (JavaScript Object Notation). When submitting (research data) publications, in most cases there is the option of entering the metadata directly into a prefabricated online form. A detailed knowledge of XML, JSON or other markup languages is therefore not necessarily required when creating metadata for your own project, but it can contribute to understanding how the research data is processed.

Computer readability is an essential point and becomes important, for example, when related research data are to be found by keyword search or compared with each other. A machine-readable file can be created using special programmes. In the section “How do I create my metadata” you will be introduced to appropriate programmes.

If you are not familiar with the creation of machine-readable metadata files, you should save the metadata for your research data in a form that you can create. For example, a simple text file can be created using the integrated editor of your operating system, in which each line contains information. When doing so, consider which information is important for traceability (e.g. creator of the data, date of creation/experiment, structure of individual experimental set-ups, etc.). The categories depend on the type, scope and structure of the research data. A transfer into a machine-readable form is still possible with proper and comprehensible documentation at the end of a project or a section of the project.

Examples of metadata

In the following, a few examples will show what metadata can look like.

Fig. 4.1: Entry of a work in an online library catalogue (source: https://ubmr.hds.hebis.de/Record/HEB060886269?lng=en)

Figure 4.1 shows a book title as an entry in an online library catalogue in a form that you, as a member of a university, have probably seen many times before. It should be noted at this point that metadata is not a new development and does not only play a major role in the digital age, but has already been used before, for example, in the creation of card catalogues in libraries for locating books. The information listed in Figure 4.1 is also nothing more than metadata that can be processed by a processing system and read by users to obtain information about a particular book. They learn about the title, the author(s), the volume, the year of publication, the language, etc.

Although the data from the example above is probably very different from your research data, it illustrates very well the way metadata is collected. If metadata for research data were written in the way shown here, namely in a kind of two-column table, with one column containing the category (e.g. title) and another column containing the actual information (here “King Oedipus”), this information would in any case be helpful for a later researcher to understand the data. However, it would not yet lead to computer systems being able to process this data automatically.

If you have no experience at all with the creation of computer-readable metadata, it is worthwhile, as already mentioned, to use such a tabular list of all relevant data in a file (e.g. .docx, .xlsx, .txt, etc.) at the beginning of a research project and to keep it current, in order to have this data at hand for a possible later submission. Also stick to a sensible versioning concept in order to make changes in the data traceable in the course of the project (see Chapter 8).

Fig. 4.2: Machine-readable example metadata according to the Dublin Core Metadata Element Set (created by Henrike Becker in the project "Fokus")

Figure 4.2 shows part of a machine-readable metadata record written in the markup language XML according to the conventions of the Dublin Core Metadata Element Set, which was first published by the Dublin Core Metadata Initiative in 1995 (more on this in section 4.4 – “What are metadata standards?”). How this can be recognised is explained below.

Everything written in blue in Figure 4.2 are elements, everything written in black is the content of these elements. A simpler understanding of this relationship can be obtained by looking at Figure 4.1: The left column contains the type of information or category (e.g., “title”, “author”, etc.), the right column shows the actual information within this category (e.g., “King Oedipus”, “Sophocles”, etc.). The relationship between the element and the content of the element is analogous, with the type of information/category representing the elements (blue font in Figure 4.2) and the actual information representing the content of the elements (black font in Figure 4.2).

A fundamental difference, however, is the structure: element names are always enclosed in less-than and greater-than signs <...>. In addition, there is an opening and a closing element for each category. The opening element can be recognised by the less-than sign < and always stands before the actual information. The closing element is recognisable by the forward slash / after the less-than sign < and always comes after the actual information of the respective category. These opening and closing elements thus practically always enclose the information content, which is easily recognisable in Figure 4.2. The information about the category is located between the the less-than and greater-than signs (e.g., “title”, “creator”). The information written in black between <dc:creator> and </dc:creator> thus gives information about the author of the respective document or data, for example. In the case of Figure 4.2, this would be “Henrike Becker”. At this point, the other elements shown in Figure 4.2 should be briefly explained. The <dc:title> element contains the title under which the document or research dataset was published. Systems that read and display titles from a database often use the content of this element as information. <dc:subject> can occur several times and always contains a subject of the content in keywords that serve as a search basis. The second <dc:subject> element in Figure 4.2 contains a very long specification of a subject (i.e. not only keywords), which should rather be avoided in order to achieve better search results. The <dc:description> element gives a short summary of the content. In the case of text publications, the table of contents can also be placed there. Multiple entries are also possible for this element. <dc:date> contains a date, usually the date of publication. If possible, the date should be written according to DIN ISO 8601 as YYYY-MM-DD for better findability. Within this element, sub-elements (so-called child elements) can be placed, which finally give more precise information about the date, such as whether it is the date of creation, the date of the last change or the date of publication. The <dc:identifier> element is only present once and mandatory in a metadata record. The persistent identifier it contains, is assigned only once worldwide and uniquely identifies the document or research dataset. More information on persistent identifiers can be found in the following section “Which categories are important” as well as in the section “Findable” of Chapter 5.

The two letters with the colon dc: that precede the actual element name creator etc. in the elements show that the elements come from the Dublin Core Metadata Element Set mentioned at the beginning. Further information on why these two letters should or often even have to be written in front of them is explained in more detail in section 4.4 – “What are metadata standards?”

And now it's your turn. In the table shown, what is data and what is metadata? Click on the image to see the solution.

Fig. 4.3: Data and metadata of an Excel table

There are very many different categories that can and often must be described by metadata. Depending on the field and research data, these categories can differ greatly, but some are considered standard categories for all disciplines.

One category that should be present in the metadata at the latest in the case of a citable publication is the “persistent identifier” mentioned in the previous section. An identifier is used for permanent and unmistakable identification. The DOI (Digital Object Identifier) is well-known and frequently used. A DOI is assigned by official registries, such as DataCite. Metadata are linked to the document and the research data via a DOI. Research data can be cited via a DOI.

Furthermore, the metadata should indicate who the author of the data is. In the case of research groups, all those involved in the work or who may have rights to the research data should be named. The latter may, of course, include companies that may have contributed to the funding of the research. Always make sure that the names are complete and unambiguous. If a researcher ID (e.g. ORCID) is available, this should be mentioned.

The research topic should be described in as much detail as necessary. In view of the findability of the research data, it can also be useful to mention keywords that can then be used in a digital database search to achieve better results.

Furthermore, for the traceability of the research data, clear information is needed for parameters such as place / time / temperature / social setting, ... and any other conditions that make sense for the data. This also includes instruments and devices used with their exact configurations.

If specific software was used to create the research data, the name of the software must also be mentioned in the metadata. Of course, this also includes naming the software version used, as this makes it easier for researchers to understand later why this data can no longer be opened in the case of very old data.

Some metadata requirements are always the same. This also applies to the categories just listed, which are very generic. For such cases, there are subject-independent metadata standards, including the already introduced Dublin Core Element Set. Other requirements can differ greatly between different disciplines. Therefore, there are subject-specific standards that cover these requirements. You can read more about this in the next section 6.4 – “What are metadata standards?”.

Figure 4.4 shows different categories of metadata that may prove useful with regard to research data.

Fig. 4.4: Listing of sample categories (Created by Henrike Becker in the project "Fokus")

4.4 What are metadata standards and why are they important?

One very important aspect of metadata already mentioned at the beginning is its readability for humans and machines. The large number of different metadata needed to describe research data can become a problem in view of the additional large number of different scientific communities, each with their own needs. On the one hand, there is metadata that is necessary across scientific fields (e.g., name of author, title, date of creation, etc.), but on the other hand there is also subject-specific metadata that depends on the research area or even the research subject.

Imagine that research group 1 has created a lot of research data over several experiments of the same kind with different room temperatures. Research group 2 has conducted the same experiment with the same substances at the same room temperature and different levels of oxygen in the air and has also created research data. Research group 1 refers to the parameter “room temperature” as “rtemp” in their metadata, but research group 2 only refers to it as “temp”. How do the researchers of research group 1 and how does a computer system know that the value “temp” of research group 2 is the value “rtemp” of research group 1? It’s just not easily possible and thus reduces the usefulness of the data.

So how can it be ensured that both research groups use the same vocabulary when describing their metadata, so that in the end it is not only readable but also interpretable? For such cases, metadata standards have been and are being developed by various research communities to ensure that all researchers in a scientific discipline use the same descriptive vocabulary. This ensures interoperability between research data, which plays a crucial role in expanding knowledge when working with data (for more information on “interoperability” see Chapter 5).

Metadata standards thus enable a uniform design of metadata. They are a formal definition, based on the conventions of a research community, about how metadata should be collected and recorded. Despite this claim, metadata standards do not represent a static collection of rules for collecting metadata. They are dynamic and adaptable to individual needs. This is particularly necessary because research data in projects with new research methods can be very project-specific and therefore the demands on their metadata are just as strongly project-specific.

The following table lists some examples of metadata standards from different disciplines. If your discipline is not listed, the listing of the Digital Curation Centre (DCC) can usually provide information on which standards are applicable to your field of science.

Academic discipline	Name of the standard(s)
interdisciplinary	DataCite Schema, Dublin Core, MARC21, RADAR
Humanities	EAD, TEI P5, TEI Lex0
Earth Sciences	AgMES, CSDGM, ISO 19115
Climate science	CF Conventions
Arts & Cultural Studies	CDWA, MIDAS-Heritage
Natural sciences	CIF, CSMD, Darwin Core, EML, ICAT Schema
X-ray, neutron, and muon research	NeXus
Social and economic sciences	DDI

Tab. 4.1: Some metadata standards sorted by scientific discipline

Cross-disciplinary standards are metadata standards that describe objects in a general way. The Dublin Core standard, partially described above, is one of these types of standards. The “EAD” standard is used to describe archival finding aids. “TEI P5” provides standards for annotating texts and manuscripts. “TEI Lex0” is a newly developed standard based on “TEI P5” for describing lexicographic data. “AgMES” is used to describe information from the agricultural sector. “CSDGM” is a standard for the description of digital spatial data, which is still in use but will be replaced by the “ISO 19115” standard in the long term. The Federal Geographic Data Committee (FGDC), the developers of the “CSDGM” standard, therefore, encourage all interested parties to use the “ISO 19115” standard for the description of digital spatial data. The “CF Conventions” provide metadata for the description of climate and weather information. The “CDWA” standard provides facilities for describing art, architecture and other cultural works. “MIDAS-Heritage” is a standard for describing cultural heritage. This includes buildings, monuments, excavation sites, shipwrecks, battlefields, artefacts, etc. “CIF” provides standards for research in crystallography. “CSMD” provides descriptive capabilities for scientific studies in scientific disciplines that perform systematic experimental analyses on substances (e.g. materials science, chemistry, biochemistry). The “ICAT scheme” is based on “CSMD” and serves the same purpose but offers even more precise description possibilities. “Darwin Core” is used to describe biological diversity or biodiversity such as living organisms. “EML” is a standard used exclusively in the field of ecology. The “DDI” standard is used to describe data collected through surveys or other observational research methods in the social and economic sciences as well as in behavioural research.

Some publishers have their own metadata standards that must be taken into account when publishing. It is best to check the specific features at the beginning of your project, when you already have a journal in mind for publication. Some research data archives also have their own metadata standards, e.g. GenBank.

4.5 What are controlled vocabularies and authority files? What are they used for?

As you have seen so far, metadata standards define the categories with which data can be described in more detail. On the one hand, these include interdisciplinary categories such as title, author, date of publication, type of study, etc., but on the other hand they also include subject-specific categories such as substance temperature in chemistry or materials science. However, there is no definition or control of how you fill the respective categories with information.

What date format do you use? Is the temperature given in Celsius or Fahrenheit and with “°” or “degrees”? Is it a “survey” or a “questionnaire”? These questions seem superficial at first glance, but predefined and uniform terms and formats are closely related to machine processing, the search results and linkage with other research data. For example, if the date format does not correspond to the format a search system works with, the research data with the incompatible format will not be found. If questionnaires are searched for, but the term “survey” is used in the metadata, it is not certain that the associated research data will also be found.

For the purpose of linguistic standardisation in the description of metadata, so-called controlled vocabularies have been developed. In the simplest form, these can be pure word lists that regulate the use of language in the description of metadata, but also complex, structured thesauri. Thesauri are word networks that contain words and their semantic relations to other words. This makes it possible, among other things, to unambiguously resolve polysemous (ambiguous) terms.

As a researcher or research group, how can you ensure the use of consistent terms and formats? As an individual in a scientific discipline, it is worth asking about controlled vocabularies within that discipline at the beginning of a research project. A simple search on the internet is usually enough. Even in a research group with a research project lasting several years, a controlled vocabulary should be searched for before the project begins and before the first analysis. If none can be found, it is worthwhile, depending on the number of researchers involved in the project and the number of sites involved, to create an internal project document for the uniform coordination of the terms and technical terms used, which should be used in the respective metadata categories.

In addition to controlled vocabularies, there are also a large number of authority files which, in addition to uniform naming, make a large number of entities uniquely referenceable. ORCID, short for Open Researcher and Contributor ID, which identifies academic and scientific authors via a unique code, has already been mentioned above. The specification of such an ID distinguishes any frequently occurring and therefore ambiguous names and should hence be used as a matter of preference.

Probably the best-known standards file in Germany is the Gemeinsame Normdatei (GND, eng.: Common authority file), which is maintained by the Deutsche Nationalbibliothek (DNB, eng. German National Library), among others. It describes not only persons, but also “corporate bodies, conferences, geographies, subject terms and works related to cultural and scientific collections”. (Gemeinsame Normdatei (GND), 2019, About the GND) Each entity in the GND is given its own GND ID, which uniquely references that entity. For example, the poet “Sophocles” has the ID 118615688 in the GND. This ID can be used to reference Sophocles unambiguously in metadata with reference to the GND.

GeoNames is an online encyclopaedia of places, also called a gazetteer. It contains all countries and more than 11 million place names that are assigned a unique ID. This makes it possible, for example, to directly distinguish between places with the same name without knowing the officially assigned municipality code (in Germany, the postcode). For example, Manchester in the UK (2643123), Manchester in the state of New Hampshire in the US (5089178) and Manchester in the state of Connecticut in the US (4838174) can be clearly distinguished.

In general, find out about specific requirements as soon as you know where you want to store or publish your research data. Once you know these requirements, you can create your own metadata. When referring to specific commonly known entities, always try to use a unique ID, specifying the thesaurus used.

If you want to know whether a controlled vocabulary or ontology already exists for your scientific discipline or a specific subject area, you can carry out a search at BARTOC, the “Basic Register of Thesauri, Ontologies & Classifications” as a first step.

4.6 How do I create my metadata?

Metadata can be created manually or with the help of programmes. Programmes, also for subject-specific metadata, are available on the internet, many of them for free. Nevertheless, first find out for your institution whether experience has already been gained and whether licences are available for proprietary software commonly used in your research area. The following list of programmes for creating metadata is only a selection and does not claim to be complete.

If you have no experience at all with metadata standards, the editor integrated in Windows can be used to create metadata for the time being. This makes sense in order to have key data on the respective examinations at all and to be able to call them up later. It is best to save the individual text files in individual folders per examination.

Programmes with simple graphical user interfaces are not available for all metadata standards. Therefore, if you want or need to work directly with an existing XML metadata standard, you should either use the free editors Notepad++ or Atom or the paid software oXygen, if licences are available at your institution. All three editors offer better usage and display options to make content and element labels visible separately. For example, as in Figure 4.2, elements are displayed in blue and the actual content in black.

The open source online tool CEDAR Workbench allows online templates based on metadata standards to be created via a graphical user interface, filled in and also shared with other users. At the same time, templates created by other users can also be used for one's own research. All you need to do is register free of charge.

The tool Annotare is suitable for annotating biomedical research and results. It works according to the MIAME (Minimum Information About a Microarray Experiment) quality standard for microarrays and generates data in MAGE-TAB format (MicroArray Gene Expression Tabular). The metadata are entered into simple input fields in the programme. Precise knowledge of the metadata standard is therefore not necessarily required.

The ISA framework is suitable for describing biomedical research, but also for experiments in the life sciences and environmental research. It is open source and consists of several programmes that can help in the management of experiments from planning and conduction to the final description. You can start with the ISA Creator, which is used to create files in the ISA-TAB format. This format is explicitly required, for example, by the Scientific Data Journal of the Nature publishing house.

To create metadata in the metadata standard EML, the programme MetacatUI should be used. It allows data and metadata to be stored in a single file, which facilitates archiving. It is also directly linked to the Knowledge Network for Biocomplexity (KNB), an international subject-specific repository for ecological and environmental research. Data can thus be uploaded directly to the repository and made available for others to use.

The programme CatMDEdit is suitable for metadata collection in the geosciences according to ISO 19115. The metadata created are also compliant with the Dublin Core standard. Information on how to use it can be found here.

Which programmes are suitable for your metadata depends very much on the type of research data and your wishes for use. It is therefore worthwhile to talk to other researchers in advance to find the best way to create metadata for yourself. Creating metadata manually in an editor without a metadata standard as a basis is the easiest and fastest method for beginners, but familiarising yourself with the metadata standard relevant to you and searching for programmes that use this standard can have an advantage in terms of automatic processing of the data and later publication. At the very least, the use of a simple, subject-independent metadata standard such as Dublin Core should be considered.

References, further reading and online sources Page

Test - 4 Metadata

Test your knowledge about the content of the chapter !

Handout - 4 Metadata Fichier

Here is a summary about the most important facts

Résumé de section