节: 8 Data storage and archiving | Research Data Management - An Online Introduction

8.1 Introduction and learning objectives

The following chapter provides a closer look at the fourth stage of the research data lifecycle: archiving and storing data.

After completing this chapter, you will be able to...

...assess the risks of careless handling of data.
...apply strategies for a secure backup.
...name the requirements for (long-term) archiving.
...recognise the advantages and disadvantages of relevant file formats.
...understand the benefit of special precautions that need to be taken to archive and make data available for the long term.

8.2 Storage media and locations: advantages and disadvantages

As already noted in Chapter 7, research data should be saved regularly, and progress and changes should be marked and well documented via versions if possible.

Saving should be done on different media. When deciding on a medium, you should consider the following factors according to Ludwig/Enke (2013, p. 33):

Size of the dataset
Number of datasets
Frequency of data access

Storage media have different properties, which means that there are sometimes considerable differences in protection against data loss and unauthorised access depending on the medium. The following is a compact overview of the properties, advantages and risks of the most common storage media and locations:

Own PC
Advantages	Disadvantages
Ownership for security and backup own control	everything that happens to the PC happens to the backup Possibly lack of resources and know-how to configure and check the quality of the backup copies. Individual solutions are time-consuming, costly, and inefficient in relation to a working group.
Mobile storage medium (e.g. CD, DVD, USB stick, external hard drive)
Advantages	Disadvantages
Easy to transport can be stored in a lockable cabinet or safe	Particularly easy to lose and can be easily stolen, therefore extremely insecure Content is not protected in case of loss if it has not been encrypted beforehand Sensitive to temperature, air quality and humidity External hard drives are particularly shock and wear-prone
Institutional storage locations (e.g. server of your university)
Advantages	Disadvantages
Backup of the data is ensured Professional implementation and maintenance Storage in accordance with the institution's data protection policy Data protection regulated via access rights Can be used worldwide for mobile working	Speed depends on the network Access to backups possibly delayed by official channels It may be unclear which security criteria are applied and which security strategies are used possibly associated with higher costs
External storage locations (e.g. cloud services of external companies)
Advantages	Disadvantages
easy to use and manage are professionally maintained Can be used worldwide for mobile working	Depending on the provider, the connection may be insecure depending on access to the internet Upload and download can take a long time Access to backups possibly delayed Unclear which security criteria are applied and security strategies are used and whether these comply with the specifications for sensitive data many institutions have issued special regulations for the use of such services

Tab. 8.1: Advantages and disadvantages of different storage media and locations

CDs and DVDs belong to the so-called optical media. They should always be stored in suitable containers at about 30-50 % humidity and at a stable temperature between -9°C-23°C. However, magnetic storage media, e.g. hard disks or tapes, are also very wear prone (Corti et al., 2014, p. 87).

The use of free cloud storage services, such as Dropbox, OneDrive or Google Drive, should be avoided. As the server location for these providers is based in the US, the law there applies to the data and your privacy, which must be viewed critically, especially in view of the USA PATRIOT Act of 2001, as the data is not protected from all unwanted access by third parties and it is not possible to control what happens to the data.

Frankfurt UAS offers the use of Nextcloud as a safe alternative to all university members (with the exception of students) with a valid CIT-Account.

Nextcloud is an open source solution for storing files (file hosting). Functionally, it is comparable to Dropbox, Google Drive or other sync-and-share services. However, all files remain stored on the university's servers. There are five gigabytes available per user for file storage. The files can be synchronized with local storage via a client or accessed at nextcloud.frankfurt-university.de. For more information, visit the Nextcloud Knowledge Base on Confluence.

Non-digital media should not be forgotten either. Much data is handwritten or printed on paper-based materials (e.g. photos). Here, sunlight, acid or fingerprints in particular contribute to quick wear. If data is stored on paper, according to Corti et al. (2014, p. 87) you should...

...use acid-free paper.
...use folders and boxes.
...use stainless steel paper clips.

You should also scan the data so that it is available in a digital format. If necessary, this digital data can then be converted back into a material format by printing, for example. The PDF/A format is particularly suitable for transferring data into a digital format. However, not all documents can be converted to PDF/A without problems. However, there are free tools that can check PDF/A conformity. If the format is not suitable for your data, simply scan it as a PDF.

It should also be noted that at least two people should have access to the data in order to guarantee the availability of the data even in case of illness or absence.

8.3 Data security and encryption

As can be seen from the previous list of advantages and disadvantages of different storage locations and media, the question is not only where you should store data, but also how you store it. You can contribute to the security and safety of your (sensitive) data by, for example, storing your storage media in a separate, lockable room or cabinet and securing notebooks against theft with a lock. If you have to log in to an account first to view the data, it can also make sense to use a two-step verification process, preferably via a physical authentication key (e.g. YubiKey). However, find out beforehand whether the server you are logging in to also supports one of the protocols offered by the authentication key.

However, physical protection is not enough; your data must also be protected digitally. An important factor here is data security, which can be ensured through data encryption. Encryption software can provide you with additional help to secure both individual files and storage locations. Also note that special precautions must be taken especially when dealing with sensitive data. According to Corti et al. (2014, p. 88), data encryption starts at three levels to prevent unauthorised access and unwanted changes as well as destruction and disclosure of data:

Physical security	Restrict access / entry to buildings Include hardcopy material Transport / move sensitive data only in exceptional cases
Network security	Do not store sensitive data on external servers Keep firewall up to date and update regularly
Information and computer security	Protect computers with passwords and firewalls Surge protection through use of UPS (uninterruptible power supply) devices Protect files with passwords Set access rights to files Encrypt restricted access data Obtain confidentiality declarations from data users No unencrypted data transmission via email GoogleDocs/Dropbox etc. are not always appropriate If data is to be destroyed: destroy correctly (see next section 9.4)

Tab. 8.2: Three levels of data encryption

8.4 Data destruction

Data destruction is closely linked to data security. First find out from the HRZ (University Computer Centre) responsible for your university which services are provided to ensure professional data destruction.

Anyone who has already had to make use of data recovery or has carried it out themselves knows that simply deleting the data does not destroy it permanently. This means that the data can be recovered by unauthorised persons. So how do you destroy data correctly? First of all, the answer to this question depends on the type of storage medium chosen.

Even reformatting hard disks does not delete data completely; instead, the reference to the file is deleted, which merely makes it untraceable without using certain recovery software. Therefore, to permanently delete data, it must be overwritten before formatting and the data medium must be deeply formatted. Programmes like Eraser, WipeFile or Permanent Eraser can help you with this. If the hard disk is not supposed to be used anymore, you should have it destroyed by a company that specialises in the destruction of data media if the data is very sensitive.

The easiest way to erase data on USB sticks is to physically destroy them. This also applies to external hard drives, CDs/DVDs and non-digital data. With DIN 66399, published in 2012, the German Institute for Standardisation (DIN) has developed a total of three protection classes and seven different security levels for document destruction depending on the respective data carrier. The specification of DIN 66399 stipulates that the higher the protection class and security level for the data, the smaller the residual particle size (i.e. the shredding level) must become in relation to the total size of the original data carrier after shredding to ensure that the physical data carrier can no longer be reassembled. This also requires the use of machines, which in most cases are only owned by companies that specialise in the destruction of data.

8.5 Backup

In contrast to these measures, with which you delete data permanently and safely, data can also be lost unintentionally. To avoid deleting data by mistake or destroying it by accident, you should backup your data regularly.

The creation of a backup copy of data should always be done on a storage medium that is separate from the usually used infrastructure – in a planned and structured way. Thus, data should be backed up as regularly as possible in order to be able to carry out a data reconstruction as easily as possible. However, before you backup your data, you should clarify organisational questions:

Are there already ongoing backup plans? What do they look like?
How often should a backup be made of what?
Where should the backups be stored?
How should the backups be saved? (e.g. labelling, sorting, file format)
Which backup tools can help?
How is sensitive data handled?

It is recommended to use an automated routine. Partial data that is currently being worked on should be backed up daily if possible. Furthermore, it is advisable not to overwrite them every day, as this allows one to reconstruct errors if necessary or also to undo changes that were done erroneously. In addition, a weekly complete backup should be done. The principles of the 3-2-1 backup are useful here (see figure 8.1).

3 copies, 2 different storage media, 1 decentralized storage

Fig. 8.1: The 3-2-1 backup rule (CC-BY SA, Andre Pietsch)

A decentralized storage location refers to the institutional as well as external storage locations listed in Table 8.1. You should always prefer an institutional, decentralised storage location.

The backup or the resulting data recovery should be checked at the beginning as well as at regular intervals. Most institutions offer an automatic solution in which all data is stored exclusively on backed-up drives provided by the university computer centres. This professionalisation ensures that the backups won’t be forgotten, and that the configuration of the backup system does not need to be done individually.

In addition, you can check your backups after they have been created using checksums. To do this, however, you must have MD5 or SHA1 checksums created for these files after the backup files have been created. The utility “File Checksum Integrity Verifier”, FCIV for short, provided by Microsoft, helps you to do this. Instructions on how to use it can be found here. If the checksums of both your original data and the backup are identical, so is the data. In this way, you can check the integrity of your data and determine whether any errors may have occurred when copying the data. Incidentally, if you also publish software code, it is customary in the programming field to include the checksum of the installation file (“*.exe”) with the download so that interested users can check beforehand whether it is an original installation file and not possibly a file infected with viruses.

8.6 Data archiving

Besides data storage, data archiving is another necessary step in the research data life cycle. While data storage primarily involves the storage of data during the ongoing work process in the project period, as covered in the previous sections of this chapter, data archiving is concerned with how the data can be made available in as reusable a way as possible after the project has been completed. Often a distinction is made between data storage in a repository and data archiving in the sense of long-term archiving (LTA for short). However, in many places, including the DFG's “Guidelines for Safeguarding Good Research Practice” from 2019 (“Guideline 17: Archiving”), both terms are used equivalently. When we speak of preservation or data retention in the following, we mean the storage of data in a research data repository. However, when data archiving is mentioned, long-term archiving is meant. The differences between the two variants are the subject of this section.

Data storage in a research data repository is usually accompanied by publication of the data produced. Access to such publications can and, in the case of sensitive data such as personal data, must be restricted. In accordance with good scientific practice, repositories must ensure that the published research data are stored and made available for at least ten years, after which time availability is no longer necessarily guaranteed, but is nevertheless usually continued. If data are removed from the repository after this minimum retention period at the decision of the operator, the reference to the metadata must remain available. Repositories are usually divided into three different types: Institutional repositories, subject repositories and interdisciplinary or generic repositories. A fourth, more specific variant are so-called software repositories, in which software or pure software code can be published. These are usually designed for one programming language at a time (e.g. PyPI for the programming language “Python”).

Institutional repositories include all those repositories that are provided by mostly state-recognised institutions. These may include universities, museums, research institutions or other institutions that have an interest in making research results or other documents of scientific importance available to the public. As part of the DFG's “Guidelines for Safeguarding Good Research Practice” (2019), there is an official requirement that the research data on which a scientific work is based must be kept at least “at the institution where the data were produced or in cross-location repositories”. (DFG 2019, p. 20) Also, before publishing your data, be aware of the requirements for long-term storage imposed by your research institution's research data guideline or research data policy. Therefore, contact the research data officer at your university or research institution early on to discuss how and where you can publish the data in order to act in accordance with good scientific practice. Even if you have already published your data in a journal, it is often possible to publish it at your institution as well. Ask the publisher or check your contract.

In addition to publishing in your institutional repository, you can also publish your data in a subject-specific repository. Publishing in a renowned subject-specific repository in particular can greatly contribute to enhancing your scientific reputation. To find out whether a suitable subject-specific repository is available for your research area, it is worthwhile to search via the repository index re3data.

If there is no suitable repository, the last option is to publish in a large interdisciplinary repository. A free option is offered on the one hand by the service Zenodo, funded by the European Commission, and on the other hand internationally by figshare. If your university is a member of Dryad, you can also publish there, free of charge. RADAR offers a fee-based service for publishing data in Germany. The most frequently used option in Europe is probably Zenodo. When publishing on Zenodo, make sure that you also assign your research data to one or more communities that in some way reflect a subject-specificity within this generic service.

Regardless of where you end up publishing your data, always make sure to include a descriptive “metadata file” in addition to the data, describing the data and setting out the context of the data collection (see Chapter 4). When choosing your preferred repository, also look to see if it is certified in any way (e.g. CoreTrustSeal). Whether a repository is certified can be checked at re3data.

With today's rapidly evolving digital possibilities, the older data become, the more likely it is that this data can no longer be opened, read, or understood in the future. There are several reasons for this: The necessary hardware and/or software is missing, or scientific methods have changed so much that data is now collected in other ways with other parameters. Modern computers and notebooks, for example, now almost always do without a CD or DVD drive, which means that these storage media can no longer be widely used. Long-term archiving therefore aims to ensure the long-term use of data over an unspecified period of time beyond the limits of media wear and technical innovations. This includes both the provision of the technical infrastructure and organisational measures. In doing so, LTA pursues the preservation of the authenticity, integrity, accessibility, and comprehensibility of the data.

In order to enable long-term archiving of data, it is important that the data are provided with meta-information relevant to LTA, such as the collection method used, hardware of the system used to collect the data, software, coding, metadata standards including version, possibly a migration history, etc. (see Chapter 4). In addition, the datasets should comply with FAIR principles as far as possible (see Chapter 5). This includes storing data preferably in non-proprietary, openly documented data formats and avoiding proprietary data formats. Open formats need to be migrated less often and are characterised by a longer lifespan and higher dissemination. Also make sure that the files to be archived are unencrypted, patent-free and non-compressed. In principle, file formats can be converted lossless, lossy, or according to the meaning. Lossless conversion is usually preferable, as all information is retained. However, if smaller file sizes are preferred, information losses must often be accepted. For example, if you convert audio files such as WAV to MP3, information is lost through compression and the sound quality decreases. However, the conversion results in a smaller file size. The following table gives a first basic overview of which formats are suitable and which are rather unsuitable for a certain data type:

Data type	Recommended formats	less suitable	unsuitable formats
Audio	_.flac / _.wav	*.mp3
Computer-aided Design (CAD)	_.dwg / _.dxf / _.x3d / _.x3db / *.x3dv
Databases	_.sql / _.xml	*.accdb	*.mdb
Raster graphics & images	_.dng / _.jp2 (lossless compression) / _.jpg2 (lossless compression) / _.png / *.tif (uncompressed)	_.bmp / _.gif / _.jp2 (lossy compression) / _.jpeg / _.jpg / _.jpg2 (lossy compression) / *.tif (compressed)	*.psd
Raw data and workspace		_.cdf (NetCDF) / _.h5 / _.hdf5 / _.he5 / _.mat (since version 7.3) / _.nc (NetCDF)	_.mat (binary) / _.rdata
Spreadsheets	_.csv / _.tsv / *.tab	_.odc / _.odf / _.odg / _.odm / _.odt / _.xlsx	_.xls / _.xlsb
Statistical data	*.por	*.sav (IBM®SPSS)
Text	_.txt / _.pdf (PDF/A) / _.rtf / _.tex / *.xml	_.docx / _.odf / *.pdf	.doc
Vector graphics	_.svg / _.svgz		_.ait / _.cdr / _.eps / _.indd / *.psd
Video¹	*.mkv	_.avi / _.mp4 / _.mpeg / _.mpg	_.mov / _.wmv

Tab. 8.3: Recommended and non-recommended data formats by file type

The listing in the column “less suitable or unsuitable formats” does not mean that you cannot use these formats at all if you want to store your data in the long term. It is rather a matter of being sensitised to questions of long-term availability in a first start. Make it clear which format offers which advantages and disadvantages. You can find an extended overview at forschungsdaten.info. If you want to delve further, you will find what you are looking for on the website of NESTOR – the German competence network for long-term archiving and long-term availability of digital resources. Under NESTOR - Topic you will find current short articles from the field, e.g. on tiff or pdf formats. If you put these and other overviews side by side, you will notice that the recommendations on file formats differ from each other. We do not yet have enough experience in this field. Another good way to find out if you are uncertain about formats is to ask a specialised data centre or a research data network, if one exists. If you want to store your data there, this approach is even more advisable. You may then find that your data will be taken even if the chosen data format is not the first choice from an LTA perspective. Operators of repositories or research data centres work close to science and always try to find a way of dealing with formats that are widely used in the respective fields, e.g. Excel files. As an example of this, you can take a look at the guidelines of the Association for Research Data Education.

In order to be able to decide for yourself which formats are suitable for your project, there are a number of criteria that you should consider when making your selection (according to Harvey/Weatherburn 2018: 131):

Extent of dissemination of the data format
Dependence on other technologies
Public accessibility of the file format specifications
Transparency of the file format
Metadata support
Reusability/Interoperability
Robustness/complexity/profitability
Stability
Rights that can complicate data storage

LTA currently uses two strategies for long-term data preservation: emulation and migration. Emulation means that on a current, modern system, an often older system is emulated, which imitates the old system in as many aspects as possible. Programmes that do this are called emulators. A prominent example of this is DOSBox, which makes it possible to emulate an old MS DOS system including almost all functionalities on current computers and thus to use software for this system, which is most likely no longer possible with a more current system.

Migration or data migration means the transfer of data to another system or another data carrier. In the area of LTA, the aim is to ensure that the data can still be read and viewed on the system to be transferred. For this, it is necessary that the data are not inseparably linked to the data carrier on which they were originally collected. Remember that metadata must also be migrated!

When choosing a suitable storage location for long-term archiving, you should consider the following points:

Technical requirements – The service provider should have a data conversion, migration and/or emulation strategy. In addition, a readability check of the files and a virus check should be carried out at regular intervals. All steps should be documented.
Seals for trustworthy long-term archives – Various seals have been developed to assess whether a long-term archive is trustworthy, e.g. the nestor seal, which was developed on the basis of DIN 31644 “Criteria for trustworthy digital long-term archives”, ISO 16363 or the CoreTrustSeal.
Costs – The operation of servers as well as the implementation of technical standards are associated with costs, which is why some service providers charge for their services. The price depends above all on the amount of data.
Making the data accessible – Before choosing the storage location, you should decide whether the data should be accessible or only stored.
Service provider longevity – Economic and political factors influence the longevity of service providers.

In summary, the following can be said: The information on LTA listed here has mainly a theoretical value for you and only a limited action value. If you publish in a certified repository, you are well advised. Above all, make sure that you do so at a trustworthy institution and obtain information from this institution in advance about possibilities or plans regarding an LTA. You can use the aspects listed here for a good LTA to formulate possible questions for the facilities. This should provide sufficient preconditions for the LTA.

References, further reading and online sources 网页

Test - 8 Data storage and archiving 测验

Test your knowledge about the content of the chapter !

Handout - 8 Data storage and archiving 文件

Here is a summary of the most important facts

章节大纲