Welcome to Mimir. This book explains a method of handling data production for smaller research laboratories in order to produce robust, annotated and machine-friendly data that follows the FAIR principles, without any particularly expensive equipment, subscriptions or personell.
It explains ways to:
- Manage samples in day-to-day work;
- Write clear, complete and concise analysis protocols;
- Write relevant metadata for all kinds of data produced while working;
- Provide guidelines on how to handle long-term storage of produced data;
- Give practical advice on how to write data handling policies for a (small-ish) research group.
This book is aimed at smaller laboratories without an institutional open-data, FAIR-data or open-science support framework, so that they may start producing more collaborative, robust and cleaner data for both themselves and the wider public.
The book is divided in sections and it is meant to be read both from top to bottom and as a handbook, depending on your needs:
- The Background section contains information useful to get started if you have no idea what FAIR means, why you should produce FAIR data and other required background knowledge.
- The Before experiments section outlines things that should be handled before starting experiments, both in general (i.e. systems and policies that should be in place) and in the specific (i.e. what steps should be performed before every experiment).
- The At the bench section outlines what should be done or kept in mind while working specifically at the bench.
- The At the PC section covers the practices that should be implemented while handling data after the measurements on the samples have been completed.
Who is this book for?
This book is aimed at both researchers that generate or handle data, and at their bosses, which want or need to setup proper data management guidelines for their own laboratory.
If you have ever:
- Lost data in some hard drive somewhere;
- Forgot what a file was all about;
- Could not read the weird file format of the files left by your collaborators a year ago;
- Panicked at the amount of data you have to sieve trough to write a report or paper;
then you might find Mimir to be useful.
What is FAIR data?
In 2016, an article was published outlining "The FAIR guiding principles for scientific data management
and stewardship".
In the paper, the many authors outline what FAIR data is, why it was invented, and why it is useful.
It's a highly recommended read: if you have the time, please stop reading this book and read the article above (or at this DOI, if you cannot click links, copy and paste this instead: https://doi.org/10.1038/sdata.2016.18
).
If you read the paper, you can skip this next section. If you didn't, I'll highlight some key concepts here. With FAIR data we mean data that is:
-
Findable: you can actually find FAIR data. This is true both for your locally-saved data (so that you can no longer say "Oh, yes, it's in a hard drive somewhere...") and for online datasets uploaded to data repositories. It covers data labelling: when you get a hold of FAIR data, it should be clear what the data is all about, including:
- What it is describing, meaning the thing that was measured;
- How it was generated, meaning the instrument or technique used on the thing that was measured;
- Who generated the data and when it was generated;
- Why the data was originally gathered;
Such a large amount of metadata is usually stored in a separate file that is kept alongside the data that it is describing.
This point also covers making your metadata and data machine-readable. Think about searching trough documentations: it's done by a machine, not a human. If metadata was readable by a machine, you could run powerful searches managing to locate the proverbial needle (the data that you need) in the haystack (the huge amount of data available online).
-
Accessible: you can easily obtain, or know how to obtain, FAIR data. This might seem trivial, but not all data is born equal. Some data is sensitive, for example, pertaining to the health of patients or their sexual, political or otherwise opinions that should be kept secret. Such data should still be findable, but it might require additional levels of protection regarding how it can be accessed. This point is focused around a way to access data, not that the data is openly accessible by default: the ways of getting the data should be clearly stated (e.g. asking for permission, signing waivers, etc...), but need not be open.
This point also covers the actual method of getting the data. For example, data might not be accessible if the only way to obtain it is to send a carrier pigeon with a USB strapped to its leg to the recipient. The way to get the data needs to be standardized and widely available to all. Luckily, the internet exists, so we can leverage its protocols to transport data from point A to point B.
-
Interoperable: you can easily understand FAIR data once you have it. If you download a dataset only to discover that it is written in ancient cuneiform, you might not be able to understand it (unless you are an archaeologist, maybe).
Data should be in a format that is understandable by all, not behind a paywall (did you ever consider that if you don't have the money to use Excel, you might not be able to read excel datasets?), and that is explained in detail (what does the
kplm
column mean?) so that everyone might understand the data even for many years to come. -
Reusable: you can easily use FAIR data for purposes other than the one it was originally made for. This point partially overlaps with the Interoperability, but covers specifically the act of reusing the data.
If you know in detail how the data was sourced, where it was sourced from, when it was taken and how it was recorded in the file, then you are able to use the data with confidence for other purposes.
This is the ultimate goal and power of FAIR data. FAIR data is added with confidence to the global knowledge base and can be reused many times by different people for different purposes, including laboratories and individuals that might not have had the ways to gather it in the first place but that have creative uses for it1.
To learn more about the more specifics of FAIR data and the characteristics that it should have, read the publication and the go-fair website.
The process of making old data FAIR is known as FAIRification. Mimir is not a guide on FAIRification. It will not tell you how to take an experiment that was ran a year ago and make it FAIR2.
Mimir tries to teach you how to make your data FAIR by design. It tries to decouple the process of interpreting the data from the step of creating it. With the method described here, your protocols will generated FAIR data by default. In a second step, you will use these FAIR objects for data analysis, writing papers, and sharing them with the public.
This is also why the European Union is pushing the creation of FAIR data guidelines and FAIR data object, so that companies might reuse the data to generate additional profit. This is a bleak prospect and a controversial point of view (should academia directly support capitalism? How do we economically reward the authors of the data?), but it's outside of the scope of Mimir.
I argue that this is impossible.
Why is FAIR important?
To be short an philosophical: the global knowledge pool benefits from FAIR data. To be longer and less philosophical, the rest of this page.
FAIR data is useful for both the researcher that makes it and the public at large. For the researcher, it allows them to:
- Produce higher-quality data;
- Avoid losing data due to poor storage techniques;
- Avoid having to delete old data due to forgetting what it was about;
- Be faster when composing methods for papers;
- Immediately fulfill data sharing requirements when publishing;
- Share and pool together data with members of the same group in an easier way;
- Have faster data analysis, and outsource data analysis to third parties in a much easier way.
- Comply with new requirements for funding from the European Union and other funding bodies, as they are increasigly asking to applicants to provide guidelines on how data will be treated FAIRly during the project.
- Find and collaborate with people who are interested in your data or generate data similar to yours.
For the public at large, FAIR data allows to:
- Repurpose data for new endeavours without needing to recollect it, which is potentially expensive and time consuming;
- Run large meta-analysis studies in an easier, faster way;
- Check the rigorousness of data analysis, ameliorating the reproducibility crisis;
- Allow researchers with less funding to still generate knowledge through data repurposing.
As most public reasearch is funded with public money (i.e. by the state), we as scientists also have a responsibility to use this money to produce high-quality, reusable data that the public - who paid for it - can potentially access and reuse.
Are there downsides of following FAIR practices? Well, you need time and some effort to setup a working method that makes, you create FAIR(-er) data, plus you then need to follow the new guidelines in your day to day work.
This book is designed to make these efforts as painless and as smooth as possible, especially if you do not have supporting infrastructure to begin with.
What is in it for me?
You might still not be convinced that the game is worth the candle. Your method for handling data might be perfectly fine for you: you've never lost any data, and you can understand your files well.
However, consider that you do not live in a bubble. You eventually need to:
- Share your work with others;
- Use other's data;
- Combine different datasets of your own work;
- Handle other people's data for, e.g., quality check;
- Find new people who are as interested to your experimental problem as you are.
If everyone produced FAIR data, all of these steps would be extremely easy to carry out and, in some cases, automatic.
Once you embrace FAIR data, you contribute and can tap into a large amount of data present online for your own reputation and benefit. Working with FAIR data is also more efficient: data analysis is faster with such data, and writing papers becomes a breeze.
If all of this still is not enough to convince you, consider that there are efforts to move away from bibliometric indexes to measure a researcher output and therefore evaluate them. These efforts, like COARA, will most likely replace such measures with things like FAIR quantification, research integrity, etc... Being ready for these changes will undoubtedly give you an edge in the future.
What is metadata?
μετά, in Greek, means "after" or "beyond", and as an adjective means "transcending". In modern English, meta- as a prefix also means "self-referential". Meta-data is self-refential data, or data about data.
note
Immediately, we hit an interesting point: if metadata is data about data, it is, itself, more data. It's important to keep in mind that everything that you can say for data you can also say for metadata. This includes the potential for data about metadata, or meta-metadata, as we will see below. Sometimes I will use the form (meta)data to refer to both data and metadata.
Imagine that I come up to you, and, without a word, hand you a piece of paper. You read it:
1.57, 1.42, 1.50, 1.55, 1.52, 1.48, 1.60, 1.54, 1.38, 1.48
You are probably confused at this point1. What do this numbers mean?
Let's take a moment to consider the numbers of themselves. We say that these numbers are data. The definition of "data" is very complicated, but let's simplify it here and think about data as a representation (be it numbers, words, images, etc...) of some aspect of reality. This helps, somewhat. The numbers represent something about the real world, but what?
Enter, metadata. I hand you another piece of paper, which tells you that these numbers are people's heights.
This helps your understanding: 1.57
is the representation of my height, in some units.
However, a lot of questions might have come up in your mind:
- What is this unit of measure?
- What was used to measure the heights?
- Who made the measurements?
- Who were the people that got measured?
All of this information - this metadata - is essential to understand the data. Without it, the numbers above are just that: a series of meaningless numbers. There is no data without metadata, and they are tightly linked together.
You will have handled metadata before: even just the title of your files is metadata: they describe what is inside the file.
A large part of FAIR data is to write meaningful, standardized and practical metadata alongside your data. The machine or human that reads your metadata can then quickly understand what the data is all about, up to the fine details of it. It follows that - unlike my pieces of papers - data thusly described is generally useful.
Representing metadata
Metadata is represented in much the same way as regular data. Consider again the numbers above. When working with metadata it's useful to give each data point an identifier (ID):
A B C D E F G H I J
1.57, 1.42, 1.50, 1.55, 1.52, 1.48, 1.60, 1.54, 1.38, 1.48
Once we can univocally refer to each data point, we can create a new representation, a new piece of data, with the metadata for each point:
subject_id gender
A Male
B Female
C Male
D Female
E Male
F Male
G Female
H Female
I Male
J Female
We can use the identifiers to cross-reference the data and metadata, so that we know that the person that has identifier "A" is 1.57 meters tall, and has gender of "Male". We will see that identifiers are very important in many aspects of data handling: they are the "handles" that we can grip to move, link, and break apart data. Note that identifiers are metadata. They are described in the next section.
For tabular data such as this one, where values are ordered in columns and rows, columns of data are often represented as rows of metadata.
Down the metadata rabbit hole
We can go one step further: metadata about metadata. This "meta-metadata" might refer to the definitions of the metadata. For instance, is the gender above what was assigned at birth (i.e. "biological gender") or what the subject reported as the gender they identify with?
Such a clarification is metadata about metadata and it's just as important. This example was very simple, but imagine more complex metadata, especially if it's referring to highly technical features, such as tumor grading (what does "grade III" really mean?), classifications, debated definitions (e.g. "gender") and other non-self-explanatory fields.
It's important to keep in mind when writing metadata that other people (including yourself in the future) need to be able to read and understand the metadata as much as they need to read and understand the data itself. Additionally, it would be nice to coordinate in some way, so that when we say "gender", everyone gives to that word the same meaning. This also allows the development of automated tools that can understand what "gender" means without ambiguity, so that the data can be used correctly.
Machine readability
Recall the FAIR principles: to be Findable, it should be possible for a computer program to search for and find our data. This is primarely done by searching through the metadata, not the data. For instance, if we had a metadata field for "organism", one could look up all experiments done on some specific organism with a simple search. But pause and think about how such a field should be structured. Would terms such as "human" and "gorilla" be ok? or should it be the full taxonomic name, like "homo Sapiens" instead? If so, in which form ("h. Sapiens", "Sapiens, homo", etc...)? Consider that to a program, the terms "h. Sapiens" and "homo Sapiens" are as different as "gorilla" and "fish". To make our data machine-readable, the form of our metadata must be standardized in some way.
Metadata ontologies
A way to standardize the meaning of metadata is through the usage of Ontologies. They hold all of the definitions of what the (meta)data actually means, as well as the structure it should take. Following such a metadata ontology is essential to be understood by yourself, others and chiefly computer programs which require no ambiguiity and a lot of standardization when it comes to interpreting (meta)data.
At the time of writing (start of 2024), it's still not clear what ontology to use for what data at a global level. There are some attempts are creating such an ontology (such as researchobject.org), and there are many entries for specific kinds of data (many of which are deposited in FAIRsharing.org), but - as you might think - the task of making everyone agree on what ontology to use is harduous and it will take some time before a recognized, followed standard for most data types is found and widely adopted.
For the purposes of Mimir, we will sidestep the problem altogether. In the absence of a standard for data and metadata structure, type and quality, it's still possible to create local standards to follow. If well designed, such metadata can be (automatically) ported to new metadata formats if and when they emerge. We will therefore be "selfish": your data will be FAIR primarily for yourself, and it will be internally consistent to some quality standard of your choosing. If a global (meta)data onthology does eventually emerge, it is much easier to migrate locally FAIR data to this standard rather than non FAIR data (indeed, annotating old non-FAIR data with new metadata might be even impossible).
Metadata formats
We said that metadata is data, so it follows that all the formats of data can also be found for metadata. However, metadata has some common characteristics that narrow the possible types and formats that it can take:
- Metadata is usually made up of text. Think of the labels for the example above.
- Metadata should be both human and machine readable, although being machine readable is chiefly important.
- In particular, being easily machine redable is important. This means using file formats that can be easily read by anyone. More on file formats later.
- Metadata should be retrieavable indipendently of the data it describes, due to the Accessibility principle (in particular, the A2 point: "Metadata are accessible even when the data is no longer available").
- The structure of metadata should be shared by all data that has to be searched through together.
- This is important! Say that you need to search through all the data that your research group/deparment/university generates. If all measurements were annotated with a different format, what program could reliably search through them?
This leaves a few formats that are textual, encapsulated and easily machine readable as likely candidates to store metadata:
- Comma separated values,
csv
and, less preferrably, its flavours (such astsv
); - Javascript Object Notation, JSON;
- Other formats that can be easily transpiled into JSON, such as YAML or TOML.
- Resource Description Framework (RDF), a very powerful network-like format mainly used for metadata on the web, but is complicated to explain and not easily used by most applications.
RDF, JSON and CSV files are most fitting for metadata. This book is not a manual for formats, but some more information is available in the Formats, serialization and deserialization section.
The JSON format is convenient when there are many fields with little data in each field. CSV is instead more capable of capturing a lot of information for a few fields, for instance clinical metadata of the patients from which the real data is sourced from, with identifiers linking them together (as we saw above). RDF is usually the most complex of the three, usually used for meta-metadata and generally high-level metadata. It can, however, be applied at any level, with the proper onthology.
note
Since RDF often requires a surrounding framework (such as loading some RDF-schema or OWL on the web at a fixed IRI), we will focus on mostly schemaless JSON or CSV metadata, although it might not be optimal.
For example, we might have the following JSON to represent an ecological measurement:
{
"deposited_in": "https://doi.org/00000.00000.0",
"author": {
"full_name": "John Snow Doe",
"orcid": "0000-0000-0000-0001",
"email": "jhon.doe@biology_department.org"
},
"title": "Number of species in fields",
"description": "The number of plant species in different fields from expert examination was measured by sampling ten random one-meter squares per field.",
"objects": [
{
"file_name": "my_data.csv",
"file_format": "csv",
"column_description": {
"observation_id": "The unique ID of the observation",
"field_size": "The size of the field, in meteres squared",
"species_number": "The total number of different species found in the field",
"species_density": "The average number of different species per squared meter",
"field_longitude": "The longitude of the centroid of the field",
"field_latitude": "The latutide of the field",
"field_altitude": "The altitude of the field, in meters above sea level"
}
}
]
}
Take a moment to consider if you think that this metadata is complete. Do you understand precisely what was done? Do you think the metadata is lacking? Or maybe you think it has too much information. Think about the structure of the metadata. Can such a template be reused in other, perhaps largely different experiments? What would you change to make it more flexible?
Here are some comments on this metadata example which you may or may not have considered:
- Who (did) John Doe work for when they collected the data?
- This information might be relevant to a sociologist, for example, or to check for any conflicts of interest.
- When was this data collected?
- With the fast changes in climate, when the measurement was done is tremendously important;
- A variable is in the description ("Ten random one-metere squares") but it's
not encoded in the rest of the metadata as a findable field;
- Perhaps someone might be interested in looking for effects of changing the number of samples areas, or their size, in measures such as this one.
- The
species_density
field is redundant if we added the above measure as a (meta)data. - Where is the file described in the metadata located?
- If we obtained ONLY this metadata file, we would not be able to locate the file anywhere.
- Is "Snow" a second name or part of the surname of the author? Is "Doe" the
given name or the surname?
- While this might be easy to spot for names from your culture, imagine the same situation for names from cultures unknown to you.
- The author ID, as included here, is very helpful to determine actually who this author is.
Some comments might seem pedantic, but remember that if they are not recoded at the moment of data creation, they might be lost forever, making the data useless if they are ever needed.
While it's impossible to record every potential variable that is relevant for the measurement (a large chunk of statistics would not exist if that was the case), it is important to take the time to think of comments like the ones above before depositing the data (and forgetting about it).
And I disappear immediately in an ominous way.
Identifiers
We will use identifiers troughout the book, so it's worthwile to make sure that some concepts are very clear from the get-go.
An Identifier or "ID" for short is an alphanumeric series of characters (like, a word), that refers to some item. This item might be anything: a column in a spreadsheet, an idea, a textual note, a fossilized bone in a museum drawer, etc...
An ID is said to be unique if, in some context (i.e. the museum), that identifier refers to one, and one only, item. For instance, in the Natural Sciences museum next to my department, the item with identifier "mammoth-bone-12" is a bone from a Mammoth, and only that particular bone from that particular mammoth.
We say that an ID is globally unique if, in addition to being locally unique as in the previous example, the ID refers to one item in the whole world. For instance, another museum in Munich might have the same ID "mammoth-bone-12" for one of their own Mammoth bones. Therefore, the ID "mammoth-bone-12" is not globally1 unique.
An example of a globally unique identifier is the URL to this webpage: https://mrhedmad.github.io/mimir/identifiers.html
is an alphanumeric string that identifies this file in the whole internet.
There cannot be, on the internet, another page with the same identifier.
Indeed, every webpage has a globally unique identifier to it, and it's always visible in your browser.
If the ID were not globally unique, the web infrastructure would not be able to reliably find the one page that the URl refers to.
Resovable identifiers
Say that I build a robot that, given an identifier for an item in a museum, will go and fetch it for me. For it to work reliably, we would need to use an unique identifier for the whole search space that the robot works in, in this case the whole museum. So, if I type in "mammoth-bone-12" in the museum next door, I'd obtain a mammoth bone on my desk.
The fact that we have the robot confers to the identifiers the property of being "resolvable": I have the ability to "use" the identifier (through the robot) to obtain the item that the identifier refers to. This is very neat - in some way, the identifier becomes the item.
You might see where I am going with this. The URL of web pages are resolvable: we can use them, through the internet infrastructure, to obtain the item that they refer to (the web pages). In the strictest definition, an ID is resolvable if we can use it together with some infrastructure to obtain the item it refers to or some additional information about the same item (oftentimes the metadata for it).
Similarly, the path of a file in a filesystem (e.g. "C:\users\person\Desktop\my_file.xlsx") is a (locally) unique identifier to that specific file in that computer that is resolvable by using it in conjunction with some program (like File Explorer).
What identifiers identify
Now that we know what an identifiers is, let's talk about what the item that it is identifying actually is. As we said, the "item" might be a real-world item, a file (like a webpage), a concept (a word is a kind of identifier, don't you think?), etc...
In the context of data management, and Mimir in particular, identifiers are used to refer to things like experiments, protocols, people, real-world items (like vials, culture plates, flasks, samples), and basically anything that we need to describe.
Sometimes, we need identifiers for multiple levels of granularity. For example, a culture plate is made up of different culture wells, each with spaces for one sample. Oftentimes, we need to refer to both the plate as a whole ("the plate has been in the incubator for 10 hours") So, we could say that the plate with ID "AAAA" has the samples with IDs "001", "002", ..., "348": the concept of "AAA" refers to the whole 348 samples, each with a more specific identifier. In situations such as these it's useful to reflect the hierarchy in the identifiers themselves. Let's take the plate example again: it would be much more convenient in practical terms to refer to sample 230 using an identifier like "plate_2-well_230", so we know from the get-go which plate the ID refers to. This also lets use re-use the suffix "well 230" for other wells while still having unique identifiers.
Again, this is also exactly what the internet and your filesystem is doing: the URL "https://en.wikipedia.org/wiki/Microplate" is hierarchical in nature: the specific item (the webpage named "Microplate") is under the larger group of "wiki", which is itself in the larger container of "en.wikipedia.org"2. You can see the same in the filesystem path: "C:\users\person\Desktop\my_file.xlsx" is a series of "containers" that get smaller and smaller until the final file is uniquely identified.
You can see that this method of using identifiers is very logical, and quite easy to use.
If you're smart - and you are - you might have noticed thatthis "globally" is not very well defined. This is because "globally" as used here is context-dependent. In the widest sense of the word, it mean "universally" unique: this ID is never used to refer to anything other than this one item in the whole universe. However, it's often unfair to use such a definition: if everyone in the world agrees that some ID refers to one, and one only, item, but Jane in accounting reuses it in their own catalogue to mean something else, we wouldn't care that much, and the ID would still be - for all practical purposes - globally unique. Imagine this "globally" to be flexible: for instance, we have one, large, integrated internet. Every URL in the whole internet is unique, and since we have just the one internet, we say that they are "globally unique" even though we might not be "universally unique" in the strict sense.
"en.wikipedia.org" is yet again an hierarchical identifier: the largest domain is ".org", with all organization sites. In that container, "wikipedia" is one such site. Inside the "wikipedia.org", "en." is the English version of wikipedia.
Formats, serialization and deserialization
This section will cover very technical concepts in a very non-technical way. If you know something about computer science, how files are read and manipulated by machines, you can safely skip this section. If you are not familiar on how that works, read on.
How do machines represent information? With "information" I mean everything a computer can handle: the words on the webpage you are reading now, the numbers in spreadsheets, the pixels on a chart, etc.. You probably know that computers "think" in binary, with ones and zeroes - bits. This is true: the hardware that is able to do the computations required to display the pixels on the screen can only work with 0s and 1s. A lot of work, research and effort goes in the manipulation of these long lists of zeroes and ones to obtain the pictures on screens of all devices. For the rest of the book, and to understand what "machine readability" is, it's useful to know with an eagle-eye overview what some of this work is: in particular, we will talk about serialization, deserialization and file formats.
Serialization and Deserialization
Physically, files are long strings of zeroes and ones saved in the hard-drives of computers. For example:
0110100001100101011011000110110001101111
might be a small piece of a file. Of course, we cannot work with this: we see files as collections of letters, words, tables, images, not obscure lists of bits.
Enter - serialization. With "serialization" we mean a process that takes data in one form and generates some new object that we can better understand.
For instance, ASCII is a pre-formed table of bits to convert them to letters.
For instance, the series of bits 01101000
is equal to the letter "h".
I know that the bits above can be serialized to letters using the ASCII table to obtain the word "hello".
In technical terms, the bits were serialized to "hello".
The opposite of serialization is deserialization, converting a humanly-readable object into a form that is better understood by computers or that can be better stored on disk.
File formats
We said before that we serialized the bits into the word "hello" trough the usage of the ASCII table. Technically, we say that the series of bits - in other words, the file - was in the ASCII format, or "encoding"1. When we serialize something, we use the proper method to move from a representation of a lower level (i.e. bits) to a representation in a higher level (i.e. words).
The extension of file (the .txt
after some_file.txt
) is useful to know at a glance the format of the file that it refers to: in this case, a text file (that can use ASCII, for instance).
Note that there is no obligation for the extension to be truthful.
I can rename a picture from cat.png
to cat.txt
with no issues: the file is still a picture of a cat.
However, by default the computer will now try to open cat.txt
as a text file, and will probably fail (or show weird, random text chracters it has obtained by applying the serialization rules for text to the image).
There can be more than one layer of file formats, not just from bits to words. For instance, take a Webpage (.html). I will not delve in the technicalities of webpages, but suffice to say that a single wepage is just a text file Some of the files that make up the Word document are text files with words. To read these, the computer serializes the bits to words, just like we saw before. However, the resulting files are in yet another format: HTML. Html files are just text files with a specific structure. Here's an example:
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
Why the file is structured this way is not important right now: simply notice that the patterns in file file are somewhat regular (notice the indentation, the tags with specific names, etc)...
The format of this file, HTML, is another layer of encoding: we can serialize this file into another one, more accessible by humans, through additional programs: in this case, the web browser2.
We can think of serialization as the "decoding" of the secret message that was "encoded", deserialized, with the ASCII table. This is why we use the term "encoding" in this context.
If you right click on a webpage and "show source", you will see the (very complicated) text file underneath the webpage that you see.
Nobody would call this step "serialization", but rather "rendering" or something along those lines. However, I wanted to be clearer on what serialization actually does, and how it is linked with file formats.
Protocols, experiments and projects
In this page I define some terminology related to protocols, experiments and projects used in the rest of the book.
You are probably familiar with experimental protocols. They are textual files with (detailed) instructions on how to carry out some experimental procedure. The traditional protocol refers to the whole procedure of carrying out an experiment: from sample collection, to preparation, to measurement, to the interpretation of the results. However, for the sake of this book, we will use slightly different definitions for all of these terms:
- An experiment is a series of protocols to follow to obtain one or more measurements, usually - but not necessarely - from real-world objects.
- A protocol is a detailed procedure to obtain some output thing (be it an item, a measurement, or a combination of) from zero or more inputs. It might contain some generic instructions, and it may be flexible in some aspects.
- A realization of an experiment is the same experiment with protocols without any potential variability: every choice was made and is now "set in stone".
- A run of a realization is an actual real event that took place when the steps of a realized experiment were carried out. As life is not perfect, each run might be slightly different because of variables out of our control, be it unforseen circumstances (something is broken) or intrinsic variables (the day that the experiment was performed, or who performed the experiment).
- Each run (generally) produces one or more measurements. The collection of these measurements is our data.
- A project is a collection of measurements which ultimately are interpreted together towards some goal, like the refusal of a hypothesis.
Using protocols
In the previous section, we talked about some definition of what projects, experiments and protocols are. In this section I will walk you trough an example of such a structure, and how it's implemented in the day to day.
Imagine a project to "Have tea with the others". To attain the goal of this project, you would need to run the experiments "Prepare biscuits" and "Brew tea". For the "Brew tea" experiment, we might have more than one protocol, such as "Wash cups" (which takes dirty cups and makes them clean), "Heat up Water" (which takes water and creates hot water), and "infuse water" (which takes hot water and tea leaves to make hot tea). These protocols might change slightly depending on the fine details of the experiment: for example with the same "Heat up Water" protocol we might decide on different temperatures, or different kinds of tea for "infuse water".
When we are ready to use the "Brew tea" experiment, we need to realize it. Every variable that was flexible in the various protocols (such as the kind of tea in the "Infuse Water" protocol) must be "locked in" at this stage. The resulting realized experiment has to be deposited in some way, such as online. in this case, the experiment might be realized into "Brew lemongrass black tea" by setting the water temperature to 90 degrees Celsius, using lemongrass and black tea leaves during the brewing process and brewing for 5 minutes.
Once we have a realized experiment, we can run it. During the run something unexpected might happen, so we need to record it. For example, John ran "Brew lemongrass black tea" on the third of May, 2024 (we cannot ever re-create the third of May, 2024), but the electric kettle was broken so he had to use a saucepan to heat the water (in contrast to what the realized experiment detailed should be done). The fact that John ran the experiment on that specific day is a variable out of our control: someone has to run the experiment and it must be done in some day, of course. The broken kettle is technically in our control (we could have aborted the experiment at that point), but accidents happen and samples, reagents and time might be expensive, so we must be prepared to record such incidents as we run our experiments.
You can see that the actual thing that produces real data is the run. Every other level in this hierarchy also produces variables: our metadata. The metadata is recorded in both the realized experiment and the notes that we make during the run regarding unexpected or intrinsic variable.
This is the core of the Mimir method. When we think of a new project, we can:
- Define the goal of the project and give it a name;
- Define what experiments we need to run to achieve the project's goal;
- Define the protocols that make up the experiments that we need and compose them, or reuse old ones;
- Create realizations of the experiments (if we cannot use an old one);
- Run the realizations, writing down what variables we encounter during the run that we did not codify already in the previous steps;
It's possible that it's much easier to draft the protocols needed in a realization of the experiment rather than the more generic, "abstracted" protocols that are readily reused. If this is the case, take the time to consider your realized protocol and break it up later, "taking out" all variables that might reasonably be changed often. This can potentially be done at a second time.
Once we have the data produced by the run, we just need to label it and upload the text of the realized experiment plus all of the unexpected variables that we have encountered during the run.
Sample Labelling
Following the protocol hierarchy we can define what labels on vials, items and files should look like:
ID of the potential file
Experiment ID actual ERI run extension (for files)
▲ ▲ ▲
│ │ │
┌───────┐ ┌────┴─────┐ ┌───────┐ ┌────┴────┐ ┌────┐ ┌───┴──┐
│PROJECT├──┤EXPERIMENT├──┤ E.R.I ├──┤REPLICATE├──┤leaf├──┤[.ext]│
└───┬───┘ └──────────┘ └───┬───┘ └─────────┘ └─┬──┘ └──────┘
│ │ │
▼ ▼ ▼
Project ID Experiment Unique sample-specific
Realization string of text
ID
For example, assume we are studying how in ancient Asian tea was brewed.
We have found several tea leaves in a tomb and we are exploring the taste of
different tea-brewing techniques.
The overarching project is called Ancient Tea, and we decide that its ID will
be ANTEA
.
It's possible to choose any string of text for a project ID, but it must be
universally unique. If we will do another tea-related project in the future,
we must call it something different than ANTEA
(and possibly something that is
not easily confused with the original, like ANTEA2
).
We think that we need an experiment for taste-testing brewed tea, so we define
the experiment with ID tea_taste
for it.
We then define the protocols that we need:
- Wash liquid containers;
- Heat up liquid: heat up to some degrees some liquid using the electic kettle;
- Brew tea: place tea leaves in hot water and brew them for some time;
- Taste substance: use the standard questionnaire to profile the taste of some substance.
We can now realize the experiment for this specific tea-related case:
- Wash cups: to take dirty cups and wash them;
- Heat up water: to take water and cups, and make cups with hot water inside and heating it up to 85 degrees Celsius, to not destroy the fragile leaves;
- Brew tea: brew for 5 minutes while stirring every minute.
- Taste hot tea: to take brewed tea and taste-test it, and record the taste. We need an additional section of the questionnaire for the specific bitterness of the tea.
This is the realization of the experiment. In the realized experiment we
have defined all the variables that are kept purpusefully vague in the generic
experiment.
We assign an ID to the realization too, making sure that the combo EXPERIMENT
plus E.R.I.
is unique. For example, we can use tea_taste
plus ancient
,
creating the string tea_taste-ancient
for this realization of the protocol.
When we actually do the run, we give it an ID. Since this is the first time
that the tea_taste-ancient
experiment is run in the context of the ANTEA
project, we give it the ID of 1
.
At this point, the full ID for this run is ANTEA-tea_taste-ancient-1
.
Labelling sigle data points
Each experimental run may have one ore more associated things, such as samples, vials, measurements, etc... It's important to label each item in the experimental run with a specific leaf, a string with information about it.
How to structure leaves is less strictly defined as we need to allow for
as much flexibility as possible.
However, it's a good idea to continue with the principle of hierarchy.
To continue with the above example, say that we have brewed five different
cups (cup_1
through cup_5
) by taking different tea leaves aliquotes (black,
green). Each cup is then idependently tested by five different people
(taster_a
, taster-b
, taster_c
, etc...).
We might have leaf-ids that look like this:
black-cup_1-taster_a
black-cup_1-taster_b
black-cup_1-taster_c
...
black-cup_2-taster_a
black-cup_2-taster_b
...
green-cup_1-taster_a
...
green-cup_5-taster_e
Going from highest ranking ("type of tea") to intermediate ("cup number") to the most fine ("taster ID").
Each measurement (in this case "how good the tea was") is associated with the
full ID. For example, taster_e
has rated 5/5 the tea from cup 2 made from
black tea leaves.
Thus we record the score like this:
ID, score
black-cup_2-taster_e 5
What to include in the leaf
The above "leaf" is made up of metadata. However, how much metadata should be included in the leaf is up to the experimenter. A leaf identifier should be primarily useful for the human experimenter that is performing the run to know at a glance what is in this vial, what this measurement is about, and/or things useful for housekeeping, such as when the vial was collected or stored.
In any case, all metadata about the experiment should be saved in separate metadata files, as detailed in what is metadata?. This means that the leaf might be as short and as obscure as a single number, and as long as containing all metadata variables: as long as it is unique for the specific thing it is describing, it is not really important what form it takes.
Choosing file formats
note
This section talks about which file formats to use. If you don't know what file formats are, or need a refresher, you can read the Formats, Serialization and Deserialization section. This section assumes you know everything covered in that section.
The choice of file formats is important. We must choose formats that are future-proof and accessible, meaning:
- They have an encoding that is widely popular and that can be interpreted easily in the future;
- They can be read by non-proprietary (i.e. free and open-source) software, so that there are as little barriers as possible to its usage;
- They are easily human-readable with as little serialization as possible.
These considerations generally limit our possible file formats:
Numeric data
Tables of numbers should be saved in comma separated values, or .csv
.
A csv
file looks like this:
header_1,header_2,header_3
value,value,value
value,value,value
"a value with a comma, inside",value,value
Each line is a row in the table, and the first one is often reserved for the heading of the table.
Note that some programs use the first column as row names: avoid this, instead saving the row names as a proper column (with a useful name, like sample_names
).
A tsv
or tab-separated values file is similar to a csv
file, but uses tabs instead of commas. It looks like the text below when opened with a text editor:
header_1 header_2 header_3
value value value
value value value
a value with a comma, inside value value
There is no obvious advantage or benefit of choosing tsv
rather than csv
or vice-versa, but the csv
format is generally more common.
For this reason, the csv
format should be preferred over tsv
.
important
Often, TSV files are often saved with the extension .txt
(as is "plain text").
This is an old convention. Please save TSV files with the .tsv
extension, not .tsv
.
If you can, change the extensions from .txt
to .tsv
, or even better convert them to CSV and save them as such.
A csv
has MIME type of text/csv
.
Formats to avoid
Do not use the following formats, whenever possible:
- Excel (
.xlsx
and.xls
,application/vnd.ms-excel
);- They rely on a proprietary software and are hard to read programmatically.
- Data should never contain data analysis steps, so nothing that cannot be represented in pure
csv
should be included in the format.
Images
To store images, the Tagged Image File Format (TIFF
) format should be used.
Multiple images in the same series (e.g. for a time-lapse) should be saved in the same TIFF
file as a stack of images in the correct order.
If the TIFF
file format cannot be used, an alternative is Portable Network Graphics (PNG, .png
).
TIFF images should adhere to the "basic TIFF" specification, with no extensions to the format. This is mostly a technical detail, but it is important for the long-term accessibility of the image data.
TIFF images have the MIME type of image/tiff
and PNG have image/png
.
Formats to avoid
Do not use the following formats, whenever possible:
- JPEG (
.jpg
,image/jpeg
);- JPEG images are compressed with a lossy compression, meaning that some image data is lost when the JPEG file is saved. JPEG images also do not allow for transparency.
Documentation
All documents (i.e. forms, certificates, and other bureaucratic items) that are exclusively meant for human consumption may be saved as .pdf
files.
All other documentation should be saved as plain-text (.txt
) or at most as markdown (.md
) files.
Plain text is both humanly readable and machine readable. If additional formatting is required (e.g. titles, bold/italics, links, etc...), markdown is a broadly applicable way to add text formatting while still essentially writing plain text files.
Slides and presentations should be saved as .pdf
.
If videos or animations must be included in the presentation, prefer the [Open Document Format (.odf
)] instead of PowerPoint-like formats (e.g. .ppt
or .pptx
).
warning
Be careful that ODF files, and in some cases PDF files, may distort or wrongly encode some features of the presentation if converting directly from PPT to ODF.
Always check the presentations before depositing.
Formats to avoid
- Do not use PowerPoint formats (
.ppt
,.pptx
), as they are proprietary.
Multimedia
Audio files should preferably be saved as raw Waveform Audio File Format (WAW, .wav
).
WAV files are uncompressed and simple to read, and may be processed by most audio-processing software.
As a (better) alternative, the Broadcast Wave Format (BWA, confusigly also .wav
) can be used, although consumer software to convert to and from it is relatively rare.
Yet another alternative is the Free Lossless Audio Coded (FLAC, .flac
).
Video files should be saved as either MP4 (.mp4
) or WebM (.webm
), the latter being one of the few file formats that is distributed with a permissive license (BSD) by Google and is therefore not proprietary.
Be careful of the compression of the output video files, which in some cases may be lossy.
Formats to avoid
- Do not use
.gif
files, as they have very inefficient compression that generally does not reduce file size much but causes very bad quality losses. Additionally, they do not support transparency.
Further considerations
Compressing
Files should be compressed if larger than a few megabytes. Image files in particular should always be compressed before archiving.
The recommended compression schema is gzip
(with the .gz
file extension).
Each file should be compressed on its own, but groups of tightly-related files may be compressed together as a tarball
(with the .tar
file extension) first, and then compressed with gzip
(resulting in a .tar.gz
file).
If you do, consider creating uncompressed metadata describing the contents of the tarball.
Other possible compressions are 7Zip (.7z
) or ZIP (.zip
).
Plain text encoding
All plain-text files (e.g. .txt
, .csv
and source code) should be UTF-8
encoded.
In the modern day, this is the default for most computers.
Another compatible file format that may be used for simple files is ASCII
, but UTF-8
is still preferable.
For developers
It might be useful to check file formats programmatically. A way to do this is with the JHOVE tool, that extracts a number of metadata information from many different file types. It can also check if the file is well-formatted and valid.
Another possibility is FIDO, available as a Python library and command-line tool.
Specific cases
Some files have specific formats, and require special care. If you have such files, consider the archival aspects above and choose a format other than those shown here. If you can, contact a data expert for advice.
File formats reference
Got a file? Not sure what file format to use? This is a quick guide to the preferred formats. It has the same information as the file formats page but in a more condensed fashion.
- Numbers
- Images
- Text Documents
- Use plain text (
.txt
) or markdown (.md
). - Use
.pdf
or.odf
for presentations. - For documents that machines would never need to read, you may use
.pdf
. - Avoid using Word-like files,
.docx
and.doc
.
- Use plain text (
- Multimedia