Choosing file formats
note
This section talks about which file formats to use. If you don't know what file formats are, or need a refresher, you can read the Formats, Serialization and Deserialization section. This section assumes you know everything covered in that section.
The choice of file formats is important. We must choose formats that are future-proof and accessible, meaning:
- They have an encoding that is widely popular and that can be interpreted easily in the future;
- They can be read by non-proprietary (i.e. free and open-source) software, so that there are as little barriers as possible to its usage;
- They are easily human-readable with as little serialization as possible.
These considerations generally limit our possible file formats:
Numeric data
Tables of numbers should be saved in comma separated values, or .csv
.
A csv
file looks like this:
header_1,header_2,header_3
value,value,value
value,value,value
"a value with a comma, inside",value,value
Each line is a row in the table, and the first one is often reserved for the heading of the table.
Note that some programs use the first column as row names: avoid this, instead saving the row names as a proper column (with a useful name, like sample_names
).
A tsv
or tab-separated values file is similar to a csv
file, but uses tabs instead of commas. It looks like the text below when opened with a text editor:
header_1 header_2 header_3
value value value
value value value
a value with a comma, inside value value
There is no obvious advantage or benefit of choosing tsv
rather than csv
or vice-versa, but the csv
format is generally more common.
For this reason, the csv
format should be preferred over tsv
.
important
Often, TSV files are often saved with the extension .txt
(as is "plain text").
This is an old convention. Please save TSV files with the .tsv
extension, not .tsv
.
If you can, change the extensions from .txt
to .tsv
, or even better convert them to CSV and save them as such.
A csv
has MIME type of text/csv
.
Formats to avoid
Do not use the following formats, whenever possible:
- Excel (
.xlsx
and.xls
,application/vnd.ms-excel
);- They rely on a proprietary software and are hard to read programmatically.
- Data should never contain data analysis steps, so nothing that cannot be represented in pure
csv
should be included in the format.
Images
To store images, the Tagged Image File Format (TIFF
) format should be used.
Multiple images in the same series (e.g. for a time-lapse) should be saved in the same TIFF
file as a stack of images in the correct order.
If the TIFF
file format cannot be used, an alternative is Portable Network Graphics (PNG, .png
).
TIFF images should adhere to the "basic TIFF" specification, with no extensions to the format. This is mostly a technical detail, but it is important for the long-term accessibility of the image data.
TIFF images have the MIME type of image/tiff
and PNG have image/png
.
Formats to avoid
Do not use the following formats, whenever possible:
- JPEG (
.jpg
,image/jpeg
);- JPEG images are compressed with a lossy compression, meaning that some image data is lost when the JPEG file is saved. JPEG images also do not allow for transparency.
Documentation
All documents (i.e. forms, certificates, and other bureaucratic items) that are exclusively meant for human consumption may be saved as .pdf
files.
All other documentation should be saved as plain-text (.txt
) or at most as markdown (.md
) files.
Plain text is both humanly readable and machine readable. If additional formatting is required (e.g. titles, bold/italics, links, etc...), markdown is a broadly applicable way to add text formatting while still essentially writing plain text files.
Slides and presentations should be saved as .pdf
.
If videos or animations must be included in the presentation, prefer the [Open Document Format (.odf
)] instead of PowerPoint-like formats (e.g. .ppt
or .pptx
).
warning
Be careful that ODF files, and in some cases PDF files, may distort or wrongly encode some features of the presentation if converting directly from PPT to ODF.
Always check the presentations before depositing.
Formats to avoid
- Do not use PowerPoint formats (
.ppt
,.pptx
), as they are proprietary.
Multimedia
Audio files should preferably be saved as raw Waveform Audio File Format (WAW, .wav
).
WAV files are uncompressed and simple to read, and may be processed by most audio-processing software.
As a (better) alternative, the Broadcast Wave Format (BWA, confusigly also .wav
) can be used, although consumer software to convert to and from it is relatively rare.
Yet another alternative is the Free Lossless Audio Coded (FLAC, .flac
).
Video files should be saved as either MP4 (.mp4
) or WebM (.webm
), the latter being one of the few file formats that is distributed with a permissive license (BSD) by Google and is therefore not proprietary.
Be careful of the compression of the output video files, which in some cases may be lossy.
Formats to avoid
- Do not use
.gif
files, as they have very inefficient compression that generally does not reduce file size much but causes very bad quality losses. Additionally, they do not support transparency.
Further considerations
Compressing
Files should be compressed if larger than a few megabytes. Image files in particular should always be compressed before archiving.
The recommended compression schema is gzip
(with the .gz
file extension).
Each file should be compressed on its own, but groups of tightly-related files may be compressed together as a tarball
(with the .tar
file extension) first, and then compressed with gzip
(resulting in a .tar.gz
file).
If you do, consider creating uncompressed metadata describing the contents of the tarball.
Other possible compressions are 7Zip (.7z
) or ZIP (.zip
).
Plain text encoding
All plain-text files (e.g. .txt
, .csv
and source code) should be UTF-8
encoded.
In the modern day, this is the default for most computers.
Another compatible file format that may be used for simple files is ASCII
, but UTF-8
is still preferable.
For developers
It might be useful to check file formats programmatically. A way to do this is with the JHOVE tool, that extracts a number of metadata information from many different file types. It can also check if the file is well-formatted and valid.
Another possibility is FIDO, available as a Python library and command-line tool.
Specific cases
Some files have specific formats, and require special care. If you have such files, consider the archival aspects above and choose a format other than those shown here. If you can, contact a data expert for advice.