Kerblam! is a Rust command line tool to manage the execution of scientific data analysis, where having reproducible results and sharing the executed pipelines is important. It makes it easy to write multiple analysis pipelines and select what data is analysed.
With Kerblam! your analyses will be less bloated, more organized, and more reproducible.
Kerblam! is a Free and Open Source Software, hosted on Github at MrHedmad/kerblam. The code is licensed under the MIT License.
Use the sidebar to jump to a specific section. If you have never used Kerblam! before, you can read the documentation from start to finish to learn all there is to know about Kerblam! by clicking on the arrows on the side of the page.
Kerblam! is very opinionated. To read more about why these choices where made, you can read the Kerblam! philosophy.
About
This page aggregates a series of meta information about Kerblam!.
License
The project is licensed with the MIT License. Read here for the choose a license entry of the license.
Citing
If you want or need to cite Kerblam!, provide a link to the Github repository or use the following Zenodo DOI: doi.org/10.5281/zenodo.10664806.
Naming
This project is named after the fictitious online shop/delivery company in S11E07 of Doctor Who. Kerblam! might be referred to as Kerblam!, Kerblam or Kerb!am, interchangeably, although Kerblam! is preferred. The Kerblam! logo is written in the Kwark Font by tup wanders.
About this book
This book is rendered by mdbook
, and
is written as a series of markdown files.
Its source code is available in the Kerblam! repo
under the ./docs/
folder.
The book hosted online always refers to the latest Kerblam! release. If you are looking for older or newer versions of this book, you should read the markdown files directly on Github, where you can select which tag to view from the top bar, or clone the repository locally, checkout to the commit you like, and rebuild from source. If you're interested, read the development guide to learn more.
Installation
You have a few options when installing Kerblam!.
Requirements
Currently, Kerblam! only supports mac OS (both intel and apple chips) and GNU linux.
Other unix/linux versions may work, but are untested.
It also uses binaries that it assumes are already installed and visible from your $PATH
:
- GNU
make
: gnu.org/software/make; git
: git-scm.com- Docker (as
docker
) and/or Podman (aspodman
): docker.com and/or podman.io; tar
: gnu.org/software/tar.bash
: gnu.org/software/bash.
If you can use git
, make
, tar
, bash
and docker
or podman
from your CLI,
you're good to go!
Most if not all of these tools come pre-packaged in most linux distros. Check your repositories for them.
Pre-compiled binary (recommended)
You can find and download a Kerblam! binary for your operating system in the releases tab.
There are also helpful scripts that automatically download the correct version
for your specific operating system thanks to cargo-dist
.
You can always install or update to the latest version with:
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/MrHedmad/kerblam/releases/latest/download/kerblam-installer.sh | sh
Be warned that the above command executes a script downloaded from the internet. You can click here or manually follow the fetched URL above to download the same installer script and inspect it before you run it, if you'd like.
Install from source
If you want to install the latest version from source, install Rust and cargo
, then run:
cargo install kerblam
If you wish to instead use the latest development version, run:
cargo install --git https://github.com/MrHedmad/kerblam.git
The main
branch should always compile on supported platforms with the above command.
If it does not, please open an issue.
Adding the Kerblam! badge
You can add a Kerblam! badge in the README of your project to show that you use Kerblam! Just copy the following code and add it to the README:
![Kerblam!](https://img.shields.io/badge/Kerblam!-v0.5.1-blue?logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAC0AAAAtCAMAAAANxBKoAAABlVBMVEUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADW1tYNDHwcnNLKFBQgIB/ExMS1tbWMjIufDQ3S0tLOzs6srKyioqJRUVFSS0o0MjIBARqPj48MC3pqaWkIB2MtLS1ybm3U1NS6uroXirqpqamYmJiSkpIPZ4yHh4eFhIV8fHwLWnuBe3kMC3cLCnIHBlwGBlgFBU8EBEVPRkICAi4ADRa+EhIAAAwJCQmJiYnQ0NDKysoZkMK2trYWhLOjo6MTeKMTd6KgoKCbm5uKiIaAgIAPDHhubm4JT20KCW0KCWoIS2cHBUxBQUEEAz9IQT4DAz0DKTpFPTgCAjcCASoBASAXFxcgGRa5ERG1ERGzEBCpDw+hDg4fFA2WDAyLCgouAQFaWloFO1MBHStWBATnwMkoAAAAK3RSTlMA7zRmHcOuDQYK52IwJtWZiXJWQgXw39q2jYBgE/j2187JubKjoJNLSvmSt94WZwAAAvlJREFUSMeF1GdXGkEUgOGliIgIorFH0+u7JBIChEgJamyJvWt6783eS8rvzszAusACvp88x4d7hsvsaqdU57h8oQnobGmtb6xMzwbOkV9jJdvWBRwf7e9uLyzs7B3+o7487miC+AjcvZ3rkNZyttolbKxPv2fyPVrKYKcPhp7oIpPv0FkGN5N5rmd7afAFKH0MH99DihrTK2j3RTICF/Pt0trPUr9AxXyXpkJ3xu6o97tgQJDQm+Xlt6E8vs+FfNrg6kQ1pOuREVSPoydf9YjLpg14gMW1X0IInGZ+9PWr0Xl+R43pxzgM3NgCiekvqfE50hFdT7Ly8Jbo2R/xWYNTl8Ptwk6lgsHUD+Ji2NMlBFZ8ntzZRziXW5kLZsaDom/0yH/G+CSkapS3CvfFCWTxJZgMyqbYVLtLMmzoVywrHaPrrNJX4IHCDyCmF+nXhHXRkzhtCncY+PMig3pu0FfzJG900RBNarTTxrTCEwne69miGV5k8cPst3wOHSfrmJmcCH6Y42NEzzXIX8EFXmFE/q4ZXJrKW4VsY13uzqivF74OD39CbT/0HV/1yQW9Xn8e1O0w+WAG0VJS4P4Mzc7CK+2B7jt6XtFYMhl7Kv4YWMKnsJkXZiW3NgQXxTEKamM2fL8EjzwGv1srykZveBULj6bBZX2Bwbs03cXTQ3HAb9FOGNsS4wt5fw9zv0q9oZo54Gf4UQ95PLbJj/E1HFZ9DRgTuMecPgjfUqlF7Jo1B9wX+JFxmMh7mAoGv9B1pkg2tDoVl7i3G8mjH1mUN3PaspJaqM1NH/sJq2L6QJzEZ4FTCRosuKomdxjYSofDs8DcRPZh8hQd5IbE3qt1ih+MveuVeP2DxOMJAlphgSs1mt3GVWO6yMNGUDZDi1uzJLDNqxbZDLab3mqQB5mExtLYrtU45L10qlfMeSbVQ91eFlfRmnclZyR2VcB5y7pOYhouuSvg2rxHCZG/HHZnsVkVtg7NmkdirS6LzbztTq1EPo9dXRWxqtP7D+wL5neoEOq/AAAAAElFTkSuQmCC&link=https%3A%2F%2Fgithub.com%2FMrHedmad%2Fkerblam)
The above link is very long - this is because the Kerblam! logo is baked in as a base64
image.
You can update the badge's version by directly editing the link (e.g. change
v0.5.1
to v0.4.0
) manually.
Quickstart
Welcome to Kerblam! This introductory chapter will give you the general overview on Kerblam!: what it does and how it does it.
Kerblam! is a project manager. It helps you write clean, concise data analysis pipelines, and takes care of chores for you.
Every Kerblam! project has a kerblam.toml
file in its root.
When Kerblam! looks for files, it does it relative to the position of the
kerblam.toml
file and in specific, pre-determined folders.
This helps you keep everything in its place, so that others that are unfamiliar
with your project can understand it if they ever need to look at it.
These folders, relative to where the kerblam.toml
file is, are:
./data/
: Where all the project's data is saved. Intermediate data files are specifically saved here../data/in/
: Input data files are saved and should be looked for in here../data/out/
: Output data files are saved and should be looked for in here../src/
: Code you want to be executed should be saved here../src/pipes/
: Makefiles and bash build scripts should be saved here. They have to be written as if they were saved in./
../src/dockerfiles/
: Container build scripts should be saved here.
Any sub-folder of one of these specific folders (with the exception of
src/pipes
andsrc/dockerfiles
) contains the same type of files as the parent directory. For instance,data/in/fastq
is treated as if it contains input data by Kerblam! just as thedata/in
directory is.
You can configure almost all of these paths in the kerblam.toml
file, if you so desire.
This is mostly done for compatibility reasons with non-kerblam! projects.
New projects that wish to use Kerblam! are strongly encouraged to follow the
standard folder structure, however.
The rest of these docs are written as if you are using the standard folder structure. If you are not, don't worry! All Kerblam! commands respect your choices in the
kerblam.toml
file.
If you want to convert an existing project to use Kerblam!, you can take a look
at the kerblam.toml
section of the documentation to
learn how to configure these paths.
If you follow this standard (or you write proper configuration), you can use Kerblam! to do a bunch of things:
- You can run pipelines written in
make
or arbitrary shell files insrc/pipes/
as if you ran them from the root directory of your project by simply usingkerblam run <pipe>
; - You can wrap your pipelines in docker containers by just writing new
dockerfiles in
src/dockerfiles
, with essentially just the installation of the dependencies, letting Kerblam! take care of the rest; - If you have wrapped up pipelines, you can export them for later execution
(or to send them to a reviewer) with
kerblam package <pipe>
without needing to edit your dockerfiles; - If you have a package from someone else, you can run it with
kerblam replay
. - You can fetch remote data from the internet with
kerblam data fetch
, see how much disk space your project's data is using withkerblam data
and safely cleanup all the files that are not needed to re-run your project withkerblam data clean
. - You can show others your work by packing up the data with
kerblam data pack
and share the.tar.gz
file around. - And more!
The rest of this tutorial walks you through every feature.
I hope you enjoy Kerblam! and that it makes your projects easier to understand, run and reproduce!
If you like Kerblam!, please consider leaving a star on Github. Thank you for supporting Kerblam!
Creating new projects - kerblam new
You can quickly create new kerblam! projects by using kerblam new
.
Go in a directory where you want to store the new project and run kerblam new test-project
.
Kerblam! asks you some setup questions:
- If you want to use Python;
- If you want to use R;
- If you want to use pre-commit;
- If you have a Github account, and would like to setup the
origin
of your repository to github.com.
Say 'yes' to all of these questions to follow along. Kerblam! will then:
- Create the project directory,
- Make a new git repository,
- create the
kerblam.toml
file, - create all the default project directories,
- make an empty
.pre-commit-config
file for you, - create a
venv
environment, as well as therequirements.txt
andrequirements-dev.txt
files (if you opted to use Python), - and setup the
.gitignore
file with appropriate ignores.
Kerblam! will NOT do an
Initial commit
for you! You still need to do that manually once you've finished setting up.
You can now start working in your new project, simply cd test-project
.
Akin to git
, Kerblam! will look in parent directories for a kerblam.toml
file and run there if you call it from a project sub-folder.
Efficient!
Pipelines
Kerblam! is first and foremost a pipeline runner.
Say that you have a script in ./src/calc_sum.py
. It takes an input .csv
file,
processes it, and outputs a new .csv
file, using stdin
and stdout
.
You have an input.csv
file that you'd like to process with calc_sum.py
.
You could write a shell script or a makefile with the command to run.
We'll refer to these scripts as "pipes".
Here's an example makefile pipe:
./data/out/output.csv: ./data/in/input.csv ./src/calc_sum.py
cat $< | ./src/calc_sum.py > $@
You'd generally place this file in the root of the repository and run make
to execute it.
This is perfectly fine for projects with a relatively simple structure and just one execution pipeline.
Imagine however that you have to change your pipeline to run two different
jobs which share a lot of code and input data but have slightly (or dramatically)
different execution.
You might modify your pipe to accept if
statements, use environment variables
or perhaps write many of them and run them separately.
In any case, having a single file that has the job of running all the different
pipelines is hard, adds complexity and makes managing the different execution
scripts harder than it needs to be.
Kerblam! manages your pipes for you.
You can write different makefiles and/or shell files for different types of
runs of your project and save them in ./src/pipes/
.
When you kerblam run
, Kerblam! looks into that folder, finds (by name) the
makefiles that you've written, and brings them to the top level of the project
(e.g. ./
) for execution.
In this way, you can write your pipelines as if they were in the root of
the repository, cutting down on a lot of boilerplate paths.
For instance, you could have written a ./src/pipes/process_csv.makefile
for
the previous step, and you could invoke it with kerblam run process_csv
.
You could then write more makefiles or shell files for other tasks and run
them similarly, keeping them all neatly separated from the rest of the code.
The next sections outline the specifics of how Kerblam! executes pipes.
Executing code - kerblam run
The kerblam run
command is used to run pipelines.
Kerblam! looks for files ending in the .makefile
extension for makefiles and
.sh
for shell files in the pipelines directory (by default src/pipes/
).
It automatically uses the proper execution strategy based on what extension
the file is saved as.
Shellfiles are always executed in bash
. You can use anything that is
installed on your system this way, e.g. snakemake
or nextflow
.
Make has a special execution policy to allow it to work with as little boilerplate as possible. You can read more on Make in the GNU Make book.
kerblam run
supports the following flags:
--profile <profile>
: Execute this pipeline with a profile. Read more about profiles in the section below.--desc
(-d
): Show the description of the pipeline, then exit.--local
(-l
): Skip running in a container, if a container is available, preferring a local run.
In short, kerblam run
does something similar to this:
- Move your
pipe.sh
orpipe.makefile
file in the root of the project, under the nameexecutor
; - Launch
make -f executor
orbash executor
for you.
This is why pipelines are written as if they are executed in the root of the project, because they are.
Data Profiles - Running the same pipelines on different data
You can run your same pipelines, as-is, on different data thanks to data profiles.
By default, Kerblam! will use your untouched ./data/in/
folder when executing pipes.
If you want the same pipes to run on different sets of input data, Kerblam! can
temporarily swap out your real data with this 'substitute' data during execution.
For example, a process_csv.makefile
requires an input ./data/in/input.csv
file.
However, you might want to run the same pipe on another, different_input.csv
file.
You could copy and paste the first pipe and change the paths to the first file
to this alternative one.
However, you then have to maintain two essentially identical
pipelines, and you are prone to adding errors while you modify it (what if you
forget to change one reference to the original file?).
You can use kerblam
to do the same, but in an easy, declarative and less-error-prone way.
Define in your kerblam.toml
file a new section under data.profiles
:
# You can use any ASCII name in place of 'alternate'.
[data.profiles.alternate]
# The quotes are important!
"input.csv" = "different_input.csv"
You can then run the same makefile with the new data with:
kerblam run process_csv --profile alternate
Paths under every profile section are relative to the input data directory, by default
data/in
.
Under the hood, Kerblam! will:
- Rename
input.csv
toinput.csv.original
; - Move
different_input.csv
toinput.csv
; - Run the analysis as normal;
- When the run ends (it finishes, it crashes or you kill it), Kerblam! will undo both actions:
it moves
different_input.csv
back to its original place and renamesinput.csv.original
back toinput.csv
.
This effectively causes the makefile to run with different input data.
Careful that the output data will (most likely) be saved as the same file names as a "normal" run!
Kerblam! does not look into where the output files are saved or what they are saved as. If you really want to, use the
KERBLAM_PROFILE
environment variable described below and change the output paths accordingly.
Profiles are most commonly useful to run the pipelines on test data that is faster to process or that produces pre-defined outputs. For example, you could define something similar to:
[data.profiles.test]
"input.csv" = "test_input.csv"
"configs/config_file.yaml" = "configs/test_config_file.yaml"
And execute your test run with kerblam run pipe --profile test
.
The profiles feature is used so commonly for test data that Kerblam! will
automatically make a test
profile for you, swapping all input files in the
./data/in
folder that start with test_xxx
with their "regular" counterparts xxx
.
For example, the profile above is redundant!
If you write a [data.profiles.test]
profile yourself, Kerblam! will not
modify it in any way, effectively disabling the automatic test profile feature.
Kerblam! tries its best to cleanup after itself (e.g. undo profiles,
delete temporary files, etc...) when you use kerblam run
, even if the pipe
fails, and even if you kill your pipe with CTRL-C
.
If your pipeline is unresponsive to a
CTRL-C
, pressing it twice (twoSIGTERM
signals in a row) will kill Kerblam! instead, leaving the child process to be cleaned up by the OS and the (eventual) profile not cleaned up.This is to allow you to stop whatever Kerblam! or the pipe is doing in case of emergency.
Kerblam! will run the pipelines with the environment variable KERBLAM_PROFILE
set to whatever the name of the profile is.
In this way, you can detect from inside the pipeline if you are in a profile or not.
This is useful if you want to keep the outputs of different profiles separate,
for instance.
Containerized Execution of Pipelines
Kerblam! can ergonomically run pipelines inside containers for you, making it easier to be reproducible.
If Kerblam! finds a container recipe (such as a Dockerfile) of the same name
as one of your pipes in the ./src/dockerfiles/
folder
(e.g. ./src/dockerfiles/process_csv.dockerfile
for the ./src/pipes/process_csv.makefile
pipe),
it will use it automatically when you execute a pipeline (e.g. kerblam run process_csv
)
to run the pipeline inside a container.
Specifically, it will do something similar to this:
- Copy the pipeline to the root of the directory (as it does normally when you
launch
kerblam run
), as./executor
; - Run
docker build -f ./src/dockerfiles/process_csv.dockerfile --tag process_csv_kerblam_runtime .
to build the container; - Run
docker run --rm -it -v ./data:/data --entrypoint make process_csv_kerblam_runtime -f /executor
.
This last command runs the container, telling it to execute make
with
target file -f /executor
.
Note that it's not exactly what kerblam does - it has additional features
to correctly mount your paths, capture stdin
and stdout
, etc...
If you have your docker container COPY . .
, you can then effectively have
Kerblam! run your projects in docker environments, so you can tweak your
dependencies and tooling (which might be different than your dev environment)
and execute in a protected, reproducible environment.
Kerblam! will build the container images without moving the recipies around
(this is what the -f
flag does).
The .dockerfile
in the build context (next to the kerblam.toml
) is shared
by all pipes.
See the 'using a dockerignore' section
of the Docker documentation for more.
You can write dockerfiles for both make
and sh
pipes.
Kerblam! configures automatically the correct entrypoint and arguments to run
the pipe in the container.
Read the "writing dockerfiles for Kerblam!" section to learn more about how to write dockerfiles that work nicely with Kerblam! (spoiler: it's easier than writing canonical dockerfiles!).
For example, you can have the following Dockerfile:
# ./src/dockerfiles/process_csv.dockerfile
FROM ubuntu:latest
RUN apt-get install python, python-pip && \
pip install pandas
COPY . .
and this dockerignore file:
# ./src/dockerfiles/.dockerignore
.git
data
venv
and simply run kerblam run process_csv
to build the container and run
your code inside it.
If you run kerblam run
without a pipeline (or with a non-existant pipeline), you
will get the list of available pipelines.
You can see at a glance what pipelines have an associated dockerfile as they
are prepended with a little whale (π):
Error: No runtime specified. Available runtimes:
πβΎ my_pipeline :: Generate the output data in a docker container
βΎβΎ local_pipeline :: Run some code locally
Default dockerfile
Kerblam! will look for a default.dockerfile
if it cannot find a container
recipe for the specific pipe (e.g. pipe.dockerfile
), and use that instead.
You can use this to write a generalistic dockerfile that works for your
most simple pipelines.
The whale (π) emoji in the list of pipes will be replaced by a fish (π) for
pipes that use the default container, so you can identify them easily:
Error: No runtime specified. Available runtimes:
πβΎ my_pipeline :: Generate the output data in a docker container
πβΎ another :: Run in the default container
Switching backends
Kerblam! runs containers by default with Docker, but you can tell it to use
Podman instead by setting the execution > backend
option in your kerblam.toml
:
[execution]
backend = "podman" # by default "docker"
Podman is slightly harder to set up but has a few benefits, mainly not having
to run in root mode, and being a FOSS program.
For 90% of usecases, you can use podman
instead of docker
and it will
work exactly the same.
Podman and Docker images are interchangeable, so you can use Podman with
dockerhub with no issues.
Setting the container working directory
Kerblam! does not parse your dockerfile or add any magic to the calls that it makes based on heuristics. This means that if you wish to save your code not in the root of the container, you must tell kerblam! about it.
For instance, this recipe copies the contents of the analysis in a folder
called "/app
":
COPY . /app/
This one does the same by using the WORKDIR
directive:
WORKDIR /app
COPY . .
If you change the working directory, let Kerblam! know by setting the
execution > workdir
option in kerblam.toml
:
[execution]
workdir = "/app"
In this way, Kerblam! will run the containers with the proper paths.
This option applies to ALL containers managed by Kerblam!
There is currently no way to configure a different working directory for every specific dockerfile.
Writing Dockerfiles for Kerblam!
When you write dockerfiles for use with Kerblam! there are a few things you should keep in mind:
- Kerblam! will automatically set the proper entrypoints for you;
- The build context of the dockerfile will always be the place where the
kerblam.toml
file is. - Kerblam! will not ignore any file for you.
- The behaviour of
kerblam package
is slightly different thankerblam run
, in that the context ofkerblam package
is an isolated "restarted" project, as ifkerblam data clean --yes
was run on it, while the context ofkerblam run
is the current project, as-is.
This means a few things:
COPY
directives are executed in the root of the repository
This is exactly what you want, usually.
This makes it possible to copy the whole project over to the container by just
using COPY . .
.
The data
directory is excluded from packages
If you have a COPY . .
directive in the dockerfile, it will behave differently
when you kerblam run
versus when you kerblam package
.
When you run kerblam package
, Kerblam! will create a temporary build context
with no input data.
This is what you want: Kerblam! needs to separately package your (precious)
input data on the side, and copy in the container only code and other execution-specific
files.
In a run, the current local project directory is used as-is as a build context.
This means that the data
directory will be copied over.
At the same time, Kerblam! will also mount the same directory to the running
container, so the copied files will be "overwritten" by the live mountpoint
while to container is running.
This generally means that copying the whole data directory is useless in a run, and that it cannot be done during packaging.
Therefore, a best practice is to ignore the contents of the data folders in the
.dockerignore
file.
This makes no difference while packaging containers but a big difference when
running them, as docker skips copying the useless data files.
To do this in a standard Kerblam! project, simply add this to your .dockerignore
:
# Ignore the intermediate/output directory
data
You might also want to add any files that you know are not useful in the docker environment, such as local python virtual environments.
Your dockerfiles can be very small
Since the configuration is handled by Kerblam!, the main reason to write dockerfiles is to install dependencies.
This makes your dockerfiles generally very small:
FROM ubuntu:latest
RUN apt-get update && apt-get install # a list of packages
COPY . .
You might also be interested in the article 'best practices while writing dockerfiles' by Docker.
Docker images are named based on the pipeline name
If you run kerblam run my_pipeline
twice, the same container is built to run
the pipeline twice, meaning that caching will make your execution quite fast if
you place the COPY . .
directive near the bottom of the dockerfile.
This way, you can essentially work exclusively in docker and never install anything locally.
Kerblam! will name the containers for the pipelines as <pipeline name>_kerblam_runtime
.
Describing pipelines
If you execute kerblam run
without specifying a pipe (or you try to run a
pipe that does not exist), you will get a message like this:
Error: no runtime specified. Available runtimes:
βΎβΎ process_csv
πβΎ save_plots
βΎβΎ generate_metrics
The whale emoji (π) represents pipes that have an associated Docker container.
If you wish, you can add additional information to this list by writing a section in the makefile/shellfile itself. Using the same example as above:
#? Calculate the sums of the input metrics
#?
#? The script takes the input metrics, then calculates the row-wise sums.
#? These are useful since we can refer to this calculation later.
./data/out/output.csv: ./data/in/input.csv ./src/calc_sum.py
cat $< | ./src/calc_sum.py > $@
If you add this block of lines starting with #?
, Kerblam! will use them as
descriptions (note that the space after the ?
is important!), and it will
treat them as markdown.
The first paragraph of text (#?
lines not separated by an empty #?
line) will be
the title of the pipeline. Try to keep this short and to the point.
The rest of the lines will be the long description.
Kerblam will parse all lines starting with #?
, although it's preferrable
to only have a single contiguous description block in each file.
The output of kerblam run
will now read:
Error: no runtime specified. Available runtimes:
βΎπ process_csv :: Calculate the sums of the input metrics
πβΎ save_plots
βΎβΎ generate_metrics
The scroll (π) emoji appears when Kerblam! notices a long description.
You can show the full description for such pipes with kerblam run process_csv --desc
.
With pipeline docstrings, you can have a record of what the pipeline does for both yourself and others who review your work.
You cannot write docstrings inside docker containers1.
You actually can. I can't stop you. But Kerblam! ignores them.
Packaging pipelines for later
The kerblam package
command is one of the most useful features of Kerblam!
It allows you to package everything needed to execute a pipeline in a docker
container and export it for execution later.
You must have a matching dockerfile for every pipeline that you want to package, or Kerblam! won't know what to package your pipeline into.
For example, say that you have a process
pipe that uses make
to run, and
requires both a remotely-downloaded remote.txt
file and a local-only
precious.txt
file.
If you execute:
kerblam package process --tag my_process_package
Kerblam! will:
- Create a temporary build context;
- Copy all non-data files to the temporary context;
- Build the specified dockerfile as normal, but using this temporary context;
- Create a new
Dockerfile
that:- Inherits from the image built before;
- Copies the Kerblam! executable to the root of the container;
- Configure the default execution command to something suitable for execution
(just like
kerblam run
does, but "baked in").
- Build the docker container and tag it with
my_process_package
; - Export all precious data, the
kerblam.toml
and the--tag
of the container to aprocess.kerblam.tar
tarball.
If you don't specify a --tag
, Kerblam! will name the result as <pipe>_exec
.
The --tag
parameter is a docker tag.
You can specify a remote repository and push it with docker push ...
as you
would normally do.
After Kerblam! packages your project, you can re-run the analysis with
kerblam replay
by using the process.kerblam.tar
file:
kerblam replay process.kerblam.tar ./replay_directory
Kerblam! reads the .kerblam.tar
file, recreates the execution environment from
it by unpacking the packed data, and executes the exported docker container
with the proper mountpoints (as described in the kerblam.toml
file).
In the container, Kerblam! fetches remote files (i.e. runs kerblam data fetch
)
and then the pipeline is triggered via kerblam run
.
Since the output folder is attached to the output directory on disk, the
final output of the pipeline is saved locally.
These packages are meant to make pipelines reproducible in the long-term.
For day-to-day runs, kerblam run
is much faster.
The responsibility of having the resulting docker work in the long-term is
up to you, not Kerblam!
For most cases, just having kerblam run
work is enough for the resulting
package made by kerblam package
to work, but depending on your docker
files this might not be the case.
Kerblam! does not test the resulting package - it's up to you to do that.
It's best to try your packaged pipeline once before shipping it off.
However, even a broken kerblam package
is still useful!
You can always enter with --entrypoint bash
and interactively work inside the
container later, manually fixing any issues that time or wrong setup might
have introduced.
Kerblam! respects your choices of execution
options when it packages,
changing backend or working directory as you'd expect.
See the kerblam.toml specification to learn more.
Managing Data
Kerblam! has a bunch of utilities to help you manage the local data for your project. If you follow open science guidelines, chances are that a lot of your data is FAIR, and you can fetch it remotely.
Kerblam! is perfect to work with such data. The next tutorial sections outline what Kerblam! can do to help you work with data.
Remember that Kerblam! recognizes what data is what by the location where you save the data in. If you need a refresher, read this section of the book.
kerblam data
will give you an overview of the status of local data:
> kerblam data
./data 500 KiB [2]
βββ in 1.2 MiB [8]
βββ out 823 KiB [2]
ββββββββββββββββββββββ
Total 2.5 Mib [12]
βββ cleanup 2.3 Mib [9] (92.0%)
βββ remote 1.0 Mib [5]
! There are 3 undownloaded files.
The first lines highlight the size (500 KiB
) and amount (2
) of files in the
./data/in
(input), ./data/out
(output) and ./data
(intermediate) folders.
The total size of all the files in the ./data/
folder is then broken down
between categories: the Total
data size, how much data can be removed with
kerblam data clean
or kerblam data pack
, and how many files are specified
to be downloaded but are not yet present locally.
Fetching remote data
If you define in kerblam.toml
the section data.remote
you can have
Kerblam! automatically fetch remote data for you:
[data.remote]
# This follows the form "url_to_download" = "save_as_file"
"https://raw.githubusercontent.com/MrHedmad/kerblam/main/README.md" = "some_readme.md"
When you run kerblam data fetch
, Kerblam! will attempt to download some_readme.md
by following the URL you provided and save it in the input data directory (e.g.
data/in
).
Most importantly, some_readme.md
is treated as a file that is remotely available
and therefore locally expendable for the sake of saving disk size (see the
data clean
and data pack
commands).
You can specify any number of URLs and file names in [data.remote]
, one for
each file that you wish to be downloaded.
The download directory for all fetched data is your input directory,
so if you specify some/nested/dir/file.txt
, kerblam! will save the file in
./data/in/some/nested/dir/file.txt
.
This also means that if you write an absolute path (e.g. /some_file.txt
),
Kerblam! will treat the path as it should treat it - by making some_file.txt
in the root of the filesystem (and most likely failing to do so).
It will, however, warn you before acting that it is about to do something
potentially unwanted, giving you the chance to abort.
Package and distribute data
Say that you wish to send all your data folder to a colleague for inspection.
You can tar -czvf exported_data.tar.gz ./data/
and send your whole data folder,
but you might want to only pick the output and non-remotely available inputs,
and leave re-downloading the (potentially bulky) remote data to your colleague.
It is widely known that remembering
tar
commands is impossible.
If you run kerblam data pack
you can do just that.
Kerblam! will create a exported_data.tar.gz
file and save it locally with the
non-remotely-available .data/in
files and the files in ./data/out
.
You can also pass the --cleanup
flag to also delete them after packing.
You can then share the data pack with others.
Cleanup data
If you want to cleanup your data (perhaps you have finished your work, and would
like to save some disk space), you can run kerblam data clean
.
Kerblam! will remove:
- All temporary files in
./data/
; - All output files in
./data/out
; - All input files that can be downloaded remotely in
./data/in
. - All empty (even nested) folders in
./data/
and./data/out
. This essentially only leaves input data that cannot be retrieved remotely on disk.
Kerblam! will consider as "remotely available" files that are present in the
data.remote
section of kerblam.toml
.
See this chapter of the book to learn more about remote data.
If you wish to preserve the remote data (perhaps you merely want to "reset"
the pipelines but start again quickly) you can use the --keep-remote
flag to do so.
If you want to preserve the empty folders left behind after cleaning,
pass the --keep-dirs
flag to do just that.
Kerblam! will ask for your confirmation before deleting the files.
If you're feeling bold, skip it with the --yes
flag.
Other utilities
Kerblam! has a few other utilities to deal with the most tedius steps when working with projects.
kerblam ignore
- Add items to your .gitignore
quickly
Oops! You forgot to include your preferred language to your .gitignore
.
You now need to google for the template .gitignore
, open the file and copy-paste it in.
With Kerblam! you can do that in just one command. For example:
kerblam ignore Rust
will fetch Rust.gitignore
from the Github gitignore repository
and append it to your .gitignore
for you.
Be careful that this command is case sensitive (e.g. Rust
works, rust
does not).
You can also add specific files or folders this way:
kerblam ignore ./src/something_useless.txt
Kerblam! will add the proper pattern to the .gitignore
file to filter out
that specific file.
The optional --compress
flag makes Kerblam! check the .gitignore
file for
duplicated entries, and only retain one copy of each pattern.
This also cleans up comments and whitespace in a sensible way.
The --compress
flag allows to fix ignoring stuff twice.
E.g. kerblam ignore Rust && kerblam ignore Rust --compress
is the same as
running kerblam ignore Rust
just once.
Getting help
You can get help with Kerblam! via a number of channels:
- Encountered a bug? Open an issue on Github!
- Have a question? either open an issue or send me an email;
- Have a suggestion? Open an issue!
Thank you so much for giving Kerblam! a go.
Usage examples
There are a bunch of examples in the MrHedmad/kerblam-examples repository, ready for your perusal.
The latest development version of Kerblam! is tested against these examples, so you can be sure they are as fresh as they can be.
The Kerblam.toml file
The kerblam.toml
file is the control center of kerblam!
All of its configuration is found there.
Here is what fields are available, and what they do.
Extra fields not found here are silently ignored. This means that you must be careful of typos!
The fields are annotated where possible with the default value.
[meta] # Metadata regarding kerblam!
version = "0.4.0"
# Kerblam! will check this version and give you a warning
# if you are not running the same executable.
# To save you headaches!
# The [data] section has options regarding... well, data.
[data.paths]
input = "./data/in"
output = "./data/out"
intermediate = "./data"
[data.profiles] # Specify profiles here
profile_name = {
"original_name" = "profile_name",
"other_name" = "other_profile_name"
}
# Or, alternatively
[data.profiles.profile_name]
"original_name" = "profile_name"
"other_name" = "other_profile_name"
# Any number of profiles can be specified, but stick to just one of these
# two methods of defining them.
[data.remote] # Specify how to fetch remote data
"url_to_fetch" = "file_to_save_to"
# there can be any number of "url" = "file" entries here.
# Files are saved inside `[data.paths.input]`
##### --- #####
[code] # Where to look for containers and pipes
env_dir = "./src/dockerfiles"
pipes_dir = "./src/pipes"
[execution] # How to execute the pipelines
backend = "docker" # or "podman", the backend to use to build and run containers
workdir = "/" # The working directory inside all built containers
Note that this does not want to be a valid TOML, just a reference. Don't expect to copy-paste it and obtain a valid Kerblam! configuration.
Contributing to Kerblam!
Thank you for wanting to contribute!
The developer guide changes more often than this book, so you can read it directly on Github.
The Kerblam! philosophy
Hello! This is the maintainer. This article covers the design principles behind how Kerblam! functions. It is both targeted at myself - to remind me why I did what I did - and to anyone who is interested in the topic of managing data analysis projects.
Reading this is not at all necessary to start using Kerblam!. Perhaps you want to read the tutorial instead.
I am an advocate of open science, open software and of sharing your work as soon and as openly as possible. I also believe that documenting your code is even more important than the code itself. Keep this in mind when reading this article, as it is strongly opinionated.
The first time I use an acronym I'll try to make it bold italics so you can have an easier time finding it if you forget what it means. However, I try to keep acronyms to a minimum.
Introduction
After three years doing bioinformatics work as my actual job, I think I have come across many of the different types of projects that one encounters as a bioinformatician:
- You need to analyse some data either directly from someone or from some
online repository.
This requires the usage of both pre-established tools and new code and/or
some configuration.
- For example, someone in your research group performed RNA-Seq, and you are tasked with the data analysis.
- You wish to create a new tool/pipeline/method of analysis and apply it to some data to both test its performance and/or functionality, before releasing the software package to the public.
The first point is data analysis. The second point is software development. Both require writing software, but they are not exactly the same.
You'd generally work on point 2 like a generalist programmer would. In terms of how you work, there are many different workflow mental schemas that you can choose from, each with its following, pros, and cons. Simply search for coding workflow to find a plethora of different styles, methods and types of way you can use to manage what to do and when while you code.
In any case, while working with a specific programming language, you usually
have only one possible way to layout your files.
A python project uses a quite specific structure:
you have a pyproject.toml
/setup.py
, a module directory1...
Similarly, when you work on a Rust project, you use cargo
, and therefore
have a cargo.toml
file, a /src
directory...
The topic of structuring the code itself is even deeper, with different ways to think of your coding problem: object oriented vs functional vs procedural, monolithic vs microservices, etcetera, but it's out of the scope of this piece.
At its core, software is a collection of text files written in a way that the computer can understand. The process of laying out these files in a logical way in the filesystem is what I mean when I say project layout (PL). A project layout system (PLS) is a pre-established way to layout these files. Kerblam! is a tool that can help you with general tasks if you follow the Kerblam! project layout system.
There are also project management systems, that are tasked with managing what has to be done while writing code. They are not the subject of this piece, however.
Since we are talking about code, there are a few characteristics in common between all code-centric projects:
- The changes between different versions of the text files are important. We need to be able to go back to a previous version if we need to. This can be due by a number of things: if we realize that we changed something that we shouldn't have, if we just want to see a previous version of the code or if we need to run a previous version of the program for reproducibility purposes.
- Code must be documented to be useful. While it is often sufficient to read a piece of code to understand what it does, the why is often unclear. This is even more important when creating new tools: a tool without clear documentation is unusable, and an unusable tool might as well not exist.
- Often, code has to be edited by multiple people simultaneously. It's important to have a way to coordinate between people as you add your edits in.
- Code layout is often driven by convention or by requirements of build systems/ interpreters/external tools that need to read your code. Each language is unique under this point.
From these initial observations we can start to think about a generic PLS. Version control takes care of - well - version control and is essential for collaboration. Version control generally does not affect the PL meaningfully. However, version control often does not work well with large files, especially binary files.
Design principle A: We must use a version control system.
Design principle B: Big binary blobs bad2!
I'm very proud of this pun. Please don't take it from me.
I assume that the reader knows how vital version control is when writing software. In case that you do not, I want to briefly outline why you'd want to use a version control system in your work:
- It takes care of tracking what you did on your project;
- You can quickly turn back time if you mess up and change something that should not have been changed.
- It allows you to collaborate both in your own team (if any) and with the public (in the case of open-source codebases). Collaboration is nigh impossible without a version control system.
- It allows you to categorize and compartimentalize your work, so you can keep track of every different project neatly.
- It makes the analysis (or tool) accessible - and if you are careful also reproducible - to others, which is an essential part of the scientific process. These are just some of the advantages you get when using a version control system. One of the most popular version control systems is
git
. Withgit
, you can progressively add changes to code over time, withgit
taking care of recording what you did, and managing different versions made by others.If you are not familiar with version control systems and specifically with
git
, I suggest you stop reading and look up the git user manual.
Design principle A makes it so that the basic unit of our PLS is the repository. Our project therefore is a repository of code.
As we said, documentation is important. It should be versioned together with the code, as that is what it is describing and it should change at the same pace.
Design principle C: Documentation is good. We should do more of that.
Code is read more times than it is written,
therefore, it's important for a PLS to be logical and obvious.
To be logical, one should categorize files based on their content, and logically
arrange them in a way that makes sense when you or a stranger looks through them.
To be obvious, the categorization and the choice of folder and file names should
make sense at a glance (e.g. the 'scripts
' directory is for scripts, not for data).
Design principle D: Be logical, obvious and predictable
Scientific computing needs to be reproduced by others. The best kind of reproducibility is computational reproducibility, by which the same output is generated given the same input. There are a lot of things that you can do while writing code to achieve computational reproducibility, but one of the main contributors to reproducibility is still containerization.
Additionally, being easily reproducible is - in my mind - as important to being reproducible to begin with. The easier it is to reproduce your work, the more "morally upright" you will be in the eyes of the reader. This has a lot of benefits, of course, with the main one being that you are more resilient to backlash in the inevitable case that you commit an error.
Design principle E: Be (easily) reproducible.
Structuring data analysis
While structuring single programs is relatively straightforward, doing the same for a data analysis project is less set in stone. However, given the design principles that we have established in the previous section, we can try to find a way to fulfill all of them for the broadest scope of application possible.
To design such a system, it's important to find what are the points in common between all types of data analysis projects. In essence, a data analysis project encompasses:
- Input data that must be analysed in order to answer some question.
- Output data that is created as a result of analysing the input data.
- Code that analyses that data.
- It is often the case that data analysis requires many different external tools, each with its own set of requirements. These sum with the requirements of your own code and scripts.
"Data analysis" code is not "tool" code: it usually uses more than one programming language, it is not monolithic (i.e builds up to just one "thing") and can differ wildly in structure (from just one script, to external tool, to complex pieces of code that run many steps of the analysis).
This complexity results in a plethora of different ways to structure the code and the data during the project.
I will not say that the Kerblam! way is the one-and-only, cover-all way to structure your project, but I will say that it is a sensible default.
Kerblam!
The kerblam! way to structure a project is based on the design principles that we have seen, the characteristics of all data analysis project and some additional fundamental observations, which I list below:
- All projects deal with input and output data.
- Some projects have intermediate data that can be stored to speed up the execution, but can be regenerated if lost (or the pipeline changes).
- Some projects generate temporary data that is needed during the pipeline but then becomes obsolete when the execution ends.
- Projects may deal with very large data files.
- Projects may use different programming languages.
- Projects, especially exploratory data analysis, require a record of all the trials that were made during the exploratory phase. Often, one last execution is the final one, with the resulting output the presented one. Having these in mind, we can start to outline how Kerblam! deals with each of them.
Data
Points 1, 2, 3 and 4 deal with data.
A kerblam! project has a dedicated data
directory, as you'd expect.
However, kerblam! actually differentiates between the different data types.
Other than input, output, temporary and intermediate data, kerblam! also considers:
- Remote data is data that can be downloaded at runtime from a (static) remote source.
- Input data that is not remote is called precious, since it cannot be substituted if it is lost.
- All data that is not precious is fragile, since it can be deleted with little repercussion (i.e. you just re-download it or re-run the pipeline to obtain it again.
Practically, data can be input/output/temp/intermediate, either fragile or precious and either local or remote.
To make the distinction between these different data types we could either keep a separate configuration that points at each file (a git-like system), or we specify directories where each type of file will be stored.
Kerblam! takes both of these approaches.
The distinction between input/output/temp/intermediate data is given by directories.
It's up to the user to save each file in the appropriate directory.
The distinction between remote and local files is however given by a config file,
kerblam.toml
, so that Kerblam! can fetch the remote files for you on demand3.
Fragile and precious data can just be computed from knowing the other two variables.
Two birds with one stone, or so they say.
The only data that needs to be manually shared with others is precious data. Everything else can be downloaded or regenerated by the code. This means that the only data that needs to be committed to version control is the precious one. If you strive to keep precious data to a minimum - as should already be the case - analysis code can be kept tiny, size-wise. This makes Kerblam! compliant with principle B4 and makes it easier (or in some cases possible) to be compliant with principle A5.
Execution
Points 5 and 6 are generally covered by pipeline managers. A pipeline manager, like snakemake or nextflow, executes code in a controlled way in order to obtain output files. While both of these were made with data analysis in mind, they are both very powerful and very "complex"6 and unwieldy for most projects.
Kerblam! supports simple shell scripts (which in theory can be used to run
anything, even pipeline managers like nextflow or snakemake) and makefiles natively.
make
is a quite old GNU utility that is mainly used to build packages and
create compiled C/C++ projects.
However, it supports and manages the creation of any file with any creation recipe.
It is easy to learn and quick to write, and is at the perfect spot for most analyses
between a simple shell script and a full-fledged pipeline manager.
Kerblam! considers these executable scripts and makefiles as "pipes", where each
pipe can be executed to obtain some output.
Each pipe should call external tools and internal code.
If code is structured following the unix philosophy
,
each different piece of code ("program") can be reused in the different pipelines and
interlocked with one another inside pipelines.
With these considerations, point 6 can be addressed by making different pipes
with sensible names, saving them in version control.
Point 5 is easy if each program is independent of each other, and developed in
its own folder.
Kerblam! appoints the ./src
directory to contain the program code (e.g. scripts,
directories with programs, etc...) and the /src/pipes
directory to contain shell
scripts and makefile pipelines.
These steps fulfill the design principle D7: Makefiles and shell scripts are easy to read, and having separate folders for pipelines and actual code that runs makes it easy to know what is what. Having the rest of the code be sensibly managed is up to the programmer.
Principle E8 can be messed up very easily, and the reproducibility crisis is a symptom of this. A very common way to make any analysis reproducible is to package the execution environment into containers, executable bundles that can be configured to do basically anything in an isolated, controlled environment.
Kerblam! projects leverage docker containers to make the analysis as easily reproducible as possible. Using docker for the most basic tasks is relatively straightforward:
- Start with an image;
- Add dependencies;
- Copy the current environment;
- Setup the proper entrypoint;
- Execute the container with a directory mounted to the local file system in order to extract the output files as needed.
Kerblam! automatically detects dockerfiles in the ./src/dockerfiles
directory
and builds and executes the containers following this simple schema.
To give as much freedom to the user as possible, Kerblam! does not edit or check
these dockerfiles, just executes them in the proper environment and the correct
mounting points.
The output of a locally-run pipeline cannot be trusted as it is not reproducible. Having Kerblam! natively run all pipelines in containers allows development runs to be exactly the same as the output runs when development ends.
To be compliant with principle D7, knowing what dockerfile is needed for what pipeline can be challenging. Kerblam! requires that pipes and the respective dockerfiles must have the same name.
Documentation
Documentation is essential, as we said in principle C9. However, documentation is for humans, and is generally well established how to layout the documentation files in a repository:
- Add a
README
file. - Add a [
LICENSE
], so it's clear how other may use your code. - Create a
/docs
folder with other documentation, such asCONTRIBUTING
guides, tutorials and generally human-readable text needed to understand your project.
There is little that an automated tool can do to help with documentation. There are plenty of guides online that deal with the task of documenting a project, so I will not cover it further.
Python packaging is a bit weird since there are so many packaging
engines that create python packages. Most online guides use setuptools
, but
modern python (as of Dec 2023) now works with the build
script with a
pyproject.toml
file, which supports different build engines.
See this pep for more info.
I cannot find a good adjective other than "complex". These tools are not hard to use, or particularly difficult to learn, but they do have an initial learning curve. The thing that I want to highlight is that they are so formal, and require careful specification of inputs, output, channels and pipelines that they become a bit unwieldy to use as a default. For large project with many moving parts and a lot of computing (e.g. the need to run in a cluster), using programs such as these can be very important and useful. However, bringing a tank to a fist fight could be a bit too much.
Big binary blobs bad.
We must use a version control system.
Be logical, obvious and predictable.
Be (easily) reproducible.
Documentation is good. We should do more of that.