The logo for the DSK


important

The Data Stewardship Knowledgebase is under construction. Expect empty pages, warning signs, and hammers and nails left on the floor. It also might change drastically without notice.

The DSK aims to be a handbook of useful resources for both current Data Stewards handling data and future Data Stewards-to-be which are just approaching the subject. To this end, it has a few main goals:

  • Define what data stewardship is, and provide insight on what meaningful data stewardship should look like in different contexts, with particular emphasis in the context of public research.
  • Aggregate in an orderly way the resources found scattered on the internet, as data management can be a diffuse topic touching many aspects in many different contexts;
  • Integrate information from other websites with additional context and, if needed, create new resources to fill in gaps from publicly available knowledge.
  • Define lists of best practices and methods, as well as providing ways to find and define such methods, in a wide array of contexts;
  • Provide practical guides and how-tos to deal with common or recurring problems when dealing with data stewardship and management in different contexts.
  • Promote principles of meaningful data stewardship in many research contexts, and provide teaching material useful to promote such principles to a wide audience by Data Stewards and other people interested to do so.
  • Promote the critical evaluation of the philosophy of science and the method of doing science of research groups and institutions through the collection of useful resources and teaching materials.

The DSK is structured in four broad categories of interest: Open Science, Computer Science Toolbox, Policy and Legal Issues and Stewarding the Data Lifetime. They are described below, so that you may be aware of the overarching structure of the DSK.

Open Science

The profession of Data Steward, and the concept of meaningful, useful data stewardship for the benefit of the community is the culmination of years of Open Science philosophy. This section aims to explore the aspects of Open Science, in particular in the context of data management. It covers topics such as:

  • What Open Science is;
  • Why is Open Science the right direction for researchers and research institutions to take;
  • What could go wrong if Open Science is implemented badly;
  • What do Data Stewards do in the context of Open Science;
  • How to efficiently teach Open Science concepts to others;
  • Why data and data stewardship matters so much for Open Science;
  • Why a third party (like a researcher) might be interested in implementing Open Science and Data Stewardship policies;

Computer Science Toolbox

In the modern day, data is almost always manipulated digitally in some form. Even physical objects might be listed in a digital index, or scanned and digitalized altogether. For this reason, a Data Steward has to have some computer science knowlege and a toolbox of digital hammers and wrenches which are useful when dealing with digital data. This section covers topics such as:

  • What digital data is;
  • How digital data is encoded, transmitted and shared with others;
  • What formats are available to save data in;
  • What is metadata and in which formats are available to represent it;
  • What data infrastructures are and how to manage them (as potential administrators);
  • Technologies to manipulate, reshape, fuse and split data;
  • Determination of costs related to data management (e.g. storage and computing power);
  • Knowledge of relevant tools that can be used to obtain, reshape, reuse, manipulate and share data throughout a research project.

The administration of data, especially personal data, may be subject (or should be subjected) to laws. This section aims to aggregate such concepts and make a data steward both aware of them and capable of dealing with them. It covers topics such as:

  • National and International privacy laws regarding personal data;
  • Legal issues when reusing other’s code and data;
  • Ethical concerns of releasing, reusing and otherwise manipulating data;
  • Determining the ethical and legal risks related to handling specific types of data;
  • How to give recognition when reusing a piece of data produced by others;
  • Creation of effective Open Science policies and plans of action for groups and organizations;
  • Fulfilling Open Science/Data Stewardship requirements for funding bodies that require them (i.e. DMPs);
  • The soft skills required for effective management and administration of an organization interested in implementing data stewardship practices;

Stewarding the data lifetime

The most expansive and eterogeneous section, "Stewarding the data lifetime" deals with the philosophical, pratical and technical aspects of data stewardship, from the planning of data collection, to the manipulation of fresh data, to its potential deletion or archival, etc... This section is heavily context-specific: ideas that might apply to data in the context of biological science might not be relevant to Architectural studies, and vice-versa. This section covers many topics, and some examples include:

  • How to plan data collection, even at large scales and with many data collection partners;
  • Determining when, where and how to store newly created data;
  • Defining and measuring data quality for specific data types in specific contexts;
  • Designing and implementing data curation procedures, from collection to archival;
  • Solving the discard problem and defining methods and formats of long to very-long term preservation of archive data;
  • Determining the best methods of reusing published data to limit useless expenditures, with particular regards to ascertaining data quality and usefulness for the purpose.

Contributing

Thank you for wanting to contribute! Before contributing, please read the contributing guide in the Github repository of the project.

After you are familiar with how to contribute, you can use the edit icon in the top-right of each page to edit that page directly on GitHub and open a pull request with your change.

All contributions are treasured. You can find a list of all contributors in the contributors page. Thank you to all these wonderful people!

Core competence of Data Stewards

Data stewardship and the position of Data Steward (DS) is relatively recent (~ 2017). Therefore, the "core competences" of DSs - what DSs do and what they know - are still being considered.

In this report, the FAIRsFAIR consortium has analised job offerings and other similar resources and generated a competence framework for DSs.

Here are reported such competences, with some modifications, from the above document.

note

Further work on this page will link core competences to relevant pages in the Data Stewardship knowledgebase.

Data Management

"Data Management" is an umbrella term covering all aspects of working with data, similar to "data handling". Many of these concepts also fall under the broad term "🔰 data curation".

  • Develop and implement strategies for:
    • Data collection;
    • Data storage;
    • Data preservation;
    • Ensuring data is compliant with FAIR principles.
  • Create Data Management Plans and Data governance policies, which are aligned with best practices in the field.
  • Know and use relevant data and metadata data types and formats, as well as use and develop common standards for data and metadata.
  • Be familiar, develop and use metadata management tools.
  • Ensure recording of data provenance, including creation and manipulation, also through data publishing.
  • Develop and implement strategies for long-term data archival, including:
    • Develop data archival policies which complies with open science principles, open access policies and best practices for interoperability;
    • Archival of metadata, with specific emphasis on data provenance;
    • Policies for long term data accessibility and assurance of data integrity;
    • Estimation of long-term data archival costs.
  • Develop policies and methods to measure data quality and ensure compliance with community standards, also in coordination with data owners;
  • Develop, implement and supervise policies on data protection, especially when sharing data, including:
    • Compliance with data privacy laws such as the GDPR;
    • Ethical issues;
    • Address legal issues if necessary;
    • Digital data security and integrity, referring to malicious data access, stealing and tampering;
  • Collaborate with other Data Stewards and manage a team of Data Stewards;
  • Coordinate data-related activities between departments and between departments and external collaborators in accordance with local and foreign data policies;
  • Define domain-specific data management requirements, and supervise their development, also in collaboration with other departments.
  • Coordinate and supervise data acquisition.
  • Develop policies for the implementation of open science principles, including FAIR data;
  • Define, develop and supervise required infrastructure for data management and archival;
  • Provide tools, guidance and training to other experts that deal with data (e.g. researchers).

Data Engineering

"Data engineering" encompasses actual technologies that deal with data: collecting, analysing, transferring, storing and sharing it.

  • Be familiar with modern computer science technologies, specifically to:
    • Design and implement data analytics applications;
    • Design and develop experiments, processes and infrastructure for data handling during the whole data lifecycle, including:
      • Data collection;
      • Data storage;
      • Data cleaning (munging);
      • Data analysis;
      • Data visualization;
      • Data archival;
  • Develop and prototype specialised data handling procedures for specific needs.
  • Develop and manage infrastructure for data handling and analysis, with emphasis on big data, data streaming and batch processing, while ensuring provenance and FAIRness.
  • Develop, deploy and operate data infrastructure, including data storage, while following data management policies, with specific attention to the implementation of FAIR principles.
  • Apply data security mechanisms throughout the data lifecycle, including designing and implementing data access policies for different stakeholders.
  • Design, build and operate SQL and NoSQL databases, with particular attention to data models (structure), consistent metadata, data vocabularies and data accessibility.
  • Develop and implement policies and methodologies for data reuse, interoperability and integration of local (i.e. of the organization) and external data.

Research methods and Project management

Data stewards need to work closely with researchers and other experts before, during and after research projects. It is therefore important to have competences in research management and more broadly project management. Some of this concepts might seem obvious and broad to people who have a research backgroud, but this might not be the case for people in all backgrounds.

  • Create new knowledge (i.e. concepts, understandings, relationships and capabilities) through the scientific method based on scientific facts and data;
  • Discover new approaches to achieve research goals, also through the re-usage of available (FAIR) data and software.
  • Use available domain-related knowledge to generate novel sound hypotheses;
  • Inspect and periodically audit the research process, with specific regards to quality, (i.e. integrity, soundness, and usefulness), openness and inclusivity.
  • Design, develop and supervise data-driven projects, which include:
    • Project planning;
    • Experimental design, also in conjunction with domain experts such as Data Science, data infrastructure and other data stewards;
    • Data collection;
    • Data handling.

Domain-specific competences

Each research domain works with wildly different data types, formats and sources. This means that each domain requires a different sets of competences. This sections tries to outline in which contexts this domain-specific knowledge has to be taken into account.

  • Use and adopt general Data Science methods to domain-specific issues, such as:
    • Data types;
    • Data presentations;
    • Organizational roles and relations;
  • Analyse, collect and assess data to achieve organizational goals, such as quality assurance of the organizational system;
  • Identify and monitor performance indicators to identify and asses potential organizational challenges and needs. Specify data models, transparency policies and handling procedures for such performance indicators.
  • Monitor and analyse indicators to identify current trends and potential future developments in local adoption of policies, methods, tools and other areas related to data management, FAIR implementation and open science. Ensure transparency of the process;
  • Coordinate organization-level activities between different domains related to data management, provenance and analytics, with particular focus on data FAIRness throughout the data lifecycle.

Emoji Key

Many links are tagged with emojis. Here's what they mean:

  • Type of content:
    • ➰ > A link to another page in the DSK.
    • 💬 > Opinion piece, presentation, blog post or other content by an individual or organization.
    • 📰 > News article, editoral or piece by a journalist.
    • 🏢 > Official communication from an organization, oftentimes an institutional (i.e. Government-backed) organization.
    • 🧑‍⚖️ > Text of a law or other binding document currently active in one or more countries. Associate the ❌ emoji if the law is no longer in effect.
    • 📑 > Published research article or review in a canonical peer-reviewed journal, or similar (e.g. ongoing open peer review).
    • 📄 > Preprint in a preprint server.
    • 📃 > Poster or other vignette.
    • 📕 > Book or long-form report.
    • 💁 > Presentation to a meeting, conference, etc...
    • 📝 > Official agreement, treatise or manifesto of purpose with no legally binding effects published by an organization or group of organizations.
    • 🔨 > Tool, practical resource checklist or handbook.
  • Format of content:
    • The default format is a simple webpage (HTML), and has no associated emoji.
    • 🔻 > PDF (.pdf).
    • 🔸 > Presentation (e.g. .pptx).
    • ▶️ > Video or other multimedia formats.
  • Language:
    • The default language is English, and has no associated emoji.
    • 🇮🇹 > Italian.
    • 🇫🇷 > French.
    • 🇪🇸 > Spanish.
  • Accessibility:
    • The default accessibility is unrestricted (e.g. an Open Access paper), and has no associated emoji. Such resources should be freely perusable without any expense by the user (other than a computer, electricity and a web connection, obviously).
    • 🔒 > This resource is paywalled, requires a login or is not publicly and freely available due to other reasons.
    • 🔐 > This resource requires a login or registration in order to provide its services, but it is otherwise free to use or read.
  • Content quality or fruibility:
    • 🔰 > Easy to use, understand or in general a beginner-friendly resource.
    • ⭐ > This resource is particularly important or fundamental for a topic.
    • ❌ > Retracted, false or misleading information.
  • Other:
    • 🍪 > This website requires the usage of cookies.
    • 📥 > This link immediately downloads a file to the user's computer.
    • ⚫ > This link has been screened, but no other emoji tags apply.

Not all links are fully tagged. Please consider contributing if you find an error or an omission.

Open Science

The profession of Data Steward, and the concept of meaningful, useful data stewardship for the benefit of the community is the culmination of years of Open Science philosophy. This section aims to explore the aspects of Open Science, in particular in the context of data management. It covers topics such as:

  • What Open Science is;
  • Why is Open Science the right direction for researchers and research institutions to take;
  • What could go wrong if Open Science is implemented badly;
  • What do Data Stewards do in the context of Open Science;
  • How to efficiently teach Open Science concepts to others;
  • Why data and data stewardship matters so much for Open Science;
  • Why a third party (like a researcher) might be interested in implementing Open Science and Data Stewardship policies;

What is Open Science?

definition

Open science is a set of principles and practices that aim to make scientific research from all fields accessible to everyone for the benefits of scientists and society as a whole. Open science is about making sure not only that scientific knowledge is accessible but also that the production of that knowledge itself is inclusive, equitable and sustainable.

  • 🏢 ⭐ UNESCO definition of Open Science
  • UNESCO 🔻 🏢 Recommendations for Open Science
  • 🏢 🔻 📥 ⭐ Strategic Research and Innovation Agenda: Critical success factors for Open Science in Europe.
    • See sections 1.3 for the definition of Open Science and some historical facts.
  • 🏢 🔻 📕 UNESCO - Open Science Outlook 1.
    • This is a very long document (74 pages), on the status of Open Science in 2023, but has a section of "Key Messages" that summarize its message. These include the benefits of Open Science, how to achieve its goals, how it has grown and what it needs to grow further.
  • 🏢 🔻 📥 Horizon Europe's application template with a section on Open Science practices
    • Under the methodology section, the grant specifies that applicants should "Describe how appropriate open science practices are implemented as an integral part of the proposed methodology. Show how the choice of practices and their implementation are adapted to the nature of your work, in a way that will increase the chances of the project delivering on its objectives [e.g. 1 page]. If you believe that none of these practices are appropriate for your project, please provide a justification here."
  • 💬 🇮🇹 Elena Giglia - Open Science è una necessità, non una noia burocratica
    • An overview article about Open Science, scholarly publishing and the importance of making research accessible to everyone, also under the light of the covid-19 pandemic.
  • 💬 💁 🔻 Dr. Jon Tennant - Open Science is just Good Science
    • Tennant touches on what Open Science is, its benefits, and how to put it in practice.

History of Open Science

This section covers the history of Open Science, from its inception, to crucial events in its history, to the current day.

The Open Movement in Europe

Open Science has strong backing from the European Commission:

Open Science and Covid-19

The Covid-19 pandemic has highlighted the importance of Open Science. This section includes resources that discuss how Open Science has helped in the fight against Covid-19 and how it went wrong in some cases.

Open Science Organizations

This page collects some information about open science organizations together with a brief description, their motives and goals, and the services they offer.

Coalition S

cOAlition S is an organization built around "Plan S", a committment to make all articles written on publicly-funded research Open Access, effective immediately. You can read more on the cOAlition S about page and on :memo: Plan S.

  • The 🔻 🏢 Coalition S preamble is the founding document of the coalition, with all considerations made when creating it plus its goals.

COARA

COARA, the Coalition for Advancing Research Assessment, is an organization striving to reform the methods for research assessment in accordance to ➰ Open Science principles.

In particular, they aim to find methods to reward all types of research outputs, not only publications and patents.

COARA is a coordinated group effort divided in 📰 COARA National Chapters and 📰 COARA Working groups. The COARA Website is the access point of all resources for the COARA initiative.

COARA and the force behind it has produced some changes:

Alternative metric sources, detached from canonical publishers and publishing in general are crucial for COARA. Here are a few tools and resources built for that regard:

Miscellaneous resources on the reform of research evaluation:

Scientific Communication

This section deals with scientific communication. In particular, it focuses on the role of publishers, how the publishing industry has changed over the years, and what new opportunities are available for researchers in the modern era.

The case of Elsevier

These resources discuss in particular the editor Elsevier, as a case-study.

  • 💬 Publisher control of all scholarly infrastructure
    • How publishing groups have started to control all aspects of research output: from planning research questions, to literature review, to data collection, to peer review, to publication, to dissemination.
  • 📑 Jefferson Pooley - Surveillance Publishing
    • "This essay develops the idea of surveillance publishing, with special attention to the example of Elsevier. A scholarly publisher can be defined as a surveillance publisher if it derives a substantial proportion of its revenue from prediction products, fueled by data extracted from researcher behavior."
  • Navigating Risk in vendor data privacy practices, an analysis of Elsevier's ScienceDirect
  • 📝 SPARC's 2021 Update
    • SPARC is "a non-profit advocacy organization that supports systems for research and education that are open by default and equitable by design." (https://sparcopen.org/who-we-are/). This document "[...] suggests organizational changes in academic institutions to both (1) manage increasing strategic and ethical challenges and (2) deploy hammers and analyze data to better understand the needs and protect the interests of individuals and communities."
    • 📝 📥 🔻 Direct PDF Link
  • 📰 💬 Sci-hub, Elsevier and Wiley declare war on research communities in India

Alternatives to traditional publishing

Open Access

This section includes resources specifically about Open Access.

  • 🏢 Berlin declaration on Open Access
    • The founding document of the Open Access movement, it delineates the requirement to move away from paywalled content in the era of the internet towards Open Access. It defines what Open Access is, and how to support the transition to the open paradigm.
  • 🍪 ScienceOpen - Open Access Survey results
    • A survey of 60 researchers about Open Access.The low number of respondents makes the results not very reliable.
    • Sampling strategy is also not clear. This may have been a convenience sample, on people who participated in a ScienceOpen event, making the results not generalizable.
  • 📑 Shift academic culture through publication, an article discussing how exploitative publishers are a problem, especially discriminating poorer researchers.
  • European Commission - 🏢 🔻 Study of scientific publishing in Europe (2024), on the state of scientific publishing in Europe, including publishing costs.
  • 🇫🇷 🏢 Barometer of Open Science, data on the progressive shift to open publishing practices in France.
  • DoaJ - 🔨 Open Access Journal repository
  • Open Science Cafè - 🇮🇹 💁 Attività europee per l'open access

Sherpa helps authors decide where to publish, including services that compile what their rights are after publication. See 🔨 About Sherpa for an overview:

  • 🔨 Sherpa Romeo: what are the archiving polices of different journal publishers? An author can go here to learn how to open up their articles, even when publishing in a closed-access journal.
  • 🔨 Sherpa Juliet: what are the publishing requirements of funding agencies? Authors can check the publishing requirements based on who funds their research.
  • 🔨 Sherpa Fact: combining data from Romeo and Juliet, it shows if journals are compliant with best publishing practices.

Some universities provide open access publishing services. An example is 🇮🇹 Sirio, for the University of Turin.

So called "hybrid journals" provide both open access and closed access articles. They are 🏢 generally regarded are bad for open access.

Preprints

A Preprint is an article ready to be sent for peer reivew. Such versions of the articles :bookmark_tab: usually differ little with their peer-reviewed counterparts, and are therefore a valid open alternative to reading regular articles.

The coronavirus pandemic required immediate action. Preprints were essential for this, as they provided immediate knowledge to the public.

Talking points

This section includes resources that discuss the importance of Open Science to a wider audience, including anectodes, examples, stories from researchers, comics, etc. They can be useful to introduce Open Science during talks, presentations and conferences.

Publishers can be very protective about the published data: it makes them a lot of money. See for instance, the case of Researchgate v publishers, Researchgate bows to publishers and Researchgate announcement on the topic.

Reproducibility Crisis

The reproducibility crisis we are experiencing in many research areas has highlighted the importance of Open Science. This section includes resources that discuss the reproducibility crisis and how Open Science can help alleviate it.

Computer Science toolbox

In the modern day, data is almost always manipulated digitally in some form. Even physical objects might be listed in a digital index, or scanned and digitalized altogether. For this reason, a Data Steward has to have some computer science knowlege and a toolbox of digital hammers and wrenches which are useful when dealing with digital data. This section covers topics such as:

  • What digital data is;
  • How digital data is encoded, transmitted and shared with others;
  • What formats are available to save data in;
  • What is metadata and in which formats are available to represent it;
  • What data infrastructures are and how to manage them (as potential administrators);
  • Technologies to manipulate, reshape, fuse and split data;
  • Determination of costs related to data management (e.g. storage and computing power);
  • Knowledge of relevant tools that can be used to obtain, reshape, reuse, manipulate and share data throughout a research project.

important

This section is heavily under construction.

Basics of computer science

  • Files and filesystems
  • Basics of the internet and shared computing

Programming languages

  • What are programming languages?
  • Python

Data Structures, serialization and storage

  • Basic data structures and types
  • Serialization and Deserialization
  • Compression

AI and Machine Learning

Policy and legal issues

The administration of data, especially personal data, may be subject (or should be subjected) to laws. This section aims to aggregate such concepts and make a data steward both aware of them and capable of dealing with them. It covers topics such as:

  • National and International privacy laws regarding personal data;
  • Legal issues when reusing other’s code and data;
  • Ethical concerns of releasing, reusing and otherwise manipulating data;
  • Determining the ethical and legal risks related to handling specific types of data;
  • How to give recognition when reusing a piece of data produced by others;
  • Creation of effective Open Science policies and plans of action for groups and organizations;
  • Fulfilling Open Science/Data Stewardship requirements for funding bodies that require them (i.e. DMPs);
  • The soft skills required for effective management and administration of an organization interested in implementing data stewardship practices;

Intellectual Property Rights

Stewarding the data lifetime

The most expansive and eterogeneous section, "Stewarding the data lifetime" deals with the philosophical, pratical and technical aspects of data stewardship, from the planning of data collection, to the manipulation of fresh data, to its potential deletion or archival, etc... This section is heavily context-specific: ideas that might apply to data in the context of biological science might not be relevant to Architectural studies, and vice-versa. This section covers many topics, and some examples include:

  • How to plan data collection, even at large scales and with many data collection partners;
  • Determining when, where and how to store newly created data;
  • Defining and measuring data quality for specific data types in specific contexts;
  • Designing and implementing data curation procedures, from collection to archival;
  • Solving the discard problem and defining methods and formats of long to very-long term preservation of archive data;
  • Determining the best methods of reusing published data to limit useless expenditures, with particular regards to ascertaining data quality and usefulness for the purpose.

Contributors to the Data Stewardship Knowledgebase

Meaningful contributors to the project will be listed here.

List of maintainers

This is a list of currently active maintainers for the Data Stewardship Knowledgebase, in no particular order. They are responsible for reviewing and merging pull requests, as well as generally maintaining the repository and administering the public spaces of the project:

  • MrHedmad - E-mail luca.visentin (at) unito.it, Discord @MrHedmad.

All contributors

This is a list of all contributors to the project. Thanks to all these amazing people!