Verified Commit b1d0940a authored by Laurent Modolo's avatar Laurent Modolo
Browse files

data_mangement: fix typo based on D. Jost comments

parent 99a48709
Pipeline #301 passed with stage
in 18 seconds
......@@ -22,7 +22,7 @@ Guide of good practice}
This document is a summary of the information that you can find in the \href{https://biowiki.biologie.ens-lyon.fr/}{biowiki} and \href{https://lbmc.gitbiopages.ens-lyon.fr/hub/good_practices/good_practices.html}{guide of good practices} of the LBMC.
Nowadays, numerical data are at the core of the scientific activities and we often worry about their management and safe-keeping.
You will find in this guide a list of storage facilities that you have access to, as a member of the LBMC and guideline on how to use these facilities.
You will find in this guide a list of storage facilities that you have access to, as a member of the LBMC and guidelines on how to use these facilities.
All data are not equal. For example, some data need to be shared while others need to be accessible only to one user or even encrypted.
In this document, we are going to first classify data according to their size and nature:
......@@ -37,20 +37,20 @@ The {\bf experimental data} category can be seen as quite open.
In the data {\bf backup} community, we often further categorize {\bf experimental data} as :
\begin{itemize}
\item {\bf hot}: data on which you are working, you want a rapid access to them
\item {\bf warm}: data on which you may be working, you want an easy access to them
\item {\bf cold}: data on which you will not be working in a foreseeable future, you don't care if it takes some time to retrieve them.
\item {\bf hot}: data on which you are currently working on, you want a rapid access to them
\item {\bf warm}: data on which you may be working on, you want an easy access to them
\item {\bf cold}: data on which you will not be working on in a foreseeable future, you don't care if it takes some time to retrieve them.
\end{itemize}
The {\bf hot} to {\bf cold} categorization is closely related to the money and energy cost of the underlying storage facilities (the colder the cheaper).
For all of the above categories, we need to discriminate between {\bf backuped data} and {\bf archived data}.
The data that you are working on can have none to multiple {\bf backup}. An increase in the number of {\bf backup} will increase the resilience and the physical cost of the storage of your data, but also management time spent to update all the copies.
Data that will not change in the future can be {\bf archived}. In this case the data need to be deposited in an archive facility along with the correct {\bf metadata}, where it will get a unique identifier and will stay accessible {\it forever} (which requires a potentially large number of multi-site {\bf backup}).
The data that you are working on can have none to multiple {\bf backups}. An increase in the number of {\bf backups} will increase the resilience and the physical cost of the storage of your data, but also management time spent to update all the copies.
Data that will not change in the future can be {\bf archived}. In this case the data need to be deposited in an archive facility along with the correct {\bf metadata}, where it will get a unique identifier and will stay accessible {\it forever} (which may require a potentially large number of multi-site {\bf backup}).
The \href{https://ec.europa.eu/research/participants/docs/h2020-funding-guide/cross-cutting-issues/open-access-data-management/data-management_en.htm
}{H2020 recommendations to make research data findable, accessible, interoperable and reusable ({\bf FAIR})}, encourage the use of data management plans to structure theses metadata.
Data Management Plans or {\bf DMP}s) are a key element of good data management. A {\bf DMP} describes the data management life cycle for the data to be collected, processed and/or generated. As part of making research data {\bf FAIR}, a {\bf DMP} should include information on:
Data Management Plans (or {\bf DMP}s) are a key element of good data management. A {\bf DMP} describes the data management life cycle for the data to be collected, processed and/or generated. As part of making research data {\bf FAIR}, a {\bf DMP} should include information on:
\begin{itemize}
\item the handling of research data during & after the end of the project
\item what data will be collected, processed and/or generated
......@@ -59,7 +59,7 @@ Data Management Plans or {\bf DMP}s) are a key element of good data management.
\item how data will be curated & preserved (including after the end of the project).
\end{itemize}
The {\bf DMP} needs to be updated over the course of the project whenever significant changes arise, such as (but not limited to): new data, changes in consortium policies or changes in consortium composition and external factors.
The {\bf DMP} may need to be updated over the course of the project whenever significant changes arise, such as (but not limited to): new data, changes in consortium policies or changes in consortium composition and external factors.
We will now go over the solutions that you have access to, to store, {\bf backup}, and {\bf archive} your {\bf documents}, {\bf codes} and {\bf experimental data}.
......@@ -71,12 +71,12 @@ There are several solutions to {\bf backup} and share your {\bf documents}:
If your computer is correctly configured, you can \href{https://biowiki.biologie.ens-lyon.fr/doku.php?id=backup_auto_help}{make daily {\bf backups} of your {\bf documents}} on a wired connection.
Different snapshots of your {\bf documents} will stay accessible and you can restore your {\bf documents} at any of these snapshots.
Note that some \href{https://biowiki.biologie.ens-lyon.fr/doku.php?id=backup_auto_help}{type of files} are excluded from these {\bf backup}.
Note that some \href{https://biowiki.biologie.ens-lyon.fr/doku.php?id=backup_auto_help}{type of files} are excluded from these {\bf backups}.
\subsection{Data Backup and Synchronization Tools}
{\bf Backup and synchronization tools} allows you to continuously synchronize a list of the folder with a remove server.
In addition to providing you with a {\bf backup} of these folders, you can also easily share some of them with other users or between different computers (and increase the number of {\bf backup}).
{\bf Backup and synchronization tools} allow you to continuously synchronize a list of the folder with a remove server.
In addition to provide a {\bf backup} of these folders, you can also easily share some of them with other users or between different computers (and increase the number of {\bf backups}).
You also have a small history of the last modifications where you can restore a given file to an anterior version.
\begin{itemize}
......@@ -86,13 +86,13 @@ You also have a small history of the last modifications where you can restore a
For both services, the data stored can be considered as heavily {\bf backuped} (the data should not be lost on their ends).
Other famous USA companies also provide similar services, but despite the Safe Harbor and Privacy Shield, the \href{https://curia.europa.eu/juris/document/document.jsf?text=&docid=228677&pageIndex=0&doclang=en&mode=lst&dir=&occ=first&part=1&cid=9791227}{court of justice of the EU ruled on July 16 2020, that the USA privacy law cannot be made compatible with EU privacy law.} Therefore, you should not use these services for work (or at least heavily encrypt the content stored on them).
Other famous USA companies also provide similar services, but despite the Safe Harbor and Privacy Shield, the \href{https://curia.europa.eu/juris/document/document.jsf?text=&docid=228677&pageIndex=0&doclang=en&mode=lst&dir=&occ=first&part=1&cid=9791227}{court of justice of the EU ruled on July 16 2020, that the USA privacy law cannot be made compatible with EU privacy law.} Therefore, you should not use these services for work (or at least you should heavily encrypt the content stored on them).
\subsection{Shared Network Volumes}
Shared network volume is seen by your computer as external hard disk that is only available if your computer is connected to the corresponding network.
Even if they look like an external hard disk, shared network volume doesn't offer the same level of accessibility as local storage.
Shared network volume performances and availability can vary depending on to the load of the network which speeds will always be slower than the speed of your local storage.
Shared network volume performances and availability can vary depending on to the load of the network whose speed will always be slower than the speed of your local storage.
On the ENS network, you have access to the \href{https://biowiki.biologie.ens-lyon.fr/doku.php?id=biodata}{BIODATA} network volume.
Your \href{https://biowiki.biologie.ens-lyon.fr/doku.php?id=biodata}{BIODATA} space is only accessible by your team's members and from the ENS network (not throughout the VPN).
......@@ -109,7 +109,7 @@ You team can buy more storage, to add to \href{https://biowiki.biologie.ens-lyon
Most of the human bioinformatic work will result in the production of lines of {\bf code} or text. While important, the size of such data is often quite small and should be copied to other places as often as possible.
Your documentation is also a valuable set of files.
\href{https://git-scm.com/}{Git} is nowadays the reference system to store {\bf code} and it's history of modification. It can be seen as the numerical equivalent of a cahier de laboratoire. You can even \href{https://docs.gitlab.com/ee/user/project/repository/gpg_signed_commits/}{numerically sign your contributions}.
\href{https://git-scm.com/}{Git} is nowadays the reference system to store {\bf code} and its history of modification. It can be seen as the numerical equivalent of a cahier de laboratoire. You can even \href{https://docs.gitlab.com/ee/user/project/repository/gpg_signed_commits/}{numerically sign your contributions}.
\subsection{\href{http://www.ens-lyon.fr/LBMC/intranet/services-communs/pole-bioinformatique/ressources/gitlab}{Gitbio}}
......@@ -121,7 +121,7 @@ When using a version control system (see Section 3 of the \href{https://lbmc.git
\subsection{Code archive}
The EU and the CNRS and various French ministry support the \href{https://www.softwareheritage.org}{softwareheritage project}, which can make {\bf automatic archive} of git {\bf code} repositories.
The EU and the CNRS and various French ministries support the \href{https://www.softwareheritage.org}{softwareheritage project}, which can make {\bf automatic archive} of git {\bf code} repositories.
Upon publication of your work, you can therefore add your git repository to the \href{https://www.softwareheritage.org}{softwareheritage project} to {\bf archive} it.
......@@ -129,14 +129,14 @@ Upon publication of your work, you can therefore add your git repository to the
In this section we will present some rules to manage your project data. Given the size of current experimental data sets, one must find the balance between securing the data for his/her project and avoid the needless replication of gigabytes of data.
From the time spent to get the material, to the cost of the reagents and acquisition, your data are precious. Moreover for reproducibility concern you should always keep a raw version of your data to go back to. Those two points mean that you must make an {\bf archive} of your raw data as soon as possible (the external hard or thumb drive on which you can get them doesn’t count).
From the time spent to get the materials, to the cost of the reagents and acquisition, your data are precious. Moreover, for reproducibility concern you should always keep a raw version of your data to go back to. Those two points mean that you must make an {\bf archive} of your raw data as soon as possible (the external hard or thumb drive on which you can get them doesn’t count).
When you receive data, it’s also always important to document them. Write a simple \texttt{description.txt} file in the same folder that describes your data and how they were generated. This metadata of your data is important to archive and index them. There are numerous conventions for metadata terms that you can follow, like the \href{http://dublincore.org/documents/dcmi-terms/}{dublin core}. Metadata will also be useful for the persons that are going to reuse your data (in meta-analysis for example) and to cite them.
\subsection{Public Archives}
Public archives like \href{https://www.ebi.ac.uk/submission/}{ebi} (UE) or \href{https://www.ncbi.nlm.nih.gov/home/submit-wizard/}{ncbi} (USA) are free to use for academic purpose.
These institutions propose different services for different types of data for example for the \href{https://www.ebi.ac.uk/submission/}{ebi} (UE) you have:
These institutions propose different services for different types of data. For example, for the \href{https://www.ebi.ac.uk/submission/}{ebi} (UE) you have:
\begin{itemize}
\item \href{https://www.ebi.ac.uk/ena/browser/home}{ENA (the European Nucleotide Archive)} to store raw sequencing data, sequence assembly information and functional annotation
......@@ -145,9 +145,9 @@ These institutions propose different services for different types of data for ex
\end{itemize}
Once your raw data deposited on a public archive, you can consider that they have a level of {\bf backup} that you cannot reasonably reach and that they are safe.
The archiving procedure request metadata information on the author of the data and the nature of the data. Filling the forms of the archiving procedure is akin to writing a {\bf DMP} with an infinite lifetime for the data.
The archiving procedure request metadata information on the author of the data and the on nature of the data. Filling the forms of the archiving procedure is akin to writing a {\bf DMP} with an infinite lifetime for the data.
Public archives propose an embargo time system during which your dataset will stay private. This you will get an automatic alert before the end of the embargo and you will be able to renew it as many times as you need.
Public archives propose an embargo time system during which your dataset will stay private. You will get an automatic alert before the end of the embargo and you will be able to renew it as many times as you need.
Therefore, you should systematically archive your raw data.
\begin{itemize}
......@@ -171,8 +171,8 @@ The \href{https://biowiki.biologie.ens-lyon.fr/doku.php?id=biodata}{BIODATA} sto
\subsection{\href{http://www.ens-lyon.fr/PSMN/}{PSMN}:}
The PSMN (Pôle Scientifique de Modélisation Numérique) is the preferential high-performance computing (HPC) center the LBMC have access to. The LBMC members have access to a volume of storage in the PSMN facilities accessible, once connected \href{http://www.ens-lyon.fr/PSMN/doku.php?id=contact:forms:inscription}{with a PSMN account}.
The access to these families is \href{http://www.ens-lyon.fr/LBMC/intranet/services-communs/pole-bioinformatique/ressources/PSMN#section-0}{preferentialy done by the command line with {\bf ssh}} but can also be done \href{http://www.ens-lyon.fr/LBMC/intranet/services-communs/pole-bioinformatique/ressources/PSMN#section-10}{with a graphical interface like Filezilla}.
You can \href{http://www.ens-lyon.fr/PSMN/doku.php?id=contact:forms:accueil}{request a training course to Cerasela Iliana Calugaru} to learn to use these resources.
The access to these volumes is \href{http://www.ens-lyon.fr/LBMC/intranet/services-communs/pole-bioinformatique/ressources/PSMN#section-0}{preferentialy done by the command line with {\bf ssh}} but can also be done \href{http://www.ens-lyon.fr/LBMC/intranet/services-communs/pole-bioinformatique/ressources/PSMN#section-10}{with a graphical interface like Filezilla}.
You can \href{http://www.ens-lyon.fr/PSMN/doku.php?id=contact:forms:accueil}{request a training course to Cerasela Iliana Calugaru} to learn how to use these resources.
A copy of your data can be placed in your PSMN team folder \texttt{/Xnfs/site/lbmcdb/team\_name}, with up to 600To of storage for the biology department.
You can contact \href{mailto:helene.polveche@ens-lyon.fr}{Helene Polveche} or \href{mailto:laurent.modolo@ens-lyon.fr}{Laurent Modolo} if you need help with this procedure. This will also facilitate the access to your data for the people working on your project if they use the PSMN computing facilities.
......@@ -180,7 +180,7 @@ You can contact \href{mailto:helene.polveche@ens-lyon.fr}{Helene Polveche} or \h
\subsection{\href{https://cc.in2p3.fr/}{CCIN2P3}:}
The \href{https://cc.in2p3.fr/}{CCIN2P3} (Centre de Calcul de l’Institut national de physique nucléaire et de physique des particules) give access to a percentage of its resources to the Biologies. In addition to the computing resources, you can also make long-term backup of your data in this center.
The \href{https://cc.in2p3.fr/}{CCIN2P3} (Centre de Calcul de l’Institut national de physique nucléaire et de physique des particules) gives access to a percentage of its ressources to Biologistes. In addition to the computing resources, you can also make long-term backup of your data in this center.
With a \href{http://www.ens-lyon.fr/PSMN/}{PSMN} account, you can make long-term {\bf backup} of your data there.
The \href{https://cc.in2p3.fr/}{CCIN2P3} don't know you, and don't provide archiving services, therefore you must write a {\bf DMP} to define some information like the owner of the data, its nature and its lifetime.
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment