@@ -24,7 +24,8 @@ This document is a summary of the information that you can find in the \href{htt
...
@@ -24,7 +24,8 @@ This document is a summary of the information that you can find in the \href{htt
Nowadays, numerical data are at the core of the scientific activities and we often worry about their management and safe-keeping.
Nowadays, numerical data are at the core of the scientific activities and we often worry about their management and safe-keeping.
You will find in this guide a list of storage facilities that you have access to, as a member of the LBMC and guideline on how to use these facilities.
You will find in this guide a list of storage facilities that you have access to, as a member of the LBMC and guideline on how to use these facilities.
All data are not equal. In this document, we are going to first classify data according to their size and nature:
All data are not equal. For example, some data need to be shared while others need to be accessible only to one user or even encrypted.
In this document, we are going to first classify data according to their size and nature:
\begin{itemize}
\begin{itemize}
\item{\bf documents}: small files
\item{\bf documents}: small files
...
@@ -32,7 +33,9 @@ All data are not equal. In this document, we are going to first classify data ac
...
@@ -32,7 +33,9 @@ All data are not equal. In this document, we are going to first classify data ac
\item{\bf experimental data}: small to huge files
\item{\bf experimental data}: small to huge files
\end{itemize}
\end{itemize}
The {\bf experimental data} category can seem quite open. In the data backup community, we often further categorize data as :
The {\bf experimental data} category can seem quite open.
In the data {\bf backup} community, we often further categorize {\bf experimental data} as :
\begin{itemize}
\begin{itemize}
\item{\bf hot}: data on which you are working, you want a rapid access to them
\item{\bf hot}: data on which you are working, you want a rapid access to them
...
@@ -44,24 +47,36 @@ The {\bf hot} to {\bf cold} categorization is closely related to the money and e
...
@@ -44,24 +47,36 @@ The {\bf hot} to {\bf cold} categorization is closely related to the money and e
For all of the above categories, we need to discriminate between {\bf backuped data} and {\bf archived data}.
For all of the above categories, we need to discriminate between {\bf backuped data} and {\bf archived data}.
The data that you are working on can have none to multiple {\bf backup}. An increase in the number of {\bf backup} will increase the resilience and the physical cost of the storage of your data, but also management time spent to update all the copies.
The data that you are working on can have none to multiple {\bf backup}. An increase in the number of {\bf backup} will increase the resilience and the physical cost of the storage of your data, but also management time spent to update all the copies.
Data that will not change in the future can be {\bf archived}. In this case the data need to be deposited in an archive facility along with the correct metadata where it will get a unique identifier and will stay accessible {\it forever}.
Data that will not change in the future can be {\bf archived}. In this case the data need to be deposited in an archive facility along with the correct {\bf metadata} where it will get a unique identifier and will stay accessible {\it forever} (which requires a potentially large number of multi-site {\bf backup}).
The \href{https://ec.europa.eu/research/participants/docs/h2020-funding-guide/cross-cutting-issues/open-access-data-management/data-management_en.htm
}{H2020 recommendations to make research data findable, accessible, interoperable and reusable ({\bf FAIR})} and encourage the use of data management plans to structure theses metadata.
Data Management Plans or {\bf DMP}s) are a key element of good data management. A {\bf DMP} describes the data management life cycle for the data to be collected, processed and/or generated. As part of making research data findable, accessible, interoperable and re-usable ({\bf FAIR}), a {\bf DMP} should include information on:
\begin{itemize}
\item the handling of research data during & after the end of the project
\item what data will be collected, processed and/or generated
\item which methodology \& standards will be applied
\item whether data will be shared/made open access and
\item how data will be curated & preserved (including after the end of the project).
\end{itemize}
The {\bf DMP} needs to be updated over the course of the project whenever significant changes arise, such as (but not limited to): new data, changes in consortium policies or changes in consortium composition and external factors.
Finally, some data need to be shared while others need to be accessible only to one user or even encrypted.
\section{Documents}
\section{Documents}
There are several solutions to back up and share your documents:
There are several solutions to back up and share your {\bfdocuments}:
\subsection{Automatic backup for workstations}
\subsection{Automatic backup for workstations}
If your computer is correctly configured, you can \href{https://biowiki.biologie.ens-lyon.fr/doku.php?id=backup_auto_help}{make daily backups of your documents} on a wired connection.
If your computer is correctly configured, you can \href{https://biowiki.biologie.ens-lyon.fr/doku.php?id=backup_auto_help}{make daily {\bfbackups} of your {\bfdocuments}} on a wired connection.
Different snapshots of your documents will stay accessible and you can restore your documents at any of these snapshots.
Different snapshots of your {\bfdocuments} will stay accessible and you can restore your {\bfdocuments} at any of these snapshots.
Note that some \href{https://biowiki.biologie.ens-lyon.fr/doku.php?id=backup_auto_help}{type of files} are excluded from these backup.
Note that some \href{https://biowiki.biologie.ens-lyon.fr/doku.php?id=backup_auto_help}{type of files} are excluded from these {\bfbackup}.
\subsection{Data Backup and Synchronization Tools}
\subsection{Data Backup and Synchronization Tools}
Backup and synchronization tools allows you to continuously synchronize a list of the folder with a remove server.
{\bfBackup and synchronization tools} allows you to continuously synchronize a list of the folder with a remove server.
In addition to providing you with a backup of these folders, you can also easily share some of them with other users or between different computers (and increase the number of backup).
In addition to providing you with a {\bfbackup} of these folders, you can also easily share some of them with other users or between different computers (and increase the number of {\bfbackup}).
You also have a small history of the last modifications where you can restore a given file to an anterior version.
You also have a small history of the last modifications where you can restore a given file to an anterior version.
\begin{itemize}
\begin{itemize}
...
@@ -69,9 +84,9 @@ You also have a small history of the last modifications where you can restore a
...
@@ -69,9 +84,9 @@ You also have a small history of the last modifications where you can restore a
\item The UE provides a synchronization service called \href{https://b2drop.eudat.eu}{b2drop} (20 Gb), which should be accessible to all members of the LBMC.
\item The UE provides a synchronization service called \href{https://b2drop.eudat.eu}{b2drop} (20 Gb), which should be accessible to all members of the LBMC.
\end{itemize}
\end{itemize}
For both services, the data stored can be considered as heavily backuped (the data should not be lost on their ends).
For both services, the data stored can be considered as heavily {\bfbackuped} (the data should not be lost on their ends).
Other famous USA companies also provide similar services, but despite the Safe Harbor and Privacy Shield, the \href{https://curia.europa.eu/juris/document/document.jsf?text=&docid=228677&pageIndex=0&doclang=en&mode=lst&dir=&occ=first&part=1&cid=9791227}{court of justice of the EU ruled in July 16 2020, that the USA privacy law cannot be made compatible with EU privacy law.} Therefore, you should not use these services for work (or at least heavily encrypt the content stored on them).
Other famous USA companies also provide similar services, but despite the Safe Harbor and Privacy Shield, the \href{https://curia.europa.eu/juris/document/document.jsf?text=&docid=228677&pageIndex=0&doclang=en&mode=lst&dir=&occ=first&part=1&cid=9791227}{court of justice of the EU ruled on July 16 2020, that the USA privacy law cannot be made compatible with EU privacy law.} Therefore, you should not use these services for work (or at least heavily encrypt the content stored on them).
\subsection{Shared Network Volumes}
\subsection{Shared Network Volumes}
...
@@ -85,28 +100,28 @@ Your \href{https://biowiki.biologie.ens-lyon.fr/doku.php?id=biodata}{BIODATA} sp
...
@@ -85,28 +100,28 @@ Your \href{https://biowiki.biologie.ens-lyon.fr/doku.php?id=biodata}{BIODATA} sp
The \href{https://biowiki.biologie.ens-lyon.fr/doku.php?id=biodata}{BIODATA} storage space is managed by the ENS DSI and allows you to store raw data, directly from scientific platforms. Each team has access to two folders:
The \href{https://biowiki.biologie.ens-lyon.fr/doku.php?id=biodata}{BIODATA} storage space is managed by the ENS DSI and allows you to store raw data, directly from scientific platforms. Each team has access to two folders:
\begin{itemize}
\begin{itemize}
\item{\bf nameofteam/}: 2 To, with daily snapshots on another server in the SLING room
\item{\bf nameofteam/}: 2 To, with daily snapshots on another server in the SLING room
\item{\bf nameofteam2/}: 15 To, backuped monthly by Stéphane
\item{\bf nameofteam2/}: 15 To, {\bfbackuped} monthly by Stéphane
\end{itemize}
\end{itemize}
You team can buy more storage, to add to \href{https://biowiki.biologie.ens-lyon.fr/doku.php?id=biodata}{BIODATA}.
You team can buy more storage, to add to \href{https://biowiki.biologie.ens-lyon.fr/doku.php?id=biodata}{BIODATA}.
\section{Codes}
\section{Codes}
Most of the human bioinformatic work will result in the production of lines of code or text. While important, the size of such data is often quite small and should be copied to other places as often as possible.
Most of the human bioinformatic work will result in the production of lines of {\bfcode} or text. While important, the size of such data is often quite small and should be copied to other places as often as possible.
Your documentation is also a valuable set of files.
Your documentation is also a valuable set of files.
\href{https://git-scm.com/}{Git} is nowadays the reference system to store code and it's history of modification. It can be seen as the numerical equivalent of a cahier de laboratoire. You can even \href{https://docs.gitlab.com/ee/user/project/repository/gpg_signed_commits/}{numerically sign your contributions}.
\href{https://git-scm.com/}{Git} is nowadays the reference system to store {\bfcode} and it's history of modification. It can be seen as the numerical equivalent of a cahier de laboratoire. You can even \href{https://docs.gitlab.com/ee/user/project/repository/gpg_signed_commits/}{numerically sign your contributions}.
All LBMC members have access to the \href{http://www.ens-lyon.fr/LBMC/intranet/services-communs/pole-bioinformatique/ressources/gitlab}{Gitbio} server to back up and share their codes.
All LBMC members have access to the \href{http://www.ens-lyon.fr/LBMC/intranet/services-communs/pole-bioinformatique/ressources/gitlab}{Gitbio} server to back up and share their {\bfcodes}.
Using, \texttt{git} means that a copy of these files exists at least on your computer (and the computer of every collaborator in the project), on the gitbio server and on the backup of the gitbio server (updated every 24h). The details of the code and documentation management within your project are developed in \texttt{src} and \texttt{doc} paragraph of the Section 1 of the \href{https://lbmc.gitbiopages.ens-lyon.fr/hub/good_practices/good_practices.html}{guide of good practices}.
Using, \texttt{git} means that a copy of these files exists at least on your computer (and the computer of every collaborator in the project), on the gitbio server and on the {\bfbackup} of the gitbio server (updated every 24h). The details of the {\bfcode} and documentation management within your project are developed in \texttt{src} and \texttt{doc} paragraph of the Section 1 of the \href{https://lbmc.gitbiopages.ens-lyon.fr/hub/good_practices/good_practices.html}{guide of good practices}.
When using a version control system (see Section 3 of the \href{https://lbmc.gitbiopages.ens-lyon.fr/hub/good_practices/good_practices.html}{guide of good practices}), making regular pushes to the LBMC gitbio server will not only make you gain time to deal with different versions of your project but also save a copy of your code on the server.
When using a version control system (see Section 3 of the \href{https://lbmc.gitbiopages.ens-lyon.fr/hub/good_practices/good_practices.html}{guide of good practices}), making regular pushes to the LBMC gitbio server will not only make you gain time to deal with different versions of your project but also save a copy of your {\bfcode} on the server.
\subsection{Code archive}
\subsection{Code archive}
The EU and the CNRS and various French ministry support the \href{https://www.softwareheritage.org}{softwareheritage project}, which can make {\bf automatic archive} of git code repositories.
The EU and the CNRS and various French ministry support the \href{https://www.softwareheritage.org}{softwareheritage project}, which can make {\bf automatic archive} of git {\bfcode} repositories.
Upon publication of your work, you can therefore add your git repository to the \href{https://www.softwareheritage.org}{softwareheritage project} to {\bf archive} it.
Upon publication of your work, you can therefore add your git repository to the \href{https://www.softwareheritage.org}{softwareheritage project} to {\bf archive} it.
...
@@ -122,8 +137,9 @@ When you receive data, it’s also always important to document them. Write a si
...
@@ -122,8 +137,9 @@ When you receive data, it’s also always important to document them. Write a si
Public archives like \href{https://www.ebi.ac.uk/submission/}{ebi} (UE) or \href{https://www.ncbi.nlm.nih.gov/home/submit-wizard/}{ncbi} (USA) are free to use for academic purpose.
Public archives like \href{https://www.ebi.ac.uk/submission/}{ebi} (UE) or \href{https://www.ncbi.nlm.nih.gov/home/submit-wizard/}{ncbi} (USA) are free to use for academic purpose.
Once your raw data deposited on a public archive, you can consider that they have a level of {\bf backup} that you cannot reasonably reach and that they are safe.
Once your raw data deposited on a public archive, you can consider that they have a level of {\bf backup} that you cannot reasonably reach and that they are safe.
The archiving procedure request metadata information on the author of the data and the nature of the data. Filling the forms of the archiving procedure is akin to writing a {\bf DMP} with an infinite lifetime for the data.
Moreover, public archives propose an embargo time system during which your dataset will stay private. This you will get an automatic alert before the end of the embargo and you will be able to renew it as many times as you need.
Public archives propose an embargo time system during which your dataset will stay private. This you will get an automatic alert before the end of the embargo and you will be able to renew it as many times as you need.
Therefore, you should systematically archive your raw data.
Therefore, you should systematically archive your raw data.
\begin{itemize}
\begin{itemize}
...
@@ -141,25 +157,29 @@ Moreover, your team can buy more storage if needed.
...
@@ -141,25 +157,29 @@ Moreover, your team can buy more storage if needed.
The \href{https://biowiki.biologie.ens-lyon.fr/doku.php?id=biodata}{BIODATA} storage space is managed by the ENS DSI and allows you to store raw data, directly from scientific platforms. Each team has access to two folders:
The \href{https://biowiki.biologie.ens-lyon.fr/doku.php?id=biodata}{BIODATA} storage space is managed by the ENS DSI and allows you to store raw data, directly from scientific platforms. Each team has access to two folders:
\begin{itemize}
\begin{itemize}
\item{\bf nameofteam/}: 2 To, with daily snapshots on another server in the SLING room
\item{\bf nameofteam/}: 2 To, with daily snapshots on another server in the SLING room
\item{\bf nameofteam2/}: 15 To, backuped monthly by Stéphane
\item{\bf nameofteam2/}: 15 To, {\bfbackuped} monthly by Stéphane
The PSMN (Pôle Scientifique de Modélisation Numérique) is the preferential high-performance computing (HPC) center the LBMC have access to. The LBMC members have access to a volume of storage in the PSMN facilities accessible, once connected \href{http://www.ens-lyon.fr/PSMN/doku.php?id=contact:forms:inscription}{with a PSMN account}.
The PSMN (Pôle Scientifique de Modélisation Numérique) is the preferential high-performance computing (HPC) center the LBMC have access to. The LBMC members have access to a volume of storage in the PSMN facilities accessible, once connected \href{http://www.ens-lyon.fr/PSMN/doku.php?id=contact:forms:inscription}{with a PSMN account}.
The access to these families is \href{http://www.ens-lyon.fr/LBMC/intranet/services-communs/pole-bioinformatique/ressources/PSMN#section-0}{preferentialy done by the command line with {\bf ssh}} but can also be done \href{http://www.ens-lyon.fr/LBMC/intranet/services-communs/pole-bioinformatique/ressources/PSMN#section-10}{with a graphical interface like Filezilla}.
The access to these families is \href{http://www.ens-lyon.fr/LBMC/intranet/services-communs/pole-bioinformatique/ressources/PSMN#section-0}{preferentialy done by the command line with {\bf ssh}} but can also be done \href{http://www.ens-lyon.fr/LBMC/intranet/services-communs/pole-bioinformatique/ressources/PSMN#section-10}{with a graphical interface like Filezilla}.
You can \href{http://www.ens-lyon.fr/PSMN/doku.php?id=contact:forms:accueil}{request a training course to Cerasela Iliana Calugaru} to learn to use these resources.
A copy of your data can be placed in your PSMN team folder \texttt{/Xnfs/site/lbmcdb/team\_name}, with up to 600To of storage.
A copy of your data can be placed in your PSMN team folder \texttt{/Xnfs/site/lbmcdb/team\_name}, with up to 600To of storage for the biology department.
You can contact \href{mailto:helene.polveche@ens-lyon.fr}{Helene Polveche} or \href{mailto:laurent.modolo@ens-lyon.fr}{Laurent Modolo} if you need help with this procedure. This will also facilitate the access to your data for the people working on your project if they use the PSMN computing facilities.
You can contact \href{mailto:helene.polveche@ens-lyon.fr}{Helene Polveche} or \href{mailto:laurent.modolo@ens-lyon.fr}{Laurent Modolo} if you need help with this procedure. This will also facilitate the access to your data for the people working on your project if they use the PSMN computing facilities.
The \href{https://cc.in2p3.fr/}{CCIN2P3} (Centre de Calcul de l’Institut national de physique nucléaire et de physique des particules) give access to a percentage of its resources to the Biologies.
The \href{https://cc.in2p3.fr/}{CCIN2P3} (Centre de Calcul de l’Institut national de physique nucléaire et de physique des particules) give access to a percentage of its resources to the Biologies. In addition to the computing resources, you can also make long-term backup of your data in this center.
From the\href{http://www.ens-lyon.fr/PSMN/}{PSMN} you can make a{\bf backup} of your data there.
With a\href{http://www.ens-lyon.fr/PSMN/}{PSMN}account, you can make long-term{\bf backup} of your data there.
You will need to contact
The \href{https://cc.in2p3.fr/}{CCIN2P3} don't know you, and don't provide archiving services, therefore you must write a {\bf DMP} to define some information like the owner of the data, its nature and its lifetime.
The first step is to write a {\bf DMP}. You will need to create an \href{https://dmp.opidor.fr/}{account on dmp.opidor.fr}, where you can find a \href{https://dmp.opidor.fr/plans?plan%5Bfunder%5D%5Bid%5D=274&plan%5Borg%5D%5Bid%5D=4&plan%5Btemplate_id%5D=682}{{\bf DMP} template for the CCIN2P3}.
You can then contact the \href{http://www.ens-lyon.fr/PSMN/doku.php?id=contact:forms:accueil}{PSMN staff} to send this {\bf DMP} to the \href{https://cc.in2p3.fr/}{CCIN2P3}.
Once this {\bf DMP} is validated by the \href{https://cc.in2p3.fr/}{CCIN2P3} staff, you will be able to upload your data from the PSMN.