Verified Commit e7a6bbee authored by Laurent Modolo's avatar Laurent Modolo
Browse files

data_mangement: first version of the document

parent 1c5dabe3
Pipeline #296 passed with stage
in 16 seconds
all: public/good_practices.html \
public/data_management.html \
public/github-pandoc.css \
public/github-pandoc.css: github-pandoc.css
cp github-pandoc.css public/github-pandoc.css
public/data_management.html: data_management.tex bibliography.bib github-pandoc.css
pandoc data_management.tex --bibliography=bibliography.bib -c github-pandoc.css --citeproc --toc --standalone --number-sections -o public/data_management.html
public/good_practices.html: good_practices.tex bibliography.bib github-pandoc.css
pandoc good_practices.tex --bibliography=bibliography.bib -c github-pandoc.css --citeproc --toc --standalone --number-sections -o public/good_practices.html
colorlinks=false,% hyperlinks will be black
linkcolor=blue,% hyperlink text will be green
pdfborderstyle={/S/U/W 1}% border style will be underline of width 1pt
\title{Biocomputing at LBMC\\
Guide of good practice}
\author{Laurent Modolo}
This document is a summary of the information that you can find in the \href{}{biowiki} and \href{}{guide of good practices} of the LBMC.
Nowadays, numerical data are at the core of the scientific activities and we often worry about their management and safe-keeping.
You will find in this guide a list of storage facilities that you have access to, as a member of the LBMC and guideline on how to use these facilities.
All data are not equal. In this document, we are going to first classify data according to their size and nature:
\item {\bf documents}: small files
\item {\bf codes}: small files with complex history
\item {\bf experimental data}: small to huge files
The {\bf experimental data} category can seem quite open. In the data backup community, we often further categorize data as :
\item {\bf hot}: data on which you are working, you want a rapid access to them
\item {\bf warm}: data on which you may be working, you want an easy access to them
\item {\bf cold}: data on which you will not be working in a foreseeable future, you don't care if it takes some time to retrieve them.
The {\bf hot} to {\bf cold} categorization is closely related to the money and energy cost of the underlying storage facilities (the colder the cheaper).
For all of the above categories, we need to discriminate between {\bf backuped data} and {\bf archived data}.
The data that you are working on can have none to multiple {\bf backup}. An increase in the number of {\bf backup} will increase the resilience and the physical cost of the storage of your data, but also management time spent to update all the copies.
Data that will not change in the future can be {\bf archived}. In this case the data need to be deposited in an archive facility along with the correct metadata where it will get a unique identifier and will stay accessible {\it forever}.
Finally, some data need to be shared while others need to be accessible only to one user or even encrypted.
There are several solutions to back up and share your documents:
\subsection{Automatic backup for workstations}
If your computer is correctly configured, you can \href{}{make daily backups of your documents} on a wired connection.
Different snapshots of your documents will stay accessible and you can restore your documents at any of these snapshots.
Note that some \href{}{type of files} are excluded from these backup.
\subsection{Data Backup and Synchronization Tools}
Backup and synchronization tools allows you to continuously synchronize a list of the folder with a remove server.
In addition to providing you with a backup of these folders, you can also easily share some of them with other users or between different computers (and increase the number of backup).
You also have a small history of the last modifications where you can restore a given file to an anterior version.
\item The CNRS provide a synchronization service called \href{}{MyCore} {\bf 100 Gb}, which should be accessible to all members of the LBMC.
\item The UE provides a synchronization service called \href{}{b2drop} (20 Gb), which should be accessible to all members of the LBMC.
For both services, the data stored can be considered as heavily backuped (the data should not be lost on their ends).
Other famous USA companies also provide similar services, but despite the Safe Harbor and Privacy Shield, the \href{}{court of justice of the EU ruled in July 16 2020, that the USA privacy law cannot be made compatible with EU privacy law.} Therefore, you should not use these services for work (or at least heavily encrypt the content stored on them).
\subsection{Shared Network Volumes}
Shared network volume is seen by your computer as external hard disk that is only available if your computer is connected to the corresponding network.
Even if they look like an external hard disk, shared network volume doesn't offer the same level of accessibility as local storage.
Shared network volume performances and availability can vary depending on to the load of the network which speeds will always be slower than the speed of your local storage.
On the ENS network, you have access to the \href{}{BIODATA} network volume.
Your \href{}{BIODATA} space is only accessible by your team's members and from the ENS network (not throughout the VPN).
The \href{}{BIODATA} storage space is managed by the ENS DSI and allows you to store raw data, directly from scientific platforms. Each team has access to two folders:
\item {\bf nameofteam/}: 2 To, with daily snapshots on another server in the SLING room
\item {\bf nameofteam2/}: 15 To, backuped monthly by Stéphane
You team can buy more storage, to add to \href{}{BIODATA}.
Most of the human bioinformatic work will result in the production of lines of code or text. While important, the size of such data is often quite small and should be copied to other places as often as possible.
Your documentation is also a valuable set of files.
\href{}{Git} is nowadays the reference system to store code and it's history of modification. It can be seen as the numerical equivalent of a cahier de laboratoire. You can even \href{}{numerically sign your contributions}.
All LBMC members have access to the \href{}{Gitbio} server to back up and share their codes.
Using, \texttt{git} means that a copy of these files exists at least on your computer (and the computer of every collaborator in the project), on the gitbio server and on the backup of the gitbio server (updated every 24h). The details of the code and documentation management within your project are developed in \texttt{src} and \texttt{doc} paragraph of the Section 1 of the \href{}{guide of good practices}.
When using a version control system (see Section 3 of the \href{}{guide of good practices}), making regular pushes to the LBMC gitbio server will not only make you gain time to deal with different versions of your project but also save a copy of your code on the server.
\subsection{Code archive}
The EU and the CNRS and various French ministry support the \href{}{softwareheritage project}, which can make {\bf automatic archive} of git code repositories.
Upon publication of your work, you can therefore add your git repository to the \href{}{softwareheritage project} to {\bf archive} it.
\section{Experimental Data}
In this section we will present some rules to manage your project data. Given the size of current experimental data sets, one must find the balance between securing the data for his/her project and avoid the needless replication of gigabytes of data.
From the time spent to get the material, to the cost of the reagents and acquisition, your data are precious. Moreover for reproducibility concern you should always keep a raw version of your data to go back to. Those two points mean that you must make an {\bf archive} of your raw data as soon as possible (the external hard or thumb drive on which you can get them doesn’t count).
When you receive data, it’s also always important to document them. Write a simple \texttt{description.txt} file in the same folder that describes your data and how they were generated. This metadata of your data is important to archive and index them. There are numerous conventions for metadata terms that you can follow, like the \href{}{dublin core}. Metadata will also be useful for the persons that are going to reuse your data (in meta-analysis for example) and to cite them.
\subsection{Public Archives}
Public archives like \href{}{ebi} (UE) or \href{}{ncbi} (USA) are free to use for academic purpose.
Once your raw data deposited on a public archive, you can consider that they have a level of {\bf backup} that you cannot reasonably reach and that they are safe.
Moreover, public archives propose an embargo time system during which your dataset will stay private. This you will get an automatic alert before the end of the embargo and you will be able to renew it as many times as you need.
Therefore, you should systematically archive your raw data.
\item Once a dataset is archived, it will never be deleted.
\item These archives support a wide array of data types.
\item The embargo can be extended as far as you want.
\item You will get a reminder when the end of the embargo is near. Thus your precious data won't go public inadvertently.
For many kinds of raw data, the storage available on \href{}{BIODATA} could be enough to have a backup.
Moreover, your team can buy more storage if needed.
The \href{}{BIODATA} storage space is managed by the ENS DSI and allows you to store raw data, directly from scientific platforms. Each team has access to two folders:
\item {\bf nameofteam/}: 2 To, with daily snapshots on another server in the SLING room
\item {\bf nameofteam2/}: 15 To, backuped monthly by Stéphane
The PSMN (Pôle Scientifique de Modélisation Numérique) is the preferential high-performance computing (HPC) center the LBMC have access to. The LBMC members have access to a volume of storage in the PSMN facilities accessible, once connected \href{}{with a PSMN account}.
The access to these families is \href{}{preferentialy done by the command line with {\bf ssh}} but can also be done \href{}{with a graphical interface like Filezilla}.
A copy of your data can be placed in your PSMN team folder \texttt{/Xnfs/site/lbmcdb/team\_name}, with up to 600To of storage.
You can contact \href{}{Helene Polveche} or \href{}{Laurent Modolo} if you need help with this procedure. This will also facilitate the access to your data for the people working on your project if they use the PSMN computing facilities.
The \href{}{CCIN2P3} (Centre de Calcul de l’Institut national de physique nucléaire et de physique des particules) give access to a percentage of its resources to the Biologies.
From the \href{}{PSMN} you can make a {\bf backup} of your data there.
You will need to contact
\ No newline at end of file
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment