Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
good_practices
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
LBMC
Hub
good_practices
Commits
adf6f46d
Unverified
Commit
adf6f46d
authored
5 years ago
by
Laurent Modolo
Browse files
Options
Downloads
Patches
Plain Diff
add README.md
parent
67523501
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
README.md
+536
-0
536 additions, 0 deletions
README.md
with
536 additions
and
0 deletions
README.md
0 → 100644
+
536
−
0
View file @
adf6f46d
Introduction
============
We spend an increasing amount of time building and using software.
However, most of us are never taught how to do this correctly and
efficiently. The resulting problems are multiple and easily avoidable.
This document summarizes a set of good practices in bioinformatics.
Section
[
\[sec:project.organisation\]
](
#sec:project.organisation
)
{reference-type="ref"
reference="sec:project.organisation"}, presents the organization of your
working folder for a given bioinformatics project.
Section
[
\[sec:data.managment\]
](
#sec:data.managment
)
{reference-type="ref"
reference="sec:data.managment"}, lists the resources available to manage
and secure data in your project.
Section
[
\[sec:versioning\]
](
#sec:versioning
)
{reference-type="ref"
reference="sec:versioning"}, presents the
`git`
code versioning system
and some examples on how to use it. Finally,
Section
[
\[sec:coding\]
](
#sec:coding
)
{reference-type="ref"
reference="sec:coding"}, enumerates some rules to follow when you write
code. These rules will ease the reproducibility of your analysis and
collaborative development for your project. These good practices were
compiled from different sources, often overlapping, listed in the
References starting page of this document.
Project organization[\[sec:project.organisation\]]{#sec:project.organisation label="sec:project.organisation"} (mandatory)
==========================================================================================================================
The first step at the start of a bioinformatic project is to plan for
the structure of the project. Following this structure will facilitate
collaboration with others or your future self. In this section we are
going to present a guide for your project organization. This guide
should cover most bioinformatic project requirements. This section aims
at facilitating collaboration with other bioinformaticians in the LBMC
or even yourself in the future. You are strongly encouraged to follow it
and to enforce its policies in your team.
The project must have the following structure:
project_name/
bin/
data/
doc/
results/
src/
tests/
CITATION
CONTRIBUTING
README
LICENCE
todo.txt
You can get a template of this organization on the following
`git`
repository:
[
url\_to\_come/barebone.git
](
url_to_come/barebone.git
)
.
Text files at the root of the project directory
-----------------------------------------------
#### The `README`
file must contain different information on your project, such as the
project title, a short description and contact information on the
carrier of the project. You should also provide some examples on how to
run tasks to be able to reproduce your work. This includes the
dependencies that needs to be installed.
#### The `CONTRIBUTING`
file points out to visitors ways they can help, the tests they can run
and the guidelines that the project adheres to.
#### The `LICENSE`
file must contain the license you wish your work to be published under.
Lack of an explicit license implies that the author is keeping all
rights and others are not allowed to re-use or modify the material.
For code sources the CEA, CNRS and Inria advise to use the
[
CeCILL
license
](
http://www.cecill.info/licences.en.html
)
which is an open
French license. For documents you can use a
[
creative-common
license
](
https://creativecommons.org/licenses/
)
which also exists for
data with the
[
opendata-commons licenses
](
https://opendatacommons.org/
)
.
#### The `CITATION`
file must contain information about how to cite the project as a whole
and where to find and how to cite any data sets, code, figures or
documents in the project.
You can use reputable DOI-issuing repository such as
[
figshare
](
https://figshare.com/
)
,
[
datadryad
](
http://datadryad.org/
)
or
[
zenodo
](
https://zenodo.org/
)
to facilitate this step.
#### The `todo.txt`
file. If you don't use tools like
*issues*
in Gitlab, you can maintain a
to-do list with a clear description of its items, so they make sense to
newcomers. This will also help you to keep track of the work progress
and time-table.
`data` folder
-------------
A general rule for data management is to have a single authoritative
representation of every piece of data in the system.
\
The
`data`
folder must contain only the raw data for your project. No
script must write in it (except the ones to get the data in the first
place). This point in crucial for the reproducibility of your work. One
must be able to go back to the first step of your analysis and play it
back again step by step. One other advantage of keeping the raw data
untempered is that it allows more freedom to experiment with your
analysis pipeline without side effect between the different strategies.
\
Data files in this folder, and in general, should contain some metadata
like a time stamp and few biologically meaningful key words. We advise
you to use the following naming convention:
`20xx-12-31-informative_name.file`
. An informative file name can for
example be a compilation of the species lineage replicate and sequencing
technology name. With this format, sorting the file by name will also
sort them by date and the most important metadata will be kept in the
file name. When possible use open file formats that are easier to handle
with standard tools and help to promote open science.
\
When writing script or code, it's important to be able to test it. You
can create a
`data/examples`
folder that contains small toy data sets to
test your scripts or software as described in the
`README`
file. In
addition to giving the possibility to others to validate your work, it
also enables you to check that new modifications didn't break anything,
hence saving you a lot of time (See
Section
[
\[sec:coding\]
](
#sec:coding
)
{reference-type="ref"
reference="sec:coding"} for more information on testing).
`src` folder
------------
If you are developing a new tool, its source code must be in
`src`
. If
you are developing or using an analysis pipeline, we advise you to put
the functional part of your code in a
`src/func`
folder and in a
`src/pipe`
folder the pipeline or script that contain the commands
called to run those functions. If you are developing a web tools, you
can create the
`src/model`
, the
`src/view`
and
`src/controler`
folders
if you use the MVC structure.
If you are using online tools, documenting every link used and the value
of every field filed can be fastidious. Instead, you have access to a
palette of command like
`wget`
or
`curl`
to automatize your requests.
Most online tools also provide APIs (and associated documentation) that
facilitate command line interaction with them.
\
The goal of this folder is first, to let the computer do the work and
second to automatize every step of your analysis. By saving commands in
a file, it's easier to re-use them and to build tools to automate
workflows. This means that others will be able to make the same analysis
on their own data, that you will have a publication ready pipeline at
the end of your project, and that your work can easily be integrated in
another project for the future.
\
`doc` folder
------------
You like to write stuff? Put it in the
`doc`
folder. Even if you don't
like to write, write anyway on what you are doing and put it in the
`doc`
folder. The doc folder contains documents associated with the
project. This includes files for the manuscripts and documentation for
your source code. A throughout documentation add a huge value to your
project as others will find it easier to comprehend and reuse your work
and cite it instead of starting something new.
\
We advise you to keep an electronic lab notebook in a
`doc/reports`
subfolder to track your experiments and to describe your workflow. This
notebook can be easily generated using tools like
`knitr`
or
`Sweave`
.
Those tools can call code or functions from the
`src/func`
folder to
compute results and generate figures. You can use Makefile to automate
the generation of documents in the
`doc`
folder. You can learn about
Makefile
[
here
](
http://bellet.info/creatis/unix/makefile.html
)
.
This guide should be placed in the
`doc`
folder.
`results` folder
----------------
Every generated results or temporary files must go to the results
folder. This also means that the integrality of the
`results`
folder can
be regenerated from the
`data`
,
`bin`
and
`src`
folders. If this is not
the case for a given result file, delete it and write the necessary code
in
`src`
to regenerate it.
\
We advise you to use the same naming convention in the
`results`
folder
than in the
`data`
folder. It's easy to load file that has a variable
part (date and time) with the use of special characters like "
`*`
".
Adding time stamps to your results files will help you track down errors
in your analysis.
\
Even if we don't enforce a backup policy for the
`results`
folder keep
in mind that computation time is not free and that days or weeks of
computations, even if easily reproducible (by following the guidelines
of this document), are valuable. Moreover, keeping intermediate files to
be able to restart an analysis at any point can save you a lot of time.
It's up to you to discriminate between valuable final or intermediate
results that could ease the reviewing process of your work and temporary
files that are only consuming space. You can use a
`results/tmp`
folder
to make this distinction.
`bin` folder
------------
The
`bin`
folder which historically contains any compiled binary file
must also contain third party scripts and software. You should be able
to fill this folder with the information contained in the dependencies
section of the
`README`
file. The compiled file from your work can be
recompiled and the third party material can be got back from the
internet or other sources. This folder can also be automatically filed
if necessary by the execution of the content of the
`src`
folder.
`tests` folder
--------------
The
`tests`
folder must contain a list of tests files that can be
executed to test your code. This will be explained in more detail in the
Section
[
\[sec:coding\]
](
#sec:coding
)
{reference-type="ref"
reference="sec:coding"} on tests-driven development.
Data Management[\[sec:data.managment\]]{#sec:data.managment label="sec:data.managment"}
=======================================================================================
In this section we will present some rules to manage your project data.
Given the size of current NGS data set one must find the balance between
securing the data for his/her project and avoid the needless replication
of gigabytes of data.
\
The
`data`
folder paragraph of the
Section
[
\[sec:project.organisation\]
](
#sec:project.organisation
)
{reference-type="ref"
reference="sec:project.organisation"} focuses on data management within
your project and naming convention for your data files. In this section,
we focus on replicating your data in multiple sites in order to secure
them. Your code and documentation are also valuable sets of data. The
details of the code and documentation management within your project are
developed in
`data`
and
`doc`
paragraph of the
Section
[
\[sec:project.organisation\]
](
#sec:project.organisation
)
{reference-type="ref"
reference="sec:project.organisation"}. In this section we will also
present advice to keep it safely.
\
From the time spent to get the material, to the cost of the reagents and
of the sequencing, your data are precious. Moreover for reproducibility
concern you should always keep a raw version of your data to go back to.
Those two points mean that you must make a backup of your raw data as
soon as possible (the external hard or thumb drive on which you can get
them doesn't count). When you receive data, it's also always important
to document them. Write a simple
`description.txt`
file in the same
folder that describes your data and how they were generated. These
metadata on your data are important to archive and index them. There are
numerous conventions for metadata terms that you can follow, like the
[
dublin core
](
http://dublincore.org/documents/dcmi-terms/
)
. Metadata
will also be useful for the persons that are going to reuse your data
(in meta-analysis for example) and to cite them.
Public archives (mandatory)
---------------------------
Public archives like
[
ebi
](
https://www.ebi.ac.uk/submission/
)
(
UE
)
or
[
ncbi
](
https://www.ncbi.nlm.nih.gov/home/submit-wizard/
)
(
USA
)
are free
to use for academic purpose. Public archives propose an embargo time
during which your dataset will stay private. Therefore, you should use
them as soon as you get your raw data.
-
Once a dataset is archived, it will never be deleted, so it's the
easiest way to safeguard your data.
-
These archives support a wide array of data type, so yours should
fit in it.
-
The embargo can be extended as far as you want, so you can take your
time to publish
-
You will get a reminder when the end of the embargo is near, so you
won't make inadvertently your precious data public.
As soon as you obtain your data and fill the
`data`
folder, you should
depose your data on
[
ebi
](
https://www.ebi.ac.uk/submission/
)
(
UE
)
or
[
ncbi
](
https://www.ncbi.nlm.nih.gov/home/submit-wizard/
)
(
USA
)
.
[PSMN](http://www.ens-lyon.fr/PSMN/):
-------------------------------------
The PSMN (Pôle Scientifique de Modélisation Numérique) is the
preferential high-performance computing (HPC) center the LBMC have
access to. The LBMC have access to a volume of storage in the PSMN
facilities accessible, once connected, from the
`/Xnfs/site/lbmcdb/`
path.
A second copy of the raw data can be placed in your PSMN team folder
`/Xnfs/site/lbmcdb/team_name`
. You can contact
[
Helene
Polveche
](
mailto:helene.polveche@ens-lyon.fr
)
or
[
Laurent
Modolo
](
mailto:laurent.modolo@ens-lyon.fr
)
if you need help with this
procedure. This will also facilitate access to your data for the people
working on your project if they use the PSMN computing facilities.
\
All the above solutions are not exclusive and are prepared to host large
volume of data.
\
Code safety
-----------
Most of the human bioinformatic work will result in the production of
lines of code or text. While important, the size of such data is often
quite small and should be copied to other places as often as possible.
\
When using a version control system (See
Section
[
\[sec:versioning\]
](
#sec:versioning
)
{reference-type="ref"
reference="sec:versioning"}), making regular pushes to the LBMC gitlab
server will not only make you gain time to deal with different versions
of your project but also save a copy of your code on the server. You can
also make instantaneous or daily backup in your home directory at the
PSNM. With the laboratory you can also use the Silexe sever or the
future storage server of the biology department. Finally, the CNRS
provides a synchronization service called
[
MyCore
](
https://mycore.cnrs.fr/
)
to synchronize folders on their
servers.
Versioning[\[sec:versioning\]]{#sec:versioning label="sec:versioning"} (mandatory)
==================================================================================
Biologists keep their lab journal up to date so their future self or
other people can check on and reproduce their work. In bioinformatics
versioning can be seen as a bioinformatic journal where you can comment
the addition of new functions to your project. This also means that you
can go back at any point of this journal to revert to your code at an
earlier state.
Moreover, where a lab journal is linear, you can start new paths
(branches) to try new ideas and test new features. Your main working
branch will be left undisturbed. With a versioning software, you can
even make progress on different branch at the same time. Successful
branches can then be merged back into the main branch to include new
working and tested functionalities.
\
The strength of a code versioning system is to do all of the above
transparently. You don't have to keep different versions of your files;
it's the versioning software job. By going to another branch or time
point, your working directory will be changed to match the status of the
files at that point. If you jump back, the files will be changed back to
the condition where you came from.
\
The flexibility of the version control software to jump to a given time
point of your project relies on the granularity of those time points.
Therefore, you should try to make incremental changes to your project
and record them with the version control software as often as possible.
This will also help you to comply with the recommendations of the
Section
[
\[sec:coding\]
](
#sec:coding
)
{reference-type="ref"
reference="sec:coding"} on coding.
\
Installing `git`
----------------
We chose
`git`
for the version control software to use at the LBMC.
`git`
can be easily installed on most operating systems with the
following instructions.
On Linux you can type:
# on debian/ubuntu
apt-get install git
# on redhat/centOS
yum install git
To install
`git`
on macOS with
[
homebrew
](
https://brew.sh/
)
, you can
type:
brew install git
When using
`git`
for the first time, you need to give him your identity
so it can sign your entries (commits). To do that you can use the
commands:
git config --global user.name "first_name last_name"
git config --global user.email first_name.last_name@ens-lyon.fr
Using git
---------
To start recording your bioinformatic journal you simply need to place
yourself in your project directory and use the command:
git init
Then you can record the status of a given file or a list of files with
the commands:
git add file_a
git add file_b
git commit -m "creation of file a and b"
Each new commit will create a new entry in your bioinformatic journal
with the current status of your project. if you missed some things or
made an error, you can easily amend the last commit with:
git rm file_b
git add file_c
git commit --amend
This will open your favorite text editor to let you edit the commit
message and amend it. At any time to see the status of your repository,
you can use the command:
git status
One strength of
`git`
is his decentralized structure. This means that
you can keep your own journal on your computer without the need to push
your changes to a central repository. This also means that there are
powerful tools in
`git`
to merge differences between different
repositories of the same project. To facilitate collaborative work (like
your superior checking on your progress), you can use a central shared
repository. One such instance is available at
[
url\_to\_comes
](
url_to_comes
)
for the LBMC.
To push your local repository to the LBMC gitlab server you can use
these commands:
git remote add origin url_to_comes
git push -u origin master
The full documentation of every command and possibilities with git is
well beyond the scope of this document. However, you can access a
complete and well-written documentation on the website
[
git-scm.com
](
https://git-scm.com/book/en/v2
)
. There is also a huge
community around
`git`
so most of your problems with it should find
their answer online or in the LBMC. Also, don't forget to go to the
`git`
formation organized in the LBMC !
Coding[\[sec:coding\]]{#sec:coding label="sec:coding"}
======================================================
In this section we are going to introduce some concept and rules to
follow and implement. The goal of this section is to write better code
and scripts in your projects, with validated and reproducible results.
Write programs for people, not computers
----------------------------------------
The first goal to follow in your project is to write code for people and
not for computers. We are limited and there is only so much information
that we can keep in mind at the same time. Thus program should not
require its readers to hold more than a handful of facts in memory at
once.
\
This means that you should use very simple control flow constructs.
Split logical units of code into functions. No function should exceed
about 60 lines of code, with one line per statement and one line per
declaration. A function is a logical unit that is understandable and
verifiable as a unit. Simple control flows are easier to verify and
result in improved code clarity.
\
Keep the number of parameters in your function small. It will be easier
to track them throughout your function. It will also help you for
debugging and testing it. Try also to keep the amount of memory objects
modified by your function small. If you need to modify loots of items,
write more functions. The validity of parameters must be checked inside
each function. Also check the return values of your function.
\
Don't conserve a block of code more than once in your project. If you
need a block of lines of code at different points of your program,
transform it into a function and make a call to it. This will keep your
code small and avoid the problem of maintaining different version of the
same code. If you are using an object-oriented language use
inheritance.
\
Apply a naming convention
-------------------------
Define a naming convention for your variables, functions, objects, and
files at the start of your project and kept it. We advise you to use
lower-case character separated by underscores for variables and
functions names. Object template names should start with an upper-case
character to differentiate them from functions. We also advise you to
configure your editor to use two spaces characters instead of the
tabulation character (called soft tabs).
\
A brief summary of those rules should be written in the
`CONTRIBUTING`
file of your project. This kind of naming convention will allow you to
use informative variable and function names. Informative names will
clarify your code for you and others. This will encourage collaborative
work and help you debugging or factorizing your code.
\
Modern editors can also provide you with add-ons that can automatically
check your code syntax to see if it follows coding conventions that are
widely recognized for a language. Those add-ons call upon software like
the
`lintr`
package for
`R`
,
`g++`
for
`C/C++`
or
`pep8`
for
`python`
.
Another advantage of using these tools is to ensure that your code will
be valid for the future evolution of the language.
Iterative development and continuous integration
------------------------------------------------
Aim to have short development cycle from the conception of your code to
its test. The conception will be simpler and your code will evolve with
the addition of small new functional units.
\
With a short development cycle, the addition of new functionalities or
improvements will result in small independent changes to your code.
Those small changes will be easier to track with a version control
system and can be published daily or many times a day. To enforce this
policy, you should try to make incremental changes to your project. This
means working in small steps with frequent feedback and course
correction rather than trying to plan months of work in advance.
\
To achieve this development rhythm you need to apply another rule: don't
optimize prematurely. Write a code that is simple, clear and works. Keep
in mind that a source code is rewritable at any point in time. You can
later try to rewrite the suboptimal sections of your code. If you
followed the previous points and the following section on testing, this
will involve small changes with minimal side effects. You can make those
changes in a new branch while keeping a working (if suboptimal) main
branch.
Tests driven development
------------------------
The value or your code reside in its number of working lines and not in
its number of lines. Thus each modification must be tested. The easiest
way to do that is to build and code the tests before new
functionalities. Instead of writing your own tests (that would need to
be tested), use testing libraries like
[
`testhat` package for
`R`
](
http://r-pkgs.had.co.nz/tests.html
)
, or
[
`unittest`
](
https://docs.python.org/3/library/unittest.html
)
modules
for python to facilitate the addition of tests to your code. Using a
test-driven development will also provide you with a complete set of
tests to check for side effect in the non-modified part of your code
after a modification.
There are different kinds of tests that you can use, like unit tests or
integration tests.
\
Unit tests are simple tests to check a functionality or a module. When
you write a function, you first write one or more unit tests that aim at
checking if the return value of your function corresponds to what you
expect. Then you can write your function and test it. Unit tests are
beneficial at many points of the code development:
-
Before: they force you to detail your code requirements
-
Writing: they keep you from over-coding, when all the test cases
pass, you are done.
-
Factoring: everything keeps working when you improve your code.
-
Maintaining: instead of reconsidering everything you can just say:
"nop sir the code still passes all our tests".
-
Working with others: You can test if the additions from your work
don't break other developers tests.
Integration tests are one level of complexity above the unit tests. They
aim at checking the assembly of elementary components in your code.
Integration tests can be used with the content of your
`data/examples`
folder to check after each step of your pipeline if you get the expected
results.
[
\[
sec:bibliography
\]
]{#sec:bibliography label="sec:bibliography"}
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment