diff --git a/README.md b/README.md new file mode 100644 index 0000000000000000000000000000000000000000..6e9008ca7b511e1022a9e6c7f93d223c64a390f9 --- /dev/null +++ b/README.md @@ -0,0 +1,536 @@ +Introduction +============ + +We spend an increasing amount of time building and using software. +However, most of us are never taught how to do this correctly and +efficiently. The resulting problems are multiple and easily avoidable. +This document summarizes a set of good practices in bioinformatics. +Section [\[sec:project.organisation\]](#sec:project.organisation){reference-type="ref" +reference="sec:project.organisation"}, presents the organization of your +working folder for a given bioinformatics project. +Section [\[sec:data.managment\]](#sec:data.managment){reference-type="ref" +reference="sec:data.managment"}, lists the resources available to manage +and secure data in your project. +Section [\[sec:versioning\]](#sec:versioning){reference-type="ref" +reference="sec:versioning"}, presents the `git` code versioning system +and some examples on how to use it. Finally, +Section [\[sec:coding\]](#sec:coding){reference-type="ref" +reference="sec:coding"}, enumerates some rules to follow when you write +code. These rules will ease the reproducibility of your analysis and +collaborative development for your project. These good practices were +compiled from different sources, often overlapping, listed in the +References starting page of this document. + +Project organization[\[sec:project.organisation\]]{#sec:project.organisation label="sec:project.organisation"} (mandatory) +========================================================================================================================== + +The first step at the start of a bioinformatic project is to plan for +the structure of the project. Following this structure will facilitate +collaboration with others or your future self. In this section we are +going to present a guide for your project organization. This guide +should cover most bioinformatic project requirements. This section aims +at facilitating collaboration with other bioinformaticians in the LBMC +or even yourself in the future. You are strongly encouraged to follow it +and to enforce its policies in your team. + +The project must have the following structure: + + project_name/ + bin/ + data/ + doc/ + results/ + src/ + tests/ + CITATION + CONTRIBUTING + README + LICENCE + todo.txt + +You can get a template of this organization on the following `git` +repository: [url\_to\_come/barebone.git](url_to_come/barebone.git). + +Text files at the root of the project directory +----------------------------------------------- + +#### The `README` + +file must contain different information on your project, such as the +project title, a short description and contact information on the +carrier of the project. You should also provide some examples on how to +run tasks to be able to reproduce your work. This includes the +dependencies that needs to be installed. + +#### The `CONTRIBUTING` + +file points out to visitors ways they can help, the tests they can run +and the guidelines that the project adheres to. + +#### The `LICENSE` + +file must contain the license you wish your work to be published under. +Lack of an explicit license implies that the author is keeping all +rights and others are not allowed to re-use or modify the material. + +For code sources the CEA, CNRS and Inria advise to use the [CeCILL +license](http://www.cecill.info/licences.en.html) which is an open +French license. For documents you can use a [creative-common +license](https://creativecommons.org/licenses/) which also exists for +data with the [opendata-commons licenses](https://opendatacommons.org/). + +#### The `CITATION` + +file must contain information about how to cite the project as a whole +and where to find and how to cite any data sets, code, figures or +documents in the project. + +You can use reputable DOI-issuing repository such as +[figshare](https://figshare.com/), [datadryad](http://datadryad.org/) or +[zenodo](https://zenodo.org/) to facilitate this step. + +#### The `todo.txt` + +file. If you don't use tools like *issues* in Gitlab, you can maintain a +to-do list with a clear description of its items, so they make sense to +newcomers. This will also help you to keep track of the work progress +and time-table. + +`data` folder +------------- + +A general rule for data management is to have a single authoritative +representation of every piece of data in the system.\ +The `data` folder must contain only the raw data for your project. No +script must write in it (except the ones to get the data in the first +place). This point in crucial for the reproducibility of your work. One +must be able to go back to the first step of your analysis and play it +back again step by step. One other advantage of keeping the raw data +untempered is that it allows more freedom to experiment with your +analysis pipeline without side effect between the different strategies.\ +Data files in this folder, and in general, should contain some metadata +like a time stamp and few biologically meaningful key words. We advise +you to use the following naming convention: +`20xx-12-31-informative_name.file`. An informative file name can for +example be a compilation of the species lineage replicate and sequencing +technology name. With this format, sorting the file by name will also +sort them by date and the most important metadata will be kept in the +file name. When possible use open file formats that are easier to handle +with standard tools and help to promote open science.\ +When writing script or code, it's important to be able to test it. You +can create a `data/examples` folder that contains small toy data sets to +test your scripts or software as described in the `README` file. In +addition to giving the possibility to others to validate your work, it +also enables you to check that new modifications didn't break anything, +hence saving you a lot of time (See +Section [\[sec:coding\]](#sec:coding){reference-type="ref" +reference="sec:coding"} for more information on testing). + +`src` folder +------------ + +If you are developing a new tool, its source code must be in `src`. If +you are developing or using an analysis pipeline, we advise you to put +the functional part of your code in a `src/func` folder and in a +`src/pipe` folder the pipeline or script that contain the commands +called to run those functions. If you are developing a web tools, you +can create the `src/model`, the `src/view` and `src/controler` folders +if you use the MVC structure. + +If you are using online tools, documenting every link used and the value +of every field filed can be fastidious. Instead, you have access to a +palette of command like `wget` or `curl` to automatize your requests. +Most online tools also provide APIs (and associated documentation) that +facilitate command line interaction with them.\ +The goal of this folder is first, to let the computer do the work and +second to automatize every step of your analysis. By saving commands in +a file, it's easier to re-use them and to build tools to automate +workflows. This means that others will be able to make the same analysis +on their own data, that you will have a publication ready pipeline at +the end of your project, and that your work can easily be integrated in +another project for the future.\ + +`doc` folder +------------ + +You like to write stuff? Put it in the `doc` folder. Even if you don't +like to write, write anyway on what you are doing and put it in the +`doc` folder. The doc folder contains documents associated with the +project. This includes files for the manuscripts and documentation for +your source code. A throughout documentation add a huge value to your +project as others will find it easier to comprehend and reuse your work +and cite it instead of starting something new.\ +We advise you to keep an electronic lab notebook in a `doc/reports` +subfolder to track your experiments and to describe your workflow. This +notebook can be easily generated using tools like `knitr` or `Sweave`. +Those tools can call code or functions from the `src/func` folder to +compute results and generate figures. You can use Makefile to automate +the generation of documents in the `doc` folder. You can learn about +Makefile [here](http://bellet.info/creatis/unix/makefile.html). + +This guide should be placed in the `doc` folder. + +`results` folder +---------------- + +Every generated results or temporary files must go to the results +folder. This also means that the integrality of the `results` folder can +be regenerated from the `data`, `bin` and `src` folders. If this is not +the case for a given result file, delete it and write the necessary code +in `src` to regenerate it.\ +We advise you to use the same naming convention in the `results` folder +than in the `data` folder. It's easy to load file that has a variable +part (date and time) with the use of special characters like "`*`". +Adding time stamps to your results files will help you track down errors +in your analysis.\ +Even if we don't enforce a backup policy for the `results` folder keep +in mind that computation time is not free and that days or weeks of +computations, even if easily reproducible (by following the guidelines +of this document), are valuable. Moreover, keeping intermediate files to +be able to restart an analysis at any point can save you a lot of time. +It's up to you to discriminate between valuable final or intermediate +results that could ease the reviewing process of your work and temporary +files that are only consuming space. You can use a `results/tmp` folder +to make this distinction. + +`bin` folder +------------ + +The `bin` folder which historically contains any compiled binary file +must also contain third party scripts and software. You should be able +to fill this folder with the information contained in the dependencies +section of the `README` file. The compiled file from your work can be +recompiled and the third party material can be got back from the +internet or other sources. This folder can also be automatically filed +if necessary by the execution of the content of the `src` folder. + +`tests` folder +-------------- + +The `tests` folder must contain a list of tests files that can be +executed to test your code. This will be explained in more detail in the +Section [\[sec:coding\]](#sec:coding){reference-type="ref" +reference="sec:coding"} on tests-driven development. + +Data Management[\[sec:data.managment\]]{#sec:data.managment label="sec:data.managment"} +======================================================================================= + +In this section we will present some rules to manage your project data. +Given the size of current NGS data set one must find the balance between +securing the data for his/her project and avoid the needless replication +of gigabytes of data.\ +The `data` folder paragraph of the +Section [\[sec:project.organisation\]](#sec:project.organisation){reference-type="ref" +reference="sec:project.organisation"} focuses on data management within +your project and naming convention for your data files. In this section, +we focus on replicating your data in multiple sites in order to secure +them. Your code and documentation are also valuable sets of data. The +details of the code and documentation management within your project are +developed in `data` and `doc` paragraph of the +Section [\[sec:project.organisation\]](#sec:project.organisation){reference-type="ref" +reference="sec:project.organisation"}. In this section we will also +present advice to keep it safely.\ +From the time spent to get the material, to the cost of the reagents and +of the sequencing, your data are precious. Moreover for reproducibility +concern you should always keep a raw version of your data to go back to. +Those two points mean that you must make a backup of your raw data as +soon as possible (the external hard or thumb drive on which you can get +them doesn't count). When you receive data, it's also always important +to document them. Write a simple `description.txt` file in the same +folder that describes your data and how they were generated. These +metadata on your data are important to archive and index them. There are +numerous conventions for metadata terms that you can follow, like the +[dublin core](http://dublincore.org/documents/dcmi-terms/). Metadata +will also be useful for the persons that are going to reuse your data +(in meta-analysis for example) and to cite them. + +Public archives (mandatory) +--------------------------- + +Public archives like [ebi](https://www.ebi.ac.uk/submission/) (UE) or +[ncbi](https://www.ncbi.nlm.nih.gov/home/submit-wizard/) (USA) are free +to use for academic purpose. Public archives propose an embargo time +during which your dataset will stay private. Therefore, you should use +them as soon as you get your raw data. + +- Once a dataset is archived, it will never be deleted, so it's the + easiest way to safeguard your data. + +- These archives support a wide array of data type, so yours should + fit in it. + +- The embargo can be extended as far as you want, so you can take your + time to publish + +- You will get a reminder when the end of the embargo is near, so you + won't make inadvertently your precious data public. + +As soon as you obtain your data and fill the `data` folder, you should +depose your data on [ebi](https://www.ebi.ac.uk/submission/) (UE) or +[ncbi](https://www.ncbi.nlm.nih.gov/home/submit-wizard/) (USA). + +[PSMN](http://www.ens-lyon.fr/PSMN/): +------------------------------------- + +The PSMN (Pôle Scientifique de Modélisation Numérique) is the +preferential high-performance computing (HPC) center the LBMC have +access to. The LBMC have access to a volume of storage in the PSMN +facilities accessible, once connected, from the `/Xnfs/site/lbmcdb/` +path. + +A second copy of the raw data can be placed in your PSMN team folder +`/Xnfs/site/lbmcdb/team_name`. You can contact [Helene +Polveche](mailto:helene.polveche@ens-lyon.fr) or [Laurent +Modolo](mailto:laurent.modolo@ens-lyon.fr) if you need help with this +procedure. This will also facilitate access to your data for the people +working on your project if they use the PSMN computing facilities.\ +All the above solutions are not exclusive and are prepared to host large +volume of data.\ + +Code safety +----------- + +Most of the human bioinformatic work will result in the production of +lines of code or text. While important, the size of such data is often +quite small and should be copied to other places as often as possible.\ +When using a version control system (See +Section [\[sec:versioning\]](#sec:versioning){reference-type="ref" +reference="sec:versioning"}), making regular pushes to the LBMC gitlab +server will not only make you gain time to deal with different versions +of your project but also save a copy of your code on the server. You can +also make instantaneous or daily backup in your home directory at the +PSNM. With the laboratory you can also use the Silexe sever or the +future storage server of the biology department. Finally, the CNRS +provides a synchronization service called +[MyCore](https://mycore.cnrs.fr/) to synchronize folders on their +servers. + +Versioning[\[sec:versioning\]]{#sec:versioning label="sec:versioning"} (mandatory) +================================================================================== + +Biologists keep their lab journal up to date so their future self or +other people can check on and reproduce their work. In bioinformatics +versioning can be seen as a bioinformatic journal where you can comment +the addition of new functions to your project. This also means that you +can go back at any point of this journal to revert to your code at an +earlier state. + +Moreover, where a lab journal is linear, you can start new paths +(branches) to try new ideas and test new features. Your main working +branch will be left undisturbed. With a versioning software, you can +even make progress on different branch at the same time. Successful +branches can then be merged back into the main branch to include new +working and tested functionalities.\ +The strength of a code versioning system is to do all of the above +transparently. You don't have to keep different versions of your files; +it's the versioning software job. By going to another branch or time +point, your working directory will be changed to match the status of the +files at that point. If you jump back, the files will be changed back to +the condition where you came from.\ +The flexibility of the version control software to jump to a given time +point of your project relies on the granularity of those time points. +Therefore, you should try to make incremental changes to your project +and record them with the version control software as often as possible. +This will also help you to comply with the recommendations of the +Section [\[sec:coding\]](#sec:coding){reference-type="ref" +reference="sec:coding"} on coding.\ + +Installing `git` +---------------- + +We chose `git` for the version control software to use at the LBMC. +`git` can be easily installed on most operating systems with the +following instructions. + +On Linux you can type: + + # on debian/ubuntu + apt-get install git + # on redhat/centOS + yum install git + +To install `git` on macOS with [homebrew](https://brew.sh/), you can +type: + + brew install git + +When using `git` for the first time, you need to give him your identity +so it can sign your entries (commits). To do that you can use the +commands: + + git config --global user.name "first_name last_name" + git config --global user.email first_name.last_name@ens-lyon.fr + +Using git +--------- + +To start recording your bioinformatic journal you simply need to place +yourself in your project directory and use the command: + + git init + +Then you can record the status of a given file or a list of files with +the commands: + + git add file_a + git add file_b + git commit -m "creation of file a and b" + +Each new commit will create a new entry in your bioinformatic journal +with the current status of your project. if you missed some things or +made an error, you can easily amend the last commit with: + + git rm file_b + git add file_c + git commit --amend + +This will open your favorite text editor to let you edit the commit +message and amend it. At any time to see the status of your repository, +you can use the command: + + git status + +One strength of `git` is his decentralized structure. This means that +you can keep your own journal on your computer without the need to push +your changes to a central repository. This also means that there are +powerful tools in `git` to merge differences between different +repositories of the same project. To facilitate collaborative work (like +your superior checking on your progress), you can use a central shared +repository. One such instance is available at +[url\_to\_comes](url_to_comes) for the LBMC. + +To push your local repository to the LBMC gitlab server you can use +these commands: + + git remote add origin url_to_comes + git push -u origin master + +The full documentation of every command and possibilities with git is +well beyond the scope of this document. However, you can access a +complete and well-written documentation on the website +[git-scm.com](https://git-scm.com/book/en/v2). There is also a huge +community around `git` so most of your problems with it should find +their answer online or in the LBMC. Also, don't forget to go to the +`git` formation organized in the LBMC ! + +Coding[\[sec:coding\]]{#sec:coding label="sec:coding"} +====================================================== + +In this section we are going to introduce some concept and rules to +follow and implement. The goal of this section is to write better code +and scripts in your projects, with validated and reproducible results. + +Write programs for people, not computers +---------------------------------------- + +The first goal to follow in your project is to write code for people and +not for computers. We are limited and there is only so much information +that we can keep in mind at the same time. Thus program should not +require its readers to hold more than a handful of facts in memory at +once.\ +This means that you should use very simple control flow constructs. +Split logical units of code into functions. No function should exceed +about 60 lines of code, with one line per statement and one line per +declaration. A function is a logical unit that is understandable and +verifiable as a unit. Simple control flows are easier to verify and +result in improved code clarity.\ +Keep the number of parameters in your function small. It will be easier +to track them throughout your function. It will also help you for +debugging and testing it. Try also to keep the amount of memory objects +modified by your function small. If you need to modify loots of items, +write more functions. The validity of parameters must be checked inside +each function. Also check the return values of your function.\ +Don't conserve a block of code more than once in your project. If you +need a block of lines of code at different points of your program, +transform it into a function and make a call to it. This will keep your +code small and avoid the problem of maintaining different version of the +same code. If you are using an object-oriented language use +inheritance.\ + +Apply a naming convention +------------------------- + +Define a naming convention for your variables, functions, objects, and +files at the start of your project and kept it. We advise you to use +lower-case character separated by underscores for variables and +functions names. Object template names should start with an upper-case +character to differentiate them from functions. We also advise you to +configure your editor to use two spaces characters instead of the +tabulation character (called soft tabs).\ +A brief summary of those rules should be written in the `CONTRIBUTING` +file of your project. This kind of naming convention will allow you to +use informative variable and function names. Informative names will +clarify your code for you and others. This will encourage collaborative +work and help you debugging or factorizing your code.\ +Modern editors can also provide you with add-ons that can automatically +check your code syntax to see if it follows coding conventions that are +widely recognized for a language. Those add-ons call upon software like +the `lintr` package for `R`, `g++` for `C/C++` or `pep8` for `python`. +Another advantage of using these tools is to ensure that your code will +be valid for the future evolution of the language. + +Iterative development and continuous integration +------------------------------------------------ + +Aim to have short development cycle from the conception of your code to +its test. The conception will be simpler and your code will evolve with +the addition of small new functional units.\ +With a short development cycle, the addition of new functionalities or +improvements will result in small independent changes to your code. +Those small changes will be easier to track with a version control +system and can be published daily or many times a day. To enforce this +policy, you should try to make incremental changes to your project. This +means working in small steps with frequent feedback and course +correction rather than trying to plan months of work in advance.\ +To achieve this development rhythm you need to apply another rule: don't +optimize prematurely. Write a code that is simple, clear and works. Keep +in mind that a source code is rewritable at any point in time. You can +later try to rewrite the suboptimal sections of your code. If you +followed the previous points and the following section on testing, this +will involve small changes with minimal side effects. You can make those +changes in a new branch while keeping a working (if suboptimal) main +branch. + +Tests driven development +------------------------ + +The value or your code reside in its number of working lines and not in +its number of lines. Thus each modification must be tested. The easiest +way to do that is to build and code the tests before new +functionalities. Instead of writing your own tests (that would need to +be tested), use testing libraries like [`testhat` package for +`R`](http://r-pkgs.had.co.nz/tests.html), or +[`unittest`](https://docs.python.org/3/library/unittest.html) modules +for python to facilitate the addition of tests to your code. Using a +test-driven development will also provide you with a complete set of +tests to check for side effect in the non-modified part of your code +after a modification. + +There are different kinds of tests that you can use, like unit tests or +integration tests.\ +Unit tests are simple tests to check a functionality or a module. When +you write a function, you first write one or more unit tests that aim at +checking if the return value of your function corresponds to what you +expect. Then you can write your function and test it. Unit tests are +beneficial at many points of the code development: + +- Before: they force you to detail your code requirements + +- Writing: they keep you from over-coding, when all the test cases + pass, you are done. + +- Factoring: everything keeps working when you improve your code. + +- Maintaining: instead of reconsidering everything you can just say: + "nop sir the code still passes all our tests". + +- Working with others: You can test if the additions from your work + don't break other developers tests. + +Integration tests are one level of complexity above the unit tests. They +aim at checking the assembly of elementary components in your code. +Integration tests can be used with the content of your `data/examples` +folder to check after each step of your pipeline if you get the expected +results. + +[\[sec:bibliography\]]{#sec:bibliography label="sec:bibliography"}