Select Git revision
Carine Rey authored
index.qmd 20.06 KiB
---
title: "Using Git and Gitlab to back up your scripts and keep track of your changes"
subtitle: "BIBS Workshop"
author: "Carine Rey - BIBS team - CIRI"
date: " February 2024"
date-format: "[Last session:] MMMM, YYYY"
---

This intro is a compilation of various tutorials available on the web:
- https://www.miximum.fr/blog/enfin-comprendre-git/
- https://www.git-tower.com/learn/git/ebook/en/command-line/
- https://linogaliana.gitlab.io/collaboratif/git.html
- https://www.book.utilitr.org/git.html
- https://thinkr.fr/travailler-avec-git-via-rstudio-et-versionner-son-code/
- https://www.bioinformatics.babraham.ac.uk/training/RStudio_GitHub/Initial_setup.html
::: callout-tip
## Objectives
The aim is to teach you the basic concepts and commands so that you can work independently in :
- your use of git via rstudio and your gitlab repository
- researching git features
- **and above all** solving your git problems (because yes, you will!)
:::
# Why should you use Git?
## Without git
* Management of chaotic files between
* different versions of a file (V1,V2,final, final_ok ....)
* different system backups (PC, backup disks, ...)
* different users ...
* no trace of why changes were made
* impossible to return to an earlier version
This comic strip should bring back memories for everyone:
<img src="http://phdcomics.com/comics/archive/phd101212s.gif" alt="" width="400"/>
## With git
* Preservation and archiving of your project
* Clear history of changes
* Efficient collaborative working:
* Work in parallel and easily merge files
* Keep track of who did what
# Git can be seen as a time machine
Git lets you write the history of your project, alone or with others, via using snapshot which can be seen as pictures of the folder and files contained in it that you wish to track.
This folder is the *repository*. Each time you want to freeze the state of the *repository*, to take a snapshot, you do what's called a *commit*.
Each *commit* records a certain amount of information:
* the modifications made
* who made the modifications
* when the modifications was made
* why the modifications were made, description of the modifications (via a *commit* message).
Over the course of *commits*, you'll build up a *history* that can be consulted. The main history, which contains the "clean" version of your *repository*, is located on the *master* branch.
You can also create parallel *branches* from a *commit*. For example, you could create an additional branch to do something "just to see" and abandon your idea, or keep your modifications and merge them with the *master* branch via a *merge*. But either way, you'll have kept track of them.
<img src="figures/git-branches-merge.png" alt="" width="400"/>
In this course, we won't be using any additional branches, but you should know that they do exist, and that they make this tool, git, so powerful and indispensable for collaborative projects.
## Collaborate or archive your code via a remote repository (e.g. gitlab/github)
Git lets you make a backup of your versioned project. On a remote server, elsewhere, this is called the *remote*. Your *remote* can be on Github (the most famous) or on a self-hosted Gitlab (as here at the ENS).
To retrieve a project from a *remote*, the first time, you *clone* it; as the name suggests, you *clone* the project, making a copy of it that you retrieve locally, on your machine. When you make *commits* to your local project, you can send them to the *remote* by making a *push*. Other people connected to the *remote* will perform a *pull* to retrieve your *commit*.
In this way, the local version (on your computer) and the remote version (on the *remote*) of your project are always synchronized.
## How do you write your story?
The three most common manipulations are shown in the diagram below:
*pull*: I retrieve the latest version of the files from the remote repository
*commit*: validate my changes with a message explaining them
*push*: transmit validated changes to the remote repository
<img src="figures/push_pull_Drees.png" alt="" width="400"/>
To be more precise, there is an additional step to be taken before validating your modifications (i.e. making a *commit*): indexing your modifications.
In fact, git allows you to manage modifications in subtle ways and not take into account all the modifications in your workspace (*working directory*).
Only indexed modifications, those you have added to your *staging area* via the *stage* command, will be saved in your *commit*.
To summarize :
1 - First you make changes to your files, but these changes will not be saved in the repository.
<img src="figures/git-stages0.png" alt="" width="400"/>
2 - Use the *stage* command to select the modifications you're going to include in the next *commit* and place them in the *staging area*.
<img src="figures/git-stages1.png" alt="" width="400"/>
3 - Then use the *commit* command to save the selected changes in the *staging area*.
<img src="figures/git-stages2.png" alt="" width="400"/>
These steps can be carried out via the command line, but there are also graphical tools to do this, or most editors (IDEs) such as Rstudio or Visual Code have plug-ins to make life easier.
<img src="figures/git-stages.png" alt="" width="400"/>
## Summary of key commands :
- clone: retrieve the *repository* from the *remote* for the first time
- stage: save changes that will be added to the next *commit*.
- commit: a frozen moment in the life of your project
- push: send new *commits* to the *remote*.
- pull: retrieve the new *commit* locally from the *remote*.
- checkout: jump back in time to a *commit*.
You can get a more global view of your environment with this diagram:
<img src="figures/basic-remote-workflow.png" alt="" width="400"/>
Now that we've covered the basics, it's time to give it a try!
# It's time to give it a try: Initialize your Git project on Gitlab and use Rstudio to manage it locally
Git can only be used on the command line but here we will use it throw another tool to obtain a graphical interface.
There are several tools available but we will use **Rstudio** (perhaps not the best) but it has the great advantage of not adding a new tool as it is widely used for R programming. For advanced users, there is often a git add-on on script editors and you will retrieve the same functions.
## Linking Gitlab and your local machine throw Rsudio
### Install tools
- **git** : https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
- (Windows users: For Windows operating systems, we recommend that you select the Git Bash and Git Gui components in step 3)
<img src="https://gitlab.cirad.fr/cirad/documentation/-/wikis/uploads/dbdc699677639995106fb84b45d07255/git-rstudio-1.png" alt="" width="400"/>
For more details for Windows users: https://gitlab.cirad.fr/cirad/documentation/-/wikis/Installation-de-Git-sur-Windows)
- **rstudio** : https://posit.co/download/rstudio-desktop/
### Create an account on [ENS's Gitlab].(https://gitbio.ens-lyon.fr/): <https://gitbio.ens-lyon.fr/>
If you don't have an account yet:
- Go to the site and try to connect via **SSO Ens de Lyon**, this will redirect you to the CAS in order to connect with your ENS identifiers.
- You will then be blocked, which is normal. Carine will receive an account request and will be able to validate it.
- Send an e-mail to Carine (carine.rey@ens-lyon.fr) specifying your group name.

### Complete your email adress in your gitbio account on [ENS's Gitlab].(https://gitbio.ens-lyon.fr/): <https://gitbio.ens-lyon.fr/>
- at your first connection, you have to complete your email in the parameter of your account, please use your ENS's email. A mail will be automatically send to your ENS's email.
- Go at your ENS webmail (https://webmail.ens-lyon.fr/) and you should have received a mail (**Confirmation instructions | GITBIO-NEW**) like this one:

- confirm your email and come back on the (https://gitbio.ens-lyon.fr/): <https://gitbio.ens-lyon.fr/> site. You must have now access to your dashboard.
### Create a new *repository* on the [ENS's Gitlab](https://gitbio.ens-lyon.fr/) in your team directory.
- The gitlab is **shared between the different laboratories** of the ENS de Lyon (CIRI,LBMC,IGFL,RDP). **To keep it tidy**, it is organized by lab and then by team, that is why each team has its own **group** within the laboratory **group**. Within each team, I recommend creating a group for each user, so that everyone can store their own space.
- Inside a group, you can create a **project** which will contain your scripts.
- A project can be public or private, I recommend to tidy your private project in your own group but you can put a the root of your team a public project associated to a publication.
- Each user can belonging to a group and have different rights on this group.
- Click on the lab (Menu -> groups -> your groups) (https://gitbio.ens-lyon.fr/CIRI/)
- Then create a project by clicking on (**Create new project**)
- Select Create from blank project (**Create blank project**)
- Give your project a name (Ideally, your project name should be in lower case, without periods, spaces or underscores, and should not begin with a number, e.g. htlv_rnaseq).
- Leave selected** *Initialize repository with a README*.
- Click on **Create project**
### Create your ssh key pair (using Rstudio) to allow your local machine and Gitlab to communicate.
We need to enable your local machine to connect to gitlab, so we're going to use an ssh key pair, which is more secure and more easy-to-use than a login/password. We will use Rstudio to do it but you can also do it in command line.
If you already have an ssh key pair, it's still a good idea to make a new pair of files specifically for connecting to gitlab.
We're going to use Rstudio to generate this pair of keys (which are in fact simply 2 files, one called "**private key**" and the second "**public key**").
The private key can be symbolized by a padlock key and the public key by the lock of this padlock.
We're going to create these two files directly via Rstudio on your virtual machine.
In Rstudio:
1- click on "Tools > Global Options...> Git/SVN".
2- click on "Create ssh Key ...".
3- a window will open, enter a passphrase (=a password to secure the use of your private key) and validate.
<img src="figures/create_ssh_key_from_rstudio.png" alt="" width="400"/>
4- Then click on "Close".
5 - Access the public key (=content of the id_ed25519.pub file) by clicking on "View public key".
<img src="figures/create_ssh_key_from_rstudio_get_public_key_fleche.png" alt="" width="400"/>
6- Copy your public key
<img src="figures/create_ssh_key_from_rstudio_copy_public_key.png" alt="" width="400"/>
7- In your Gitlab profile, top left, click on your avatar then on **Preferences** then **ssh keys** (in the panel on the left). You should arrive on this page: <https://gitbio.ens-lyon.fr/-/profile/keys>
<img src="figures/preference_ssh_key_gitbio.png" alt="" width="600"/>
8- Add your new public key
*You could also create these files on the command line via the terminal. You can find help in the gitlab documentation, or on the Internet if you need to do it again (e.g. https://happygitwithr.com/ssh-keys.html#create-an-ssh-key-pair).*
To check that everything is OK, you can type in the **terminal** (via Rstudio) :
```
ssh -T git@gitbio.ens-lyon.fr
```
<img src="figures/test_ssh_keys.png" alt="" width="600"/>
- Answer "yes" to the question
- Enter your passphrase (it won't be displayed, that's normal.)
The answer should be :
```
Welcome to GitLab, @votre_login!
```
<img src="figures/create_ssh_key_from_rstudio_check_key.png" alt="" width="400"/>
### Configuring git in Rstudio
Finally, you will need to declare your identity: in the RStudio terminal (not the R console), type in your name so that each of your *commits* is linked to you:
```
git config --global user.name "your_pseudo"
git config --global user.email "your_mail@mail.com"
```
<img src="figures/config_git_rstudio.png" alt="" width="400"/>
If you forget to do this, you will get this message:
<img src="figures/config_user.png" alt="" width="400"/>
### Clone your empty repository and create an R project in Rstudio
To associate this Git *repository* with an R project via RStudio, you need to make a clone:
* On Gitlab: click on Code and copy the URL (SSH protocol)
<img src="figures/git_clone_gitlab.png" alt="" width="400"/>
* In RStudio now click on: File > New Project... > Version Control > Git,
- enter the URL/SSH address of the repository you've copied, the name of the R project (ideally the same as Git)
- enter the folder in which to place it **(~/mydatalocal)**,
- click on Create Project and finally enter your passphrase.
<img src="figures/git_rstudio_newproj_gitlab.png" alt="" width="400"/>
In this newly created RStudio project, you'll see the git tab in the top right-hand corner.
<img src="figures/git_rstudio_gittab.png" alt="" width="400"/>
## Using git commands in Rstudio
RStudio's Git panel shows you the status of your project in real time: the status of the various files and folders is displayed:
<img src="figures/git_etat_fichier.png" alt="" width="400"/>
* A new file will be associated with an orange icon containing a **?**
* This new file will be associated with a green icon containing an **A** once you've checked it (in the 'staged' column).
* A modified file will be associated with a blue icon containing an **M**
* A deleted file will be associated with a red icon containing a **D**
## Configuring files to be synchronized or not using a .gitignore file
You don't need to synchronize all the files in your project. Only those you check will be associated with commits. It is therefore possible to explicitly ask Git not to monitor a particular file: this is the role of the *.gitignore* file at the root of your project. This is a text file that accepts regular expressions and allows you to define rules that correspond to several :
<img src="figures/git_gitignore.png" alt="" width="400"/>
By default, when creating the Rstudio project, a .gitignore file is added containing the following lines:
```
.Rproj.user
.Rhistory
.RData
.Ruserdata
```
This means that the Rstudio project configuration files are not tracked.
In general, we don't want to track changes to raw data or results, maybe for small metadata files **but NEVER for big raw data files !!**.
1. Add the following lines to the *.gitignore* file:
```
data/
results/
*.Rproj
```
2. Index changes by clicking in the *staged* column the box opposite the *.gitignore* file.
3. Then commit with an explicit message.
4. Then *push* the modifications to synchronize your local modifications with the *remote*.
5. View changes on gitlab
::: callout-important
# Never commit data into version control repositories
Why you should never commit data to Git:
Data should never be committed into your Git repositories. This is because git was designed to version small files of source code; committing data, a different category of things from source code, into your repositories will first and foremost lead to repository size bloat. Also, committing data into repositories means the data get shipped alongside the source code to anybody who has access to the source code. This might not necessarily be in-line with organizational practices.
:::
## Organizing your working directory
It's a good idea to put all your project-related files in the same folder:
* raw data
* scripts
* results
* project documentation,
* ...
To help you find your way around and avoid mixing up files or accidentally deleting them, we recommend that you separate the different types of data into sub-folders.
For example, your working directory might look like this:
```
project_name/
├── README.md # overview of the project
├── data/ # data files used in the project
├── results/ # results of the analysis (data, tables, figures)
├── src/ # contains all code in the project
│ └── ...
└── doc/ # documentation for your project
└── ...
```
In addition, for ease of use and reproducibility, you need to add a file, often called **README.md**, to the root of your folder, which will contain all the information you need to get started with the project.
This is also the file that will be visible on your project's home page on Gitlab.
This way, when someone wants to (re)work on the project, they can open the file, and they'll know where to go to see and understand what's been done.
This person could be a collaborator, your manager or simply yourself 6 months later.
### The README.md
In concrete terms, the README.md file is a text file written in markdown (hence the .md extension).
Markdown is a language that allows you to encode the formatting of plain text simply and easily.
For example, a # means that the following sentence is a title, ## , a subtitle, ###, a sub-subtitle.
You can browse the various tags here: [https://www.markdownguide.org/basic-syntax/](https://www.markdownguide.org/basic-syntax/)
This makes it possible to write text without wasting time on formatting, keeping the file "light" and, above all, readable for everyone.
On your project's Gitlab page, you'll find your formatted **README.md** file.
Don't forget to add to your README.md as you go along, so you don't forget anything.
It can also be represented as your laboratory notebook or your laboratory report.
At the end of the course, the quality of your README.md will be particularly important in the evaluation.
### Creating your project architecture
1. Create your README.md
2. Start completing it
2. Index, commit, push...
3. Create data, results and scripts folders
4. Create a fake scripts "00_fake.R" in the scripts directory.
4. Index, commit, push ...
3. Update .gitignore file
4. Index, commit, push ...
::: callout-important
# Never commit data into version control repositories
Why you should never commit data to Git:
Data should never be committed into your Git repositories. This is because git was designed to version small files of source code; committing data, a different category of things from source code, into your repositories will first and foremost lead to repository size bloat. Also, committing data into repositories means the data get shipped alongside the source code to anybody who has access to the source code. This might not necessarily be in-line with organizational practices.
:::
# Time to work on your data
Import scripts of your own project on gitlab.
# Resolve conflicts
Some resources to resolve conflicts:
- https://docs.gitlab.com/ee/user/project/merge_requests/conflicts.html
- https://www.simplilearn.com/tutorials/git-tutorial/merge-conflicts-in-git
If you cannot resolve the conflicts :
- rename your local project to "backup_project_name".
- clone your backup copy of your project again
- copy your latest changes from the backed-up project in the new cloned project
- index, commit and push
- use this new project
- delete the backup repository when you're sure you've retrieved all your changes
<img src="figures/git_clone_again.jpg" alt="" width="600"/>
# Collaboration with multiple people on the same project : the use of multiple branches
See:
- https://git-scm.com/book/en/v2/Git-Branching-Branches-in-a-Nutshell
- https://cupofcode.blog/intro-to-git/
------------------------------------------------------------------------
These materials have been developed by members of BIBS team of the CIRI (<https://ciri.ens-lyon.fr/>). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- Some materials used in these lessons were derived or adapted from work made available by the Harvard Chan Bioinformatics Core (HBC) (<https://github.com/hbctraining>) under the Creative Commons Attribution license (CC BY 4.0).
------------------------------------------------------------------------