Skip to content
Snippets Groups Projects
user avatar
Ghislain Durif authored
7f994cae
History
Name Last commit Last update
.dvc
data
src
.dvcignore
README.md

testing Data Version Control (DVC)

Install

pip install dvc

Usage

  1. Init a git repository and so some stuff
git init
git add ...
git commit ...
git remote add ...
  1. Init a DVC repository and record it in git
dvc init
git add .dvc/.gitignore .dvc/config .dvcignore
git commit -m "init DVC repository"
  1. Generate some random data file:
scr/script.sh        # generate data/datafile.dat
  1. Add the data to DVC repository:
dvc add data/datafile.dat
100% Adding...|███████████████████████████████████████|1/1 [00:00,  4.63file/s]
                                                                                                                       
To track the changes with git, run:

	git add data/datafile.dat.dvc data/.gitignore

To enable auto staging, run:

	dvc config core.autostage true
  1. Record reference to data in git (but not datafile):
git add data/datafile.dat.dvc data/.gitignore
git commit -m "data generation v1"
  1. Modify the script and regenerate the data:
# edit scr/script.sh
git add src/script.sh 
git commit -m "modification of data generation"
scr/script.sh        # generate data/datafile.dat

Important:

  • git does not "see" the modification (git status will say nothing)
  • data modification can be verified with dvc status
dvc status
data/datafile.dat.dvc:                                                                                                                                                      
	changed outs:
		modified:           data/datafile.dat
  1. Record new version of data
dvc add data/datafile.dat

Now git "sees" the modification:

git status
On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   data/datafile.dat.dvc

no changes added to commit (use "git add" and/or "git commit -a")

Record modification in git:

git add data/datafile.dat.dvc
git commit -m "new version of generated data"
  1. Add local DVC remote:
dvc remote add -d local ../dvc-testing-remote

Note: -d option is to set the new remote as the default one.

  1. Push to local DVC remote:
dvc push

Data version management

  1. Current data version
cat data/datafile.dat.dvc
outs:
- md5: a0c027223a771d1bb1519e5e5aaaf82c
  size: 204800000
  hash: md5
  path: datafile.dat

Note: we can verify that the hash stored in the data/datafile.dat.dvc file corresponds to the actual data/datafile.dat file:

md5sum data/datafile.dat
a0c027223a771d1bb1519e5e5aaaf82c  data/datafile.dat
  1. Switch to a previous git commit:
git log --all --graph --oneline --decorate
* 7fc8f5c (HEAD -> main, origin/main) new version of data
* 32aa7c7 modification of data generation
* 949f9ed add dvc local remote
* c8ad6a0 data generation v1
* 2885f16 init dvc repos
* fe58e05 script that generate some data
* b2f5373 init DVC testing repos
git checkout c8ad6a0
  1. Verify version of data:
cat data/datafile.dat.dvc
outs:
- md5: 8c0a82ed58e6152f9b134ba8d272dd42
  size: 102400000
  hash: md5
  path: datafile.dat

Note: at this point, the datafile version does not correspond (discrepancy between the hash stored in the data/datafile.dat.dvc file and the actual data/datafile.dat file:

md5sum data/datafile.dat
a0c027223a771d1bb1519e5e5aaaf82c  data/datafile.dat
md5sum data/datafile.dat
a0c027223a771d1bb1519e5e5aaaf82c  data/datafile.dat
  1. Switch to corresponding version of data file:
dvc checkout
M       data/datafile.dat
  1. Verify that version of data file:
md5sum data/datafile.dat
8c0a82ed58e6152f9b134ba8d272dd42  data/datafile.dat

SSH remote

More details here: https://dvc.org/doc/user-guide/data-management/remote-storage/ssh

  1. Requirements:
pip install dvc-ssh
  1. Add an SSH remote to the DVC repository:
dvc remote add psmn_ssh ssh://gdurif@psmn-local/home/gdurif/work/dvc-testing-remote
  1. List DVC remotes:
dvc remote list
local	/home/drg/work/dev/tmp/test_data_version_control/dvc-testing-remote
psmn_ssh	ssh://gdurif@psmn-local/home/gdurif/work/dvc-testing-remote
  1. Record new remote in git:
git diff
diff --git a/.dvc/config b/.dvc/config
index 60e9772..a1cadea 100644
--- a/.dvc/config
+++ b/.dvc/config
@@ -3,3 +3,5 @@
     remote = local
 ['remote "local"']
     url = ../../dvc-testing-remote
+['remote "psmn_ssh"']
+    url = ssh://gdurif@psmn-local/home/gdurif/work/dvc-testing-remote
git add .dvc/config
git commit -m "new DVC remote"
  1. Push to a given remote:
dvc push -r ssh_psmn

Tips

Disable analytics reporting (locally in a repository):

dvc config core.analytics false

Note: add the --global option for global configuration as with git.