testing Data Version Control (DVC)
Install
pip install dvc
Usage
- Init a git repository and so some stuff
git init
git add ...
git commit ...
git remote add ...
- Init a DVC repository and record it in git
dvc init
git add .dvc/.gitignore .dvc/config .dvcignore
git commit -m "init DVC repository"
- Generate some random data file:
scr/script.sh # generate data/datafile.dat
- Add the data to DVC repository:
dvc add data/datafile.dat
100% Adding...|███████████████████████████████████████|1/1 [00:00, 4.63file/s]
To track the changes with git, run:
git add data/datafile.dat.dvc data/.gitignore
To enable auto staging, run:
dvc config core.autostage true
- Record reference to data in git (but not datafile):
git add data/datafile.dat.dvc data/.gitignore
git commit -m "data generation v1"
- Modify the script and regenerate the data:
# edit scr/script.sh
git add src/script.sh
git commit -m "modification of data generation"
scr/script.sh # generate data/datafile.dat
Important:
- git does not "see" the modification (
git status
will say nothing) - data modification can be verified with
dvc status
dvc status
data/datafile.dat.dvc:
changed outs:
modified: data/datafile.dat
- Record new version of data
dvc add data/datafile.dat
Now git "sees" the modification:
git status
On branch main
Your branch is ahead of 'origin/main' by 1 commit.
(use "git push" to publish your local commits)
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: data/datafile.dat.dvc
no changes added to commit (use "git add" and/or "git commit -a")
Record modification in git:
git add data/datafile.dat.dvc
git commit -m "new version of generated data"
- Add local DVC remote:
dvc remote add -d local ../dvc-testing-remote
Note: -d
option is to set the new remote as the default one.
- Push to local DVC remote:
dvc push
Data version management
- Current data version
cat data/datafile.dat.dvc
outs:
- md5: a0c027223a771d1bb1519e5e5aaaf82c
size: 204800000
hash: md5
path: datafile.dat
Note: we can verify that the hash stored in the data/datafile.dat.dvc
file corresponds to the actual data/datafile.dat
file:
md5sum data/datafile.dat
a0c027223a771d1bb1519e5e5aaaf82c data/datafile.dat
- Switch to a previous git commit:
git log --all --graph --oneline --decorate
* 7fc8f5c (HEAD -> main, origin/main) new version of data
* 32aa7c7 modification of data generation
* 949f9ed add dvc local remote
* c8ad6a0 data generation v1
* 2885f16 init dvc repos
* fe58e05 script that generate some data
* b2f5373 init DVC testing repos
git checkout c8ad6a0
- Verify version of data:
cat data/datafile.dat.dvc
outs:
- md5: 8c0a82ed58e6152f9b134ba8d272dd42
size: 102400000
hash: md5
path: datafile.dat
Note: at this point, the datafile version does not correspond (discrepancy between the hash stored in the data/datafile.dat.dvc
file and the actual data/datafile.dat
file:
md5sum data/datafile.dat
a0c027223a771d1bb1519e5e5aaaf82c data/datafile.dat
md5sum data/datafile.dat
a0c027223a771d1bb1519e5e5aaaf82c data/datafile.dat
- Switch to corresponding version of data file:
dvc checkout
M data/datafile.dat
- Verify that version of data file:
md5sum data/datafile.dat
8c0a82ed58e6152f9b134ba8d272dd42 data/datafile.dat
SSH remote
More details here: https://dvc.org/doc/user-guide/data-management/remote-storage/ssh
- Requirements:
pip install dvc-ssh
- Add an SSH remote to the DVC repository:
dvc remote add psmn_ssh ssh://gdurif@psmn-local/home/gdurif/work/dvc-testing-remote
- List DVC remotes:
dvc remote list
local /home/drg/work/dev/tmp/test_data_version_control/dvc-testing-remote
psmn_ssh ssh://gdurif@psmn-local/home/gdurif/work/dvc-testing-remote
- Record new remote in git:
git diff
diff --git a/.dvc/config b/.dvc/config
index 60e9772..a1cadea 100644
--- a/.dvc/config
+++ b/.dvc/config
@@ -3,3 +3,5 @@
remote = local
['remote "local"']
url = ../../dvc-testing-remote
+['remote "psmn_ssh"']
+ url = ssh://gdurif@psmn-local/home/gdurif/work/dvc-testing-remote
git add .dvc/config
git commit -m "new DVC remote"
- Push to a given remote:
dvc push -r ssh_psmn
Tips
Disable analytics reporting (locally in a repository):
dvc config core.analytics false
Note: add the --global
option for global configuration as with git
.