Skip to main content

An MLOps tool to manage data files and pipelines on top of Git

Project description

# Xvc

This is the Python port of Xvc. Xvc's goal is let users perform all ML operations regarding data, files, models and pipelines from the command line, and version all of these on top of Git. Xvc.py extends this goal by allowing all commands to be run from Python shells, code and Jupyter notebooks.

You can install xvc to your system or virtual environment with uncommenting the line below.

# !pip install xvc

Import Xvc like any other Python module.

import xvc

This notebook is created with the following version. Please test your issue with the latest version before creating any bug reports.

xvc.version()
'v0.6.3-46-g71fc283-modified'

Python wrappings for xvc command uses Xvc class.

from xvc import Xvc

Tracking files

Let's reset the directory we're in and create a directory tree. These are the data files we want to track. Note that xvc-test-helper is a cli application to make life easier for Xvc tests. If you want to reproduce this notebook, you can install it with cargo install xvc-test-helper

!chmod -R +w .xvc .xvcignore .git .gitignore test-data/
!rm -rf .xvc .xvcignore .git .gitignore test-data/
!xvc-test-helper create-directory-tree --directories 3 --files 3 --root test-data --seed 20240101
chmod: .xvc: No such file or directory
chmod: .git: No such file or directory
chmod: .gitignore: No such file or directory
chmod: test-data/: No such file or directory

Above command created a directory structure as this.

!tree
.
├── Debug.ipynb
├── Readme.ipynb
├── default-pipeline.json
├── dr-iq-scores.csv
├── generate_data.py
├── random_names_iq_scores.csv
├── requirements.txt
├── test-data
│   ├── dir-0001
│   │   ├── file-0001.bin
│   │   ├── file-0002.bin
│   │   └── file-0003.bin
│   ├── dir-0002
│   │   ├── file-0001.bin
│   │   ├── file-0002.bin
│   │   └── file-0003.bin
│   └── dir-0003
│       ├── file-0001.bin
│       ├── file-0002.bin
│       └── file-0003.bin
└── visualize.py

5 directories, 17 files

Now we can initialize Xvc repository. We first init Git in the directory. Xvc tracks binary files with their metadata (size, timestamp and digest) and this metadata is kept in text files. You can use Xvc without Git, but adding versioning to your data is usually a good idea. Xvc manages Git operations on the text files it creates automatically by default, so it won't add any extra commands to the workflow.

!git init
Initialized empty Git repository in /Users/iex/github.com/iesahin/xvc-notebooks/Readme/.git/

Now we can init the Xvc repository

First we create an Xvc instance. You can specify the workdir and this instance will work on that directory. For now, only one instance of Xvc can run in a Python process. This limitation may be removed in future releases.

xvc_test_data = Xvc()

Let's initialize Xvc in the directory. This will create a .xvc directory, will put several initial files and also create a .gitignore entry for those elements that shouldn't be tracked by Git.

xvc_test_data.init()
''

We can see which Xvc repository a directory belongs with root command. By default, it returns the relative directory but you can set absolute flag to get see the absolute dir.

xvc_test_data.root(absolute=True)
/Users/iex/github.com/iesahin/xvc-notebooks/Readme





''

Let's add a set of files to Xvc.

xvc_test_data.file().track("test-data/dir-0001/")
''

If there is no output, it (usually) means the command is successful. If you want more output, you can set verbosity level when creating Xvc object.

Let's get the list of files in the workspace.

xvc_test_data.file().list()
'FX         142 2024-06-15 16:11:50          2f4850fa visualize.py\nFX        2003 2024-06-16 09:49:47          41e16be7 test-data/dir-0003/file-0003.bin\nFX        2002 2024-06-16 09:49:47          27f0efd0 test-data/dir-0003/file-0002.bin\nFX        2001 2024-06-16 09:49:47          66de5084 test-data/dir-0003/file-0001.bin\nDX         160 2024-06-16 09:49:47                   test-data/dir-0003\nFX        2003 2024-06-16 09:49:47          41e16be7 test-data/dir-0002/file-0003.bin\nFX        2002 2024-06-16 09:49:47          27f0efd0 test-data/dir-0002/file-0002.bin\nFX        2001 2024-06-16 09:49:47          66de5084 test-data/dir-0002/file-0001.bin\nDX         160 2024-06-16 09:49:47                   test-data/dir-0002\nFC        2003 2024-06-16 09:49:47 41e16be7 41e16be7 test-data/dir-0001/file-0003.bin\nFC        2002 2024-06-16 09:49:47 27f0efd0 27f0efd0 test-data/dir-0001/file-0002.bin\nFC        2001 2024-06-16 09:49:47 66de5084 66de5084 test-data/dir-0001/file-0001.bin\nDX         160 2024-06-16 09:49:48                   test-data/dir-0001\nFX          85 2024-06-16 09:49:48          30842230 test-data/.gitignore\nDX         192 2024-06-16 09:49:48                   test-data\nFX          14 2024-06-15 16:11:50          2b56f46f requirements.txt\nFX       19021 2024-06-16 09:44:16          95dff305 random_names_iq_scores.csv\nFX         629 2024-06-15 16:11:50          1eecb98e generate_data.py\nFX           8 2024-06-16 09:46:37          af1349b9 dr-iq-scores.csv\nFX        2183 2024-06-16 09:46:46          35763b4e default-pipeline.json\nFX       85102 2024-06-16 09:48:12          396cd3fd Readme.ipynb\nFX       30380 2024-06-10 08:25:47          a4864bca Debug.ipynb\nTotal #: 22 Workspace Size:      156254 Cached Size:        6006\n'

Note that Xvc commands return Python strings as output. You can print and process the output as any other string.

xvc_test_data_list = xvc_test_data.file().list() 
print(xvc_test_data_list)
FX         142 2024-06-15 16:11:50          2f4850fa visualize.py
FX        2003 2024-06-16 09:49:47          41e16be7 test-data/dir-0003/file-0003.bin
FX        2002 2024-06-16 09:49:47          27f0efd0 test-data/dir-0003/file-0002.bin
FX        2001 2024-06-16 09:49:47          66de5084 test-data/dir-0003/file-0001.bin
DX         160 2024-06-16 09:49:47                   test-data/dir-0003
FX        2003 2024-06-16 09:49:47          41e16be7 test-data/dir-0002/file-0003.bin
FX        2002 2024-06-16 09:49:47          27f0efd0 test-data/dir-0002/file-0002.bin
FX        2001 2024-06-16 09:49:47          66de5084 test-data/dir-0002/file-0001.bin
DX         160 2024-06-16 09:49:47                   test-data/dir-0002
FC        2003 2024-06-16 09:49:47 41e16be7 41e16be7 test-data/dir-0001/file-0003.bin
FC        2002 2024-06-16 09:49:47 27f0efd0 27f0efd0 test-data/dir-0001/file-0002.bin
FC        2001 2024-06-16 09:49:47 66de5084 66de5084 test-data/dir-0001/file-0001.bin
DX         160 2024-06-16 09:49:48                   test-data/dir-0001
FX          85 2024-06-16 09:49:48          30842230 test-data/.gitignore
DX         192 2024-06-16 09:49:48                   test-data
FX          14 2024-06-15 16:11:50          2b56f46f requirements.txt
FX       19021 2024-06-16 09:44:16          95dff305 random_names_iq_scores.csv
FX         629 2024-06-15 16:11:50          1eecb98e generate_data.py
FX           8 2024-06-16 09:46:37          af1349b9 dr-iq-scores.csv
FX        2183 2024-06-16 09:46:46          35763b4e default-pipeline.json
FX       85102 2024-06-16 09:48:12          396cd3fd Readme.ipynb
FX       30380 2024-06-10 08:25:47          a4864bca Debug.ipynb
Total #: 22 Workspace Size:      156254 Cached Size:        6006

xvc.file().list() returns the list of all files and directories recursively. Its default formatting shows whether a file F is tracked and commit (C) and its hash values in the workspace and the cache. You can, for example, see files not tracked by processing the output string.

[l for l in xvc_test_data_list.split("\n") if l.startswith("FX")]
['FX         142 2024-06-15 16:11:50          2f4850fa visualize.py',
 'FX        2003 2024-06-16 09:49:47          41e16be7 test-data/dir-0003/file-0003.bin',
 'FX        2002 2024-06-16 09:49:47          27f0efd0 test-data/dir-0003/file-0002.bin',
 'FX        2001 2024-06-16 09:49:47          66de5084 test-data/dir-0003/file-0001.bin',
 'FX        2003 2024-06-16 09:49:47          41e16be7 test-data/dir-0002/file-0003.bin',
 'FX        2002 2024-06-16 09:49:47          27f0efd0 test-data/dir-0002/file-0002.bin',
 'FX        2001 2024-06-16 09:49:47          66de5084 test-data/dir-0002/file-0001.bin',
 'FX          85 2024-06-16 09:49:48          30842230 test-data/.gitignore',
 'FX          14 2024-06-15 16:11:50          2b56f46f requirements.txt',
 'FX       19021 2024-06-16 09:44:16          95dff305 random_names_iq_scores.csv',
 'FX         629 2024-06-15 16:11:50          1eecb98e generate_data.py',
 'FX           8 2024-06-16 09:46:37          af1349b9 dr-iq-scores.csv',
 'FX        2183 2024-06-16 09:46:46          35763b4e default-pipeline.json',
 'FX       85102 2024-06-16 09:48:12          396cd3fd Readme.ipynb',
 'FX       30380 2024-06-10 08:25:47          a4864bca Debug.ipynb']

It's also possible to specify the format string to file().list() with format argument. For example the following shows the size of all workspace files. You can see all format elements in the xvc file list reference.

print(xvc_test_data.file().list(format="{{name}}: {{asz}}"))
visualize.py:         142
test-data/dir-0003/file-0003.bin:        2003
test-data/dir-0003/file-0002.bin:        2002
test-data/dir-0003/file-0001.bin:        2001
test-data/dir-0003:         160
test-data/dir-0002/file-0003.bin:        2003
test-data/dir-0002/file-0002.bin:        2002
test-data/dir-0002/file-0001.bin:        2001
test-data/dir-0002:         160
test-data/dir-0001/file-0003.bin:        2003
test-data/dir-0001/file-0002.bin:        2002
test-data/dir-0001/file-0001.bin:        2001
test-data/dir-0001:         160
test-data/.gitignore:          85
test-data:         192
requirements.txt:          14
random_names_iq_scores.csv:       19021
generate_data.py:         629
dr-iq-scores.csv:           8
default-pipeline.json:        2183
Readme.ipynb:       85102
Debug.ipynb:       30380
Total #: 22 Workspace Size:      156254 Cached Size:        6006

By default Xvc adds files to the cache and copies them to the workspace. For data files that you only need to read, you can specify to use symlinks to track the files. This way, there will only one copy of the file in the cache and a read-only symlink will be placed in the workspace.

xvc_test_data.file().track("test-data/dir-0002/", recheck_method="symlink")
''
print(xvc_test_data.file().list("test-data/dir-0002"))
SS         131 2024-06-16 09:49:51 41e16be7          test-data/dir-0002/file-0003.bin
SS         131 2024-06-16 09:49:51 27f0efd0          test-data/dir-0002/file-0002.bin
SS         131 2024-06-16 09:49:51 66de5084          test-data/dir-0002/file-0001.bin
Total #: 3 Workspace Size:         393 Cached Size:        6006

Note that the workspace size and cached size are different. The listed workspace size is only the size of symbolic links. Also the initials (symlink as recheck type and symlink in the workspace) is different the files in test-data/dir-0001/

!ls -l test-data/dir-000[12]/
test-data/dir-0001/:
.rw-rw-rw- iex staff 2.0 KB Sun Jun 16 12:49:47 2024  file-0001.bin
.rw-rw-rw- iex staff 2.0 KB Sun Jun 16 12:49:47 2024  file-0002.bin
.rw-rw-rw- iex staff 2.0 KB Sun Jun 16 12:49:47 2024  file-0003.bin

test-data/dir-0002/:
lrwxr-xr-x iex staff 131 B Sun Jun 16 12:49:51 2024  file-0001.bin ⇒ /Users/iex/github.com/iesahin/xvc-notebooks/Readme/.xvc/b3/66d/e50/84c7bbab22aa25db35e7e28c16953764cc9036d1782a58444eb3b9394b/0.bin
lrwxr-xr-x iex staff 131 B Sun Jun 16 12:49:51 2024  file-0002.bin ⇒ /Users/iex/github.com/iesahin/xvc-notebooks/Readme/.xvc/b3/27f/0ef/d00cd3ab2c9d9cd753b0c3a3bd10b79bf0ffc7719fac44218e8130a071/0.bin
lrwxr-xr-x iex staff 131 B Sun Jun 16 12:49:51 2024  file-0003.bin ⇒ /Users/iex/github.com/iesahin/xvc-notebooks/Readme/.xvc/b3/41e/16b/e7387698c97b703e7607c649498594840cd8bf37295ef430de22ac72c1/0.bin

Xvc allows to track files with hardlinks and reflinks (where supported) as well. You can keep your model files as copies and data files as links. Files in the cache are read-only, so if you don't change their permissions manually, your code cannot write on them.

You can also change their recheck (checkout) methods after you add them to Xvc. So, if you change your mind and want to start to track test-data/dir-0001 as hardlinks, you can do so.

xvc_test_data.file().recheck("test-data/dir-0001/", recheck_method="hardlink")
''
print(xvc_test_data.file().list("test-data/dir-0001"))
FH        2003 2024-06-16 09:49:47 41e16be7 41e16be7 test-data/dir-0001/file-0003.bin
FH        2002 2024-06-16 09:49:47 27f0efd0 27f0efd0 test-data/dir-0001/file-0002.bin
FH        2001 2024-06-16 09:49:47 66de5084 66de5084 test-data/dir-0001/file-0001.bin
Total #: 3 Workspace Size:        6006 Cached Size:        6006

Note that the filesystem doesn't make a distinction between hardlinks and files, they all have the same attributes. So the file sizes are identical between in the workspace and the cache. In the second column you can see that these are tracked as Hardlinks by Xvc.

If the files are tracked by Xvc, they are also normally in the cache. You can delete them and get them back from the cache when they are required.

!rm -rf test-data/dir-0002/
!tree
.
├── Debug.ipynb
├── Readme.ipynb
├── default-pipeline.json
├── dr-iq-scores.csv
├── generate_data.py
├── random_names_iq_scores.csv
├── requirements.txt
├── test-data
│   ├── dir-0001
│   │   ├── file-0001.bin
│   │   ├── file-0002.bin
│   │   └── file-0003.bin
│   └── dir-0003
│       ├── file-0001.bin
│       ├── file-0002.bin
│       └── file-0003.bin
└── visualize.py

4 directories, 14 files
xvc_test_data.file().recheck("test-data/dir-0002/")
''
print(xvc_test_data.file().list("test-data/dir-0002"))
SS         131 2024-06-16 09:49:53 41e16be7          test-data/dir-0002/file-0003.bin
SS         131 2024-06-16 09:49:53 27f0efd0          test-data/dir-0002/file-0002.bin
SS         131 2024-06-16 09:49:53 66de5084          test-data/dir-0002/file-0001.bin
Total #: 3 Workspace Size:         393 Cached Size:        6006

Note that recheck (checkout) methods for files are also tracked and you don't have to specify them every time.

Sharing files

Xvc can copy cached files to S3 compatible servers like MinIO or Google Cloud Storage; locally accessible paths like NFS or external drives; SSH servers with Rsync and via external commands like rclone or s5cmd. For the purposes of this introduction, we are going to use S3.

Note that Xvc doesn't store any credentials and rely on environment variables or existing SSH connections.

Let's start by configuring the S3 connection that we'll use.

xvc_test_data.storage().new_s3(name="backup", bucket_name="xvc-test", region="eu-central-1", storage_prefix="xvc-storage")
''

You can now send the files you track to this storage.

xvc_test_data.file().send(storage="backup")
''

When you (or someone else) want to access these files later, you can clone the Git repository and get the files from the storage.

$ git clone https://example.com/my-machine-learning-project
Cloning into 'my-machine-learning-project'...
xvc_test_data.file().bring("test-data/", storage="backup")
''

You can also share the files with others for a period if your storage type supports signed URLs. The default period is 24 hours.

xvc_test_data.file().share("test-data/dir-0001/file-0001.bin", storage="backup", period="1h")
'https://xvc-test.s3.eu-central-1.amazonaws.com/xvc-test/xvc-storage/3c7f8146-df25-4c22-b0e6-bb20d600acaf/b3/66d/e50/84c7bbab22aa25db35e7e28c16953764cc9036d1782a58444eb3b9394b/0.bin?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOXNDTB55QE3GOEQ%2F20240616%2Feu-central-1%2Fs3%2Faws4_request&X-Amz-Date=20240616T094957Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=d82f59453c95abc9421bac2a12605373708b678c6a1e48d8b2734eb34bd22942'

This approach ensures convenient access to files from the shared storage when needed. You can share the URL with anyone and they will be able to download your file within this period.

You don't have to reconfigure the storage after cloning, but you need to have valid credentials as environment variables to access the storage.

Xvc never stores any credentials.

Pipelines

If you have commands that depend on data or code elements, you can configure a pipeline.

For this example, we'll use a Python script to generate a data set with random names with random IQ scores.

The script uses the Faker library and this library must be available where you run the pipeline. To make it repeatable, we start the pipeline by adding a step that creates the virtual environment and installs dependencies.

pipeline = xvc_test_data.pipeline()
pipeline.step().new(step_name="install-deps", command="python3 -m pip install -r requirements.txt")
''

We'll make install-deps step to depend on requirements.txt file, so when the requirements file changes it will make the step run.

pipeline.step().dependency(step_name="install-deps", file="requirements.txt")
''

Xvc allows to create dependencies between pipeline steps. Dependent steps wait for dependencies to finish successfully.

Now we create a step to run the script and make install-deps step a dependency of it.


pipeline.step().new(step_name="generate-data", command="python3 generate_data.py")
pipeline.step().dependency(step_name="generate-data", step="install-deps")
''

After you define the pipeline, you can run it by:

print(pipeline.run())
[OUT] [install-deps] Requirement already satisfied: faker in /Users/iex/github.com/iesahin/xvc.py/.venv/lib/python3.12/site-packages (from -r requirements.txt (line 1)) (25.8.0)
Requirement already satisfied: seaborn in /Users/iex/github.com/iesahin/xvc.py/.venv/lib/python3.12/site-packages (from -r requirements.txt (line 2)) (0.13.2)
Requirement already satisfied: python-dateutil>=2.4 in /Users/iex/github.com/iesahin/xvc.py/.venv/lib/python3.12/site-packages (from faker->-r requirements.txt (line 1)) (2.9.0.post0)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /Users/iex/github.com/iesahin/xvc.py/.venv/lib/python3.12/site-packages (from seaborn->-r requirements.txt (line 2)) (1.26.4)
Requirement already satisfied: pandas>=1.2 in /Users/iex/github.com/iesahin/xvc.py/.venv/lib/python3.12/site-packages (from seaborn->-r requirements.txt (line 2)) (2.2.2)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /Users/iex/github.com/iesahin/xvc.py/.venv/lib/python3.12/site-packages (from seaborn->-r requirements.txt (line 2)) (3.9.0)
Requirement already satisfied: contourpy>=1.0.1 in /Users/iex/github.com/iesahin/xvc.py/.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn->-r requirements.txt (line 2)) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /Users/iex/github.com/iesahin/xvc.py/.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn->-r requirements.txt (line 2)) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /Users/iex/github.com/iesahin/xvc.py/.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn->-r requirements.txt (line 2)) (4.53.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /Users/iex/github.com/iesahin/xvc.py/.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn->-r requirements.txt (line 2)) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /Users/iex/github.com/iesahin/xvc.py/.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn->-r requirements.txt (line 2)) (24.0)
Requirement already satisfied: pillow>=8 in /Users/iex/github.com/iesahin/xvc.py/.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn->-r requirements.txt (line 2)) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /Users/iex/github.com/iesahin/xvc.py/.venv/lib/python3.12/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn->-r requirements.txt (line 2)) (3.1.2)
Requirement already satisfied: pytz>=2020.1 in /Users/iex/github.com/iesahin/xvc.py/.venv/lib/python3.12/site-packages (from pandas>=1.2->seaborn->-r requirements.txt (line 2)) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /Users/iex/github.com/iesahin/xvc.py/.venv/lib/python3.12/site-packages (from pandas>=1.2->seaborn->-r requirements.txt (line 2)) (2024.1)
Requirement already satisfied: six>=1.5 in /Users/iex/github.com/iesahin/xvc.py/.venv/lib/python3.12/site-packages (from python-dateutil>=2.4->faker->-r requirements.txt (line 1)) (1.16.0)
 [DONE] install-deps (python3 -m pip install -r requirements.txt)[OUT] [generate-data] CSV file generated successfully.
 [DONE] generate-data (python3 generate_data.py)

Xvc allows many kinds of dependencies, like files, groups of files and directories defined by globs, regular expression searches in files, line ranges in files, hyper-parameters defined in YAML, JSON or TOML files HTTP URLs, shell command outputs, and other steps.

🔅 The rest of this document introduces some of these dependencies. Please do not panic if you're not into regexes or hyperparameters.

Suppose you're only interested in the IQ scores of those with Dr. in front of their names and how they differ from the rest in the dataset we created. Let's create a regex search dependency to the data file that will show all doctors IQ scores.

pipeline.step().new(step_name="dr-iq", command='echo "${XVC_REGEX_ADDED_ITEMS}" >> dr-iq-scores.csv')
pipeline.step().dependency(step_name="dr-iq", regex_items=r'random_names_iq_scores.csv:/^Dr\..*')
''

The first line specifies a command, when run writes ${XVC_REGEX_ADDED_ITEMS} environment variable to dr-iq-scores.csv file. The second line specifies the dependency which will also populate the $[XVC_REGEX_ADDED_ITEMS] environment variable in the command.

Some dependency types like regex items, line items and glob items inject environment variables in the commands they are a dependency. For example, if you have two million files specified with a glob, but want to run a script only on the added files after the last run, you can use these environment variables.

When you run the pipeline again, a file named dr-iq-scores.csv will be created. Note that, as requirements.txt didn't change install-deps step and its dependent generate-data steps didn't run.

print(pipeline.run())
[DONE] dr-iq (echo "${XVC_REGEX_ADDED_ITEMS}" >> dr-iq-scores.csv)

We are using this feature to get lines starting with Dr. from the file and write them to another file. When the file changes, e.g. another record matching the dependency regex added to the random_names_iq_scores.csv file, it will also be added to dr-iq-scores.csv file.

!zsh -cl 'echo "Dr. Albert Einstein,144" >> random_names_iq_scores.csv'

print(pipeline.run())
[DONE] dr-iq (echo "${XVC_REGEX_ADDED_ITEMS}" >> dr-iq-scores.csv)

Now we want to add a another command that draws a fancy histogram from dr-iq-scores.csv. As this new step must wait dr-iq-scores.csv file to be ready, we'll define dr-iq-scores.csv as an output of dr-iq step and set the file as a dependency to this new visualize step.

pipeline.step().output(step_name="dr-iq", output_file="dr-iq-scores.csv")

pipeline.step().new(step_name="visualize", command='python3 visualize.py')
pipeline.step().dependency(step_name="visualize", file="dr-iq-scores.csv")

print(pipeline.run())
[DONE] dr-iq (echo "${XVC_REGEX_ADDED_ITEMS}" >> dr-iq-scores.csv)[ERROR] Step visualize finished UNSUCCESSFULLY with command python3 visualize.py

You can get the pipeline in Graphviz DOT or mermaid format to convert to an image.

print(pipeline.dag(format="mermaid"))
flowchart TD
    n0["install-deps"]
    n1["requirements.txt"] --> n0
    n2["generate-data"]
    n0["install-deps"] --> n2
    n3["dr-iq"]
    n4["visualize"]
    n5["dr-iq-scores.csv"] --> n4
flowchart TD
    n0["install-deps"]
    n1["requirements.txt"] --> n0
    n2["generate-data"]
    n0["install-deps"] --> n2
    n3["dr-iq"]
    n4["visualize"]
    n5["dr-iq-scores.csv"] --> n4

You can also export and import the pipeline to JSON to edit in your editor.

pipeline.export(file="default-pipeline.json")
''
!cat default-pipeline.json
───────┬────────────────────────────────────────────────────────────────────────
       │ File: default-pipeline.json
───────┼────────────────────────────────────────────────────────────────────────
   1   │ {
   2   │   "name": "default",
   3   │   "steps": [
   4   │     {
   5   │       "command": "python3 -m pip install -r requirements.txt",
   6   │       "dependencies": [
   7   │         {
   8   │           "File": {
   9   │             "content_digest": {
  10   │               "algorithm": "Blake3",
  11   │               "digest": [
  12   │                 43,
  13   │                 86,
  14   │                 244,
  15   │                 111,
  16   │                 13,
  17   │                 243,
  18   │                 28,
  19   │                 110,
  20   │                 140,
  21   │                 213,
  22   │                 105,
  23   │                 20,
  24   │                 239,
  25   │                 62,
  26   │                 73,
  27   │                 75,
  28   │                 13,
  29   │                 146,
  30   │                 82,
  31   │                 17,
  32   │                 148,
  33   │                 152,
  34   │                 66,
  35   │                 86,
  36   │                 154,
  37   │                 230,
  38   │                 154,
  39   │                 246,
  40   │                 213,
  41   │                 214,
  42   │                 40,
  43   │                 119
  44   │               ]
  45   │             },
  46   │             "path": "requirements.txt",
  47   │             "xvc_metadata": {
  48   │               "file_type": "File",
  49   │               "modified": {
  50   │                 "nanos_since_epoch": 431372139,
  51   │                 "secs_since_epoch": 1718467910
  52   │               },
  53   │               "size": 14
  54   │             }
  55   │           }
  56   │         }
  57   │       ],
  58   │       "invalidate": "ByDependencies",
  59   │       "name": "install-deps",
  60   │       "outputs": []
  61   │     },
  62   │     {
  63   │       "command": "python3 generate_data.py",
  64   │       "dependencies": [
  65   │         {
  66   │           "Step": {
  67   │             "name": "install-deps"
  68   │           }
  69   │         }
  70   │       ],
  71   │       "invalidate": "ByDependencies",
  72   │       "name": "generate-data",
  73   │       "outputs": []
  74   │     },
  75   │     {
  76   │       "command": "echo \"${XVC_REGEX_ADDED_ITEMS}\" >> dr-iq-scores.csv
       │ ",
  77   │       "dependencies": [],
  78   │       "invalidate": "ByDependencies",
  79   │       "name": "dr-iq",
  80   │       "outputs": []
  81   │     },
  82   │     {
  83   │       "command": "python3 visualize.py",
  84   │       "dependencies": [
  85   │         {
  86   │           "File": {
  87   │             "content_digest": null,
  88   │             "path": "dr-iq-scores.csv",
  89   │             "xvc_metadata": null
  90   │           }
  91   │         }
  92   │       ],
  93   │       "invalidate": "ByDependencies",
  94   │       "name": "visualize",
  95   │       "outputs": []
  96   │     }
  97   │   ],
  98   │   "version": 1,
  99   │   "workdir": ""
 100   │ }
───────┴────────────────────────────────────────────────────────────────────────

You can edit the file (in Jupyter lab or in your editor) to change commands, add new dependencies, etc. and import it back to Xvc with import_pipeline method. You can create copies of pipelines by either editing the pipeline name in the file or supplying the name with name argument.

🔅 import_pipeline is xvc pipeline import in command line. Renamed because import is a reserved word.

pipeline.import_pipeline(file="default-pipeline.json", name="another-pipeline")
''

Now you can list all pipelines and run another-pipeline if you want to.

print(pipeline.list())
+------------------+---------+
| Name             | Run Dir |
+============================+
| default          |         |
|------------------+---------|
| another-pipeline |         |
+------------------+---------+

Please create an issue or discussion for any other kinds of dependencies that you'd like to be included.

I'm planning to add data label and annotations tracking), experiments tracking), model tracking), encrypted cache, server to control all commands from a web interface, and more as my time permits.

Please check docs.xvc.dev for documentation.

🤟 Big Thanks

xvc stands on the following (giant) crates:

  • trycmd is used to run all example commands in this file, reference, and how-to documentation at every PR. It makes sure that the documentation is always up-to-date and shown commands work as described. We start development by writing documentation and implementing them thanks to trycmd.

  • serde allows all data structures to be stored in text files. Special thanks from xvc-ecs for serializing components in an ECS with a single line of code.

  • Xvc processes files in parallel with pipelines and parallel iterators thanks to crossbeam and rayon.

  • Thanks to strum, Xvc uses enums extensively and converts almost everything to typed values from strings.

  • Xvc has a deep CLI that has subcommands of subcommands (e.g. xvc storage new s3), and all these work with minimum bugs thanks to clap.

  • Xvc uses rust-s3 to connect to S3 and compatible storage services. It employs excellent tokio for fast async Rust. These cloud storage features can be turned off thanks to Rust conditional compilation.

  • Without implementations of BLAKE3, BLAKE2, SHA-2 and SHA-3 from Rust crypto crate, Xvc couldn't detect file changes so fast.

  • Many thanks to small and well built crates, reflink, relative-path, path-absolutize, glob for file system and glob handling.

  • Thanks to sad_machine for providing a State Machine implementation that I used in xvc pipeline run. A DAG composed of State Machines made running pipeline steps in parallel with a clean separation of process states.

  • Thanks to thiserror and anyhow for making error handling a breeze. These two crates make me feel I'm doing something good for the humanity when handling errors.

  • Xvc is split into many crates and owes this organization to cargo workspaces.

And, biggest thanks to Rust designers, developers and contributors. Although I can't see myself expert to appreciate it all, it's a fabulous language and environment to work with.

🚁 Support

  • You can use Discussions to ask questions. I'll answer as much as possible. Thank you.
  • I don't follow any other sites regularly. You can also reach me at emre@xvc.dev

👐 Contributing

  • Star this repo. I feel very happy for every star and send my best wishes to you. That's a certain win to spend your two seconds for me. Thanks.
  • Use xvc. Tell me how it works for you, read the documentation, report bugs, discuss features.
  • Please note that, I don't accept large code PRs. Please open an issue to discuss your idea and write/modify a reference page before sending a PR. I'm happy to discuss and help you to implement your idea. Also, it may require a copyright transfer to me, as there may be cases which I provide the code in other licenses.

📜 License

Xvc and these Python bindings are licensed under the GNU GPL 3.0 License. If you want to use the code in your project with other licenses, please contact me.

🌦️ Future and Maintenance

I'm using Xvc daily and I'm happy with it. Tracking all my files with Git via arbitrary servers and cloud providers is something I always need. I'm happy to improve and maintain it as long as I use it.

Given that I'm working on this for the last two years for pure technical bliss, you can expect me to work on it more.

⚠️ Disclaimer

This software is fresh and ambitious. Although I use it and test it close to real-world conditions, it didn't go under the test of time. Xvc can eat your files and spit them into the eternal void! Please take backups.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xvc-0.6.17.tar.gz (86.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

xvc-0.6.17-cp37-abi3-win_amd64.whl (7.3 MB view details)

Uploaded CPython 3.7+Windows x86-64

xvc-0.6.17-cp37-abi3-win32.whl (6.7 MB view details)

Uploaded CPython 3.7+Windows x86

xvc-0.6.17-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.9 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ x86-64

xvc-0.6.17-cp37-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (11.0 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ ppc64le

xvc-0.6.17-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (10.7 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ ARM64

xvc-0.6.17-cp37-abi3-macosx_11_0_arm64.whl (7.8 MB view details)

Uploaded CPython 3.7+macOS 11.0+ ARM64

xvc-0.6.17-cp37-abi3-macosx_10_12_x86_64.whl (8.1 MB view details)

Uploaded CPython 3.7+macOS 10.12+ x86-64

File details

Details for the file xvc-0.6.17.tar.gz.

File metadata

  • Download URL: xvc-0.6.17.tar.gz
  • Upload date:
  • Size: 86.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for xvc-0.6.17.tar.gz
Algorithm Hash digest
SHA256 0035cb2e02557227854e54c67e41f4ced416c5f6c6323f02210d7d6cc2e564e6
MD5 faaef18dcd3560224e3206a56eed8be9
BLAKE2b-256 8ad1e78946ccb745458186c0b066e338853c19a72b82fcdba8989f11b0b4b513

See more details on using hashes here.

File details

Details for the file xvc-0.6.17-cp37-abi3-win_amd64.whl.

File metadata

  • Download URL: xvc-0.6.17-cp37-abi3-win_amd64.whl
  • Upload date:
  • Size: 7.3 MB
  • Tags: CPython 3.7+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for xvc-0.6.17-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c48bb29e6645cd478d72523f08c4137421a5e34e83d1f546440b8b8b89737b2f
MD5 19acafc8907982f243bbbbddd6892068
BLAKE2b-256 8a557c9282caa4ebae55cf19440d50e17343d3a9b3f27c724ab0ce8c9280aaeb

See more details on using hashes here.

File details

Details for the file xvc-0.6.17-cp37-abi3-win32.whl.

File metadata

  • Download URL: xvc-0.6.17-cp37-abi3-win32.whl
  • Upload date:
  • Size: 6.7 MB
  • Tags: CPython 3.7+, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.8.3

File hashes

Hashes for xvc-0.6.17-cp37-abi3-win32.whl
Algorithm Hash digest
SHA256 c1c89844b71af8f2d4b2c973b04422217b85ca6a9bab7666474eac1f64fb6148
MD5 846f43c16cceb371f153d50c44a0462d
BLAKE2b-256 7759686c2e9e890b0f53048a4867fb55e5defc552e80c1b47e4f635126f8692a

See more details on using hashes here.

File details

Details for the file xvc-0.6.17-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for xvc-0.6.17-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f83eb08fb1dd64b62dcf1b221a2a01dba6eea86e8f4204ddf3b23f582702d390
MD5 fa66ba06b404680429f1313c1850d439
BLAKE2b-256 be62400c01ee09b2b9abdfb359d50666e9eea18e3b7d3bb2d5afeb9f3301607a

See more details on using hashes here.

File details

Details for the file xvc-0.6.17-cp37-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl.

File metadata

File hashes

Hashes for xvc-0.6.17-cp37-abi3-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl
Algorithm Hash digest
SHA256 4d1b075abf8d88c774c1ffd9c7aba1f3010ba0b15df2ebce394e805400fd99d2
MD5 62ab06be5cbd5f1567cca794554cded5
BLAKE2b-256 22eaf74485b3b6e8669c203ccdcc9ce95172b51399ef5695d71a244db3fbec68

See more details on using hashes here.

File details

Details for the file xvc-0.6.17-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for xvc-0.6.17-cp37-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 42b3a215b21fc10888e7e0e26155f33aed9ca082791ea459815a787e86e25526
MD5 4693151307e1d16cdb6316bfe00c2444
BLAKE2b-256 7c368faab0bc732a70834df9547b850dce44505509bde927f710283b92686d7b

See more details on using hashes here.

File details

Details for the file xvc-0.6.17-cp37-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for xvc-0.6.17-cp37-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 888f6e3bbaacf7a35912961136864fc9e8fcb800d56e6a1ee567fb7995ffee8c
MD5 4b764bfa9576e207c973ed7d1d727e7b
BLAKE2b-256 0bf1b49669955186736641532194e7d7483090f070ef10a4f0b48256731a5791

See more details on using hashes here.

File details

Details for the file xvc-0.6.17-cp37-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for xvc-0.6.17-cp37-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3bb521d81eaf920608ef20a7d2b46727db14488741164982cd05c5644079b559
MD5 120735de45fe367d679cbfccd420aea1
BLAKE2b-256 20a5285463caf125bf696e8baa35efa18426b444af0b5d0a573a0d7a55006c60

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page