A metadata management package based on filesystem mirroring.
Project description
fsmirror
Installation
pip install fsmirror
Functionality
Mirror project filesystems for metadata tracking. It can be useful to have a direct path mirror between code that generates data and the location in a filesystem or object store that stores the data / artifacts it generates.
Example
code lives at:
project/etl/my_etl_task.py::LiftDataTask
fsmirror
output for associated:
project/etl/my_etl_task/LiftDataTask/out.parquet
fsmirror
s3 output for associated:
s3://my.bucket/project/etl/my_etl_task/LiftDataTask.out.parquet
Usage
- Create a configuration file like the one in
examples/example_config.yml
- Set the config path:
export FSMIRROR_CONFIG_PATH=/your/project/path/config.yml`
The config file should look like the example:
# artifacts
storage:
# local, s3, gcs, blob
provider: s3
# root file path, bucket, etc.
tenant: test.bucket
# prefix - if 'MIRROR' will mirror filesystem
namespace: MIRROR
# Each mirror should be a subdirectory
# within your project for example your
# orchestrator codebase lives at the
# following path:
#
# /opt/orchestrator
#
# To mirror this subdirectory we would
# add an "orchestrator" mirror as is
# done below
mirrors:
fsmirror:
# directory or subdirectory to split on
root: fsmirror
prefix: MIRROR
output_name: out
output_format: parquet
aipipeline:
root: aipipeline
prefix: MIRROR
output_name: out
output_format: pkl
Use fsmirror
for managing where to store artifacts, the following pseudocode is
an example of how it should be used:
>>> from test_mirror import SomeTask, some_task
>>> from fsmirror import FSMirror, load_config
>>> load_config()
{'storage': {'provider': 's3', 'tenant': 'test.bucket', 'namespace': 'MIRROR'}, 'mirrors': {'fsmirror': {'root': 'fsmirror', 'prefix': 'MIRROR', 'output_name': 'out', 'output_format': 'parquet'}, 'aipipeline': {'root': 'aipipeline', 'prefix': 'MIRROR', 'output_name': 'out', 'output_format': 'pkl'}}}
>>> config = load_config()
>>> fm = FSMirror(config=config, mirror='fsmirror')
>>> fm.mirror_relative(some_task)
'fsmirror/tests/test_mirror/20240227160221/some_task'
>>> fm.mirror_relative(some_task, with_id=False)
'fsmirror/tests/test_mirror/some_task'
>>> fm.mirror_full(some_task)
's3://test.bucket/fsmirror/tests/test_mirror/20240227160221/some_task'
>>> fm.mirror_full_output(some_task)
's3://test.bucket/fsmirror/tests/test_mirror/20240227160221/some_task/out.parquet'
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file fsmirror-0.4.tar.gz
.
File metadata
- Download URL: fsmirror-0.4.tar.gz
- Upload date:
- Size: 5.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dfa75e2f019fd991f24ace1201ac8b1ff8546b1623f893f52858f59cad1975d5 |
|
MD5 | f222d13a61b7a29af3db3bd260e09746 |
|
BLAKE2b-256 | 5d7d083e6a3209da7fa3a8695024667fc8846174a53ee5e0baca0b773421bd7a |