Skip to main content

A metadata management package based on filesystem mirroring.

Project description

fsmirror

Installation

pip install fsmirror

Functionality

Mirror project filesystems for metadata tracking. It can be useful to have a direct path mirror between code that generates data and the location in a filesystem or object store that stores the data / artifacts it generates.

Example

code lives at:
project/etl/my_etl_task.py::LiftDataTask fsmirror output for associated:
project/etl/my_etl_task/LiftDataTask/out.parquet fsmirror s3 output for associated:
s3://my.bucket/project/etl/my_etl_task/LiftDataTask.out.parquet

Usage

  • Create a configuration file like the one in examples/example_config.yml
  • Set the config path:
export FSMIRROR_CONFIG_PATH=/your/project/path/config.yml`

The config file should look like the example:

# artifacts
storage:
  # local, s3, gcs, blob
  provider: s3
  # root file path, bucket, etc.
  tenant: test.bucket
  # prefix - if 'MIRROR' will mirror filesystem
  namespace: MIRROR


# Each mirror should be a subdirectory
# within your project for example your
# orchestrator codebase lives at the
# following path:
#
# /opt/orchestrator
#
# To mirror this subdirectory we would
# add an "orchestrator" mirror as is
# done below
mirrors:
  fsmirror:
    # directory or subdirectory to split on
    root: fsmirror
    prefix: MIRROR
    output_name: out
    output_format: parquet

  aipipeline:
    root: aipipeline
    prefix: MIRROR
    output_name: out
    output_format: pkl

Use fsmirror for managing where to store artifacts, the following pseudocode is an example of how it should be used:

>>> from test_mirror import SomeTask, some_task
>>> from fsmirror import FSMirror, load_config
>>> load_config()
{'storage': {'provider': 's3', 'tenant': 'test.bucket', 'namespace': 'MIRROR'}, 'mirrors': {'fsmirror': {'root': 'fsmirror', 'prefix': 'MIRROR', 'output_name': 'out', 'output_format': 'parquet'}, 'aipipeline': {'root': 'aipipeline', 'prefix': 'MIRROR', 'output_name': 'out', 'output_format': 'pkl'}}}
>>> config = load_config()
>>> fm = FSMirror(config=config, mirror='fsmirror')
>>> fm.mirror_relative(some_task)
'fsmirror/tests/test_mirror/20240227160221/some_task'
>>> fm.mirror_relative(some_task, with_id=False)
'fsmirror/tests/test_mirror/some_task'
>>> fm.mirror_full(some_task)
's3://test.bucket/fsmirror/tests/test_mirror/20240227160221/some_task'
>>> fm.mirror_full_output(some_task)
's3://test.bucket/fsmirror/tests/test_mirror/20240227160221/some_task/out.parquet'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fsmirror-0.4.tar.gz (5.0 kB view details)

Uploaded Source

File details

Details for the file fsmirror-0.4.tar.gz.

File metadata

  • Download URL: fsmirror-0.4.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.18

File hashes

Hashes for fsmirror-0.4.tar.gz
Algorithm Hash digest
SHA256 dfa75e2f019fd991f24ace1201ac8b1ff8546b1623f893f52858f59cad1975d5
MD5 f222d13a61b7a29af3db3bd260e09746
BLAKE2b-256 5d7d083e6a3209da7fa3a8695024667fc8846174a53ee5e0baca0b773421bd7a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page