articat: data artifact catalog
Project description
articat
Minimal metadata catalog to store and retrieve metadata about data artifacts.
Getting started
At a high level, articat is simply a key-value store. Value being the Artifact metadata. Key a.k.a. "Artifact Spec" being:
- globally unique
id
- optional timestamp:
partition
- optional arbitrary string:
version
To publish a file system Artifact (FSArtifact
):
from articat import FSArtifact
from pathlib import Path
from datetime import date
# Apart from being a metadata containers, Artifact classes have optional
# convenience methods to help in data publishing flow:
with FSArtifact.partitioned("foo", partition=date(1643, 1, 4)) as fsa:
# To create a new Artifact, always use `with` statement, and
# either `partitioned` or `versioned` methods. Use:
# * `partitioned(...)`, for Artifacts with explicit `datetime` partition
# * `versioned(...)`, for Artifacts with explicit `str` version
# Next we produce some local data, this could be a Spark job,
# ML model etc.
data_path = Path("/tmp/data")
data_path.write_text("42")
# Now let's stage that data, temporary and final data directories/buckets
# are configurable (see below)
fsa.stage(data_path)
# Additionally let's provide some description, here we could also
# save some extra arbitrary metadata like model metrics, hyperparameters etc.
fsa.metadata.description = "Answer to the Ultimate Question of Life, the Universe, and Everything"
To retrieve the metadata about the Artifact above:
from articat.fs_artifact import FSArtifact
from datetime import date
from pathlib import Path
# To retrieve the metadata, use Artifact object, and `fetch` method:
fsa = FSArtifact.partitioned("foo", partition=date(1643, 1, 4)).fetch()
fsa.id # "foo"
fsa.created # <CREATION-TIMESTAMP>
fsa.partition # <CREATION-TIMESTAMP>
fsa.metadata.description # "Answer to the Ultimate Question of Life, the Universe, and Everything"
fsa.main_dir # Data directory, this is where the data was stored after staging
Path(fsa.joinpath("data")).read_text() # 42
Features
- store and retrieve metadata about your data artifacts
- no long running services (low maintenance)
- data publishing utils builtin
- IO/data format agnostic
- immutable metadata
- development mode
Artifact flavours
Currently available Artifact flavours:
FSArtifact
: metadata/utils for files or objects (supports: local FS, GCS, S3 and more)BQArtifact
: metadata/utils for BigQuery tablesNotebookArtifact
: metadata/utils for Jupyter Notebooks
Development mode
To ease development of Artifacts, articat supports development/dev mode.
Development Artifact can be indicated by dev
parameter (preferred), or
_dev
prefix in the Artifact id
. Dev mode supports:
- overwriting Artifact metadata
- configure separate locations (e.g.
dev_prefix
forFSArtifact
), with potentially different retention periods etc
Backend
local
: mostly for testing/demo, metadata is stored locally (configurable, default:~/.config/articat/local
)gcp_datastore
: metadata is stored in the Google Cloud Datastore
Configuration
articat configuration can be provided in the API, or configuration files. By default configuration
is loaded from ~/.config/articat/articat.cfg
and articat.cfg
in current working directory. You
can also point at the configuration file via environment variable ARTICAT_CONFIG
.
You use local
mode without configuration file. Available options:
[main]
# local or gcp_datastore, default: local
# mode =
# local DB directory, default: ~/.config/articat/local
# local_db_dir =
[fs]
# temporary directory/prefix
# tmp_prefix =
# development data directory/prefix
# dev_prefix =
# production data directory/prefix
# prod_prefix =
[gcp]
# GCP project
# project =
[bq]
# development data BigQuery dataset
# dev_dataset =
# production data BigQuery dataset
# prod_dataset =
Our/example setup
Below you can see a diagram of our setup, Articat is just one piece of our system, and solves a specific problem. This should give you an idea where it might fit into your environment:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file articat-0.1.16.tar.gz
.
File metadata
- Download URL: articat-0.1.16.tar.gz
- Upload date:
- Size: 43.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b6480cc1c0d41bf72b7225737ffe0cdebad0bae7349b531deba14f2b2a0b7e83 |
|
MD5 | 77375ea9a603b0ab5fcc71d4a98bfc6e |
|
BLAKE2b-256 | ba16bb49e6fc33b3b4e04a965cd2fe7beb519d1b7cb36caef383dbb58c7b20d0 |
File details
Details for the file articat-0.1.16-py3-none-any.whl
.
File metadata
- Download URL: articat-0.1.16-py3-none-any.whl
- Upload date:
- Size: 52.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a56f68685f56515162e8dbfb1e08d628d55b8980979e5461dbbf0830913049f9 |
|
MD5 | 000bbc5101e079eadfb4debad63103f7 |
|
BLAKE2b-256 | fa7b00700d6438ad9a97c0b0ea3cfac5fc8fe16578395a19444baa1c00da4cab |