SDK and CLI for capturing data-science lineage and persisting DAG snapshots to Walacor.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

captrus

These details have not been verified by PyPI

Project links

Documentation

Project description

Walacor Data Tracking

A schema-first framework to track, version, and store the full lineage of data transformations — from raw ingestion to final model output — using Walacor as a backend snapshot store.

✨ Why this exists

Reproducibility – Every transformation, parameter, and artifact is captured in a graph you can replay.
Auditability – Snapshots are immutable, version-controlled, and timestamped.
Collaboration – Team members see the same lineage and can compare or branch workflows.
Extensibility – Strict JSON-schemas keep today’s pipelines clean while allowing tomorrow’s to evolve safely.

🏗️ Core Concepts

Concept	Stored as	Purpose
Transform Node	`transform_node`	One operation (e.g., “fit model”, “clean text”).
Transform Edge	`transform_edge`	Dependency between two nodes.
Project Metadata	`project_metadata`	Run-level info (owner, description, timestamps).

Immutable Snapshots Once a DAG is written to Walacor, it cannot mutate—only a new snapshot (with a higher SV or run ID) can supersede it.

🚀 Getting Started

1. Install the SDKs

pip install walatrack

Make sure you're using Python 3.10+ and have internet access to reach the Walacor API.

2. Initialize the Tracking Components

To begin capturing your data lineage:

Start the Tracker – This manages the session and records operations.
Attach an Adapter – For example, use PandasAdapter to automatically track DataFrame transformations.
Add Writers – Choose where to send the output:
- Console output for quick inspection
- WalacorWriter to persist snapshots to the Walacor backend

Once set up, your transformation history will be automatically recorded and can be exported or persisted.

🧪 Example Use Cases

Track changes in a machine learning pipeline
Visualize column-level transformations in pandas
Record versions of a dataset as it’s cleaned and merged
Keep an auditable log of automated workflows

Here’s the updated README.md with a concise, illustrative example that highlights how easy it is to use walatrack. This is placed right after the Getting Started section and demonstrates a realistic tracking flow with minimal code:

🧪 Minimal Example

Here's how simple it is to start tracking transformations:

import pandas as pd
from walacor_data_tracker import Tracker, PandasAdapter
from walacor_data_tracker.writers import ConsoleWriter
from walacor_data_tracker.writers.walacor import WalacorWriter

# 1️⃣  Start tracking
tracker = Tracker().start()
PandasAdapter().start(tracker)        # auto-captures every DataFrame op
ConsoleWriter()                       # (optional) printf lineage to stdout

# 2️⃣  Open a Walacor run in one line
wal_writer = WalacorWriter(
    "https://your-walacor-url/api",    # server
    "your-username",                   # login
    "your-password",
    project_name="MyProject",
    pipeline_name="daily_sales_pipeline",   # ⇢ opens a new run right away
)

# 3️⃣  Do your normal pandas work
df = pd.DataFrame({"id": [1, 2], "value": [100, 200]})
df2 = df.assign(double=df.value * 2)
df3 = df2.rename(columns={"value": "v"})

# 4️⃣  Finish the run and stop tracking
wal_writer.close(status="finished")   # marks the run "finished" in Walacor
tracker.stop()

print("Walacor run UID:", wal_writer._run_uid)   # UID of the run you just wrote

💡 The PandasAdapter automatically tracks operations like .assign(), .rename(), .merge(), etc., so you can work with pandas as usual — but with versioned lineage behind the scenes.

This snippet:

Is short enough to understand at a glance
Avoids hardcoded credentials or IPs
Clearly reflects your existing setup
Shows the power and simplicity of the library

🛠️ Pandas operations automatically tracked

The current release wraps the pandas DataFrame API methods below. Whenever you call any of them, a transform _node is emitted, parameters are captured, and lineage is updated—zero extra code required:

Category	Supported `DataFrame` methods
Structural copies / reshaping	`copy`, `reset_index`, `set_axis`, `pivot_table`, `melt`, `explode`
Column creation / update	`assign`, `insert`, `__setitem__` (`df["col"] = …`)
Cleaning & NA handling	`fillna`, `dropna`, `replace`
Column rename / re-order	`rename`, `reindex`, `sort_values`
Joins & merges	`merge`, `join`
Type & dtype changes	`astype`

ℹ️ These map directly to the constant in PandasAdapter:

_DF_METHODS = [
    "copy", "pivot_table", "reset_index", "__setitem__",
    "fillna", "dropna", "replace", "rename", "assign",
    "merge", "join", "set_axis", "insert", "astype",
    "sort_values", "reindex", "explode", "melt",
]

Missing your favourite method?

Pull requests are welcome! Add the method name to _DF_METHODS, ensure the wrapper captures a meaningful snapshot, and open a PR. We’ll review and merge updates that keep to the schema-first philosophy.

🔍 Helper API — query your lineage

Helper	Purpose	Key parameters	Returns
`get_projects()`	List every Walacor-tracked project.	(none)	`[{uid, project_name, description, user_tag}]`
`get_pipelines()`	List the names of all pipelines ever executed (across projects).	(none)	`["daily_etl", "train_model", ...]`
`get_pipelines_for_project(project_name, *, user_tag=None)`	Pipelines that belong to one project.	`project_name` – required `user_tag` – filter if you store multiple laptops/branches	`["sales_pipeline", …]`
`get_runs(project_name, *, pipeline_name=None, user_tag=None)`	History of executions (“runs”).	`project_name` – required `pipeline_name` – limit to one pipeline `user_tag` – optional	`[{"UID","status","pipeline_name",…}, …]`
`get_nodes(project_name, *, pipeline_name=None, run_uid=None, user_tag=None)`	Raw transform_node rows (operations).	Same filters as above – pick one* of* `pipeline_name` or `run_uid`. Omitting both returns every node in the project.	List of node dicts with `operation`, `shape`, `params_json`, …
`get_dag(project_name, *, pipeline_name=None, run_uid=None, user_tag=None)`	Convenient “everything I need for a graph”.	Same filter rules.	`{"nodes": [...], "edges": [...]}` where edges come from `transform_edge`.
`get_projects_with_pipelines()`	High-level catalogue: each project, its pipelines and run-counts.	(none)	`[ { "project_name": "Proj", "pipelines":[{"name":"etl","runs":7}] }, … ]`

Parameter rules at a glance

Filter combo	What you get
`project_name` only	all nodes / all edges in the project
`project_name + pipeline_name`	all runs & nodes for that pipeline
`project_name + run_uid`	nodes/edges of one specific run
`user_tag`	optional extra filter on any of the above

Example usage

# 1️⃣ list all runs of “train_model” in “ML_Proj”
runs = wal_writer.get_runs("ML_Proj", pipeline_name="train_model")
first_run = runs[0]["UID"]

# 2️⃣ pull the DAG for that first run
dag = wal_writer.get_dag("ML_Proj", run_uid=first_run)

# 3️⃣ quick print
for n in dag["nodes"]:
    print(n["operation"], n["shape"])

These helpers leverage the official Walacor Python SDK, so every call hits Walacor’s fast summary view and transparently re-uses the writer’s authenticated session—no extra login or handshake required.

🤝 Contributing

Fork → feature branch → PR.
Run pre-commit run --all-files.
Add/Update unit tests and schema definitions.
Keep the README & docs in sync.

📄 License

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

captrus

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

This version

0.0.5

Jul 16, 2025

0.0.4

Jun 28, 2025

0.0.3

Jun 28, 2025

0.0.2

Jun 28, 2025

0.0.1

Jun 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

walacor_data_tracker-0.0.5.tar.gz (30.0 kB view details)

Uploaded Jul 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

walacor_data_tracker-0.0.5-py3-none-any.whl (26.5 kB view details)

Uploaded Jul 16, 2025 Python 3

File details

Details for the file walacor_data_tracker-0.0.5.tar.gz.

File metadata

Download URL: walacor_data_tracker-0.0.5.tar.gz
Upload date: Jul 16, 2025
Size: 30.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for walacor_data_tracker-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`12141773444a8d146dfb2f30cecbe5dbab6bdb5b371ac975b0e60b9f62c1bf97`
MD5	`3675ba7e8afee7759fc9675912114b8b`
BLAKE2b-256	`2d9fcbc2d0d152f2f60e2f137e4845936d005e23c2f0b72555a499f63bd64231`

See more details on using hashes here.

Provenance

The following attestation bundles were made for walacor_data_tracker-0.0.5.tar.gz:

Publisher: release.yaml on walacor/walacor-data-tracker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: walacor_data_tracker-0.0.5.tar.gz
- Subject digest: 12141773444a8d146dfb2f30cecbe5dbab6bdb5b371ac975b0e60b9f62c1bf97
- Sigstore transparency entry: 278658834
- Sigstore integration time: Jul 16, 2025
Source repository:
- Permalink: walacor/walacor-data-tracker@d570a51cca0a30bcef5b2305158e79cac83c3d8b
- Branch / Tag: refs/tags/0.0.5
- Owner: https://github.com/walacor
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@d570a51cca0a30bcef5b2305158e79cac83c3d8b
- Trigger Event: push

File details

Details for the file walacor_data_tracker-0.0.5-py3-none-any.whl.

File metadata

Download URL: walacor_data_tracker-0.0.5-py3-none-any.whl
Upload date: Jul 16, 2025
Size: 26.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for walacor_data_tracker-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e443499949712136cc2cf9e6580b37620b3e986205563e502588189d1f28416a`
MD5	`c76e9e077e6e2adac4cd944e5bbc7eff`
BLAKE2b-256	`6c13b7d674b30c9913abea1e60c9e4c458f9947822187ef9d16acb9b59753edc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for walacor_data_tracker-0.0.5-py3-none-any.whl:

Publisher: release.yaml on walacor/walacor-data-tracker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: walacor_data_tracker-0.0.5-py3-none-any.whl
- Subject digest: e443499949712136cc2cf9e6580b37620b3e986205563e502588189d1f28416a
- Sigstore transparency entry: 278658845
- Sigstore integration time: Jul 16, 2025
Source repository:
- Permalink: walacor/walacor-data-tracker@d570a51cca0a30bcef5b2305158e79cac83c3d8b
- Branch / Tag: refs/tags/0.0.5
- Owner: https://github.com/walacor
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@d570a51cca0a30bcef5b2305158e79cac83c3d8b
- Trigger Event: push

walacor-data-tracker 0.0.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Walacor Data Tracking

✨ Why this exists

🏗️ Core Concepts

🚀 Getting Started

1. Install the SDKs

2. Initialize the Tracking Components

🧪 Example Use Cases

🧪 Minimal Example

🛠️ Pandas operations automatically tracked

Missing your favourite method?

🔍 Helper API — query your lineage

Parameter rules at a glance

Example usage

🤝 Contributing

📄 License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance