Skip to main content

SDK and CLI for capturing data-science lineage and persisting DAG snapshots to Walacor.

Project description

Walacor Data Tracking

License Apache 2.0 Walacor (1100127456347832400) Walacor (1100127456347832400) Walacor (1100127456347832400)


A schema-first framework to track, version, and store the full lineage of data transformations — from raw ingestion to final model output — using Walacor as a backend snapshot store.


✨ Why this exists

  • Reproducibility – Every transformation, parameter, and artifact is captured in a graph you can replay.
  • Auditability – Snapshots are immutable, version-controlled, and timestamped.
  • Collaboration – Team members see the same lineage and can compare or branch workflows.
  • Extensibility – Strict JSON-schemas keep today’s pipelines clean while allowing tomorrow’s to evolve safely.

🏗️ Core Concepts

Concept Stored as Purpose
Transform Node transform_node One operation (e.g., “fit model”, “clean text”).
Transform Edge transform_edge Dependency between two nodes.
Project Metadata project_metadata Run-level info (owner, description, timestamps).

Immutable Snapshots Once a DAG is written to Walacor, it cannot mutate—only a new snapshot (with a higher SV or run ID) can supersede it.


🚀 Getting Started

1. Install the SDKs

pip install walatrack

Make sure you're using Python 3.10+ and have internet access to reach the Walacor API.

2. Initialize the Tracking Components

To begin capturing your data lineage:

  • Start the Tracker – This manages the session and records operations.

  • Attach an Adapter – For example, use PandasAdapter to automatically track DataFrame transformations.

  • Add Writers – Choose where to send the output:

    • Console output for quick inspection
    • WalacorWriter to persist snapshots to the Walacor backend

Once set up, your transformation history will be automatically recorded and can be exported or persisted.


🧪 Example Use Cases

  • Track changes in a machine learning pipeline
  • Visualize column-level transformations in pandas
  • Record versions of a dataset as it’s cleaned and merged
  • Keep an auditable log of automated workflows

Here’s the updated README.md with a concise, illustrative example that highlights how easy it is to use walatrack. This is placed right after the Getting Started section and demonstrates a realistic tracking flow with minimal code:


🧪 Minimal Example

Here's how simple it is to start tracking transformations:

import pandas as pd
from walacor_data_tracker import Tracker, PandasAdapter
from walacor_data_tracker.writers import ConsoleWriter
from walacor_data_tracker.writers.walacor import WalacorWriter

# 1️⃣  Start tracking
tracker = Tracker().start()
PandasAdapter().start(tracker)        # auto-captures every DataFrame op
ConsoleWriter()                       # (optional) printf lineage to stdout

# 2️⃣  Open a Walacor run in one line
wal_writer = WalacorWriter(
    "https://your-walacor-url/api",    # server
    "your-username",                   # login
    "your-password",
    project_name="MyProject",
    pipeline_name="daily_sales_pipeline",   # ⇢ opens a new run right away
)

# 3️⃣  Do your normal pandas work
df = pd.DataFrame({"id": [1, 2], "value": [100, 200]})
df2 = df.assign(double=df.value * 2)
df3 = df2.rename(columns={"value": "v"})

# 4️⃣  Finish the run and stop tracking
wal_writer.close(status="finished")   # marks the run "finished" in Walacor
tracker.stop()

print("Walacor run UID:", wal_writer._run_uid)   # UID of the run you just wrote

💡 The PandasAdapter automatically tracks operations like .assign(), .rename(), .merge(), etc., so you can work with pandas as usual — but with versioned lineage behind the scenes.


This snippet:

  • Is short enough to understand at a glance
  • Avoids hardcoded credentials or IPs
  • Clearly reflects your existing setup
  • Shows the power and simplicity of the library

🛠️ Pandas operations automatically tracked

The current release wraps the pandas DataFrame API methods below. Whenever you call any of them, a transform _node is emitted, parameters are captured, and lineage is updated—zero extra code required:

Category Supported DataFrame methods
Structural copies / reshaping copy, reset_index, set_axis, pivot_table, melt, explode
Column creation / update assign, insert, __setitem__ (df["col"] = …)
Cleaning & NA handling fillna, dropna, replace
Column rename / re-order rename, reindex, sort_values
Joins & merges merge, join
Type & dtype changes astype

ℹ️ These map directly to the constant in PandasAdapter:

_DF_METHODS = [
    "copy", "pivot_table", "reset_index", "__setitem__",
    "fillna", "dropna", "replace", "rename", "assign",
    "merge", "join", "set_axis", "insert", "astype",
    "sort_values", "reindex", "explode", "melt",
]

Missing your favourite method?

Pull requests are welcome! Add the method name to _DF_METHODS, ensure the wrapper captures a meaningful snapshot, and open a PR. We’ll review and merge updates that keep to the schema-first philosophy.


🔍 Helper API — query your lineage

Helper Purpose Key parameters Returns
get_projects() List every Walacor-tracked project. (none) [{uid, project_name, description, user_tag}]
get_pipelines() List the names of all pipelines ever executed (across projects). (none) ["daily_etl", "train_model", ...]
get_pipelines_for_project(project_name, *, user_tag=None) Pipelines that belong to one project. project_name – required
user_tag – filter if you store multiple laptops/branches
["sales_pipeline", …]
get_runs(project_name, *, pipeline_name=None, user_tag=None) History of executions (“runs”). project_name – required
pipeline_name – limit to one pipeline
user_tag – optional
[{"UID","status","pipeline_name",…}, …]
get_nodes(project_name, *, pipeline_name=None, run_uid=None, user_tag=None) Raw transform_node rows (operations). Same filters as above – pick one of pipeline_name or run_uid.
Omitting both returns every node in the project.
List of node dicts with operation, shape, params_json, …
get_dag(project_name, *, pipeline_name=None, run_uid=None, user_tag=None) Convenient “everything I need for a graph”. Same filter rules. {"nodes": [...], "edges": [...]} where edges come from transform_edge.
get_projects_with_pipelines() High-level catalogue: each project, its pipelines and run-counts. (none) [ { "project_name": "Proj", "pipelines":[{"name":"etl","runs":7}] }, … ]

Parameter rules at a glance

Filter combo What you get
project_name only all nodes / all edges in the project
project_name + pipeline_name all runs & nodes for that pipeline
project_name + run_uid nodes/edges of one specific run
user_tag optional extra filter on any of the above

Example usage

# 1️⃣ list all runs of “train_model” in “ML_Proj”
runs = wal_writer.get_runs("ML_Proj", pipeline_name="train_model")
first_run = runs[0]["UID"]

# 2️⃣ pull the DAG for that first run
dag = wal_writer.get_dag("ML_Proj", run_uid=first_run)

# 3️⃣ quick print
for n in dag["nodes"]:
    print(n["operation"], n["shape"])

These helpers leverage the official Walacor Python SDK, so every call hits Walacor’s fast summary view and transparently re-uses the writer’s authenticated session—no extra login or handshake required.


🤝 Contributing

  1. Fork → feature branch → PR.
  2. Run pre-commit run --all-files.
  3. Add/Update unit tests and schema definitions.
  4. Keep the README & docs in sync.

📄 License

Apache 2.0 © 2025 Walacor & Contributors.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

walacor_data_tracker-0.0.5.tar.gz (30.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

walacor_data_tracker-0.0.5-py3-none-any.whl (26.5 kB view details)

Uploaded Python 3

File details

Details for the file walacor_data_tracker-0.0.5.tar.gz.

File metadata

  • Download URL: walacor_data_tracker-0.0.5.tar.gz
  • Upload date:
  • Size: 30.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for walacor_data_tracker-0.0.5.tar.gz
Algorithm Hash digest
SHA256 12141773444a8d146dfb2f30cecbe5dbab6bdb5b371ac975b0e60b9f62c1bf97
MD5 3675ba7e8afee7759fc9675912114b8b
BLAKE2b-256 2d9fcbc2d0d152f2f60e2f137e4845936d005e23c2f0b72555a499f63bd64231

See more details on using hashes here.

Provenance

The following attestation bundles were made for walacor_data_tracker-0.0.5.tar.gz:

Publisher: release.yaml on walacor/walacor-data-tracker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file walacor_data_tracker-0.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for walacor_data_tracker-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 e443499949712136cc2cf9e6580b37620b3e986205563e502588189d1f28416a
MD5 c76e9e077e6e2adac4cd944e5bbc7eff
BLAKE2b-256 6c13b7d674b30c9913abea1e60c9e4c458f9947822187ef9d16acb9b59753edc

See more details on using hashes here.

Provenance

The following attestation bundles were made for walacor_data_tracker-0.0.5-py3-none-any.whl:

Publisher: release.yaml on walacor/walacor-data-tracker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page