Datalake functionality for Mindtrace

These details have not been verified by PyPI

Project links

Project description

Mindtrace Datalake

The Mindtrace Datalake is the canonical data layer for Mindtrace. It sits on mindtrace.database (structured records) and mindtrace.registry (object storage and mounts) and exposes a unified model for assets, collections, annotations, and immutable dataset versions.

Start here

Happy path — local stack, direct upload, dataset sync vs replication, and operational caveats.
Docker (Mongo + MinIO + DatalakeService) — docker/datalake/README.md at the repository root.

What you can do today

Area	Role
`DatalakeService`	HTTP/MCP-facing API over `AsyncDatalake` (typed tasks, FastAPI).
Objects & uploads	Put bytes in storage (`objects.put` or upload-session flow), then reference them from canonical records.
Canonical model	Assets, collections, datums, dataset versions, annotations — persisted in Mongo, payloads in configured mounts.
Dataset sync	Export/import dataset version bundles between lakes (`dataset_versions.export`, `import_prepare`, `import_commit`).
Replication	Metadata-first mirroring and payload lifecycle (`replication.*` tasks — upsert, hydrate, reconcile, status, reclaim).

Sync vs replication (short):

Dataset sync — move a named, versioned dataset snapshot as an import/export bundle. Dataset-centric.
Replication — mirror assets across lakes with a metadata-first pipeline and optional hydration/reclaim. Asset-centric.

Same-lake metadata_only transfer policies are supported where implemented; cross-lake metadata_only import is intentionally rejected until unresolved-placeholder semantics exist. See the happy path and GitHub issues for detail.

Relationship to other Mindtrace modules

mindtrace.database — persistence for canonical documents.
mindtrace.registry — mounts, stores, and StorageRef resolution.
mindtrace.jobs / mindtrace.cluster — execution and orchestration consume datalake data; they should not define the canonical schema.

flowchart TD
    DB[database module]
    REG[registry module]

    DB --> DL[datalake module]
    REG --> DL

    JOBS[jobs module] --> CL[cluster module]
    DL --> CL

DataVault (`AsyncDataVault` / `DataVault`)

DataVault is a small facade over save(alias, payload, …) and load(alias): it creates/links assets, registers aliases, and reads objects through the same registry stack as AsyncDatalake.

In-process: AsyncDataVault(async_datalake) or DataVault(datalake) (or pass an explicit LocalAsyncDataVaultBackend / LocalDataVaultBackend).
Remote HTTP/MCP: cm = DatalakeService.connect(url="http://…"), then DataVault(cm) or AsyncDataVault(cm). The facade recognizes the service client and uses DatalakeServiceDataVaultBackend / DatalakeServiceAsyncDataVaultBackend automatically. You can still pass those backends explicitly if you prefer.

When the lake is running in Docker (Mongo + MinIO + DatalakeService), see docker/datalake/README.md for a copy-paste sample against http://localhost:8080.

Datalake service (`DatalakeService`)

The package provides DatalakeService, which wraps AsyncDatalake with the Mindtrace Service layer (FastAPI + MCP). Initialization can be lazy; live processes may enable startup initialization and background helpers (for example upload-session reconciliation).

Example (adjust host/port and Mongo URIs for your environment):

from mindtrace.datalake import DatalakeService

service = DatalakeService.launch(
    host="localhost",
    port=8080,
    mongo_db_uri="mongodb://localhost:27017",
    mongo_db_name="mindtrace",
)
# Use async handlers or the service’s app/routes per your deployment.

Task families (overview)

Includes, among others:

health, summary, mounts
objects.* — put/get/head/copy, upload session create/complete
assets.*, assets.get_by_alias, aliases.add, collections.*, collection_items.*, asset_retentions.*
annotation_*, datums.*
dataset_versions.* — CRUD, resolve, export, import_prepare, import_commit
replication.* — upsert_batch, hydrate_asset_payload, reconcile, mark_local_delete_eligible, delete_local_payload, reclaim_verified_payloads, status

Exact wire format and paths depend on how the shared Service framework exposes tasks; treat names above as the stable task identifiers.

Storage model

Structured records live in the database layer; large payloads live in registry-backed storage. Mounts can target local disk, S3-compatible endpoints (including MinIO), GCS, etc., via Mount and store configuration.

Design reference (V3 direction)

The datalake is evolving from earlier internal versions toward a fuller V3 canonical model. The sections below summarize that direction; they are not an exhaustive API spec.

Implementation status (historical labels)

V1 — older mtrix-era datalake (packaging and loading).
V2 — current mindtrace.datalake center of gravity (Datum, queries, etc.).
V3 — design direction: clearer entities, registry mounts, service-oriented access.

Canonical V3 concepts

Collection, CollectionItem, AssetRetention
StorageRef, Asset
Annotation schema/set/record model
Datum, DatasetVersion
DatasetBuilder (helper for constructing new versions — not the same as a persisted version record)

Entity relationships (conceptual)

erDiagram
    STORAGE_REF ||--|| ASSET : "locates"
    ASSET ||--o{ COLLECTION_ITEM : "included by"
    COLLECTION ||--o{ COLLECTION_ITEM : "contains"
    ASSET ||--o{ ASSET_RETENTION : "retained by"
    COLLECTION ||--o{ ASSET_RETENTION : "may import/pin"
    ASSET ||--o{ DATUM : "used by role refs"
    DATASET_VERSION ||--o{ DATUM : "manifest contains"
    DATASET_VERSION ||--o{ ANNOTATION_SET : "may include"
    ANNOTATION_SET ||--o{ ANNOTATION_RECORD : "contains"
    DATUM ||--o{ ANNOTATION_RECORD : "annotated by"
    ANNOTATION_SOURCE ||--o{ ANNOTATION_RECORD : "source for"

Annotations

V3 aims for first-class annotation types (classification, bbox, mask, keypoint, etc.) with provenance. See docs/datalake-v3-proposal.md in the repository for the full proposal.

Design principles

Canonical data should outlive individual workflows.
Storage location should be separate from logical identity.
Datasets should be immutable views over reusable entities.
Annotations should be structured, queryable, and provenance-aware.
Collections should not imply destructive ownership of shared assets.
Execution systems integrate with the datalake; they do not define its schema.

Built-in Pascal VOC importer

The package includes an importer for Pascal VOC 2012 (splits, one image per asset, segmentation via class masks, etc.). This is one way to load a benchmark into the canonical model; it is separate from service upload/sync/replication flows.

CLI

mindtrace-datalake-import-pascal-voc \
  --mongo-db-uri "mongodb://mindtrace:mindtrace@localhost:27017" \
  --mongo-db-name "mindtrace" \
  --root-dir "./data/pascal-voc" \
  --split train \
  --dataset-name "pascal-voc-2012-train" \
  --download

Or:

python -m mindtrace.datalake.importers.pascal_voc \
  --mongo-db-uri "mongodb://mindtrace:mindtrace@localhost:27017" \
  --mongo-db-name "mindtrace" \
  --root-dir "./data/pascal-voc" \
  --split train \
  --dataset-name "pascal-voc-2012-train" \
  --download

Python

from mindtrace.datalake import Datalake, PascalVocImportConfig, import_pascal_voc

with Datalake.create(
    mongo_db_uri="mongodb://mindtrace:mindtrace@localhost:27017",
    mongo_db_name="mindtrace",
) as datalake:
    summary = import_pascal_voc(
        datalake,
        PascalVocImportConfig(
            root_dir="./data/pascal-voc",
            split="train",
            dataset_name="pascal-voc-2012-train",
            download=True,
        ),
    )
    print(summary)

Importer notes: reuses downloaded trees when present; overwrite-on-conflict for importer writes; fails if the target DatasetVersion already exists.

Jobs and cluster

Jobs should own execution lifecycle; cluster orchestration should resolve datalake inputs/outputs. Task output schemas are not canonical datalake schemas — persist results as datalake entities (e.g. annotation sets/records) when they represent durable data.

What this README is not

This file is an entry point, not a full API reference. For deeper V3 discussion see docs/datalake-v3-proposal.md. For a practical walkthrough of today’s features, use HAPPY_PATH.md.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.11.0

May 1, 2026

0.10.1

Apr 16, 2026

0.10.0

Mar 27, 2026

0.9.4

Mar 11, 2026

0.9.0

Feb 27, 2026

0.8.0

Jan 30, 2026

0.7.1

Jan 8, 2026

0.7.0

Dec 19, 2025

0.6.0

Nov 28, 2025

0.5.0

Oct 31, 2025

0.4.0

Sep 26, 2025

0.3.0

Aug 29, 2025

0.2.0

Jul 25, 2025

0.1.0

Jul 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mindtrace_datalake-0.11.0.tar.gz (102.0 kB view details)

Uploaded May 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mindtrace_datalake-0.11.0-py3-none-any.whl (108.8 kB view details)

Uploaded May 1, 2026 Python 3

File details

Details for the file mindtrace_datalake-0.11.0.tar.gz.

File metadata

Download URL: mindtrace_datalake-0.11.0.tar.gz
Upload date: May 1, 2026
Size: 102.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mindtrace_datalake-0.11.0.tar.gz
Algorithm	Hash digest
SHA256	`dae0654112942e6b5016a8ddf1a7676f2de252bf22ea24af204e22485f980098`
MD5	`157d510b00e319e00269e41d6fa3eb90`
BLAKE2b-256	`2776f8de62f86fbc962b3cf2d3cc3c11b915c72ccb8a33c849cfdde8e4bd2f2e`

See more details on using hashes here.

File details

Details for the file mindtrace_datalake-0.11.0-py3-none-any.whl.

File metadata

Download URL: mindtrace_datalake-0.11.0-py3-none-any.whl
Upload date: May 1, 2026
Size: 108.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mindtrace_datalake-0.11.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`062ad05cf7579a6c959e6772190921869301ae68de5f1b0075762b6dba64d8ef`
MD5	`5ebad487b91e6deb4c02149f06027947`
BLAKE2b-256	`c64245b4e3c5d24fd13f8f05351322fabaa3343d0e4690d44132fdb0adfac21b`

See more details on using hashes here.

mindtrace-datalake 0.11.0

Navigation

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Project description

Mindtrace Datalake

What you can do today

Relationship to other Mindtrace modules

DataVault (AsyncDataVault / DataVault)

Datalake service (DatalakeService)

Task families (overview)

Storage model

Design reference (V3 direction)

Implementation status (historical labels)

Canonical V3 concepts

Entity relationships (conceptual)

Annotations

Design principles

Built-in Pascal VOC importer

CLI

Python

Jobs and cluster

What this README is not

Project details

Verified details

Owner

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

DataVault (`AsyncDataVault` / `DataVault`)

Datalake service (`DatalakeService`)