Datalake functionality for Mindtrace
Project description
Mindtrace Datalake
The Mindtrace Datalake is the canonical data layer for Mindtrace. It sits on mindtrace.database (structured records) and mindtrace.registry (object storage and mounts) and exposes a unified model for assets, collections, annotations, and immutable dataset versions.
Start here
- Happy path — local stack, direct upload, dataset sync vs replication, and operational caveats.
- Docker (Mongo + MinIO +
DatalakeService) — docker/datalake/README.md at the repository root.
What you can do today
| Area | Role |
|---|---|
DatalakeService |
HTTP/MCP-facing API over AsyncDatalake (typed tasks, FastAPI). |
| Objects & uploads | Put bytes in storage (objects.put or upload-session flow), then reference them from canonical records. |
| Canonical model | Assets, collections, datums, dataset versions, annotations — persisted in Mongo, payloads in configured mounts. |
| Dataset sync | Export/import dataset version bundles between lakes (dataset_versions.export, import_prepare, import_commit). |
| Replication | Metadata-first mirroring and payload lifecycle (replication.* tasks — upsert, hydrate, reconcile, status, reclaim). |
Sync vs replication (short):
- Dataset sync — move a named, versioned dataset snapshot as an import/export bundle. Dataset-centric.
- Replication — mirror assets across lakes with a metadata-first pipeline and optional hydration/reclaim. Asset-centric.
Same-lake metadata_only transfer policies are supported where implemented; cross-lake metadata_only import is intentionally rejected until unresolved-placeholder semantics exist. See the happy path and GitHub issues for detail.
Relationship to other Mindtrace modules
mindtrace.database— persistence for canonical documents.mindtrace.registry— mounts, stores, andStorageRefresolution.mindtrace.jobs/mindtrace.cluster— execution and orchestration consume datalake data; they should not define the canonical schema.
flowchart TD
DB[database module]
REG[registry module]
DB --> DL[datalake module]
REG --> DL
JOBS[jobs module] --> CL[cluster module]
DL --> CL
DataVault (AsyncDataVault / DataVault)
DataVault is a small facade over save(alias, payload, …) and load(alias): it creates/links assets, registers aliases, and reads objects through the same registry stack as AsyncDatalake.
- In-process:
AsyncDataVault(async_datalake)orDataVault(datalake)(or pass an explicitLocalAsyncDataVaultBackend/LocalDataVaultBackend). - Remote HTTP/MCP:
cm = DatalakeService.connect(url="http://…"), thenDataVault(cm)orAsyncDataVault(cm). The facade recognizes the service client and usesDatalakeServiceDataVaultBackend/DatalakeServiceAsyncDataVaultBackendautomatically. You can still pass those backends explicitly if you prefer.
When the lake is running in Docker (Mongo + MinIO + DatalakeService), see docker/datalake/README.md for a copy-paste sample against http://localhost:8080.
Datalake service (DatalakeService)
The package provides DatalakeService, which wraps AsyncDatalake with the Mindtrace Service layer (FastAPI + MCP). Initialization can be lazy; live processes may enable startup initialization and background helpers (for example upload-session reconciliation).
Example (adjust host/port and Mongo URIs for your environment):
from mindtrace.datalake import DatalakeService
service = DatalakeService.launch(
host="localhost",
port=8080,
mongo_db_uri="mongodb://localhost:27017",
mongo_db_name="mindtrace",
)
# Use async handlers or the service’s app/routes per your deployment.
Task families (overview)
Includes, among others:
health,summary,mountsobjects.*— put/get/head/copy, upload session create/completeassets.*,assets.get_by_alias,aliases.add,collections.*,collection_items.*,asset_retentions.*annotation_*,datums.*dataset_versions.*— CRUD, resolve, export, import_prepare, import_commitreplication.*— upsert_batch, hydrate_asset_payload, reconcile, mark_local_delete_eligible, delete_local_payload, reclaim_verified_payloads, status
Exact wire format and paths depend on how the shared Service framework exposes tasks; treat names above as the stable task identifiers.
Storage model
Structured records live in the database layer; large payloads live in registry-backed storage. Mounts can target local disk, S3-compatible endpoints (including MinIO), GCS, etc., via Mount and store configuration.
Design reference (V3 direction)
The datalake is evolving from earlier internal versions toward a fuller V3 canonical model. The sections below summarize that direction; they are not an exhaustive API spec.
Implementation status (historical labels)
- V1 — older
mtrix-era datalake (packaging and loading). - V2 — current
mindtrace.datalakecenter of gravity (Datum, queries, etc.). - V3 — design direction: clearer entities, registry mounts, service-oriented access.
Canonical V3 concepts
- Collection, CollectionItem, AssetRetention
- StorageRef, Asset
- Annotation schema/set/record model
- Datum, DatasetVersion
- DatasetBuilder (helper for constructing new versions — not the same as a persisted version record)
Entity relationships (conceptual)
erDiagram
STORAGE_REF ||--|| ASSET : "locates"
ASSET ||--o{ COLLECTION_ITEM : "included by"
COLLECTION ||--o{ COLLECTION_ITEM : "contains"
ASSET ||--o{ ASSET_RETENTION : "retained by"
COLLECTION ||--o{ ASSET_RETENTION : "may import/pin"
ASSET ||--o{ DATUM : "used by role refs"
DATASET_VERSION ||--o{ DATUM : "manifest contains"
DATASET_VERSION ||--o{ ANNOTATION_SET : "may include"
ANNOTATION_SET ||--o{ ANNOTATION_RECORD : "contains"
DATUM ||--o{ ANNOTATION_RECORD : "annotated by"
ANNOTATION_SOURCE ||--o{ ANNOTATION_RECORD : "source for"
Annotations
V3 aims for first-class annotation types (classification, bbox, mask, keypoint, etc.) with provenance. See docs/datalake-v3-proposal.md in the repository for the full proposal.
Design principles
- Canonical data should outlive individual workflows.
- Storage location should be separate from logical identity.
- Datasets should be immutable views over reusable entities.
- Annotations should be structured, queryable, and provenance-aware.
- Collections should not imply destructive ownership of shared assets.
- Execution systems integrate with the datalake; they do not define its schema.
Built-in Pascal VOC importer
The package includes an importer for Pascal VOC 2012 (splits, one image per asset, segmentation via class masks, etc.). This is one way to load a benchmark into the canonical model; it is separate from service upload/sync/replication flows.
CLI
mindtrace-datalake-import-pascal-voc \
--mongo-db-uri "mongodb://mindtrace:mindtrace@localhost:27017" \
--mongo-db-name "mindtrace" \
--root-dir "./data/pascal-voc" \
--split train \
--dataset-name "pascal-voc-2012-train" \
--download
Or:
python -m mindtrace.datalake.importers.pascal_voc \
--mongo-db-uri "mongodb://mindtrace:mindtrace@localhost:27017" \
--mongo-db-name "mindtrace" \
--root-dir "./data/pascal-voc" \
--split train \
--dataset-name "pascal-voc-2012-train" \
--download
Python
from mindtrace.datalake import Datalake, PascalVocImportConfig, import_pascal_voc
with Datalake.create(
mongo_db_uri="mongodb://mindtrace:mindtrace@localhost:27017",
mongo_db_name="mindtrace",
) as datalake:
summary = import_pascal_voc(
datalake,
PascalVocImportConfig(
root_dir="./data/pascal-voc",
split="train",
dataset_name="pascal-voc-2012-train",
download=True,
),
)
print(summary)
Importer notes: reuses downloaded trees when present; overwrite-on-conflict for importer writes; fails if the target DatasetVersion already exists.
Jobs and cluster
Jobs should own execution lifecycle; cluster orchestration should resolve datalake inputs/outputs. Task output schemas are not canonical datalake schemas — persist results as datalake entities (e.g. annotation sets/records) when they represent durable data.
What this README is not
This file is an entry point, not a full API reference. For deeper V3 discussion see docs/datalake-v3-proposal.md. For a practical walkthrough of today’s features, use HAPPY_PATH.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mindtrace_datalake-0.11.0.tar.gz.
File metadata
- Download URL: mindtrace_datalake-0.11.0.tar.gz
- Upload date:
- Size: 102.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dae0654112942e6b5016a8ddf1a7676f2de252bf22ea24af204e22485f980098
|
|
| MD5 |
157d510b00e319e00269e41d6fa3eb90
|
|
| BLAKE2b-256 |
2776f8de62f86fbc962b3cf2d3cc3c11b915c72ccb8a33c849cfdde8e4bd2f2e
|
File details
Details for the file mindtrace_datalake-0.11.0-py3-none-any.whl.
File metadata
- Download URL: mindtrace_datalake-0.11.0-py3-none-any.whl
- Upload date:
- Size: 108.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
062ad05cf7579a6c959e6772190921869301ae68de5f1b0075762b6dba64d8ef
|
|
| MD5 |
5ebad487b91e6deb4c02149f06027947
|
|
| BLAKE2b-256 |
c64245b4e3c5d24fd13f8f05351322fabaa3343d0e4690d44132fdb0adfac21b
|