Vectorization and dataset preparation utilities for Adri Agents.
Project description
Indexer Package Overview
Indexer captures the whole lifecycle of turning raw SAP ECC exports into a production-ready vector store. The package bundles two complementary CLIs—prepare_datasets.py and vectorize.py—plus shared libraries that handle schema management, data validation, partition manifests, resume metadata, and drop/remediation flows.
Lifecycle in Three Acts
- Prepare –
prepare_datasets.pystubs configs, sanitises CSVs, deduplicates rows, and writes manifest-tracked partitions. It detects schema changes, versions them, and marks older partitions stale. - Index –
vectorize.pystreams validated partitions into Chroma (local or cloud), manages resume state, enforces batch token limits, and keeps partition-level persistence for incremental replays. - Remediate – Both tools support drop planning and application so you can retire stale data, purge partitions, or re-ingest after schema shifts without losing traceability.
Feature Highlights
- 🔄 Schema-aware manifest with versioning, stale partition tracking, and per-model metadata.
- 🧠 Resume-friendly ingestion (per model and per partition) that records file offsets and row indices.
- 🩹 CSV repair safeguards;
prepare_datasets.pystitches newline-fractured rows around a configured malformed column before deciding to drop them. - ⚡ Digest caches keep reruns fast by persisting per-partition row hashes alongside the CSVs.
- 🧵 Parallel partition runs fan out manifest indexing via
--parallel-partitionswhen you want multiple partitions embedding at once. - 🗂️ Pluggable collection strategies let Chroma Cloud index each partition into its own collection while local runs stick with a single name.
- 🚮 Stale cleanup parity – when partitions map to standalone collections,
--delete-staledrops those collections wholesale before rebuilding. - 🎯 E2E sampling mode (
--e2e-test-run) indexes random rows per CSV and writes an audit log so you can validate pipelines without processing millions of records. - 🧩 Pluggable model registries; pass
--model <module:REGISTRY>to target ECC defaults or your own knowledge domain. - ☁️ Chroma transport flexibility; switch between persistent client and HTTP/Cloud with a couple of flags.
- 🧹 Drop + remediation tooling that mirrors the migration workflow and keeps audit history.
Quick Start
# 1) Describe your registry target once
export MODEL_REGISTRY_TARGET="kb.std.ecc_6_0_ehp_7.registry:MODEL_REGISTRY"
# 2) Scaffold a dataset config
prepare_datasets.py new-config ecc-foundation --model "$MODEL_REGISTRY_TARGET"
# 3) Produce partitions (idempotent; MVCC-friendly)
prepare_datasets.py \
--model "$MODEL_REGISTRY_TARGET" \
--config configs/ecc_foundation.json \
--output-root build/partitions
# 4) Index into Chroma and resume safely across runs
vectorize.py index \
--model "$MODEL_REGISTRY_TARGET" \
--partition-manifest build/partitions/manifest.json \
--partition-out-dir build/vector \
--collection ecc-std \
--resume
Looking for deeper dives? Check out DOC.md for a narrative walkthrough and FAQ.md for operational tips.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file idxr-0.1.0.tar.gz.
File metadata
- Download URL: idxr-0.1.0.tar.gz
- Upload date:
- Size: 3.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
173c9b70a8e7031395bcb000b74a22d202efeb03cc7f5d40b54c2c8d06296266
|
|
| MD5 |
aa14c8aaf17ece955dcd269631dbcdaa
|
|
| BLAKE2b-256 |
837f86ca67320a5699c6fbbc42adcd2c5b0f1ed44952dda9b74bb55eef659b4e
|
File details
Details for the file idxr-0.1.0-py3-none-any.whl.
File metadata
- Download URL: idxr-0.1.0-py3-none-any.whl
- Upload date:
- Size: 3.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c456231d97304a32218d6f68ed955d72802833661074ca5b446766df4554d46d
|
|
| MD5 |
526cfc228b10fa66c6d8a703376bb00f
|
|
| BLAKE2b-256 |
9ac23420a90ecb0a7917820c81cbbe91a30f445197e73e3e1f0563b623a66bc8
|