Vectorization and dataset preparation utilities for Adri Agents.

These details have not been verified by PyPI

Project links

Project description

Indexer Package Overview

Indexer captures the whole lifecycle of turning raw SAP ECC exports into a production-ready vector store. The package bundles two complementary CLIs—prepare_datasets.py and vectorize.py—plus shared libraries that handle schema management, data validation, partition manifests, resume metadata, and drop/remediation flows.

Lifecycle in Three Acts

Prepare – prepare_datasets.py stubs configs, sanitises CSVs, deduplicates rows, and writes manifest-tracked partitions. It detects schema changes, versions them, and marks older partitions stale.
Index – vectorize.py streams validated partitions into Chroma (local or cloud), manages resume state, enforces batch token limits, and keeps partition-level persistence for incremental replays.
Remediate – Both tools support drop planning and application so you can retire stale data, purge partitions, or re-ingest after schema shifts without losing traceability.

Feature Highlights

🔄 Schema-aware manifest with versioning, stale partition tracking, and per-model metadata.
🧠 Resume-friendly ingestion (per model and per partition) that records file offsets and row indices.
🩹 CSV repair safeguards; prepare_datasets.py stitches newline-fractured rows around a configured malformed column before deciding to drop them.
⚡ Digest caches keep reruns fast by persisting per-partition row hashes alongside the CSVs.
🧵 Parallel partition runs fan out manifest indexing via --parallel-partitions when you want multiple partitions embedding at once.
🗂️ Pluggable collection strategies let Chroma Cloud index each partition into its own collection while local runs stick with a single name.
🚮 Stale cleanup parity – when partitions map to standalone collections, --delete-stale drops those collections wholesale before rebuilding.
🎯 E2E sampling mode (--e2e-test-run) indexes random rows per CSV and writes an audit log so you can validate pipelines without processing millions of records.
🧩 Pluggable model registries; pass --model <module:REGISTRY> to target ECC defaults or your own knowledge domain.
☁️ Chroma transport flexibility; switch between persistent client and HTTP/Cloud with a couple of flags.
🧹 Drop + remediation tooling that mirrors the migration workflow and keeps audit history.

Quick Start

# 1) Describe your registry target once
export MODEL_REGISTRY_TARGET="kb.std.ecc_6_0_ehp_7.registry:MODEL_REGISTRY"

# 2) Scaffold a dataset config
prepare_datasets.py new-config ecc-foundation --model "$MODEL_REGISTRY_TARGET"

# 3) Produce partitions (idempotent; MVCC-friendly)
prepare_datasets.py \
  --model "$MODEL_REGISTRY_TARGET" \
  --config configs/ecc_foundation.json \
  --output-root build/partitions

# 4) Index into Chroma and resume safely across runs
vectorize.py index \
  --model "$MODEL_REGISTRY_TARGET" \
  --partition-manifest build/partitions/manifest.json \
  --partition-out-dir build/vector \
  --collection ecc-std \
  --resume

Looking for deeper dives? Check out DOC.md for a narrative walkthrough and FAQ.md for operational tips.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.8.0

Nov 23, 2025

1.7.0

Nov 22, 2025

1.5.1

Nov 22, 2025

1.5.0

Nov 22, 2025

0.2.0

Oct 31, 2025

This version

0.1.0

Oct 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

idxr-0.1.0.tar.gz (3.4 kB view details)

Uploaded Oct 31, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

idxr-0.1.0-py3-none-any.whl (3.2 kB view details)

Uploaded Oct 31, 2025 Python 3

File details

Details for the file idxr-0.1.0.tar.gz.

File metadata

Download URL: idxr-0.1.0.tar.gz
Upload date: Oct 31, 2025
Size: 3.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for idxr-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`173c9b70a8e7031395bcb000b74a22d202efeb03cc7f5d40b54c2c8d06296266`
MD5	`aa14c8aaf17ece955dcd269631dbcdaa`
BLAKE2b-256	`837f86ca67320a5699c6fbbc42adcd2c5b0f1ed44952dda9b74bb55eef659b4e`

See more details on using hashes here.

File details

Details for the file idxr-0.1.0-py3-none-any.whl.

File metadata

Download URL: idxr-0.1.0-py3-none-any.whl
Upload date: Oct 31, 2025
Size: 3.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for idxr-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c456231d97304a32218d6f68ed955d72802833661074ca5b446766df4554d46d`
MD5	`526cfc228b10fa66c6d8a703376bb00f`
BLAKE2b-256	`9ac23420a90ecb0a7917820c81cbbe91a30f445197e73e3e1f0563b623a66bc8`

See more details on using hashes here.

idxr 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Indexer Package Overview

Lifecycle in Three Acts

Feature Highlights

Quick Start

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes