Skip to main content

Vectorization and dataset preparation utilities for Adri Agents.

Project description

Indexer Package Overview

Indexer captures the whole lifecycle of turning raw SAP ECC exports into a production-ready vector store. The package bundles two complementary CLIs—prepare_datasets.py and vectorize.py—plus shared libraries that handle schema management, data validation, partition manifests, resume metadata, and drop/remediation flows.

Lifecycle in Three Acts

  1. Prepareprepare_datasets.py stubs configs, sanitises CSVs, deduplicates rows, and writes manifest-tracked partitions. It detects schema changes, versions them, and marks older partitions stale.
  2. Indexvectorize.py streams validated partitions into Chroma (local or cloud), manages resume state, enforces batch token limits, and keeps partition-level persistence for incremental replays.
  3. Remediate – Both tools support drop planning and application so you can retire stale data, purge partitions, or re-ingest after schema shifts without losing traceability.

Feature Highlights

  • 🔄 Schema-aware manifest with versioning, stale partition tracking, and per-model metadata.
  • 🧠 Resume-friendly ingestion (per model and per partition) that records file offsets and row indices.
  • 🩹 CSV repair safeguards; prepare_datasets.py stitches newline-fractured rows around a configured malformed column before deciding to drop them.
  • Digest caches keep reruns fast by persisting per-partition row hashes alongside the CSVs.
  • 🧵 Parallel partition runs fan out manifest indexing via --parallel-partitions when you want multiple partitions embedding at once.
  • 🗂️ Pluggable collection strategies let Chroma Cloud index each partition into its own collection while local runs stick with a single name.
  • 🚮 Stale cleanup parity – when partitions map to standalone collections, --delete-stale drops those collections wholesale before rebuilding.
  • 🎯 E2E sampling mode (--e2e-test-run) indexes random rows per CSV and writes an audit log so you can validate pipelines without processing millions of records.
  • 🧩 Pluggable model registries; pass --model <module:REGISTRY> to target ECC defaults or your own knowledge domain.
  • ☁️ Chroma transport flexibility; switch between persistent client and HTTP/Cloud with a couple of flags.
  • 🧹 Drop + remediation tooling that mirrors the migration workflow and keeps audit history.

Quick Start

# 1) Describe your registry target once
export MODEL_REGISTRY_TARGET="kb.std.ecc_6_0_ehp_7.registry:MODEL_REGISTRY"

# 2) Scaffold a dataset config
prepare_datasets.py new-config ecc-foundation --model "$MODEL_REGISTRY_TARGET"

# 3) Produce partitions (idempotent; MVCC-friendly)
prepare_datasets.py \
  --model "$MODEL_REGISTRY_TARGET" \
  --config configs/ecc_foundation.json \
  --output-root build/partitions

# 4) Index into Chroma and resume safely across runs
vectorize.py index \
  --model "$MODEL_REGISTRY_TARGET" \
  --partition-manifest build/partitions/manifest.json \
  --partition-out-dir build/vector \
  --collection ecc-std \
  --resume

Looking for deeper dives? Check out DOC.md for a narrative walkthrough and FAQ.md for operational tips.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

idxr-0.1.0.tar.gz (3.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

idxr-0.1.0-py3-none-any.whl (3.2 kB view details)

Uploaded Python 3

File details

Details for the file idxr-0.1.0.tar.gz.

File metadata

  • Download URL: idxr-0.1.0.tar.gz
  • Upload date:
  • Size: 3.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for idxr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 173c9b70a8e7031395bcb000b74a22d202efeb03cc7f5d40b54c2c8d06296266
MD5 aa14c8aaf17ece955dcd269631dbcdaa
BLAKE2b-256 837f86ca67320a5699c6fbbc42adcd2c5b0f1ed44952dda9b74bb55eef659b4e

See more details on using hashes here.

File details

Details for the file idxr-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: idxr-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 3.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for idxr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c456231d97304a32218d6f68ed955d72802833661074ca5b446766df4554d46d
MD5 526cfc228b10fa66c6d8a703376bb00f
BLAKE2b-256 9ac23420a90ecb0a7917820c81cbbe91a30f445197e73e3e1f0563b623a66bc8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page