Skip to main content

Make directories AI-ready, not just files — turn a directory into a portable knowledge space.

Project description

indx

Make directories AI-ready — not just files.

Point indx at a folder and get back a knowledge space: structure, folder lineage, file-to-file relationships, and semantic metadata that AI agents and RAG systems can actually reason over.

PyPI Python License Docs

Documentation · Quickstart · Concepts · AI agents · Changelog


A parser turns one PDF into clean text. indx turns an entire folder into a knowledge space — and keeps the things parsers throw away: the folder a file lived in, the documents beside it, the report it continues, the contract it references, and what kind of document it is.

The thesis: most real knowledge doesn't live in a single file — it lives in the arrangement of files. indx keeps that map and hands it to your agent.

indx composes file parsers (Docling, Unstructured, LlamaParse, MarkItDown, …) rather than replacing them, then layers on what they discard. Every major component — parser, LLM, VLM, embedder, vector store, output — is a swappable, typed slot with a sensible default. Open-source · Python · CLI + SDK · Apache-2.0.

See it in one command: indx demo

No data, no installs, no API keys. indx demo builds, inspects, and queries a bundled sample corpus, fully offline — the whole flow in a single command:

pip install indx
indx demo
$ indx demo
indx demo — building a sample 'team handbook' knowledge space…

stage: walk
stage: parse
stage: chunk
stage: relate
stage: enrich
stage: embed-pack
✓ 7 docs · 7 chunks · 19 relations → /tmp/indx-demo-XXXX/demo (0.01s)
  components: parser=plaintext llm=none embedder=hash store=jsonl format=.indx

/tmp/indx-demo-XXXX/demo  schema=1 indx=0.0.1
  documents=7 chunks=7 relations=19 embeddings=7 embedding=hash/256
       Types                    Relations
  type       count        type         count
  markdown       6        references      14
  text           1        sibling          5

sample query (keyword/lexical, offline): how do I onboard?
  score  source                      text
  0.121  engineering/code-review.md  # Code Review  Code review keeps our codebase…
  0.098  people/remote-work.md       # Remote Work Policy  Acme Robotics is remote-…
  0.095  handbook/welcome.md         # Welcome to Acme Robotics  This is the Acme …

✓ that's the whole flow — built offline with keyword/lexical retrieval, no API key.
  run it on your own folder: indx ./your-docs --out ./ai-ready.indx --offline

(A trimmed, ANSI-stripped transcript of a real indx demo run.)

Now point it at your own folder:

indx ./docs --out ./ai-ready.indx --offline   # index a folder, fully offline, zero extra deps
indx inspect ./ai-ready.indx                   # structure, type histogram, relation sample
indx query   ./ai-ready.indx "how do I onboard?"
indx ask     ./ai-ready.indx "what is the leave policy?"   # answer with cited sources
indx app                                       # visual configure → build → inspect → query in the browser

A home knowledge base — one permanent DB, no path juggling

indx keeps a single persistent personal space under ~/.indx/ (override with $INDX_HOME). Build, add, query, and ask default to it when you pass no path — so it works like a personal notebook you can keep adding to:

indx add  ./notes/standup.md           # append a file to the home DB (no path = home)
indx add  ./reports                     # add a whole folder incrementally
indx query "what did we decide about retention?"   # query home, no archive argument
indx ask   "summarize this week's standups"        # answer with citations, offline-friendly
indx home stats                         # counts for the home space
indx home path                          # print the home dir

Incremental CRUD works on any .indx too — indx add <space> <path>, indx rm <space> <doc|path>, indx update <space> <path> mutate a sealed archive in place without a full rebuild.

Filter what gets indexed, run only the stages you need, compose spaces

# Conditional import: index only what matches (globs · extensions · size · depth · count)
indx ./repo --out ./code.indx --offline --ext md --ext py \
  --exclude '**/_drafts/**' --max-size 2mb --max-depth 3 --dry-run

# Granular stages: stop after chunking (no embeddings yet), or inspect one member
indx ./docs --out ./docs.indx --offline --through chunk
indx inspect ./docs.indx --part documents --json

# indx of indx: one parent space that federates query/inspect across child archives
indx compose ./all.indx --add ./eng.indx --add ./design.indx
indx query   ./all.indx "onboarding checklist"      # hits drawn from every child, globally ranked

Why it matters: a chunk that remembers everything

A flat parse → split → embed → store pipeline gives you orphaned text fragments. indx gives you chunks that carry their whole context with them. Here's a single chunk as it appears in the readable index.json:

{
  "id": "chunk_0481",
  "doc_id": "doc_0007",
  "position": 12,
  "text": "Enterprise data is retained for 90 days…",
  "prev_id": "chunk_0480",
  "next_id": "chunk_0482",
  "source":   { "path": "policies/data/retention.pdf", "folder": "policies/data", "type": "policy" },
  "metadata": { "topics": ["retention", "compliance"], "summary": "90-day retention rule…", "tags": ["data-retention", "gdpr"] },
  "relations": [ { "src": "chunk_0481", "dst": "legal/gdpr.md", "type": "references", "score": 1.0 } ]
}

It knows where it came from (source), what it's about (metadata), what sits next to it (prev_id / next_id), and what it points to (relations). An agent can filter by location or type, expand the context window around a hit, and follow knowledge instead of just matching it. Ids are deterministic, so a knowledge space is diffable and reproducible.

How it works

indx is a pipeline of six ordered, individually-replaceable stages that share one mutable SpaceContext:

01 Walk → 02 Parse → 03 Chunk → 04 Relate → 05 Enrich → 06 Embed+Pack

The pipeline is a list you control: insert a stage (say, PII redaction before enrichment), swap one, or drop one (skip Enrich when no LLM is available) without touching its neighbors. The whole model is symmetric across the CLI and SDK:

from indx import DirectoryPipeline, KnowledgeSpace

# Build (default stack is cloud-backed; needs OPENAI_API_KEY)
space = DirectoryPipeline().run("./docs", "./ai-ready")

print(space.stats)                          # counts, timings, components used
for doc in space.documents(type="contract"):
    print(doc.path, doc.topics, doc.summary)

# Re-load the portable archive anywhere — no re-processing
space = KnowledgeSpace.load("./ai-ready/handbook.indx")
hits  = space.search("data retention", k=5)
answer = space.ask("how long is data retained?")     # extractive offline, or LLM-synthesized
print(answer.answer, answer.sources)

# Incremental CRUD — append / re-ingest / delete without a full rebuild, then reseal
space.add("./notes/new-policy.md")
space.update("./notes/new-policy.md")
space.remove("./notes/new-policy.md")
space.save("./ai-ready/handbook.indx")

# Selective load and filtered build
from indx import WalkFilter
docs   = KnowledgeSpace.load_part("./ai-ready/handbook.indx", "documents")  # one member only
filtered = DirectoryPipeline(filter=WalkFilter(ext=[".md"], max_size="2mb"))

The four core objects you need to know — KnowledgeSpace, Document, Chunk, Relation — are explained in Core concepts.

Bring your own stack — no lock-in

Every slot is a typed interface with a default and zero lock-in. Mix and match by name:

Slot Default (cloud) Offline core Other built-ins
Parser docling plaintext unstructured · llamaparse · markitdown · textract · docintel · docai
LLM openai:gpt-5-mini none ollama · anthropic · litellm · vllm · azure · bedrock · vertex
VLM none none gpt4o · qwen-vl · local · bedrock · azure · vertex
Embedder openai:text-embedding-3-small hash bge-m3 · e5 · cohere · bedrock · azure · vertex · litellm
Store qdrant jsonl (no DB) pgvector · chroma · lancedb · s3vectors · opensearch · azure-search · bigquery · vertex-vector
Output .indx archive .indx / jsonl langchain · llamaindex
indx ./docs --out ./ai-ready --offline             # zero-dependency core: plaintext → hash → jsonl → .indx
indx ./docs --out ./ai-ready --store chroma        # override a single slot; everything else keeps its default

Three managed cloud profiles wire every slot to one vendor with a single install and a single flag:

pip install "indx[aws]"   && indx ./docs --out ./out --aws     # Textract → Bedrock → Titan → S3 Vectors
pip install "indx[azure]" && indx ./docs --out ./out --azure   # Doc Intelligence → Azure OpenAI → AI Search
pip install "indx[gcp]"   && indx ./docs --out ./out --gcp     # Document AI → Gemini → gemini-embedding → BigQuery

About the offline core: the hash embedder is a deterministic hashing trick, so offline query is keyword/lexical retrieval, not semantic vector search — true semantic search needs a real embedder extra (e.g. bge or openai). Likewise the offline enrich step derives metadata (type, topics, tags, summary) locally, with no LLM call; LLM/VLM enrichment is opt-in via the cloud/local extras. The default (non---offline) stack is cloud-backed — install it with pip install "indx[cloud]" and set OPENAI_API_KEY.

Plug a knowledge space into an AI agent

A .indx archive is a portable knowledge space — carry it like a USB drive and plug it into any agent framework in one line:

from indx.agent import connect

kb = connect("ai-ready/handbook.indx")   # load the "USB drive"
tools = kb.openai()                       # OpenAI Agents SDK …or .langchain() / .pydantic_ai() / .claude()

Or serve it to any MCP client — Claude Desktop, Cursor, or the TypeScript Mastra framework — with no Python glue on the client side:

pip install "indx[agent]"            # all framework adapters + the MCP server
indx mcp ai-ready/handbook.indx      # serve indx_search / indx_overview / indx_get_document

Every connector exposes the same three read-only tools — search, overview, get-document — built on the same retrieval path as the CLI. See the AI agents guide.

Who it's for

  • RAG / agent engineers who want grounded context with relationships, not flat chunk soup.
  • Enterprise & air-gapped platform teams that need fully local, auditable, reproducible ingestion across large on-prem document estates — no byte leaves the network.
  • OSS developers & integrators who want a composable, no-lock-in library they can extend with their own parser, store, or output.
  • Researchers turning archives of papers, datasets, and notes into a navigable, citable, shareable knowledge graph.

indx is a build-time knowledge layer, not a runtime framework. It produces the portable archive that LangChain, LlamaIndex, agents, and vector DBs consume — use it with them, not instead of them. See the comparison.

Status

Alpha (0.0.1). The zero-dependency core path (plaintexthashjsonl.indx) runs end to end and is fully air-gapped — reach it with indx demo or --offline. The optional cloud/local backends (docling, openai, ollama, bge-m3, qdrant, plus the managed AWS/Azure/GCP profiles, …) are implemented and selected through the registry: install the matching extra (e.g. pip install "indx[cloud]") and provide credentials to switch a slot onto it. The .indx format is at schema_version "2" (readers still load "1" archives — the children reference list is additive); public APIs may still shift before 1.0 — see the CHANGELOG and the docs.

Documentation

Full documentation — quickstart, concepts, the pipeline & stages, guides, and the complete CLI/SDK reference — lives at docs.indx.jp.

Development

python -m venv .venv && . .venv/bin/activate
pip install -e ".[dev]"
nox -s tests          # fast offline suite: unit + corpus
nox -l                # list every session (integration / docker / airgap / live / record-fixtures)

Contributions are welcome — see CONTRIBUTING.md, and Adding a backend to author a new slot implementation.

License

Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indx-0.0.4.tar.gz (671.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

indx-0.0.4-py3-none-any.whl (565.3 kB view details)

Uploaded Python 3

File details

Details for the file indx-0.0.4.tar.gz.

File metadata

  • Download URL: indx-0.0.4.tar.gz
  • Upload date:
  • Size: 671.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for indx-0.0.4.tar.gz
Algorithm Hash digest
SHA256 448e140b979bd0203e309685de514829d71bfdaf57868954aed482ca0b62b61d
MD5 8cfff3700cd85c58c7b026c13cedc3f7
BLAKE2b-256 d7a557360a8370f4ee2b04e918c0904d1dfc39774704f095118ddfa8ecd4ac23

See more details on using hashes here.

Provenance

The following attestation bundles were made for indx-0.0.4.tar.gz:

Publisher: release.yml on indxjp/indx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file indx-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: indx-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 565.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for indx-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 ca52adaa6e146c03082fbfca445f269d3072c6f86272a929ee730b5111be5212
MD5 a0082b6d72e0148b3f1f11dfecfed6ed
BLAKE2b-256 ca4c53c7296bc823b16ca001b318b7a3cf7a60ad697f2f0f8e6c55e75cebd5d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for indx-0.0.4-py3-none-any.whl:

Publisher: release.yml on indxjp/indx

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page