Document ingestion pipeline — transforms documents into a queryable knowledge store
Project description
Ingestible
Turn documents into token-efficient, searchable knowledge stores for AI.
Instead of dumping 90,000 tokens into an LLM context window, Ingestible gives your AI a structured map of the document and hybrid search across three indexes — so each query costs ~1,000-2,000 tokens instead.
513-page book: 92,598 tokens full → ~1,317 tokens per query (99% reduction)
55-page paper: 4,975 tokens full → ~585 tokens per query (88% reduction)
Install
Option 1: pip (recommended)
pip install ingestible # base install (~50MB) — uses API embeddings
To use local embeddings (no API keys needed, runs fully offline):
pip install ingestible[local] # adds torch + sentence-transformers + ChromaDB (~2GB)
Which to choose? Use
ingestibleif you have an OpenAI/Cohere/Voyage API key. Useingestible[local]if you want zero cloud dependencies.
Optional extras (combine with comma: pip install ingestible[local,audio,cloud]):
| Extra | What it adds |
|---|---|
local |
Local embeddings — sentence-transformers, ChromaDB, torch |
pgvector |
PostgreSQL pgvector backend — psycopg, pgvector |
gemini |
Google Gemini LLM + embeddings — google-genai |
audio |
Audio/video transcription — faster-whisper |
cloud |
S3, GCS, Azure Blob connectors — boto3, google-cloud-storage, azure-storage-blob |
mcp |
MCP server for AI agent integration |
watch |
File watcher — watchdog |
cohere |
Cohere embedding provider |
voyage |
Voyage embedding provider |
Option 2: Docker
# Pull and run (includes all dependencies, ready to go)
docker run -d \
-p 8081:8081 \
-v ingestible-data:/app/data \
ghcr.io/simplyliz/ingestible:latest
The API and web UI are at http://localhost:8081. Data persists via the Docker volume.
With environment variables:
docker run -d \
-p 8081:8081 \
-v ingestible-data:/app/data \
-e INGEST_ANTHROPIC_API_KEY=sk-ant-... \
ghcr.io/simplyliz/ingestible:latest
Or with docker-compose (clone the repo first for .env.example):
cp .env.example .env # edit with your API keys
docker compose up -d
The container runs behind gunicorn with multiple workers. Monitor via /health/ready and /metrics.
From source (for development)
git clone https://github.com/simplyliz/Ingestible.git
cd Ingestible
python3.12 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
Quickstart
# Ingest a document (no API keys needed — skips LLM enrichment, builds all search indexes)
ingest ingest /path/to/document.pdf -v --skip-enrichment
# List ingested documents
ingest list
# Search
ingest search <doc_id> "your query here"
# Parse only — get structured chunks as JSONL, no storage
ingest parse /path/to/document.pdf
First run downloads the E5-large-v2 embedding model (~1.3 GB). Runs locally on CPU / Apple Silicon MPS / CUDA.
With LLM enrichment
Enrichment adds summaries, hypothetical questions, and concept tags to each chunk — significantly improving search precision. One-time cost per document.
cp .env.example .env # add your Anthropic or OpenAI API key
ingest ingest /path/to/document.pdf -v
| Document size | Estimated cost (gpt-4o-mini) | Time |
|---|---|---|
| 55 pages | ~$0.01 | ~1 min |
| 500 pages | ~$0.20 | ~5 min |
| 1,000 pages | ~$0.50 | ~10 min |
How It Works
graph LR
A["Document"] --> B["Parse"]
B --> C["Structure"]
C --> D["Chunk"]
D --> E["Enrich"]
E --> F["Embed + Index"]
F --> G["Store"]
| Stage | What happens |
|---|---|
| Parse | Format-specific extraction → clean markdown. PDF uses IBM Docling for deep layout analysis, PyMuPDF fallback, automatic OCR if text is sparse. |
| Structure | Builds hierarchy tree from TOC tables, heading patterns, or page range heuristics. |
| Chunk | Splits into 4 levels (L0-L3). Tables and code blocks stay atomic. ~10% overlap. Small trailing chunks get merged. |
| Enrich | Bottom-up LLM pass (L3→L0) generates summaries, concepts, hypothetical questions, entities. Skippable. |
| Embed | E5-large-v2 vectors (ChromaDB, auto-detected CUDA/MPS/CPU) + BM25 sparse index + concept→chunk mapping. |
| Store | JSON file hierarchy under data/documents/{doc_id}/. |
The 4-level chunk hierarchy:
| Level | What | Size | Purpose |
|---|---|---|---|
| L0 | Document overview + TOC + executive summary | ~500-800 tokens | Map of the entire document |
| L1 | Chapters | ~300-500 tokens | Browsing units |
| L2 | Sections | ~200-400 tokens | Section summaries |
| L3 | Passages | ~250-500 tokens | Primary search targets |
Three search indexes, fused with Reciprocal Rank Fusion (RRF):
- Vector (ChromaDB + E5-large-v2) — semantic meaning
- BM25 (rank-bm25) — keyword matching
- Concept index — direct concept-to-chunk lookup
Passages found by multiple indexes rank higher. No score normalization needed.
Features
Core
- 25+ input formats — PDF, DOCX, HTML, EPUB, PPTX, XLSX, CSV, Markdown, RST, AsciiDoc, plain text, audio, video, images, email, XML, JSON/JSONL, ZIP (Notion, Confluence, generic)
- 4-level chunk hierarchy — L0 overview → L1 chapters → L2 sections → L3 passages, no mid-paragraph splits
- Chunking strategies — paragraph, semantic, recursive, docling
- LLM enrichment — summaries, concepts, hypothetical questions, entities, cross-references (one-time cost, skippable)
- Hybrid search — vector + BM25 + concept index, fused with RRF. Optional cross-encoder reranking, SPLADE sparse retrieval.
- Cross-document corpus search — query across all ingested documents at once
- Extraction profiles — auto-detected (paper, article, documentation, general) with tailored enrichment
- Knowledge graph extraction — entity-relationship triples from enrichment
Production
- Rate limiting per endpoint tier, configurable
- Structured JSON logging and Prometheus metrics (
/metrics) - Docker-ready — gunicorn with multiple workers, persistent volumes, health checks (
/health/ready) - API auth — key-based authentication
- Background ingestion with checkpoint/resume and file locking
- Smart re-ingestion — content-hash dedup, selective re-enrichment (unchanged chunks skip LLM), version archival
Integrations
- MCP server — AI agents (Claude, GPT) can ingest and search directly via Model Context Protocol
- Cloud storage — ingest from S3, GCS, Azure Blob (
s3://,gs://,az://) - Parse mode —
ingest parseoutputs structured chunks as JSONL without storing. Feed directly to external systems. - Export — JSONL, Parquet, LlamaIndex, LangChain formats
- File watcher —
ingest watchmonitors a directory and auto-ingests on changes - Retrieval evaluation — Hit Rate, MRR, Precision@K with synthetic or manual queries
- Document access control — per-document access tags, filtered at retrieval time
- Full data lifecycle audit trail — ingest, search, delete, export, re-enrich, webhook events logged to JSONL with user identity and timestamps
- LLM providers — Anthropic, OpenAI, Gemini, Ollama (local)
- Embedding providers — local (sentence-transformers), OpenAI, Cohere, Voyage, Gemini
- Vector backends — ChromaDB (default), pgvector, Qdrant
- Zero cloud dependencies — with
[local]extra and--skip-enrichment, everything runs offline
Supported Formats
| Format | Extensions | Notes |
|---|---|---|
.pdf |
Text + scanned/OCR via Docling | |
| Markdown | .md |
|
| DOCX | .docx |
|
| HTML | .html, .htm |
|
| EPUB | .epub |
|
| PowerPoint | .pptx |
|
| Excel | .xlsx |
|
| CSV | .csv |
|
| reStructuredText | .rst |
|
| AsciiDoc | .adoc |
|
| Plain text | .txt |
|
| Audio | .mp3, .wav, .m4a, .flac, .ogg, .wma, .aac, .opus |
ASR via faster-whisper, timestamped |
| Video | .mp4, .mkv, .avi, .mov, .webm, .wmv, .flv |
Audio extraction + ASR + keyframes |
| Images | .png, .jpg, .jpeg, .tiff, .bmp, .webp |
OCR + layout via Docling VLM |
.eml, .msg |
Headers, body, attachments | |
| XML | .xml |
Structured to markdown |
| JSON | .json, .jsonl |
Object/array rendering |
| ZIP | .zip |
Auto-detects Notion, Confluence, or generic |
Extraction Profiles
| Profile | Detects via | Extras |
|---|---|---|
paper |
Academic headings (Abstract, Methodology, References...) | Citation extraction, methodology, key findings |
article |
HTML/Markdown without academic signals | Executive summary |
documentation |
Code blocks, API/install headings, .rst/.adoc format |
Code-aware chunking |
general |
Fallback | Standard enrichment |
Override with --profile <name>, or enable LLM fallback for ambiguous documents with INGEST_PROFILE_LLM_FALLBACK=true.
Configuration
All settings via environment variables (INGEST_* prefix), .env file, or ingestible.toml.
# LLM provider (default: anthropic)
INGEST_LLM_PROVIDER=anthropic # anthropic | openai | gemini | ollama
INGEST_ANTHROPIC_API_KEY=sk-ant-...
INGEST_ANTHROPIC_MODEL=claude-haiku-4-5-20251001
# INGEST_OPENAI_API_KEY=sk-... # if using openai provider
# INGEST_GEMINI_API_KEY=AIza... # if using gemini provider
# INGEST_OLLAMA_BASE_URL=http://localhost:11434 # if using ollama (no key needed)
# Embeddings
INGEST_EMBEDDING_PROVIDER=local # local | openai | cohere | voyage | gemini
INGEST_EMBEDDING_MODEL=intfloat/e5-large-v2
INGEST_EMBEDDING_DEVICE=auto # auto | cuda | mps | cpu
INGEST_EMBEDDING_DIMENSIONS= # Matryoshka truncation (empty = full)
INGEST_EMBEDDING_QUANTIZATION=none # none | binary
# Chunking
INGEST_CHUNKING_STRATEGY=paragraph # paragraph | semantic | recursive | docling
INGEST_MAX_CHUNK_TOKENS=500
INGEST_CONTEXTUAL_CHUNKING=false
# Search
INGEST_VECTOR_WEIGHT=0.7
INGEST_BM25_WEIGHT=0.3
INGEST_SPARSE_RETRIEVAL=bm25 # bm25 | splade
INGEST_RERANKER_MODEL= # cross-encoder model (empty = disabled)
# API & auth
INGEST_API_KEYS=key1,key2 # empty = no auth
INGEST_RATE_LIMIT_INGEST=10/minute
INGEST_MAX_UPLOAD_BYTES=500000000 # 500 MB
# Access control & audit (off by default)
INGEST_ACCESS_CONTROL_ENABLED=false
INGEST_AUDIT_ENABLED=false
# Production
INGEST_LLM_TIMEOUT=120 # seconds per LLM call
INGEST_LOG_JSON=true # structured JSON logging
See the Usage Guide for the full list.
CLI Reference
ingest ingest <path> # Ingest a file or directory
ingest parse <path> # Parse only — JSONL to stdout, no storage
ingest list # List all ingested documents
ingest search <doc_id> <query> # Search within a document
ingest corpus-search <query> # Search across all documents
ingest enrich <doc_id> # Re-enrich without re-parsing
ingest export <doc_id> # Export (jsonl, parquet, llamaindex, langchain)
ingest versions <doc_id> # Show document version history
ingest config # Show effective configuration
ingest serve # Start web UI + API server
ingest watch <dir> # Watch directory for changes, auto-ingest
ingest cleanup # Remove stale checkpoints and temp files
ingest export-cv [doc_id] # Export to CognitiveVault
ingest eval <doc_id> # Evaluate retrieval quality
ingest audit # View search audit trail
ingest mcp # Start MCP server for AI agents
Ingest options
ingest ingest /path/to/docs/ \
--profile auto \ # auto | paper | article | documentation | general
--chunking paragraph \ # paragraph | semantic | recursive | docling
--parallel 4 \ # concurrent file processing
--force \ # re-ingest even if unchanged
--skip-enrichment \ # skip LLM enrichment
--no-checkpoint \ # disable checkpoint/resume
-v # verbose logging
Parse options
ingest parse /path/to/docs/ \
--profile auto \ # auto | paper | article | documentation | general
--chunking paragraph \ # paragraph | semantic | recursive | docling
-o chunks.jsonl \ # output file (default: stdout)
-v # verbose logging to stderr
Parse Mode
ingest parse runs the parsing and chunking pipeline without storing anything. It outputs structured JSONL chunks to stdout or a file — designed for feeding external systems like training pipelines, knowledge graphs, or RAG backends.
# Single document → stdout
ingest parse paper.pdf
# Directory → file
ingest parse ~/Documents/Papers/ -o chunks.jsonl
# With options
ingest parse paper.pdf --profile paper --chunking semantic -v
What it runs: Parse → Clean → Structure detection → Chunk → Validate. No embeddings, no search indexes, no enrichment, no storage. Uses a temp directory that's cleaned up after.
Output format (JSONL):
{"id": "doc-chunk-001", "doc_id": "paper-a1b2c3", "doc_title": "Paper Title", "content": "The chunk text...", "summary": null, "chapter": "Introduction", "section": "Background", "page_start": 3, "page_end": 3, "concepts": [], "keywords": [], "profile": "paper"}
Use cases:
- Feeding training data to ML pipelines (e.g., ANCS VBC model training)
- Populating external vector databases
- Building custom RAG systems that handle their own storage
- Batch extraction scripts that need structured text from PDFs
Compared to ingest ingest + ingest export:
ingest ingest + export |
ingest parse |
|
|---|---|---|
| Storage | Writes to data/documents/ |
Nothing persisted |
| Indexes | Builds BM25 + vector + concept | None |
| Embeddings | Computed and stored | Skipped |
| Output | Export per doc_id | Single JSONL stream |
| Speed | Slower (indexing overhead) | Faster (parse + chunk only) |
| Use case | Search and retrieval | External consumption |
Compliance & Data Governance
Ingestible is designed for regulated environments. See COMPLIANCE.md for full details on GDPR, EU AI Act, and ISO mapping.
On-prem / air-gapped deployment
With pip install ingestible[local] and --skip-enrichment, zero data leaves your infrastructure. No API calls, no cloud dependencies, no external network access. Embeddings run locally, search runs locally, storage is local JSON on disk. This is the strongest compliance posture.
Standard deployment (with LLM enrichment)
When LLM enrichment is enabled, chunk text (~250-500 tokens each) is sent to the configured LLM provider (Anthropic or OpenAI). Both are EU-US Data Privacy Framework certified. The deploying organization must sign the provider's DPA.
What's built in
| Capability | Details |
|---|---|
| Data lifecycle audit | Every ingest, search, delete, export, re-enrich, and webhook event logged to data/audit.jsonl with user identity and timestamps |
| Deletion proof | DELETE operations are audit-logged — evidence for GDPR Art. 17 right to erasure |
| User identity | X-User-ID header (from auth proxy) or bearer token. Propagated to all audit events and structured logs |
| Access control | Per-document access tags, filtered at retrieval time (not post-filtering) |
| Request tracing | X-Request-ID header generated/propagated through all logs |
| Version-aware search | Superseded content from old document versions ranked lower, never silently served as current |
| AI transparency | AI-generated fields clearly named (summary, hypothetical_questions, kg_triples). Non-AI fields are deterministic |
| Quality monitoring | ingest eval measures retrieval precision/recall against synthetic or manual test sets |
EU AI Act classification
Ingestible is a data processing pipeline, not an AI system under Art. 3(1). EU AI Act obligations apply to the LLM providers (Anthropic, OpenAI) and to deployers of high-risk systems that use Ingestible as a component — not to Ingestible itself.
Documentation
- Architecture Overview — pipeline stages, data flow, project structure
- How Search Works — vector search, BM25, concept index, RRF fusion
- Usage Guide — installation, CLI commands, configuration
- REST API Reference — all HTTP endpoints
- Token Economics — why hierarchical retrieval matters, real-world numbers
- CognitiveVault Integration — export to CognitiveVault
- Compliance & Data Governance — GDPR, EU AI Act, ISO 27001, Austrian DSG
- Roadmap
- Changelog
License
PolyForm Small Business 1.0.0 — free for individuals, small businesses, nonprofits, and open-source projects.
Need a commercial license? See COMMERCIAL-LICENSE.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ingestible-1.1.1.tar.gz.
File metadata
- Download URL: ingestible-1.1.1.tar.gz
- Upload date:
- Size: 315.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f6526460b6a6455dca0072afbb8ac77753e5329313694f6248ebfc1a1d66b900
|
|
| MD5 |
feb3216e334dcaea2852ff11750818ed
|
|
| BLAKE2b-256 |
ea6c6701bca15612e6fde1ac72eb350f0baa5ce411e94a1c4c25e4d01d7e96f9
|
Provenance
The following attestation bundles were made for ingestible-1.1.1.tar.gz:
Publisher:
release.yml on SimplyLiz/Ingestible
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ingestible-1.1.1.tar.gz -
Subject digest:
f6526460b6a6455dca0072afbb8ac77753e5329313694f6248ebfc1a1d66b900 - Sigstore transparency entry: 1180504329
- Sigstore integration time:
-
Permalink:
SimplyLiz/Ingestible@7675661be2a33aa4fd3639831d2d0df5ae603d8c -
Branch / Tag:
refs/tags/v1.1.1 - Owner: https://github.com/SimplyLiz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7675661be2a33aa4fd3639831d2d0df5ae603d8c -
Trigger Event:
push
-
Statement type:
File details
Details for the file ingestible-1.1.1-py3-none-any.whl.
File metadata
- Download URL: ingestible-1.1.1-py3-none-any.whl
- Upload date:
- Size: 176.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a787e72143bd1d501040db69270ae9df6671651f9c7612d618bccd4fdf1b410
|
|
| MD5 |
6a0724aaf980b6922270f39410e0d01a
|
|
| BLAKE2b-256 |
7dcee2211b0ad9941f82761a091b4d83dd7b2a4d2083c8381e0455e61add77df
|
Provenance
The following attestation bundles were made for ingestible-1.1.1-py3-none-any.whl:
Publisher:
release.yml on SimplyLiz/Ingestible
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ingestible-1.1.1-py3-none-any.whl -
Subject digest:
0a787e72143bd1d501040db69270ae9df6671651f9c7612d618bccd4fdf1b410 - Sigstore transparency entry: 1180504372
- Sigstore integration time:
-
Permalink:
SimplyLiz/Ingestible@7675661be2a33aa4fd3639831d2d0df5ae603d8c -
Branch / Tag:
refs/tags/v1.1.1 - Owner: https://github.com/SimplyLiz
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@7675661be2a33aa4fd3639831d2d0df5ae603d8c -
Trigger Event:
push
-
Statement type: