Turn messy files into agent-ready context.
Project description
ContextIQ
ContextIQ is a local-first ingestion pipeline for developers building RAG systems, agent memory layers, document search, and eval datasets.
Point it at a folder of mixed files and it produces clean, traceable JSONL and Markdown outputs that AI systems can actually use.
Why ContextIQ
Most AI tooling starts after your data is already clean. Real projects usually break much earlier:
- PDFs are noisy
- Word docs lose structure
- JSON and CSV need normalization
- repos and notes mix formats
- chunks become inconsistent
- source traceability gets lost
ContextIQ focuses on the missing middle: ingestion, normalization, chunking, and export.
Installation
Install from PyPI:
pip install contextiq
Run the CLI:
contextiq ingest ./docs --out ./build/context
Or with module execution:
python -m contextiq ingest ./docs --out ./build/context
Quickstart
Use the built-in example content:
contextiq ingest ./examples --out ./build/context
PowerShell example:
contextiq ingest .\examples --out .\build\context
Generated output:
documents.jsonl- normalized source documentschunks.jsonl- chunked outputs for RAG and agentschunks.md- human-readable review outputmanifest.json- run summary, warnings, and config
What It Supports
Built-in file types
.txt,.md,.rst.json,.jsonl.csv,.tsv.html,.htm- optional
.pdfviapypdf - optional
.docxviapython-docx
Output behavior
- recursive directory ingestion
- normalized plain-text extraction
- document-aware chunking
- source-preserving metadata
- JSONL and Markdown export
- manifest output for reproducibility
CLI
Basic usage
contextiq ingest <path> --out <directory>
Useful flags
--include-ext .md,.txt,.json--exclude-glob "*.min.js,*.lock"--chunk-size 1200--chunk-overlap 150--formats jsonl,markdown--fail-on-warning
Example commands
contextiq ingest ./docs --out ./dist/context --chunk-size 900 --chunk-overlap 120
contextiq ingest ./knowledge-base --out ./build/export --include-ext .md,.txt,.json
How It Works
ContextIQ runs in four stages:
1. Discovery
Recursively finds supported files while skipping common noise such as virtualenvs, caches, and build directories.
2. Loading and normalization
Converts each file into normalized plain text:
- Markdown and text are read directly
- JSON and JSONL are pretty-printed into readable text
- CSV and TSV become row-based text
- HTML is stripped to visible text
- PDF and DOCX are supported through optional extras
3. Chunking
Splits documents into retrieval-friendly chunks with:
- target chunk size
- overlap between chunks
- paragraph and sentence-aware boundaries
- source path and character ranges preserved
4. Export
Writes machine-friendly and human-readable outputs for downstream AI workflows.
Project Structure
src/contextiq/
|- cli.py
|- pipeline.py
|- loaders.py
|- chunking.py
|- exporters.py
|- discovery.py
|- models.py
`- utils.py
Use Cases
RAG ingestion
Prepare mixed files for vector indexing and retrieval pipelines.
Agent memory and context packing
Turn project docs into clean, bounded chunks for coding and research agents.
Search systems
Produce normalized text and chunk exports for semantic or hybrid retrieval.
Eval datasets
Create stable, traceable corpora for retrieval benchmarking and prompt evaluation.
Development
Install editable dependencies:
pip install -e .[dev]
Run tests:
pytest
Run the demo:
.\demo.ps1
Roadmap
- embeddings plugin interface
- vector database exporters
- OCR pipeline
- table extraction
- citation-aware retrieval benchmarks
Contributing
Contributions are welcome.
- improve loaders
- add exporters
- extend chunking strategies
- improve docs and examples
Open an issue or submit a PR if you want to help shape ContextIQ.
License
MIT License - see LICENSE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file contextiq-0.1.1.tar.gz.
File metadata
- Download URL: contextiq-0.1.1.tar.gz
- Upload date:
- Size: 12.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d01ae23d2750cf422f5c82052d53aebd4c6d41a01ac06a283aeaa99bedcf54a2
|
|
| MD5 |
f439391e13b5f9421e58e95f39ff182f
|
|
| BLAKE2b-256 |
64df7153ed63e5206336890ac64c1ceabe241bb39724404a2bf2fda6ef0f16e2
|
File details
Details for the file contextiq-0.1.1-py3-none-any.whl.
File metadata
- Download URL: contextiq-0.1.1-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a3488b853db5186f5297a3b99d02616f398c794da79c807ace82b94c0207d10d
|
|
| MD5 |
59f0fccc9c8bab5c19777a08b98d6582
|
|
| BLAKE2b-256 |
89d774a717a0cac49bcfc99ca390abc397ae429b64c14ae251a3395f453152a3
|