Prepare documents into structured, vector-ready data
Project description
docprep
Deterministic document chunking for RAG pipelines.
What is docprep?
docprep transforms source documents into structured, vector-ready chunks with deterministic IDs, Markdown-aware boundaries, and incremental sync. It sits between your documents and your vector store:
Source files → Loader → Parser → Chunker(s) → Sink → Export
│
Diff Engine → Changed-only export
docprep produces the same chunk IDs for the same input, every time. When documents change, it computes a structural diff and exports only the added, modified, or deleted chunks — so you re-embed only what changed.
What docprep is NOT
- Not a document parser. Use MarkItDown, Docling, or Unstructured for PDFs/DOCX/PPTX, then feed Markdown into docprep via adapters.
- Not an embedding service. docprep produces text chunks; you bring your own embedding model.
- Not a vector database. docprep exports records for Qdrant, pgvector, Chroma, or any other store.
- Not a RAG framework. Use LlamaIndex or LangChain for retrieval. docprep handles the ingestion layer.
How docprep compares
| Feature | docprep | MarkItDown | Docling | Unstructured | Chonkie |
|---|---|---|---|---|---|
| Deterministic chunk IDs | ✅ | N/A | ❌ | ❌ | ❌ |
| Markdown-aware splitting | ✅ | N/A | Limited | Limited | ❌ |
| Incremental sync (diff) | ✅ | ❌ | ❌ | ❌ | ❌ |
| Multi-format parsing | Via adapters | ✅ | ✅ | ✅ | ❌ |
| Plugin system | ✅ | ❌ | ❌ | ❌ | ❌ |
| Chunk-level provenance | ✅ | N/A | Partial | Partial | ❌ |
Installation
pip install docprep
For PostgreSQL support:
pip install docprep[postgres]
Quick Start
Config-first (recommended)
Create a docprep.toml in your project root:
source = "docs/"
[sink]
database_url = "sqlite:///docs.db"
create_tables = true
[[chunkers]]
type = "heading"
[[chunkers]]
type = "token"
max_tokens = 512
Then run:
docprep ingest # Ingest documents
docprep preview # Preview structure without persisting
docprep export -o out.jsonl # Export as JSONL
docprep diff # Show what changed since last ingest
Python API
from docprep import ingest
result = ingest("docs/")
for doc in result.documents:
print(f"{doc.title}: {len(doc.sections)} sections, {len(doc.chunks)} chunks")
With database persistence
from sqlalchemy import create_engine
from docprep import ingest
from docprep.sinks.sqlalchemy import SQLAlchemySink
engine = create_engine("sqlite:///docs.db")
sink = SQLAlchemySink(engine=engine)
result = ingest("docs/", sink=sink)
print(f"Persisted: {result.persisted}, Skipped: {len(result.skipped_source_uris)}")
Changed-only export
docprep export docs/ --changed-only --db sqlite:///docs.db -o delta.jsonl
Documentation
| Guide | Description |
|---|---|
| Getting Started | Installation, first ingestion, basic usage |
| Configuration | docprep.toml reference and all options |
| CLI Reference | All commands, flags, and examples |
| Python API | Types, functions, and usage patterns |
| Architecture | Pipeline flow, identity model, module map |
| Export | VectorRecordV1, JSONL, changed-only export |
| Plugins | Entry-point plugin system |
| Adapters | External converter integration |
Design decisions are documented as Architecture Decision Records.
Supported Formats
| Format | Extensions | Parser | Notes |
|---|---|---|---|
| Markdown | .md |
Built-in | Frontmatter extraction, heading hierarchy |
| Plain text | .txt |
Built-in | First non-empty line as title |
| HTML | .html, .htm |
Built-in (stdlib) | Strips script/style, converts headings |
| reStructuredText | .rst |
Built-in | Heading adornments, field lists |
| Any format | * |
Via adapter | MarkItDown, Docling, Unstructured, etc. |
Development
git clone https://github.com/yeongseon/docprep.git
cd docprep
make install
make check-all # lint + typecheck + test + security
make test # pytest
make lint # ruff + mypy
make format # ruff format
See CONTRIBUTING.md for the full development guide.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docprep-0.1.1.tar.gz.
File metadata
- Download URL: docprep-0.1.1.tar.gz
- Upload date:
- Size: 146.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e955489616b50897eb39d070f49adc7ca07f0728b53b7a138a9373d32a725dc2
|
|
| MD5 |
a253e258f3aac64b132f19e9ea28552c
|
|
| BLAKE2b-256 |
872313c42bc9b5cc4c19ee214a5185c1bd4a3794be68d3598d8c6ade2fc75c82
|
Provenance
The following attestation bundles were made for docprep-0.1.1.tar.gz:
Publisher:
publish-pypi.yml on yeongseon/docprep
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docprep-0.1.1.tar.gz -
Subject digest:
e955489616b50897eb39d070f49adc7ca07f0728b53b7a138a9373d32a725dc2 - Sigstore transparency entry: 1280600863
- Sigstore integration time:
-
Permalink:
yeongseon/docprep@7ce4a00d3e7f13ad7b60fcd4f061742e4fd54f37 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/yeongseon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@7ce4a00d3e7f13ad7b60fcd4f061742e4fd54f37 -
Trigger Event:
push
-
Statement type:
File details
Details for the file docprep-0.1.1-py3-none-any.whl.
File metadata
- Download URL: docprep-0.1.1-py3-none-any.whl
- Upload date:
- Size: 72.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec9a796a7d991777428e3eb02c4a3ab17292a5bb5cf840225237082ef3f09676
|
|
| MD5 |
fb1e4a9b63255c6cbd258d25ff37abe4
|
|
| BLAKE2b-256 |
1685d9bec9d6861dc08cd0e6e22a3fbbe1e0c60c65fc275b8ed8733a10c01b95
|
Provenance
The following attestation bundles were made for docprep-0.1.1-py3-none-any.whl:
Publisher:
publish-pypi.yml on yeongseon/docprep
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docprep-0.1.1-py3-none-any.whl -
Subject digest:
ec9a796a7d991777428e3eb02c4a3ab17292a5bb5cf840225237082ef3f09676 - Sigstore transparency entry: 1280600867
- Sigstore integration time:
-
Permalink:
yeongseon/docprep@7ce4a00d3e7f13ad7b60fcd4f061742e4fd54f37 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/yeongseon
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@7ce4a00d3e7f13ad7b60fcd4f061742e4fd54f37 -
Trigger Event:
push
-
Statement type: