Prepare documents into structured, vector-ready data

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

yeongseon

These details have not been verified by PyPI

Project description

docprep logo

docprep

Deterministic document chunking for RAG pipelines.

What is docprep?

docprep transforms source documents into structured, vector-ready chunks with deterministic IDs, Markdown-aware boundaries, and incremental sync. It sits between your documents and your vector store:

Source files → Loader → Parser → Chunker(s) → Sink → Export
                                      │
                                Diff Engine → Changed-only export

docprep produces the same chunk IDs for the same input, every time. When documents change, it computes a structural diff and exports only the added, modified, or deleted chunks — so you re-embed only what changed.

What docprep is NOT

Not a document parser. Use MarkItDown, Docling, or Unstructured for PDFs/DOCX/PPTX, then feed Markdown into docprep via adapters.
Not an embedding service. docprep produces text chunks; you bring your own embedding model.
Not a vector database. docprep exports records for Qdrant, pgvector, Chroma, or any other store.
Not a RAG framework. Use LlamaIndex or LangChain for retrieval. docprep handles the ingestion layer.

How docprep compares

Feature	docprep	MarkItDown	Docling	Unstructured	Chonkie
Deterministic chunk IDs	✅	N/A	❌	❌	❌
Markdown-aware splitting	✅	N/A	Limited	Limited	❌
Incremental sync (diff)	✅	❌	❌	❌	❌
Multi-format parsing	Via adapters	✅	✅	✅	❌
Plugin system	✅	❌	❌	❌	❌
Chunk-level provenance	✅	N/A	Partial	Partial	❌

Installation

pip install docprep

For PostgreSQL support:

pip install docprep[postgres]

Quick Start

Config-first (recommended)

Create a docprep.toml in your project root:

source = "docs/"

[sink]
database_url = "sqlite:///docs.db"
create_tables = true

[[chunkers]]
type = "heading"

[[chunkers]]
type = "token"
max_tokens = 512

Then run:

docprep ingest              # Ingest documents
docprep preview             # Preview structure without persisting
docprep export -o out.jsonl # Export as JSONL
docprep diff                # Show what changed since last ingest

Python API

from docprep import ingest

result = ingest("docs/")
for doc in result.documents:
    print(f"{doc.title}: {len(doc.sections)} sections, {len(doc.chunks)} chunks")

With database persistence

from sqlalchemy import create_engine
from docprep import ingest
from docprep.sinks.sqlalchemy import SQLAlchemySink

engine = create_engine("sqlite:///docs.db")
sink = SQLAlchemySink(engine=engine)

result = ingest("docs/", sink=sink)
print(f"Persisted: {result.persisted}, Skipped: {len(result.skipped_source_uris)}")

Changed-only export

docprep export docs/ --changed-only --db sqlite:///docs.db -o delta.jsonl

Documentation

Guide	Description
Getting Started	Installation, first ingestion, basic usage
Configuration	`docprep.toml` reference and all options
CLI Reference	All commands, flags, and examples
Python API	Types, functions, and usage patterns
Architecture	Pipeline flow, identity model, module map
Export	VectorRecordV1, JSONL, changed-only export
Plugins	Entry-point plugin system
Adapters	External converter integration

Design decisions are documented as Architecture Decision Records.

Supported Formats

Format	Extensions	Parser	Notes
Markdown	`.md`	Built-in	Frontmatter extraction, heading hierarchy
Plain text	`.txt`	Built-in	First non-empty line as title
HTML	`.html`, `.htm`	Built-in (stdlib)	Strips script/style, converts headings
reStructuredText	`.rst`	Built-in	Heading adornments, field lists
Any format	`*`	Via adapter	MarkItDown, Docling, Unstructured, etc.

Development

git clone https://github.com/yeongseon/docprep.git
cd docprep
make install

make check-all    # lint + typecheck + test + security
make test         # pytest
make lint         # ruff + mypy
make format       # ruff format

See CONTRIBUTING.md for the full development guide.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

yeongseon

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

Apr 12, 2026

0.1.0

Apr 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docprep-0.1.1.tar.gz (146.0 kB view details)

Uploaded Apr 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docprep-0.1.1-py3-none-any.whl (72.3 kB view details)

Uploaded Apr 12, 2026 Python 3

File details

Details for the file docprep-0.1.1.tar.gz.

File metadata

Download URL: docprep-0.1.1.tar.gz
Upload date: Apr 12, 2026
Size: 146.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docprep-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`e955489616b50897eb39d070f49adc7ca07f0728b53b7a138a9373d32a725dc2`
MD5	`a253e258f3aac64b132f19e9ea28552c`
BLAKE2b-256	`872313c42bc9b5cc4c19ee214a5185c1bd4a3794be68d3598d8c6ade2fc75c82`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docprep-0.1.1.tar.gz:

Publisher: publish-pypi.yml on yeongseon/docprep

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docprep-0.1.1.tar.gz
- Subject digest: e955489616b50897eb39d070f49adc7ca07f0728b53b7a138a9373d32a725dc2
- Sigstore transparency entry: 1280600863
- Sigstore integration time: Apr 12, 2026
Source repository:
- Permalink: yeongseon/docprep@7ce4a00d3e7f13ad7b60fcd4f061742e4fd54f37
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/yeongseon
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@7ce4a00d3e7f13ad7b60fcd4f061742e4fd54f37
- Trigger Event: push

File details

Details for the file docprep-0.1.1-py3-none-any.whl.

File metadata

Download URL: docprep-0.1.1-py3-none-any.whl
Upload date: Apr 12, 2026
Size: 72.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docprep-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ec9a796a7d991777428e3eb02c4a3ab17292a5bb5cf840225237082ef3f09676`
MD5	`fb1e4a9b63255c6cbd258d25ff37abe4`
BLAKE2b-256	`1685d9bec9d6861dc08cd0e6e22a3fbbe1e0c60c65fc275b8ed8733a10c01b95`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docprep-0.1.1-py3-none-any.whl:

Publisher: publish-pypi.yml on yeongseon/docprep

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docprep-0.1.1-py3-none-any.whl
- Subject digest: ec9a796a7d991777428e3eb02c4a3ab17292a5bb5cf840225237082ef3f09676
- Sigstore transparency entry: 1280600867
- Sigstore integration time: Apr 12, 2026
Source repository:
- Permalink: yeongseon/docprep@7ce4a00d3e7f13ad7b60fcd4f061742e4fd54f37
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/yeongseon
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@7ce4a00d3e7f13ad7b60fcd4f061742e4fd54f37
- Trigger Event: push

docprep 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

docprep

What is docprep?

What docprep is NOT

How docprep compares

Installation

Quick Start

Config-first (recommended)

Python API

With database persistence

Changed-only export

Documentation

Supported Formats

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance