Skip to main content

Prepare documents into structured, vector-ready data

Project description

docprep logo

docprep

Deterministic document chunking for RAG pipelines.

Test and Coverage PyPI version Python 3.10+ License: MIT

What is docprep?

docprep transforms source documents into structured, vector-ready chunks with deterministic IDs, Markdown-aware boundaries, and incremental sync. It sits between your documents and your vector store:

Source files → Loader → Parser → Chunker(s) → Sink → Export
                                      │
                                Diff Engine → Changed-only export

docprep produces the same chunk IDs for the same input, every time. When documents change, it computes a structural diff and exports only the added, modified, or deleted chunks — so you re-embed only what changed.

What docprep is NOT

  • Not a document parser. Use MarkItDown, Docling, or Unstructured for PDFs/DOCX/PPTX, then feed Markdown into docprep via adapters.
  • Not an embedding service. docprep produces text chunks; you bring your own embedding model.
  • Not a vector database. docprep exports records for Qdrant, pgvector, Chroma, or any other store.
  • Not a RAG framework. Use LlamaIndex or LangChain for retrieval. docprep handles the ingestion layer.

How docprep compares

Feature docprep MarkItDown Docling Unstructured Chonkie
Deterministic chunk IDs N/A
Markdown-aware splitting N/A Limited Limited
Incremental sync (diff)
Multi-format parsing Via adapters
Plugin system
Chunk-level provenance N/A Partial Partial

Installation

pip install docprep

For PostgreSQL support:

pip install docprep[postgres]

Quick Start

Config-first (recommended)

Create a docprep.toml in your project root:

source = "docs/"

[sink]
database_url = "sqlite:///docs.db"
create_tables = true

[[chunkers]]
type = "heading"

[[chunkers]]
type = "token"
max_tokens = 512

Then run:

docprep ingest              # Ingest documents
docprep preview             # Preview structure without persisting
docprep export -o out.jsonl # Export as JSONL
docprep diff                # Show what changed since last ingest

Python API

from docprep import ingest

result = ingest("docs/")
for doc in result.documents:
    print(f"{doc.title}: {len(doc.sections)} sections, {len(doc.chunks)} chunks")

With database persistence

from sqlalchemy import create_engine
from docprep import ingest
from docprep.sinks.sqlalchemy import SQLAlchemySink

engine = create_engine("sqlite:///docs.db")
sink = SQLAlchemySink(engine=engine)

result = ingest("docs/", sink=sink)
print(f"Persisted: {result.persisted}, Skipped: {len(result.skipped_source_uris)}")

Changed-only export

docprep export docs/ --changed-only --db sqlite:///docs.db -o delta.jsonl

Documentation

Guide Description
Getting Started Installation, first ingestion, basic usage
Configuration docprep.toml reference and all options
CLI Reference All commands, flags, and examples
Python API Types, functions, and usage patterns
Architecture Pipeline flow, identity model, module map
Export VectorRecordV1, JSONL, changed-only export
Plugins Entry-point plugin system
Adapters External converter integration

Design decisions are documented as Architecture Decision Records.

Supported Formats

Format Extensions Parser Notes
Markdown .md Built-in Frontmatter extraction, heading hierarchy
Plain text .txt Built-in First non-empty line as title
HTML .html, .htm Built-in (stdlib) Strips script/style, converts headings
reStructuredText .rst Built-in Heading adornments, field lists
Any format * Via adapter MarkItDown, Docling, Unstructured, etc.

Development

git clone https://github.com/yeongseon/docprep.git
cd docprep
make install

make check-all    # lint + typecheck + test + security
make test         # pytest
make lint         # ruff + mypy
make format       # ruff format

See CONTRIBUTING.md for the full development guide.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docprep-0.1.1.tar.gz (146.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docprep-0.1.1-py3-none-any.whl (72.3 kB view details)

Uploaded Python 3

File details

Details for the file docprep-0.1.1.tar.gz.

File metadata

  • Download URL: docprep-0.1.1.tar.gz
  • Upload date:
  • Size: 146.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docprep-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e955489616b50897eb39d070f49adc7ca07f0728b53b7a138a9373d32a725dc2
MD5 a253e258f3aac64b132f19e9ea28552c
BLAKE2b-256 872313c42bc9b5cc4c19ee214a5185c1bd4a3794be68d3598d8c6ade2fc75c82

See more details on using hashes here.

Provenance

The following attestation bundles were made for docprep-0.1.1.tar.gz:

Publisher: publish-pypi.yml on yeongseon/docprep

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docprep-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: docprep-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 72.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docprep-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ec9a796a7d991777428e3eb02c4a3ab17292a5bb5cf840225237082ef3f09676
MD5 fb1e4a9b63255c6cbd258d25ff37abe4
BLAKE2b-256 1685d9bec9d6861dc08cd0e6e22a3fbbe1e0c60c65fc275b8ed8733a10c01b95

See more details on using hashes here.

Provenance

The following attestation bundles were made for docprep-0.1.1-py3-none-any.whl:

Publisher: publish-pypi.yml on yeongseon/docprep

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page