Prepare documents into structured, vector-ready data

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

yeongseon

These details have not been verified by PyPI

Project description

docprep

Deterministic document chunking for RAG pipelines.

What is docprep?

docprep transforms source documents into structured, vector-ready chunks with deterministic IDs, Markdown-aware boundaries, and incremental sync. It sits between your documents and your vector store:

Source files -> Loader -> Parser -> Chunker(s) -> Sink -> Export
                                       |
                                 Diff Engine -> Changed-only export

docprep produces the same chunk IDs for the same input, every time. When documents change, it computes a structural diff and exports only the added, modified, or deleted chunks -- so you re-embed only what changed.

What docprep is NOT

Not a document parser. docprep does not parse PDFs, DOCX, or PPTX. Use MarkItDown, Docling, or Unstructured for that, then feed their Markdown output into docprep.
Not an embedding service. docprep produces text chunks; you bring your own embedding model.
Not a vector database. docprep exports records for Qdrant, pgvector, Chroma, or any other store.
Not a RAG framework. Use LlamaIndex or LangChain for retrieval orchestration. docprep handles the ingestion layer.

How docprep compares

Feature	docprep	MarkItDown	Docling	Unstructured	Chonkie
Deterministic chunk IDs	Yes	N/A	No	No	No
Markdown-aware splitting	Yes	N/A	Limited	Limited	No
Incremental sync (diff)	Yes	No	No	No	No
Multi-format parsing	Via adapters	Yes	Yes	Yes	No
Plugin system	Yes	No	No	No	No
Chunk-level provenance	Yes	N/A	Partial	Partial	No

MarkItDown converts files to Markdown (complementary to docprep). Chonkie is a chunking library without structural awareness or identity tracking.

Supported formats

Format	Extensions	Parser	Notes
Markdown	`.md`	Built-in	Frontmatter extraction, heading hierarchy
Plain text	`.txt`	Built-in	First non-empty line as title
HTML	`.html`, `.htm`	Built-in (stdlib)	Strips script/style, converts headings
reStructuredText	`.rst`	Built-in	Heading adornments, field lists
Any format	`*`	Via adapter	MarkItDown, Docling, Unstructured, etc.

Installation

pip install docprep

For PostgreSQL support:

pip install docprep[postgres]

Quick start

Config-first (recommended)

Create a docprep.toml in your project root:

source = "docs/"

[sink]
database_url = "sqlite:///docs.db"
create_tables = true

[[chunkers]]
type = "heading"

[[chunkers]]
type = "token"
max_tokens = 512

Then run:

# Ingest documents
docprep ingest

# Preview structure without persisting
docprep preview

# Export as JSONL
docprep export -o records.jsonl

# Show what changed since last ingest
docprep diff

Python API

from docprep import ingest

result = ingest("docs/")
for doc in result.documents:
    print(f"{doc.title}: {len(doc.sections)} sections, {len(doc.chunks)} chunks")

With database persistence

from sqlalchemy import create_engine
from docprep import ingest
from docprep.sinks.sqlalchemy import SQLAlchemySink

engine = create_engine("sqlite:///docs.db")
sink = SQLAlchemySink(engine=engine)

result = ingest("docs/", sink=sink)
print(f"Persisted: {result.persisted}, Skipped: {len(result.skipped_source_uris)}")

Streaming JSONL export

from docprep import ingest
from docprep.export import iter_vector_records_v1, write_jsonl

result = ingest("docs/")

# Stream to file
with open("records.jsonl", "w") as f:
    count = write_jsonl(iter_vector_records_v1(result.documents), f)
    print(f"Exported {count} records")

Changed-only export

# Export only chunks that changed since last sync
docprep export docs/ --changed-only --db sqlite:///docs.db -o delta.jsonl

CLI reference

Command	Description
`docprep ingest`	Ingest documents into a database
`docprep preview`	Preview document structure without persistence
`docprep export`	Export vector records as JSONL
`docprep diff`	Show changes against persisted state
`docprep stats`	Show database statistics
`docprep inspect`	Inspect a document, section, or chunk by URI or ID
`docprep prune`	Remove stale documents no longer in source
`docprep delete`	Delete a document by source URI

All commands support --config PATH for explicit config and --json / --no-json for output format.

Configuration

docprep discovers docprep.toml by searching the current directory and parent directories. Config precedence: CLI arguments > explicit --config > discovered config > defaults.

See examples/configs/ for minimal, standard, and advanced configurations.

Full config reference

# Source directory or file
source = "docs/"

# Default output format for CLI
json = false

[loader]
type = "filesystem"                    # "markdown" or "filesystem"
include_globs = ["**/*.md", "**/*.txt", "**/*.html", "**/*.htm", "**/*.rst"]
exclude_globs = ["**/drafts/**"]
hidden_policy = "skip"                 # "skip" or "include"
symlink_policy = "follow"              # "follow" or "skip"
encoding = "utf-8"
encoding_errors = "strict"

[parser]
type = "auto"                          # "markdown", "plaintext", "html", "rst", or "auto"

[[chunkers]]
type = "heading"                       # Split by headings into sections

[[chunkers]]
type = "token"                         # Split sections into token-budgeted chunks
max_tokens = 512
overlap_tokens = 0
tokenizer = "whitespace"               # "whitespace" or "character"

# Alternative: size-based chunking
# [[chunkers]]
# type = "size"
# max_chars = 1500
# overlap_chars = 0
# min_chars = 0

[sink]
type = "sqlalchemy"
database_url = "sqlite:///docs.db"
create_tables = true

[export]
text_prepend = "title_and_heading_path"  # "none", "title_only", "heading_path", or "title_and_heading_path"

Architecture

src/docprep/
├── models/domain.py       # Document, Section, Chunk, VectorRecordV1
├── loaders/
│   ├── markdown.py        # Load .md files
│   └── filesystem.py      # Multi-format loader with glob patterns
├── parsers/
│   ├── markdown.py        # Frontmatter + heading extraction
│   ├── plaintext.py       # Plain text with title detection
│   ├── html.py            # HTML to Markdown (stdlib only)
│   ├── rst.py             # RST heading adornments + field lists
│   └── multi.py           # Auto-dispatch by media type
├── chunkers/
│   ├── heading.py         # Split by headings into sections
│   ├── size.py            # Size-based chunk splitting
│   ├── token.py           # Token-aware chunk splitting
│   └── _markdown.py       # Shared Markdown boundary analysis
├── sinks/
│   ├── sqlalchemy.py      # SQLAlchemy persistence + revision tracking
│   └── orm.py             # Table definitions
├── adapters/
│   └── protocol.py        # Adapter protocol for external converters
├── plugins.py             # Entry-point plugin discovery
├── diff.py                # Structural diff engine
├── export.py              # VectorRecordV1 export + JSONL streaming
├── ingest.py              # Pipeline orchestration
├── ids.py                 # Deterministic ID generation
├── config.py              # Config discovery and validation
├── eval/                  # Evaluation corpus and benchmark harness
└── cli/main.py            # Command-line interface

Plugin system

Third-party packages can provide custom loaders, parsers, chunkers, sinks, and adapters via Python entry points. No core modification required.

Creating a plugin

In your package's pyproject.toml:

[project.entry-points."docprep.parsers"]
my-format = "my_package.parser:MyFormatParser"

Your parser class must implement the Parser protocol:

from docprep.loaders.types import LoadedSource
from docprep.models.domain import Document

class MyFormatParser:
    def parse(self, loaded_source: LoadedSource) -> Document:
        ...

Plugin import failures produce warnings but never break built-in components.

Design decisions

Key architectural decisions are documented as Architecture Decision Records:

Identity model: Anchor-based stable IDs with content hash for change detection
Adapter-not-parser: docprep normalizes, external tools parse
Chunking strategy: Markdown-aware boundaries, then token-budget splitting
Plugin registry: Entry-point discovery via importlib.metadata
Diff-then-sync: Structural diff for incremental updates
Export contract: VectorRecordV1 with mandatory provenance fields

Development

git clone https://github.com/yeongseon/docprep.git
cd docprep
make install

# Run all checks
make check-all    # lint + typecheck + test + security

# Individual commands
make test         # pytest
make lint         # ruff + mypy
make format       # ruff format
make security     # bandit
make cov          # coverage report

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

yeongseon

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

Apr 12, 2026

This version

0.1.0

Apr 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docprep-0.1.0.tar.gz (115.0 kB view details)

Uploaded Apr 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docprep-0.1.0-py3-none-any.whl (68.1 kB view details)

Uploaded Apr 12, 2026 Python 3

File details

Details for the file docprep-0.1.0.tar.gz.

File metadata

Download URL: docprep-0.1.0.tar.gz
Upload date: Apr 12, 2026
Size: 115.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docprep-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2d8f39e0123e4256479b0948ac8df76446d817545017127846c1610e9798a606`
MD5	`df8a7e01eb11515f2b4f7180483e7449`
BLAKE2b-256	`48c4776e89b2697de92868c18d1fce09c28704d66ecca8d0ab737e0f89285beb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docprep-0.1.0.tar.gz:

Publisher: publish-pypi.yml on yeongseon/docprep

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docprep-0.1.0.tar.gz
- Subject digest: 2d8f39e0123e4256479b0948ac8df76446d817545017127846c1610e9798a606
- Sigstore transparency entry: 1280595755
- Sigstore integration time: Apr 12, 2026
Source repository:
- Permalink: yeongseon/docprep@f452cc28db2e03767d3a66f9e4acb401fd318e5b
- Branch / Tag: refs/heads/main
- Owner: https://github.com/yeongseon
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@f452cc28db2e03767d3a66f9e4acb401fd318e5b
- Trigger Event: workflow_dispatch

File details

Details for the file docprep-0.1.0-py3-none-any.whl.

File metadata

Download URL: docprep-0.1.0-py3-none-any.whl
Upload date: Apr 12, 2026
Size: 68.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docprep-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f3b58ab8c948b4b13fc32fc0b2a8e05b5eefb90acf00fadb0030457426744311`
MD5	`4b09da8c3a0274f2e18da8d5f1d67873`
BLAKE2b-256	`883324bd7e1f228535f8010f42bb207196ebf4c8ee48f495b320c4ed7bc31160`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docprep-0.1.0-py3-none-any.whl:

Publisher: publish-pypi.yml on yeongseon/docprep

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docprep-0.1.0-py3-none-any.whl
- Subject digest: f3b58ab8c948b4b13fc32fc0b2a8e05b5eefb90acf00fadb0030457426744311
- Sigstore transparency entry: 1280595764
- Sigstore integration time: Apr 12, 2026
Source repository:
- Permalink: yeongseon/docprep@f452cc28db2e03767d3a66f9e4acb401fd318e5b
- Branch / Tag: refs/heads/main
- Owner: https://github.com/yeongseon
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@f452cc28db2e03767d3a66f9e4acb401fd318e5b
- Trigger Event: workflow_dispatch

docprep 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

docprep

What is docprep?

What docprep is NOT

How docprep compares

Supported formats

Installation

Quick start

Config-first (recommended)

Python API

With database persistence

Streaming JSONL export

Changed-only export

CLI reference

Configuration

Full config reference

Architecture

Plugin system

Creating a plugin

Design decisions

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance