Document loading helpers for Donkit RagOps

These details have not been verified by PyPI

Project description

donkit-read-engine

Document extraction library for Donkit RagOps. Reads 20+ file formats and produces structured, page-level JSON — with tables as Markdown, headings preserved, and optional LLM-powered image descriptions. Images extracted from documents can be saved locally or uploaded to S3 for use in downstream retrieval.

PyPI: pip install donkit-read-engine Python: 3.12 – 3.13 License: MIT

Features

Extracts text, tables, headings, captions, code blocks from PDF, DOCX, PPTX, XLSX, HTML, Markdown, LaTeX, images, and more
Three processing pipelines: Docling-only, Docling + LLM, or pure LLM vision
Extracts and saves document images (PNG) with page association — ready for multimodal retrieval
S3 support: uploads JSON result and images to S3 with a single s3_output_prefix parameter
Fully async (aread_document) with sync wrapper (read_document)
LLM token usage tracking per document

Installation

pip install donkit-read-engine

Quick Start

Text only (no LLM)

from donkit.read_engine.read_engine import DonkitReader

reader = DonkitReader(reading_pipeline="docling")
result = reader.read_document("report.pdf")

print(result.output_path)   # ./processed/report.json
print(result.page_count)    # 42

With LLM image descriptions + save images locally

from donkit.llm import ModelFactory
from donkit.read_engine.read_engine import DonkitReader

llm_model = ModelFactory.create_model("openai", "gpt-4o-mini", {"api_key": "sk-..."})

reader = DonkitReader(
    llm_model=llm_model,
    reading_pipeline="docling_llm",  # default
)

result = reader.read_document("report.pdf", output_dir="./output")
# ./output/report.json
# ./output/images_report/page0001_img0000.png
# ./output/images_report/page0003_img0001.png

print(result.images_dir)              # ./output/images_report
print(result.total_llm_requests)      # 5
print(result.total_prompt_tokens)     # 3200

Async

result = await reader.aread_document("report.pdf", output_dir="./output")

With S3 output

Pass s3_service at construction and s3_output_prefix per call. The JSON result and all extracted images are uploaded; the local staging directory is cleaned up automatically.

from donkit.read_engine.utils.s3 import S3Credentials, S3Service
from donkit.read_engine.read_engine import DonkitReader

s3 = S3Service(S3Credentials(
    access_key_id="...",
    secret_access_key="...",
    region_name="us-east-1",
    endpoint_url="https://s3.amazonaws.com",
    bucket_name="my-bucket",
))

reader = DonkitReader(
    llm_model=llm_model,
    reading_pipeline="docling_llm",
    s3_service=s3,
)

result = await reader.aread_document(
    "report.pdf",
    s3_output_prefix="experiments/exp123/reading",
)
# result.output_path  = "experiments/exp123/reading/report.json"   (S3 key)
# result.images_dir   = "experiments/exp123/reading/images_report" (S3 prefix)

S3 layout:

experiments/exp123/reading/
  report.json
  images_report/
    page0001_img0000.png
    page0003_img0001.png

Note: when S3 is used, result.output_path and result.images_dir contain S3 keys, not local paths.

Pipelines

Pipeline	Description	Requires `llm_model`	Saves images
`docling_llm`	Docling parses the document; LLM describes each image (up to 15 concurrent calls). Images saved to disk / S3.	Yes	Yes
`docling`	Docling only; built-in VLM describes images. No images saved to disk.	No	No
`llm`	PDF pages rasterized and sent entirely to LLM vision. Non-PDF formats via Docling. No images saved.	Yes	No

Image saving (images_dir) is only supported in the docling_llm pipeline.

Supported Formats

Via Docling (primary engine)

Extension	Format
`.pdf`	PDF
`.docx`	Word
`.pptx`	PowerPoint
`.xlsx`, `.csv`	Excel / CSV
`.html`	HTML
`.md`, `.txt`	Markdown / plain text
`.tex`	LaTeX
`.asciidoc`	AsciiDoc
`.vtt`	WebVTT subtitles
`.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.tiff`	Images

Legacy readers

Extension	Format
`.json`	JSON documents
`.xls`	Legacy Excel (pre-2007)

Output Format

Every pipeline writes a single JSON file:

{
  "content": [
    {
      "page": 1,
      "type": "Text",
      "content": "# Title\n\nParagraph text...\n\n| Col1 | Col2 |\n|---|---|\n| A | B |",
      "images": [
        "./output/images_report/page0001_img0000.png"
      ]
    },
    {
      "page": 2,
      "type": "Text",
      "content": "More text on page two."
    }
  ]
}

page — 1-based page number from the source document
content — Markdown-formatted text: headings, tables, lists, code blocks, captions
images — list of saved image paths (local paths or S3 keys); only present when images were extracted

Page headers, footers, and decorative images (detected by LLM and answered SKIP) are excluded from content but images are still saved.

API Reference

`DonkitReader`

class DonkitReader:
    def __init__(
        self,
        output_format: Literal["json", "text", "md"] = "json",
        progress_callback: Callable[[int, int, str | None], None] | None = None,
        llm_model: LLMModelAbstract | None = None,
        reading_pipeline: str = "docling_llm",
        s3_service: S3Service | None = None,
    ) -> None

Parameter	Type	Default	Description
`output_format`	`"json"` / `"text"` / `"md"`	`"json"`	Format hint passed to the image analysis service
`progress_callback`	`Callable`	`None`	`(current, total, message)` — called per page in `llm` pipeline
`llm_model`	`LLMModelAbstract`	`None`	LLM for image descriptions. Required for `docling_llm` and `llm`
`reading_pipeline`	`str`	`"docling_llm"`	`"docling_llm"`, `"docling"`, or `"llm"`
`s3_service`	`S3Service`	`None`	If set, results are uploaded to S3 when `s3_output_prefix` is provided

def read_document(
    self,
    file_path: str,
    output_dir: str | None = None,
    s3_output_prefix: str | None = None,
) -> ReadDocumentResult

async def aread_document(
    self,
    file_path: str,
    output_dir: str | None = None,
    s3_output_prefix: str | None = None,
) -> ReadDocumentResult

Parameter	Description
`file_path`	Path to the source document
`output_dir`	Local directory for results. Defaults to `processed/` next to the source file. When S3 is used and `output_dir` is `None`, a temporary directory is created and cleaned up automatically
`s3_output_prefix`	S3 key prefix for all output artifacts (JSON + images). Requires `s3_service` set at construction

`ReadDocumentResult`

@dataclass
class ReadDocumentResult:
    output_path: str                           # Local path to JSON, or S3 key when S3 is used
    page_count: int                            # Number of unique pages
    images_dir: str | None                     # Local images dir, or S3 prefix, or None
    total_llm_requests: int
    total_prompt_tokens: int
    total_completion_tokens: int
    page_split_duration_ms: int | None
    reading_duration_ms: int | None
    gc_duration_ms: int | None
    json_serialize_duration_ms: int | None
    rasterize_duration_ms: int | None

`S3Service` / `S3Credentials`

@dataclass
class S3Credentials:
    access_key_id: str
    secret_access_key: str
    region_name: str
    endpoint_url: str
    bucket_name: str

class S3Service:
    def __init__(self, credentials: S3Credentials)
    async def download_file(self, s3_path: str, local_path: str) -> None
    async def upload_file(self, local_path: str, s3_path: str) -> None
    async def upload_content(self, s3_path: str, content: bytes) -> None

CLI

# Single file
donkit-read-engine report.pdf

# Directory (recursive)
donkit-read-engine ./documents/

# With OCR settings (unstructured backend)
donkit-read-engine scan.pdf --pdf-strategy hi_res --ocr-lang rus+eng

Argument	Values	Default
`file_path`	file or directory	—
`--output-type`	`text`, `json`, `markdown`	`json`
`--pdf-strategy`	`fast`, `hi_res`, `ocr_only`, `auto`	—
`--ocr-lang`	e.g. `rus+eng`	—

The CLI uses the docling pipeline (no LLM).

Environment Variables

Variable	Default	Description
`UNSTRUCTURED_STRATEGY`	`hi_res`	PDF OCR strategy for the unstructured backend
`UNSTRUCTURED_OCR_LANG`	`rus+eng`	OCR language codes

LLM credentials are not read from environment. Pass llm_model explicitly.

Dependencies

Package	Purpose
`docling`	Primary document conversion engine
`pymupdf`	PDF rasterization (`llm` pipeline)
`unstructured[pdf]`	PDF OCR with Tesseract
`python-docx`, `python-pptx`	Office format parsing
`pandas`	Excel / CSV processing
`pillow`	Image extraction and encoding
`donkit-llm`	LLM provider abstraction
`aioboto3`	Async S3 client
`json-repair`	Fix malformed JSON from LLM output
`loguru`	Structured logging

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.4

Mar 18, 2026

0.5.3

Mar 18, 2026

0.5.2

Mar 18, 2026

0.5.1

Mar 17, 2026

This version

0.5.0

Mar 17, 2026

0.4.1

Mar 13, 2026

0.4.0

Mar 5, 2026

0.3.2

Mar 3, 2026

0.3.1

Feb 26, 2026

0.3.0

Feb 26, 2026

0.2.9

Feb 3, 2026

0.2.8

Jan 29, 2026

0.2.7

Dec 22, 2025

0.2.6

Dec 5, 2025

0.2.5

Nov 27, 2025

0.2.4

Nov 26, 2025

0.2.3

Nov 12, 2025

0.2.2

Nov 7, 2025

0.2.1

Nov 5, 2025

0.2.0

Nov 1, 2025

0.1.13

Oct 31, 2025

0.1.12

Oct 31, 2025

0.1.11

Oct 31, 2025

0.1.10

Oct 30, 2025

0.1.8

Oct 30, 2025

0.1.7

Oct 30, 2025

0.1.6

Oct 30, 2025

0.1.4

Oct 28, 2025

0.1.3

Oct 27, 2025

0.1.2

Oct 22, 2025

0.1.1

Oct 21, 2025

0.1.0

Oct 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

donkit_read_engine-0.5.0.tar.gz (38.5 kB view details)

Uploaded Mar 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

donkit_read_engine-0.5.0-py3-none-any.whl (47.4 kB view details)

Uploaded Mar 17, 2026 Python 3

File details

Details for the file donkit_read_engine-0.5.0.tar.gz.

File metadata

Download URL: donkit_read_engine-0.5.0.tar.gz
Upload date: Mar 17, 2026
Size: 38.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.13.1 Darwin/25.3.0

File hashes

Hashes for donkit_read_engine-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`e5c368466364596351f563e005396771c8077e85104a956117935f9ed7074e91`
MD5	`8c1fd613f9364a2c81da1dcf74b30379`
BLAKE2b-256	`88a075d9beb76d16e55c3478b7aada98cfb973c509db040039bd5d7abe15b91c`

See more details on using hashes here.

File details

Details for the file donkit_read_engine-0.5.0-py3-none-any.whl.

File metadata

Download URL: donkit_read_engine-0.5.0-py3-none-any.whl
Upload date: Mar 17, 2026
Size: 47.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.13.1 Darwin/25.3.0

File hashes

Hashes for donkit_read_engine-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8980a3f41022eb33015ff649a985a6daed42ee395c6aa5a4091b78774d047249`
MD5	`ec1f95b138bed84eeecd22793e4ff6e6`
BLAKE2b-256	`1fbd46662d84c6cb9f33fb61755a3553ede9ff2806ed988e4db4c841d9cc81ff`

See more details on using hashes here.

donkit-read-engine 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

donkit-read-engine

Features

Installation

Quick Start

Text only (no LLM)

With LLM image descriptions + save images locally

Async

With S3 output

Pipelines

Supported Formats

Via Docling (primary engine)

Legacy readers

Output Format

API Reference

DonkitReader

ReadDocumentResult

S3Service / S3Credentials

CLI

Environment Variables

Dependencies

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`DonkitReader`

`ReadDocumentResult`

`S3Service` / `S3Credentials`