Document loading helpers for Donkit RagOps
Project description
donkit-read-engine
Document extraction library for Donkit RagOps. Reads 20+ file formats and produces structured, page-level JSON — with tables as Markdown, headings preserved, and optional LLM-powered image descriptions. Images extracted from documents can be saved locally or uploaded to S3 for use in downstream retrieval.
PyPI: pip install donkit-read-engine
Python: 3.12 – 3.13
License: MIT
Features
- Extracts text, tables, headings, captions, code blocks from PDF, DOCX, PPTX, XLSX, HTML, Markdown, LaTeX, images, and more
- Three processing pipelines: Docling-only, Docling + LLM, or pure LLM vision
- Extracts and saves document images (PNG) with page association — ready for multimodal retrieval
- S3 support: uploads JSON result and images to S3 with a single
s3_output_prefixparameter - Fully async (
aread_document) with sync wrapper (read_document) - LLM token usage tracking per document
Installation
pip install donkit-read-engine
Quick Start
Text only (no LLM)
from donkit.read_engine.read_engine import DonkitReader
reader = DonkitReader(reading_pipeline="docling")
result = reader.read_document("report.pdf")
print(result.output_path) # ./processed/report.json
print(result.page_count) # 42
With LLM image descriptions + save images locally
from donkit.llm import ModelFactory
from donkit.read_engine.read_engine import DonkitReader
llm_model = ModelFactory.create_model("openai", "gpt-4o-mini", {"api_key": "sk-..."})
reader = DonkitReader(
llm_model=llm_model,
reading_pipeline="docling_llm", # default
)
result = reader.read_document("report.pdf", output_dir="./output")
# ./output/report.json
# ./output/images_report/page0001_img0000.png
# ./output/images_report/page0003_img0001.png
print(result.images_dir) # ./output/images_report
print(result.total_llm_requests) # 5
print(result.total_prompt_tokens) # 3200
Async
result = await reader.aread_document("report.pdf", output_dir="./output")
With S3 output
Pass s3_service at construction and s3_output_prefix per call. The JSON result and all extracted images are uploaded; the local staging directory is cleaned up automatically.
from donkit.read_engine.utils.s3 import S3Credentials, S3Service
from donkit.read_engine.read_engine import DonkitReader
s3 = S3Service(S3Credentials(
access_key_id="...",
secret_access_key="...",
region_name="us-east-1",
endpoint_url="https://s3.amazonaws.com",
bucket_name="my-bucket",
))
reader = DonkitReader(
llm_model=llm_model,
reading_pipeline="docling_llm",
s3_service=s3,
)
result = await reader.aread_document(
"report.pdf",
s3_output_prefix="experiments/exp123/reading",
)
# result.output_path = "experiments/exp123/reading/report.json" (S3 key)
# result.images_dir = "experiments/exp123/reading/images_report" (S3 prefix)
S3 layout:
experiments/exp123/reading/
report.json
images_report/
page0001_img0000.png
page0003_img0001.png
Note: when S3 is used,
result.output_pathandresult.images_dircontain S3 keys, not local paths.
Pipelines
| Pipeline | Description | Requires llm_model |
Saves images |
|---|---|---|---|
docling_llm |
Docling parses the document; LLM describes each image (up to 15 concurrent calls). Images saved to disk / S3. | Yes | Yes |
docling |
Docling only; built-in VLM describes images. No images saved to disk. | No | No |
llm |
PDF pages rasterized and sent entirely to LLM vision. Non-PDF formats via Docling. No images saved. | Yes | No |
Image saving (
images_dir) is only supported in thedocling_llmpipeline.
Supported Formats
Via Docling (primary engine)
| Extension | Format |
|---|---|
.pdf |
|
.docx |
Word |
.pptx |
PowerPoint |
.xlsx, .csv |
Excel / CSV |
.html |
HTML |
.md, .txt |
Markdown / plain text |
.tex |
LaTeX |
.asciidoc |
AsciiDoc |
.vtt |
WebVTT subtitles |
.png, .jpg, .jpeg, .gif, .webp, .tiff |
Images |
Legacy readers
| Extension | Format |
|---|---|
.json |
JSON documents |
.xls |
Legacy Excel (pre-2007) |
Output Format
Every pipeline writes a single JSON file:
{
"content": [
{
"page": 1,
"type": "Text",
"content": "# Title\n\nParagraph text...\n\n| Col1 | Col2 |\n|---|---|\n| A | B |",
"images": [
"./output/images_report/page0001_img0000.png"
]
},
{
"page": 2,
"type": "Text",
"content": "More text on page two."
}
]
}
page— 1-based page number from the source documentcontent— Markdown-formatted text: headings, tables, lists, code blocks, captionsimages— list of saved image paths (local paths or S3 keys); only present when images were extracted
Page headers, footers, and decorative images (detected by LLM and answered SKIP) are excluded from content but images are still saved.
API Reference
DonkitReader
class DonkitReader:
def __init__(
self,
output_format: Literal["json", "text", "md"] = "json",
progress_callback: Callable[[int, int, str | None], None] | None = None,
llm_model: LLMModelAbstract | None = None,
reading_pipeline: str = "docling_llm",
s3_service: S3Service | None = None,
) -> None
| Parameter | Type | Default | Description |
|---|---|---|---|
output_format |
"json" / "text" / "md" |
"json" |
Format hint passed to the image analysis service |
progress_callback |
Callable |
None |
(current, total, message) — called per page in llm pipeline |
llm_model |
LLMModelAbstract |
None |
LLM for image descriptions. Required for docling_llm and llm |
reading_pipeline |
str |
"docling_llm" |
"docling_llm", "docling", or "llm" |
s3_service |
S3Service |
None |
If set, results are uploaded to S3 when s3_output_prefix is provided |
def read_document(
self,
file_path: str,
output_dir: str | None = None,
s3_output_prefix: str | None = None,
) -> ReadDocumentResult
async def aread_document(
self,
file_path: str,
output_dir: str | None = None,
s3_output_prefix: str | None = None,
) -> ReadDocumentResult
| Parameter | Description |
|---|---|
file_path |
Path to the source document |
output_dir |
Local directory for results. Defaults to processed/ next to the source file. When S3 is used and output_dir is None, a temporary directory is created and cleaned up automatically |
s3_output_prefix |
S3 key prefix for all output artifacts (JSON + images). Requires s3_service set at construction |
ReadDocumentResult
@dataclass
class ReadDocumentResult:
output_path: str # Local path to JSON, or S3 key when S3 is used
page_count: int # Number of unique pages
images_dir: str | None # Local images dir, or S3 prefix, or None
total_llm_requests: int
total_prompt_tokens: int
total_completion_tokens: int
page_split_duration_ms: int | None
reading_duration_ms: int | None
gc_duration_ms: int | None
json_serialize_duration_ms: int | None
rasterize_duration_ms: int | None
S3Service / S3Credentials
@dataclass
class S3Credentials:
access_key_id: str
secret_access_key: str
region_name: str
endpoint_url: str
bucket_name: str
class S3Service:
def __init__(self, credentials: S3Credentials)
async def download_file(self, s3_path: str, local_path: str) -> None
async def upload_file(self, local_path: str, s3_path: str) -> None
async def upload_content(self, s3_path: str, content: bytes) -> None
CLI
# Single file
donkit-read-engine report.pdf
# Directory (recursive)
donkit-read-engine ./documents/
# With OCR settings (unstructured backend)
donkit-read-engine scan.pdf --pdf-strategy hi_res --ocr-lang rus+eng
| Argument | Values | Default |
|---|---|---|
file_path |
file or directory | — |
--output-type |
text, json, markdown |
json |
--pdf-strategy |
fast, hi_res, ocr_only, auto |
— |
--ocr-lang |
e.g. rus+eng |
— |
The CLI uses the docling pipeline (no LLM).
Environment Variables
| Variable | Default | Description |
|---|---|---|
UNSTRUCTURED_STRATEGY |
hi_res |
PDF OCR strategy for the unstructured backend |
UNSTRUCTURED_OCR_LANG |
rus+eng |
OCR language codes |
LLM credentials are not read from environment. Pass llm_model explicitly.
Dependencies
| Package | Purpose |
|---|---|
docling |
Primary document conversion engine |
pymupdf |
PDF rasterization (llm pipeline) |
unstructured[pdf] |
PDF OCR with Tesseract |
python-docx, python-pptx |
Office format parsing |
pandas |
Excel / CSV processing |
pillow |
Image extraction and encoding |
donkit-llm |
LLM provider abstraction |
aioboto3 |
Async S3 client |
json-repair |
Fix malformed JSON from LLM output |
loguru |
Structured logging |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file donkit_read_engine-0.5.2.tar.gz.
File metadata
- Download URL: donkit_read_engine-0.5.2.tar.gz
- Upload date:
- Size: 38.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.13.1 Darwin/25.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
923e2993ad08d226b5e3cb00478139ddfb103fd1525f7f42ba4e990aec18fe5b
|
|
| MD5 |
f426367daec471d8834f8bfee990318a
|
|
| BLAKE2b-256 |
76af78817fe5dc82f956f8407f4688b194edc3c4b2c44b7db51bdea77cdffcdf
|
File details
Details for the file donkit_read_engine-0.5.2-py3-none-any.whl.
File metadata
- Download URL: donkit_read_engine-0.5.2-py3-none-any.whl
- Upload date:
- Size: 47.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.13.1 Darwin/25.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1419bcc6379b568050523afd4b0c899abb2b5cc13e34539bec8738cdaaf32673
|
|
| MD5 |
61d87628760644cca8bd75f3788be51a
|
|
| BLAKE2b-256 |
df921cf9fe326d37a39ab43d2d5913a84f2484d4d1473c9f904c8e47e7e26052
|