Skip to main content

Document loading helpers for Donkit RagOps

Project description

donkit-read-engine

Document extraction library for Donkit RagOps. Reads 20+ file formats and produces structured, page-level JSON — with tables as Markdown, headings preserved, and optional LLM-powered image descriptions. Images extracted from documents can be saved locally or uploaded to S3 for use in downstream retrieval.

PyPI: pip install donkit-read-engine Python: 3.12 – 3.13 License: MIT


Features

  • Extracts text, tables, headings, captions, code blocks from PDF, DOCX, PPTX, XLSX, HTML, Markdown, LaTeX, images, and more
  • Three processing pipelines: Docling-only, Docling + LLM, or pure LLM vision
  • Extracts and saves document images (PNG) with page association — ready for multimodal retrieval
  • S3 support: uploads JSON result and images to S3 with a single s3_output_prefix parameter
  • Fully async (aread_document) with sync wrapper (read_document)
  • LLM token usage tracking per document

Installation

pip install donkit-read-engine

Quick Start

Text only (no LLM)

from donkit.read_engine.read_engine import DonkitReader

reader = DonkitReader(reading_pipeline="docling")
result = reader.read_document("report.pdf")

print(result.output_path)   # ./processed/report.json
print(result.page_count)    # 42

With LLM image descriptions + save images locally

from donkit.llm import ModelFactory
from donkit.read_engine.read_engine import DonkitReader

llm_model = ModelFactory.create_model("openai", "gpt-4o-mini", {"api_key": "sk-..."})

reader = DonkitReader(
    llm_model=llm_model,
    reading_pipeline="docling_llm",  # default
)

result = reader.read_document("report.pdf", output_dir="./output")
# ./output/report.json
# ./output/images_report/page0001_img0000.png
# ./output/images_report/page0003_img0001.png

print(result.images_dir)              # ./output/images_report
print(result.total_llm_requests)      # 5
print(result.total_prompt_tokens)     # 3200

Async

result = await reader.aread_document("report.pdf", output_dir="./output")

With S3 output

Pass s3_service at construction and s3_output_prefix per call. The JSON result and all extracted images are uploaded; the local staging directory is cleaned up automatically.

from donkit.read_engine.utils.s3 import S3Credentials, S3Service
from donkit.read_engine.read_engine import DonkitReader

s3 = S3Service(S3Credentials(
    access_key_id="...",
    secret_access_key="...",
    region_name="us-east-1",
    endpoint_url="https://s3.amazonaws.com",
    bucket_name="my-bucket",
))

reader = DonkitReader(
    llm_model=llm_model,
    reading_pipeline="docling_llm",
    s3_service=s3,
)

result = await reader.aread_document(
    "report.pdf",
    s3_output_prefix="experiments/exp123/reading",
)
# result.output_path  = "experiments/exp123/reading/report.json"   (S3 key)
# result.images_dir   = "experiments/exp123/reading/images_report" (S3 prefix)

S3 layout:

experiments/exp123/reading/
  report.json
  images_report/
    page0001_img0000.png
    page0003_img0001.png

Note: when S3 is used, result.output_path and result.images_dir contain S3 keys, not local paths.


Pipelines

Pipeline Description Requires llm_model Saves images
docling_llm Docling parses the document; LLM describes each image (up to 15 concurrent calls). Images saved to disk / S3. Yes Yes
docling Docling only; built-in VLM describes images. No images saved to disk. No No
llm PDF pages rasterized and sent entirely to LLM vision. Non-PDF formats via Docling. No images saved. Yes No

Image saving (images_dir) is only supported in the docling_llm pipeline.


Supported Formats

Via Docling (primary engine)

Extension Format
.pdf PDF
.docx Word
.pptx PowerPoint
.xlsx, .csv Excel / CSV
.html HTML
.md, .txt Markdown / plain text
.tex LaTeX
.asciidoc AsciiDoc
.vtt WebVTT subtitles
.png, .jpg, .jpeg, .gif, .webp, .tiff Images

Legacy readers

Extension Format
.json JSON documents
.xls Legacy Excel (pre-2007)

Output Format

Every pipeline writes a single JSON file:

{
  "content": [
    {
      "page": 1,
      "type": "Text",
      "content": "# Title\n\nParagraph text...\n\n| Col1 | Col2 |\n|---|---|\n| A | B |",
      "images": [
        "./output/images_report/page0001_img0000.png"
      ]
    },
    {
      "page": 2,
      "type": "Text",
      "content": "More text on page two."
    }
  ]
}
  • page — 1-based page number from the source document
  • content — Markdown-formatted text: headings, tables, lists, code blocks, captions
  • images — list of saved image paths (local paths or S3 keys); only present when images were extracted

Page headers, footers, and decorative images (detected by LLM and answered SKIP) are excluded from content but images are still saved.


API Reference

DonkitReader

class DonkitReader:
    def __init__(
        self,
        output_format: Literal["json", "text", "md"] = "json",
        progress_callback: Callable[[int, int, str | None], None] | None = None,
        llm_model: LLMModelAbstract | None = None,
        reading_pipeline: str = "docling_llm",
        s3_service: S3Service | None = None,
    ) -> None
Parameter Type Default Description
output_format "json" / "text" / "md" "json" Format hint passed to the image analysis service
progress_callback Callable None (current, total, message) — called per page in llm pipeline
llm_model LLMModelAbstract None LLM for image descriptions. Required for docling_llm and llm
reading_pipeline str "docling_llm" "docling_llm", "docling", or "llm"
s3_service S3Service None If set, results are uploaded to S3 when s3_output_prefix is provided
def read_document(
    self,
    file_path: str,
    output_dir: str | None = None,
    s3_output_prefix: str | None = None,
) -> ReadDocumentResult

async def aread_document(
    self,
    file_path: str,
    output_dir: str | None = None,
    s3_output_prefix: str | None = None,
) -> ReadDocumentResult
Parameter Description
file_path Path to the source document
output_dir Local directory for results. Defaults to processed/ next to the source file. When S3 is used and output_dir is None, a temporary directory is created and cleaned up automatically
s3_output_prefix S3 key prefix for all output artifacts (JSON + images). Requires s3_service set at construction

ReadDocumentResult

@dataclass
class ReadDocumentResult:
    output_path: str                           # Local path to JSON, or S3 key when S3 is used
    page_count: int                            # Number of unique pages
    images_dir: str | None                     # Local images dir, or S3 prefix, or None
    total_llm_requests: int
    total_prompt_tokens: int
    total_completion_tokens: int
    page_split_duration_ms: int | None
    reading_duration_ms: int | None
    gc_duration_ms: int | None
    json_serialize_duration_ms: int | None
    rasterize_duration_ms: int | None

S3Service / S3Credentials

@dataclass
class S3Credentials:
    access_key_id: str
    secret_access_key: str
    region_name: str
    endpoint_url: str
    bucket_name: str

class S3Service:
    def __init__(self, credentials: S3Credentials)
    async def download_file(self, s3_path: str, local_path: str) -> None
    async def upload_file(self, local_path: str, s3_path: str) -> None
    async def upload_content(self, s3_path: str, content: bytes) -> None

CLI

# Single file
donkit-read-engine report.pdf

# Directory (recursive)
donkit-read-engine ./documents/

# With OCR settings (unstructured backend)
donkit-read-engine scan.pdf --pdf-strategy hi_res --ocr-lang rus+eng
Argument Values Default
file_path file or directory
--output-type text, json, markdown json
--pdf-strategy fast, hi_res, ocr_only, auto
--ocr-lang e.g. rus+eng

The CLI uses the docling pipeline (no LLM).


Environment Variables

Variable Default Description
UNSTRUCTURED_STRATEGY hi_res PDF OCR strategy for the unstructured backend
UNSTRUCTURED_OCR_LANG rus+eng OCR language codes

LLM credentials are not read from environment. Pass llm_model explicitly.


Dependencies

Package Purpose
docling Primary document conversion engine
pymupdf PDF rasterization (llm pipeline)
unstructured[pdf] PDF OCR with Tesseract
python-docx, python-pptx Office format parsing
pandas Excel / CSV processing
pillow Image extraction and encoding
donkit-llm LLM provider abstraction
aioboto3 Async S3 client
json-repair Fix malformed JSON from LLM output
loguru Structured logging

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

donkit_read_engine-0.5.0.tar.gz (38.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

donkit_read_engine-0.5.0-py3-none-any.whl (47.4 kB view details)

Uploaded Python 3

File details

Details for the file donkit_read_engine-0.5.0.tar.gz.

File metadata

  • Download URL: donkit_read_engine-0.5.0.tar.gz
  • Upload date:
  • Size: 38.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.13.1 Darwin/25.3.0

File hashes

Hashes for donkit_read_engine-0.5.0.tar.gz
Algorithm Hash digest
SHA256 e5c368466364596351f563e005396771c8077e85104a956117935f9ed7074e91
MD5 8c1fd613f9364a2c81da1dcf74b30379
BLAKE2b-256 88a075d9beb76d16e55c3478b7aada98cfb973c509db040039bd5d7abe15b91c

See more details on using hashes here.

File details

Details for the file donkit_read_engine-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: donkit_read_engine-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 47.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.13.1 Darwin/25.3.0

File hashes

Hashes for donkit_read_engine-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8980a3f41022eb33015ff649a985a6daed42ee395c6aa5a4091b78774d047249
MD5 ec1f95b138bed84eeecd22793e4ff6e6
BLAKE2b-256 1fbd46662d84c6cb9f33fb61755a3553ede9ff2806ed988e4db4c841d9cc81ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page