Skip to main content

Document loading helpers for Donkit RagOps

Project description

donkit-read-engine

Document extraction library for Donkit RagOps. Reads 20+ file formats and produces structured, page-level JSON — with tables as Markdown, headings preserved, and optional LLM-powered image descriptions. Images extracted from documents can be saved locally or uploaded to S3 for use in downstream retrieval.

PyPI: pip install donkit-read-engine Python: 3.12 – 3.13 License: MIT


Features

  • Extracts text, tables, headings, captions, code blocks from PDF, DOCX, PPTX, XLSX, HTML, Markdown, LaTeX, images, and more
  • Three processing pipelines: Docling-only, Docling + LLM, or pure LLM vision
  • Extracts and saves document images (PNG) with page association — ready for multimodal retrieval
  • S3 support: uploads JSON result and images to S3 with a single s3_output_prefix parameter
  • Fully async (aread_document) with sync wrapper (read_document)
  • LLM token usage tracking per document

Installation

pip install donkit-read-engine

Quick Start

Text only (no LLM)

from donkit.read_engine.read_engine import DonkitReader

reader = DonkitReader(reading_pipeline="docling")
result = reader.read_document("report.pdf")

print(result.output_path)   # ./processed/report.json
print(result.page_count)    # 42

With LLM image descriptions + save images locally

from donkit.llm import ModelFactory
from donkit.read_engine.read_engine import DonkitReader

llm_model = ModelFactory.create_model("openai", "gpt-4o-mini", {"api_key": "sk-..."})

reader = DonkitReader(
    llm_model=llm_model,
    reading_pipeline="docling_llm",  # default
)

result = reader.read_document("report.pdf", output_dir="./output")
# ./output/report.json
# ./output/images_report/page0001_img0000.png
# ./output/images_report/page0003_img0001.png

print(result.images_dir)              # ./output/images_report
print(result.total_llm_requests)      # 5
print(result.total_prompt_tokens)     # 3200

Async

result = await reader.aread_document("report.pdf", output_dir="./output")

With S3 output

Pass s3_service at construction and s3_output_prefix per call. The JSON result and all extracted images are uploaded; the local staging directory is cleaned up automatically.

from donkit.read_engine.utils.s3 import S3Credentials, S3Service
from donkit.read_engine.read_engine import DonkitReader

s3 = S3Service(S3Credentials(
    access_key_id="...",
    secret_access_key="...",
    region_name="us-east-1",
    endpoint_url="https://s3.amazonaws.com",
    bucket_name="my-bucket",
))

reader = DonkitReader(
    llm_model=llm_model,
    reading_pipeline="docling_llm",
    s3_service=s3,
)

result = await reader.aread_document(
    "report.pdf",
    s3_output_prefix="experiments/exp123/reading",
)
# result.output_path  = "experiments/exp123/reading/report.json"   (S3 key)
# result.images_dir   = "experiments/exp123/reading/images_report" (S3 prefix)

S3 layout:

experiments/exp123/reading/
  report.json
  images_report/
    page0001_img0000.png
    page0003_img0001.png

Note: when S3 is used, result.output_path and result.images_dir contain S3 keys, not local paths.


Pipelines

Pipeline Description Requires llm_model Saves images
docling_llm Docling parses the document; LLM describes each image (up to 15 concurrent calls). Images saved to disk / S3. Yes Yes
docling Docling only; built-in VLM describes images. No images saved to disk. No No
llm PDF pages rasterized and sent entirely to LLM vision. Non-PDF formats via Docling. No images saved. Yes No

Image saving (images_dir) is only supported in the docling_llm pipeline.


Supported Formats

Via Docling (primary engine)

Extension Format
.pdf PDF
.docx Word
.pptx PowerPoint
.xlsx, .csv Excel / CSV
.html HTML
.md, .txt Markdown / plain text
.tex LaTeX
.asciidoc AsciiDoc
.vtt WebVTT subtitles
.png, .jpg, .jpeg, .gif, .webp, .tiff Images

Legacy readers

Extension Format
.json JSON documents
.xls Legacy Excel (pre-2007)

Output Format

Every pipeline writes a single JSON file:

{
  "content": [
    {
      "page": 1,
      "type": "Text",
      "content": "# Title\n\nParagraph text...\n\n| Col1 | Col2 |\n|---|---|\n| A | B |",
      "images": [
        "./output/images_report/page0001_img0000.png"
      ]
    },
    {
      "page": 2,
      "type": "Text",
      "content": "More text on page two."
    }
  ]
}
  • page — 1-based page number from the source document
  • content — Markdown-formatted text: headings, tables, lists, code blocks, captions
  • images — list of saved image paths (local paths or S3 keys); only present when images were extracted

Page headers, footers, and decorative images (detected by LLM and answered SKIP) are excluded from content but images are still saved.


API Reference

DonkitReader

class DonkitReader:
    def __init__(
        self,
        output_format: Literal["json", "text", "md"] = "json",
        progress_callback: Callable[[int, int, str | None], None] | None = None,
        llm_model: LLMModelAbstract | None = None,
        reading_pipeline: str = "docling_llm",
        s3_service: S3Service | None = None,
    ) -> None
Parameter Type Default Description
output_format "json" / "text" / "md" "json" Format hint passed to the image analysis service
progress_callback Callable None (current, total, message) — called per page in llm pipeline
llm_model LLMModelAbstract None LLM for image descriptions. Required for docling_llm and llm
reading_pipeline str "docling_llm" "docling_llm", "docling", or "llm"
s3_service S3Service None If set, results are uploaded to S3 when s3_output_prefix is provided
def read_document(
    self,
    file_path: str,
    output_dir: str | None = None,
    s3_output_prefix: str | None = None,
) -> ReadDocumentResult

async def aread_document(
    self,
    file_path: str,
    output_dir: str | None = None,
    s3_output_prefix: str | None = None,
) -> ReadDocumentResult
Parameter Description
file_path Path to the source document
output_dir Local directory for results. Defaults to processed/ next to the source file. When S3 is used and output_dir is None, a temporary directory is created and cleaned up automatically
s3_output_prefix S3 key prefix for all output artifacts (JSON + images). Requires s3_service set at construction

ReadDocumentResult

@dataclass
class ReadDocumentResult:
    output_path: str                           # Local path to JSON, or S3 key when S3 is used
    page_count: int                            # Number of unique pages
    images_dir: str | None                     # Local images dir, or S3 prefix, or None
    total_llm_requests: int
    total_prompt_tokens: int
    total_completion_tokens: int
    page_split_duration_ms: int | None
    reading_duration_ms: int | None
    gc_duration_ms: int | None
    json_serialize_duration_ms: int | None
    rasterize_duration_ms: int | None

S3Service / S3Credentials

@dataclass
class S3Credentials:
    access_key_id: str
    secret_access_key: str
    region_name: str
    endpoint_url: str
    bucket_name: str

class S3Service:
    def __init__(self, credentials: S3Credentials)
    async def download_file(self, s3_path: str, local_path: str) -> None
    async def upload_file(self, local_path: str, s3_path: str) -> None
    async def upload_content(self, s3_path: str, content: bytes) -> None

CLI

# Single file
donkit-read-engine report.pdf

# Directory (recursive)
donkit-read-engine ./documents/

# With OCR settings (unstructured backend)
donkit-read-engine scan.pdf --pdf-strategy hi_res --ocr-lang rus+eng
Argument Values Default
file_path file or directory
--output-type text, json, markdown json
--pdf-strategy fast, hi_res, ocr_only, auto
--ocr-lang e.g. rus+eng

The CLI uses the docling pipeline (no LLM).


Environment Variables

Variable Default Description
UNSTRUCTURED_STRATEGY hi_res PDF OCR strategy for the unstructured backend
UNSTRUCTURED_OCR_LANG rus+eng OCR language codes

LLM credentials are not read from environment. Pass llm_model explicitly.


Dependencies

Package Purpose
docling Primary document conversion engine
pymupdf PDF rasterization (llm pipeline)
unstructured[pdf] PDF OCR with Tesseract
python-docx, python-pptx Office format parsing
pandas Excel / CSV processing
pillow Image extraction and encoding
donkit-llm LLM provider abstraction
aioboto3 Async S3 client
json-repair Fix malformed JSON from LLM output
loguru Structured logging

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

donkit_read_engine-0.5.2.tar.gz (38.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

donkit_read_engine-0.5.2-py3-none-any.whl (47.5 kB view details)

Uploaded Python 3

File details

Details for the file donkit_read_engine-0.5.2.tar.gz.

File metadata

  • Download URL: donkit_read_engine-0.5.2.tar.gz
  • Upload date:
  • Size: 38.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.13.1 Darwin/25.3.0

File hashes

Hashes for donkit_read_engine-0.5.2.tar.gz
Algorithm Hash digest
SHA256 923e2993ad08d226b5e3cb00478139ddfb103fd1525f7f42ba4e990aec18fe5b
MD5 f426367daec471d8834f8bfee990318a
BLAKE2b-256 76af78817fe5dc82f956f8407f4688b194edc3c4b2c44b7db51bdea77cdffcdf

See more details on using hashes here.

File details

Details for the file donkit_read_engine-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: donkit_read_engine-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 47.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.13.1 Darwin/25.3.0

File hashes

Hashes for donkit_read_engine-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1419bcc6379b568050523afd4b0c899abb2b5cc13e34539bec8738cdaaf32673
MD5 61d87628760644cca8bd75f3788be51a
BLAKE2b-256 df921cf9fe326d37a39ab43d2d5913a84f2484d4d1473c9f904c8e47e7e26052

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page