Skip to main content

Document loading helpers for Donkit RagOps

Project description

donkit-read-engine

Document extraction library for Donkit RagOps. Reads 20+ file formats and produces structured, page-level JSON — with tables as Markdown, headings preserved, and optional LLM-powered image descriptions. Images extracted from documents can be saved locally or uploaded to S3 for use in downstream retrieval.

PyPI: pip install donkit-read-engine Python: 3.12 – 3.13 License: MIT


Features

  • Extracts text, tables, headings, captions, code blocks from PDF, DOCX, PPTX, XLSX, HTML, Markdown, LaTeX, images, and more
  • Three processing pipelines: Docling-only, Docling + LLM, or pure LLM vision
  • Extracts and saves document images (PNG) with page association — ready for multimodal retrieval
  • S3 support: uploads JSON result and images to S3 with a single s3_output_prefix parameter
  • Fully async (aread_document) with sync wrapper (read_document)
  • LLM token usage tracking per document

Installation

pip install donkit-read-engine

Quick Start

Text only (no LLM)

from donkit.read_engine.read_engine import DonkitReader

reader = DonkitReader(reading_pipeline="docling")
result = reader.read_document("report.pdf")

print(result.output_path)   # ./processed/report.json
print(result.page_count)    # 42

With LLM image descriptions + save images locally

from donkit.llm import ModelFactory
from donkit.read_engine.read_engine import DonkitReader

llm_model = ModelFactory.create_model("openai", "gpt-4o-mini", {"api_key": "sk-..."})

reader = DonkitReader(
    llm_model=llm_model,
    reading_pipeline="docling_llm",  # default
)

result = reader.read_document("report.pdf", output_dir="./output")
# ./output/report.json
# ./output/images_report/page0001_img0000.png
# ./output/images_report/page0003_img0001.png

print(result.images_dir)              # ./output/images_report
print(result.total_llm_requests)      # 5
print(result.total_prompt_tokens)     # 3200

Async

result = await reader.aread_document("report.pdf", output_dir="./output")

With S3 output

Pass s3_service at construction and s3_output_prefix per call. The JSON result and all extracted images are uploaded; the local staging directory is cleaned up automatically.

from donkit.read_engine.utils.s3 import S3Credentials, S3Service
from donkit.read_engine.read_engine import DonkitReader

s3 = S3Service(S3Credentials(
    access_key_id="...",
    secret_access_key="...",
    region_name="us-east-1",
    endpoint_url="https://s3.amazonaws.com",
    bucket_name="my-bucket",
))

reader = DonkitReader(
    llm_model=llm_model,
    reading_pipeline="docling_llm",
    s3_service=s3,
)

result = await reader.aread_document(
    "report.pdf",
    s3_output_prefix="experiments/exp123/reading",
)
# result.output_path  = "experiments/exp123/reading/report.json"   (S3 key)
# result.images_dir   = "experiments/exp123/reading/images_report" (S3 prefix)

S3 layout:

experiments/exp123/reading/
  report.json
  images_report/
    page0001_img0000.png
    page0003_img0001.png

Note: when S3 is used, result.output_path and result.images_dir contain S3 keys, not local paths.


Pipelines

Pipeline Description Requires llm_model Saves images
docling_llm Docling parses the document; LLM describes each image (up to 15 concurrent calls). Images saved to disk / S3. Yes Yes
docling Docling only; built-in VLM describes images. No images saved to disk. No No
llm PDF pages rasterized and sent entirely to LLM vision. Non-PDF formats via Docling. No images saved. Yes No

Image saving (images_dir) is only supported in the docling_llm pipeline.


Supported Formats

Via Docling (primary engine)

Extension Format
.pdf PDF
.docx Word
.pptx PowerPoint
.xlsx, .csv Excel / CSV
.html HTML
.md, .txt Markdown / plain text
.tex LaTeX
.asciidoc AsciiDoc
.vtt WebVTT subtitles
.png, .jpg, .jpeg, .gif, .webp, .tiff Images

Legacy readers

Extension Format
.json JSON documents
.xls Legacy Excel (pre-2007)

Output Format

Every pipeline writes a single JSON file:

{
  "content": [
    {
      "page": 1,
      "type": "Text",
      "content": "# Title\n\nParagraph text...\n\n| Col1 | Col2 |\n|---|---|\n| A | B |",
      "images": [
        "./output/images_report/page0001_img0000.png"
      ]
    },
    {
      "page": 2,
      "type": "Text",
      "content": "More text on page two."
    }
  ]
}
  • page — 1-based page number from the source document
  • content — Markdown-formatted text: headings, tables, lists, code blocks, captions
  • images — list of saved image paths (local paths or S3 keys); only present when images were extracted

Page headers, footers, and decorative images (detected by LLM and answered SKIP) are excluded from content but images are still saved.


API Reference

DonkitReader

class DonkitReader:
    def __init__(
        self,
        output_format: Literal["json", "text", "md"] = "json",
        progress_callback: Callable[[int, int, str | None], None] | None = None,
        llm_model: LLMModelAbstract | None = None,
        reading_pipeline: str = "docling_llm",
        s3_service: S3Service | None = None,
    ) -> None
Parameter Type Default Description
output_format "json" / "text" / "md" "json" Format hint passed to the image analysis service
progress_callback Callable None (current, total, message) — called per page in llm pipeline
llm_model LLMModelAbstract None LLM for image descriptions. Required for docling_llm and llm
reading_pipeline str "docling_llm" "docling_llm", "docling", or "llm"
s3_service S3Service None If set, results are uploaded to S3 when s3_output_prefix is provided
def read_document(
    self,
    file_path: str,
    output_dir: str | None = None,
    s3_output_prefix: str | None = None,
) -> ReadDocumentResult

async def aread_document(
    self,
    file_path: str,
    output_dir: str | None = None,
    s3_output_prefix: str | None = None,
) -> ReadDocumentResult
Parameter Description
file_path Path to the source document
output_dir Local directory for results. Defaults to processed/ next to the source file. When S3 is used and output_dir is None, a temporary directory is created and cleaned up automatically
s3_output_prefix S3 key prefix for all output artifacts (JSON + images). Requires s3_service set at construction

ReadDocumentResult

@dataclass
class ReadDocumentResult:
    output_path: str                           # Local path to JSON, or S3 key when S3 is used
    page_count: int                            # Number of unique pages
    images_dir: str | None                     # Local images dir, or S3 prefix, or None
    total_llm_requests: int
    total_prompt_tokens: int
    total_completion_tokens: int
    page_split_duration_ms: int | None
    reading_duration_ms: int | None
    gc_duration_ms: int | None
    json_serialize_duration_ms: int | None
    rasterize_duration_ms: int | None

S3Service / S3Credentials

@dataclass
class S3Credentials:
    access_key_id: str
    secret_access_key: str
    region_name: str
    endpoint_url: str
    bucket_name: str

class S3Service:
    def __init__(self, credentials: S3Credentials)
    async def download_file(self, s3_path: str, local_path: str) -> None
    async def upload_file(self, local_path: str, s3_path: str) -> None
    async def upload_content(self, s3_path: str, content: bytes) -> None

CLI

# Single file
donkit-read-engine report.pdf

# Directory (recursive)
donkit-read-engine ./documents/

# With OCR settings (unstructured backend)
donkit-read-engine scan.pdf --pdf-strategy hi_res --ocr-lang rus+eng
Argument Values Default
file_path file or directory
--output-type text, json, markdown json
--pdf-strategy fast, hi_res, ocr_only, auto
--ocr-lang e.g. rus+eng

The CLI uses the docling pipeline (no LLM).


Environment Variables

Variable Default Description
UNSTRUCTURED_STRATEGY hi_res PDF OCR strategy for the unstructured backend
UNSTRUCTURED_OCR_LANG rus+eng OCR language codes

LLM credentials are not read from environment. Pass llm_model explicitly.


Dependencies

Package Purpose
docling Primary document conversion engine
pymupdf PDF rasterization (llm pipeline)
unstructured[pdf] PDF OCR with Tesseract
python-docx, python-pptx Office format parsing
pandas Excel / CSV processing
pillow Image extraction and encoding
donkit-llm LLM provider abstraction
aioboto3 Async S3 client
json-repair Fix malformed JSON from LLM output
loguru Structured logging

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

donkit_read_engine-0.5.3.tar.gz (38.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

donkit_read_engine-0.5.3-py3-none-any.whl (47.7 kB view details)

Uploaded Python 3

File details

Details for the file donkit_read_engine-0.5.3.tar.gz.

File metadata

  • Download URL: donkit_read_engine-0.5.3.tar.gz
  • Upload date:
  • Size: 38.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.13.1 Darwin/25.3.0

File hashes

Hashes for donkit_read_engine-0.5.3.tar.gz
Algorithm Hash digest
SHA256 d0992cb3e451621e80e7bcc27c834986d1b18ae8173adb334cebbc7177069276
MD5 dc75a14a76362da1d116b929fbf35d66
BLAKE2b-256 0ecb65614a155d3969c95fe26305260f0cd31ecf75c1be985bba3472e1b8e47f

See more details on using hashes here.

File details

Details for the file donkit_read_engine-0.5.3-py3-none-any.whl.

File metadata

  • Download URL: donkit_read_engine-0.5.3-py3-none-any.whl
  • Upload date:
  • Size: 47.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.13.1 Darwin/25.3.0

File hashes

Hashes for donkit_read_engine-0.5.3-py3-none-any.whl
Algorithm Hash digest
SHA256 d39f1040025e4cd9222d4f739141ba9044f2a6625236d53b1fd5d30bd9b0962d
MD5 bb10a06d9450019446e09f52418c584c
BLAKE2b-256 d7ecfeb729d6763fe52ce65107bb79121001539f17c58fe2c63658a986a4a5a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page