Skip to main content

Add your description here

Project description

Tests

LLMAIxv2 Library

LLMAIx is an end‑to‑end toolkit for turning raw documents into structured knowledge with large‑language models. It now features a modular, Pydantic‑validated preprocessing core, richer OCR choices, and a revamped CLI.

Status: the public API is still stabilising; expect small breaking changes before a 2.0 stable release.

✨ Key capabilities

  • Robust preprocessing –extract Markdown or plain text from PDFs, DOCX, TXT and images. The pipeline tries cheap text extraction first (PyMuPDF) and falls back to OCR (OCR‑my‑PDF/PaddleOCR/Surya) only when needed).
  • Layout‑aware enrichment – advanced mode plugs into the Docling pipeline for tables, formulas and picture descriptions, optionally powered by a local or remote vision‑language model.
  • MIME‑aware loading – files (or byte buffers) are classified with python‑magic so even extension‑less uploads are handled correctly.
  • Information extraction – send arbitrary prompts + a Pydantic schema and get back valid JSON, using any OpenAI‑compatible LLM endpoint.
  • CLI utilities – one‑command document conversion (llmaix preprocess) and structured‑info extraction (llmaix extract).
  • Extensible – register new back‑ends (e.g. EPUB) with a single decorator; models and OCR engines can be swapped freely.

🛠 Installation

pip install llmaix          # base
pip install llmaix[docling] # + Docling/VLM extras
pip install llmaix[surya]   # + Surya‑OCR
pip install llmaix[all]     # everything

If you need GPU PaddleOCR:

uv pip install \
  --index-url https://www.paddlepaddle.org.cn/packages/stable/cu129/ \
  paddlepaddle-gpu==3.1.0

PaddleOCR supports 80+ languages out‑of‑the‑box). For MIME detection install libmagic (Linux/macOS) or python-magic-win64 on Windows).

🚀 Quick start

CLI

llmaix preprocess myscan.pdf                 # fast mode, auto‑OCR
llmaix preprocess doc.pdf --mode advanced \
    --enable-picture-description             # Docling + VLM captions
llmaix preprocess scan.pdf --force-ocr \
    --ocr-engine paddleocr -o out.md
llmaix extract -i "Acme Inc. raised $10 M..." # JSON extraction

Python API

from llmaix.preprocess import DocumentPreprocessor

# 1) simple PDF (born digital)
text = DocumentPreprocessor(mode="fast").process("report.pdf")

# 2) scanned PDF with multilingual OCR
proc = DocumentPreprocessor(
    mode="advanced",
    ocr_engine="surya",
    force_ocr=True,
    enable_picture_description=True,
    use_local_vlm=True,
    local_vlm_repo_id="HuggingFaceTB/SmolVLM-256M-Instruct",
    vlm_prompt="Please describe this document in detail.",
)
markdown = proc.process("scan_no_text.pdf")

Information extraction

from llmaix import extract_info
from pydantic import BaseModel

class LabInfo(BaseModel):
    name: str
    location: str
    lead: str

sentence = (
    "The KatherLab is a research group at TU Dresden led by Prof. Jakob N. Kather."
)
json_out = extract_info(
    prompt=f"Extract lab facts: {sentence}",
    pydantic_model=LabInfo,
    llm_model="gpt-4o-mini"
)
print(json_out.json(indent=2))

⚙️ Back‑end matrix

Task Engine Notes
Text extraction PyMuPDF‑for‑LLM Fast Markdown conversion from PDFs
Docling Layout‑aware; optional VLM captions
OCR OCR‑my‑PDF (Tesseract) Strong PDF/A support
Surya‑OCR Local transformer OCR, 90 + langs
PaddleOCR PP‑Structure Table & formula detection
MIME sniffing python‑magic libmagic signatures
(optional) filetype pure‑Python fallback

🧪 Tests

Clone and run:

git clone https://github.com/KatherLab/LLMAIx-v2.git
cd LLMAIx-v2
uv sync
uv run pytest          # full suite

You can focus on a backend:

uv run pytest tests/test_preprocess.py -k paddleocr

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmaix-0.0.14.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmaix-0.0.14-py3-none-any.whl (26.5 kB view details)

Uploaded Python 3

File details

Details for the file llmaix-0.0.14.tar.gz.

File metadata

  • Download URL: llmaix-0.0.14.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llmaix-0.0.14.tar.gz
Algorithm Hash digest
SHA256 8c941b3ab269913fa261053688003663101d8cead860ec3d9e74582bc65449fc
MD5 adce40245a4477e7a55e8d4111d67b5a
BLAKE2b-256 5745d4aca43a5adff57609a75235ec6f38c3c62107b4eaaee5f305dd79b96f15

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmaix-0.0.14.tar.gz:

Publisher: python-publish.yml on KatherLab/llmaixlib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llmaix-0.0.14-py3-none-any.whl.

File metadata

  • Download URL: llmaix-0.0.14-py3-none-any.whl
  • Upload date:
  • Size: 26.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llmaix-0.0.14-py3-none-any.whl
Algorithm Hash digest
SHA256 7bac966120cb64f21ccca579648aeae696af6d891afae9e492ee8b9623a33595
MD5 fb25312fbf39020f251831e3b4190311
BLAKE2b-256 de32de81a7aaef4442b42079f1322bb0345a38b2e9c5ccf2f33ae1eeae005e69

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmaix-0.0.14-py3-none-any.whl:

Publisher: python-publish.yml on KatherLab/llmaixlib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page