Skip to main content

Add your description here

Project description

Tests

LLMAIxv2 Library

LLMAIx is an end‑to‑end toolkit for turning raw documents into structured knowledge with large‑language models. It now features a modular, Pydantic‑validated preprocessing core, richer OCR choices, and a revamped CLI.

Status: the public API is still stabilising; expect small breaking changes before a 2.0 stable release.

✨ Key capabilities

  • Robust preprocessing –extract Markdown or plain text from PDFs, DOCX, TXT and images. The pipeline tries cheap text extraction first (PyMuPDF) and falls back to OCR (OCR‑my‑PDF/PaddleOCR/Surya) only when needed).
  • Layout‑aware enrichment – advanced mode plugs into the Docling pipeline for tables, formulas and picture descriptions, optionally powered by a local or remote vision‑language model.
  • MIME‑aware loading – files (or byte buffers) are classified with python‑magic so even extension‑less uploads are handled correctly.
  • Information extraction – send arbitrary prompts + a Pydantic schema and get back valid JSON, using any OpenAI‑compatible LLM endpoint.
  • CLI utilities – one‑command document conversion (llmaix preprocess) and structured‑info extraction (llmaix extract).
  • Extensible – register new back‑ends (e.g. EPUB) with a single decorator; models and OCR engines can be swapped freely.

🛠 Installation

pip install llmaix          # base
pip install llmaix[docling] # + Docling/VLM extras
pip install llmaix[surya]   # + Surya‑OCR
pip install llmaix[all]     # everything

If you need GPU PaddleOCR:

uv pip install \
  --index-url https://www.paddlepaddle.org.cn/packages/stable/cu129/ \
  paddlepaddle-gpu==3.1.0

PaddleOCR supports 80+ languages out‑of‑the‑box). For MIME detection install libmagic (Linux/macOS) or python-magic-win64 on Windows).

🚀 Quick start

CLI

llmaix preprocess myscan.pdf                 # fast mode, auto‑OCR
llmaix preprocess doc.pdf --mode advanced \
    --enable-picture-description             # Docling + VLM captions
llmaix preprocess scan.pdf --force-ocr \
    --ocr-engine paddleocr -o out.md
llmaix extract -i "Acme Inc. raised $10 M..." # JSON extraction

Python API

from llmaix.preprocess import DocumentPreprocessor

# 1) simple PDF (born digital)
text = DocumentPreprocessor(mode="fast").process("report.pdf")

# 2) scanned PDF with multilingual OCR
proc = DocumentPreprocessor(
    mode="advanced",
    ocr_engine="surya",
    force_ocr=True,
    enable_picture_description=True,
    use_local_vlm=True,
    local_vlm_repo_id="HuggingFaceTB/SmolVLM-256M-Instruct",
    vlm_prompt="Please describe this document in detail.",
)
markdown = proc.process("scan_no_text.pdf")

Information extraction

from llmaix import extract_info
from pydantic import BaseModel

class LabInfo(BaseModel):
    name: str
    location: str
    lead: str

sentence = (
    "The KatherLab is a research group at TU Dresden led by Prof. Jakob N. Kather."
)
json_out = extract_info(
    prompt=f"Extract lab facts: {sentence}",
    pydantic_model=LabInfo,
    llm_model="gpt-4o-mini"
)
print(json_out.json(indent=2))

⚙️ Back‑end matrix

Task Engine Notes
Text extraction PyMuPDF‑for‑LLM Fast Markdown conversion from PDFs
Docling Layout‑aware; optional VLM captions
OCR OCR‑my‑PDF (Tesseract) Strong PDF/A support
Surya‑OCR Local transformer OCR, 90 + langs
PaddleOCR PP‑Structure Table & formula detection
MIME sniffing python‑magic libmagic signatures
(optional) filetype pure‑Python fallback

🧪 Tests

Clone and run:

git clone https://github.com/KatherLab/LLMAIx-v2.git
cd LLMAIx-v2
uv sync
uv run pytest          # full suite

You can focus on a backend:

uv run pytest tests/test_preprocess.py -k paddleocr

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmaix-0.0.20.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmaix-0.0.20-py3-none-any.whl (27.1 kB view details)

Uploaded Python 3

File details

Details for the file llmaix-0.0.20.tar.gz.

File metadata

  • Download URL: llmaix-0.0.20.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llmaix-0.0.20.tar.gz
Algorithm Hash digest
SHA256 13c001632d35469416f948512984646dfede350bda7d05429c5d1e7cdda63dec
MD5 02cd1e42692024136e30184b62dda9f1
BLAKE2b-256 3635529b2bf4c6e2055577a6ef6de761361c75e8347befd50834ec596b57fe68

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmaix-0.0.20.tar.gz:

Publisher: python-publish.yml on KatherLab/llmaixlib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llmaix-0.0.20-py3-none-any.whl.

File metadata

  • Download URL: llmaix-0.0.20-py3-none-any.whl
  • Upload date:
  • Size: 27.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llmaix-0.0.20-py3-none-any.whl
Algorithm Hash digest
SHA256 f3a8523e91f1a82c5413915f62f8f9303992da91fb65b6ab615e0187e6f02403
MD5 58eb987ec37a97a6d218f300447ac792
BLAKE2b-256 a3c32e5b9c8ca8665f2ebb3a47b273efcf48d2446551f7344724e43349312283

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmaix-0.0.20-py3-none-any.whl:

Publisher: python-publish.yml on KatherLab/llmaixlib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page