Skip to main content

Add your description here

Project description

Tests

LLMAIxv2 Library

LLMAIx is an end‑to‑end toolkit for turning raw documents into structured knowledge with large‑language models. It now features a modular, Pydantic‑validated preprocessing core, richer OCR choices, and a revamped CLI.

Status: the public API is still stabilising; expect small breaking changes before a 2.0 stable release.

✨ Key capabilities

  • Robust preprocessing –extract Markdown or plain text from PDFs, DOCX, TXT and images. The pipeline tries cheap text extraction first (PyMuPDF) and falls back to OCR (OCR‑my‑PDF/PaddleOCR/Surya) only when needed).
  • Layout‑aware enrichment – advanced mode plugs into the Docling pipeline for tables, formulas and picture descriptions, optionally powered by a local or remote vision‑language model.
  • MIME‑aware loading – files (or byte buffers) are classified with python‑magic so even extension‑less uploads are handled correctly.
  • Information extraction – send arbitrary prompts + a Pydantic schema and get back valid JSON, using any OpenAI‑compatible LLM endpoint.
  • CLI utilities – one‑command document conversion (llmaix preprocess) and structured‑info extraction (llmaix extract).
  • Extensible – register new back‑ends (e.g. EPUB) with a single decorator; models and OCR engines can be swapped freely.

🛠 Installation

pip install llmaix          # base
pip install llmaix[docling] # + Docling/VLM extras
pip install llmaix[surya]   # + Surya‑OCR
pip install llmaix[all]     # everything

If you need GPU PaddleOCR:

uv pip install \
  --index-url https://www.paddlepaddle.org.cn/packages/stable/cu129/ \
  paddlepaddle-gpu==3.1.0

PaddleOCR supports 80+ languages out‑of‑the‑box). For MIME detection install libmagic (Linux/macOS) or python-magic-win64 on Windows).

🚀 Quick start

CLI

llmaix preprocess myscan.pdf                 # fast mode, auto‑OCR
llmaix preprocess doc.pdf --mode advanced \
    --enable-picture-description             # Docling + VLM captions
llmaix preprocess scan.pdf --force-ocr \
    --ocr-engine paddleocr -o out.md
llmaix extract -i "Acme Inc. raised $10 M..." # JSON extraction

Python API

from llmaix.preprocess import DocumentPreprocessor

# 1) simple PDF (born digital)
text = DocumentPreprocessor(mode="fast").process("report.pdf")

# 2) scanned PDF with multilingual OCR
proc = DocumentPreprocessor(
    mode="advanced",
    ocr_engine="surya",
    force_ocr=True,
    enable_picture_description=True,
    use_local_vlm=True,
    local_vlm_repo_id="HuggingFaceTB/SmolVLM-256M-Instruct",
    vlm_prompt="Please describe this document in detail.",
)
markdown = proc.process("scan_no_text.pdf")

Information extraction

from llmaix import extract_info
from pydantic import BaseModel

class LabInfo(BaseModel):
    name: str
    location: str
    lead: str

sentence = (
    "The KatherLab is a research group at TU Dresden led by Prof. Jakob N. Kather."
)
json_out = extract_info(
    prompt=f"Extract lab facts: {sentence}",
    pydantic_model=LabInfo,
    llm_model="gpt-4o-mini"
)
print(json_out.json(indent=2))

⚙️ Back‑end matrix

Task Engine Notes
Text extraction PyMuPDF‑for‑LLM Fast Markdown conversion from PDFs
Docling Layout‑aware; optional VLM captions
OCR OCR‑my‑PDF (Tesseract) Strong PDF/A support
Surya‑OCR Local transformer OCR, 90 + langs
PaddleOCR PP‑Structure Table & formula detection
MIME sniffing python‑magic libmagic signatures
(optional) filetype pure‑Python fallback

🧪 Tests

Clone and run:

git clone https://github.com/KatherLab/LLMAIx-v2.git
cd LLMAIx-v2
uv sync
uv run pytest          # full suite

You can focus on a backend:

uv run pytest tests/test_preprocess.py -k paddleocr

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmaix-0.0.19.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmaix-0.0.19-py3-none-any.whl (26.9 kB view details)

Uploaded Python 3

File details

Details for the file llmaix-0.0.19.tar.gz.

File metadata

  • Download URL: llmaix-0.0.19.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llmaix-0.0.19.tar.gz
Algorithm Hash digest
SHA256 068b84cb65e0d6f6a9d28fadc4438f852469e48a1e560b7703efd83b478a4606
MD5 6e8fa0b4ef09acc07629b7b6b253536e
BLAKE2b-256 460a01a6c075a24454fd2cdb6885db472e19e48289bd414a99c07bc939fbd587

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmaix-0.0.19.tar.gz:

Publisher: python-publish.yml on KatherLab/llmaixlib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llmaix-0.0.19-py3-none-any.whl.

File metadata

  • Download URL: llmaix-0.0.19-py3-none-any.whl
  • Upload date:
  • Size: 26.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llmaix-0.0.19-py3-none-any.whl
Algorithm Hash digest
SHA256 ae23905a871f852928f4fa17b3ec5a4d33b02ab2633c9dcda16f670b2be7d1d2
MD5 579bc21e9bb1eedbe4e8a95ca96899dd
BLAKE2b-256 b88ad74b8783b5792129254eda517b4a179b3e0503ddc5426eb2e58db304f775

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmaix-0.0.19-py3-none-any.whl:

Publisher: python-publish.yml on KatherLab/llmaixlib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page