Skip to main content

Add your description here

Project description

Tests

LLMAIxv2 Library

LLMAIx is an end‑to‑end toolkit for turning raw documents into structured knowledge with large‑language models. It now features a modular, Pydantic‑validated preprocessing core, richer OCR choices, and a revamped CLI.

Status: the public API is still stabilising; expect small breaking changes before a 2.0 stable release.

✨ Key capabilities

  • Robust preprocessing –extract Markdown or plain text from PDFs, DOCX, TXT and images. The pipeline tries cheap text extraction first (PyMuPDF) and falls back to OCR (OCR‑my‑PDF/PaddleOCR/Surya) only when needed).
  • Layout‑aware enrichment – advanced mode plugs into the Docling pipeline for tables, formulas and picture descriptions, optionally powered by a local or remote vision‑language model.
  • MIME‑aware loading – files (or byte buffers) are classified with python‑magic so even extension‑less uploads are handled correctly.
  • Information extraction – send arbitrary prompts + a Pydantic schema and get back valid JSON, using any OpenAI‑compatible LLM endpoint.
  • CLI utilities – one‑command document conversion (llmaix preprocess) and structured‑info extraction (llmaix extract).
  • Extensible – register new back‑ends (e.g. EPUB) with a single decorator; models and OCR engines can be swapped freely.

🛠 Installation

pip install llmaix          # base
pip install llmaix[docling] # + Docling/VLM extras
pip install llmaix[surya]   # + Surya‑OCR
pip install llmaix[all]     # everything

If you need GPU PaddleOCR:

uv pip install \
  --index-url https://www.paddlepaddle.org.cn/packages/stable/cu129/ \
  paddlepaddle-gpu==3.1.0

PaddleOCR supports 80+ languages out‑of‑the‑box). For MIME detection install libmagic (Linux/macOS) or python-magic-win64 on Windows).

🚀 Quick start

CLI

llmaix preprocess myscan.pdf                 # fast mode, auto‑OCR
llmaix preprocess doc.pdf --mode advanced \
    --enable-picture-description             # Docling + VLM captions
llmaix preprocess scan.pdf --force-ocr \
    --ocr-engine paddleocr -o out.md
llmaix extract -i "Acme Inc. raised $10 M..." # JSON extraction

Python API

from llmaix.preprocess import DocumentPreprocessor

# 1) simple PDF (born digital)
text = DocumentPreprocessor(mode="fast").process("report.pdf")

# 2) scanned PDF with multilingual OCR
proc = DocumentPreprocessor(
    mode="advanced",
    ocr_engine="surya",
    force_ocr=True,
    enable_picture_description=True,
    use_local_vlm=True,
    local_vlm_repo_id="HuggingFaceTB/SmolVLM-256M-Instruct",
    vlm_prompt="Please describe this document in detail.",
)
markdown = proc.process("scan_no_text.pdf")

Information extraction

from llmaix import extract_info
from pydantic import BaseModel

class LabInfo(BaseModel):
    name: str
    location: str
    lead: str

sentence = (
    "The KatherLab is a research group at TU Dresden led by Prof. Jakob N. Kather."
)
json_out = extract_info(
    prompt=f"Extract lab facts: {sentence}",
    pydantic_model=LabInfo,
    llm_model="gpt-4o-mini"
)
print(json_out.json(indent=2))

⚙️ Back‑end matrix

Task Engine Notes
Text extraction PyMuPDF‑for‑LLM Fast Markdown conversion from PDFs
Docling Layout‑aware; optional VLM captions
OCR OCR‑my‑PDF (Tesseract) Strong PDF/A support
Surya‑OCR Local transformer OCR, 90 + langs
PaddleOCR PP‑Structure Table & formula detection
MIME sniffing python‑magic libmagic signatures
(optional) filetype pure‑Python fallback

🧪 Tests

Clone and run:

git clone https://github.com/KatherLab/LLMAIx-v2.git
cd LLMAIx-v2
uv sync
uv run pytest          # full suite

You can focus on a backend:

uv run pytest tests/test_preprocess.py -k paddleocr

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmaix-0.0.21.tar.gz (1.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmaix-0.0.21-py3-none-any.whl (27.1 kB view details)

Uploaded Python 3

File details

Details for the file llmaix-0.0.21.tar.gz.

File metadata

  • Download URL: llmaix-0.0.21.tar.gz
  • Upload date:
  • Size: 1.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llmaix-0.0.21.tar.gz
Algorithm Hash digest
SHA256 d7694931978bd285866459a2835a8c6eb83e5226113b8492ce683f627769330d
MD5 a7491b903d4fef2e42b3c58985ff8ba3
BLAKE2b-256 685595d8c015b812e36148beeaede4ee26da3ecdd6a8e89b98b195f8cdd0ea39

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmaix-0.0.21.tar.gz:

Publisher: python-publish.yml on KatherLab/llmaixlib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llmaix-0.0.21-py3-none-any.whl.

File metadata

  • Download URL: llmaix-0.0.21-py3-none-any.whl
  • Upload date:
  • Size: 27.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llmaix-0.0.21-py3-none-any.whl
Algorithm Hash digest
SHA256 01767ec9f621f9998d169d7dea60f299d1ea8ad0b820c6b39ce54264d8afb710
MD5 b4f822ec409f43c9febbacf6f796a65b
BLAKE2b-256 62f09a4d558574c479dbbfee1038e42f02266deff17ba78794d119075163739e

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmaix-0.0.21-py3-none-any.whl:

Publisher: python-publish.yml on KatherLab/llmaixlib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page