Add your description here
Project description
LLMAIxv2 Library
LLMAIx is an end‑to‑end toolkit for turning raw documents into structured knowledge with large‑language models. It now features a modular, Pydantic‑validated preprocessing core, richer OCR choices, and a revamped CLI.
Status: the public API is still stabilising; expect small breaking changes before a 2.0 stable release.
✨ Key capabilities
- Robust preprocessing –extract Markdown or plain text from PDFs, DOCX, TXT and images. The pipeline tries cheap text extraction first (PyMuPDF) and falls back to OCR (OCR‑my‑PDF/PaddleOCR/Surya) only when needed).
- Layout‑aware enrichment – advanced mode plugs into the Docling pipeline for tables, formulas and picture descriptions, optionally powered by a local or remote vision‑language model.
- MIME‑aware loading – files (or byte buffers) are classified with python‑magic so even extension‑less uploads are handled correctly.
- Information extraction – send arbitrary prompts + a Pydantic schema and get back valid JSON, using any OpenAI‑compatible LLM endpoint.
- CLI utilities – one‑command document conversion (
llmaix preprocess) and structured‑info extraction (llmaix extract). - Extensible – register new back‑ends (e.g. EPUB) with a single decorator; models and OCR engines can be swapped freely.
🛠 Installation
pip install llmaix # base
pip install llmaix[docling] # + Docling/VLM extras
pip install llmaix[surya] # + Surya‑OCR
pip install llmaix[all] # everything
If you need GPU PaddleOCR:
uv pip install \
--index-url https://www.paddlepaddle.org.cn/packages/stable/cu129/ \
paddlepaddle-gpu==3.1.0
PaddleOCR supports 80+ languages out‑of‑the‑box).
For MIME detection install libmagic (Linux/macOS) or python-magic-win64 on Windows).
🚀 Quick start
CLI
llmaix preprocess myscan.pdf # fast mode, auto‑OCR
llmaix preprocess doc.pdf --mode advanced \
--enable-picture-description # Docling + VLM captions
llmaix preprocess scan.pdf --force-ocr \
--ocr-engine paddleocr -o out.md
llmaix extract -i "Acme Inc. raised $10 M..." # JSON extraction
Python API
from llmaix.preprocess import DocumentPreprocessor
# 1) simple PDF (born digital)
text = DocumentPreprocessor(mode="fast").process("report.pdf")
# 2) scanned PDF with multilingual OCR
proc = DocumentPreprocessor(
mode="advanced",
ocr_engine="surya",
force_ocr=True,
enable_picture_description=True,
use_local_vlm=True,
local_vlm_repo_id="HuggingFaceTB/SmolVLM-256M-Instruct",
vlm_prompt="Please describe this document in detail.",
)
markdown = proc.process("scan_no_text.pdf")
Information extraction
from llmaix import extract_info
from pydantic import BaseModel
class LabInfo(BaseModel):
name: str
location: str
lead: str
sentence = (
"The KatherLab is a research group at TU Dresden led by Prof. Jakob N. Kather."
)
json_out = extract_info(
prompt=f"Extract lab facts: {sentence}",
pydantic_model=LabInfo,
llm_model="gpt-4o-mini"
)
print(json_out.json(indent=2))
⚙️ Back‑end matrix
| Task | Engine | Notes |
|---|---|---|
| Text extraction | PyMuPDF‑for‑LLM | Fast Markdown conversion from PDFs |
| Docling | Layout‑aware; optional VLM captions | |
| OCR | OCR‑my‑PDF (Tesseract) | Strong PDF/A support |
| Surya‑OCR | Local transformer OCR, 90 + langs | |
| PaddleOCR PP‑Structure | Table & formula detection | |
| MIME sniffing | python‑magic | libmagic signatures |
| (optional) | filetype | pure‑Python fallback |
🧪 Tests
Clone and run:
git clone https://github.com/KatherLab/LLMAIx-v2.git
cd LLMAIx-v2
uv sync
uv run pytest # full suite
You can focus on a backend:
uv run pytest tests/test_preprocess.py -k paddleocr
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmaix-0.0.14.tar.gz.
File metadata
- Download URL: llmaix-0.0.14.tar.gz
- Upload date:
- Size: 1.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c941b3ab269913fa261053688003663101d8cead860ec3d9e74582bc65449fc
|
|
| MD5 |
adce40245a4477e7a55e8d4111d67b5a
|
|
| BLAKE2b-256 |
5745d4aca43a5adff57609a75235ec6f38c3c62107b4eaaee5f305dd79b96f15
|
Provenance
The following attestation bundles were made for llmaix-0.0.14.tar.gz:
Publisher:
python-publish.yml on KatherLab/llmaixlib
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmaix-0.0.14.tar.gz -
Subject digest:
8c941b3ab269913fa261053688003663101d8cead860ec3d9e74582bc65449fc - Sigstore transparency entry: 319580376
- Sigstore integration time:
-
Permalink:
KatherLab/llmaixlib@04f9e444981227d08f365fe2cc2a659c0b69c458 -
Branch / Tag:
refs/tags/v0.0.14 - Owner: https://github.com/KatherLab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@04f9e444981227d08f365fe2cc2a659c0b69c458 -
Trigger Event:
release
-
Statement type:
File details
Details for the file llmaix-0.0.14-py3-none-any.whl.
File metadata
- Download URL: llmaix-0.0.14-py3-none-any.whl
- Upload date:
- Size: 26.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7bac966120cb64f21ccca579648aeae696af6d891afae9e492ee8b9623a33595
|
|
| MD5 |
fb25312fbf39020f251831e3b4190311
|
|
| BLAKE2b-256 |
de32de81a7aaef4442b42079f1322bb0345a38b2e9c5ccf2f33ae1eeae005e69
|
Provenance
The following attestation bundles were made for llmaix-0.0.14-py3-none-any.whl:
Publisher:
python-publish.yml on KatherLab/llmaixlib
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmaix-0.0.14-py3-none-any.whl -
Subject digest:
7bac966120cb64f21ccca579648aeae696af6d891afae9e492ee8b9623a33595 - Sigstore transparency entry: 319580383
- Sigstore integration time:
-
Permalink:
KatherLab/llmaixlib@04f9e444981227d08f365fe2cc2a659c0b69c458 -
Branch / Tag:
refs/tags/v0.0.14 - Owner: https://github.com/KatherLab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@04f9e444981227d08f365fe2cc2a659c0b69c458 -
Trigger Event:
release
-
Statement type: