Add your description here

These details have not been verified by PyPI

Project links

Project description

Tests

LLMAIxv2 Library

LLMAIx is an end‑to‑end toolkit for turning raw documents into structured knowledge with large‑language models. It now features a modular, Pydantic‑validated preprocessing core, richer OCR choices, and a revamped CLI.

Status: the public API is still stabilising; expect small breaking changes before a 2.0 stable release.

✨ Key capabilities

Robust preprocessing –extract Markdown or plain text from PDFs, DOCX, TXT and images. The pipeline tries cheap text extraction first (PyMuPDF) and falls back to OCR (OCR‑my‑PDF/PaddleOCR/Surya) only when needed).
Layout‑aware enrichment – advanced mode plugs into the Docling pipeline for tables, formulas and picture descriptions, optionally powered by a local or remote vision‑language model.
MIME‑aware loading – files (or byte buffers) are classified with python‑magic so even extension‑less uploads are handled correctly.
Information extraction – send arbitrary prompts + a Pydantic schema and get back valid JSON, using any OpenAI‑compatible LLM endpoint.
CLI utilities – one‑command document conversion (llmaix preprocess) and structured‑info extraction (llmaix extract).
Extensible – register new back‑ends (e.g. EPUB) with a single decorator; models and OCR engines can be swapped freely.

🛠 Installation

pip install llmaix          # base
pip install llmaix[docling] # + Docling/VLM extras
pip install llmaix[surya]   # + Surya‑OCR
pip install llmaix[all]     # everything

If you need GPU PaddleOCR:

uv pip install \
  --index-url https://www.paddlepaddle.org.cn/packages/stable/cu129/ \
  paddlepaddle-gpu==3.1.0

PaddleOCR supports 80+ languages out‑of‑the‑box). For MIME detection install libmagic (Linux/macOS) or python-magic-win64 on Windows).

🚀 Quick start

CLI

llmaix preprocess myscan.pdf                 # fast mode, auto‑OCR
llmaix preprocess doc.pdf --mode advanced \
    --enable-picture-description             # Docling + VLM captions
llmaix preprocess scan.pdf --force-ocr \
    --ocr-engine paddleocr -o out.md
llmaix extract -i "Acme Inc. raised $10 M..." # JSON extraction

Python API

from llmaix.preprocess import DocumentPreprocessor

# 1) simple PDF (born digital)
text = DocumentPreprocessor(mode="fast").process("report.pdf")

# 2) scanned PDF with multilingual OCR
proc = DocumentPreprocessor(
    mode="advanced",
    ocr_engine="surya",
    force_ocr=True,
    enable_picture_description=True,
    use_local_vlm=True,
    local_vlm_repo_id="HuggingFaceTB/SmolVLM-256M-Instruct",
    vlm_prompt="Please describe this document in detail.",
)
markdown = proc.process("scan_no_text.pdf")

Information extraction

from llmaix import extract_info
from pydantic import BaseModel

class LabInfo(BaseModel):
    name: str
    location: str
    lead: str

sentence = (
    "The KatherLab is a research group at TU Dresden led by Prof. Jakob N. Kather."
)
json_out = extract_info(
    prompt=f"Extract lab facts: {sentence}",
    pydantic_model=LabInfo,
    llm_model="gpt-4o-mini"
)
print(json_out.json(indent=2))

⚙️ Back‑end matrix

Task	Engine	Notes
Text extraction	PyMuPDF‑for‑LLM	Fast Markdown conversion from PDFs
	Docling	Layout‑aware; optional VLM captions
OCR	OCR‑my‑PDF (Tesseract)	Strong PDF/A support
	Surya‑OCR	Local transformer OCR, 90 + langs
	PaddleOCR PP‑Structure	Table & formula detection
MIME sniffing	python‑magic	libmagic signatures
(optional)	filetype	pure‑Python fallback

🧪 Tests

Clone and run:

git clone https://github.com/KatherLab/LLMAIx-v2.git
cd LLMAIx-v2
uv sync
uv run pytest          # full suite

You can focus on a backend:

uv run pytest tests/test_preprocess.py -k paddleocr

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.26

Aug 28, 2025

0.0.24

Aug 5, 2025

0.0.23

Aug 1, 2025

0.0.22

Jul 31, 2025

0.0.21

Jul 29, 2025

This version

0.0.20

Jul 28, 2025

0.0.19

Jul 28, 2025

0.0.18

Jul 28, 2025

0.0.17

Jul 28, 2025

0.0.16

Jul 28, 2025

0.0.14

Jul 28, 2025

0.0.12

Jul 10, 2025

0.0.11

Jun 30, 2025

0.0.10 yanked

Jun 30, 2025

Reason this release was yanked:

Broken build version

0.0.9

Jun 18, 2025

0.0.8

Jun 16, 2025

0.0.7

Jun 6, 2025

0.0.6

Jun 2, 2025

0.0.5

May 12, 2025

0.0.3

May 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmaix-0.0.20.tar.gz (1.7 MB view details)

Uploaded Jul 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmaix-0.0.20-py3-none-any.whl (27.1 kB view details)

Uploaded Jul 28, 2025 Python 3

File details

Details for the file llmaix-0.0.20.tar.gz.

File metadata

Download URL: llmaix-0.0.20.tar.gz
Upload date: Jul 28, 2025
Size: 1.7 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llmaix-0.0.20.tar.gz
Algorithm	Hash digest
SHA256	`13c001632d35469416f948512984646dfede350bda7d05429c5d1e7cdda63dec`
MD5	`02cd1e42692024136e30184b62dda9f1`
BLAKE2b-256	`3635529b2bf4c6e2055577a6ef6de761361c75e8347befd50834ec596b57fe68`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmaix-0.0.20.tar.gz:

Publisher: python-publish.yml on KatherLab/llmaixlib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmaix-0.0.20.tar.gz
- Subject digest: 13c001632d35469416f948512984646dfede350bda7d05429c5d1e7cdda63dec
- Sigstore transparency entry: 320455018
- Sigstore integration time: Jul 28, 2025
Source repository:
- Permalink: KatherLab/llmaixlib@c192dc54bf5f3874ae616338f0b4a45c92af8e65
- Branch / Tag: refs/tags/v0.0.20
- Owner: https://github.com/KatherLab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@c192dc54bf5f3874ae616338f0b4a45c92af8e65
- Trigger Event: release

File details

Details for the file llmaix-0.0.20-py3-none-any.whl.

File metadata

Download URL: llmaix-0.0.20-py3-none-any.whl
Upload date: Jul 28, 2025
Size: 27.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llmaix-0.0.20-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f3a8523e91f1a82c5413915f62f8f9303992da91fb65b6ab615e0187e6f02403`
MD5	`58eb987ec37a97a6d218f300447ac792`
BLAKE2b-256	`a3c32e5b9c8ca8665f2ebb3a47b273efcf48d2446551f7344724e43349312283`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmaix-0.0.20-py3-none-any.whl:

Publisher: python-publish.yml on KatherLab/llmaixlib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmaix-0.0.20-py3-none-any.whl
- Subject digest: f3a8523e91f1a82c5413915f62f8f9303992da91fb65b6ab615e0187e6f02403
- Sigstore transparency entry: 320455042
- Sigstore integration time: Jul 28, 2025
Source repository:
- Permalink: KatherLab/llmaixlib@c192dc54bf5f3874ae616338f0b4a45c92af8e65
- Branch / Tag: refs/tags/v0.0.20
- Owner: https://github.com/KatherLab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@c192dc54bf5f3874ae616338f0b4a45c92af8e65
- Trigger Event: release

llmaix 0.0.20

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LLMAIxv2 Library

✨ Key capabilities

🛠 Installation

🚀 Quick start

CLI

Python API

⚙️ Back‑end matrix

🧪 Tests

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance