Add your description here

These details have not been verified by PyPI

Project links

Project description

Tests

LLMAIxLib

LLMAIxLib is a Python toolkit for automated document preprocessing (including OCR) and information extraction using large language models. It is designed for users who need to extract structured facts from arbitrary documents (PDF, DOCX, images, etc.) and output them as Markdown, plain text, or validated JSON.

[!CAUTION]

Under active development. Best suited for research or prototyping. Always validate results.

🚀 What LLMAIxLib Does

Preprocessing: Extracts human-readable Markdown or plain text from a wide range of document types, automatically falling back to OCR for scanned or image-based files.
Information Extraction: Uses a large language model (LLM) to transform unstructured or semi-structured text into structured data—validated by Pydantic models or JSON Schema—via an OpenAI-compatible API.

❗ What You Need

Python ≥3.12
OCR tools: Tesseract (for OCRmyPDF), a GPU for faster OCR (Surya-OCR and PaddleOCR)
OpenAI-compatible API endpoint: Required for information extraction! This can be:
- The official OpenAI API (or Azure OpenAI or ...)
- A self-hosted API that matches the OpenAI chat/completions format, e.g. vllm, llama.cpp server, or other compatible backends
- Your endpoint must support structured output (based on JSON schema).

🛠 Installation

Install base:

pip install llmaix

Add extras for advanced features:

pip install llmaix[docling]      # advanced layout + VLM support
pip install llmaix[surya]        # Surya OCR
pip install llmaix[paddleocr]    # PaddleOCR
pip install llmaix[docling,surya,paddleocr] # all extras

📚 Usage

CLI Examples

Environment variables are the recommended way to provide your API settings (see below).

llmaix preprocess file.pdf                # extract as Markdown, fast mode
llmaix preprocess scan.pdf --force-ocr --ocr-engine paddleocr -o out.md
llmaix preprocess paper.pdf --mode advanced --enable-picture-description
llmaix extract --input "Patient was a 73-year-old male..." --json-schema patient_schema.json

Python API Example

from llmaix.preprocess import DocumentPreprocessor
from llmaix import extract_info
from pydantic import BaseModel

# Preprocessing: get Markdown or text
proc = DocumentPreprocessor(mode="advanced", ocr_engine="surya")
markdown = proc.process("scan.pdf")

# Information extraction: structured JSON from text via LLM
class PersonInfo(BaseModel):
    name: str
    affiliation: str
    position: str

result = extract_info(
    prompt="Alice Smith is a Professor of AI at TU Dresden.",
    pydantic_model=PersonInfo,
    llm_model="o4-mini"
)
print(result.json(indent=2))

🔑 API Configuration

You must provide your LLM API settings by environment variable (recommended) or CLI flag:

export OPENAI_API_KEY=sk-xxx
export OPENAI_API_BASE=https://api.example.com/v1  # optional, default: OpenAI endpoint
export OPENAI_MODEL=gpt-4                         # optional, default: set in CLI or code

Or pass directly:

llmaix extract --input "..." --llm-model llama-3-8b-instruct --base-url http://localhost:8000/v1 --api-key sk-xxx --json-schema schema.json

🗂 Architecture Overview

Preprocessing

DocumentPreprocessor:
- Detects MIME type and routes to the appropriate handler.
- For PDFs: tries fast text extraction first, falls back to OCR (OCRmyPDF, PaddleOCR, Surya-OCR) if needed.
- DOCX, TXT, and image formats supported.
- Advanced mode: integrates Docling for tables, formulas, and (optionally) vision-language model for image captioning.
OCR Engines: Pluggable; use Tesseract, Surya, PaddleOCR as needed.

Information Extraction

extract_info:
- Sends text and a schema (Pydantic or JSON Schema) to an OpenAI-compatible API endpoint.
- Validates output as structured JSON.
- CLI can load schema from file or as literal string.
- Your API endpoint must support structured outputs!
- Can be used with hosted (OpenAI, Azure) or self-hosted (e.g. llama.cpp, vllm) models that follow the OpenAI API.

🧩 JSON Schema Example

{
  "type": "object",
  "properties": {
    "experiment_id": { "type": "string" },
    "date": { "type": "string", "format": "date" },
    "findings": { "type": "array", "items": { "type": "string" } }
  },
  "required": ["experiment_id", "findings"]
}

✅ Quick Checklist

Set up API credentials (see above).
Install OCR backends as required for your documents.
Use llmaix preprocess for robust text/Markdown extraction from documents.
Use llmaix extract (with prompt + schema or model) for LLM-powered structured extraction.

🧪 Testing

uv run pytest
uv run pytest tests/test_preprocess.py -k paddleocr

⚠️ Caveats & Notes

Preprocessing only: No LLM API needed if you just want Markdown/text from documents.
Information extraction: Requires an OpenAI-compatible API endpoint that supports structured outputs.
If your LLM or endpoint does not support structured output via reponse_format, information extraction will not work as expected.
- You can still use the extract_info function and provide a prompt or system_prompt argument which teaches the model to respond with valid JSON only in the desired format!

📄 License

MIT.

Contributions welcome.

Repo: github.com/KatherLab/llmaixlib

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.26

Aug 28, 2025

0.0.24

Aug 5, 2025

This version

0.0.23

Aug 1, 2025

0.0.22

Jul 31, 2025

0.0.21

Jul 29, 2025

0.0.20

Jul 28, 2025

0.0.19

Jul 28, 2025

0.0.18

Jul 28, 2025

0.0.17

Jul 28, 2025

0.0.16

Jul 28, 2025

0.0.14

Jul 28, 2025

0.0.12

Jul 10, 2025

0.0.11

Jun 30, 2025

0.0.10 yanked

Jun 30, 2025

Reason this release was yanked:

Broken build version

0.0.9

Jun 18, 2025

0.0.8

Jun 16, 2025

0.0.7

Jun 6, 2025

0.0.6

Jun 2, 2025

0.0.5

May 12, 2025

0.0.3

May 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmaix-0.0.23.tar.gz (1.7 MB view details)

Uploaded Aug 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmaix-0.0.23-py3-none-any.whl (28.0 kB view details)

Uploaded Aug 1, 2025 Python 3

File details

Details for the file llmaix-0.0.23.tar.gz.

File metadata

Download URL: llmaix-0.0.23.tar.gz
Upload date: Aug 1, 2025
Size: 1.7 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llmaix-0.0.23.tar.gz
Algorithm	Hash digest
SHA256	`d86889698ebe17c5e4be8d927e377e09d8322b0a009c17703838f0cf995339ea`
MD5	`f1d08245ab4d5842b65303bdfc77ef26`
BLAKE2b-256	`36a1be645212debc1440b86b7b7c13443ae8a202c4b7eb97ffa830f017a4bdf7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmaix-0.0.23.tar.gz:

Publisher: python-publish.yml on KatherLab/llmaixlib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmaix-0.0.23.tar.gz
- Subject digest: d86889698ebe17c5e4be8d927e377e09d8322b0a009c17703838f0cf995339ea
- Sigstore transparency entry: 339768095
- Sigstore integration time: Aug 1, 2025
Source repository:
- Permalink: KatherLab/llmaixlib@f386446b14d9e10f4ff1b6d984f93bb8ee2cf3f1
- Branch / Tag: refs/tags/v0.0.23
- Owner: https://github.com/KatherLab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@f386446b14d9e10f4ff1b6d984f93bb8ee2cf3f1
- Trigger Event: release

File details

Details for the file llmaix-0.0.23-py3-none-any.whl.

File metadata

Download URL: llmaix-0.0.23-py3-none-any.whl
Upload date: Aug 1, 2025
Size: 28.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for llmaix-0.0.23-py3-none-any.whl
Algorithm	Hash digest
SHA256	`674be8f209fe7455fdb4c95033edd9e47cb1c724df5adaf07f256e494d8d0621`
MD5	`fcd044802a61ca5ee0d87a9652426727`
BLAKE2b-256	`d1391a6dc8cf6c7aff7dd15af7e19912130a6b4dd1261cdd7d484d59dfeb3a8f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llmaix-0.0.23-py3-none-any.whl:

Publisher: python-publish.yml on KatherLab/llmaixlib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llmaix-0.0.23-py3-none-any.whl
- Subject digest: 674be8f209fe7455fdb4c95033edd9e47cb1c724df5adaf07f256e494d8d0621
- Sigstore transparency entry: 339768118
- Sigstore integration time: Aug 1, 2025
Source repository:
- Permalink: KatherLab/llmaixlib@f386446b14d9e10f4ff1b6d984f93bb8ee2cf3f1
- Branch / Tag: refs/tags/v0.0.23
- Owner: https://github.com/KatherLab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@f386446b14d9e10f4ff1b6d984f93bb8ee2cf3f1
- Trigger Event: release

llmaix 0.0.23

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LLMAIxLib

🚀 What LLMAIxLib Does

❗ What You Need

🛠 Installation

📚 Usage

CLI Examples

Python API Example

🔑 API Configuration

🗂 Architecture Overview

Preprocessing

Information Extraction

🧩 JSON Schema Example

✅ Quick Checklist

🧪 Testing

⚠️ Caveats & Notes

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance