Add your description here
Project description
LLMAIx (v2) Library
The llmaix library contains the core functionality of the LLMAIx framework.
[!CAUTION] The interface of the library is still in development and may change in the future. The library is not yet ready for production use.
Features
-
Preprocessing: The library provides tools for extracting text from various file formats, including PDF, DOCX, and TXT. It can apply OCR to images and PDFs, using tesseract, surya-ocr and VLMs via docling.
-
Information Extraction: The library provides a wrapper helping you to get a JSON response from an LLM. All OpenAI-API compatible models are supported!
Installation
pip install llmaix
To install dependencies for docling:
pip install llmaix[docling]
Available Dependency groups: surya,docling
To install all dependencies:
pip install llmaix[all]
Usage
CLI
llmaix --help
Python
Preprocessing a PDF file without OCR:
from llmaix import preprocess_file
filename = "tests/testfiles/987462_text.pdf"
extracted_text = preprocess_file(filename)
Preprocessing a PDF file with OCR:
from llmaix import preprocess_file
filename = "tests/testfiles/987462_notext.pdf"
extracted_text = preprocess_file(filename, use_ocr=True, ocr_backend="ocrmypdf")
| OCR Backends | Comment |
|---|---|
| ocrmypdf | Uses tesseract. Needs to be installed on the system first! |
| surya-ocr | Uses surya-ocr. Runs models via transformers library locally. |
| doclingvlm | Uses docling to perform OCR using a VLM. Configure the model like for information extraction! |
| PDF Backends | Comment |
|---|---|
| pymupdf4llm | Uses pymupdf to extract text as markdown from PDF files. |
| markitdown | Uses markitdown to extract text as markdown from PDF files. |
| docling | Uses docling to extract text as markdown from PDF files. Caution: docling itself might apply OCR even if you don't specify it. |
| ocr_backend | Directly use the text output from the OCR backend. Incompatible with ocrmypdf. |
Extracting information from a text:
- Provide a .env file with your OpenAI API key:
echo "OPENAI_API_KEY=your_openai_api_key" > .env
- (Optional) To use a custom base url, set the
OPENAI_API_BASEenvironment variable:
echo "OPENAI_API_BASE=https://your_custom_base_url/v1" >> .env
- (Optional) Configure model in the
.envfile:
echo "OPENAI_MODEL=gpt-4o-2024-08-06" >> .env
- Use the
extract_infofunction to extract information from a text. In this example, a pydantic model is used to define the expected output format. The output will be a JSON object.
from llmaix import extract_info
from pydantic import BaseModel
extracted_text = "The KatherLab is a research group at the University of Technology Dresden, lead by Prof. Jakob N. Kather."
class LabInformation(BaseModel):
name: str
location: str
lead: str
extracted_info = extract_info(
prompt=f"Extract the name, location and lead of the lab from the following text: {extracted_text}",
llm_model="Llama-4-Maverick-17B-128E-Instruct-FP8",
pydantic_model=LabInformation,
)
Clone the repository and install the dependencies:
git clone https://github.com/KatherLab/LLMAIx-v2.git
cd LLMAIx-v2
uv sync
Tests
Run the tests using the following command:
uv run pytest
Example to just run test for preprocessing with the ocrmypdf backend:
uv run pytest tests/test_preprocess.py --ocr-backend ocrmypdf
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmaix-0.0.8.tar.gz.
File metadata
- Download URL: llmaix-0.0.8.tar.gz
- Upload date:
- Size: 1.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2cb56be6b7389adb51473cfa42317d83c13d63750b59383b3f186f374bb0298d
|
|
| MD5 |
d69d5369b56f0114d9215c7199a64ed0
|
|
| BLAKE2b-256 |
454bda95e30157754025fb02ca6606306c6cd4e020b176fdf0df4267ba5f335c
|
Provenance
The following attestation bundles were made for llmaix-0.0.8.tar.gz:
Publisher:
python-publish.yml on KatherLab/llmaixlib
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmaix-0.0.8.tar.gz -
Subject digest:
2cb56be6b7389adb51473cfa42317d83c13d63750b59383b3f186f374bb0298d - Sigstore transparency entry: 239538508
- Sigstore integration time:
-
Permalink:
KatherLab/llmaixlib@62e9fb0deb575396135808344cd2fee4638b0f20 -
Branch / Tag:
refs/tags/v0.0.8 - Owner: https://github.com/KatherLab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@62e9fb0deb575396135808344cd2fee4638b0f20 -
Trigger Event:
release
-
Statement type:
File details
Details for the file llmaix-0.0.8-py3-none-any.whl.
File metadata
- Download URL: llmaix-0.0.8-py3-none-any.whl
- Upload date:
- Size: 16.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e741a4d7e8b665ef59f501043c4df804efb6941de6092da928c01a78c44813f3
|
|
| MD5 |
e4afb235030f5ea885585331d2ad56ee
|
|
| BLAKE2b-256 |
ddb3ebc323a6a3a790019c286b09c4ea25b3fe00350254ea138fcac294b0057c
|
Provenance
The following attestation bundles were made for llmaix-0.0.8-py3-none-any.whl:
Publisher:
python-publish.yml on KatherLab/llmaixlib
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmaix-0.0.8-py3-none-any.whl -
Subject digest:
e741a4d7e8b665ef59f501043c4df804efb6941de6092da928c01a78c44813f3 - Sigstore transparency entry: 239538513
- Sigstore integration time:
-
Permalink:
KatherLab/llmaixlib@62e9fb0deb575396135808344cd2fee4638b0f20 -
Branch / Tag:
refs/tags/v0.0.8 - Owner: https://github.com/KatherLab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@62e9fb0deb575396135808344cd2fee4638b0f20 -
Trigger Event:
release
-
Statement type: