Skip to main content

A framework that enables efficient extraction of structured data from unstructured text using large language models (LLMs).

Project description

LLM Extractinator

Overview of the LLM Data Extractor

⚠️ This tool is a prototype in active development and may change significantly. Always verify results!

LLM Extractinator enables efficient extraction of structured data from unstructured text using large language models (LLMs). It supports configurable task definitions, CLI or Python usage, and flexible data input/output formats.

📘 Full documentation: https://DIAGNijmegen.github.io/llm_extractinator/


🔧 Installation

1. Install Ollama

On Linux:

curl -fsSL https://ollama.com/install.sh | sh

On Windows or macOS:

Download the installer from:
https://ollama.com/download


2. Install the Package

You have two options:

🔹 Option A – Install from PyPI:

pip install llm_extractinator

🔹 Option B – Install from a Local Clone:

git clone https://github.com/DIAGNijmegen/llm_extractinator.git
cd llm_extractinator
pip install -e .

🚀 Quick Usage

CLI

extractinate --task_id 001 --model_name "phi4"

Python

from llm_extractinator import extractinate

extractinate(task_id=1, model_name="phi4")

📁 Task Files

Each task is defined using a JSON file stored in the tasks/ directory.

Filename format:

TaskXXX_name.json

Example contents:

{
  "Description": "Extract product data from text.",
  "Data_Path": "products.csv",
  "Input_Field": "text",
  "Parser_Format": "product_parser.py"
}

Parser_Format refers to a .py file in tasks/parsers/ that defines a Pydantic OutputParser class used to structure the LLM output.


🛠️ Visual Schema Builder (Optional)

You can visually design the output schema using:

build-parser

This launches a web UI to create a Pydantic OutputParser model, which defines the structure of the extracted data. Additional models can be added and nested for complex formats.

The resulting .py file should be saved in:

tasks/parsers/

And referenced in your task JSON under the Parser_Format key.

👉 See parser docs for full usage.


📄 Citation

If you use this tool, please cite: 10.5281/zenodo.15089764


🤝 Contributing

We welcome contributions! See the full contributing guide in the docs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_extractinator-0.5.1.tar.gz (31.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_extractinator-0.5.1-py3-none-any.whl (32.0 kB view details)

Uploaded Python 3

File details

Details for the file llm_extractinator-0.5.1.tar.gz.

File metadata

  • Download URL: llm_extractinator-0.5.1.tar.gz
  • Upload date:
  • Size: 31.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for llm_extractinator-0.5.1.tar.gz
Algorithm Hash digest
SHA256 f094f1f99d8249de73be38793787473ed4b5c4363b1c69caa62b0e67ca022cff
MD5 35135d94e2c2cee0ffd6bc965c130e43
BLAKE2b-256 48b9e95153009eea6cf4f42a02eb7c875c9fcce1968aae58bd968529f66f0b4c

See more details on using hashes here.

File details

Details for the file llm_extractinator-0.5.1-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_extractinator-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b935e9a74c99ae4ed9d99720ddf0cafdff2e34cb85d93195893213acbd33c097
MD5 0f173cfef1a083f1b9184f4ada49e97c
BLAKE2b-256 5a25ced954dc324be13eed588f5f18378e2b6e064197f294beef8a8765f4296e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page