Skip to main content

A framework that enables efficient extraction of structured data from unstructured text using large language models (LLMs).

Project description

LLM Extractinator

Overview of the LLM Data Extractor

⚠️ This tool is a prototype in active development and may change significantly. Always verify results!

LLM Extractinator enables efficient extraction of structured data from unstructured text using large language models (LLMs). It supports configurable task definitions, CLI or Python usage, a point‑and‑click GUI Studio, and flexible data input/output formats.

📘 Full documentation: https://DIAGNijmegen.github.io/llm_extractinator/


🔧 Installation

1. Install Ollama

On Linux

curl -fsSL https://ollama.com/install.sh | sh

On Windows or macOS

Download the installer from: https://ollama.com/download


2. Install the Package

Create a fresh conda environment:

conda create -n llm_extractinator python=3.11
conda activate llm_extractinator

Install the package via pip:

pip install llm_extractinator

Or from source:

git clone https://github.com/DIAGNijmegen/llm_extractinator.git
cd llm_extractinator
pip install -e .

Tip: to be able to run the latest models, update the Ollama client regularly:

pip install --upgrade ollama langchain-ollama

🖥️ Interactive Studio GUI

Starting with v0.5, Extractinator ships with a Streamlit‑based Studio for designing, running and monitoring extraction tasks with zero code:

Studio screenshot

🚀 To run:

launch-extractinator  # opens http://localhost:8501 in your browser

Features

🗂️ Project Manager Create / select datasets, parsers and tasks with file previews
🔧 Parser Builder Visual Pydantic schema designer (nested models supported)
🚀 One‑click Runs Configure model, sampling & advanced flags, then watch live logs
🛠️ Task JSON Wizard Step‑by‑step helper to generate valid TaskXXX.json files
🆘 Help bubbles everywhere Inline docs so you never lose context

The Studio is fully optional: anything you configure here can still be executed from the CLI or Python API.


🚀 Quick Usage

GUI

launch-extractinator  # recommended for new users

CLI

extractinate --task_id 001 --model_name "phi4"

Python

from llm_extractinator import extractinate

extractinate(task_id=1, model_name="phi4")

📁 Task Files

Each task is defined by a JSON file stored in tasks/.

Filename format:

TaskXXX_name.json

Example:

{
  "Description": "Extract product data from text.",
  "Data_Path": "products.csv",
  "Input_Field": "text",
  "Parser_Format": "product_parser.py"
}

Parser_Format points to a .py file in tasks/parsers/ that implements a Pydantic OutputParser model used to structure the LLM output.


🛠️ Visual Schema Builder (optional)

If you prefer a graphical approach to designing parsers, run:

build-parser

This starts the same builder embedded in the Studio, letting you assemble nested Pydantic models visually. Save the resulting .py file in tasks/parsers/ and reference it via Parser_Format.

👉 Read the parser docs for full details.


📄 Citation

If you use this tool, please cite: https://doi.org/10.5281/zenodo.15089764


🤝 Contributing

We welcome pull requests! See the contributing guide for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_extractinator-0.5.6.tar.gz (38.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_extractinator-0.5.6-py3-none-any.whl (39.5 kB view details)

Uploaded Python 3

File details

Details for the file llm_extractinator-0.5.6.tar.gz.

File metadata

  • Download URL: llm_extractinator-0.5.6.tar.gz
  • Upload date:
  • Size: 38.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for llm_extractinator-0.5.6.tar.gz
Algorithm Hash digest
SHA256 297d8e94fe7b8b0a112cf63e74773faf6db3c031d740a4c43f721c226691f0bd
MD5 717ac633f8c04cef2081d05ee3e4b347
BLAKE2b-256 61cc24c18b9651b96de7763926ee663f8d698f1bdae69d2a5798318b1cd13901

See more details on using hashes here.

File details

Details for the file llm_extractinator-0.5.6-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_extractinator-0.5.6-py3-none-any.whl
Algorithm Hash digest
SHA256 9c9ecfabee9d7ff6b46a2b0da15944a356004cc50edac10cc40a7c788c7b1b9a
MD5 867f54a2131a7a2486b550bf8cd19bce
BLAKE2b-256 0f7b80c2615744f1e53ccd75abff32499a2ba4b0d457043e935244e3361bc1b5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page