Skip to main content

A framework that enables efficient extraction of structured data from unstructured text using large language models (LLMs).

Project description

LLM Extractinator

Overview of the LLM Data Extractor

⚠️ This tool is a prototype in active development and may change significantly. Always verify results!

LLM Extractinator enables efficient extraction of structured data from unstructured text using large language models (LLMs). It supports configurable task definitions, CLI or Python usage, and flexible data input/output formats.

📘 Full documentation: https://DIAGNijmegen.github.io/llm_extractinator/


🔧 Installation

1. Install Ollama

On Linux

curl -fsSL https://ollama.com/install.sh | sh

On Windows or macOS

Download the installer from:
https://ollama.com/download


2. Install the Package

Create a fresh conda environment:

conda create -n llm_extractinator python=3.11
conda activate llm_extractinator

Install the package via pip:

pip install llm_extractinator

Or from source:

git clone https://github.com/DIAGNijmegen/llm_extractinator.git
cd llm_extractinator
pip install -e .

To be able to run the latest models available, make sure to update the ollama package to the latest version:

pip install --upgrade ollama langchain-ollama

🚀 Quick Usage

CLI

extractinate --task_id 001 --model_name "phi4"

Python

from llm_extractinator import extractinate

extractinate(task_id=1, model_name="phi4")

📁 Task Files

Each task is defined using a JSON file stored in the tasks/ directory.

Filename format:

TaskXXX_name.json

Example contents:

{
  "Description": "Extract product data from text.",
  "Data_Path": "products.csv",
  "Input_Field": "text",
  "Parser_Format": "product_parser.py"
}

Parser_Format refers to a .py file in tasks/parsers/ that defines a Pydantic OutputParser class used to structure the LLM output.


🛠️ Visual Schema Builder (Optional)

You can visually design the output schema using:

build-parser

This launches a web UI to create a Pydantic OutputParser model, which defines the structure of the extracted data. Additional models can be added and nested for complex formats.

The resulting .py file should be saved in:

tasks/parsers/

And referenced in your task JSON under the Parser_Format key.

👉 See parser docs for full usage.


📄 Citation

If you use this tool, please cite: 10.5281/zenodo.15089764


🤝 Contributing

We welcome contributions! See the full contributing guide in the docs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_extractinator-0.5.2.tar.gz (37.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_extractinator-0.5.2-py3-none-any.whl (38.7 kB view details)

Uploaded Python 3

File details

Details for the file llm_extractinator-0.5.2.tar.gz.

File metadata

  • Download URL: llm_extractinator-0.5.2.tar.gz
  • Upload date:
  • Size: 37.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for llm_extractinator-0.5.2.tar.gz
Algorithm Hash digest
SHA256 e382b4f1adcc72a2db42b121eab62e7b8a520006f23c5e65ce7c9e2595480882
MD5 2b7c6e8607e3d942068bb0cb5819960f
BLAKE2b-256 361165cc3e44129511dee2e8316fd820f8d0bb66aec0c1a3007d598edd3aff4e

See more details on using hashes here.

File details

Details for the file llm_extractinator-0.5.2-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_extractinator-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9119445bb995f5f7b5f87b271d5c88cfb20e78e7ffc4fd2bb6f7ee866fbd948f
MD5 e55a0cb6020cdb1de4f2c4a6e3595d4c
BLAKE2b-256 72adee3218130ed769035c07cfcb06e82cb2f00514cd6d5546a7ab3864ae55eb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page