Skip to main content

A framework that enables efficient extraction of structured data from unstructured text using large language models (LLMs).

Project description

LLM Extractinator

Overview of the LLM Data Extractor

[!Important] This tool is a prototype which is in active development and is still undergoing major changes. Please always check the results!


Overview

This project enables the efficient extraction of structured data from unstructured text using large language models (LLMs). It provides a flexible configuration system and supports a variety of tasks.

Tool Workflow

Overview of the LLM Data Extractor


Installing Ollama

For the package to work, Ollama needs to be installed on your machine. For Linux, use the following command:

curl -fsSL https://ollama.com/install.sh | sh

For Windows or macOS, install via this link

Installing the Package

Option 1: Install from PyPI

The package is installable via PyPI using:

pip install llm_extractinator

Option 2: Install using local clone

For contributing to or developing the package, clone this repository and install it using:

pip install -e .

Data Structure

The data structure for the input data should be as follows:

  • The data should be in a CSV or a JSON file.
  • The text data should be in a column specified by the Input_Field in the task configuration. The default is text.
  • The name of the data file should be specified in the Data_Path field in the task configuration. The default location is the data folder, but this can be changed using the --data_dir flag when running the model.

When running the model with examples (num_examples > 0), the examples should be provided in a separate CSV or JSON file. The path to this file should be specified in the Example_Path field in the task configuration. The default location is the data folder, but this can be changed using the --example_dir flag when running the model.


Setting Up Task Configuration

Create a JSON file in the tasks folder for each task, following the naming convention:

TaskXXX_taskname.json

Where XXX is a 3-digit number, and taskname is a brief descriptor of the task.

The JSON file should always include the following fields:

  • Task: The name of the task.
  • Type: The type of task.
  • Description: A detailed description of the task.
  • Data_Path: The filename of the data file in the data folder.
  • Input_Field: The column name containing the text data.
  • Parser_Format: The JSON format you want the output to be in. See Task999_example.json for an example.

The following field is only mandatory if you want to have the model use examples in its prompt:

  • Example_Path: The path to data used for creating examples (only required if num_examples > 0 when running the model).

[!Important] If you don't want to use examples, omit the Example_Path field from the task configuration completely. Do not set it to an empty string!


Input Flags for extractinate

The following input flags can be used to configure the behavior of the extractinate script:

Flag Type Default Value Description
--task_id int Required Task ID to generate examples for.
--run_name str "run" Name of the run for logging purposes.
--n_runs int 5 Number of runs to perform.
--num_examples int 0 Number of examples to generate for each task.
--num_predict int 512 Maximum number of tokens to predict.
--chunk_size int None Number of examples to generate in a single chunk. When None, use dataset size as chunksize.
--overwrite bool False Overwrite existing files instead of skipping them.
--translate bool False Translate the generated examples to English.
--verbose bool False Enable verbose logging.
--reasoning_model bool False Whether or not the model is a reasoning model.
--model_name str "mistral-nemo" Name of the model to use for prediction tasks.
--temperature float 0.0 Temperature for text generation.
--max_context_len str max Maximum context length; 'split' splits data into short and long cases and does a run for them seperately (good if your dataset distribution has a tail with long reports and a bulk of short ones), 'max' uses the maximum token length of the dataset, or a number sets a fixed length.
--top_k int None Limits the sampling to the top K tokens.
--top_p float None Nucleus sampling probability threshold.
--seed int None Random seed for reproducibility.
--output_dir Path <project_root>/output Path to the directory for output files.
--task_dir Path <project_root>/tasks Path to the directory containing task configuration files.
--log_dir Path <project_root>/output Path to the directory for log files.
--data_dir Path <project_root>/data Path to the directory containing input data.
--example_dir Path <project_root>/examples Path to the directory containing example data.
--host str "localhost" Server hostname.
--port int 28900 Server port.

Running the Extractor

To run the data extraction process, use either the command line or import the function in Python.

Using the Command Line

extractinate --task_id 001 --model_name "mistral-nemo" 

Using the Function in Python

from llm_extractinator import extractinate

extractinate(
    task_id=1,
    model_name="mistral-nemo",
)

Enhancements and Contributions

Feel free to contribute by improving configurations, adding more task types, or extending model compatibility. Open a pull request or file an issue for discussions!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_extractinator-0.3.2.tar.gz (27.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_extractinator-0.3.2-py3-none-any.whl (27.9 kB view details)

Uploaded Python 3

File details

Details for the file llm_extractinator-0.3.2.tar.gz.

File metadata

  • Download URL: llm_extractinator-0.3.2.tar.gz
  • Upload date:
  • Size: 27.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for llm_extractinator-0.3.2.tar.gz
Algorithm Hash digest
SHA256 54c9e41d277e9630691d1418fb2b2b434b9077c526569bb840d751d6d874fc5f
MD5 9f144eb6a73d1d0d42aa7a021b9395ac
BLAKE2b-256 382f8f64fe70a579655095ae5d2ec4679e07815db15f889c38e7b8c19d6cab04

See more details on using hashes here.

File details

Details for the file llm_extractinator-0.3.2-py3-none-any.whl.

File metadata

File hashes

Hashes for llm_extractinator-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c7386c327c97cdaa52eb4ab8dc2a66a3fd8f66752cd2b5361c0d62f64799ef60
MD5 992ce93c7af019b403974059f1ddf8d5
BLAKE2b-256 43c0d3be0dde3e1e7359d7da77b9715a91fce276496078b9bb43004e24c8d2d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page