A framework that enables efficient extraction of structured data from unstructured text using large language models (LLMs).

These details have not been verified by PyPI

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

LLM Extractinator

Overview of the LLM Data Extractor

[!Important] This tool is a prototype which is in active development and is still undergoing major changes. Please always check the results!

Overview

This project enables the efficient extraction of structured data from unstructured text using large language models (LLMs). It provides a flexible configuration system and supports a variety of tasks.

Tool Workflow

Overview of the LLM Data Extractor

Setup Environment

To set up the environment, run the following commands:

conda create --name=llm_extractinator python=3.12
conda activate llm_extractinator

pip install -e .

Setting Up Task Configuration

Create a JSON file in the tasks folder for each task, following the naming convention:

TaskXXX_taskname.json

Where XXX is a 3-digit number, and taskname is a brief descriptor of the task.

The JSON file should always include the following fields:

Task: The name of the task.
Type: The type of task.
Description: A detailed description of the task.
Data_Path: The filename of the data file in the data folder.
Input_Field: The column name containing the text data.
Parser_Format: The JSON format you want the output to be in. See Task999_example.json for an example.

The following fields are only mandatory if you want to have the model automatically generate examples:

Example_Path: The path to data used for creating examples (only required if num_examples > 0 when running the model).
Label_Field: The column name containing the ground truth labels (only required if num_examples > 0).

Input Flags for `extractinate`

The following input flags can be used to configure the behavior of the extractinate script:

Flag	Type	Default Value	Description
`--task_id`	`int`	Required	Task ID to generate examples for.
`--model_name`	`str`	`"mistral-nemo"`	Name of the model to use for prediction tasks. See https://ollama.com/search for the options.
`--num_examples`	`int`	`0`	Number of examples to generate for each task.
`--n_runs`	`int`	`5`	Number of runs to perform.
`--temperature`	`float`	`0.3`	Temperature for text generation.
`--max_context_len`	`int`	`8192`	Maximum context length for input text.
`--num_predict`	`int`	`1024`	Maximum number of tokens to predict.
`--run_name`	`Path`	`"run"`	Name of the run for logging purposes.
`--output_dir`	`Path`	`<project_root>/output`	Path to the directory for output files.
`--task_dir`	`Path`	`<project_root>/tasks`	Path to the directory containing task configuration files.
`--log_dir`	`Path`	`<project_root>/output`	Path to the directory for log files.
`--data_dir`	`Path`	`<project_root>/data`	Path to the directory containing input data.
`--chunk_size`	`int`	`None`	Number of examples to generate in a single chunk. When None, use dataset size as chunksize.
`--translate`	`bool`	`False`	Translate the generated examples to English.

Example `Task.json`

Below is an example configuration file for a task:

{
    "Task": "Text Summarization",
    "Type": "Summarization",
    "Description": "Generate summaries for long-form text documents.",
    "Data_Path": "data/documents.csv",
    "Example_Path": "data/summaries_examples.csv",
    "Input_Field": "text",
    "Label_Field": "summary",
    "Parser_Format": {
        "summary": "string"
    }
}

Running the Extractor

To run the data extraction process, use the following command:

extractinate --task_id 001 --model_name "mistral-nemo" --num_examples 0 --max_context_len 8192 --num_predict 8192 --translate

Customize the flags based on your task requirements.

Output

The output will be saved in the specified --output_dir. Ensure that the directory structure and paths specified in the Task.json file match your project's organization.

For further details, check the logs in the directory specified by --log_dir.

Enhancements and Contributions

Feel free to enhance this project by improving configurations, adding more task types, or extending model compatibility. Open a pull request or file an issue for discussions!

Project details

These details have not been verified by PyPI

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.5.13

Apr 3, 2026

0.5.12

Apr 3, 2026

0.5.11

Apr 2, 2026

0.5.10

Feb 24, 2026

0.5.9

Dec 9, 2025

0.5.8

Nov 14, 2025

0.5.7

Oct 22, 2025

0.5.6

Oct 22, 2025

0.5.5

Sep 19, 2025

0.5.4

Jul 29, 2025

0.5.3

Jul 21, 2025

0.5.2

Jul 21, 2025

0.5.1

Jun 13, 2025

0.5.0

May 21, 2025

0.4.2

Mar 17, 2025

0.4.1

Feb 20, 2025

0.4.0

Feb 19, 2025

0.3.7

Feb 11, 2025

0.3.6

Feb 7, 2025

0.3.5

Feb 6, 2025

0.3.4

Feb 6, 2025

0.3.3

Feb 6, 2025

0.3.2

Feb 6, 2025

0.3.1

Feb 6, 2025

0.3.0

Feb 5, 2025

0.2.4

Jan 17, 2025

0.2.3

Jan 16, 2025

0.2.2

Jan 13, 2025

0.2.1

Jan 2, 2025

0.2.0

Jan 2, 2025

0.1.5

Dec 19, 2024

0.1.4

Dec 19, 2024

0.1.3

Dec 19, 2024

0.1.2

Dec 19, 2024

This version

0.1.1

Dec 19, 2024

0.1.0

Dec 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_extractinator-0.1.1.tar.gz (19.9 kB view details)

Uploaded Dec 19, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_extractinator-0.1.1-py3-none-any.whl (20.5 kB view details)

Uploaded Dec 19, 2024 Python 3

File details

Details for the file llm_extractinator-0.1.1.tar.gz.

File metadata

Download URL: llm_extractinator-0.1.1.tar.gz
Upload date: Dec 19, 2024
Size: 19.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for llm_extractinator-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`6506eff13c014137288bbb42f731ae66eb7dd7080b6ba7380280f7dff7464c09`
MD5	`5d2bc5748438d71ae13e0bf01ae5d375`
BLAKE2b-256	`6accaafae2968a50960822336a653bb6c9b559e27aabac13cf2b7dbc7af3b9d8`

See more details on using hashes here.

File details

Details for the file llm_extractinator-0.1.1-py3-none-any.whl.

File metadata

Download URL: llm_extractinator-0.1.1-py3-none-any.whl
Upload date: Dec 19, 2024
Size: 20.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for llm_extractinator-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a44565d5e435533e0090514dc9f5fb7ccfcb5207cb40e58ff996bbb0a5cdc888`
MD5	`452f4f55aa52e1c27f7d2ca87dc5b727`
BLAKE2b-256	`85bf18c3ec280a603447d3f1d887861edaef871606cbe7d09f85875824945f52`

See more details on using hashes here.

llm-extractinator 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

LLM Extractinator

Overview

Tool Workflow

Setup Environment

Setting Up Task Configuration

Input Flags for `extractinate`

Example `Task.json`

Running the Extractor

Output

Enhancements and Contributions

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

llm-extractinator 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

LLM Extractinator

Overview

Tool Workflow

Setup Environment

Setting Up Task Configuration

Input Flags for extractinate

Example Task.json

Running the Extractor

Output

Enhancements and Contributions

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Input Flags for `extractinate`

Example `Task.json`