A framework that enables efficient extraction of structured data from unstructured text using large language models (LLMs).
Project description
LLM Extractinator
[!Important] This tool is a prototype which is in active development and is still undergoing major changes. Please always check the results!
Overview
This project enables the efficient extraction of structured data from unstructured text using large language models (LLMs). It provides a flexible configuration system and supports a variety of tasks.
Tool Workflow
Setup Environment
To set up the environment, run the following commands:
conda create --name=llm_extractinator python=3.12
conda activate llm_extractinator
pip install -e .
Setting Up Task Configuration
Create a JSON file in the tasks folder for each task, following the naming convention:
TaskXXX_taskname.json
Where XXX is a 3-digit number, and taskname is a brief descriptor of the task.
The JSON file should always include the following fields:
- Task: The name of the task.
- Type: The type of task.
- Description: A detailed description of the task.
- Data_Path: The filename of the data file in the data folder.
- Input_Field: The column name containing the text data.
- Parser_Format: The JSON format you want the output to be in. See
Task999_example.jsonfor an example.
The following fields are only mandatory if you want to have the model automatically generate examples:
- Example_Path: The path to data used for creating examples (only required if
num_examples > 0when running the model). - Label_Field: The column name containing the ground truth labels (only required if
num_examples > 0).
Input Flags for extractinate
The following input flags can be used to configure the behavior of the extractinate script:
| Flag | Type | Default Value | Description |
|---|---|---|---|
--task_id |
int |
Required | Task ID to generate examples for. |
--model_name |
str |
"mistral-nemo" |
Name of the model to use for prediction tasks. See https://ollama.com/search for the options. |
--num_examples |
int |
0 |
Number of examples to generate for each task. |
--n_runs |
int |
5 |
Number of runs to perform. |
--temperature |
float |
0.3 |
Temperature for text generation. |
--max_context_len |
int |
8192 |
Maximum context length for input text. |
--num_predict |
int |
1024 |
Maximum number of tokens to predict. |
--run_name |
Path |
"run" |
Name of the run for logging purposes. |
--output_dir |
Path |
<project_root>/output |
Path to the directory for output files. |
--task_dir |
Path |
<project_root>/tasks |
Path to the directory containing task configuration files. |
--log_dir |
Path |
<project_root>/output |
Path to the directory for log files. |
--data_dir |
Path |
<project_root>/data |
Path to the directory containing input data. |
--chunk_size |
int |
None |
Number of examples to generate in a single chunk. When None, use dataset size as chunksize. |
--translate |
bool |
False |
Translate the generated examples to English. |
Example Task.json
Below is an example configuration file for a task:
{
"Task": "Text Summarization",
"Type": "Summarization",
"Description": "Generate summaries for long-form text documents.",
"Data_Path": "data/documents.csv",
"Example_Path": "data/summaries_examples.csv",
"Input_Field": "text",
"Label_Field": "summary",
"Parser_Format": {
"summary": "string"
}
}
Running the Extractor
To run the data extraction process, use the following command:
extractinate --task_id 001 --model_name "mistral-nemo" --num_examples 0 --max_context_len 8192 --num_predict 8192 --translate
Customize the flags based on your task requirements.
Output
The output will be saved in the specified --output_dir. Ensure that the directory structure and paths specified in the Task.json file match your project's organization.
For further details, check the logs in the directory specified by --log_dir.
Enhancements and Contributions
Feel free to enhance this project by improving configurations, adding more task types, or extending model compatibility. Open a pull request or file an issue for discussions!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_extractinator-0.1.2.tar.gz.
File metadata
- Download URL: llm_extractinator-0.1.2.tar.gz
- Upload date:
- Size: 22.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
13857efd4443353546b05a4819be3c25ea80982b9bd444d59c4edc90e89769da
|
|
| MD5 |
528b86063d91fda9841a490dfbe6dc65
|
|
| BLAKE2b-256 |
1fe076884aa03c5b9916981a3d21cc81b1a414c091f51566a20fa1f3bf95037f
|
File details
Details for the file llm_extractinator-0.1.2-py3-none-any.whl.
File metadata
- Download URL: llm_extractinator-0.1.2-py3-none-any.whl
- Upload date:
- Size: 20.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f9188ca8c2f867d992991a4cc8b4feb5cdbd30f28007bcb95c9cf465baf705a
|
|
| MD5 |
ef4250c652721853e6c2de4c96c9b80c
|
|
| BLAKE2b-256 |
3439142f0af463a842c1c35a3e4e845183179e61a13098e3ae1e826a88fcbde4
|