A library to create synthetic data with OpenAI and train a GLiNER model on that data.

These details have not been verified by PyPI

Project links

Homepage

Project description

GLiNER-Finetune

gliner-finetune is a Python library designed to generate synthetic data using OpenAI's GPT models, process this data, and then use it to train a GLiNER model. GLiNER is a framework for learning and inference in Named Entity Recognition (NER) tasks.

Features

Data Generation: Leverage OpenAI's powerful language models to create synthetic training data.
Data Processing: Convert raw synthetic data into a format suitable for NER training.
Model Training: Fine-tune the GLiNER model on the processed synthetic data for improved NER performance.

Installation

To install the gliner-finetune library, use pip:

pip install gliner-finetune

Quick Start

The following example demonstrates how to generate synthetic data, process it, and train a GLiNER model using the gliner-finetune library.

Make sure you have a .env file with your OPENAI_API_KEY set as a variable.

Step 1: Generate Synthetic Data

from gliner_finetune.synthetic import generate_data, create_prompt
import json

# Define your example data
example_data = {
    "text": "The Alpine Swift primarily consumes flying insects such as wasps, bees, and flies. It captures its prey mid-air while swiftly flying through the alpine skies. It nests in high, rocky mountain crevices where it uses feathers and small sticks to construct a simple yet secure nesting environment.",
    "generic_plant_food": [],
    "generic_animal_food": ["flying insects"],
    "plant_food": [],
    "specific_animal_food": ["wasps", "bees", "flies"],
    "location_nest": ["rocky mountain crevices"],
    "item_nest": ["feathers", "small sticks"]
}

# Convert example data to JSON string
json_data = json.dumps(example_data)

# Generate prompt and synthetic data
prompt = create_prompt(json_data)
print(prompt)

# Generate synthetic data with specified number of API calls
num_calls = 3
results = generate_data(json_data, num_calls)
print(results)

Step 2: Process and Split Data

from gliner_finetune.convert import convert

# Assuming the data has been read from 'parsed_responses.json'
with open('synthetic_data/parsed_responses.json', 'r') as file:
    data = json.load(file)

# Flatten the data list for processing
final_data = [sample for item in data for sample in item]

# Convert and split the data into training, validation, and testing datasets
training_data = convert(final_data, project_path='', train_split=0.8, eval_split=0.2, test_split=0.0,
                        train_file='train.json', eval_file='eval.json', test_file='test.json', overwrite=True)

Step 3: Train the GLiNER Model

from gliner_finetune.train import train_model

# Train the model
train_model(model="urchade/gliner_small-v2.1", train_data="assets/train.json", 
            eval_data="assets/eval.json", project="")

Documentation

For more details about the GLiNER model and its capabilities, visit the official repository:

GLiNER GitHub Repository

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.4

Apr 15, 2024

0.0.3

Apr 15, 2024

This version

0.0.2

Apr 15, 2024

0.0.1

Apr 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gliner-finetune-0.0.2.tar.gz (6.7 kB view hashes)

Uploaded Apr 15, 2024 Source

Built Distribution

gliner_finetune-0.0.2-py3-none-any.whl (7.7 kB view hashes)

Uploaded Apr 15, 2024 Python 3

Hashes for gliner-finetune-0.0.2.tar.gz

Hashes for gliner-finetune-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`5bb4acf0f465cc73b26e52f3ec5131d4b02dbf52d6a29a1000ea7d3d0b66f13b`
MD5	`6cecf3a564d3eed01cb37cb7f2d8c212`
BLAKE2b-256	`a09792caaa6ae7637dcb05aa8ce9f2eff204eaf02e074cda5a5e84920f1641e7`

Hashes for gliner_finetune-0.0.2-py3-none-any.whl

Hashes for gliner_finetune-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6b83e93bb912edbce9b007e46e73b1fb85009855f1583f36496594c4be1adca1`
MD5	`c00b48bb13579e8c7b1611138c4c6444`
BLAKE2b-256	`17d31739fc2d4cd31d168c292c313cfcb8c9cf326ad6464c7479531cfe25d6d2`