Skip to main content

A library to create synthetic data with OpenAI and train a GLiNER model on that data.

Project description

GLiNER-Finetune

gliner-finetune is a Python library designed to generate synthetic data using OpenAI's GPT models, process this data, and then use it to train a GLiNER model. GLiNER is a framework for learning and inference in Named Entity Recognition (NER) tasks.

Features

  • Data Generation: Leverage OpenAI's powerful language models to create synthetic training data.
  • Data Processing: Convert raw synthetic data into a format suitable for NER training.
  • Model Training: Fine-tune the GLiNER model on the processed synthetic data for improved NER performance.

Installation

To install the gliner-finetune library, use pip:

pip install gliner-finetune

Quick Start

The following example demonstrates how to generate synthetic data, process it, and train a GLiNER model using the gliner-finetune library.

Make sure you have a .env file with your OPENAI_API_KEY set as a variable.

Step 1: Generate Synthetic Data

from gliner_finetune.synthetic import generate_data, create_prompt
import json

# Define your example data
example_data = {
    "text": "The Alpine Swift primarily consumes flying insects such as wasps, bees, and flies. It captures its prey mid-air while swiftly flying through the alpine skies. It nests in high, rocky mountain crevices where it uses feathers and small sticks to construct a simple yet secure nesting environment.",
    "generic_plant_food": [],
    "generic_animal_food": ["flying insects"],
    "plant_food": [],
    "specific_animal_food": ["wasps", "bees", "flies"],
    "location_nest": ["rocky mountain crevices"],
    "item_nest": ["feathers", "small sticks"]
}

# Convert example data to JSON string
json_data = json.dumps(example_data)

# Generate prompt and synthetic data
prompt = create_prompt(json_data)
print(prompt)

# Generate synthetic data with specified number of API calls
num_calls = 3
results = generate_data(json_data, num_calls)
print(results)

Step 2: Process and Split Data

from gliner_finetune.convert import convert

# Assuming the data has been read from 'parsed_responses.json'
with open('synthetic_data/parsed_responses.json', 'r') as file:
    data = json.load(file)

# Flatten the data list for processing
final_data = [sample for item in data for sample in item]

# Convert and split the data into training, validation, and testing datasets
training_data = convert(final_data, project_path='', train_split=0.8, eval_split=0.2, test_split=0.0,
                        train_file='train.json', eval_file='eval.json', test_file='test.json', overwrite=True)

Step 3: Train the GLiNER Model

from gliner_finetune.train import train_model

# Train the model
train_model(model="urchade/gliner_small-v2.1", train_data="assets/train.json", 
            eval_data="assets/eval.json", project="")

Documentation

For more details about the GLiNER model and its capabilities, visit the official repository:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gliner-finetune-0.0.4.tar.gz (6.8 kB view details)

Uploaded Source

Built Distribution

gliner_finetune-0.0.4-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file gliner-finetune-0.0.4.tar.gz.

File metadata

  • Download URL: gliner-finetune-0.0.4.tar.gz
  • Upload date:
  • Size: 6.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for gliner-finetune-0.0.4.tar.gz
Algorithm Hash digest
SHA256 84e3f092bcd2db8a0d8f8d612d88d7b8d12907b50132c1bdd65b9c382a98a18c
MD5 042b398c313218d233f7ff6e4e11b4d5
BLAKE2b-256 c7cc462e250237deeb562a23db455d14409744316114434f18e61ffe6010bcb8

See more details on using hashes here.

File details

Details for the file gliner_finetune-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for gliner_finetune-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 8bf8d67286efa030da09706eab6feeb378a55a6f100d19302ac50b50cbf16acd
MD5 9dfdeccf6dc806773406d9360e8fddf4
BLAKE2b-256 03c2b6ab5dc4a8a3812e81f7f2b15bebf18e5c1bb2958c0a039d0ccc8306c39e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page