Synthetic dataset generation and normalization functions powered by LLMs

These details have not been verified by PyPI

Project description

fuxion

LangChain + LLM powered data generation and normalization functions. fuxion helps you generate a fully synthetic dataset with LLM APIs to train a task-specific model you can run on your own GPU. Preliminary models for name, price, and address standardization are available on HuggingFace.

fuxion

Description
Installation
Usage

Description

fuxion is a Python package that provides you with a data generation and normalization pipeline which could be used for testing, normalization and training machine learning models. Using fuxion, you are able to generate sythetic data for different types of use cases -- all that's required is that you pass the right prompt to the chain and watch how things unfold :sunglasses:

Installation

We recommend that you create a virtual environment before proceeding with the installation process as it would help to create an isolated environment for this project. After doing that, you can proceed with the installation by following the steps below.

install via pip
```
pip install fuxion
```
Add the following to your bashrc file and replace "your-key" with your OpenAI API key:
```
export OPENAI_API_KEY = "your-key"
```

Usage

The process of creating useful synthetic data involves two main steps: data generation and normalization. fuxion provides a simple interface for both of these tasks, and a pipeline that chains together both of these tasks.

Generation

from fuxion.generators import GeneratorChain
from pprint import pprint

chain = GeneratorChain.from_template(
    template_file="examples/name_generator/generator.template",
    temperature=0.0,
    cache=False,
    verbose=True,
    model_name="gpt-3.5-turbo",
)


result = chain.execute(
    few_shot_example_file="examples/name_generator/few_shot.json", sample_size=3
)
pprint(result)

Normalization

from fuxion.normalizers import NormalizerChain

normalizer_chain = NormalizerChain.from_template(
    template_file="../templates/normalizer/address.template",
    temperature=0.0,
    cache=False,
    verbose=True,
    model_name="gpt-3.5-turbo",
)

normalizer = normalizer_chain.execute(
    example="John Doe street 1234, New York, NY 10001",
)
print(normalizer)

fuxion can be used to generate synthetic data for rapid product testing amongst other use cases. And this is easily achieved by passing the instructions and few shot examples as paths to the chain. The instructions are provided in the prompt template, and the few shot examples are provided in a json file.

Template Structure

For each generation or normalization task, a template file is required to guide the llm on what to do. Below, we provide a brief overview of what the template files should look like for a given generation and normalization task.

Generator templates

Generate a list of U.S. postal addresses separated by double newlines.  

Make them as realistic and diverse as possible.
Include some company address, P.O. boxes, apartment complexes, etc.
Ensure the addresses are fake.

{{few_shot}}

List:

The first few lines tells the chain to generate addresses and contains a bunch of creative instructions that determines the quality of the results.
{{few_shot}} tells the chain to get few-shot examples provided in the examples folder.
List returns the results in a list

The same convention should be followed when creating subsequent templates for various data generation tasks.

Normalizer templates


Format the following address as a list of python dictionaries of the form:
[
    { 
        "house_number": int, 
        "road": str, 
        "unit": int, 
        "unit_type": str, 
        "po_box_number": int, 
        "city": str, 
        "state": str, 
        "postcode": int 
    }
]. 

Use abbreviations for state and road type.
Use short form zip codes.

Input:
"{{address}}"

Output:
[{

The first few lines tells the chain to format the address passed to it into a list of dict(s)
It then takes in {{address}} as input
And returns a list of dict as output

Pipelines

We can train machine learning models on the combination of synthetically generated data and their normalized format. This is where we use pipelines

from fuxion.pipelines import DatasetPipeline

pipeline_chain = DatasetPipeline.from_template(
    generator_template="examples/name_generator/generator.template",
    normalizer_template="examples/name_generator/normalizer.template",
    few_shot_file="examples/name_generator/few_shot.json",
    dataset_name="name_pipeline",
    k=20,
    model_name="gpt-3.5-turbo",
    cache=False,
    verbose=True,
    temperature=1.0,
    batch_save=True,
    batch_size = 3,
)

result = pipeline_chain.execute()
print(result)

The pipeline chain takes in the generator_template, normalizer_template, few shot_file for few shot examples, dataset_name, number of datapoints to generate k and other parameters to generate a dataset including the model_name argument which specifies the llm to use for the generation and normalization process. The dataset is then saved in a json file with the dataset name provided. The user can choose to save the dataset in batches by setting batch_save to True and providing a batch_size.

Models supported

gpt-3.5-turbo
gpt-4
gpt-4-1106-preview
gpt-3.5-turbo-instruct

Future work:

fuxion is still a work in progress, but it is a good starting point for anyone looking to generate synthetic data for testing and training machine learning models. We plan to add more features to fuxion in the future, including a seamless functionality for accurate data generation and normalization using various llms (locally hosted or via the huggingface api). For now, OpenAI's models are the most functional and reliable.

Feel free to contribute to this project by opening an issue or a pull request. We would love to hear your thoughts on how we can improve fuxion!

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.4.2

Sep 19, 2024

0.0.4.1 yanked

Sep 19, 2024

0.0.4 yanked

Sep 19, 2024

0.0.3.2 yanked

Sep 13, 2024

Reason this release was yanked:

Fuxion has been updated to use structured output which gurantees better generations

This version

0.0.3.1 yanked

Sep 12, 2024

0.0.3 yanked

Sep 12, 2024

0.0.2 yanked

Apr 15, 2024

0.0.1 yanked

Apr 5, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuxion-0.0.3.1.tar.gz (1.0 MB view details)

Uploaded Sep 12, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fuxion-0.0.3.1-py3-none-any.whl (1.1 MB view details)

Uploaded Sep 12, 2024 Python 3

File details

Details for the file fuxion-0.0.3.1.tar.gz.

File metadata

Download URL: fuxion-0.0.3.1.tar.gz
Upload date: Sep 12, 2024
Size: 1.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.4.2 CPython/3.10.6 Linux/6.9.3-76060903-generic

File hashes

Hashes for fuxion-0.0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`435f4b58c530cab4240328b0f907d38b215dd4d49aabb15fbb4f9b43af9beabc`
MD5	`4e3d0fe9d21d9f2c8cc8b9f4353573d1`
BLAKE2b-256	`12f501ae71077fc27be8e298816663769389cc252009f1157c26650895f97e1e`

See more details on using hashes here.

File details

Details for the file fuxion-0.0.3.1-py3-none-any.whl.

File metadata

Download URL: fuxion-0.0.3.1-py3-none-any.whl
Upload date: Sep 12, 2024
Size: 1.1 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.4.2 CPython/3.10.6 Linux/6.9.3-76060903-generic

File hashes

Hashes for fuxion-0.0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`084625a9e61fe2b82606f5d911acb86b69a4e6b6d66ce638ec4fa11ff0e6b58a`
MD5	`4f6f7cb4afd7796e17c4dcbd4dbd14fd`
BLAKE2b-256	`f9fe3e74f6df2e6461a6b7c7942601701f5962b32d94eb56336df937ba62dcd9`

See more details on using hashes here.

fuxion 0.0.3.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

fuxion

Table of Contents

Description

Installation

Usage

Generation

Normalization

Template Structure

Generator templates

Normalizer templates

Pipelines

Future work:

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes