Skip to main content

Synthetic dataset generation and normalization functions powered by LLMs

Reason this release was yanked:

Fuxion has been updated to use structured output which gurantees better generations

Project description

fuxion

LangChain + LLM powered data generation and normalization functions. fuxion helps you generate a fully synthetic dataset with LLM APIs to train a task-specific model you can run on your own GPU. Preliminary models for name, price, and address standardization are available on HuggingFace.

fuxion

Table of Contents

Description

fuxion is a Python package that provides you with a data generation and normalization pipeline which could be used for testing, normalization and training machine learning models. Using fuxion, you are able to generate sythetic data for different types of use cases -- all that's required is that you pass the right prompt to the chain and watch how things unfold :sunglasses:

Installation

We recommend that you create a virtual environment before proceeding with the installation process as it would help to create an isolated environment for this project. After doing that, you can proceed with the installation by following the steps below.

  • install via pip

    pip install fuxion
    
  • Add the following to your bashrc file and replace "your-key" with your OpenAI API key:

    export OPENAI_API_KEY="your-key"
    

Usage

The process of creating useful synthetic data involves two main steps: data generation and normalization. fuxion provides a simple interface for both of these tasks, and a pipeline that chains together both of these tasks.

Generation

from fuxion.generators import GeneratorChain
from pprint import pprint

chain = GeneratorChain.from_template(
    template_file="examples/name_generator/generator.template",
    temperature=0.0,
    cache=False,
    verbose=True,
    model_name="gpt-3.5-turbo",
)


result = chain.execute(
    few_shot_example_file="examples/name_generator/few_shot.json", sample_size=3
)
pprint(result)

Normalization

from fuxion.normalizers import NormalizerChain

normalizer_chain = NormalizerChain.from_template(
    template_file="../templates/normalizer/address.template",
    temperature=0.0,
    cache=False,
    verbose=True,
    model_name="gpt-3.5-turbo",
)

normalizer = normalizer_chain.execute(
    example="John Doe street 1234, New York, NY 10001",
)
print(normalizer)

fuxion can be used to generate synthetic data for rapid product testing amongst other use cases. And this is easily achieved by passing the instructions and few shot examples as paths to the chain. The instructions are provided in the prompt template, and the few shot examples are provided in a json file.

Template Structure

For each generation or normalization task, a template file is required to guide the llm on what to do. Below, we provide a brief overview of what the template files should look like for a given generation and normalization task.

Generator templates
Generate a list of U.S. postal addresses separated by double newlines.  

Make them as realistic and diverse as possible.
Include some company address, P.O. boxes, apartment complexes, etc.
Ensure the addresses are fake.

{{few_shot}}

List:
  • The first few lines tells the chain to generate addresses and contains a bunch of creative instructions that determines the quality of the results.

  • {{few_shot}} tells the chain to get few-shot examples provided in the examples folder.

  • List returns the results in a list

The same convention should be followed when creating subsequent templates for various data generation tasks.

Normalizer templates

Format the following address as a list of python dictionaries of the form:
[
    { 
        "house_number": int, 
        "road": str, 
        "unit": int, 
        "unit_type": str, 
        "po_box_number": int, 
        "city": str, 
        "state": str, 
        "postcode": int 
    }
]. 

Use abbreviations for state and road type.
Use short form zip codes.

Input:
"{{address}}"

Output:
[{
  • The first few lines tells the chain to format the address passed to it into a list of dict(s)

  • It then takes in {{address}} as input

  • And returns a list of dict as output

Pipelines

We can train machine learning models on the combination of synthetically generated data and their normalized format. This is where we use pipelines

from fuxion.pipelines import DatasetPipeline

pipeline_chain = DatasetPipeline.from_template(
    generator_template="examples/name_generator/generator.template",
    normalizer_template="examples/name_generator/normalizer.template",
    few_shot_file="examples/name_generator/few_shot.json",
    dataset_name="name_pipeline",
    k=20,
    model_name="gpt-3.5-turbo",
    cache=False,
    verbose=True,
    temperature=1.0,
    batch_save=True,
    batch_size = 3,
)

result = pipeline_chain.execute()
print(result)

The pipeline chain takes in the generator_template, normalizer_template, few shot_file for few shot examples, dataset_name, number of datapoints to generate k and other parameters to generate a dataset including the model_name argument which specifies the llm to use for the generation and normalization process. The dataset is then saved in a json file with the dataset name provided. The user can choose to save the dataset in batches by setting batch_save to True and providing a batch_size.

Models supported

  • gpt-3.5-turbo
  • gpt-4
  • gpt-4o
  • gpt-4o-mini

Future work:

fuxion is still a work in progress, but it is a good starting point for anyone looking to generate synthetic data for testing and training machine learning models. We plan to add more features to fuxion in the future, including a seamless functionality for accurate data generation and normalization using various llms (locally hosted or via the huggingface api). For now, OpenAI's models are the most functional and reliable.

Feel free to contribute to this project by opening an issue or a pull request. We would love to hear your thoughts on how we can improve fuxion!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuxion-0.0.3.2.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fuxion-0.0.3.2-py3-none-any.whl (1.1 MB view details)

Uploaded Python 3

File details

Details for the file fuxion-0.0.3.2.tar.gz.

File metadata

  • Download URL: fuxion-0.0.3.2.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.10.6 Linux/6.9.3-76060903-generic

File hashes

Hashes for fuxion-0.0.3.2.tar.gz
Algorithm Hash digest
SHA256 37c22f7bd3f3eeb7076c9d4e7633c9db70152a15b14d6f8f068997774c91d085
MD5 c43d638860134b089ba9e69c06333ed4
BLAKE2b-256 43c967aaef61f10023ae42f5c11a346545a44bb74723863b458db15ca16b23b0

See more details on using hashes here.

File details

Details for the file fuxion-0.0.3.2-py3-none-any.whl.

File metadata

  • Download URL: fuxion-0.0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.10.6 Linux/6.9.3-76060903-generic

File hashes

Hashes for fuxion-0.0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b06ac57dce3706324e138708194d800ef846462beeb028e9fb28809d519caf11
MD5 22f4c8316cb221543242134071b85269
BLAKE2b-256 25fa3269063c68ebb478188d27641f88ab27533bc82b733c1194f303e1374ead

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page