Skip to main content

Synthetic dataset generation and normalization functions powered by LLMs

Project description

fuxion

LangChain + LLM powered data generation and normalization functions. fuxion helps you generate a fully synthetic dataset with LLM APIs to train a task-specific model you can run on your own GPU. Preliminary models for name, price, and address standardization are available on HuggingFace.

fuxion

Table of Contents

Description

fuxion is a Python package that provides you with a data generation and normalization pipeline which could be used for testing, normalization and training machine learning models. Using fuxion, you are able to generate sythetic data for different types of use cases -- all that's required is that you pass the right prompt to the chain and watch how things unfold :sunglasses:

Installation

We recommend that you create a virtual environment before proceeding with the installation process as it would help to create an isolated environment for this project. After doing that, you can proceed with the installation by following the steps below.

  • install via pip

    pip install fuxion
    
  • Add the following to your bashrc file and replace "your-key" with your OpenAI API key:

    export OPENAI_API_KEY = "your-key"
    

Usage

The process of creating useful synthetic data involves two main steps: data generation and normalization. fuxion provides a simple interface for both of these tasks, and a pipeline that chains together both of these tasks.

Generation

from fuxion.generators import GeneratorChain
from pprint import pprint

chain = GeneratorChain.from_template(
    template_file="examples/name_generator/generator.template",
    temperature=0.0,
    cache=False,
    verbose=True,
    model_name="gpt-3.5-turbo",
)


result = chain.execute(
    few_shot_example_file="examples/name_generator/few_shot.json", sample_size=3
)
pprint(result)

Normalization

from fuxion.normalizers import NormalizerChain

normalizer_chain = NormalizerChain.from_template(
    template_file="../templates/normalizer/address.template",
    temperature=0.0,
    cache=False,
    verbose=True,
    model_name="gpt-3.5-turbo",
)

normalizer = normalizer_chain.execute(
    example="John Doe street 1234, New York, NY 10001",
)
print(normalizer)

fuxion can be used to generate synthetic data for rapid product testing amongst other use cases. And this is easily achieved by passing the instructions and few shot examples as paths to the chain. The instructions are provided in the prompt template, and the few shot examples are provided in a json file.

Template Structure

For each generation or normalization task, a template file is required to guide the llm on what to do. Below, we provide a brief overview of what the template files should look like for a given generation and normalization task.

Generator templates
Generate a list of U.S. postal addresses separated by double newlines.  

Make them as realistic and diverse as possible.
Include some company address, P.O. boxes, apartment complexes, etc.
Ensure the addresses are fake.

{{few_shot}}

List:
  • The first few lines tells the chain to generate addresses and contains a bunch of creative instructions that determines the quality of the results.

  • {{few_shot}} tells the chain to get few-shot examples provided in the examples folder.

  • List returns the results in a list

The same convention should be followed when creating subsequent templates for various data generation tasks.

Normalizer templates

Format the following address as a list of python dictionaries of the form:
[
    { 
        "house_number": int, 
        "road": str, 
        "unit": int, 
        "unit_type": str, 
        "po_box_number": int, 
        "city": str, 
        "state": str, 
        "postcode": int 
    }
]. 

Use abbreviations for state and road type.
Use short form zip codes.

Input:
"{{address}}"

Output:
[{
  • The first few lines tells the chain to format the address passed to it into a list of dict(s)

  • It then takes in {{address}} as input

  • And returns a list of dict as output

Pipelines

We can train machine learning models on the combination of synthetically generated data and their normalized format. This is where we use pipelines

from fuxion.pipelines import DatasetPipeline

pipeline_chain = DatasetPipeline.from_template(
    generator_template="examples/name_generator/generator.template",
    normalizer_template="examples/name_generator/normalizer.template",
    few_shot_file="examples/name_generator/few_shot.json",
    dataset_name="name_pipeline",
    k=20,
    model_name="gpt-3.5-turbo",
    cache=False,
    verbose=True,
    temperature=1.0,
    batch_save=True,
    batch_size = 3,
)

result = pipeline_chain.execute()
print(result)

The pipeline chain takes in the generator_template, normalizer_template, few shot_file for few shot examples, dataset_name, number of datapoints to generate k and other parameters to generate a dataset including the model_name argument which specifies the llm to use for the generation and normalization process. The dataset is then saved in a json file with the dataset name provided. The user can choose to save the dataset in batches by setting batch_save to True and providing a batch_size.

Models supported

  • gpt-3.5-turbo
  • gpt-4
  • gpt-4-1106-preview
  • gpt-3.5-turbo-instruct

Future work:

fuxion is still a work in progress, but it is a good starting point for anyone looking to generate synthetic data for testing and training machine learning models. We plan to add more features to fuxion in the future, including a seamless functionality for accurate data generation and normalization using various llms (locally hosted or via the huggingface api). For now, OpenAI's models are the most functional and reliable.

Feel free to contribute to this project by opening an issue or a pull request. We would love to hear your thoughts on how we can improve fuxion!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuxion-0.0.3.1.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fuxion-0.0.3.1-py3-none-any.whl (1.1 MB view details)

Uploaded Python 3

File details

Details for the file fuxion-0.0.3.1.tar.gz.

File metadata

  • Download URL: fuxion-0.0.3.1.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.10.6 Linux/6.9.3-76060903-generic

File hashes

Hashes for fuxion-0.0.3.1.tar.gz
Algorithm Hash digest
SHA256 435f4b58c530cab4240328b0f907d38b215dd4d49aabb15fbb4f9b43af9beabc
MD5 4e3d0fe9d21d9f2c8cc8b9f4353573d1
BLAKE2b-256 12f501ae71077fc27be8e298816663769389cc252009f1157c26650895f97e1e

See more details on using hashes here.

File details

Details for the file fuxion-0.0.3.1-py3-none-any.whl.

File metadata

  • Download URL: fuxion-0.0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.2 CPython/3.10.6 Linux/6.9.3-76060903-generic

File hashes

Hashes for fuxion-0.0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 084625a9e61fe2b82606f5d911acb86b69a4e6b6d66ce638ec4fa11ff0e6b58a
MD5 4f6f7cb4afd7796e17c4dcbd4dbd14fd
BLAKE2b-256 f9fe3e74f6df2e6461a6b7c7942601701f5962b32d94eb56336df937ba62dcd9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page