Synthetic dataset generation and normalization functions powered by LLMs

These details have not been verified by PyPI

Project description

fuxion

LangChain + LLM powered data generation and normalization functions. fuxion helps you generate a fully synthetic dataset with LLM APIs to train a task-specific model you can run on your own GPU. Preliminary models for name, price, and address standardization are available on HuggingFace.

fuxion

Description
Installation
Usage

Description

fuxion is a Python package that provides you with a data generation and normalization pipeline which could be used for testing, normalization and training machine learning models. Using fuxion, you are able to generate sythetic data for different types of use cases -- all that's required is that you pass the right prompt to the chain and watch how things unfold :sunglasses:

Installation

We recommend that you create a virtual environment before proceeding with the installation process as it would help to create an isolated environment for this project. After doing that, you can proceed with the installation by following the steps below.

install via pip
```
pip install fuxion
```
Add the following to your bashrc file and replace "your-key" with your OpenAI API key:
```
export OPENAI_API_KEY="your-key"
```

Usage

The process of creating useful synthetic data involves two main steps: data generation and normalization. fuxion now combines data generation and normalization into a single streamlined process using structured output.

Generation and Normalization

The generation process in fuxion uses a template file to guide the LLM in creating synthetic data. This template file contains instructions and placeholders for few-shot examples. Here's an overview of how generation works:

Create a template file with instructions for the type of data you want to generate.
Prepare a few-shot example file in JSON format.
Define the output structure for your data.
Use the DatasetPipeline class to generate and normalize data in one step.

The template file, few-shot examples, and output structure work together to produce high-quality, structured synthetic data.

This structure guides the LLM to produce data in the specified format, ensuring consistency and proper typing.

Template Structure

For each generation task, a template file is required to guide the LLM on what to do. Below, we provide a brief overview of what the template files should look like.

Generator templates

Generate a list of U.S. postal addresses separated by double newlines.

Make them as realistic and diverse as possible.
Include some company address, P.O. boxes, apartment complexes, etc.
Ensure the addresses are fake.

{{few_shot}}

List:

The first few lines tells the chain to generate addresses and contains a bunch of creative instructions that determines the quality of the results.
{{few_shot}} tells the chain to get few-shot examples provided in the examples folder.
List returns the results in a list

The same convention should be followed when creating subsequent templates for various data generation tasks.

Normalizer templates

In the latest version of fuxion, normalization is integrated directly into the pipeline process using an output_structure parameter. This eliminates the need for separate normalization templates.

Creating an Output Structure

The output structure is a key component in fuxion's generation process. It defines the format and types of the generated data, effectively combining generation and normalization. Here's an example of how to define an output structure:

output_structure = {
    "field_name1": data_type,
    "field_name2": data_type,
    # ... more fields as needed
}

For example, for normalizing names:

output_structure = {
    "title": str,
    "given": str,
    "middle": str,
    "surname": str,
    "suffix": str
}

Or for addresses:

output_structure = {
    "house_number": int,
    "street": str,
    "city": str,
    "state": str,
    "zip_code": str
}

This structure guides the LLM in formatting the generated data, ensuring consistent and properly typed output. For details on how to use this in a pipeline, refer to the Pipelines section.

Pipelines

The latest version of fuxion simplifies the normalization process by incorporating it directly into the pipeline using structured output. This removes the need for a separate normalization template, making it easier for users.

The DatasetPipeline class is the primary interface for generating synthetic datasets. It handles both generation and normalization in a single process. Here's an example of how to use it:

from fuxion.pipelines import DatasetPipeline
from rich import print

output_structure = {
    "title": str,
    "given": str,
    "middle": str,
    "surname": str,
    "suffix": str
}

pipeline_chain = DatasetPipeline(
    generator_template="examples/name_generator/generator.template",
    few_shot_file="examples/name_generator/few_shot.json",
    output_structure=output_structure,
    dataset_name="name_pipeline",
    k=5,
    model_name="gpt-4o",
    cache=False,
    verbose=True,
    temperature=1.0,
)

result = pipeline_chain.execute()
print(result)

This pipeline generates a dataset of 5 names, formatted according to the specified output structure. This replaces the previous normalizer_template parameter, simplifying the process and reducing complexity. The generated data is automatically saved to a file named name_dataset.json in the datasets directory.

fuxion can be used to generate synthetic data for various use cases, including rapid product testing and machine learning model training. By customizing the template, few-shot examples, and output structure, you can create diverse and realistic datasets tailored to your specific needs.

Models supported

gpt-3.5-turbo
gpt-4
gpt-4o
gpt-4o-mini

Future work:

fuxion is still a work in progress, but it is a good starting point for anyone looking to generate synthetic data for testing and training machine learning models. We plan to add more features to fuxion in the future, including a seamless functionality for accurate data generation and normalization using various llms (locally hosted or via the huggingface api). For now, OpenAI's models are the most functional and reliable.

Feel free to contribute to this project by opening an issue or a pull request. We would love to hear your thoughts on how we can improve fuxion!

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.0.4.2

Sep 19, 2024

This version

0.0.4.1 yanked

Sep 19, 2024

0.0.4 yanked

Sep 19, 2024

0.0.3.2 yanked

Sep 13, 2024

Reason this release was yanked:

Fuxion has been updated to use structured output which gurantees better generations

0.0.3.1 yanked

Sep 12, 2024

0.0.3 yanked

Sep 12, 2024

0.0.2 yanked

Apr 15, 2024

0.0.1 yanked

Apr 5, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuxion-0.0.4.1.tar.gz (1.1 MB view details)

Uploaded Sep 19, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fuxion-0.0.4.1-py3-none-any.whl (1.1 MB view details)

Uploaded Sep 19, 2024 Python 3

File details

Details for the file fuxion-0.0.4.1.tar.gz.

File metadata

Download URL: fuxion-0.0.4.1.tar.gz
Upload date: Sep 19, 2024
Size: 1.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.4.2 CPython/3.10.6 Linux/6.9.3-76060903-generic

File hashes

Hashes for fuxion-0.0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`68014d5de484cb45cfeaf9a66e53fe96c119a99787dd1b4b3c0694ab2ec60fb3`
MD5	`2e8ee432600bb6df79374fb540d77b46`
BLAKE2b-256	`6edcedc0a32c5e5be509228464de65376e51a385bdf229129fbade0ab4be4b86`

See more details on using hashes here.

File details

Details for the file fuxion-0.0.4.1-py3-none-any.whl.

File metadata

Download URL: fuxion-0.0.4.1-py3-none-any.whl
Upload date: Sep 19, 2024
Size: 1.1 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.4.2 CPython/3.10.6 Linux/6.9.3-76060903-generic

File hashes

Hashes for fuxion-0.0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0ac75d3f13bc1fe22871bb284d8a441bf9b069e5415cd34513796e3a0bd2c209`
MD5	`7439ce0b5e05beef2c31502f095f5897`
BLAKE2b-256	`f5b5f2780df389dbd540ab9f97445890d7fe3e179df4a118ebe7fb8f77f57474`

See more details on using hashes here.

fuxion 0.0.4.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

fuxion

Table of Contents

Description

Installation

Usage

Generation and Normalization

Template Structure

Generator templates

Normalizer templates

Creating an Output Structure

Pipelines

Future work:

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes