Skip to main content

Unified flow interface for synthetic data generation and many more

Project description

🌊 🔀 uniflow

License Apache-2.0 pypi_status Commit activity Slack

uniflow is a unified flow interface for synthetic data generation. Generate synthetic data from text and data using a one of the many uniflow flows, including DataGenFlow, DataGenTextFlow, TextPlusDataGenFlow, and SelfInstructedGenFlow.

Built by CambioML.

Features

uniflow lets you easily generate synthetic data from text and data. Here are the flows for common applications:

Data Generation from Data

Generate more Question-Answer pair data from some sample data using the DataGenFlow interface.

Example

First, load a csv file with Question and Answer columns.

# Initiate flow
flow = DataGenFlow()
# Load data
qaa = pd.read_csv(f"{YOUR_CSV_FILE}", encoding = "utf8")`

Then you can generate more data from the loaded data.

input_dict = {"qaa": qaa[:]}
output_dict = flow(input_dict)

Data Generation from Text

Generate Question-Answer pair data from text using the DataGenTextFlow interface.

Example

First, load a text from which you would like to generate the Question-Answer data.

with open(os.path.join(f"{YOUR_FILE_PATH}", "YOUR_FILE.txt"), "r") as file:
    context = file.read()

Then you can generate the Question-Answer data from the text.

flow = DataGenTextFlow()
input_dict = {"context": context}
output_dict = flow(input_dict)

Data Generation from Text plus Data

Using the TextPlusDataGenFlow, you can run the previous two flows in sequence to generate data from text, and then generate additional data from the generated data.

Example

First, load a text from which you would like to generate the Question-Answer data.

with open(os.path.join(f"{YOUR_FILE_PATH}", "YOUR_FILE.txt"), "r") as file:
    context = file.read()

Then you can generate the Question-Answer data from the text.

flow = TextPlusDataGenFlow()
input_dict = {"context": context}
output_dict = flow(input_dict)

Self Instructed Fine Tuning

Generate data from html text, and then fine tune the model on the generated data using the SelfInstructedGenFlow interface.

First, place your html file in the same directory as your python file.

Then you can generate the Question-Answer data from the html.

# Initiate flow
flow = SelfInstructedGenFlow()
input_dict = {constants.HTML_KEY: YOUR_HTML_FILE_PATH}

# Run flow
output_dict = flow(input_dict)

Installation

To get started with uniflow, you can install it using pip in a conda environment.

First, create a conda environment on your terminal using:

conda create -n uniflow python=3.10 -y
conda activate uniflow  # some OS requires `source activate uniflow`

Then install flow and the compatible pytorch based on your os

pip3 install uniflow
pip3 install torch

Finally, if you are on a GPU, install pytorch based on your cuda version. You can find your CUDA version via nvcc -V.

pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121  # cu121 means cuda 12.1

Congrats you have finished the installation!

Dev Setup

If you are interested in contributing to us, here are the preliminary development setups.

Backend Dev Setup

conda create -n uniflow python=3.10
conda activate uniflow
cd uniflow
pip3 install poetry --no-root

EC2 Dev Setup

If you are on EC2, you can launch a GPU instance with the following config:

  • EC2 g4dn.xlarge (if you want to run a pretrained LLM with 7B parameters)
  • Deep Learning AMI PyTorch GPU 2.0.1 (Ubuntu 20.04) Alt text
  • EBS: at least 100G Alt text

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uniflow-0.0.2.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

uniflow-0.0.2-py3-none-any.whl (25.7 kB view details)

Uploaded Python 3

File details

Details for the file uniflow-0.0.2.tar.gz.

File metadata

  • Download URL: uniflow-0.0.2.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.13 Darwin/22.4.0

File hashes

Hashes for uniflow-0.0.2.tar.gz
Algorithm Hash digest
SHA256 20b775bae723685ec7993077f9754787e0d024060ff0958b244ebd94c71fc6c4
MD5 8f608b33b9460f40ce5794504f60b6e1
BLAKE2b-256 e4eb9238d11e954b1b2c041dd87d7d3d9d6641426070b835db9bb1a4ba214051

See more details on using hashes here.

File details

Details for the file uniflow-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: uniflow-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 25.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.10.13 Darwin/22.4.0

File hashes

Hashes for uniflow-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9a36757a76212958e7798e28e0fb441fbe321c2e035f24159b850e2051f53504
MD5 4ad756b6b3fe4e7a065d2d39d20c5cf4
BLAKE2b-256 6e524a9d96c5a3a8cdba9588a589ca7ba91d8d2dbd04526eb3110e83caee4ab9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page