Unified flow interface for synthetic data generation and many more
Project description
🌊 🔀 uniflow
uniflow is a unified flow interface for synthetic data generation. Generate synthetic data from text and data using a one of the many uniflow flows, including DataGenFlow, DataGenTextFlow, TextPlusDataGenFlow, and SelfInstructedGenFlow.
Built by CambioML.
Features
uniflow lets you easily generate synthetic data from text and data. Here are the flows for common applications:
Data Generation from Data
Generate more Question-Answer pair data from some sample data using the DataGenFlow interface.
Example
First, load a csv file with Question and Answer columns.
# Initiate flow
flow = DataGenFlow()
# Load data
qaa = pd.read_csv(f"{YOUR_CSV_FILE}", encoding = "utf8")`
Then you can generate more data from the loaded data.
input_dict = {"qaa": qaa[:]}
output_dict = flow(input_dict)
Data Generation from Text
Generate Question-Answer pair data from text using the DataGenTextFlow interface.
Example
First, load a text from which you would like to generate the Question-Answer data.
with open(os.path.join(f"{YOUR_FILE_PATH}", "YOUR_FILE.txt"), "r") as file:
context = file.read()
Then you can generate the Question-Answer data from the text.
flow = DataGenTextFlow()
input_dict = {"context": context}
output_dict = flow(input_dict)
Data Generation from Text plus Data
Using the TextPlusDataGenFlow, you can run the previous two flows in sequence to generate data from text, and then generate additional data from the generated data.
Example
First, load a text from which you would like to generate the Question-Answer data.
with open(os.path.join(f"{YOUR_FILE_PATH}", "YOUR_FILE.txt"), "r") as file:
context = file.read()
Then you can generate the Question-Answer data from the text.
flow = TextPlusDataGenFlow()
input_dict = {"context": context}
output_dict = flow(input_dict)
Self Instructed Fine Tuning
Generate data from html text, and then fine tune the model on the generated data using the SelfInstructedGenFlow interface.
First, place your html file in the same directory as your python file.
Then you can generate the Question-Answer data from the html.
# Initiate flow
flow = SelfInstructedGenFlow()
input_dict = {constants.HTML_KEY: YOUR_HTML_FILE_PATH}
# Run flow
output_dict = flow(input_dict)
Installation
To get started with uniflow, you can install it using pip in a conda environment.
First, create a conda environment on your terminal using:
conda create -n uniflow python=3.10 -y
conda activate uniflow # some OS requires `source activate uniflow`
Then install flow and the compatible pytorch based on your os
pip3 install uniflow
pip3 install torch
Finally, if you are on a GPU, install pytorch based on your cuda version. You can find your CUDA version via nvcc -V.
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121 # cu121 means cuda 12.1
Congrats you have finished the installation!
Dev Setup
If you are interested in contributing to us, here are the preliminary development setups.
Backend Dev Setup
conda create -n uniflow python=3.10
conda activate uniflow
cd uniflow
pip3 install poetry --no-root
EC2 Dev Setup
If you are on EC2, you can launch a GPU instance with the following config:
- EC2
g4dn.xlarge(if you want to run a pretrained LLM with 7B parameters) - Deep Learning AMI PyTorch GPU 2.0.1 (Ubuntu 20.04)
- EBS: at least 100G
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file uniflow-0.0.2.tar.gz.
File metadata
- Download URL: uniflow-0.0.2.tar.gz
- Upload date:
- Size: 18.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.10.13 Darwin/22.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20b775bae723685ec7993077f9754787e0d024060ff0958b244ebd94c71fc6c4
|
|
| MD5 |
8f608b33b9460f40ce5794504f60b6e1
|
|
| BLAKE2b-256 |
e4eb9238d11e954b1b2c041dd87d7d3d9d6641426070b835db9bb1a4ba214051
|
File details
Details for the file uniflow-0.0.2-py3-none-any.whl.
File metadata
- Download URL: uniflow-0.0.2-py3-none-any.whl
- Upload date:
- Size: 25.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.10.13 Darwin/22.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a36757a76212958e7798e28e0fb441fbe321c2e035f24159b850e2051f53504
|
|
| MD5 |
4ad756b6b3fe4e7a065d2d39d20c5cf4
|
|
| BLAKE2b-256 |
6e524a9d96c5a3a8cdba9588a589ca7ba91d8d2dbd04526eb3110e83caee4ab9
|