Skip to main content

Dataformer is a library to create data for LLMs.

Project description

Solving data for LLMs - Create quality synthetic datasets!

Why Dataformer?

Dataformer empowers engineers with a robust framework for creating high-quality synthetic datasets for AI, offering speed, reliability, and scalability. Our mission is to supercharge your AI development process by enabling rapid generation of diverse, premium datasets grounded in proven research methodologies. In the world of AI, compute costs are high, and output quality is paramount. Dataformer allows you to prioritize data excellence, addressing both these challenges head-on. By crafting top-tier synthetic data, you can invest your precious time in achieving and sustaining superior standards for your AI models.

One API, Multiple Providers

We integrate with multiple LLM providers using one unified API and allow you to make parallel async API calls while respecting rate-limits. We offer the option to cache responses from LLM providers, minimizing redundant API calls and directly reducing operational expenses.

Research-Backed Iteration at Scale

Leverage state-of-the-art research papers to generate synthetic data while ensuring adaptability, scalability, and resilience. Shift your focus from infrastructure concerns to refining your data and enhancing your models.

Installation

Github Source:

pip install dataformer@git+https://github.com/DataformerAI/dataformer.git 

Using Git:

git clone https://github.com/DataformerAI/dataformer.git
cd dataformer
pip install .

Quick Start

AsyncLLM supports various API providers, including:

  • OpenAI
  • Groq
  • Together
  • DeepInfra
  • OpenRouter

Choose the provider that best suits your needs!

Here's a quick example of how to use Dataformer's AsyncLLM for efficient asynchronous generation:

from dataformer.llms import AsyncLLM
from dataformer.utils import get_request_list, get_messages
from datasets import load_dataset

# Load a sample dataset
dataset = load_dataset("dataformer/self-knowledge", split="train").select(range(3))
instructions = [example["question"] for example in dataset]

# Prepare the request list
sampling_params = {"temperature": 0.7}
request_list = get_request_list(instructions, sampling_params)

# Initialize AsyncLLM with your preferred API provider
llm = AsyncLLM(api_provider="groq", model="llama-3.1-8b-instant")

# Generate responses asynchronously
response_list = get_messages(llm.generate(request_list))

Contribute

We welcome contributions! Check our issues or open a new one to get started.

Join Community

Join Dataformer on Discord

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataformer-0.0.2.tar.gz (48.5 kB view details)

Uploaded Source

Built Distribution

dataformer-0.0.2-py3-none-any.whl (35.7 kB view details)

Uploaded Python 3

File details

Details for the file dataformer-0.0.2.tar.gz.

File metadata

  • Download URL: dataformer-0.0.2.tar.gz
  • Upload date:
  • Size: 48.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for dataformer-0.0.2.tar.gz
Algorithm Hash digest
SHA256 5eb50f625664ad6ce9159a3e4c734cf3d058cd1868ed364b20802731a8b8b7c9
MD5 10106d5396698b575d04875c7a039314
BLAKE2b-256 9328d4426f030bedb5414923fdcfead2039edf703448f620b030e483494d6ed4

See more details on using hashes here.

File details

Details for the file dataformer-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: dataformer-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 35.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for dataformer-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 76480dedd02e5c3a3eb7a3932662512d177fe5d0baef5e599832e401a8ad4312
MD5 ab16300b4253974e7f3fc59377d38f30
BLAKE2b-256 15c6f119396acf6be7c8540bc3d393307ab2910dac8ecd7d74f002df89666d32

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page