Skip to main content

Easiest and fastest way to 1B synthetic tokens

Project description

fastdata

fastdata is a minimal library for generating synthetic data for training deep learning models. For example, below is how you can generate a dataset to train a language model to translate from English to Spanish.

First you need to define the structure of the data you want to generate. claudette, which is the library that fastdata uses to generate data, requires you to define the schema of the data you want to generate.

from fastcore.utils import *
class Translation():
    "Translation from an English phrase to a Spanish phrase"
    def __init__(self, english: str, spanish: str): store_attr()
    def __repr__(self): return f"{self.english} ➡ *{self.spanish}*"

Translation("Hello, how are you today?", "Hola, ¿cómo estás hoy?")
Hello, how are you today? ➡ *Hola, ¿cómo estás hoy?*

Next, you need to define the prompt that will be used to generate the data and any inputs you want to pass to the prompt.

prompt_template = """\
Generate English and Spanish translations on the following topic:
<topic>{topic}</topic>
"""

inputs = [{"topic": "Otters are cute"}, {"topic": "I love programming"}]

Finally, we can generate some data with fastdata.

[!NOTE]

We only support Anthropic models at the moment. Therefore, make sure you have an API key for the model you want to use and the proper environment variables set or pass the api key to the FastData class FastData(api_key="sk-ant-api03-...").

from fastdata.core import FastData
fast_data = FastData(model="claude-3-haiku-20240307")
translations = fast_data.generate(
    prompt_template=prompt_template,
    inputs=inputs,
    schema=Translation,
)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.57it/s]
from IPython.display import Markdown
Markdown("\n".join(f'- {t}' for t in translations))
  • I love programming ➡ Me encanta la programación
  • Otters are cute ➡ Las nutrias son lindas

Installation

Install latest from the GitHub repository:

$ pip install git+https://github.com/AnswerDotAI/fastdata.git

or from pypi

$ pip install python-fastdata

If you’d like to see how best to generate data with fastdata, check out our blog post here and some of the examples in the examples directory.

Developer Guide

If you are new to using nbdev here are some useful pointers to get you started.

Install fastdata in Development mode

# make sure fastdata package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to fastdata
$ nbdev_prepare

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

python_fastdata-0.0.3.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

python_fastdata-0.0.3-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file python_fastdata-0.0.3.tar.gz.

File metadata

  • Download URL: python_fastdata-0.0.3.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for python_fastdata-0.0.3.tar.gz
Algorithm Hash digest
SHA256 a6d49b40ca4fde214431c06c87e4446a5adade9c7400f2ca8d3f82b19ea9531e
MD5 b381d2aacd3489fdc753034ddc9d8f7e
BLAKE2b-256 319961c71a850b05e6bde5c439cf95f259e6e3a583347b0eb5c050cd377846de

See more details on using hashes here.

File details

Details for the file python_fastdata-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for python_fastdata-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0d5ce0e4c6df7326b5ba6481f39e4842c795e3126082d84a4e7f81603a23863a
MD5 ba992099455cd42e48450cf53e3c0be8
BLAKE2b-256 163e7c84226841b9f8d758b01587c3ea7ac5d18ce5b6090457910afa1aa64f6d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page