Skip to main content

Prompt. Generate Synthetic Data. Train & Align Models.

Project description

DataDreamer
https://datadreamer.dev

Prompt. Generate Synthetic Data. Train & Align Models.

Tests & Release Ruff

DataDreamer is a powerful open-source Python library for prompting, synthetic data generation, and training workflows. It is designed to be simple, extremely efficient, and research-grade.

Installation

pip3 install datadreamer.dev
demo.py Result of demo.py
                                                                                                   
demo.py

See the full demo script


                                                                                     
Demo

See the synthetic dataset and the trained model

🚀 For more demonstrations and recipes see the Quick Tour page.

With DataDreamer you can:

  • 💬 Create Prompting Workflows: Create and run multi-step, complex, prompting workflows easily with major open source or API-based LLMs.
  • 📊 Generate Synthetic Datasets: Generate synthetic datasets for novel tasks or augment existing datasets with LLMs.
  • ⚙️ Train Models: Align models. Fine-tune models. Instruction-tune models. Distill models. Train on existing data or synthetic data.
  • ... learn more about what's possible in the Overview Guide

DataDreamer is:

  • 🧩 Simple: Simple and approachable to use with sensible defaults, yet powerful with support for bleeding edge techniques.
  • 🔬 Research-Grade: Built for researchers, by researchers, but accessible to all. A focus on correctness, best practices, and reproducibility.
  • 🏎️ Efficient: Aggressive caching and resumability built-in. Support for techniques like quantization, parameter-efficient training (LoRA), and more.
  • 🔄 Reproducible: Workflows built with DataDreamer are easily shareable, reproducible, and extendable.
  • 🤝 Makes Sharing Easy: Publishing datasets and models is simple. Automatically generate data cards and model cards with metadata. Generate a list of any citations required.
  • ... learn more about the motivation and design principles behind DataDreamer.

Citation

Please cite the DataDreamer paper:

@misc{patel2024datadreamer,
      title={DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows}, 
      author={Ajay Patel and Colin Raffel and Chris Callison-Burch},
      year={2024},
      eprint={2402.10379},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact

Please reach out to us via email (ajayp@upenn.edu) or on Discord if you have any questions, comments, or feedback.



Copyright © 2024, Ajay Patel. Released under the MIT License.

Thank you to the maintainers at Hugging Face and LiteLLM for accepting contributions neccessary for DataDreamer and providing upstream support.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datadreamer_dev-0.18.0.tar.gz (275.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datadreamer_dev-0.18.0-py3-none-any.whl (366.8 kB view details)

Uploaded Python 3

File details

Details for the file datadreamer_dev-0.18.0.tar.gz.

File metadata

  • Download URL: datadreamer_dev-0.18.0.tar.gz
  • Upload date:
  • Size: 275.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.9 Linux/6.2.0-1017-aws

File hashes

Hashes for datadreamer_dev-0.18.0.tar.gz
Algorithm Hash digest
SHA256 ffe34f6f0db14b476dcd8a61fdd8891d71321fcc0f515984981ddd9a706d879d
MD5 835b5a5a9b0e457eb74283f235541a54
BLAKE2b-256 b24a2dbd40a57f93b06e2528dfbd4b49b28cc71e4038af68621f2d49b0c2c555

See more details on using hashes here.

File details

Details for the file datadreamer_dev-0.18.0-py3-none-any.whl.

File metadata

  • Download URL: datadreamer_dev-0.18.0-py3-none-any.whl
  • Upload date:
  • Size: 366.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.9 Linux/6.2.0-1017-aws

File hashes

Hashes for datadreamer_dev-0.18.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bc7428b36ba947f874172b54333ae7c7d04a3583d35533f987076b647665e30d
MD5 1053c58f6a883cbfabec4d6df7208d19
BLAKE2b-256 5d87c08c6fc8ffef00b22b5ff04b0d42f41f7a1ab940591bfc3c1b2a93d760f7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page