Skip to main content

Data preparation system to build controllable AI system

Project description

⬜️ Open Datagen ⬜️

Open Datagen is a Data Preparation Tool designed to build Controllable AI Systems

It offers improvements for:

RAG: Generate large Q&A datasets to improve your Retrieval strategies.

Evals: Create unique, “unseen” datasets to robustly test your models and avoid overfitting.

Fine-Tuning: Produce large, low-bias, and high-quality datasets to get better models after the fine-tuning process.

Guardrails: Generate red teaming datasets to strengthen the security and robustness of your Generative AI applications against attack.

Additional Features

  • Use external sources to generate high-quality synthetic data (Local files, Hugging Face datasets and Internet)

  • Data anonymization

  • Open-source model support + local inference

  • Decontamination

  • Tree of thought

  • (SOON) No-code dataset generation

  • (SOON) Multimodality

Installation

pip install --upgrade opendatagen

Setting up your API keys

export OPENAI_API_KEY='your_openai_api_key' #(using openai>=1.2)
export MISTRAL_API_KEY='your_mistral_api_key'
export TOGETHER_API_KEY='your_together_api_key'
export ANYSCALE_API_KEY='your_anyscale_api_key'
export ELEVENLABS_API_KEY='your_elevenlabs_api_key'
export SERPLY_API_KEY='your_serply_api_key' #Google Search API 

Usage

Example: Generate a low-biased FAQ dataset based on Wikipedia content

from opendatagen.template import TemplateManager
from opendatagen.data_generator import DataGenerator

output_path = "opendatagen.csv"
template_name = "opendatagen"
manager = TemplateManager(template_file_path="faq_wikipedia.json")
template = manager.get_template(template_name=template_name)

if template:

    generator = DataGenerator(template=template)

    data, data_decontaminated = generator.generate_data(output_path=output_path, output_decontaminated_path=None)

where faq_wikipedia.json is here

Contribution

We welcome contributions to Open Datagen! Whether you're looking to fix bugs, add templates, new features, or improve documentation, your help is greatly appreciated.

Acknowledgements

We would like to express our gratitude to the following open source projects and individuals that have inspired and helped us:

Connect

If you need help for your Generative AI strategy, implementation, and infrastructure, reach us on

Linkedin: @Thomas. Twitter: @thoddnn.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opendatagen-0.0.34.tar.gz (30.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opendatagen-0.0.34-py3-none-any.whl (50.2 kB view details)

Uploaded Python 3

File details

Details for the file opendatagen-0.0.34.tar.gz.

File metadata

  • Download URL: opendatagen-0.0.34.tar.gz
  • Upload date:
  • Size: 30.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/37.3 requests/2.29.0 requests-toolbelt/0.10.1 urllib3/1.26.15 tqdm/4.50.2 importlib-metadata/5.1.0 keyring/21.4.0 rfc3986/1.5.0 colorama/0.4.4 CPython/3.8.5

File hashes

Hashes for opendatagen-0.0.34.tar.gz
Algorithm Hash digest
SHA256 3b05eb32df3187a564fc3db5d5693ee51ace5c0ddb9d2598d33bf1a28e8f76c4
MD5 7e2db6d3d1b21ad51f4f33f616ae04ec
BLAKE2b-256 3e9b28ac2b8e2161f64b002aca017d681019db7b5b3bb804309fc1352fc74126

See more details on using hashes here.

File details

Details for the file opendatagen-0.0.34-py3-none-any.whl.

File metadata

  • Download URL: opendatagen-0.0.34-py3-none-any.whl
  • Upload date:
  • Size: 50.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/37.3 requests/2.29.0 requests-toolbelt/0.10.1 urllib3/1.26.15 tqdm/4.50.2 importlib-metadata/5.1.0 keyring/21.4.0 rfc3986/1.5.0 colorama/0.4.4 CPython/3.8.5

File hashes

Hashes for opendatagen-0.0.34-py3-none-any.whl
Algorithm Hash digest
SHA256 972b314ff9f441f8a5f8092f0fac0e154a3d6e403d6e78a66b802f222559de02
MD5 f8845b2a87252088af8106c1b45d4384
BLAKE2b-256 16c5f825e2eb97c7ece1de1835a2a37bb633b5f2100782cac17f2d17fce39819

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page