Skip to main content

Data preparation system to build controllable AI system

Project description

⬜️ Open Datagen ⬜️

Open Datagen is a Data Preparation Tool designed to build Controllable AI Systems

It offers improvements for:

RAG: Generate large Q&A datasets to improve your Retrieval strategies.

Evals: Create unique, “unseen” datasets to robustly test your models and avoid overfitting.

Fine-Tuning: Produce large, low-bias, and high-quality datasets to get better models after the fine-tuning process.

Guardrails: Generate red teaming datasets to strengthen the security and robustness of your Generative AI applications against attack.

Additional Features

  • Use external sources to generate high-quality synthetic data (Local files, Hugging Face datasets and Internet)

  • Data anonymization

  • Open-source model support + local inference

  • Decontamination

  • Tree of thought

  • Multimodality (Text, Audio and Image)

  • (SOON) No-code dataset generation

Installation

pip install --upgrade opendatagen

Setting up your API keys

export OPENAI_API_KEY='your_openai_api_key' #(using openai>=1.2)
export MISTRAL_API_KEY='your_mistral_api_key'
export TOGETHER_API_KEY='your_together_api_key'
export ANYSCALE_API_KEY='your_anyscale_api_key'
export ELEVENLABS_API_KEY='your_elevenlabs_api_key'
export SERPLY_API_KEY='your_serply_api_key' #Google Search API 

Usage

Example: Generate a low-biased FAQ dataset based on Wikipedia content

from opendatagen.template import TemplateManager
from opendatagen.data_generator import DataGenerator

output_path = "opendatagen.csv"
template_name = "opendatagen"
manager = TemplateManager(template_file_path="faq_wikipedia.json")
template = manager.get_template(template_name=template_name)

if template:
    
    generator = DataGenerator(template=template)
    
    data, data_decontaminated = generator.generate_data(output_path=output_path, output_decontaminated_path=None)
    

where faq_wikipedia.json is here

Contribution

We welcome contributions to Open Datagen! Whether you're looking to fix bugs, add templates, new features, or improve documentation, your help is greatly appreciated.

Acknowledgements

We would like to express our gratitude to the following open source projects and individuals that have inspired and helped us:

Connect

If you need help for your Generative AI strategy, implementation, and infrastructure, reach us on

Linkedin: @Thomas. Twitter: @thoddnn.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opendatagen-0.0.35.tar.gz (31.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opendatagen-0.0.35-py3-none-any.whl (50.8 kB view details)

Uploaded Python 3

File details

Details for the file opendatagen-0.0.35.tar.gz.

File metadata

  • Download URL: opendatagen-0.0.35.tar.gz
  • Upload date:
  • Size: 31.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for opendatagen-0.0.35.tar.gz
Algorithm Hash digest
SHA256 f42031dd316ce1ba70b2e9fc5e275764e3d4c29234f92684740a64fe2412db3a
MD5 21a5c08799741b94d961dbaea83995b2
BLAKE2b-256 377e0002ecf508db4fd80a28a78ed17f3d02b8bc10cd1c3289ec22107aab164e

See more details on using hashes here.

File details

Details for the file opendatagen-0.0.35-py3-none-any.whl.

File metadata

  • Download URL: opendatagen-0.0.35-py3-none-any.whl
  • Upload date:
  • Size: 50.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for opendatagen-0.0.35-py3-none-any.whl
Algorithm Hash digest
SHA256 db136ced2d6493e7f53f338a83e4aa7987cac20a70a2876477ae24e99cd6740c
MD5 58b2a60149954d560395e43a4f904d3a
BLAKE2b-256 a70c95a58e6cd4114d1632615d5f2391e1463cd96ee4300f4b5624ec082f5dfd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page