Skip to main content

Data preparation system to build controllable AI system

Project description

⬜️ Open Datagen ⬜️

Open Datagen is a Data Preparation Tool designed to build Controllable AI Systems

It offers improvements for:

RAG: Generate large Q&A datasets to improve your Retrieval strategies.

Evals: Create unique, “unseen” datasets to robustly test your models and avoid overfitting.

Fine-Tuning: Produce large, low-bias, and high-quality datasets to get better models after the fine-tuning process.

Guardrails: Generate red teaming datasets to strengthen the security and robustness of your Generative AI applications against attack.

Additional Features

  • Use external sources to generate high-quality synthetic data (Local files, Hugging Face datasets and Internet)

  • Data anonymization

  • Open-source model support + local inference

  • Decontamination

  • Tree of thought

  • (SOON) No-code dataset generation

  • (SOON) Multimodality

Installation

pip install --upgrade opendatagen

Setting up your API keys

export OPENAI_API_KEY='your_openai_api_key' #(using openai>=1.2)
export MISTRAL_API_KEY='your_mistral_api_key'
export TOGETHER_API_KEY='your_together_api_key'
export ANYSCALE_API_KEY='your_anyscale_api_key'
export SERPLY_API_KEY='your_serply_api_key' #Google Search API 

Usage

Example: Generate a low-biased FAQ dataset based on Wikipedia content

from opendatagen.template import TemplateManager
from opendatagen.data_generator import DataGenerator

output_path = "opendatagen.csv"
template_name = "opendatagen"
manager = TemplateManager(template_file_path="faq_wikipedia.json")
template = manager.get_template(template_name=template_name)

if template:
    
    generator = DataGenerator(template=template)
    
    data, data_decontaminated = generator.generate_data(output_path=output_path, output_decontaminated_path=None)
    

where faq_wikipedia.json is here

Contribution

We welcome contributions to Open Datagen! Whether you're looking to fix bugs, add templates, new features, or improve documentation, your help is greatly appreciated.

Acknowledgements

We would like to express our gratitude to the following open source projects and individuals that have inspired and helped us:

Connect

If you need help for your Generative AI strategy, implementation, and infrastructure, reach us on

Linkedin: @Thomas. Twitter: @thoddnn.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opendatagen-0.0.32.tar.gz (29.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

opendatagen-0.0.32-py3-none-any.whl (48.0 kB view details)

Uploaded Python 3

File details

Details for the file opendatagen-0.0.32.tar.gz.

File metadata

  • Download URL: opendatagen-0.0.32.tar.gz
  • Upload date:
  • Size: 29.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/37.3 requests/2.29.0 requests-toolbelt/0.10.1 urllib3/1.26.15 tqdm/4.50.2 importlib-metadata/5.1.0 keyring/21.4.0 rfc3986/1.5.0 colorama/0.4.4 CPython/3.8.5

File hashes

Hashes for opendatagen-0.0.32.tar.gz
Algorithm Hash digest
SHA256 fb07a0d70496a9239b212fe481051ca2079b8c8f707be6e18de736467822b083
MD5 8995851afd78dff4ec3d9d6b0f05ed37
BLAKE2b-256 f8749a1be4229a9d8ed699baef9cb54e7c39587310afdde9cd43f0eb84ab17e7

See more details on using hashes here.

File details

Details for the file opendatagen-0.0.32-py3-none-any.whl.

File metadata

  • Download URL: opendatagen-0.0.32-py3-none-any.whl
  • Upload date:
  • Size: 48.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/37.3 requests/2.29.0 requests-toolbelt/0.10.1 urllib3/1.26.15 tqdm/4.50.2 importlib-metadata/5.1.0 keyring/21.4.0 rfc3986/1.5.0 colorama/0.4.4 CPython/3.8.5

File hashes

Hashes for opendatagen-0.0.32-py3-none-any.whl
Algorithm Hash digest
SHA256 3d891d1eb8355f2c4faed8420b8b5b0cd18b2fffbbc0bf372b23c14ff362a290
MD5 551672a3a0debd2434ea88dbf3615576
BLAKE2b-256 9b91dd984dd18d353855dfd58d857bff799b80f56fc94eae66b7b3a936e62548

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page