Skip to main content

A local Image Dataset Generator using smart augmentations and MobileNetV2 filtering.

Project description

DatagenKit

DatagenKit is a robust, production-quality Image Dataset Generator Python Library and Command Line Tool. It effortlessly expands a small set of user-provided seed images into a larger, high-quality synthetic dataset using LLM-guided Generative Expansion, advanced Background Removal, lightweight geometric augmentations, and MobileNetV2 feature-based semantic filtering.

🚀 Features

  • Generative AI Expansion: Seamlessly connect to Hugging Face Cloud APIs (FLUX) to predictably synthesize entirely new structural variants from your base seeds.
  • Dynamic Prompt Engineering: Intercept your base textual prompts and use LLaMA-3.2 to rewrite them into dozens of unique, rich scenarios automatically.
  • Intelligent Subject Isolation: Uses U-2-Net (rembg) to optionally strip out backgrounds, ensuring your downstream models learn the subject and not the background (preventing shortcut learning).
  • Smart Augmentations: Uses albumentations for geometric and photometric transformations natively on 4-channel transparent PNGs.
  • Pretrained Filtering: Uses MobileNetV2 for feature extraction and Cosine Similarity to discard any unrealistic or overly-distorted augmentations.
  • Headless CLI & Library: Designed natively for automated MLOps pipelines. Completely terminal driven with a flexible Python API.

📦 Installation

To use DatagenKit globally as a CLI tool or import it into your Python environment, install it via pip:

# Clone the repository and install it globally
git clone <your-repo-url>
cd datagenkit
pip install -e .

Note: DatagenKit installs datagenkit into your PATH automatically.


💻 Usage: Command Line Interface (CLI)

DatagenKit provides a rich, deeply configurable CLI. You can view all available arguments anytime:

datagenkit --help

Basic Generation (Augmentations & Filtering Only)

datagenkit --input-dir my_seed_images/ --output-dir generated_dataset/ --target-count 250

Advanced Pipeline (AI Expansion + Background Removal)

datagenkit -i my_seed_images/ -o final_dataset/ -n 250 \
  --enable-isolation \
  --enable-ai \
  --hf-api-key "hf_your_token_here" \
  --ai-prompt "A highly detailed cat sitting on a rug" \
  --enable-dynamic-prompts

🐍 Usage: Python API

If you are building your own Python scripts, custom data-loaders, or Jupyter Notebooks, you can import DatagenKit directly:

from datagenkit.pipeline import run_datagen_pipeline

# Run the complete, end-to-end dataset synthesizer
stats = run_datagen_pipeline(
    input_dir="my_seed_images/",
    output_dir="final_dataset/",
    target_count=250,
    similarity_threshold=0.75,
    
    # Advanced ML Features
    enable_isolation=True,      # Automatically remove backgrounds natively
    enable_ai=True,             # Generate new variants via FLUX API
    hf_api_key="hf_xxx",        # Your HuggingFace Token
    ai_prompt="A photo of a dog",
    enable_dynamic_prompts=True # Let LLaMA-3.2 enrich the prompt into variations
)

print(f"Generation Complete! Stats: {stats}")

🛠 Prerequisites

  • Python >= 3.8
  • A Hugging Face account & Access Token (Only required if you enable AI Expansion capabilities).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datagenkit-0.1.1.tar.gz (14.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datagenkit-0.1.1-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file datagenkit-0.1.1.tar.gz.

File metadata

  • Download URL: datagenkit-0.1.1.tar.gz
  • Upload date:
  • Size: 14.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for datagenkit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d91afc8603f9a894e38e3d3eb3a9a2aff9a8a0e2ad654a975d7ec69a8b9ef701
MD5 8bbc22a2734f2cb08641f1c1b6b9a5ce
BLAKE2b-256 56b473733f48655c3f10ffe2b40c0805db826396ac0a396c37445977169e39a5

See more details on using hashes here.

File details

Details for the file datagenkit-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: datagenkit-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 19.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for datagenkit-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 579221752ee82f850978851cfe65dd78ceba00da6bdd74f600fe2ab49a41c113
MD5 36b3a94af571e8a81c217d5ecd0df562
BLAKE2b-256 cad864e2eb825fa2d57fa9b2294fd068db7878d6f2fd7be6e10ef7af2aac6841

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page