Skip to main content

A local Image Dataset Generator using smart augmentations and MobileNetV2 filtering.

Project description

DatagenKit

DatagenKit is a robust, production-quality Image Dataset Generator Python Library and Command Line Tool. It effortlessly expands a small set of user-provided seed images into a larger, high-quality synthetic dataset using LLM-guided Generative Expansion, advanced Background Removal, lightweight geometric augmentations, and MobileNetV2 feature-based semantic filtering.

🚀 Features

  • Generative AI Expansion: Seamlessly connect to Hugging Face Cloud APIs (FLUX) to predictably synthesize entirely new structural variants from your base seeds.
  • Dynamic Prompt Engineering: Intercept your base textual prompts and use LLaMA-3.2 to rewrite them into dozens of unique, rich scenarios automatically.
  • Intelligent Subject Isolation: Uses U-2-Net (rembg) to optionally strip out backgrounds, ensuring your downstream models learn the subject and not the background (preventing shortcut learning).
  • Smart Augmentations: Uses albumentations for geometric and photometric transformations natively on 4-channel transparent PNGs.
  • Pretrained Filtering: Uses MobileNetV2 for feature extraction and Cosine Similarity to discard any unrealistic or overly-distorted augmentations.
  • Headless CLI & Library: Designed natively for automated MLOps pipelines. Completely terminal driven with a flexible Python API.

📦 Installation

To use DatagenKit globally as a CLI tool or import it into your Python environment, install it via pip:

# Clone the repository and install it globally
git clone <your-repo-url>
cd datagenkit
pip install -e .

Note: DatagenKit installs datagenkit into your PATH automatically.


💻 Usage: Command Line Interface (CLI)

DatagenKit provides a rich, deeply configurable CLI. You can view all available arguments anytime:

datagenkit --help

Basic Generation (Augmentations & Filtering Only)

datagenkit --input-dir my_seed_images/ --output-dir generated_dataset/ --target-count 250

Advanced Pipeline (AI Expansion + Background Removal)

datagenkit -i my_seed_images/ -o final_dataset/ -n 250 \
  --enable-isolation \
  --enable-ai \
  --hf-api-key "hf_your_token_here" \
  --ai-prompt "A highly detailed cat sitting on a rug" \
  --enable-dynamic-prompts

🐍 Usage: Python API

If you are building your own Python scripts, custom data-loaders, or Jupyter Notebooks, you can import DatagenKit directly:

from datagenkit.pipeline import run_datagen_pipeline

# Run the complete, end-to-end dataset synthesizer
stats = run_datagen_pipeline(
    input_dir="my_seed_images/",
    output_dir="final_dataset/",
    target_count=250,
    similarity_threshold=0.75,
    
    # Advanced ML Features
    enable_isolation=True,      # Automatically remove backgrounds natively
    enable_ai=True,             # Generate new variants via FLUX API
    hf_api_key="hf_xxx",        # Your HuggingFace Token
    ai_prompt="A photo of a dog",
    enable_dynamic_prompts=True # Let LLaMA-3.2 enrich the prompt into variations
)

print(f"Generation Complete! Stats: {stats}")

🛠 Prerequisites

  • Python >= 3.8
  • A Hugging Face account & Access Token (Only required if you enable AI Expansion capabilities).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datagenkit-0.1.0.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datagenkit-0.1.0-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file datagenkit-0.1.0.tar.gz.

File metadata

  • Download URL: datagenkit-0.1.0.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for datagenkit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 147fa65f928d8fa343112994c92d8cdc289f3aa2c37828655027d5982f0a6638
MD5 13091101d3e63810fd3b174d7dbb6892
BLAKE2b-256 9a42efb3166ee0d3e01e8d45ad58dde143d1ace4c8ecd8ded4ea562cb3afff97

See more details on using hashes here.

File details

Details for the file datagenkit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: datagenkit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for datagenkit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 17b401a2057146284ffcb80a7862d761d9d78c38812063ce200a104af0c73159
MD5 962ca1574b7e9c3b90bea13609ece597
BLAKE2b-256 4a55780c101a7fde4d6b609223586a4c8b380612df54b19bf3b4c582e6395f59

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page