A local Image Dataset Generator using smart augmentations and MobileNetV2 filtering.
Project description
DatagenKit
DatagenKit is a robust, production-quality Image Dataset Generator Python Library and Command Line Tool. It effortlessly expands a small set of user-provided seed images into a larger, high-quality synthetic dataset using LLM-guided Generative Expansion, advanced Background Removal, lightweight geometric augmentations, and MobileNetV2 feature-based semantic filtering.
🚀 Features
- Generative AI Expansion: Seamlessly connect to Hugging Face Cloud APIs (FLUX) to predictably synthesize entirely new structural variants from your base seeds.
- Dynamic Prompt Engineering: Intercept your base textual prompts and use LLaMA-3.2 to rewrite them into dozens of unique, rich scenarios automatically.
- Intelligent Subject Isolation: Uses U-2-Net (
rembg) to optionally strip out backgrounds, ensuring your downstream models learn the subject and not the background (preventing shortcut learning). - Smart Augmentations: Uses
albumentationsfor geometric and photometric transformations natively on 4-channel transparent PNGs. - Pretrained Filtering: Uses MobileNetV2 for feature extraction and Cosine Similarity to discard any unrealistic or overly-distorted augmentations.
- Headless CLI & Library: Designed natively for automated MLOps pipelines. Completely terminal driven with a flexible Python API.
📦 Installation
To use DatagenKit globally as a CLI tool or import it into your Python environment, install it via pip:
# Clone the repository and install it globally
git clone <your-repo-url>
cd datagenkit
pip install -e .
Note: DatagenKit installs datagenkit into your PATH automatically.
💻 Usage: Command Line Interface (CLI)
DatagenKit provides a rich, deeply configurable CLI. You can view all available arguments anytime:
datagenkit --help
Basic Generation (Augmentations & Filtering Only)
datagenkit --input-dir my_seed_images/ --output-dir generated_dataset/ --target-count 250
Advanced Pipeline (AI Expansion + Background Removal)
datagenkit -i my_seed_images/ -o final_dataset/ -n 250 \
--enable-isolation \
--enable-ai \
--hf-api-key "hf_your_token_here" \
--ai-prompt "A highly detailed cat sitting on a rug" \
--enable-dynamic-prompts
🐍 Usage: Python API
If you are building your own Python scripts, custom data-loaders, or Jupyter Notebooks, you can import DatagenKit directly:
from datagenkit.pipeline import run_datagen_pipeline
# Run the complete, end-to-end dataset synthesizer
stats = run_datagen_pipeline(
input_dir="my_seed_images/",
output_dir="final_dataset/",
target_count=250,
similarity_threshold=0.75,
# Advanced ML Features
enable_isolation=True, # Automatically remove backgrounds natively
enable_ai=True, # Generate new variants via FLUX API
hf_api_key="hf_xxx", # Your HuggingFace Token
ai_prompt="A photo of a dog",
enable_dynamic_prompts=True # Let LLaMA-3.2 enrich the prompt into variations
)
print(f"Generation Complete! Stats: {stats}")
🛠 Prerequisites
- Python >= 3.8
- A Hugging Face account & Access Token (Only required if you enable AI Expansion capabilities).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datagenkit-0.1.1.tar.gz.
File metadata
- Download URL: datagenkit-0.1.1.tar.gz
- Upload date:
- Size: 14.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d91afc8603f9a894e38e3d3eb3a9a2aff9a8a0e2ad654a975d7ec69a8b9ef701
|
|
| MD5 |
8bbc22a2734f2cb08641f1c1b6b9a5ce
|
|
| BLAKE2b-256 |
56b473733f48655c3f10ffe2b40c0805db826396ac0a396c37445977169e39a5
|
File details
Details for the file datagenkit-0.1.1-py3-none-any.whl.
File metadata
- Download URL: datagenkit-0.1.1-py3-none-any.whl
- Upload date:
- Size: 19.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
579221752ee82f850978851cfe65dd78ceba00da6bdd74f600fe2ab49a41c113
|
|
| MD5 |
36b3a94af571e8a81c217d5ecd0df562
|
|
| BLAKE2b-256 |
cad864e2eb825fa2d57fa9b2294fd068db7878d6f2fd7be6e10ef7af2aac6841
|