Fast and reliable manifest generator for ML/DL datasets
Project description
manigen 📝
manigen is a fast and reliable CLI tool designed to generate file manifests (lists of file paths) for Machine Learning and Deep Learning datasets.
Whether you are preparing a small local dataset or parsing a massive image corpus like ImageNet, manigen handles recursive scanning, multithreading, path formatting, and Train/Val/Test splitting out of the box.
✨ Features
- ⏱️ Efficient & Multithreaded: Uses a thread pool to parallelize I/O operations, significantly speeding up the scanning of large and deeply nested directory trees compared to sequential scripts.
- ✂️ Portable Manifests: Generate relative paths by stripping absolute prefixes (
--strip-prefix), making it easy to move datasets between local machines and cloud servers. - 🔀 ML-Ready Splits: Built-in shuffling and automatic Train/Validation/Test dataset splitting directly into separate files (
--split). - 🛡️ Robust Architecture: Built with modern Python, featuring thread-safe list operations, strict input validation, and clear error handling.
🎯 Motivation
While working on Super-Resolution Deep Learning projects, I found myself repeatedly copying the same massive datasets across multiple project directories. To save disk space, I decided to store all datasets in a single central location (e.g., ~/.local/share/datasets) and feed the models using simple text files containing absolute paths to the images.
Initially, I wrote a bash script for this task. However, generating a manifest for the ImageNet dataset took about 30 minutes. By rewriting the tool in Python and leveraging multithreading, manigen can now generate a manifest for ImageNet (1,281,167 images) in 12 seconds.
📦 Installation
You can install manigen directly from PyPI using pip:
pip install manigen
Or, if you use uv (recommended for CLI tools):
uv tool install manigen
🚀 Quick Start
Generate a manifest of all images in a dataset directory:
manigen -i ./datasets/ImageNet/train -o manifest.txt
💡 Advanced Usage Examples
1. Multithreaded Scanning
Speed up scanning for datasets with heavily nested directories (like ImageNet) by utilizing multiple threads and recursive search:
manigen -i ./datasets/ImageNet/train -o train_paths.txt -t 8 -r
2. Making Paths Portable (Relative Paths)
If your absolute path is /Users/ml_engineer/projects/data/images/cat.jpg, but you want the manifest to only contain data/images/cat.jpg:
manigen -i /Users/ml_engineer/projects/data -o dataset.txt --strip-prefix /Users/ml_engineer/projects/
3. Creating Train/Val/Test Splits
Automatically shuffle the dataset and split it into training (70%), validation (20%), and testing (10%) sets:
manigen -i ./dataset -o manifest.txt --shuffle --split 0.7 0.2 0.1
4. Custom File Extensions
Override the default extensions to scan for audio, text, or any other formats:
manigen -i ./audio_dataset -o audio_manifest.txt -e wav mp3 flac
🛠️ CLI Reference
| Argument | Short | Description | Default |
|---|---|---|---|
--input-dir |
-i |
(Required) One or more dataset directories to scan. | - |
--output-file |
-o |
(Required) Output file path (e.g., manifest.txt). |
- |
--threads |
-t |
Number of threads for parallel scanning. | 1 |
--recursive |
-r |
Scan subdirectories recursively. | False |
--extensions |
-e |
Allowed file extensions. | png jpeg jpg webp bmp |
--strip-prefix |
Prefix to strip from absolute paths for relative outputs. | None |
|
--shuffle |
Shuffle paths randomly before saving. | False |
|
--split |
Dataset split ratios, must sum to 1.0 (e.g., 0.8 0.2). |
None |
🤝 Contributing
1. Clone the repository
git clone https://github.com/ash1ra/manigen
cd manigen
2. Install dependencies using uv
uv sync
# On Windows
.venv\Scripts\activate
# on Unix or MacOS
source .venv/bin/activate
3. Format and lint the code
uv run ruff format .
uv run ruff check .
4. Run the tests
uv run pytest tests/ -v
5. Submit a pull request
If you'd like to contribute, please fork the repository and open a pull request to the main branch.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file manigen-1.0.2.tar.gz.
File metadata
- Download URL: manigen-1.0.2.tar.gz
- Upload date:
- Size: 20.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac6ab4792e9e67b39fec2924f740e7fe7deb86f232ee939b7fcf5b574bee98db
|
|
| MD5 |
bfbb49f5d1bf48f92ae9034a4515b48c
|
|
| BLAKE2b-256 |
c00696025e70704463411b9a70f67f0224e31b479707b176bd5b8ea7fa8246e8
|
File details
Details for the file manigen-1.0.2-py3-none-any.whl.
File metadata
- Download URL: manigen-1.0.2-py3-none-any.whl
- Upload date:
- Size: 10.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59f1d85c1cdea1f980d0d9041597f0a7210e584abc454f3148c65e3c2b9adf9b
|
|
| MD5 |
323ce823f38460dc60639dbc05e09f89
|
|
| BLAKE2b-256 |
6ce4269b206b9d69c5c92a20c6d07742c4433a95e7123fc8fa44910459bc3be1
|