Skip to main content

Fast and reliable manifest generator for ML/DL datasets

Project description

manigen 📝

PyPI Version Python Versions License CI Status

manigen is a fast and reliable CLI tool designed to generate file manifests (lists of file paths) for Machine Learning and Deep Learning datasets.

Whether you are preparing a small local dataset or parsing a massive image corpus like ImageNet, manigen handles recursive scanning, multithreading, path formatting, and Train/Val/Test splitting out of the box.

✨ Features

  • ⏱️ Efficient & Multithreaded: Uses a thread pool to parallelize I/O operations, significantly speeding up the scanning of large and deeply nested directory trees compared to sequential scripts.
  • ✂️ Portable Manifests: Generate relative paths by stripping absolute prefixes (--strip-prefix), making it easy to move datasets between local machines and cloud servers.
  • 🔀 ML-Ready Splits: Built-in shuffling and automatic Train/Validation/Test dataset splitting directly into separate files (--split).
  • 🛡️ Robust Architecture: Built with modern Python, featuring thread-safe list operations, strict input validation, and clear error handling.

🎯 Motivation

While working on Super-Resolution Deep Learning projects, I found myself repeatedly copying the same massive datasets across multiple project directories. To save disk space, I decided to store all datasets in a single central location (e.g., ~/.local/share/datasets) and feed the models using simple text files containing absolute paths to the images.

Initially, I wrote a bash script for this task. However, generating a manifest for the ImageNet dataset took about 30 minutes. By rewriting the tool in Python and leveraging multithreading, manigen can now generate a manifest for ImageNet (1,281,167 images) in 12 seconds.

📦 Installation

You can install manigen directly from PyPI using pip:

pip install manigen

Or, if you use uv (recommended for CLI tools):

uv tool install manigen

🚀 Quick Start

Generate a manifest of all images in a dataset directory:

manigen -i ./datasets/ImageNet/train -o manifest.txt

💡 Advanced Usage Examples

1. Multithreaded Scanning

Speed up scanning for datasets with heavily nested directories (like ImageNet) by utilizing multiple threads and recursive search:

manigen -i ./datasets/ImageNet/train -o train_paths.txt -t 8 -r

2. Making Paths Portable (Relative Paths)

If your absolute path is /Users/ml_engineer/projects/data/images/cat.jpg, but you want the manifest to only contain data/images/cat.jpg:

manigen -i /Users/ml_engineer/projects/data -o dataset.txt --strip-prefix /Users/ml_engineer/projects/

3. Creating Train/Val/Test Splits

Automatically shuffle the dataset and split it into training (70%), validation (20%), and testing (10%) sets:

manigen -i ./dataset -o manifest.txt --shuffle --split 0.7 0.2 0.1

4. Custom File Extensions

Override the default extensions to scan for audio, text, or any other formats:

manigen -i ./audio_dataset -o audio_manifest.txt -e wav mp3 flac

🛠️ CLI Reference

Argument Short Description Default
--input-dir -i (Required) One or more dataset directories to scan. -
--output-file -o (Required) Output file path (e.g., manifest.txt). -
--threads -t Number of threads for parallel scanning. 1
--recursive -r Scan subdirectories recursively. False
--extensions -e Allowed file extensions. png jpeg jpg webp bmp
--strip-prefix Prefix to strip from absolute paths for relative outputs. None
--shuffle Shuffle paths randomly before saving. False
--split Dataset split ratios, must sum to 1.0 (e.g., 0.8 0.2). None

🤝 Contributing

1. Clone the repository

git clone https://github.com/ash1ra/manigen
cd manigen

2. Install dependencies using uv

uv sync
# On Windows
.venv\Scripts\activate
# on Unix or MacOS
source .venv/bin/activate

3. Format and lint the code

uv run ruff format .
uv run ruff check .

4. Run the tests

uv run pytest tests/ -v

5. Submit a pull request

If you'd like to contribute, please fork the repository and open a pull request to the main branch.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

manigen-1.0.1.tar.gz (20.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

manigen-1.0.1-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file manigen-1.0.1.tar.gz.

File metadata

  • Download URL: manigen-1.0.1.tar.gz
  • Upload date:
  • Size: 20.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for manigen-1.0.1.tar.gz
Algorithm Hash digest
SHA256 753c7b417e7534010399957cb55ce2f05bea23cf8937d0e0271b1e2b3a952fa1
MD5 622d5ae16592ab1cd6052be6f780c351
BLAKE2b-256 6d25929163ca16dc49e9f68b232e5aefb793ea31345ff66c3ba6bbac45aef367

See more details on using hashes here.

File details

Details for the file manigen-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: manigen-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for manigen-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 52cfb32cf45b606880e7cca6b2ef3a6c48e6a1ba1d10fd73ab6e98cce06d8aec
MD5 1b1b764e8a05da930e9216563901372b
BLAKE2b-256 da59dbc6b9b6c472c993f0829befc197070269eadddce7fb9be021e4cbcc21e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page