Skip to main content

Fast and reliable manifest generator for ML/DL datasets

Project description

manigen ⚡

PyPI Version Python Versions License CI Status

manigen is a fast, reliable, and multithreaded CLI tool designed to generate file manifests (lists of file paths) for Machine Learning and Deep Learning datasets.

Whether you are preparing a small local dataset or parsing a massive image corpus like ImageNet, manigen handles recursive scanning, multithreading, path formatting, and Train/Val/Test splitting out of the box.

✨ Features

  • 🚀 Blazing Fast: Uses multithreading to parallelize I/O operations and scan huge directory trees efficiently.
  • ✂️ Portable Datasets: Easily strip absolute path prefixes to generate relative paths (--strip-prefix), making your manifests portable across different machines and servers.
  • 🔀 ML-Ready Splits: Built-in shuffling and automatic Train/Validation/Test dataset splitting (--split).
  • 🛡️ Robust & Safe: Thread-safe operations, strict path validation, and clean fallback mechanisms.

🎯 Motivation

While working on Super-Resolution Deep Learning projects, I found myself repeatedly copying the same massive datasets across multiple project directories. To save disk space, I decided to store all datasets in a single central location (e.g., ~/.local/share/datasets) and feed the models using simple text files containing absolute paths to the images.

Initially, I wrote a bash script for this task. However, generating a manifest for the ImageNet dataset took about 30 minutes. By rewriting the tool in Python and leveraging multithreading, manigen can now generate a manifest for ImageNet (1,281,167 images) in 12 seconds.

📦 Installation

You can install manigen directly from PyPI using pip:

pip install manigen

Or, if you use uv (recommended for CLI tools):

uv tool install manigen

🚀 Quick Start

Generate a manifest of all images in a dataset directory:

manigen -i ./datasets/ImageNet/train -o manifest.txt

💡 Advanced Usage Examples

1. Multithreaded Scanning

Speed up scanning for datasets with heavily nested directories (like ImageNet) by utilizing multiple threads and recursive search:

manigen -i ./datasets/ImageNet/train -o train_paths.txt -t 8 -r

2. Making Paths Portable (Relative Paths)

If your absolute path is /Users/ml_engineer/projects/data/images/cat.jpg, but you want the manifest to only contain data/images/cat.jpg:

manigen -i /Users/ml_engineer/projects/data -o dataset.txt --strip-prefix /Users/ml_engineer/projects/

3. Creating Train/Val/Test Splits

Automatically shuffle the dataset and split it into training (70%), validation (20%), and testing (10%) sets:

manigen -i ./dataset -o manifest.txt --shuffle --split 0.7 0.2 0.1

4. Custom File Extensions

Override the default extensions to scan for audio, text, or any other formats:

manigen -i ./audio_dataset -o audio_manifest.txt -e wav mp3 flac

🛠️ CLI Reference

Argument Short Description Default
--input-dir -i (Required) One or more dataset directories to scan. -
--output-file -o (Required) Output file path (e.g., manifest.txt). -
--threads -t Number of threads for parallel scanning. 1
--recursive -r Scan subdirectories recursively. False
--extensions -e Allowed file extensions. png jpeg jpg webp bmp
--strip-prefix Prefix to strip from absolute paths for relative outputs. None
--shuffle Shuffle paths randomly before saving. False
--split Dataset split ratios, must sum to 1.0 (e.g., 0.8 0.2). None

🤝 Contributing

1. Clone the repository

git clone https://github.com/ash1ra/manigen
cd manigen

2. Install dependencies using uv

uv sync
# On Windows
.venv\Scripts\activate
# on Unix or MacOS
source .venv/bin/activate

3. Format and lint the code

uv run ruff format .
uv run ruff check .

4. Run the tests

uv run pytest tests/ -v

5. Submit a pull request

If you'd like to contribute, please fork the repository and open a pull request to the main branch.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

manigen-1.0.0.tar.gz (5.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

manigen-1.0.0-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file manigen-1.0.0.tar.gz.

File metadata

  • Download URL: manigen-1.0.0.tar.gz
  • Upload date:
  • Size: 5.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for manigen-1.0.0.tar.gz
Algorithm Hash digest
SHA256 4080668ddc750757fefaef39c6cf6c6adff4bd9782097c7ed30179a876b3cc9f
MD5 7bb0bc6d188446781a6eaf58644d553a
BLAKE2b-256 1af7d815a550e6cbe656d989b8377c53c353ab5c45b05b40d206d710e7bfbac0

See more details on using hashes here.

File details

Details for the file manigen-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: manigen-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Arch Linux","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for manigen-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f650755bb88e769485631443ec225beca6ff1fd55714ebf11c0f1f3159736379
MD5 e9c97c29eb5bd9138fdcd15914ee2ec7
BLAKE2b-256 6bf739e5a679c36b3b1a540c526fd827bff9467e44fbb2d507701c1805ff4d8b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page