A multimedia dataset management tool for ML training

These details have not been verified by PyPI

Project links

Homepage

Project description

Seraph Dataset Management Tool

A hot new dataset management tool that's crazy easy!

Motivation

There is no generally accepted metadata standard for multimedia data, and no tooling for multimedia dataset (meta)data management. At the outset of TEAM-ML, creating a training dataset was an error-prone process that typically required 8-20 hours of work from an ML expert, at a total cost of $1,200-$3,000.

To expedite the creation and refinement of training datasets, we developed an absolute minimum metadata standard for our multimedia datasets and a management tool that covers all our common use cases.

In our experience, it requires 0.5-3 hours to prepare a component dataset for Seraph management with a bespoke script. This is a non-recurring cost that depends on the original dataset format and the delta between that format and the Seraph format. Once all component datasets are configured, a training dataset can be assembled in 5-30 minutes, depending on (1) the user’s understanding of the tooling and the desired end composition and (2) whether the component dataset(s) can be copied in as-is or if the data needs to be resampled/resized/etc. Seraph is particularly useful for creating special-purpose or exploratory datasets from existing components; we were able to create a “9mm Parabellum Cartridge Dataset” at a cost of <$10.

Seraph currently supports only audio datasets, as anything else is out of scope for TEAM-ML.

Installation

You can install from PyPI pip:

pip install libseraph

Local Installation

conda create --name seraph python=3.12
conda activate seraph

pip install .

conda deactivate

Usage

The most used features of the Seraph tool are:

Audio
- Import audio data from other datasets, including allowing class selection and exclusion
- Generate duration metadata
- Clip audio data to a set length while preserving original track identity data
- Resample audio
- Prune empty audio files
Classes
- Switch class columns
- Rename, merge, regex merge, and drop classes by name
- Check class balance, including by fold/split
- Compose class metadata from existing column(s)
Metadata
- Initialize a new seraph dataset
- Verify all data items against the dataset contract specified in the metadata file
Provenance
- Prototype OpenIRIS integration for showing and submitting provenance
Prune
- Remove records with no corresponding files and vice versa
- Drop data by row value
- Drop metadata columns
Splits
- Automatically generate train/test/validate splits or cross-validation folds with respect to class balance and optionally avoid pseudoreplication
Version
- Prototype dataset version management by at least one community standard
Integrations -Prototype Fuel AI metadata format export

Examples

# Activate environment
conda activate seraph

# Initialize new dataset
seraph meta init

# Import audio datasets
seraph audio import --import_dir ~/Desktop/Kaggle_Gunshots/
seraph audio import --import_dir ~/Desktop/Cadre_Forensics/ --channel_merge_strat mix_down --sample_rate_merge_strat mix_down

# Switch classes from `gun_type` to `caliber`
seraph classes switch --new_class_col caliber --new_name_for_current_class_col gun_type

# Merge degenerate classes
seraph classes merge --target_class_name 9x19 --classes_to_merge "9mm Luger" --classes_to_merge "9mm"

# Add durations to columns and clip to 1 sec
seraph audio duration --metadata_column_conflict_strat replace
seraph audio clip --clip_duration_secs 1 --dry_run

# Show provenance data (WIP)
seraph prov show
seraph prov submit --activity_label "Make new gunshot dataset"

# Show verioning data (WIP)
seraph version show

# Cleanup
conda deactivate

Testing

# If needed, install test dependencies
# pip install .[coverage]

python3 -m coverage run -m unittest discover -s test -p "*_test.py" && python -m coverage report --skip-covered
python -m coverage html

Tests to Write

No Coverage
- integrations
- provenance
Partial Coverage
- meta
- version

Feature Wish-List

IDEMPOTENCE
- Prevent a dataset from being "double-tapped"
Pipe dreams
- Undo

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

Ryan Quinn - Initial work

License

MIT.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.3.1

Jun 29, 2026

0.3.0

Jun 29, 2026

0.2.0

Jun 12, 2026

0.1.1

May 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

libseraph-0.3.1.tar.gz (50.0 kB view details)

Uploaded Jun 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

libseraph-0.3.1-py3-none-any.whl (58.1 kB view details)

Uploaded Jun 29, 2026 Python 3

File details

Details for the file libseraph-0.3.1.tar.gz.

File metadata

Download URL: libseraph-0.3.1.tar.gz
Upload date: Jun 29, 2026
Size: 50.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for libseraph-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`c45fc53562ab5a3f4ba7e45d03d378d52eb1668ced16de273961dc7490af76b3`
MD5	`6e01cb186008ccaebfa1f0e74cd50567`
BLAKE2b-256	`8b69e4d864bfc64b6747c4f38af71391d6a6b9ab10c14d1ed9c41dd1e7762dc2`

See more details on using hashes here.

File details

Details for the file libseraph-0.3.1-py3-none-any.whl.

File metadata

Download URL: libseraph-0.3.1-py3-none-any.whl
Upload date: Jun 29, 2026
Size: 58.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for libseraph-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8cd3c9da2bd7cb091150fc36015a354c5206adc96dce5c32835a78c5f7662e75`
MD5	`8c9cc1fa77890c8447d89c3a4bfe5e37`
BLAKE2b-256	`e615e3e76ebf5951caf43ceed90575fec876b3fe2c0b9983d9bf2eacacb300c8`

See more details on using hashes here.

libseraph 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Seraph Dataset Management Tool

Motivation

Installation

Local Installation

Usage

Examples

Testing

Tests to Write

Feature Wish-List

Versioning

Authors

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes