Skip to main content

A multimedia dataset management tool for ML training

Project description

libseraph

A hot new dataset management tool that's crazy easy!

Motivation

There is no generally accepted metadata standard for multimedia data, and no tooling for multimedia dataset (meta)data management. At the outset of TEAM-ML, creating a training dataset was an error-prone process that typically required 8-20 hours of work from an ML expert, at a total cost of $1,200-$3,000.

To expedite the creation and refinement of training datasets, we developed an absolute minimum metadata standard for our multimedia datasets and a management tool that covers all our common use cases.

In our experience, it requires 0.5-3 hours to prepare a component dataset for Seraph management with a bespoke script. This is a non-recurring cost that depends on the original dataset format and the delta between that format and the Seraph format. Once all component datasets are configured, a training dataset can be assembled in 5-30 minutes, depending on (1) the user’s understanding of the tooling and the desired end composition and (2) whether the component dataset(s) can be copied in as-is or if the data needs to be resampled/resized/etc. Seraph is particularly useful for creating special-purpose or exploratory datasets from existing components; we were able to create a “9mm Parabellum Cartridge Dataset” at a cost of <$10.

Seraph currently supports only audio datasets, as anything else is out of scope for TEAM-ML.

Installation

conda create --name seraph python=3.12
conda activate seraph

pip install -r requirements.txt
pip install .

conda deactivate

Compatibility Note

Other Python ML libraries from Certus Innovations use PyTorch 2.10 and the torchcodec library for loading audio. Unfortunately, the torchcodec library does not support enough options to save files for seraph, so it currently must rely on torchaudio, which limits the PyTorch version to 2.8. Using Conda or venv this isn't too hard to work around, but we are actively working on a path to upgrade this library for compatibility of our packages.

Usage

The most used features of the Seraph tool are:

  • Audio
    • Import audio data from other datasets, including allowing class selection and exclusion
    • Generate duration metadata
    • Clip audio data to a set length while preserving original track identity data
    • Resample audio
    • Prune empty audio files
  • Classes
    • Switch class columns
    • Rename, merge, regex merge, and drop classes by name
    • Check class balance, including by fold/split
    • Compose class metadata from existing column(s)
  • Metadata
    • Initialize a new seraph dataset
    • Verify all data items against the dataset contract specified in the metadata file
  • Provenance
    • Prototype OpenIRIS integration for showing and submitting provenance
  • Prune
    • Remove records with no corresponding files and vice versa
    • Drop data by row value
    • Drop metadata columns
  • Splits
    • Automatically generate train/test/validate splits or cross-validation folds with respect to class balance and optionally avoid pseudoreplication
  • Version
  • Integrations -Prototype Fuel AI metadata format export

Examples

# Activate environment
conda activate seraph

# Initialize new dataset
seraph meta init

# Import audio datasets
seraph audio import --import_dir ~/Desktop/Kaggle_Gunshots/
seraph audio import --import_dir ~/Desktop/Cadre_Forensics/ --channel_merge_strat mix_down --sample_rate_merge_strat mix_down

# Switch classes from `gun_type` to `caliber`
seraph classes switch --new_class_col caliber --new_name_for_current_class_col gun_type

# Merge degenerate classes
seraph classes merge --target_class_name 9x19 --classes_to_merge "9mm Luger" --classes_to_merge "9mm"

# Add durations to columns and clip to 1 sec
seraph audio duration --metadata_column_conflict_strat replace
seraph audio clip --clip_duration_secs 1 --dry_run

# Show provenance data (WIP)
seraph prov show
seraph prov submit --activity_label "Make new gunshot dataset"

# Show verioning data (WIP)
seraph version show

# Cleanup
conda deactivate

Testing

python3 -m coverage run -m unittest discover -s test -p "*_test.py" && python -m coverage report --skip-covered
python -m coverage html

Tests to Write

  • No Coverage
    • integrations
    • provenance
  • Partial Coverage
    • meta
    • version

Feature Wish-List

  • IDEMPOTENCE
    • Prevent a dataset from being "double-tapped"
  • Pipe dreams
    • Undo

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

  • Ryan Quinn - Initial work

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

libseraph-0.1.1.tar.gz (45.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

libseraph-0.1.1-py3-none-any.whl (54.0 kB view details)

Uploaded Python 3

File details

Details for the file libseraph-0.1.1.tar.gz.

File metadata

  • Download URL: libseraph-0.1.1.tar.gz
  • Upload date:
  • Size: 45.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for libseraph-0.1.1.tar.gz
Algorithm Hash digest
SHA256 06ff88ce23aa061e0d311817e459d36c1712426acbb53507fa4797feb85c4646
MD5 1bf7c3913ade2b1ee9c5e32048dd51c5
BLAKE2b-256 fc57ba4cb1c203f040d7873e83e67f63684e8b1bcef385189f19abf1af812d8e

See more details on using hashes here.

File details

Details for the file libseraph-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: libseraph-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 54.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for libseraph-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dc065515064116d14fe86e1285bc77209be38d6f2d893b81f1ea563ec1dd6bd2
MD5 a8cd0d5776630c22a3d5dd91db5f51bd
BLAKE2b-256 dc7f4a97ea826a2bbde2b59ad623881dc52f3ce72667560061a3d663a064a1eb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page