Skip to main content

Enormous, comprehensive, normalized and searchable MIDI dataset for MIR and symbolic music AI purposes

Project description

Godzilla MIDI Dataset

Enormous, comprehensive, normalized and searchable MIDI dataset for MIR and symbolic music AI purposes

Godzilla-MIDI-Dataset


Dataset features

1) Over 5.8M+ unique, de-duped and normalized MIDIs

2) Each MIDI was converted to proper MIDI format specification and checked for integrity

3) Dataset was de-duped twice: by md5 hashes and by pitches-patches counts

4) Extensive and comprehansive (meta)data was collected from all MIDIs in the dataset

5) Dataset comes with a custom-designed and highly optimized GPU-accelerated search and filter code


Installation

pip and setuptools

# It is recommended that you upgrade pip and setuptools prior to install for max compatibility
!pip install --upgrade pip
!pip install --upgrade setuptools

CPU-only install

# The following command will install Godzilla MIDI Dataset for CPU-only search
# Please note that CPU search is quite slow and it requires a minimum of 128GB RAM to work for full searches

!pip install -U godzillamididataset

CPU/GPU install

# The following command will install Godzilla MIDI Dataset for fast GPU search
# Please note that GPU search requires at least 30GB GPU VRAM for full searches at float16 precision

!pip install -U godzillamididataset[gpu]

Optional packages

Packages for Fast Parallel Exctract module

# The following command will install packages for Fast Parallel Extract module
# It will allow you to extract (untar) Godzilla MIDI Dataset much faster

!sudo apt update -y
!sudo apt install -y p7zip-full
!sudo apt install -y pigz

Packages for midi_to_colab_audio module

# The following command will install packages for midi_to_colab_audio module
# It will allow you to render Godzilla MIDI Dataset MIDIs to audio

!sudo apt update -y
!sudo apt install fluidsynth

Quick-start use example

# Import main Godzilla MIDI Dataset module
import godzillamididataset

# Download Godzilla MIDI Dataset from Hugging Face repo
godzillamididataset.download_dataset()

# Extract Godzilla MIDI Dataset with built-in function (slow)
godzillamididataset.parallel_extract()

# Or you can extract much faster if you have installed the optional packages for Fast Parallel Extract
# from godzillamididataset import fast_parallel_extract
# fast_parallel_extract.fast_parallel_extract()

# Load all MIDIs basic signatures
sigs_data = godzillamididataset.read_jsonl()

# Create signatures dictionaries
sigs_dicts = godzillamididataset.load_signatures(sigs_data)

# Pre-compute signatures
X, global_union = godzillamididataset.precompute_signatures(sigs_dicts)

# Run the search
# IO dirs will be created on the first run of the following function
# Do not forget to put your master MIDIs into created Master-MIDI-Dataset folder
# The full search for each master MIDI takes about 2-3 sec on a GPU and 4-5 min on a CPU
godzillamididataset.search_and_filter(sigs_dicts, X, global_union)

Dataset structure information

Godzilla-MIDI-Dataset/              # Dataset root dir
├── ARTWORK/                        # Concept artwork
│   ├── Illustrations/              # Concept illustrations
│   ├── Logos/                      # Dataset logos
│   └── Posters/                    # Dataset posters
├── CODE/                           # Supplemental python code and python modules
├── DATA/                           # Dataset (meta)data dir
│   ├── Averages/                   # Averages data for all MIDIs and clean MIDIs
│   ├── Basic Features/             # All basic features for all clean MIDIs
│   ├── Files Lists/                # Files lists by MIDIs types and categories
│   ├── Identified MIDIs/           # Comprehensive data for identified MIDIs
│   ├── Metadata/                   # Raw metadata from all MIDIs
│   ├── Mono Melodies/              # Data for all MIDIs with monophonic melodies
│   ├── Pitches Patches Counts/     # Pitches-patches counts for all MIDIs 
│   ├── Pitches Sums/               # Pitches sums for all MIDIs
│   ├── Signatures/                 # Signatures data for all MIDIs and MIDIs subsets
│   └── Text Captions/              # Music description text captions for all MIDIs
├── MIDIs/                          # Root MIDIs dir
└── SOUNDFONTS/                     # Select high-quality soundfont banks to render MIDIs

Dataset (meta)data information


Averages

Averages for all MIDIs are presented in three groups:

  • Notes averages without drums
  • Notes and drums averages
  • Drums averages without notes

Each group of averages is represented by a list of four values:

  • Delta start-times average in ms
  • Durations average in ms
  • Pitches average
  • Velocities average

Basic features

Basic features are presented in a form of a dictionary of 111 metrics

The features were collected from a solo piano score representation of all MIDIs with MIDI instruments below 80

These features are useful for music classification, analysis and other MIR tasks


Files lists

Numerous files lists were created for convenience and easy MIDIs retrieval from the dataset

These include lists of all MIDIs as well as subsets of MIDIs

Files lists are presented in a dictionary format of two strings:

  • MIDI md5 hash
  • Full MIDI path

Identified MIDIs

This data contains information about all MIDIs that were definitivelly identified by artist, title, and genre


Metadata

Metadata was collected from all MIDIs in the dataset and its a list of all MIDI events preceeding first MIDI note event

The list also includes the last note event of the MIDI which is useful for measuring runtime of the MIDI

The list follows the MIDI.py score format


Mono melodies

This data contains information about all MIDIs with at least one monophonic melody

The data in a form of list of tuples where first element represents monophonic melody patch/instrument

And the second element of the tuple represents number of notes for indicated patch/instrument

Please note that many MIDIs may have more than one monophonic melody


Pitches patches counts

This data contains the pitches-patches counts for all MIDIs in the dataset

This information is very useful for de-duping, MIR and statistical analysis


Pitches sums

This data contains MIDI pitches sums for all MIDIs in the dataset

Pitches sums can be used for de-duping, MIR and comparative analysis


Signatures

This data contains two signatures for each MIDI in the dataset:

  • Full signature with 577 features
  • Basic signature with 392 features

Both signatures are presented as lists of tuples where first element is a feature and the second element is a feature count

Both signatures also include number of bad features indicated by -1

Signatures features are divided into three groups:

  • MIDI pitches (represented by values 0-127)
  • MIDI chords (represented by values 128-449 or 128-264)
  • MIDI drum pitches (represented by values 449-577 or 264-392)

Both signatures can be very effectively used for MIDI comparison or MIDI search and filtering


Text captions

This data contains detailed textual description of music in each MIDI in the dataset

These captions can be used for text-to-music tasks and for MIR tasks


Citations

@misc{GodzillaMIDIDataset2025,
  title        = {Godzilla MIDI Dataset: Enormous, comprehensive, normalized and searchable MIDI dataset for MIR and symbolic music AI purposes},
  author       = {Alex Lev},
  publisher    = {Project Los Angeles / Tegridy Code},
  year         = {2025},
  url          = {https://huggingface.co/datasets/projectlosangeles/Godzilla-MIDI-Dataset}
@misc {breadai_2025,
    author       = { {BreadAi} },
    title        = { Sourdough-midi-dataset (Revision cd19431) },
    year         = 2025,
    url          = {\url{https://huggingface.co/datasets/BreadAi/Sourdough-midi-dataset}},
    doi          = { 10.57967/hf/4743 },
    publisher    = { Hugging Face }
}
@inproceedings{bradshawaria,
  title={Aria-MIDI: A Dataset of Piano MIDI Files for Symbolic Music Modeling},
  author={Bradshaw, Louis and Colton, Simon},
  booktitle={International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=X5hrhgndxW}, 
}
@misc{TegridyMIDIDataset2025,
  title        = {Tegridy MIDI Dataset: Ultimate Multi-Instrumental MIDI Dataset for MIR and Music AI purposes},
  author       = {Alex Lev},
  publisher    = {Project Los Angeles / Tegridy Code},
  year         = {2025},
  url          = {https://github.com/asigalov61/Tegridy-MIDI-Dataset}

Project Los Angeles

Tegridy Code 2025

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

godzillamididataset-25.6.5.tar.gz (2.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

godzillamididataset-25.6.5-py3-none-any.whl (2.6 MB view details)

Uploaded Python 3

File details

Details for the file godzillamididataset-25.6.5.tar.gz.

File metadata

  • Download URL: godzillamididataset-25.6.5.tar.gz
  • Upload date:
  • Size: 2.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for godzillamididataset-25.6.5.tar.gz
Algorithm Hash digest
SHA256 c7b0b33228cd0718c7476faa1dace6334aa3781cb5c1a4a698729d2694c7cc67
MD5 615a8b38f6f807e35dfc3f722d1078f3
BLAKE2b-256 4a1bb3ede95547b8be07054056eaf56c6391395b88d4fd56736130b30dea3bc9

See more details on using hashes here.

File details

Details for the file godzillamididataset-25.6.5-py3-none-any.whl.

File metadata

File hashes

Hashes for godzillamididataset-25.6.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c7e28c12cad22a071110facac30e148486633ecb1d47e34c8a5c073851c5987d
MD5 3fd9daa98264ab58db025812c8fa0e8a
BLAKE2b-256 908d08813d5c8d031a73190507a65c5fd0b3550f8309d55910d3e405de9fb2d0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page