Enormous, comprehensive, normalized and searchable MIDI dataset for MIR and symbolic music AI purposes
Project description
Godzilla MIDI Dataset
Enormous, comprehensive, normalized and searchable MIDI dataset for MIR and symbolic music AI purposes
Dataset features
1) Over 5.8M+ unique, de-duped and normalized MIDIs
2) Each MIDI was converted to proper MIDI format specification and checked for integrity
3) Dataset was de-duped twice: by md5 hashes and by pitches-patches counts
4) Extensive and comprehansive (meta)data was collected from all MIDIs in the dataset
5) Dataset comes with a custom-designed and highly optimized GPU-accelerated search and filter code
Installation
pip and setuptools
# It is recommended that you upgrade pip and setuptools prior to install for max compatibility
!pip install --upgrade pip
!pip install --upgrade setuptools
CPU-only install
# The following command will install Godzilla MIDI Dataset for CPU-only search
# Please note that CPU search is quite slow and it requires a minimum of 128GB RAM to work for full searches
!pip install -U godzillamididataset
CPU/GPU install
# The following command will install Godzilla MIDI Dataset for fast GPU search
# Please note that GPU search requires at least 30GB GPU VRAM for full searches at float16 precision
!pip install -U godzillamididataset[gpu]
Optional packages
Packages for Fast Parallel Exctract module
# The following command will install packages for Fast Parallel Extract module
# It will allow you to extract (untar) Godzilla MIDI Dataset much faster
!sudo apt update -y
!sudo apt install -y p7zip-full
!sudo apt install -y pigz
Packages for midi_to_colab_audio module
# The following command will install packages for midi_to_colab_audio module
# It will allow you to render Godzilla MIDI Dataset MIDIs to audio
!sudo apt update -y
!sudo apt install fluidsynth
Quick-start use example
# Import main Godzilla MIDI Dataset module
import godzillamididataset
# Download Godzilla MIDI Dataset from Hugging Face repo
godzillamididataset.download_dataset()
# Extract Godzilla MIDI Dataset with built-in function (slow)
godzillamididataset.parallel_extract()
# Or you can extract much faster if you have installed the optional packages for Fast Parallel Extract
# from godzillamididataset import fast_parallel_extract
# fast_parallel_extract.fast_parallel_extract()
# Load all MIDIs basic signatures
sigs_data = godzillamididataset.read_jsonl()
# Create signatures dictionaries
sigs_dicts = godzillamididataset.load_signatures(sigs_data)
# Pre-compute signatures
X, global_union = godzillamididataset.precompute_signatures(sigs_dicts)
# Run the search
# IO dirs will be created on the first run of the following function
# Do not forget to put your master MIDIs into created Master-MIDI-Dataset folder
# The full search for each master MIDI takes about 2-3 sec on a GPU and 4-5 min on a CPU
godzillamididataset.search_and_filter(sigs_dicts, X, global_union)
Dataset structure information
Godzilla-MIDI-Dataset/ # Dataset root dir
├── ARTWORK/ # Concept artwork
│ ├── Illustrations/ # Concept illustrations
│ ├── Logos/ # Dataset logos
│ └── Posters/ # Dataset posters
├── CODE/ # Supplemental python code and python modules
├── DATA/ # Dataset (meta)data dir
│ ├── Averages/ # Averages data for all MIDIs and clean MIDIs
│ ├── Basic Features/ # All basic features for all clean MIDIs
│ ├── Files Lists/ # Files lists by MIDIs types and categories
│ ├── Identified MIDIs/ # Comprehensive data for identified MIDIs
│ ├── Metadata/ # Raw metadata from all MIDIs
│ ├── Mono Melodies/ # Data for all MIDIs with monophonic melodies
│ ├── Pitches Patches Counts/ # Pitches-patches counts for all MIDIs
│ ├── Pitches Sums/ # Pitches sums for all MIDIs
│ ├── Signatures/ # Signatures data for all MIDIs and MIDIs subsets
│ └── Text Captions/ # Music description text captions for all MIDIs
├── MIDIs/ # Root MIDIs dir
└── SOUNDFONTS/ # Select high-quality soundfont banks to render MIDIs
Dataset (meta)data information
Averages
Averages for all MIDIs are presented in three groups:
-
Notes averages without drums
-
Notes and drums averages
-
Drums averages without notes
Each group of averages is represented by a list of four values:
-
Delta start-times average in ms
-
Durations average in ms
-
Pitches average
-
Velocities average
Basic features
Basic features are presented in a form of a dictionary of 111 metrics
The features were collected from a solo piano score representation of all MIDIs with MIDI instruments below 80
These features are useful for music classification, analysis and other MIR tasks
Files lists
Numerous files lists were created for convenience and easy MIDIs retrieval from the dataset
These include lists of all MIDIs as well as subsets of MIDIs
Files lists are presented in a dictionary format of two strings:
-
MIDI md5 hash
-
Full MIDI path
Identified MIDIs
This data contains information about all MIDIs that were definitivelly identified by artist, title, and genre
Metadata
Metadata was collected from all MIDIs in the dataset and its a list of all MIDI events preceeding first MIDI note event
The list also includes the last note event of the MIDI which is useful for measuring runtime of the MIDI
The list follows the MIDI.py score format
Mono melodies
This data contains information about all MIDIs with at least one monophonic melody
The data in a form of list of tuples where first element represents monophonic melody patch/instrument
And the second element of the tuple represents number of notes for indicated patch/instrument
Please note that many MIDIs may have more than one monophonic melody
Pitches patches counts
This data contains the pitches-patches counts for all MIDIs in the dataset
This information is very useful for de-duping, MIR and statistical analysis
Pitches sums
This data contains MIDI pitches sums for all MIDIs in the dataset
Pitches sums can be used for de-duping, MIR and comparative analysis
Signatures
This data contains two signatures for each MIDI in the dataset:
-
Full signature with 577 features
-
Basic signature with 392 features
Both signatures are presented as lists of tuples where first element is a feature and the second element is a feature count
Both signatures also include number of bad features indicated by -1
Signatures features are divided into three groups:
-
MIDI pitches (represented by values 0-127)
-
MIDI chords (represented by values 128-449 or 128-264)
-
MIDI drum pitches (represented by values 449-577 or 264-392)
Both signatures can be very effectively used for MIDI comparison or MIDI search and filtering
Text captions
This data contains detailed textual description of music in each MIDI in the dataset
These captions can be used for text-to-music tasks and for MIR tasks
Citations
@misc{GodzillaMIDIDataset2025,
title = {Godzilla MIDI Dataset: Enormous, comprehensive, normalized and searchable MIDI dataset for MIR and symbolic music AI purposes},
author = {Alex Lev},
publisher = {Project Los Angeles / Tegridy Code},
year = {2025},
url = {https://huggingface.co/datasets/projectlosangeles/Godzilla-MIDI-Dataset}
@misc {breadai_2025,
author = { {BreadAi} },
title = { Sourdough-midi-dataset (Revision cd19431) },
year = 2025,
url = {\url{https://huggingface.co/datasets/BreadAi/Sourdough-midi-dataset}},
doi = { 10.57967/hf/4743 },
publisher = { Hugging Face }
}
@inproceedings{bradshawaria,
title={Aria-MIDI: A Dataset of Piano MIDI Files for Symbolic Music Modeling},
author={Bradshaw, Louis and Colton, Simon},
booktitle={International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=X5hrhgndxW},
}
@misc{TegridyMIDIDataset2025,
title = {Tegridy MIDI Dataset: Ultimate Multi-Instrumental MIDI Dataset for MIR and Music AI purposes},
author = {Alex Lev},
publisher = {Project Los Angeles / Tegridy Code},
year = {2025},
url = {https://github.com/asigalov61/Tegridy-MIDI-Dataset}
Project Los Angeles
Tegridy Code 2025
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file godzillamididataset-25.6.5.tar.gz.
File metadata
- Download URL: godzillamididataset-25.6.5.tar.gz
- Upload date:
- Size: 2.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7b0b33228cd0718c7476faa1dace6334aa3781cb5c1a4a698729d2694c7cc67
|
|
| MD5 |
615a8b38f6f807e35dfc3f722d1078f3
|
|
| BLAKE2b-256 |
4a1bb3ede95547b8be07054056eaf56c6391395b88d4fd56736130b30dea3bc9
|
File details
Details for the file godzillamididataset-25.6.5-py3-none-any.whl.
File metadata
- Download URL: godzillamididataset-25.6.5-py3-none-any.whl
- Upload date:
- Size: 2.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7e28c12cad22a071110facac30e148486633ecb1d47e34c8a5c073851c5987d
|
|
| MD5 |
3fd9daa98264ab58db025812c8fa0e8a
|
|
| BLAKE2b-256 |
908d08813d5c8d031a73190507a65c5fd0b3550f8309d55910d3e405de9fb2d0
|