Ultimate MIDI dataset for MIDI music discovery and symbolic music AI purposes
Project description
Discover MIDI Dataset
Ultimate MIDI dataset for MIDI music discovery and symbolic music AI purposes
Dataset features
1) Over 6.74M+ unique, de-duped and normalized MIDIs
2) Each MIDI was converted to proper MIDI format specification and checked for integrity
3) Dataset was de-duped twice: by md5 hashes and by the pitches-chords counts
4) Extensive and comprehensive (meta)data was collected from all MIDIs in the dataset
5) Dataset comes with a custom-designed and highly optimized GPU-accelerated search and filter code
Installation
pip and setuptools
# It is recommended that you upgrade pip and setuptools prior to install for max compatibility
!pip install --upgrade pip
!pip install --upgrade setuptools
CPU/GPU install
# The following command will install Discover MIDI Dataset for fast GPU search
# Please note that GPU search requires at least 16GB GPU VRAM for full searches at float16 precision
!pip install -U discovermidi
Optional packages
Packages for Fast Parallel Extract module
# The following command will install packages for Fast Parallel Extract module
# It will allow you to extract (untar) Godzilla MIDI Dataset much faster
!sudo apt update -y
!sudo apt install -y p7zip-full
!sudo apt install -y pigz
Packages for midi_to_colab_audio module
# The following command will install packages for midi_to_colab_audio module
# It will allow you to render Godzilla MIDI Dataset MIDIs to audio
!sudo apt update -y
!sudo apt install fluidsynth
Quick-start use example
# Import main Godzilla MIDI Dataset module
import discovermidi
# Download Godzilla MIDI Dataset from Hugging Face repo
discovermidi.download_dataset()
# Extract Godzilla MIDI Dataset with built-in function (slow)
discovermidi.parallel_extract()
# Or you can extract much faster if you have installed the optional packages for Fast Parallel Extract
# from discovermidi import fast_parallel_extract
# fast_parallel_extract.fast_parallel_extract()
# Load all MIDIs features matrixes and their corresponding MIDIs file names
features_matrixes, features_matrixes_file_names = discovermidi.load_features_matrixes()
# Run the search
# IO dirs will be created on the first run of the following function
# Do not forget to put your master MIDIs into created Master-MIDI-Dataset folder
# The full search for each master MIDI takes about 10-20 seconds on a GPU
discovermidi.search_and_filter(features_matrixes, features_matrixes_file_names)
Dataset structure information
Discover-MIDI-Dataset/ # Dataset root dir
├── ARTWORK/ # Concept artwork
│ ├── Illustrations/ # Concept illustrations
│ ├── Logos/ # Dataset logos
│ └── Posters/ # Dataset posters
├── CODE/ # Root dir for supplemental python code and python modules
│ ├── midi_loops_extractor/ # MIDI loops extractor code dir
├── DATA/ # Dataset (meta)data dir
│ ├── Features Counts/ # Features counts for each MIDI
│ ├── Features Matrixes/ # Features counts matrixes for each MIDI
│ ├── Files Lists/ # Files lists by MIDIs types and categories
│ ├── Identified MIDIs/ # Comprehensive data for identified MIDIs
│ ├── Karaoke MIDIs/ # Karaoke MIDIs data
│ ├── MIDIs Lyrics/ # MIDIs lyrics data
│ ├── Mono Melodies/ # Data for all MIDIs with monophonic melodies
│ ├── Pitches Patches Counts/ # Pitches-patches counts for all MIDIs
├── MIDIs/ # Root MIDIs dir
└── SOUNDFONTS/ # Select high-quality Sound Font banks to render MIDIs
Dataset (meta)data information
Features Counts
Features counts for all MIDIs are presented in a form of list of tuples (feature, count)
Features range is [0-1089) which covers six groups of values
-
[0-128) Delta start times
-
(128-256) Durations
-
[256-384] MIDI patches/instruments, 384 being reserved for drums
-
(384-640) MIDI pitches: (384-512) reserved for instruments and (512-640) for drums
-
[640-961) All possible harmonic chords (321 chords)
-
(961-1089) Velocities
Features Matrixes
A compressed NumPy array of flattened features matrixes, covering 961 out of 1089 features (without velocities)
Files lists
Numerous files lists were created for convenience and easy MIDIs retrieval from the dataset
These include lists of all MIDIs as well as subsets of MIDIs
Files lists are presented in a dictionary format of two strings:
-
MIDI md5 hash
-
Full MIDI path
Identified MIDIs
This data contains information about all MIDIs that were definitively identified by artist and title
Mono melodies
This data contains information about all MIDIs with at least one monophonic melody
The data in a form of list of tuples where first element represents monophonic melody patch/instrument
And the second element of the tuple represents number of notes for indicated patch/instrument
Please note that many MIDIs may have more than one monophonic melody
Pitches patches counts
This data contains the pitches-patches counts for all MIDIs in the dataset
This information is very useful for de-duping, MIR and statistical analysis
Citations
@misc{GodzillaMIDIDataset2025,
title = {Godzilla MIDI Dataset: Enormous, comprehensive, normalized and searchable MIDI dataset for MIR and symbolic music AI purposes},
author = {Alex Lev},
publisher = {Project Los Angeles / Tegridy Code},
year = {2025},
url = {https://huggingface.co/datasets/projectlosangeles/Godzilla-MIDI-Dataset}
@misc {breadai_2025,
author = { {BreadAi} },
title = { Sourdough-midi-dataset (Revision cd19431) },
year = 2025,
url = {\url{https://huggingface.co/datasets/BreadAi/Sourdough-midi-dataset}},
doi = { 10.57967/hf/4743 },
publisher = { Hugging Face }
}
@inproceedings{bradshawaria,
title={Aria-MIDI: A Dataset of Piano MIDI Files for Symbolic Music Modeling},
author={Bradshaw, Louis and Colton, Simon},
booktitle={International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=X5hrhgndxW},
}
@misc{TegridyMIDIDataset2025,
title = {Tegridy MIDI Dataset: Ultimate Multi-Instrumental MIDI Dataset for MIR and Music AI purposes},
author = {Alex Lev},
publisher = {Project Los Angeles / Tegridy Code},
year = {2025},
url = {https://github.com/asigalov61/Tegridy-MIDI-Dataset}
Project Los Angeles
Tegridy Code 2025
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
discovermidi-25.12.22.tar.gz
(3.1 MB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file discovermidi-25.12.22.tar.gz.
File metadata
- Download URL: discovermidi-25.12.22.tar.gz
- Upload date:
- Size: 3.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1860b73fbb6e7c30781a651c659658ab16ecdd3605c78d3297ed733f29cddc0
|
|
| MD5 |
424dca5608186aeabef5e2041a86e3cd
|
|
| BLAKE2b-256 |
fe7c1d2d2f73a9c61903869e44fe15d962083b39da60f78073b716ead2bd87ed
|
File details
Details for the file discovermidi-25.12.22-py3-none-any.whl.
File metadata
- Download URL: discovermidi-25.12.22-py3-none-any.whl
- Upload date:
- Size: 3.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
79a2df3a522c16feadb5388c68473e71513c89423c840f9c3f197fe74ccc7fd0
|
|
| MD5 |
c66bcde3955e5e5479f410cce5cf66e0
|
|
| BLAKE2b-256 |
bec7ba770a187c9988c69f567e9131ddf473f4647c0512bd9405e94427d08ed3
|