Skip to main content

Ultimate MIDI dataset for MIDI music discovery and symbolic music AI purposes

Project description

Discover MIDI Dataset

Ultimate MIDI dataset for MIDI music discovery and symbolic music AI purposes

Discover-MIDI-Dataset

Dataset features

1) Over 6.74M+ unique, de-duped and normalized MIDIs

2) Each MIDI was converted to proper MIDI format specification and checked for integrity

3) Dataset was de-duped twice: by md5 hashes and by the pitches-chords counts

4) Extensive and comprehensive (meta)data was collected from all MIDIs in the dataset

5) Dataset comes with a custom-designed and highly optimized GPU-accelerated search and filter code


Installation

pip and setuptools

# It is recommended that you upgrade pip and setuptools prior to install for max compatibility
!pip install --upgrade pip
!pip install --upgrade setuptools

CPU/GPU install

# The following command will install Discover MIDI Dataset for fast GPU search
# Please note that GPU search requires at least 16GB GPU VRAM for full searches at float16 precision

!pip install -U discovermidi

Optional packages

Packages for Fast Parallel Extract module

# The following command will install packages for Fast Parallel Extract module
# It will allow you to extract (untar) Godzilla MIDI Dataset much faster

!sudo apt update -y
!sudo apt install -y p7zip-full
!sudo apt install -y pigz

Packages for midi_to_colab_audio module

# The following command will install packages for midi_to_colab_audio module
# It will allow you to render Godzilla MIDI Dataset MIDIs to audio

!sudo apt update -y
!sudo apt install fluidsynth

Quick-start use example

# Import main Godzilla MIDI Dataset module
import discovermidi

# Download Godzilla MIDI Dataset from Hugging Face repo
discovermidi.download_dataset()

# Extract Godzilla MIDI Dataset with built-in function (slow)
discovermidi.parallel_extract()

# Or you can extract much faster if you have installed the optional packages for Fast Parallel Extract
# from discovermidi import fast_parallel_extract
# fast_parallel_extract.fast_parallel_extract()

# Load all MIDIs features matrixes and their corresponding MIDIs file names
features_matrixes, features_matrixes_file_names = discovermidi.load_features_matrixes()

# Run the search
# IO dirs will be created on the first run of the following function
# Do not forget to put your master MIDIs into created Master-MIDI-Dataset folder
# The full search for each master MIDI takes about 10-20 seconds on a GPU
discovermidi.search_and_filter(features_matrixes, features_matrixes_file_names)

Dataset structure information

Discover-MIDI-Dataset/               # Dataset root dir
├── ARTWORK/                        # Concept artwork
│   ├── Illustrations/              # Concept illustrations
│   ├── Logos/                      # Dataset logos
│   └── Posters/                    # Dataset posters
├── CODE/                           # Root dir for supplemental python code and python modules
│   ├── midi_loops_extractor/       # MIDI loops extractor code dir
├── DATA/                           # Dataset (meta)data dir
│   ├── Features Counts/            # Features counts for each MIDI
│   ├── Features Matrixes/          # Features counts matrixes for each MIDI
│   ├── Files Lists/                # Files lists by MIDIs types and categories
│   ├── Identified MIDIs/           # Comprehensive data for identified MIDIs
│   ├── Karaoke MIDIs/              # Karaoke MIDIs data
│   ├── MIDIs Lyrics/               # MIDIs lyrics data
│   ├── Mono Melodies/              # Data for all MIDIs with monophonic melodies
│   ├── Pitches Patches Counts/     # Pitches-patches counts for all MIDIs 
├── MIDIs/                          # Root MIDIs dir
└── SOUNDFONTS/                     # Select high-quality Sound Font banks to render MIDIs

Dataset (meta)data information


Features Counts

Features counts for all MIDIs are presented in a form of list of tuples (feature, count)

Features range is [0-1089) which covers six groups of values

  • [0-128) Delta start times
  • (128-256) Durations
  • [256-384] MIDI patches/instruments, 384 being reserved for drums
  • (384-640) MIDI pitches: (384-512) reserved for instruments and (512-640) for drums
  • [640-961) All possible harmonic chords (321 chords)
  • (961-1089) Velocities

Features Matrixes

A compressed NumPy array of flattened features matrixes, covering 961 out of 1089 features (without velocities)


Files lists

Numerous files lists were created for convenience and easy MIDIs retrieval from the dataset

These include lists of all MIDIs as well as subsets of MIDIs

Files lists are presented in a dictionary format of two strings:

  • MIDI md5 hash
  • Full MIDI path

Identified MIDIs

This data contains information about all MIDIs that were definitively identified by artist and title


Mono melodies

This data contains information about all MIDIs with at least one monophonic melody

The data in a form of list of tuples where first element represents monophonic melody patch/instrument

And the second element of the tuple represents number of notes for indicated patch/instrument

Please note that many MIDIs may have more than one monophonic melody


Pitches patches counts

This data contains the pitches-patches counts for all MIDIs in the dataset

This information is very useful for de-duping, MIR and statistical analysis


Citations

@misc{GodzillaMIDIDataset2025,
  title        = {Godzilla MIDI Dataset: Enormous, comprehensive, normalized and searchable MIDI dataset for MIR and symbolic music AI purposes},
  author       = {Alex Lev},
  publisher    = {Project Los Angeles / Tegridy Code},
  year         = {2025},
  url          = {https://huggingface.co/datasets/projectlosangeles/Godzilla-MIDI-Dataset}
@misc {breadai_2025,
    author       = { {BreadAi} },
    title        = { Sourdough-midi-dataset (Revision cd19431) },
    year         = 2025,
    url          = {\url{https://huggingface.co/datasets/BreadAi/Sourdough-midi-dataset}},
    doi          = { 10.57967/hf/4743 },
    publisher    = { Hugging Face }
}
@inproceedings{bradshawaria,
  title={Aria-MIDI: A Dataset of Piano MIDI Files for Symbolic Music Modeling},
  author={Bradshaw, Louis and Colton, Simon},
  booktitle={International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=X5hrhgndxW}, 
}
@misc{TegridyMIDIDataset2025,
  title        = {Tegridy MIDI Dataset: Ultimate Multi-Instrumental MIDI Dataset for MIR and Music AI purposes},
  author       = {Alex Lev},
  publisher    = {Project Los Angeles / Tegridy Code},
  year         = {2025},
  url          = {https://github.com/asigalov61/Tegridy-MIDI-Dataset}

Project Los Angeles

Tegridy Code 2025

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

discovermidi-25.12.21.tar.gz (3.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

discovermidi-25.12.21-py3-none-any.whl (3.1 MB view details)

Uploaded Python 3

File details

Details for the file discovermidi-25.12.21.tar.gz.

File metadata

  • Download URL: discovermidi-25.12.21.tar.gz
  • Upload date:
  • Size: 3.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for discovermidi-25.12.21.tar.gz
Algorithm Hash digest
SHA256 81eab360db7b5c08c67229a9f76832a1bdfd2e6caef19dc1817d57ba6a62b161
MD5 2df94058d8b42d50dec3f7388666040f
BLAKE2b-256 dde69bae5fee2ac7ca84804d9ad9bd731d042f88bfdc514f097a92aa36ba874b

See more details on using hashes here.

File details

Details for the file discovermidi-25.12.21-py3-none-any.whl.

File metadata

File hashes

Hashes for discovermidi-25.12.21-py3-none-any.whl
Algorithm Hash digest
SHA256 bfc7a7bafff502438cb4729e6ee380528d4d7b8b762659078774ce43b00e8d64
MD5 7c4b59574c263790e28c7923753b2618
BLAKE2b-256 04d343ed3be09b037a41b1b3d8e23b67d85035521d340a5e4db5608a252d1687

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page