Skip to main content

Midi ETL pipelines implemented with dagster + dbt

Project description

MIDI ETL

This repository contains an implementation of a modern data stack for building and analyzing MIDI datasets at scale. The stack is composed of several open source technologies, including DBT, Dagster, Trino, and Minio.

DBT (Data Build Tool) is used to transform and optimize MIDI data as it is ingested from various sources into a data warehouse. DBT scripts are used to clean and preprocess the data, extract relevant metadata, and load it into a staging table.

Dagster is used to define and execute data pipelines that fetch MIDI data from online sources using web scraping and API calls, and then load the data into the staging table in the data warehouse. It provides a framework for building, testing, and deploying these pipelines in a robust and scalable way.

Trino is used to analyze the MIDI datasets stored in the data warehouse. It provides a distributed SQL query engine that can handle complex queries and large volumes of data efficiently. This allows data analysts and scientists to extract insights and trends from the MIDI datasets.

Minio is used to store and retrieve the MIDI datasets and other data used in the stack. It is a lightweight, scalable object storage solution that can handle large volumes of data.

Overall, this data stack provides a powerful and scalable platform for building and analyzing MIDI datasets. It can be used by data engineers, data scientists, and data analysts to extract insights and trends from music data at scale.

Prerequisites

Before you can use this repository, you will need to install the following:

Installation

You can install midi_etl using pip:

pip install midi_etl

Usage

To use this repository, follow these steps:

  1. Clone the repository to your local machine:
git clone git@gitlab.com:nintorac-audio/midi_etl.git
  1. Navigate to the repository directory:
cd midi_etl
  1. Build and start the Docker containers:
docker-compose up --build

This will build the Docker containers for the etl platform and deploy all the infra needed to run the project

  1. Navigate to dagit to initiate jobs

  2. Download DBeaver (or your favourite DB IDE) to run queries over your data lake, navigate to minio to review the files in your data lake or use pyarrow to load the dataset in python eg.

# First, import the necessary libraries
import pyarrow as pa
import pyarrow.parquet as pq
import s3fs
import duckdb

# Connect to Minio using s3fs
fs = s3fs.S3FileSystem(
    anon=False,
    use_ssl=False,
    client_kwargs={
        "region_name": "us-east-1",
        "endpoint_url": "http://localhost:9000",
        "aws_access_key_id": "minio",
        "aws_secret_access_key": "minio123",
        "verify": False,
    }
)

# Create a Parquet dataset for the path "midi_etl/midi" in the "datasets" bucket
note_ons = pq.ParquetDataset("midi_etl/midi/note_ons", filesystem=fs).read()

# Open a connection to a DuckDB database
conn = duckdb.connect()

# Now you can run SQL queries on the table using the connection
cursor = conn.cursor()
cursor.execute("SELECT * FROM note_ons LIMIT 10")
print(cursor.fetchall())

Makefile

load_env is a make target that exports variables from a .env file into your local shell

get_trino_cli is a make target that downloads the Trino command-line interface (CLI) from a URL. This is then mounted into the trino container to provide CLI Trino access.

Available Datasets

  • Lakh MIDI Dataset: The Lakh MIDI dataset is a collection of 176,581 MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset, and is intended for use in large-scale music information retrieval

License

This repository is licensed under the MIT license. See LICENSE for more information.

graph TD

A[Lakh MIDI dataset]-H

B[Dagster Daemon] --> C[DBT] D --> E[Minio] B --> D[Trino] C --> D B --> F[Process Lakh MIDI dataset] B --> G[Process MIDI messages] B --> H[Extract files from tar.gz] A --> H F --> G H --> E

subgraph raw end subgraph dagster F G H end

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

midi_etl-0.0.1.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

midi_etl-0.0.1-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file midi_etl-0.0.1.tar.gz.

File metadata

  • Download URL: midi_etl-0.0.1.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for midi_etl-0.0.1.tar.gz
Algorithm Hash digest
SHA256 c78b9c2ff1fee575f35548713a83d47f19eb2582f08c138de8d67cc8c86e4dbd
MD5 500dd72a77f3d011782ce5e958654f6c
BLAKE2b-256 f18bcd2eefe0dd04b4d131fc04e76343fb8ad654feab187bb5b9e7ca9278e98c

See more details on using hashes here.

File details

Details for the file midi_etl-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: midi_etl-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for midi_etl-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0b6e54f39c4e9e334c21537f1f76c210cd286c209a30343b87e671323c12b426
MD5 0e007c6ec06ce33c01f6048ad0465915
BLAKE2b-256 0ac0dcf4eb07714f64f4093af11026db5dbcb9c8839ef3b8b12d47dc949f869d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page