Discogs Data Dumps Toolkit
Project description
discogskit
A fast tool for converting and loading Discogs data dumps into Parquet, JSONL, SQLite, and PostgreSQL.
Why discogskit?
- Fast. Parallel parsing and writing squeeze maximum performance out of your machine. See Benchmarks for numbers.
- Easy to use. A single command does the job. No multi-step workflows, no manual schema setup.
- Flexible outputs. Convert to Parquet or JSONL for quick analysis without standing up a database, or load directly into SQLite or PostgreSQL.
- Reliable. Comprehensive unit and integration tests run against every release.
Installation
Requires Python 3.10+.
pipx install discogskit
Or with uv:
# Install locally
uv tool install discogskit
# Run without installing
uvx discogskit
Usage
Usage: discogskit [OPTIONS] COMMAND [ARGS]...
discogskit: Discogs Data Dumps Toolkit
╭─ Options ────────────────────────────────────────────────────────────────────────────╮
│ --version Show version and exit. │
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or │
│ customize the installation. │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ───────────────────────────────────────────────────────────────────────────╮
│ convert Convert Discogs XML dumps into flat files (Parquet or JSONL). │
│ load Load Discogs XML dumps into a database. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
discogskit convert
Convert Discogs XML dumps into flat files.
| Option | Values |
|---|---|
| Output formats | parquet, jsonl |
| Compression (Parquet) | zstd (default), snappy, gzip, none |
| Compression (JSONL) | gzip, bzip2, none (default) |
Full command help
Usage: discogskit convert [OPTIONS] PATHS...
Convert Discogs XML dumps into flat files (Parquet or JSONL).
╭─ Arguments ──────────────────────────────────────────────────────────────────────────╮
│ * paths PATHS... One or more .xml.gz files or directories containing them │
│ [required] │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────╮
│ --format -f TEXT Output format: parquet or jsonl │
│ [default: parquet] │
│ --output PATH Output directory [default: .] │
│ --compression TEXT Compression codec. Parquet: gzip, │
│ snappy, zstd (default), none. JSONL: │
│ bzip2, gzip, none (default). │
│ --parse-workers INTEGER Number of parallel parse workers │
│ [default: 4] │
│ --chunk-mb INTEGER Split XML into chunks of roughly │
│ this size (MB) │
│ [default: 256] │
│ --write-queue INTEGER Max chunks buffered in memory before │
│ writes must catch up │
│ [default: 2] │
│ --keep-xml --no-keep-xml Keep decompressed XML file after │
│ converting │
│ [default: no-keep-xml] │
│ --overwrite --no-overwrite Overwrite existing output files │
│ [default: no-overwrite] │
│ --profile --no-profile Print detailed per-table timing │
│ breakdown after convert │
│ [default: no-profile] │
│ --progress --no-progress Show a progress bar instead of │
│ per-chunk output │
│ [default: progress] │
│ --strict --no-strict Warn about unhandled XML elements │
│ during parsing │
│ [default: no-strict] │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
Examples
# Convert releases to Parquet (default)
discogskit convert --format parquet discogs_20260301_releases.xml.gz
# Convert to JSONL with gzip compression
discogskit convert --format jsonl --compression gzip discogs_20260301_artists.xml.gz
# Convert all dump files in the current directory
discogskit convert --format parquet .
# Keep decompressed XML after converting
discogskit convert --format parquet --keep-xml discogs_20260301_releases.xml.gz
discogskit load
Load Discogs XML dumps into a database.
| Database | Versions |
|---|---|
| SQLite | 3.x |
| PostgreSQL | 14+ |
Full command help
Usage: discogskit load [OPTIONS] PATHS...
Load Discogs XML dumps into a database.
╭─ Arguments ──────────────────────────────────────────────────────────────────────────╮
│ * paths PATHS... One or more .xml.gz files or directories containing them │
│ [required] │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ────────────────────────────────────────────────────────────────────────────╮
│ --dsn TEXT Database DSN (e.g., │
│ postgresql://localhost/postgres) or │
│ path to SQLite file │
│ [env var: DATABASE_URL] │
│ [default: │
│ postgresql://localhost/discogskit] │
│ --parse-workers INTEGER Number of parallel parse workers │
│ [default: 4] │
│ --write-workers INTEGER Number of parallel database write │
│ workers │
│ [default: 1] │
│ --index-workers INTEGER Number of parallel index creation │
│ workers │
│ [default: 2] │
│ --chunk-mb INTEGER Split XML into chunks of roughly this │
│ size (MB) │
│ [default: 256] │
│ --write-queue INTEGER Max chunks buffered in memory before │
│ writes must catch up │
│ [default: 2] │
│ --keep-xml --no-keep-xml Keep decompressed XML file after │
│ loading │
│ [default: no-keep-xml] │
│ --overwrite --no-overwrite Overwrite existing tables in the │
│ database │
│ [default: no-overwrite] │
│ --profile --no-profile Print detailed per-table timing │
│ breakdown after load │
│ [default: no-profile] │
│ --progress --no-progress Show a progress bar instead of │
│ per-chunk output │
│ [default: progress] │
│ --strict --no-strict Warn about unhandled XML elements │
│ during parsing │
│ [default: no-strict] │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭─ PostgreSQL ─────────────────────────────────────────────────────────────────────────╮
│ --pg-unlogged --no-pg-unlogged Skip WAL for faster writes (tables stay │
│ unlogged; data lost on crash) │
│ [default: no-pg-unlogged] │
│ --pg-tune --no-pg-tune Temporarily apply settings optimized for bulk │
│ loading │
│ [default: no-pg-tune] │
│ --pg-fk --no-pg-fk Add foreign key constraints after load │
│ [default: no-pg-fk] │
╰──────────────────────────────────────────────────────────────────────────────────────╯
Examples
# Load releases into PostgreSQL (default DSN: postgresql://localhost/discogskit)
discogskit load discogs_20260301_releases.xml.gz
# Load into a specific PostgreSQL database
discogskit load --dsn "postgresql://user:pass@localhost/discogs" discogs_20260301_releases.xml.gz
# Load into SQLite
discogskit load --dsn discogs.db discogs_20260301_releases.xml.gz
# Load all dump files from a directory
discogskit load --dsn discogs.db .
# Use UNLOGGED tables for faster PostgreSQL writes (~2x speedup)
discogskit load --pg-unlogged discogs_20260301_releases.xml.gz
# Temporarily tune PostgreSQL for bulk loading
discogskit load --pg-tune discogs_20260301_releases.xml.gz
# Add foreign key constraints after load
discogskit load --pg-fk discogs_20260301_releases.xml.gz
# Use multiple write workers for parallel database inserts
discogskit load --write-workers 4 discogs_20260301_releases.xml.gz
Benchmarks
Full load of the 20260301 data dump (artists, labels, masters, releases) into PostgreSQL 18
on a 24 GB Apple MacBook Air M3.
| discogs-xml2db Python | discogs-xml2db .NET | discogskit | discogskit --pg-unlogged |
|
|---|---|---|---|---|
| Parse + load | 0:59:11 | 1:01:08 | 18:55 | 9:25 |
| Indexes | 0:43:24 | 0:43:24 | 14:53 | 1:56 |
| Total | 1:42:35 | 1:44:32 | 33:49 | 11:22 |
| Speedup | baseline | 0.98x | 3.0x | 9.0x |
Commands and detailed output
discogs-xml2db Python
python3 run.py --apicounts --export artist --export label --export master --export release --output ./csv-dir [path]
python3 postgresql/psql.py < postgresql/sql/CreateTables.sql
python3 postgresql/importcsv.py ./csv-dir/*
python3 postgresql/psql.py < postgresql/sql/CreatePrimaryKeys.sql
python3 postgresql/psql.py < postgresql/sql/CreateFKConstraints.sql
python3 postgresql/psql.py < postgresql/sql/CreateIndexes.sql
| Step | Time |
|---|---|
| Export to CSV | 0:41:14 |
| Table creation | 0:00:01 |
| Data import | 0:17:56 |
| Primary keys | 0:19:21 |
| Foreign keys | 0:02:12 |
| Indexes | 0:21:51 |
| Total | 1:42:35 |
discogs-xml2db .NET
discogs [paths]
python3 postgresql/psql.py < postgresql/sql/CreateTables.sql
python3 postgresql/importcsv.py ./csv-dir/*
python3 postgresql/psql.py < postgresql/sql/CreatePrimaryKeys.sql
python3 postgresql/psql.py < postgresql/sql/CreateFKConstraints.sql
python3 postgresql/psql.py < postgresql/sql/CreateIndexes.sql
| Step | Time |
|---|---|
| Export to CSV | 0:43:11 |
| Table creation | 0:00:01 |
| Data import | 0:17:56 |
| Primary keys | 0:19:21 |
| Foreign keys | 0:02:12 |
| Indexes | 0:21:51 |
| Total | 1:44:32 |
discogskit
discogskit load --dsn postgresql://localhost:5432/discogskit --chunk-mb 256 \
--parse-workers 6 --write-workers 3 --index-workers 6 [path]
| Entity | Records | Parse + load | Indexes | Total |
|---|---|---|---|---|
| Artists | 9,957,079 | 40.40s | 14.77s | 55.15s |
| Labels | 2,349,729 | 9.27s | 0.67s | 9.95s |
| Masters | 2,530,697 | 34.45s | 19.94s | 54.36s |
| Releases | 18,952,204 | 1,051.37s | 857.89s | 1,909.24s |
| Total | 33,789,709 | 1,135.49s | 893.27s | 2,028.70s |
discogskit --pg-unlogged
discogskit load --dsn postgresql://localhost:5432/discogskit --chunk-mb 256 --pg-unlogged \
--parse-workers 6 --write-workers 3 --index-workers 6 [path]
| Entity | Records | Parse + load | Indexes | Total |
|---|---|---|---|---|
| Artists | 9,957,079 | 27.73s | 2.52s | 30.22s |
| Labels | 2,349,729 | 8.84s | 0.51s | 9.34s |
| Masters | 2,530,697 | 23.38s | 1.93s | 25.29s |
| Releases | 18,952,204 | 505.45s | 111.46s | 616.86s |
| Total | 33,789,709 | 565.40s | 116.42s | 681.71s |
License
discogskit is licensed under the Apache License 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file discogskit-0.1.0.tar.gz.
File metadata
- Download URL: discogskit-0.1.0.tar.gz
- Upload date:
- Size: 30.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d18b5b968d7ee89ca0e992b6f368bf299d1a7310333d6d342af82a5b77a12ab4
|
|
| MD5 |
68e67eeda297f1645f047eed91932b89
|
|
| BLAKE2b-256 |
0a9bf0608a83017d04e8ad67da7ef85f8d752b6999d68be8b6ce5995c2f44bd7
|
Provenance
The following attestation bundles were made for discogskit-0.1.0.tar.gz:
Publisher:
release.yml on jmfontaine/discogskit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
discogskit-0.1.0.tar.gz -
Subject digest:
d18b5b968d7ee89ca0e992b6f368bf299d1a7310333d6d342af82a5b77a12ab4 - Sigstore transparency entry: 1155266898
- Sigstore integration time:
-
Permalink:
jmfontaine/discogskit@d8bef79977c3898800875cce149fe2d40e1b7457 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/jmfontaine
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d8bef79977c3898800875cce149fe2d40e1b7457 -
Trigger Event:
push
-
Statement type:
File details
Details for the file discogskit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: discogskit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 42.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2d0ad3342271530904a90d93ce4a44c49b793f618314a60da25f8b84ed87ffc2
|
|
| MD5 |
b78b6e2c017a3644136aabeab4d88c6a
|
|
| BLAKE2b-256 |
e104d2e05ff548cabb1cd78d355daec79f0e34f2ce82786b50c797e334818ea1
|
Provenance
The following attestation bundles were made for discogskit-0.1.0-py3-none-any.whl:
Publisher:
release.yml on jmfontaine/discogskit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
discogskit-0.1.0-py3-none-any.whl -
Subject digest:
2d0ad3342271530904a90d93ce4a44c49b793f618314a60da25f8b84ed87ffc2 - Sigstore transparency entry: 1155266899
- Sigstore integration time:
-
Permalink:
jmfontaine/discogskit@d8bef79977c3898800875cce149fe2d40e1b7457 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/jmfontaine
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d8bef79977c3898800875cce149fe2d40e1b7457 -
Trigger Event:
push
-
Statement type: