Skip to main content

PDBj data synchronization and database loading tool

Project description

pdb-mine-builder

CI PyPI Python License: MIT Pixi Badge

Build a Mine-schema database from PDB data. Synchronizes structural biology data from wwPDB mirrors (PDBj by default) via rsync and loads it into PostgreSQL.

This project is based on PDBj's mine2updater. Thanks to the PDBj team for the original implementation and the Mine relational database design.

Documentation: https://n283t.github.io/pdb-mine-builder/

Features

  • Multi-process parallel data loading with configurable workers
  • Support for multiple data formats (CIF default, mmJSON optional)
  • Configurable sync sources with regional wwPDB mirror support (PDBj, RCSB, PDBe)
  • RDKit chemical search integration (substructure, similarity)
  • SQL query interface with multi-format output (table, CSV, JSON, Parquet)
  • Interactive SQL examples with 75+ queries across 10 categories
  • 9 database schemas covering PDB structures, chemical components, validation reports, and more

Installation

Pixi (recommended)

Pixi manages all dependencies including Python, PostgreSQL, and RDKit in a single environment.

git clone https://github.com/N283T/pdb-mine-builder.git
cd pdb-mine-builder
pixi install
cp config.example.yml config.yml  # Edit with your data paths
pixi run db-init       # Initialize PostgreSQL
pixi run db-start      # Start PostgreSQL
pixi run pmb sync      # Sync data from wwPDB (PDBj by default)
pixi run pmb load pdbj --force  # Load data
pixi run pmb stats     # Check database statistics

pip (alternative)

Note: pip installs the Python package only. You must provide PostgreSQL (17+) and the RDKit PostgreSQL cartridge separately. Database management commands (pixi run db-*) are not available.

pip install pdbminebuilder
cp config.example.yml config.yml  # Edit with your data paths and connection string
pmb --help

conda + pip (alternative)

Note: Database management commands (pixi run db-*) are not available. Use your own PostgreSQL instance.

conda create -n pmb python=3.12 rdkit-postgresql -c conda-forge
conda activate pmb
pip install pdbminebuilder
cp config.example.yml config.yml
pmb --help

Docker / Podman (alternative)

Note: Requires Docker or Podman. Data files must be mounted as volumes.

git clone https://github.com/N283T/pdb-mine-builder.git
cd pdb-mine-builder
cp config.example.yml config.yml  # Edit data paths
docker compose -f docker/docker-compose.yml up -d
docker compose -f docker/docker-compose.yml run --rm pmb update pdbj --limit 10

See the Getting Started guide for detailed setup instructions.

Pipelines

Pipeline Description Entries Tables Size Format
pdbj Main structure data ~250k 250 183 GB CIF / mmJSON
vrpt Validation reports ~250k 69 152 GB CIF
contacts Protein-protein contacts ~250k 2 13 GB JSON
cc Chemical components (with RDKit) ~50k 12 811 MB CIF / mmJSON
ccmodel Chemical component models ~23k 8 174 MB CIF / mmJSON
prd BIRD reference dictionary ~1.2k 17 50 MB CIF / mmJSON

Total: 368 tables, ~349 GB with all PDB entries loaded (as of 2026-03-08).

See the Database Reference for schema details and SQL examples.

Query

Execute SQL queries directly from the CLI with multiple output formats:

pmb query "SELECT * FROM cc.brief_summary LIMIT 5"                    # Rich table
pmb query "SELECT * FROM cc.brief_summary" -F csv > out.csv            # CSV
pmb query "SELECT * FROM cc.brief_summary LIMIT 10" -F json            # JSON
pmb query "SELECT * FROM cc.brief_summary" -F parquet -o out.parquet   # Parquet
pmb query -f query.sql                                                 # SQL from file

Development

pixi run lint      # Ruff check
pixi run format    # Ruff format
pixi run test      # Run tests (pytest)
pixi run check     # All checks

Requirements

  • Python 3.12+
  • PostgreSQL 17+ (managed by rdkit-postgresql via conda-forge)
  • Pixi — manages all dependencies (conda + PyPI)
  • rsync

Note: Most dependencies are installed from conda-forge. Only ccd2rdmol (PyPI only) and psycopg[binary,pool] (extras required) remain as PyPI dependencies. PostgreSQL version is determined by rdkit-postgresql.

License

MIT - See LICENSE for details.

Relationship to mine2updater

This project is inspired by mine2updater (LGPLv3) by PDBj, which loads PDB data into PostgreSQL using Node.js. pdb-mine-builder is an independent rewrite in Python with a completely different tech stack (gemmi, SQLAlchemy, psycopg3, RDKit), architecture, and data model. No code was copied or translated from the original project. Shared concepts (pipeline names, schema structures, PDB ID encoding) derive from PDB data specifications, not from the original codebase.

References

  • Kinjo AR, Yamashita R, Nakamura H. PDBj Mine: design and implementation of relational database interface for Protein Data Bank Japan. Database (Oxford). 2010;2010:baq021. doi: 10.1093/database/baq021

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdbminebuilder-0.2.5.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdbminebuilder-0.2.5-py3-none-any.whl (233.9 kB view details)

Uploaded Python 3

File details

Details for the file pdbminebuilder-0.2.5.tar.gz.

File metadata

  • Download URL: pdbminebuilder-0.2.5.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdbminebuilder-0.2.5.tar.gz
Algorithm Hash digest
SHA256 56e580c90e37d18a7254f0163373b8ab8c2f6851e126faae46ba46e1ad896fe8
MD5 59d6c2d72ad7dc7d1faa6b6c649c32bb
BLAKE2b-256 021993aa3b646439c7609c87e21033a184ce3aafd2d24a45f1dd853757ac5a72

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdbminebuilder-0.2.5.tar.gz:

Publisher: release.yml on N283T/pdb-mine-builder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pdbminebuilder-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: pdbminebuilder-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 233.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdbminebuilder-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 3c3e6e36a69e96fe1e28a40d0bb2c0603d73d9f871e213d29f435ec932734c9a
MD5 650b6ab0bfb79590acad87335d8a00ea
BLAKE2b-256 d11a38670d3e8d31952a72679319e9bf2d0945ed8b5ea9f8237b3662cfa322a0

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdbminebuilder-0.2.5-py3-none-any.whl:

Publisher: release.yml on N283T/pdb-mine-builder

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page