Skip to main content

PubMed database builder and query interface

Project description

pmcdb

PubMed query tool. Downloads the complete NCBI PubMed corpus (~40M articles), parses every DTD field, produces compact queryable Parquet tables (~34 GB). Every user serves data back to the scientific community via irohds P2P.

Install

uv add pmcdb

Rust parser binary bundled in platform-specific wheel. No toolchain needed.

Usage

from pmcdb import PubMed

with PubMed() as db:
    df = db.df("SELECT * FROM citation WHERE pub_year = '2024' LIMIT 10")
    print(db.tables())  # 30 tables

# Reproducible checkpoint (query-time filter only)
with PubMed(through="2024") as db:
    df = db.df("SELECT count(*) FROM citation")

First call triggers build (~2 min on 64-core, ~15 min average). Subsequent calls: instant (local cache) or delta-efficient P2P fetch.

CLI

python -m pmcdb                     # build + compact + serve (default)
python -m pmcdb query "SELECT ..."  # run SQL
python -m pmcdb --no-compact        # build only, skip compaction

Architecture

pmcdb-core (Rust)   download + parse XML.gz -> per-worker Parquet
                   quick-xml, arrow/parquet, crossbeam-channel, ureq, coren
pmcdb (Python)      DuckDB query layer, compaction, irohds P2P distribution
                   @irohds.memo on create_table, coren for resource limits

30 tables: 27 from XML corpus + 3 auxiliary (journal catalog, deleted PMIDs, computed author clusters from NCBI).

  • Parse: ~70-80k records/sec (zero-copy RowWriter into Arrow builders)
  • Full dataset: ~34 GB Parquet (vs 239 GB SQLite)
  • Resume after interrupt via _state file
  • Deterministic: same FTP state -> byte-identical sorted Parquet
  • Adaptive flush threshold via coren (Pi 4GB to HPC 512GB)
  • Mandatory compaction before P2P: one sorted file per table

Development

make dev       # maturin develop into venv
make test      # cargo test + pytest
make sync      # full pipeline: build + compact + serve

Publishing

Credentials in standard locations:

  • ~/.pypirc (twine)
  • ~/.cargo/credentials.toml (cargo)
make pub               # bump patch, test, build host wheel, upload, tag
make pub V=0.2.0       # explicit version
make pub-all           # all platforms (needs podman for Linux wheels)
make release V=0.2.0   # tag-only (CI builds + publishes)

CI

Codeberg Forgejo Actions. CI runs Rust + Python tests on push/PR. Release workflow builds Linux x86_64 + aarch64 wheels on tag push, publishes to PyPI + crates.io + Codeberg release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pmcdb-0.0.2.tar.gz (108.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pmcdb-0.0.2-py3-none-win_amd64.whl (4.8 MB view details)

Uploaded Python 3Windows x86-64

pmcdb-0.0.2-py3-none-manylinux_2_34_x86_64.whl (5.0 MB view details)

Uploaded Python 3manylinux: glibc 2.34+ x86-64

pmcdb-0.0.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (4.5 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

File details

Details for the file pmcdb-0.0.2.tar.gz.

File metadata

  • Download URL: pmcdb-0.0.2.tar.gz
  • Upload date:
  • Size: 108.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for pmcdb-0.0.2.tar.gz
Algorithm Hash digest
SHA256 4223c9f0196a3f13786069c9e8b1fb6d01e37f0e1a9d1cd20619514b64582331
MD5 8486ec97f1d01bff5327f9a3fb9f53c3
BLAKE2b-256 6acb4936c93cf51ca920b27c69a50bb4881cc270ea8ba5bf2f2c534193d7edb2

See more details on using hashes here.

File details

Details for the file pmcdb-0.0.2-py3-none-win_amd64.whl.

File metadata

  • Download URL: pmcdb-0.0.2-py3-none-win_amd64.whl
  • Upload date:
  • Size: 4.8 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for pmcdb-0.0.2-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 92032c2e96ed9c52431567c37b797949903ac9e40e665c41e464943e01314414
MD5 daa1d658024660f045eac7011f831adb
BLAKE2b-256 fd9ea60a5bc2838ca4bcd9cb2cf17b37b22ebcdf89f68ddca6caa29b87bec090

See more details on using hashes here.

File details

Details for the file pmcdb-0.0.2-py3-none-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for pmcdb-0.0.2-py3-none-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 575de981b633c7754da1045d9d2ee472d09fccfb9afac38efed1163302c6b91c
MD5 b021b3c9de10370a33042e9a0c13325f
BLAKE2b-256 5287651d2fb8b94db8b60ba08be97d22a92e04335ce88650b9d12ad8d47f5213

See more details on using hashes here.

File details

Details for the file pmcdb-0.0.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for pmcdb-0.0.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8db369e5a36e5d29d6530c38258d1256c58879dce2e053908cb0b5b25372a8a6
MD5 bbd02fc3b2c4f2db5df22b65aa1f62dc
BLAKE2b-256 ceb8297318d1195f79934e464fa5ceba667c1fe9b531a27f8b3bdb62eaa2b88f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page