Skip to main content

PubMed database builder and query interface

Project description

pmcdb

PubMed query tool. Downloads the complete NCBI PubMed corpus (~40M articles), parses every DTD field, produces compact queryable Parquet tables (~34 GB). Every user serves data back to the scientific community via irohds P2P.

Install

uv add pmcdb

Rust parser binary bundled in platform-specific wheel. No toolchain needed.

Usage

from pmcdb import PubMed

with PubMed() as db:
    df = db.df("SELECT * FROM citation WHERE pub_year = '2024' LIMIT 10")
    print(db.tables())  # 30 tables

# Reproducible checkpoint (query-time filter only)
with PubMed(through="2024") as db:
    df = db.df("SELECT count(*) FROM citation")

First call triggers build (~2 min on 64-core, ~15 min average). Subsequent calls: instant (local cache) or delta-efficient P2P fetch.

CLI

python -m pmcdb                     # build + compact + serve (default)
python -m pmcdb query "SELECT ..."  # run SQL
python -m pmcdb --no-compact        # build only, skip compaction

Architecture

pmcdb-core (Rust)   download + parse XML.gz -> per-worker Parquet
                   quick-xml, arrow/parquet, crossbeam-channel, ureq, coren
pmcdb (Python)      DuckDB query layer, compaction, irohds P2P distribution
                   @irohds.memo on create_table, coren for resource limits

30 tables: 27 from XML corpus + 3 auxiliary (journal catalog, deleted PMIDs, computed author clusters from NCBI).

  • Parse: ~70-80k records/sec (zero-copy RowWriter into Arrow builders)
  • Full dataset: ~34 GB Parquet (vs 239 GB SQLite)
  • Resume after interrupt via _state file
  • Deterministic: same FTP state -> byte-identical sorted Parquet
  • Adaptive flush threshold via coren (Pi 4GB to HPC 512GB)
  • Mandatory compaction before P2P: one sorted file per table

Development

make dev       # maturin develop into venv
make test      # cargo test + pytest
make sync      # full pipeline: build + compact + serve

Publishing

Credentials in standard locations:

  • ~/.pypirc (twine)
  • ~/.cargo/credentials.toml (cargo)
make pub               # bump patch, test, build host wheel, upload, tag
make pub V=0.2.0       # explicit version
make pub-all           # all platforms (needs podman for Linux wheels)
make release V=0.2.0   # tag-only (CI builds + publishes)

CI

Codeberg Forgejo Actions. CI runs Rust + Python tests on push/PR. Release workflow builds Linux x86_64 + aarch64 wheels on tag push, publishes to PyPI + crates.io + Codeberg release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pmcdb-0.0.3.tar.gz (108.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pmcdb-0.0.3-py3-none-win_amd64.whl (4.8 MB view details)

Uploaded Python 3Windows x86-64

pmcdb-0.0.3-py3-none-manylinux_2_34_x86_64.whl (5.0 MB view details)

Uploaded Python 3manylinux: glibc 2.34+ x86-64

pmcdb-0.0.3-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (4.5 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ ARM64

File details

Details for the file pmcdb-0.0.3.tar.gz.

File metadata

  • Download URL: pmcdb-0.0.3.tar.gz
  • Upload date:
  • Size: 108.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for pmcdb-0.0.3.tar.gz
Algorithm Hash digest
SHA256 b950243592c419db092579a62d176d76f739a2aa5a269bfb59729413dd7ac18b
MD5 c15d2c86f508b09131e10708f4a6d859
BLAKE2b-256 a3231accba84817b16af90eb4532d3b4224b449a22631a1229a1e367f001d388

See more details on using hashes here.

File details

Details for the file pmcdb-0.0.3-py3-none-win_amd64.whl.

File metadata

  • Download URL: pmcdb-0.0.3-py3-none-win_amd64.whl
  • Upload date:
  • Size: 4.8 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for pmcdb-0.0.3-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 2a34f7f60dc2ccfe5e3319e5d45d5fac926a094784b68c4146997c7346f6c7ea
MD5 8e29ca1812ca9f52679bc1d5df28b5db
BLAKE2b-256 607849d5b14aa4f02582ca11356effca9e24b1c662e9b843ebe1fdfdd263e9cc

See more details on using hashes here.

File details

Details for the file pmcdb-0.0.3-py3-none-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for pmcdb-0.0.3-py3-none-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 930e8e053199e7796e49a773585f992957a3b67d5b1552458598b50e1a44e945
MD5 28709157eebd6b42023f46be1ec2d277
BLAKE2b-256 1209d5321fd5893a592cf58eb6b024e4dab7c39e220a98e8f670f8625af1a0ff

See more details on using hashes here.

File details

Details for the file pmcdb-0.0.3-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for pmcdb-0.0.3-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 662980f74264d65db562bbe8f1d80c20151be19dbe7bd07ee675af1723d9cc7d
MD5 4a2d2e9627e0f25288cc3859f182fc88
BLAKE2b-256 8127aca2b9c9b843a22e6605e48d66e16db29c43a91f96124847c670e9b28817

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page