PubMed database builder and query interface
Project description
pmcdb
PubMed query tool. Downloads the complete NCBI PubMed corpus (~40M articles), parses every DTD field, produces compact queryable Parquet tables (~34 GB). Every user serves data back to the scientific community via irohds P2P.
Install
uv add pmcdb
Rust parser binary bundled in platform-specific wheel. No toolchain needed.
Usage
from pmcdb import PubMed
with PubMed() as db:
df = db.df("SELECT * FROM citation WHERE pub_year = '2024' LIMIT 10")
print(db.tables()) # 30 tables
# Reproducible checkpoint (query-time filter only)
with PubMed(through="2024") as db:
df = db.df("SELECT count(*) FROM citation")
First call triggers build (~2 min on 64-core, ~15 min average). Subsequent calls: instant (local cache) or delta-efficient P2P fetch.
CLI
python -m pmcdb # build + compact + serve (default)
python -m pmcdb query "SELECT ..." # run SQL
python -m pmcdb --no-compact # build only, skip compaction
Architecture
pmcdb-core (Rust) download + parse XML.gz -> per-worker Parquet
quick-xml, arrow/parquet, crossbeam-channel, ureq, coren
pmcdb (Python) DuckDB query layer, compaction, irohds P2P distribution
@irohds.memo on create_table, coren for resource limits
30 tables: 27 from XML corpus + 3 auxiliary (journal catalog, deleted PMIDs, computed author clusters from NCBI).
- Parse: ~70-80k records/sec (zero-copy RowWriter into Arrow builders)
- Full dataset: ~34 GB Parquet (vs 239 GB SQLite)
- Resume after interrupt via
_statefile - Deterministic: same FTP state -> byte-identical sorted Parquet
- Adaptive flush threshold via coren (Pi 4GB to HPC 512GB)
- Mandatory compaction before P2P: one sorted file per table
Development
make dev # maturin develop into venv
make test # cargo test + pytest
make sync # full pipeline: build + compact + serve
Publishing
Credentials in standard locations:
~/.pypirc(twine)~/.cargo/credentials.toml(cargo)
make pub # bump patch, test, build host wheel, upload, tag
make pub V=0.2.0 # explicit version
make pub-all # all platforms (needs podman for Linux wheels)
make release V=0.2.0 # tag-only (CI builds + publishes)
CI
Codeberg Forgejo Actions. CI runs Rust + Python tests on push/PR. Release workflow builds Linux x86_64 + aarch64 wheels on tag push, publishes to PyPI + crates.io + Codeberg release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pmcdb-0.0.2.tar.gz.
File metadata
- Download URL: pmcdb-0.0.2.tar.gz
- Upload date:
- Size: 108.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4223c9f0196a3f13786069c9e8b1fb6d01e37f0e1a9d1cd20619514b64582331
|
|
| MD5 |
8486ec97f1d01bff5327f9a3fb9f53c3
|
|
| BLAKE2b-256 |
6acb4936c93cf51ca920b27c69a50bb4881cc270ea8ba5bf2f2c534193d7edb2
|
File details
Details for the file pmcdb-0.0.2-py3-none-win_amd64.whl.
File metadata
- Download URL: pmcdb-0.0.2-py3-none-win_amd64.whl
- Upload date:
- Size: 4.8 MB
- Tags: Python 3, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92032c2e96ed9c52431567c37b797949903ac9e40e665c41e464943e01314414
|
|
| MD5 |
daa1d658024660f045eac7011f831adb
|
|
| BLAKE2b-256 |
fd9ea60a5bc2838ca4bcd9cb2cf17b37b22ebcdf89f68ddca6caa29b87bec090
|
File details
Details for the file pmcdb-0.0.2-py3-none-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: pmcdb-0.0.2-py3-none-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 5.0 MB
- Tags: Python 3, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
575de981b633c7754da1045d9d2ee472d09fccfb9afac38efed1163302c6b91c
|
|
| MD5 |
b021b3c9de10370a33042e9a0c13325f
|
|
| BLAKE2b-256 |
5287651d2fb8b94db8b60ba08be97d22a92e04335ce88650b9d12ad8d47f5213
|
File details
Details for the file pmcdb-0.0.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: pmcdb-0.0.2-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 4.5 MB
- Tags: Python 3, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8db369e5a36e5d29d6530c38258d1256c58879dce2e053908cb0b5b25372a8a6
|
|
| MD5 |
bbd02fc3b2c4f2db5df22b65aa1f62dc
|
|
| BLAKE2b-256 |
ceb8297318d1195f79934e464fa5ceba667c1fe9b531a27f8b3bdb62eaa2b88f
|