Python toolkit and CLI for exploring, downloading, and parsing PMC article data.
Project description
PMC Toolkit
Python toolkit and CLI for exploring, downloading, and parsing PMC article data
from the PMC Open Access dataset on AWS S3 (s3://pmc-oa-opendata).
Current Status
The project currently supports:
- listing available versions for a PMCID
- validating PMC identifiers before making requests
- retrieving metadata for a PMC identifier, defaulting to the latest version for a base PMCID
- listing every object for a resolved article version, using the local cache when available
- downloading files for an article version into a local cache (optional
--extfilters apply only tofetch, not tofiles;--extaccepts either a comma-separated list or repeated flags) - parsing cached full-text XML into a normalized article dictionary with title, journal, article, affiliations, author notes, abstract, content, acknowledgements, data availability, related articles, custom metadata, competing interests, supplementary media, references, figures, and tables
Requirements
- Python 3.11+
uv
Setup
uv sync
Development
After code changes, run the checks in AGENTS.md (typecheck, Ruff, tests).
CLI Usage
Show the available commands:
uv run pmc-toolkit --help
CLI commands print indented JSON to stdout.
List versions for a PMC article:
uv run pmc-toolkit versions PMC11370360
Fetch metadata for the latest available version of a PMCID:
uv run pmc-toolkit metadata PMC11370360
Fetch metadata for a specific version:
uv run pmc-toolkit metadata PMC11370360.1
List every object key for an article version (including media and supplements). For unversioned IDs, the CLI resolves the latest version from S3 first; once the version is known, the cached object-key manifest is reused when present. There is no extension filter on this command.
uv run pmc-toolkit files PMC11370360.1
Download files to a local cache. The default root is the per-OS user cache
directory from
platformdirs (e.g. ~/.cache/pmc-toolkit on
Linux, ~/Library/Caches/pmc-toolkit on macOS, and under %LOCALAPPDATA% on
Windows), with files under <root>/<PMCid.N>/. Override with --cache-dir or
PMC_TOOLKIT_CACHE.
uv run pmc-toolkit fetch PMC11370360.1
Download only selected file types, re-downloading even if cached:
uv run pmc-toolkit fetch PMC11370360.1 --ext xml,pdf,jpg --force
The --ext option also accepts repeated flags if you prefer the more explicit
form:
uv run pmc-toolkit fetch PMC11370360.1 --ext pdf --ext xml --ext jpg --force
Override the cache location via a flag or the PMC_TOOLKIT_CACHE env var:
uv run pmc-toolkit fetch PMC11370360.1 --cache-dir ./data
PMC_TOOLKIT_CACHE=./data uv run pmc-toolkit fetch PMC11370360.1
Convert a cached XML file into extracted JSON. Run fetch --ext xml first if
the XML is not already in the cache. The first conversion parses XML once,
writes <cache-root>/<PMCid.N>/.pmc-extracted-article.json, and prints the
extracted JSON; later conversions for the same article version read that JSON
cache unless --force is passed.
uv run pmc-toolkit fetch PMC11370360.1 --ext xml
uv run pmc-toolkit convert-xml PMC11370360.1
List the extracted JSON top-level keys:
uv run pmc-toolkit convert-xml --list-keys PMC11370360.1
article_info.publication_date currently uses the first publication date found
in the XML. If downstream consumers need to distinguish date types such as
epub, ppub, or collection, the output can be extended later.
Project Layout
Here “storage” means the AWS bucket plus the local cache directory where
pmc-toolkit fetch writes files—not a database or ORM.
src/pmc_toolkit/cli.py- Typer CLI commandssrc/pmc_toolkit/storage_api.py- import this for programmatic use: list versions, metadata, list all keys, fetch to cachesrc/pmc_toolkit/storage_utils.py- boto3/unsigned S3 client, list-objects, downloads; implementation details forstorage_apisrc/pmc_toolkit/xml_parse_api.py- import this for programmatic parsing of cached XML filessrc/pmc_toolkit/xml_parse_utils.py- smalllxml-based helpers for cached full-text XML extractionsrc/pmc_toolkit/cache.py- per-article directories under the cache root, JSON metadata, cached S3 key listings, and safe local paths for downloaded objectssrc/pmc_toolkit/validators.py- identifier validationsrc/pmc_toolkit/models.py- response modelstests/- automated tests
Local cache
Each resolved article version has a directory <cache_root>/<PMCid.N>/ containing:
<PMCid.N>.json— cached metadata (from S3metadata/<PMCid.N>.json), written after a successful read..pmc-object-keys.json— JSON array of S3 object keys under that article’s prefix, written afterlist_objects_v2(or read on cache hit). If this file is missing or not a list of strings, listing or fetch may refetch from S3 or raiseValueErrorfor an invalid manifest..pmc-extracted-article.json— full extracted JSON produced from the cached XML bypmc-toolkit convert-xml; reused by later conversions for the same article version.
Cache root selection: pmc-toolkit metadata and pmc-toolkit files (and the matching storage_api functions) always use the default OS user cache from platformdirs. Only pmc-toolkit fetch and fetch_files(..., cache_dir=...) accept --cache-dir or the PMC_TOOLKIT_CACHE environment variable.
Download paths: For each S3 key, the toolkit maps PMCid.N/relative/path to <cache_root>/PMCid.N/relative/path. Keys that do not start with the PMCid.N/ prefix, use an absolute path segment, or resolve outside that directory (for example .. path segments) are rejected with ValueError so downloads never leave the article folder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pmc_toolkit-0.2.0.tar.gz.
File metadata
- Download URL: pmc_toolkit-0.2.0.tar.gz
- Upload date:
- Size: 57.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.11 {"installer":{"name":"uv","version":"0.11.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2038ae363bd65da5de5fd9f6aff438cc8eeabcbba5c4ee08402bba4d404e03b2
|
|
| MD5 |
c16aa6ff876402d769d48b84d0e394da
|
|
| BLAKE2b-256 |
c4f1366e3dbfcefbb83a38a397a9f2f95e2ade203835163fc3d68b25f3d6f083
|
File details
Details for the file pmc_toolkit-0.2.0-py3-none-any.whl.
File metadata
- Download URL: pmc_toolkit-0.2.0-py3-none-any.whl
- Upload date:
- Size: 20.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.11 {"installer":{"name":"uv","version":"0.11.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
766d104ef5af10a4f29c7775fc26f565641f1f859b82eff00addab2158f3fe5c
|
|
| MD5 |
7647ad88444198a8e30bfdf824a9c971
|
|
| BLAKE2b-256 |
72191b9e595af4c44e62bd15eed4dc6d05649f494dea353bbdf68fcf786a8835
|