Skip to main content

Python toolkit and CLI for exploring, downloading, and parsing PMC article data.

Project description

PMC Toolkit

Python toolkit and CLI for exploring, downloading, and parsing PMC article data from the PMC Open Access dataset on AWS S3 (s3://pmc-oa-opendata).

Current Status

The project currently supports:

  • listing available versions for a PMCID
  • validating PMC identifiers before making requests
  • retrieving metadata for a PMC identifier, defaulting to the latest version for a base PMCID
  • listing every object for a resolved article version, using the local cache when available
  • downloading files for an article version into a local cache (optional --ext filters apply only to fetch, not to files; --ext accepts either a comma-separated list or repeated flags)
  • parsing cached full-text XML into a normalized article dictionary with title, journal, article, affiliations, author notes, abstract, content, acknowledgements, data availability, related articles, custom metadata, competing interests, supplementary media, references, figures, and tables

Requirements

  • Python 3.11+
  • uv

Setup

uv sync

Development

After code changes, run the checks in AGENTS.md (typecheck, Ruff, tests).

CLI Usage

Show the available commands:

uv run pmc-toolkit --help

CLI commands print indented JSON to stdout.

List versions for a PMC article:

uv run pmc-toolkit versions PMC11370360

Fetch metadata for the latest available version of a PMCID:

uv run pmc-toolkit metadata PMC11370360

Fetch metadata for a specific version:

uv run pmc-toolkit metadata PMC11370360.1

List every object key for an article version (including media and supplements). For unversioned IDs, the CLI resolves the latest version from S3 first; once the version is known, the cached object-key manifest is reused when present. There is no extension filter on this command.

uv run pmc-toolkit files PMC11370360.1

Download files to a local cache. The default root is the per-OS user cache directory from platformdirs (e.g. ~/.cache/pmc-toolkit on Linux, ~/Library/Caches/pmc-toolkit on macOS, and under %LOCALAPPDATA% on Windows), with files under <root>/<PMCid.N>/. Override with --cache-dir or PMC_TOOLKIT_CACHE.

uv run pmc-toolkit fetch PMC11370360.1

Download only selected file types, re-downloading even if cached:

uv run pmc-toolkit fetch PMC11370360.1 --ext xml,pdf,jpg --force

The --ext option also accepts repeated flags if you prefer the more explicit form:

uv run pmc-toolkit fetch PMC11370360.1 --ext pdf --ext xml --ext jpg --force

Override the cache location via a flag or the PMC_TOOLKIT_CACHE env var:

uv run pmc-toolkit fetch PMC11370360.1 --cache-dir ./data
PMC_TOOLKIT_CACHE=./data uv run pmc-toolkit fetch PMC11370360.1

Convert a cached XML file into extracted JSON. Run fetch --ext xml first if the XML is not already in the cache. The first conversion parses XML once, writes <cache-root>/<PMCid.N>/.pmc-extracted-article.json, and prints the extracted JSON; later conversions for the same article version read that JSON cache unless --force is passed.

uv run pmc-toolkit fetch PMC11370360.1 --ext xml
uv run pmc-toolkit convert-xml PMC11370360.1

List the extracted JSON top-level keys:

uv run pmc-toolkit convert-xml --list-keys PMC11370360.1

article_info.publication_date currently uses the first publication date found in the XML. If downstream consumers need to distinguish date types such as epub, ppub, or collection, the output can be extended later.

Project Layout

Here “storage” means the AWS bucket plus the local cache directory where pmc-toolkit fetch writes files—not a database or ORM.

  • src/pmc_toolkit/cli.py - Typer CLI commands
  • src/pmc_toolkit/storage_api.py - import this for programmatic use: list versions, metadata, list all keys, fetch to cache
  • src/pmc_toolkit/storage_utils.py - boto3/unsigned S3 client, list-objects, downloads; implementation details for storage_api
  • src/pmc_toolkit/xml_parse_api.py - import this for programmatic parsing of cached XML files
  • src/pmc_toolkit/xml_parse_utils.py - small lxml-based helpers for cached full-text XML extraction
  • src/pmc_toolkit/cache.py - per-article directories under the cache root, JSON metadata, cached S3 key listings, and safe local paths for downloaded objects
  • src/pmc_toolkit/validators.py - identifier validation
  • src/pmc_toolkit/models.py - response models
  • tests/ - automated tests

Local cache

Each resolved article version has a directory <cache_root>/<PMCid.N>/ containing:

  • <PMCid.N>.json — cached metadata (from S3 metadata/<PMCid.N>.json), written after a successful read.
  • .pmc-object-keys.json — JSON array of S3 object keys under that article’s prefix, written after list_objects_v2 (or read on cache hit). If this file is missing or not a list of strings, listing or fetch may refetch from S3 or raise ValueError for an invalid manifest.
  • .pmc-extracted-article.json — full extracted JSON produced from the cached XML by pmc-toolkit convert-xml; reused by later conversions for the same article version.

Cache root selection: pmc-toolkit metadata and pmc-toolkit files (and the matching storage_api functions) always use the default OS user cache from platformdirs. Only pmc-toolkit fetch and fetch_files(..., cache_dir=...) accept --cache-dir or the PMC_TOOLKIT_CACHE environment variable.

Download paths: For each S3 key, the toolkit maps PMCid.N/relative/path to <cache_root>/PMCid.N/relative/path. Keys that do not start with the PMCid.N/ prefix, use an absolute path segment, or resolve outside that directory (for example .. path segments) are rejected with ValueError so downloads never leave the article folder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pmc_toolkit-0.2.0.tar.gz (57.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pmc_toolkit-0.2.0-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

File details

Details for the file pmc_toolkit-0.2.0.tar.gz.

File metadata

  • Download URL: pmc_toolkit-0.2.0.tar.gz
  • Upload date:
  • Size: 57.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.11 {"installer":{"name":"uv","version":"0.11.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pmc_toolkit-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2038ae363bd65da5de5fd9f6aff438cc8eeabcbba5c4ee08402bba4d404e03b2
MD5 c16aa6ff876402d769d48b84d0e394da
BLAKE2b-256 c4f1366e3dbfcefbb83a38a397a9f2f95e2ade203835163fc3d68b25f3d6f083

See more details on using hashes here.

File details

Details for the file pmc_toolkit-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: pmc_toolkit-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 20.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.11 {"installer":{"name":"uv","version":"0.11.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pmc_toolkit-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 766d104ef5af10a4f29c7775fc26f565641f1f859b82eff00addab2158f3fe5c
MD5 7647ad88444198a8e30bfdf824a9c971
BLAKE2b-256 72191b9e595af4c44e62bd15eed4dc6d05649f494dea353bbdf68fcf786a8835

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page