Skip to main content

Python toolkit and CLI for exploring, downloading, and parsing PMC article data.

Project description

PMC Toolkit

Python toolkit and CLI for exploring, downloading, and parsing PMC article data from the PMC Open Access dataset on AWS S3 (s3://pmc-oa-opendata).

Current Status

The project currently supports:

  • listing available versions for a PMCID
  • validating PMC identifiers before making requests
  • retrieving metadata for a PMC identifier, defaulting to the latest version for a base PMCID
  • listing every object for a resolved article version, using the local cache when available
  • downloading files for an article version into a local cache (optional --ext filters apply only to fetch, not to files; --ext accepts either a comma-separated list or repeated flags)
  • parsing cached full-text XML into a normalized article dictionary with title, journal, article, affiliations, author notes, abstract, content, acknowledgements, data availability, related articles, custom metadata, competing interests, supplementary media, references, figures, and tables

Requirements

  • Python 3.11+
  • uv

Setup

uv sync

Development

After code changes, run the checks in AGENTS.md (typecheck, Ruff, tests).

CLI Usage

Show the available commands:

uv run pmc --help

CLI commands print indented JSON to stdout.

List versions for a PMC article:

uv run pmc versions PMC11370360

Fetch metadata for the latest available version of a PMCID:

uv run pmc metadata PMC11370360

Fetch metadata for a specific version:

uv run pmc metadata PMC11370360.1

List every object key for an article version (including media and supplements). For unversioned IDs, the CLI resolves the latest version from S3 first; once the version is known, the cached object-key manifest is reused when present. There is no extension filter on this command.

uv run pmc files PMC11370360.1

Download files to a local cache. The default root is the per-OS user cache directory from platformdirs (e.g. ~/.cache/pmc-toolkit on Linux, ~/Library/Caches/pmc-toolkit on macOS, and under %LOCALAPPDATA% on Windows), with files under <root>/<PMCid.N>/. Override with --cache-dir or PMC_TOOLKIT_CACHE.

uv run pmc fetch PMC11370360.1

Download only selected file types, re-downloading even if cached:

uv run pmc fetch PMC11370360.1 --ext xml,pdf,jpg --force

The --ext option also accepts repeated flags if you prefer the more explicit form:

uv run pmc fetch PMC11370360.1 --ext pdf --ext xml --ext jpg --force

Override the cache location via a flag or the PMC_TOOLKIT_CACHE env var:

uv run pmc fetch PMC11370360.1 --cache-dir ./data
PMC_TOOLKIT_CACHE=./data uv run pmc fetch PMC11370360.1

Convert a cached XML file into extracted JSON. Run fetch --ext xml first if the XML is not already in the cache. The first conversion parses XML once, writes <cache-root>/<PMCid.N>/.pmc-extracted-article.json, and prints the extracted JSON; later conversions for the same article version read that JSON cache unless --force is passed.

uv run pmc fetch PMC11370360.1 --ext xml
uv run pmc convert-xml PMC11370360.1

List the extracted JSON top-level keys:

uv run pmc convert-xml --list-keys PMC11370360.1

article_info.publication_date currently uses the first publication date found in the XML. If downstream consumers need to distinguish date types such as epub, ppub, or collection, the output can be extended later.

Project Layout

Here “storage” means the AWS bucket plus the local cache directory where pmc fetch writes files—not a database or ORM.

  • src/pmc_toolkit/cli.py - Typer CLI commands
  • src/pmc_toolkit/storage_api.py - import this for programmatic use: list versions, metadata, list all keys, fetch to cache
  • src/pmc_toolkit/storage_utils.py - boto3/unsigned S3 client, list-objects, downloads; implementation details for storage_api
  • src/pmc_toolkit/xml_parse_api.py - import this for programmatic parsing of cached XML files
  • src/pmc_toolkit/xml_parse_utils.py - small lxml-based helpers for cached full-text XML extraction
  • src/pmc_toolkit/cache.py - per-article directories under the cache root, JSON metadata, cached S3 key listings, and safe local paths for downloaded objects
  • src/pmc_toolkit/validators.py - identifier validation
  • src/pmc_toolkit/models.py - response models
  • tests/ - automated tests

Local cache

Each resolved article version has a directory <cache_root>/<PMCid.N>/ containing:

  • <PMCid.N>.json — cached metadata (from S3 metadata/<PMCid.N>.json), written after a successful read.
  • .pmc-object-keys.json — JSON array of S3 object keys under that article’s prefix, written after list_objects_v2 (or read on cache hit). If this file is missing or not a list of strings, listing or fetch may refetch from S3 or raise ValueError for an invalid manifest.
  • .pmc-extracted-article.json — full extracted JSON produced from the cached XML by pmc convert-xml; reused by later conversions for the same article version.

Cache root selection: pmc metadata and pmc files (and the matching storage_api functions) always use the default OS user cache from platformdirs. Only pmc fetch and fetch_files(..., cache_dir=...) accept --cache-dir or the PMC_TOOLKIT_CACHE environment variable.

Download paths: For each S3 key, the toolkit maps PMCid.N/relative/path to <cache_root>/PMCid.N/relative/path. Keys that do not start with the PMCid.N/ prefix, use an absolute path segment, or resolve outside that directory (for example .. path segments) are rejected with ValueError so downloads never leave the article folder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pmc_toolkit-0.1.0.tar.gz (57.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pmc_toolkit-0.1.0-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

File details

Details for the file pmc_toolkit-0.1.0.tar.gz.

File metadata

  • Download URL: pmc_toolkit-0.1.0.tar.gz
  • Upload date:
  • Size: 57.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.9 {"installer":{"name":"uv","version":"0.11.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pmc_toolkit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0dcb41be76243bde8364b1e4e3f8e19f0349aabfc90c79bdcb6a0bfdadc83af7
MD5 ce8919629e7cbc53691ac93a968f01f8
BLAKE2b-256 b19a1ce2e9f32019c8aaeeb9abd0c92880edabe5f8593a914326e963486ec538

See more details on using hashes here.

File details

Details for the file pmc_toolkit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pmc_toolkit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.9 {"installer":{"name":"uv","version":"0.11.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pmc_toolkit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 648b65084eb07cba986b48b03d172a8c688951fbc3833fedc02de8adda83e68b
MD5 fa83d8aaf91e19327f426c791185a9af
BLAKE2b-256 3dbbb64624a314c49a8deb01ab85ae941aa3d5645f8b72c850c96b2599b9dcb1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page