PMC ID in. Clean paper JSON out. Add figure images when you need them. Parse PubMed Central and JATS XML for biomedical ingestion pipelines.

These details have not been verified by PyPI

Project links

Project description

PMCGrab

PMC ID in. Clean paper JSON out. Add images when you need them.

PMCGrab turns PubMed Central articles and JATS XML into structured JSON for biomedical ingestion pipelines. The default output is the paper itself: title, abstract, body, figures, and tables. Parser diagnostics, provenance, relationship maps, and legacy compatibility fields are available only when you ask for full JSON.

Use it when you need clean article context for RAG, search, text mining, knowledge graphs, literature review systems, or downstream jobs that need the paper text and figure files together.

Install

uv add pmcgrab

or:

pip install pmcgrab

Requires Python 3.10 or newer.

Quick Start

Clean JSON

from pmcgrab import process_single_pmc

article = process_single_pmc("7181753")

print(article["schema"])                  # pmcgrab.paper.v1
print(article["identifiers"]["pmcid"])    # PMC7181753
print(article["paper"]["title"])

for section in article["paper"]["body"]:
    print(section["title"])

CLI:

pmcgrab --pmcids 7181753 3539614 --output-dir ./pmc_output

Output:

./pmc_output/
  PMC7181753.json
  PMC3539614.json
  summary.json

Clean JSON With Images

from pathlib import Path

from pmcgrab import AssetFetchPolicy, process_single_pmc_with_assets

article, fetch = process_single_pmc_with_assets(
    "7181753",
    out_dir="./pmc_output",
    policy=AssetFetchPolicy(fetch_images=True),
)

article_dir = Path("./pmc_output/PMC7181753")

for image in article["assets"]["images"]:
    for file in image["files"]:
        if file.get("local_path"):
            image_bytes = (article_dir / file["local_path"]).read_bytes()
            caption = image["caption"]

CLI:

pmcgrab --pmcids 7181753 --with-images --output-dir ./pmc_output

Output:

./pmc_output/
  PMC7181753/
    article.json
    images/
      42003_2020_922_Fig1_HTML.jpg
      42003_2020_922_Fig2_HTML.jpg
  summary.json

article.json is the same clean paper JSON. Image paths are stored at assets.images[].files[].local_path, relative to article.json.

Output Contract

Default output schema:

{
  "schema": "pmcgrab.paper.v1",
  "has_data": true,
  "identifiers": {
    "pmcid": "PMC7181753",
    "pmid": "32327715",
    "doi": "10.1038/s42003-020-0922-4"
  },
  "paper": {
    "title": "Single-cell transcriptomes of the human skin reveal ...",
    "abstract": [
      {
        "title": "Abstract",
        "content": [{ "type": "paragraph", "text": "..." }],
        "sections": []
      }
    ],
    "body": [
      {
        "title": "Introduction",
        "content": [{ "type": "paragraph", "text": "..." }],
        "sections": []
      }
    ]
  },
  "assets": {
    "images": [
      {
        "id": "f1",
        "label": "Figure 1",
        "caption": "...",
        "files": [
          {
            "href": "42003_2020_922_Fig1_HTML.jpg",
            "local_path": "images/42003_2020_922_Fig1_HTML.jpg",
            "status": "downloaded",
            "mime_type": "image/jpeg"
          }
        ]
      }
    ],
    "tables": []
  }
}

The clean contract intentionally excludes:

parser provenance
source XML paths
relationship graphs
author and affiliation metadata
bibliography records
diagnostics and quality counters
old schema compatibility fields

Use full JSON when you need those fields:

article = process_single_pmc("7181753", output_style="full")

pmcgrab --pmcids 7181753 --full-json --output-dir ./pmc_output

Older compatibility shapes remain available only through full JSON:

article = process_single_pmc(
    "7181753",
    output_style="full",
    schema_version=2,
)

pmcgrab --pmcids 7181753 --full-json --schema-version 2

Content Blocks

Sections contain ordered content blocks plus nested sections.

Common block types:

Type	Fields
`paragraph`	`text`
`list`	`list_type`, `items`
`definition_list`	`title`, `items`
`formula`	`label`, `text`, `tex`, `mathml`
`figure_ref`	`target_id`, `label`
`table_ref`	`target_id`, `label`
`quote`, `statement`, `boxed_text`	`content`, `text`
`code`, `preformat`	`language`, `text`
`unknown_block`	`jats_tag`, `text`, `children`

Unknown meaningful JATS blocks are preserved as readable fallback records instead of being dropped.

Image Fetching

Image fetching is off by default. --with-images and process_single_pmc_with_assets() do extra network work:

Fetch the paper XML from NCBI.
Parse clean paper JSON.
Try the PMC Open Access tar.gz package.
Fall back to per-file /bin/ image URLs when needed.
Write PMC{id}/article.json plus downloaded files.

Image file status values:

Status	Meaning
`not_attempted`	Image fetching was not enabled.
`not_available`	The figure has no usable file reference.
`missing`	PMCGrab tried but could not download the file.
`downloaded`	The file was written to `local_path`.

Supplementary files are opt-in:

pmcgrab --pmcids 7181753 \
  --with-images \
  --include-supplementary \
  --output-dir ./pmc_output

Other asset flags:

Flag	Effect
`--include-raw-xml`	Save the source JATS XML as `raw.xml`.
`--include-all-assets`	Extract every file in the OA bundle.
`--max-asset-bytes N`	Set the per-article asset ceiling.

Safety defaults:

tar paths are checked before extraction
symlinks and device files are rejected
partial downloads are removed on size-limit abort
existing files are reused on reruns

Local XML

Parse local JATS XML without network calls:

from pmcgrab import process_local_xml_dir, process_single_local_xml

article = process_single_local_xml("./pmc_bulk/PMC7181753.xml")
batch = process_local_xml_dir("./pmc_bulk", workers=16)

CLI:

pmcgrab --from-file PMC7181753.xml --output-dir ./pmc_output
pmcgrab --from-dir ./pmc_bulk --workers 16 --output-dir ./pmc_output

Local XML mode does not download images. Use PMCID network mode with --with-images when you need figure binaries.

CLI Reference

Common input modes:

pmcgrab --pmcids 7181753 3539614 --output-dir ./out
pmcgrab --pmids 32327715 --output-dir ./out
pmcgrab --dois 10.1038/s42003-020-0922-4 --output-dir ./out
pmcgrab --from-id-file ids.txt --output-dir ./out
pmcgrab --from-dir ./xml --workers 16 --output-dir ./out
pmcgrab --from-file one.xml two.xml --output-dir ./out

Output modes:

# Default: clean paper JSON, one file per article.
pmcgrab --pmcids 7181753 --output-dir ./out

# Clean paper JSON plus image files.
pmcgrab --pmcids 7181753 --with-images --output-dir ./out

# JSONL instead of per-article files.
pmcgrab --pmcids 7181753 3539614 --format jsonl --output-dir ./out

# Metadata-rich full JSON.
pmcgrab --pmcids 7181753 --full-json --output-dir ./out

Run pmcgrab --help for every flag.

Python API

from pmcgrab import (
    AssetFetchPolicy,
    Paper,
    process_local_xml_dir,
    process_single_local_xml,
    process_single_pmc,
    process_single_pmc_with_assets,
)

article = process_single_pmc("7181753")
article, fetch = process_single_pmc_with_assets(
    "7181753",
    out_dir="./out",
    policy=AssetFetchPolicy(fetch_images=True),
)

paper = Paper.from_pmc("7181753")
paper.title
paper.abstract_as_str()
paper.get_toc()
paper.to_dict()

NCBI and PMC helper clients are also exported:

from pmcgrab import (
    bioc_fetch,
    citation_export,
    id_convert,
    list_oa_links,
    normalize_id,
    normalize_pmid,
    oa_fetch,
    oai_get_record,
    oai_list_identifiers,
    oai_list_records,
    oai_list_sets,
    tgz_url_for,
)

Configuration

Variable	Purpose	Default
`PMCGRAB_EMAILS`	Contact emails for NCBI requests.	Maintainer contact
`NCBI_API_KEY`	Optional NCBI API key.	unset
`PMCGRAB_TIMEOUT`	Network timeout in seconds.	`60`
`PMCGRAB_RETRIES`	Retry count for Entrez calls.	`3`
`PMCGRAB_SSL_VERIFY`	Verify TLS certificates.	`true`
`PMCGRAB_MAX_ASSET_BYTES`	Per-article asset download ceiling.	`268435456`

For serious network use:

export PMCGRAB_EMAILS="you@university.edu"
export NCBI_API_KEY="your_ncbi_api_key_here"

Verification

Release checks:

uv run ruff check .
uv run ruff format --check .
uv run mypy src/pmcgrab
uv run pytest -q --no-cov
uv build
bash scripts/smoke-wheel-install.sh

Optional live NCBI smoke test:

PMCGRAB_RUN_LIVE_E2E=1 uv run pytest tests/test_e2e.py -q --no-cov

The live test is opt-in because public services can fail independently of this package.

Release

Releases are published from main through GitHub Actions:

Merge the release commit to main.
Run Release from main.
The workflow creates vX.Y.Z.
The tag workflow builds, tests, publishes to PyPI, and creates the GitHub Release.

Do not publish production packages from a laptop unless the GitHub pipeline is unavailable and the failure mode is understood.

License

Apache 2.0. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

3.0.1

May 18, 2026

This version

3.0.0

May 18, 2026

2.0.0

May 18, 2026

1.0.10

May 18, 2026

1.0.9

May 17, 2026

1.0.8

May 17, 2026

1.0.7

Mar 2, 2026

1.0.6

Feb 27, 2026

1.0.5

Feb 25, 2026

1.0.4

Feb 25, 2026

1.0.3

Feb 25, 2026

1.0.2

Feb 25, 2026

1.0.1

Feb 7, 2026

1.0.0

Feb 7, 2026

0.6.0

Feb 7, 2026

0.5.8

Aug 5, 2025

0.5.7

Aug 4, 2025

0.5.6

Aug 4, 2025

0.5.5

Aug 1, 2025

0.5.4

Aug 1, 2025

0.5.3

Aug 1, 2025

0.5.2

Aug 1, 2025

0.5.1

Aug 1, 2025

0.5.0

Aug 1, 2025

0.4.9

Aug 1, 2025

0.4.8

Aug 1, 2025

0.4.7

Aug 1, 2025

0.4.6

Aug 1, 2025

0.4.5

Aug 1, 2025

0.4.0

Jul 31, 2025

0.3.6

Jul 31, 2025

0.3.3

Jul 31, 2025

0.3.2

Jul 31, 2025

0.3.1

Jul 31, 2025

0.2.1

Jul 31, 2025

0.2.0

Jul 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pmcgrab-3.0.0.tar.gz (192.7 kB view details)

Uploaded May 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pmcgrab-3.0.0-py3-none-any.whl (175.7 kB view details)

Uploaded May 18, 2026 Python 3

File details

Details for the file pmcgrab-3.0.0.tar.gz.

File metadata

Download URL: pmcgrab-3.0.0.tar.gz
Upload date: May 18, 2026
Size: 192.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pmcgrab-3.0.0.tar.gz
Algorithm	Hash digest
SHA256	`30a1cf6cb67d7ea0a3eec5e29f1d976d9c04fc11c875271fad08beed2b66492f`
MD5	`ae0731cbae29b88787505189b834d722`
BLAKE2b-256	`0ed1b1c40b1b9173ccab8e01e0f6fd5c0c1484da4c9a838a6fb41b4105fb4dcc`

See more details on using hashes here.

File details

Details for the file pmcgrab-3.0.0-py3-none-any.whl.

File metadata

Download URL: pmcgrab-3.0.0-py3-none-any.whl
Upload date: May 18, 2026
Size: 175.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pmcgrab-3.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dc9da5ff376d1c1e481008e4af888fa9e9b25c6b893c3bde53b4154763675f74`
MD5	`273e01fdf2ea77e7bb436d7f13b0dba0`
BLAKE2b-256	`6d583b4e5ee34724fa10dfcfda50630e8a80545bf26ebdd7a8a0acb52e9ee031`

See more details on using hashes here.

pmcgrab 3.0.0

Navigation

Verified details

Project links

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

PMCGrab

Install

Quick Start

Clean JSON

Clean JSON With Images

Output Contract

Content Blocks

Image Fetching

Local XML

CLI Reference

Python API

Configuration

Verification

Release

License

Project details

Verified details

Project links

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes