AI-ready retrieval and parsing of PubMed Central articles for RAG applications. Install with uv for best performance.

These details have not been verified by PyPI

Project links

Project description

PMCGrab -- From PubMed Central ID to AI-Ready JSON in Seconds

Every AI workflow that touches biomedical literature hits the same wall:

Download PMC XML hoping it's "structured."
Fight nested tags, footnotes, figure refs, and half-broken links.
Hope your regex didn't blow away the Methods section you actually need.

That wall steals hours from RAG pipelines, knowledge-graph builds, LLM fine-tuning -- any downstream AI task.

PMCGrab knocks it down. Feed it a list of PMC IDs -- or point it at a directory of bulk-downloaded XML -- and get back clean, section-aware JSON you can drop straight into a vector DB or LLM prompt. No network required for local files. No timeouts. No XML wrestling.

The Hidden Cost of "I'll Just Parse It Myself"

Task	Manual / ad-hoc	PMCGrab
Install dependencies	Package hunting	One package install
Convert one article to JSON	15-30 min	One API call or CLI command
Capture article sections	Hope & regex	Parsed section tree with headings preserved
Parallel processing	Bash loops & temp files	`--workers N` flag
Edge-case maintenance	Yours forever	Automated tests and regression cases

At $50/hour, hand-parsing 100 papers burns $1,000+. PMCGrab does the same job for $0 -- within minutes -- so you can focus on using the information instead of extracting it.

Quick Install

Recommended (via uv):

uv add pmcgrab

Or with pip:

pip install pmcgrab

Python >= 3.10 required. Tested on 3.10, 3.11, 3.12, and 3.13.

Optional extras:

pip install pmcgrab[dev]       # Linting, type-checking, pre-commit
pip install pmcgrab[test]      # pytest + coverage
pip install pmcgrab[docs]      # MkDocs + Material theme
pip install pmcgrab[notebook]  # Jupyter support

30-Second Quick Start

from pmcgrab import Paper

paper = Paper.from_pmc("7181753")

print(paper.title)
# => "Single-cell transcriptomes of the human skin reveal age-related loss of ..."

print(paper.abstract_as_str()[:200])
# => "Fibroblasts are an essential cell population for human skin architecture ..."

# Every section, clean and ready
for section, text in paper.body_as_dict().items():
    print(f"{section}: {len(text.split())} words")

# Save to JSON
paper.to_json()

That's it. One import, one line to fetch, structured data everywhere.

Ways to Use PMCGrab

1. Python API -- the `Paper` class (recommended)

The Paper class is the primary interface. It wraps every piece of parsed data with convenient accessor methods.

From the network:

from pmcgrab import Paper

paper = Paper.from_pmc("7181753", suppress_warnings=True)

From a local XML file (no network needed):

paper = Paper.from_local_xml("path/to/PMC7181753.xml")

Output methods -- choose the shape that fits your pipeline:

# Abstract
paper.abstract_as_str()          # Plain-text string
paper.abstract_as_dict()         # {"Background": "...", "Results": "..."}

# Body
paper.body_as_dict()             # Flat: {"Introduction": "...", "Methods": "..."}
paper.body_as_nested_dict()      # Hierarchical: preserves subsections
paper.body_as_paragraphs()       # List of dicts -- ideal for RAG chunking
                                 #   [{"section": "Methods", "text": "...", "paragraph_index": 0}, ...]

# Full text
paper.full_text()                # Abstract + body as one continuous string

# Table of contents
paper.get_toc()                  # ["Introduction", "Methods", "Results", ...]

# Serialization
paper.to_dict()                  # Full JSON-serializable dictionary
paper.to_json()                  # JSON string (pretty-printed)

Metadata you can access directly:

paper.title                      # Article title
paper.authors                    # pandas DataFrame (names, emails, affiliations)
paper.journal_title              # "Genome Biology"
paper.article_id                 # {"pmcid": "PMC7181753", "doi": "10.1038/...", ...}
paper.keywords                   # ["fibroblasts", "aging", ...]
paper.published_date             # {"epub": "2020-04-24", ...}
paper.citations                  # Structured reference list
paper.tables                     # List of pandas DataFrames
paper.figures                    # Figure metadata + captions
paper.permissions                # Copyright, license info
paper.funding                    # Funding sources
paper.equations                  # MathML + TeX equations
# ... and 20+ more attributes (see "Extracted Metadata" below)

2. Dict-Based API (for data pipelines)

If you prefer raw dictionaries over the Paper object:

from pmcgrab import process_single_pmc, process_single_local_xml

# From network
data = process_single_pmc("7181753")

# From local XML
data = process_single_local_xml("path/to/article.xml")

print(data["title"]["main"])
print(data["identifiers"]["doi"])
print(data["content"]["abstract"])   # Ordered abstract sections
print([section["title"] for section in data["content"]["sections"]])

3. Bulk / Local XML Processing

This feature was inspired by a great suggestion from @vanAmsterdam, who pointed out that working with bulk-exported PMC data could be orders of magnitude faster than fetching articles one-by-one over the network.

We built it. Local XML processing skips the network entirely -- no HTTP requests, no timeouts, no rate limits. It is the fastest way to parse PMC articles at scale.

Python API:

from pmcgrab import Paper, process_single_local_xml, process_local_xml_dir

# Single file
paper = Paper.from_local_xml("./pmc_bulk/PMC7181753.xml")

# Single file (dict output)
data = process_single_local_xml("./pmc_bulk/PMC7181753.xml")

# Entire directory -- concurrent with 16 workers by default
results = process_local_xml_dir("./pmc_bulk/", workers=16)
for filename, data in results.items():
    if data:
        print(f"{filename}: {data['title']['main'][:60]}")

CLI:

# Process a directory of bulk-downloaded XML
pmcgrab --from-dir ./pmc_bulk_xml/ --output-dir ./results

# Process specific files
pmcgrab --from-file article1.xml article2.xml --output-dir ./results

How to get bulk XML: Download from the PMC FTP service or the PMC Open Access subset. Each .xml file is a standard JATS XML article that PMCGrab can parse directly.

4. Command Line

PMCGrab's CLI supports six input modes, all mutually exclusive:

# PMC IDs (accepts PMC7181753, pmc7181753, or just 7181753)
pmcgrab --pmcids 7181753 3539614 --output-dir ./results

# PubMed IDs (auto-converted to PMC IDs via NCBI API)
pmcgrab --pmids 33087749 34567890 --output-dir ./results

# DOIs (auto-converted to PMC IDs via NCBI API)
pmcgrab --dois 10.1038/s41586-020-2832-5 --output-dir ./results

# IDs from a text file (one per line -- PMCIDs, PMIDs, or DOIs)
pmcgrab --from-id-file ids.txt --output-dir ./results

# Local XML directory (bulk mode -- no network)
pmcgrab --from-dir ./xml_bulk/ --output-dir ./results

# Specific local XML files (no network)
pmcgrab --from-file article1.xml article2.xml --output-dir ./results

Additional flags:

Flag	Description	Default
`--output-dir` / `--out`	Output directory for JSON files	`./pmc_output`
`--batch-size` / `--workers`	Number of concurrent worker threads	`10`
`--format`	`json` (one file per article) or `jsonl` (single file)	`json`
`--verbose` / `-v`	Enable debug logging	off
`--quiet` / `-q`	Suppress progress bars	off

5. Async Support

For asyncio-based applications:

import asyncio
from pmcgrab.application.processing import async_process_pmc_ids

results = asyncio.run(async_process_pmc_ids(
    ["7181753", "3539614", "3084273"],
    max_concurrency=10,
))

for pid, data in results.items():
    print(pid, "OK" if data else "FAIL")

6. Batch Processing

Process thousands of articles with built-in concurrency, retries, and rate-limit compliance:

from pmcgrab import process_pmc_ids_in_batches

pmc_ids = ["7181753", "3539614", "5454911", "3084273"]
process_pmc_ids_in_batches(pmc_ids, "./output", batch_size=8)

Output Example

Every parsed article produces a comprehensive JSON structure:

{
  "schema_version": 2,
  "has_data": true,
  "identifiers": {
    "pmc_id": "7181753",
    "pmcid": "PMC7181753",
    "pmid": "32327715",
    "doi": "10.1038/s42003-020-0922-4",
    "publisher_id": "",
    "other": {}
  },
  "title": {
    "main": "Single-cell transcriptomes of the human skin reveal ...",
    "subtitle": "",
    "translated": []
  },
  "contributors": {
    "authors": [
      {
        "First_Name": "...",
        "Last_Name": "...",
        "Email": "...",
        "Affiliations": "..."
      }
    ],
    "non_author_contributors": [],
    "author_notes": {}
  },
  "publication": {
    "journal": {
      "title": "Communications Biology",
      "alternate_titles": [],
      "ids": {},
      "issn": {}
    },
    "publisher": {
      "name": "",
      "alternate_names": [],
      "location": "",
      "alternate_locations": []
    },
    "classification": {
      "article_types": ["research-article"],
      "article_categories": []
    },
    "dates": {
      "published": { "epub": "2020-04-24" },
      "history": {},
      "version_history": []
    },
    "issue": {
      "volume": "",
      "issue": "",
      "first_page": "",
      "last_page": "",
      "elocation_id": ""
    },
    "conference": {}
  },
  "content": {
    "abstract_type": "",
    "abstract": [
      {
        "id": "",
        "title": "Abstract",
        "level": 0,
        "blocks": [
          { "type": "paragraph", "id": "", "text": "Fibroblasts are ..." }
        ],
        "children": []
      }
    ],
    "translated_abstracts": [],
    "sections": [
      {
        "id": "s1",
        "title": "Introduction",
        "level": 1,
        "blocks": [
          { "type": "paragraph", "id": "p1", "text": "The skin is ..." }
        ],
        "children": []
      }
    ],
    "appendices": [],
    "glossary": [],
    "footnotes": "",
    "acknowledgements": [],
    "notes": []
  },
  "assets": {
    "citations": [
      { "title": "...", "authors": "...", "doi": "...", "pmid": "..." }
    ],
    "tables": [
      { "id": "t1", "label": "Table 1", "caption": "...", "rows": [] }
    ],
    "figures": [
      {
        "id": "f1",
        "label": "Fig. 1",
        "caption": "...",
        "link": "",
        "alternate_links": []
      }
    ],
    "equations": { "mathml": [], "tex": [] },
    "supplementary_material": []
  },
  "compliance": {
    "permissions": {},
    "copyright": "",
    "license": "",
    "ethics": {},
    "funding": []
  },
  "metadata": {
    "keywords": ["fibroblasts", "skin aging", "single-cell RNA-seq"],
    "counts": {},
    "self_uri": [],
    "related_articles": [],
    "custom_meta": {}
  },
  "provenance": {
    "pmcgrab_version": "1.0.8",
    "parse_timestamp": "2024-01-01T00:00:00+00:00",
    "source": "ncbi_entrez",
    "xml_source_path": ""
  }
}

Extracted Metadata -- Everything in One Object

The Paper class extracts and normalizes 40+ fields from each PMC article:

Content: title, subtitle, translated titles, abstract sections, ordered body section tree, footnotes, acknowledgements, notes, appendices, glossary

Authors & Contributors: authors (as pandas DataFrame with names, emails, affiliations), non-author contributors, author notes

Journal & Publication: journal ID, journal title, ISSN, publisher name & location, volume, issue, first/last page, elocation ID, article types, article categories

Identifiers: PMC ID, PMCID, PMID, DOI, publisher ID, and other article identifiers

Dates: publication dates (epub, ppub, collection), manuscript history dates (received, accepted, revised)

Scholarly Content: citations (structured with authors, title, DOI, PMID), tables (parsed to pandas DataFrames), figures (label, caption, graphic links, alt text), equations (MathML + TeX), supplementary materials

Legal & Funding: permissions, copyright statement, license type, funding sources, ethics disclosures

Additional: keywords, custom metadata, counts, self URIs, related articles, conference info, translated titles & abstracts, version history

NCBI Service Clients

PMCGrab bundles lightweight clients for five NCBI APIs, all importable from the top level:

from pmcgrab import bioc_fetch, id_convert, citation_export, oa_fetch
from pmcgrab import oai_get_record, oai_list_identifiers, oai_list_records, oai_list_sets
from pmcgrab import normalize_id, normalize_ids, normalize_pmid, normalize_pmids

Client	What it does
`bioc_fetch()`	Fetch BioC JSON for Open Access articles
`id_convert()`	Convert between PMC IDs, PMIDs, and DOIs
`normalize_id()`	Normalize any ID format to a numeric PMC ID
`citation_export()`	Export citations in MEDLINE, BibTeX, RIS, NBIB, or PubMed format
`oa_fetch()`	Check Open Access status and get download links
`oai_get_record()`	Retrieve a single OAI-PMH metadata record
`oai_list_records()`	Harvest metadata at scale with automatic resumption-token pagination
`oai_list_identifiers()`	List OAI-PMH identifiers for a date range or set
`oai_list_sets()`	List available OAI-PMH sets

Context Engineering: Why This Matters for LLMs

Large-language-model performance lives or dies on context quality -- the snippets you retrieve and feed back into the model:

RAG pipelines need precise, de-duplicated passages to ground answers.
Knowledge-graph population demands reliable section boundaries (e.g., Methods vs. Results) to classify triples accurately.
Fine-tuning & few-shot prompting work best with noise-free, domain-specific examples.

PMCGrab is a context-engineering tool: it converts messy XML into clean, section-aware, UTF-8 JSON that slots directly into embeddings, vector stores, or prompt templates. No preprocessing gymnastics, no guessing where the Methods section starts, no hallucinations from half-garbled text.

Better input --> better retrieval --> better answers.

Why PMCGrab Beats Home-Grown Scripts

Section-Aware Parsing Detects IMRaD plus custom subsections like Statistical Analysis -- crucial for accurate retrieval scoring.
Resilient XML Cleaning Removes cross-refs and figure stubs without dropping scientific content, preserving token-level fidelity for embeddings.
True Concurrency --workers fans out across CPU cores; automatic email rotation and a token-bucket rate limiter respect NCBI limits so large harvests don't throttle.
Modern Python Stack Typed public interfaces, linted (ruff), CI-checked on Ubuntu, macOS, and Windows across Python 3.10-3.13.
Bulk XML Support Point at a directory of pre-downloaded JATS XML files and parse them locally -- orders of magnitude faster, no network required. Ideal for the PMC FTP bulk export.

Configuration

PMCGrab follows the 12-factor app methodology. All settings are configurable via environment variables:

Variable	Description	Default
`PMCGRAB_EMAILS`	Comma-separated contact emails for NCBI Entrez requests	Maintainer contact
`NCBI_API_KEY`	NCBI API key -- enables 10 req/s instead of 3 req/s	None
`PMCGRAB_TIMEOUT`	Timeout in seconds for network operations	`60`
`PMCGRAB_RETRIES`	Number of retry attempts for Entrez API calls	`3`
`PMCGRAB_SSL_VERIFY`	Verify TLS certificates for HTTP requests	`true`

Rate limiting is enforced automatically across all threads via a token-bucket limiter:

Without an API key: 3 requests/second
With an API key: 10 requests/second

For production or high-volume use, set your own email and API key:

export PMCGRAB_EMAILS="you@university.edu,colleague@lab.org"
export NCBI_API_KEY="your_ncbi_api_key_here"

Proof at a Glance

Signal	Value
Test suite	Unit, CLI, local XML, and regression tests
JSON contract	Strict JSON output, no `NaN` literals
Local XML mode	Parses pre-downloaded JATS without network
CI platforms	Ubuntu, macOS, Windows
Python versions tested	3.10, 3.11, 3.12, 3.13

Acknowledgments

Special thanks to @vanAmsterdam for suggesting that PMCGrab support bulk-exported PMC data from disk. That idea led directly to the --from-dir, --from-file, and Paper.from_local_xml() features -- making local XML processing orders of magnitude faster than the network path. Community feedback like this makes PMCGrab better for everyone.

Contributing

We welcome contributions. See DEVELOPMENT.md for setup instructions, testing, and CI details.

License

Apache 2.0 -- see LICENSE for details.

Install Now & Ship Real Results

uv add pmcgrab

Stop paying the XML tax. Start engineering context -- and building AI products that matter.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

3.0.1

May 18, 2026

3.0.0

May 18, 2026

2.0.0

May 18, 2026

1.0.10

May 18, 2026

1.0.9

May 17, 2026

This version

1.0.8

May 17, 2026

1.0.7

Mar 2, 2026

1.0.6

Feb 27, 2026

1.0.5

Feb 25, 2026

1.0.4

Feb 25, 2026

1.0.3

Feb 25, 2026

1.0.2

Feb 25, 2026

1.0.1

Feb 7, 2026

1.0.0

Feb 7, 2026

0.6.0

Feb 7, 2026

0.5.8

Aug 5, 2025

0.5.7

Aug 4, 2025

0.5.6

Aug 4, 2025

0.5.5

Aug 1, 2025

0.5.4

Aug 1, 2025

0.5.3

Aug 1, 2025

0.5.2

Aug 1, 2025

0.5.1

Aug 1, 2025

0.5.0

Aug 1, 2025

0.4.9

Aug 1, 2025

0.4.8

Aug 1, 2025

0.4.7

Aug 1, 2025

0.4.6

Aug 1, 2025

0.4.5

Aug 1, 2025

0.4.0

Jul 31, 2025

0.3.6

Jul 31, 2025

0.3.3

Jul 31, 2025

0.3.2

Jul 31, 2025

0.3.1

Jul 31, 2025

0.2.1

Jul 31, 2025

0.2.0

Jul 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pmcgrab-1.0.8.tar.gz (147.9 kB view details)

Uploaded May 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pmcgrab-1.0.8-py3-none-any.whl (140.3 kB view details)

Uploaded May 17, 2026 Python 3

File details

Details for the file pmcgrab-1.0.8.tar.gz.

File metadata

Download URL: pmcgrab-1.0.8.tar.gz
Upload date: May 17, 2026
Size: 147.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pmcgrab-1.0.8.tar.gz
Algorithm	Hash digest
SHA256	`3ad8871b797e6995f2737df787f57a68107d3d1c0c56812dc4265d7a8a49a3b3`
MD5	`3766a58901357976071551900bd08b7b`
BLAKE2b-256	`dac4f7fb4e5aef3f8b347702bd7115710cd1c83e4fcbb59033758746c0cf596a`

See more details on using hashes here.

File details

Details for the file pmcgrab-1.0.8-py3-none-any.whl.

File metadata

Download URL: pmcgrab-1.0.8-py3-none-any.whl
Upload date: May 17, 2026
Size: 140.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pmcgrab-1.0.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`322880f58544672eede3896493aeb2a41276deddfaf5e92e3b490aa3e581eaa2`
MD5	`d07719a84ffcd66e20e41dcdf82122f3`
BLAKE2b-256	`cbf3e73f0eadf78a324ff50e13c72aaf138642868446f0b8154edd57399f3042`

See more details on using hashes here.

pmcgrab 1.0.8

Navigation

Verified details

Project links

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

PMCGrab -- From PubMed Central ID to AI-Ready JSON in Seconds

The Hidden Cost of "I'll Just Parse It Myself"

Quick Install

30-Second Quick Start

Ways to Use PMCGrab

1. Python API -- the Paper class (recommended)

2. Dict-Based API (for data pipelines)

3. Bulk / Local XML Processing

4. Command Line

5. Async Support

6. Batch Processing

Output Example

Extracted Metadata -- Everything in One Object

NCBI Service Clients

Context Engineering: Why This Matters for LLMs

Why PMCGrab Beats Home-Grown Scripts

Configuration

Proof at a Glance

Acknowledgments

Contributing

License

Install Now & Ship Real Results

Project details

Verified details

Project links

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. Python API -- the `Paper` class (recommended)