Skip to main content

AI-ready retrieval and parsing of PubMed Central articles for RAG applications. Install with uv for best performance.

Project description

PMCGrab -- From PubMed Central ID to AI-Ready JSON in Seconds

PyPI Python Docs CI License

Every AI workflow that touches biomedical literature hits the same wall:

  1. Download PMC XML hoping it's "structured."
  2. Fight nested tags, footnotes, figure refs, and half-broken links.
  3. Hope your regex didn't blow away the Methods section you actually need.

That wall steals hours from RAG pipelines, knowledge-graph builds, LLM fine-tuning -- any downstream AI task.

PMCGrab knocks it down. Feed it a list of PMC IDs -- or point it at a directory of bulk-downloaded XML -- and get back clean, section-aware JSON you can drop straight into a vector DB or LLM prompt. No network required for local files. No timeouts. No XML wrestling.


The Hidden Cost of "I'll Just Parse It Myself"

Task Manual / ad-hoc PMCGrab
Install dependencies Package hunting One package install
Convert one article to JSON 15-30 min One API call or CLI command
Capture article sections Hope & regex Parsed section tree with headings preserved
Parallel processing Bash loops & temp files --workers N flag
Edge-case maintenance Yours forever Automated tests and regression cases

At $50/hour, hand-parsing 100 papers burns $1,000+. PMCGrab does the same job for $0 -- within minutes -- so you can focus on using the information instead of extracting it.


Quick Install

Recommended (via uv):

uv add pmcgrab

Or with pip:

pip install pmcgrab

Python >= 3.10 required. Tested on 3.10, 3.11, 3.12, and 3.13.

Optional extras:

pip install pmcgrab[dev]       # Linting, type-checking, pre-commit
pip install pmcgrab[test]      # pytest + coverage
pip install pmcgrab[docs]      # MkDocs + Material theme
pip install pmcgrab[notebook]  # Jupyter support

30-Second Quick Start

from pmcgrab import Paper

paper = Paper.from_pmc("7181753")

print(paper.title)
# => "Single-cell transcriptomes of the human skin reveal age-related loss of ..."

print(paper.abstract_as_str()[:200])
# => "Fibroblasts are an essential cell population for human skin architecture ..."

# Every section, clean and ready
for section, text in paper.body_as_dict().items():
    print(f"{section}: {len(text.split())} words")

# Save to JSON
paper.to_json()

That's it. One import, one line to fetch, structured data everywhere.


Ways to Use PMCGrab

1. Python API -- the Paper class (recommended)

The Paper class is the primary interface. It wraps every piece of parsed data with convenient accessor methods.

From the network:

from pmcgrab import Paper

paper = Paper.from_pmc("7181753", suppress_warnings=True)

From a local XML file (no network needed):

paper = Paper.from_local_xml("path/to/PMC7181753.xml")

Output methods -- choose the shape that fits your pipeline:

# Abstract
paper.abstract_as_str()          # Plain-text string
paper.abstract_as_dict()         # {"Background": "...", "Results": "..."}

# Body
paper.body_as_dict()             # Flat: {"Introduction": "...", "Methods": "..."}
paper.body_as_nested_dict()      # Hierarchical: preserves subsections
paper.body_as_paragraphs()       # List of dicts -- ideal for RAG chunking
                                 #   [{"section": "Methods", "text": "...", "paragraph_index": 0}, ...]

# Full text
paper.full_text()                # Abstract + body as one continuous string

# Table of contents
paper.get_toc()                  # ["Introduction", "Methods", "Results", ...]

# Serialization
paper.to_dict()                  # Full JSON-serializable dictionary
paper.to_json()                  # JSON string (pretty-printed)

Metadata you can access directly:

paper.title                      # Article title
paper.authors                    # pandas DataFrame (names, emails, affiliations)
paper.journal_title              # "Genome Biology"
paper.article_id                 # {"pmcid": "PMC7181753", "doi": "10.1038/...", ...}
paper.keywords                   # ["fibroblasts", "aging", ...]
paper.published_date             # {"epub": "2020-04-24", ...}
paper.citations                  # Structured reference list
paper.tables                     # List of pandas DataFrames
paper.figures                    # Figure metadata + captions
paper.permissions                # Copyright, license info
paper.funding                    # Funding sources
paper.equations                  # MathML + TeX equations
# ... and 20+ more attributes (see "Extracted Metadata" below)

2. Dict-Based API (for data pipelines)

If you prefer raw dictionaries over the Paper object:

from pmcgrab import process_single_pmc, process_single_local_xml

# From network
data = process_single_pmc("7181753")

# From local XML
data = process_single_local_xml("path/to/article.xml")

print(data["title"]["main"])
print(data["identifiers"]["doi"])
print(data["content"]["abstract"])   # Ordered abstract sections
print([section["title"] for section in data["content"]["sections"]])

3. Bulk / Local XML Processing

This feature was inspired by a great suggestion from @vanAmsterdam, who pointed out that working with bulk-exported PMC data could be orders of magnitude faster than fetching articles one-by-one over the network.

We built it. Local XML processing skips the network entirely -- no HTTP requests, no timeouts, no rate limits. It is the fastest way to parse PMC articles at scale.

Python API:

from pmcgrab import Paper, process_single_local_xml, process_local_xml_dir

# Single file
paper = Paper.from_local_xml("./pmc_bulk/PMC7181753.xml")

# Single file (dict output)
data = process_single_local_xml("./pmc_bulk/PMC7181753.xml")

# Entire directory -- concurrent with 16 workers by default
results = process_local_xml_dir("./pmc_bulk/", workers=16)
for filename, data in results.items():
    if data:
        print(f"{filename}: {data['title']['main'][:60]}")

CLI:

# Process a directory of bulk-downloaded XML
pmcgrab --from-dir ./pmc_bulk_xml/ --output-dir ./results

# Process specific files
pmcgrab --from-file article1.xml article2.xml --output-dir ./results

How to get bulk XML: Download from the PMC FTP service or the PMC Open Access subset. Each .xml file is a standard JATS XML article that PMCGrab can parse directly.


4. Command Line

PMCGrab's CLI supports six input modes, all mutually exclusive:

# PMC IDs (accepts PMC7181753, pmc7181753, or just 7181753)
pmcgrab --pmcids 7181753 3539614 --output-dir ./results

# PubMed IDs (auto-converted to PMC IDs via NCBI API)
pmcgrab --pmids 33087749 34567890 --output-dir ./results

# DOIs (auto-converted to PMC IDs via NCBI API)
pmcgrab --dois 10.1038/s41586-020-2832-5 --output-dir ./results

# IDs from a text file (one per line -- PMCIDs, PMIDs, or DOIs)
pmcgrab --from-id-file ids.txt --output-dir ./results

# Local XML directory (bulk mode -- no network)
pmcgrab --from-dir ./xml_bulk/ --output-dir ./results

# Specific local XML files (no network)
pmcgrab --from-file article1.xml article2.xml --output-dir ./results

Additional flags:

Flag Description Default
--output-dir / --out Output directory for JSON files ./pmc_output
--batch-size / --workers Number of concurrent worker threads 10
--format json (one file per article) or jsonl (single file) json
--verbose / -v Enable debug logging off
--quiet / -q Suppress progress bars off

5. Async Support

For asyncio-based applications:

import asyncio
from pmcgrab.application.processing import async_process_pmc_ids

results = asyncio.run(async_process_pmc_ids(
    ["7181753", "3539614", "3084273"],
    max_concurrency=10,
))

for pid, data in results.items():
    print(pid, "OK" if data else "FAIL")

6. Batch Processing

Process thousands of articles with built-in concurrency, retries, and rate-limit compliance:

from pmcgrab import process_pmc_ids_in_batches

pmc_ids = ["7181753", "3539614", "5454911", "3084273"]
process_pmc_ids_in_batches(pmc_ids, "./output", batch_size=8)

Output Example

Every parsed article produces a comprehensive JSON structure:

{
  "schema_version": 2,
  "has_data": true,
  "identifiers": {
    "pmc_id": "7181753",
    "pmcid": "PMC7181753",
    "pmid": "32327715",
    "doi": "10.1038/s42003-020-0922-4",
    "publisher_id": "",
    "other": {}
  },
  "title": {
    "main": "Single-cell transcriptomes of the human skin reveal ...",
    "subtitle": "",
    "translated": []
  },
  "contributors": {
    "authors": [
      {
        "First_Name": "...",
        "Last_Name": "...",
        "Email": "...",
        "Affiliations": "..."
      }
    ],
    "non_author_contributors": [],
    "author_notes": {}
  },
  "publication": {
    "journal": {
      "title": "Communications Biology",
      "alternate_titles": [],
      "ids": {},
      "issn": {}
    },
    "publisher": {
      "name": "",
      "alternate_names": [],
      "location": "",
      "alternate_locations": []
    },
    "classification": {
      "article_types": ["research-article"],
      "article_categories": []
    },
    "dates": {
      "published": { "epub": "2020-04-24" },
      "history": {},
      "version_history": []
    },
    "issue": {
      "volume": "",
      "issue": "",
      "first_page": "",
      "last_page": "",
      "elocation_id": ""
    },
    "conference": {}
  },
  "content": {
    "abstract_type": "",
    "abstract": [
      {
        "id": "",
        "title": "Abstract",
        "level": 0,
        "blocks": [
          { "type": "paragraph", "id": "", "text": "Fibroblasts are ..." }
        ],
        "children": []
      }
    ],
    "translated_abstracts": [],
    "sections": [
      {
        "id": "s1",
        "title": "Introduction",
        "level": 1,
        "blocks": [
          { "type": "paragraph", "id": "p1", "text": "The skin is ..." }
        ],
        "children": []
      }
    ],
    "appendices": [],
    "glossary": [],
    "footnotes": "",
    "acknowledgements": [],
    "notes": []
  },
  "assets": {
    "citations": [
      { "title": "...", "authors": "...", "doi": "...", "pmid": "..." }
    ],
    "tables": [
      { "id": "t1", "label": "Table 1", "caption": "...", "rows": [] }
    ],
    "figures": [
      {
        "id": "f1",
        "label": "Fig. 1",
        "caption": "...",
        "link": "",
        "alternate_links": []
      }
    ],
    "equations": { "mathml": [], "tex": [] },
    "supplementary_material": []
  },
  "compliance": {
    "permissions": {},
    "copyright": "",
    "license": "",
    "ethics": {},
    "funding": []
  },
  "metadata": {
    "keywords": ["fibroblasts", "skin aging", "single-cell RNA-seq"],
    "counts": {},
    "self_uri": [],
    "related_articles": [],
    "custom_meta": {}
  },
  "provenance": {
    "pmcgrab_version": "1.0.8",
    "parse_timestamp": "2024-01-01T00:00:00+00:00",
    "source": "ncbi_entrez",
    "xml_source_path": ""
  }
}

Extracted Metadata -- Everything in One Object

The Paper class extracts and normalizes 40+ fields from each PMC article:

Content: title, subtitle, translated titles, abstract sections, ordered body section tree, footnotes, acknowledgements, notes, appendices, glossary

Authors & Contributors: authors (as pandas DataFrame with names, emails, affiliations), non-author contributors, author notes

Journal & Publication: journal ID, journal title, ISSN, publisher name & location, volume, issue, first/last page, elocation ID, article types, article categories

Identifiers: PMC ID, PMCID, PMID, DOI, publisher ID, and other article identifiers

Dates: publication dates (epub, ppub, collection), manuscript history dates (received, accepted, revised)

Scholarly Content: citations (structured with authors, title, DOI, PMID), tables (parsed to pandas DataFrames), figures (label, caption, graphic links, alt text), equations (MathML + TeX), supplementary materials

Legal & Funding: permissions, copyright statement, license type, funding sources, ethics disclosures

Additional: keywords, custom metadata, counts, self URIs, related articles, conference info, translated titles & abstracts, version history


NCBI Service Clients

PMCGrab bundles lightweight clients for five NCBI APIs, all importable from the top level:

from pmcgrab import bioc_fetch, id_convert, citation_export, oa_fetch
from pmcgrab import oai_get_record, oai_list_identifiers, oai_list_records, oai_list_sets
from pmcgrab import normalize_id, normalize_ids, normalize_pmid, normalize_pmids
Client What it does
bioc_fetch() Fetch BioC JSON for Open Access articles
id_convert() Convert between PMC IDs, PMIDs, and DOIs
normalize_id() Normalize any ID format to a numeric PMC ID
citation_export() Export citations in MEDLINE, BibTeX, RIS, NBIB, or PubMed format
oa_fetch() Check Open Access status and get download links
oai_get_record() Retrieve a single OAI-PMH metadata record
oai_list_records() Harvest metadata at scale with automatic resumption-token pagination
oai_list_identifiers() List OAI-PMH identifiers for a date range or set
oai_list_sets() List available OAI-PMH sets

Context Engineering: Why This Matters for LLMs

Large-language-model performance lives or dies on context quality -- the snippets you retrieve and feed back into the model:

  • RAG pipelines need precise, de-duplicated passages to ground answers.
  • Knowledge-graph population demands reliable section boundaries (e.g., Methods vs. Results) to classify triples accurately.
  • Fine-tuning & few-shot prompting work best with noise-free, domain-specific examples.

PMCGrab is a context-engineering tool: it converts messy XML into clean, section-aware, UTF-8 JSON that slots directly into embeddings, vector stores, or prompt templates. No preprocessing gymnastics, no guessing where the Methods section starts, no hallucinations from half-garbled text.

Better input --> better retrieval --> better answers.


Why PMCGrab Beats Home-Grown Scripts

  1. Section-Aware Parsing Detects IMRaD plus custom subsections like Statistical Analysis -- crucial for accurate retrieval scoring.

  2. Resilient XML Cleaning Removes cross-refs and figure stubs without dropping scientific content, preserving token-level fidelity for embeddings.

  3. True Concurrency --workers fans out across CPU cores; automatic email rotation and a token-bucket rate limiter respect NCBI limits so large harvests don't throttle.

  4. Modern Python Stack Typed public interfaces, linted (ruff), CI-checked on Ubuntu, macOS, and Windows across Python 3.10-3.13.

  5. Bulk XML Support Point at a directory of pre-downloaded JATS XML files and parse them locally -- orders of magnitude faster, no network required. Ideal for the PMC FTP bulk export.


Configuration

PMCGrab follows the 12-factor app methodology. All settings are configurable via environment variables:

Variable Description Default
PMCGRAB_EMAILS Comma-separated contact emails for NCBI Entrez requests Maintainer contact
NCBI_API_KEY NCBI API key -- enables 10 req/s instead of 3 req/s None
PMCGRAB_TIMEOUT Timeout in seconds for network operations 60
PMCGRAB_RETRIES Number of retry attempts for Entrez API calls 3
PMCGRAB_SSL_VERIFY Verify TLS certificates for HTTP requests true

Rate limiting is enforced automatically across all threads via a token-bucket limiter:

  • Without an API key: 3 requests/second
  • With an API key: 10 requests/second

For production or high-volume use, set your own email and API key:

export PMCGRAB_EMAILS="you@university.edu,colleague@lab.org"
export NCBI_API_KEY="your_ncbi_api_key_here"

Proof at a Glance

Signal Value
Test suite Unit, CLI, local XML, and regression tests
JSON contract Strict JSON output, no NaN literals
Local XML mode Parses pre-downloaded JATS without network
CI platforms Ubuntu, macOS, Windows
Python versions tested 3.10, 3.11, 3.12, 3.13

Acknowledgments

Special thanks to @vanAmsterdam for suggesting that PMCGrab support bulk-exported PMC data from disk. That idea led directly to the --from-dir, --from-file, and Paper.from_local_xml() features -- making local XML processing orders of magnitude faster than the network path. Community feedback like this makes PMCGrab better for everyone.


Contributing

We welcome contributions. See DEVELOPMENT.md for setup instructions, testing, and CI details.


License

Apache 2.0 -- see LICENSE for details.


Install Now & Ship Real Results

uv add pmcgrab

Stop paying the XML tax. Start engineering context -- and building AI products that matter.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pmcgrab-1.0.8.tar.gz (147.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pmcgrab-1.0.8-py3-none-any.whl (140.3 kB view details)

Uploaded Python 3

File details

Details for the file pmcgrab-1.0.8.tar.gz.

File metadata

  • Download URL: pmcgrab-1.0.8.tar.gz
  • Upload date:
  • Size: 147.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pmcgrab-1.0.8.tar.gz
Algorithm Hash digest
SHA256 3ad8871b797e6995f2737df787f57a68107d3d1c0c56812dc4265d7a8a49a3b3
MD5 3766a58901357976071551900bd08b7b
BLAKE2b-256 dac4f7fb4e5aef3f8b347702bd7115710cd1c83e4fcbb59033758746c0cf596a

See more details on using hashes here.

File details

Details for the file pmcgrab-1.0.8-py3-none-any.whl.

File metadata

  • Download URL: pmcgrab-1.0.8-py3-none-any.whl
  • Upload date:
  • Size: 140.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for pmcgrab-1.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 322880f58544672eede3896493aeb2a41276deddfaf5e92e3b490aa3e581eaa2
MD5 d07719a84ffcd66e20e41dcdf82122f3
BLAKE2b-256 cbf3e73f0eadf78a324ff50e13c72aaf138642868446f0b8154edd57399f3042

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page