Skip to main content

Convert PubMed articles (PMIDs or PMCIDs) to clean, structured markdown with full text, abstracts, and supplementary materials

Project description

PubMed Downloader

Convert PubMed articles to clean, structured markdown. Handles the full pipeline: PMID resolution, full-text extraction via PubMed Central, HTML-to-markdown conversion, and supplementary material retrieval.

Articles without open-access full text automatically fall back to abstract-only download.

Installation

pip install git+https://github.com/shloknatarajan/PubMedDownloader.git

Setup

Set your email for NCBI API identification (optional but recommended):

export NCBI_EMAIL=your-email@institution.edu

Or create a .env file in your working directory:

NCBI_EMAIL=your-email@institution.edu

Usage

Python API

Single article (returns markdown string, no files created):

from pubmed_downloader import PubMedDownloader

downloader = PubMedDownloader()

# From PMID (resolves to PMCID automatically, falls back to abstract if not open access)
markdown = downloader.single_pmid_to_markdown("12895196")

# From PMCID directly
markdown = downloader.single_pmcid_to_markdown("PMC1884285")

Batch processing (saves HTML and markdown files to disk):

from pubmed_downloader import PubMedDownloader

downloader = PubMedDownloader()
pmids = ["12895196", "17872605", "25051018"]
downloader.pmids_to_markdown(pmids, save_dir="data")

This creates:

data/
├── html/          # Raw HTML from PMC
├── markdown/      # Converted markdown files
├── cache/         # PMID-to-PMCID mapping cache
└── pmcids.txt     # Resolved PMCIDs

Add supplementary materials to existing markdown files:

downloader.add_supplements_to_existing(save_dir="data")

Individual utility functions:

from pubmed_downloader import (
    get_pmcid_from_pmid,
    get_html_from_pmcid,
    get_abstract_markdown_from_pmid,
    fetch_bioc_supplement,
)

# Resolve PMIDs to PMCIDs
mapping = get_pmcid_from_pmid(["12895196", "17872605"])

# Fetch raw HTML from PMC
html = get_html_from_pmcid("PMC1884285")

# Get abstract for non-open-access articles
abstract_md = get_abstract_markdown_from_pmid("12345678")

# Get supplementary material text
supplement = fetch_bioc_supplement("PMC6435416")

Command Line

# Convert PMIDs from a file (one PMID per line)
pubmed-download --file_path=pmids.txt --save_dir=data

# Add supplementary materials to existing markdown
pubmed-download --add_supplements --save_dir=data

# Clear all caches
pubmed-download --clear_caches

API Reference

Method Creates Files Returns Use Case
single_pmid_to_markdown() No Markdown string Single article, programmatic use
single_pmcid_to_markdown() No Markdown string Direct PMCID conversion
pmids_to_markdown() Yes None Batch processing, building datasets
local_html_to_markdown() Yes None Re-convert existing HTML files
add_supplements_to_existing() Yes None Append supplements to existing markdown

PharmGKB Integration

Extract PMIDs from PharmGKB variant annotations for pharmacogenomics research:

from pubmed_downloader.pharmgkb_annotations import get_pmid_list
from pubmed_downloader import PubMedDownloader

# Download PharmGKB annotations and extract PMIDs
pmids = get_pmid_list(save_dir="data")

# Convert to markdown
downloader = PubMedDownloader()
downloader.pmids_to_markdown([str(p) for p in pmids], save_dir="data")

How It Works

  1. PMID to PMCID -- Uses NCBI's ID Converter API with batching, caching (30-day expiry), and rate limiting
  2. HTML extraction -- Fetches full article HTML from PubMed Central
  3. Markdown conversion -- Converts HTML to structured markdown preserving tables, figures, citations, and references
  4. Supplementary materials -- Fetches pre-processed supplement text via NCBI's BioC API
  5. Abstract fallback -- Articles not in PMC Open Access get abstract + metadata via NCBI E-Fetch

Configuration

Environment Variable Default Description
NCBI_EMAIL None Email for NCBI API identification
PMID_CACHE_DIR data/cache Cache directory path
PMID_CACHE_FILE pmid_to_pmcid.json Cache filename

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pubmed_markdown-0.1.0.tar.gz (248.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pubmed_markdown-0.1.0-py3-none-any.whl (27.9 kB view details)

Uploaded Python 3

File details

Details for the file pubmed_markdown-0.1.0.tar.gz.

File metadata

  • Download URL: pubmed_markdown-0.1.0.tar.gz
  • Upload date:
  • Size: 248.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for pubmed_markdown-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d9b7209e9a11d46afe75da064da641864d463f00846142727086a722d5d6fe9c
MD5 32d177d0d56f5ccecbee3dbfcc9e0de4
BLAKE2b-256 943098f079bacb135a6e2fcbb45c3d67a4640cc202558014635ed5c0cadf3611

See more details on using hashes here.

File details

Details for the file pubmed_markdown-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pubmed_markdown-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 022d48b70d3e4861c4938f8ee88a78de468bfa075770fb7c94de37039324277f
MD5 e6611e39b2a291246accb660200bd6a3
BLAKE2b-256 46a7ca8be08b62d31337589d155af44ecddd0c0fe7a1d9ab85d5b15dabb2bdac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page