Skip to main content

Convert PubMed articles (PMIDs or PMCIDs) to clean, structured markdown with full text, abstracts, and supplementary materials

Project description

PubMed Downloader

PyPI

Convert PubMed articles to clean, structured markdown. Handles the full pipeline: PMID resolution, full-text extraction via PubMed Central, HTML-to-markdown conversion, and supplementary material retrieval.

Articles without open-access full text automatically fall back to abstract-only download.

Installation

pip install pubmed-markdown

Setup

Set your email for NCBI API identification (optional but recommended):

export NCBI_EMAIL=your-email@institution.edu

Or create a .env file in your working directory:

NCBI_EMAIL=your-email@institution.edu

Usage

Python API

Single article (returns markdown string, no files created):

from pubmed_downloader import PubMedDownloader

downloader = PubMedDownloader()

# From PMID (resolves to PMCID automatically, falls back to abstract if not open access)
markdown = downloader.single_pmid_to_markdown("12895196")

# From PMCID directly
markdown = downloader.single_pmcid_to_markdown("PMC1884285")

Batch processing (saves HTML and markdown files to disk):

from pubmed_downloader import PubMedDownloader

downloader = PubMedDownloader()
pmids = ["12895196", "17872605", "25051018"]
downloader.pmids_to_markdown(pmids, save_dir="data")

This creates:

data/
├── html/          # Raw HTML from PMC
├── markdown/      # Converted markdown files
├── cache/         # PMID-to-PMCID mapping cache
└── pmcids.txt     # Resolved PMCIDs

Add supplementary materials to existing markdown files:

downloader.add_supplements_to_existing(save_dir="data")

Individual utility functions:

from pubmed_downloader import (
    get_pmcid_from_pmid,
    get_html_from_pmcid,
    get_abstract_markdown_from_pmid,
    fetch_bioc_supplement,
)

# Resolve PMIDs to PMCIDs
mapping = get_pmcid_from_pmid(["12895196", "17872605"])

# Fetch raw HTML from PMC
html = get_html_from_pmcid("PMC1884285")

# Get abstract for non-open-access articles
abstract_md = get_abstract_markdown_from_pmid("12345678")

# Get supplementary material text
supplement = fetch_bioc_supplement("PMC6435416")

Command Line

# Convert PMIDs from a file (one PMID per line)
pubmed-download --file_path=pmids.txt --save_dir=data

# Add supplementary materials to existing markdown
pubmed-download --add_supplements --save_dir=data

# Clear all caches
pubmed-download --clear_caches

API Reference

Method Creates Files Returns Use Case
single_pmid_to_markdown() No Markdown string Single article, programmatic use
single_pmcid_to_markdown() No Markdown string Direct PMCID conversion
pmids_to_markdown() Yes None Batch processing, building datasets
local_html_to_markdown() Yes None Re-convert existing HTML files
add_supplements_to_existing() Yes None Append supplements to existing markdown

PharmGKB Integration

Extract PMIDs from PharmGKB variant annotations for pharmacogenomics research:

from pubmed_downloader.pharmgkb_annotations import get_pmid_list
from pubmed_downloader import PubMedDownloader

# Download PharmGKB annotations and extract PMIDs
pmids = get_pmid_list(save_dir="data")

# Convert to markdown
downloader = PubMedDownloader()
downloader.pmids_to_markdown([str(p) for p in pmids], save_dir="data")

How It Works

  1. PMID to PMCID -- Uses NCBI's ID Converter API with batching, caching (30-day expiry), and rate limiting
  2. HTML extraction -- Fetches full article HTML from PubMed Central
  3. Markdown conversion -- Converts HTML to structured markdown preserving tables, figures, citations, and references
  4. Supplementary materials -- Fetches pre-processed supplement text via NCBI's BioC API
  5. Abstract fallback -- Articles not in PMC Open Access get abstract + metadata via NCBI E-Fetch

Configuration

Environment Variable Default Description
NCBI_EMAIL None Email for NCBI API identification
PMID_CACHE_DIR data/cache Cache directory path
PMID_CACHE_FILE pmid_to_pmcid.json Cache filename

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pubmed_markdown-0.1.1.tar.gz (248.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pubmed_markdown-0.1.1-py3-none-any.whl (28.5 kB view details)

Uploaded Python 3

File details

Details for the file pubmed_markdown-0.1.1.tar.gz.

File metadata

  • Download URL: pubmed_markdown-0.1.1.tar.gz
  • Upload date:
  • Size: 248.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for pubmed_markdown-0.1.1.tar.gz
Algorithm Hash digest
SHA256 4c1d1dd6511d4229fc2ecf7d86d2496701a5fbe9f962bc3afa9d3ddeb6bd9ba1
MD5 ad27369f3abb0f3299b3e5ce676aee18
BLAKE2b-256 925b63a7dcf164dcba8d6726a6803fcb68ea8df85dcaddb9b7f30519ebf9d2d3

See more details on using hashes here.

File details

Details for the file pubmed_markdown-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for pubmed_markdown-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 92bb108cbe596dca329091902e860a742a913246ed77cac555cbfec196fd81b7
MD5 254051a64c7c98f32148e570edd700f0
BLAKE2b-256 ebe391b858254f45a67ba8143504e80ac6576e691f482c6719a4202de485ddb2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page