Convert PubMed articles (PMIDs or PMCIDs) to clean, structured markdown with full text, abstracts, and supplementary materials
Project description
PubMed Downloader
Convert PubMed articles to clean, structured markdown. Handles the full pipeline: PMID resolution, full-text extraction via PubMed Central, HTML-to-markdown conversion, and supplementary material retrieval.
Articles without open-access full text automatically fall back to abstract-only download.
Installation
pip install pubmed-markdown
Requires Python 3.11+.
Setup
Set your email for NCBI API identification (required to avoid 403 errors):
export NCBI_EMAIL=your-email@institution.edu
Or pass it directly:
downloader = PubMedMarkdown(email="your-email@institution.edu")
Quick Start
from pubmed_markdown import PubMedMarkdown
downloader = PubMedMarkdown()
# Get markdown string from a PMID
markdown = downloader.pmid_to_markdown("12895196")
Usage
Python API
Get markdown strings (single or batch, no files created):
from pubmed_markdown import PubMedMarkdown
downloader = PubMedMarkdown()
# From PMID — accepts a single string or a list
markdown = downloader.pmid_to_markdown("12895196")
markdowns = downloader.pmid_to_markdown(["12895196", "17872605"])
# From PMCID directly — also accepts a single string or a list
markdown = downloader.pmcid_to_markdown("PMC1884285")
markdowns = downloader.pmcid_to_markdown(["PMC1884285", "PMC6435416"])
# Skip supplementary materials
markdown = downloader.pmid_to_markdown("12895196", include_supplements=False)
Save markdown files to disk (single or batch):
from pubmed_markdown import PubMedMarkdown
downloader = PubMedMarkdown()
downloader.pmids_to_markdown_files(["12895196", "17872605"], save_dir="data")
# Also works with a single PMID
downloader.pmids_to_markdown_files("25051018", save_dir="data")
# Overwrite existing files
downloader.pmids_to_markdown_files(["12895196"], save_dir="data", overwrite=True)
This creates:
data/
├── html/ # Raw HTML from PMC
└── markdown/ # Converted markdown files
Full-text articles are saved as {PMCID}.md. Articles without open-access full text are saved as PMID{PMID}.md with abstract only.
Individual utility functions:
from pubmed_markdown import (
get_pmcid_from_pmid,
get_html_from_pmcid,
get_abstract_markdown_from_pmid,
fetch_bioc_supplement,
format_supplement_as_markdown,
)
# Resolve PMIDs to PMCIDs (returns dict mapping PMID -> PMCID or None)
mapping = get_pmcid_from_pmid(["12895196", "17872605"])
# Fetch raw HTML from PMC
html = get_html_from_pmcid("PMC1884285")
# Get abstract for non-open-access articles
abstract_md = get_abstract_markdown_from_pmid("12345678")
# Get raw supplementary material text
supplement = fetch_bioc_supplement("PMC6435416")
# Get supplementary materials formatted as a markdown section
supplement_md = format_supplement_as_markdown("PMC6435416")
Command Line
# Convert PMIDs from a file (one PMID per line)
pubmed-download --file_path=pmids.txt --save_dir=data
# Overwrite existing files
pubmed-download --file_path=pmids.txt --save_dir=data --overwrite
# Specify email directly
pubmed-download --file_path=pmids.txt --email=your-email@institution.edu
API Reference
| Method | Creates Files | Returns | Use Case |
|---|---|---|---|
pmid_to_markdown() |
No | Markdown string(s) | Single or batch, programmatic use |
pmcid_to_markdown() |
No | Markdown string(s) | Direct PMCID conversion |
pmids_to_markdown_files() |
Yes | None | Batch processing, building datasets |
pmids_to_pmcids() |
No | List of PMCIDs | PMID to PMCID resolution |
pmcids_to_html() |
Yes | None | Fetch and save raw HTML |
local_html_to_markdown() |
Yes | None | Re-convert existing HTML files |
All methods accepting IDs take either a single string or a list of strings.
How It Works
- PMID to PMCID -- Uses NCBI's ID Converter API with batching and rate limiting
- HTML extraction -- Fetches full article HTML from PubMed Central
- Markdown conversion -- Converts HTML to structured markdown preserving tables, figures, citations, and references
- Supplementary materials -- Fetches pre-processed supplement text via NCBI's BioC API
- Abstract fallback -- Articles not in PMC Open Access get abstract + metadata via NCBI E-Fetch
Configuration
| Environment Variable | Default | Description |
|---|---|---|
NCBI_EMAIL |
None | Email for NCBI API identification |
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pubmed_markdown-0.2.5.tar.gz.
File metadata
- Download URL: pubmed_markdown-0.2.5.tar.gz
- Upload date:
- Size: 55.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37643506c3384df7fc51a4bbc6ff52b2b837f59dfc4825797eb425c1cefb6a22
|
|
| MD5 |
4c95a06534f2a08a78071e72b78dc6fb
|
|
| BLAKE2b-256 |
493a4938f9be156c886b20da68f0633f408015cc05f240a62c55aa8f7fb3eed8
|
File details
Details for the file pubmed_markdown-0.2.5-py3-none-any.whl.
File metadata
- Download URL: pubmed_markdown-0.2.5-py3-none-any.whl
- Upload date:
- Size: 21.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9937b1899d1fa1966f54b9cad4dc543c46876088923a5cd36fc2ed54506756b9
|
|
| MD5 |
3ebe08658e4995365a3e06e7e7691236
|
|
| BLAKE2b-256 |
ef1c362ca1803a6d825c3005ce4295437c98b1c1bebb192708de9d6de245491a
|