Convert PubMed articles (PMIDs or PMCIDs) to clean, structured markdown with full text, abstracts, and supplementary materials
Project description
PubMed Downloader
Convert PubMed articles to clean, structured markdown. Handles the full pipeline: PMID resolution, full-text extraction via PubMed Central, HTML-to-markdown conversion, and supplementary material retrieval.
Articles without open-access full text automatically fall back to abstract-only download.
Installation
pip install pubmed-markdown
Setup
Set your email for NCBI API identification (optional but recommended):
export NCBI_EMAIL=your-email@institution.edu
Or create a .env file in your working directory:
NCBI_EMAIL=your-email@institution.edu
Usage
Python API
Get markdown strings (single or batch, no files created):
from pubmed_markdown import PubMedMarkdown
downloader = PubMedMarkdown()
# From PMID — accepts a single string or a list
markdown = downloader.pmid_to_markdown("12895196")
markdowns = downloader.pmid_to_markdown(["12895196", "17872605"])
# From PMCID directly — also accepts a single string or a list
markdown = downloader.pmcid_to_markdown("PMC1884285")
markdowns = downloader.pmcid_to_markdown(["PMC1884285", "PMC6435416"])
Save markdown files to disk (single or batch):
from pubmed_markdown import PubMedMarkdown
downloader = PubMedMarkdown()
downloader.pmids_to_markdown_files(["12895196", "17872605"], save_dir="data")
# Also works with a single PMID
downloader.pmids_to_markdown_files("25051018", save_dir="data")
This creates:
data/
├── html/ # Raw HTML from PMC
└── markdown/ # Converted markdown files
Individual utility functions:
from pubmed_markdown import (
get_pmcid_from_pmid,
get_html_from_pmcid,
get_abstract_markdown_from_pmid,
fetch_bioc_supplement,
)
# Resolve PMIDs to PMCIDs
mapping = get_pmcid_from_pmid(["12895196", "17872605"])
# Fetch raw HTML from PMC
html = get_html_from_pmcid("PMC1884285")
# Get abstract for non-open-access articles
abstract_md = get_abstract_markdown_from_pmid("12345678")
# Get supplementary material text
supplement = fetch_bioc_supplement("PMC6435416")
Command Line
# Convert PMIDs from a file (one PMID per line)
pubmed-download --file_path=pmids.txt --save_dir=data
API Reference
| Method | Creates Files | Returns | Use Case |
|---|---|---|---|
pmid_to_markdown() |
No | Markdown string(s) | Single or batch, programmatic use |
pmcid_to_markdown() |
No | Markdown string(s) | Direct PMCID conversion |
pmids_to_markdown_files() |
Yes | None | Batch processing, building datasets |
local_html_to_markdown() |
Yes | None | Re-convert existing HTML files |
All methods accepting IDs take either a single string or a list of strings.
How It Works
- PMID to PMCID -- Uses NCBI's ID Converter API with batching and rate limiting
- HTML extraction -- Fetches full article HTML from PubMed Central
- Markdown conversion -- Converts HTML to structured markdown preserving tables, figures, citations, and references
- Supplementary materials -- Fetches pre-processed supplement text via NCBI's BioC API
- Abstract fallback -- Articles not in PMC Open Access get abstract + metadata via NCBI E-Fetch
Configuration
| Environment Variable | Default | Description |
|---|---|---|
NCBI_EMAIL |
None | Email for NCBI API identification |
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pubmed_markdown-0.2.0.tar.gz.
File metadata
- Download URL: pubmed_markdown-0.2.0.tar.gz
- Upload date:
- Size: 53.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1c40873b4c65adb9d46d3f604d5dc729b0b81d434c61a647f6fb1845d1d1b5b
|
|
| MD5 |
94ec2e17418881d092e77384d1cd1c82
|
|
| BLAKE2b-256 |
63864775e637b761bc3b96e3d714801f14595b3bf1102afb180b3abba3c4d9af
|
File details
Details for the file pubmed_markdown-0.2.0-py3-none-any.whl.
File metadata
- Download URL: pubmed_markdown-0.2.0-py3-none-any.whl
- Upload date:
- Size: 21.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33f6f3cff58081ed210fca4ad3c6ad76680880c746893b9715d6c781e2f5d7cd
|
|
| MD5 |
6437dd8b676097265d941ecbdb9b2e82
|
|
| BLAKE2b-256 |
a9b7aa53013b1cc36e3eb5398beaac0c750fcb66463d914de26eb3a37af0471b
|