LLM-assisted biomedical literature screening and structured extraction for PubMed and GEO.
Project description
biolit
mcp-name: io.github.rachadele/biolit
LLM-assisted biomedical literature screening and structured extraction. Accepts PubMed alert emails and mixed lists of PMIDs, DOIs, and GEO accessions in any combination. Retrieves full text from PMC, Europe PMC, bioRxiv/medRxiv, Unpaywall, and Semantic Scholar. Supports multiple LLM providers and exposes all functionality as an MCP server.
Setup
Requirements: Python 3.8+
Install from PyPI:
pip install biolit
Or install from source for development:
pip install -e .
Copy .env.example to .env and add your API key:
cp .env.example .env
# edit .env and set ANTHROPIC_API_KEY (or OPENAI_API_KEY)
Usage
The tool accepts a PubMed alert email (.eml) or a plain-text file of identifiers, as well as inline identifiers via --ids. Identifiers can be PMIDs, DOIs, or GEO accessions — mixed lists are supported in a single run.
| Input | How to pass | Example |
|---|---|---|
| PubMed alert email | positional .eml file |
alert.eml |
| BibTeX file | positional .bib file |
refs.bib |
| Identifier file (mixed) | positional plain-text file, one per line | identifiers.txt |
| Inline identifiers | --ids flag, comma-separated |
--ids 41795042,GSE53987,10.1101/2025.03.17.25324098 |
Use --default to run with schizophrenia genomics defaults (no prompts):
biolit docs/alert.eml --default
biolit docs/pmids.txt --default
biolit docs/geo_accessions.txt --default
biolit --ids 41795042,41792186,GSE53987 --default
biolit --ids 10.1101/2025.03.17.25324098 --default
Or specify criterion and fields as flags:
biolit identifiers.txt \
--criterion "Is this about treatment-resistant schizophrenia?" \
--fields "methodology, sample_size, treatment, outcomes"
Add --markdown (or --md) to also write a prose .md summary alongside the CSV. Each record gets a markdown section with ### field subsections; records that failed or were skipped appear as stub entries:
biolit refs.bib --config my_config.json --markdown
biolit refs.bib --config my_config.json --markdown --markdown-max-tokens 2048
Or use a JSON config file to store reusable parameters (CLI flags take precedence). The config can include ids or input_file (path to an .eml, .bib, or identifier list), and "markdown": true to enable markdown output:
biolit alert.eml --config my_config.json
biolit refs.bib --config my_config.json # DOIs extracted from .bib automatically
biolit --config my_config.json # ids or input_file supplied by config
The fields key in a config file can be a comma-separated string or a JSON object mapping field names to extraction descriptions. When a string is used, an extra LLM call converts the field names into descriptions before extraction. When a dict is used, that call is skipped — the descriptions are passed directly to the model:
{
"fields": {
"tf_name": "HGNC symbol of the transcription factor perturbed in this experiment",
"organism": "scientific name of the organism used",
"platform": "GPL accession of the microarray platform"
}
}
Omit --criterion to skip screening (all records are extracted). Omit --fields to use the default fields (methodology, sample_type, causal_claims, summary):
# fetch + extract with defaults (no screening)
biolit alert.eml
# fetch + screen only, then extract with defaults
biolit alert.eml --criterion "Is this about treatment-resistant schizophrenia?"
Single-record screening
Use biolit screen to quickly check one paper or GEO record for relevance without running the full extraction pipeline:
biolit screen --pmid 41627908 --default
biolit screen --accession GSE53987 --default
biolit screen --doi 10.64898/2026.02.16.706214 --default
biolit screen --pmid 41627908 --criterion "Is this about treatment-resistant schizophrenia?"
Output is a single line to stdout:
RELEVANT [abstract] — Paper uses GWAS to investigate schizophrenia risk loci.
Mixed identifier lists
PMIDs, DOIs, and GEO accessions can be freely mixed in a file or via --ids. Each identifier is auto-detected by format:
41795042→ PMID (all digits)10.1101/2025.03.17.25324098→ DOI (starts with10.)GSE53987→ GEO accession (starts withGSE,GDS,GSM, orGPL)
biolit --ids 41795042,GSE53987,10.1101/2025.03.17.25324098 --default
GEO records additionally include a linked_pmids column. All record types share pmid, doi, and geo_accession columns (null when not applicable).
Full-text retrieval
Full-text retrieval runs automatically for every PMID and DOI (including preprints). For GEO records, the pipeline attempts full-text retrieval via each linked PMID in order, falling back to the GEO record metadata if no linked paper has accessible full text. The pipeline tries each source in order:
- PMC JATS XML (open access)
- Europe PMC JATS XML (broader open-access coverage)
- Preprint XML (bioRxiv / medRxiv)
- Unpaywall PDF (requires
--unpaywall-email) - Semantic Scholar open-access PDF
- Abstract fallback
To enable Unpaywall (step 4), pass your email:
biolit alert.eml --default --unpaywall-email you@example.com
Limit which sections are sent to the LLM:
biolit alert.eml --default --sections methods,results
LLM providers
The tool supports Anthropic (default), OpenAI, and local Ollama models:
# OpenAI
biolit pmids.txt --default --provider openai --model gpt-4o
# Ollama (local)
biolit pmids.txt --default --provider ollama --model llama3
You can also set LLM_PROVIDER and LLM_MODEL as environment variables.
Output
Each run creates a timestamped directory (e.g. run_20260313_142000/) containing:
results.csv— one row per relevant recordresults.md— prose markdown summary (written when--markdownor"markdown": truein config)artifacts/<id>/— per-record folder with the text sent to the LLM, metadata, and any retrieved full-text files
Records that fail at any pipeline stage (fetch error, not found, no content, screening or extraction error) are excluded from the CSV but appear in the markdown as stub entries with a failure note.
With default fields, the CSV columns are:
| Column | Description |
|---|---|
title |
Paper title |
authors |
Author list (comma-separated; parsed from PubMed XML, bioRxiv/medRxiv API, or GEO contributors) |
url |
Link to PubMed, GEO, or DOI |
pmid |
PubMed ID (null for unindexed preprints) |
doi |
DOI (null for GEO records) |
geo_accession |
GEO accession (null for non-GEO records) |
text_source |
Where the text came from (abstract, pmc_fulltext, europepmc_fulltext, preprint_fulltext, unpaywall_pdf, s2_pdf, geo_linked_fulltext, geo_linked_abstract, geo_record) |
citation_count |
Citation count from Semantic Scholar (null if not found) |
methodology |
General method (e.g. GWAS, scRNA-seq, proteomics) |
sample_type |
Tissue/sample type and origin |
causal_claims |
Statements about causes of schizophrenia inferred from the data |
summary |
2-3 sentence plain-language summary for triage |
GEO records additionally include a linked_pmids column listing all associated PubMed IDs.
The CSV can be imported directly into Google Sheets (File → Import).
MCP server
biolit ships an MCP server that exposes the pipeline as tools for any MCP-compatible client (Claude Desktop, Claude CLI, OpenAI Agents SDK, etc.).
Start the server:
biolit-mcp
Or test interactively with the MCP inspector:
mcp dev biolit/mcp_server.py
Configure Claude Desktop
Add to ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"biolit": {
"command": "biolit-mcp"
}
}
}
Restart Claude Desktop. The tools will appear in the tool picker.
Configure Claude CLI
Add a .mcp.json in your project root:
{
"mcpServers": {
"biolit": {
"command": "biolit-mcp"
}
}
}
Available tools
Batch pipeline (equivalent to the biolit CLI):
| Tool | Description |
|---|---|
run_pipeline |
Fetch, optionally screen, and optionally extract a mixed list of PMIDs, DOIs, and/or GEO accessions; write results CSV (and optionally a .md summary when markdown=True). Accepts ids (comma-separated), bib_path (.bib file), or ids_file (plain-text identifier file). Use max_tokens to cap input text (default 12500), extraction_max_tokens for field extraction output (default 4096), and markdown_max_tokens for markdown rendering (default 1024). Pass 0 for any token param to use the default. All parameters optional — pass only config_path to drive the entire run from a JSON file. |
Low-level (for custom workflows):
| Tool | Description |
|---|---|
fetch_pubmed_metadata |
Fetch PubMed metadata by PMID |
fetch_geo_record |
Fetch and parse a GEO record by accession |
fetch_fulltext |
Retrieve full text for a PMID (6-step chain) |
fetch_geo_fulltext |
Retrieve full text for a GEO accession via its linked PMIDs |
screen_paper |
LLM relevance screen given pre-fetched text |
extract_fields |
Structured field extraction given pre-fetched text |
resolve_doi |
Resolve a DOI to PMID + PMCID via the NCBI ID Converter |
lookup_s2_pdf |
Check whether Semantic Scholar has an open-access PDF for a DOI |
read_pmids_from_eml |
Parse PMIDs from a PubMed alert .eml file |
get_version |
Return the installed biolit package version |
Use as a Python library
The pipeline functions are importable directly:
from biolit.pipeline import run, screen_paper, fetch_record
from biolit.llm import get_llm_client
client = get_llm_client("anthropic")
# Batch pipeline — PMIDs, DOIs, and GEO accessions can be mixed freely
# criterion and fields_description are optional; omit either to skip that step
# markdown=True writes results.md alongside the CSV
# Returns (csv_path, record_count)
csv_path, count = run(client, ids=["41627908", "GSE53987", "10.1101/2025.03.17.25324098"],
criterion="...", fields_description="methodology, summary", output_path="results.csv",
markdown=True)
# Fetch + write metadata only (no LLM calls)
csv_path, count = run(client, ids=["41627908", "GSE53987"])
# Fetch a single record (auto-detects PMID / DOI / GEO)
paper = fetch_record("10.1101/2025.03.17.25324098")
# Screen pre-fetched text
result = screen_paper(client, paper, "Is this about schizophrenia genomics?", paper["abstract"])
# {"relevant": True, "reason": "..."}
Known Limitations
- Papers without abstracts or accessible full text are skipped silently.
- GEO records attempt full-text retrieval via linked PMIDs.
text_sourcewill begeo_linked_fulltext,geo_linked_abstract, orgeo_recorddepending on what was accessible. - bioRxiv/medRxiv JATS XML is frequently blocked by Cloudflare regardless of headers. The pipeline falls back to the title and abstract from the bioRxiv API (
text_source: preprint_abstract). - The Semantic Scholar API allows roughly 100 unauthenticated requests per day. Set
SEMANTIC_SCHOLAR_API_KEYin.envfor higher limits.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file biolit-0.1.25.tar.gz.
File metadata
- Download URL: biolit-0.1.25.tar.gz
- Upload date:
- Size: 54.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
366239a023975270acbf511a7d6298d5b7c7b88b7e510d1a01e080544472bd0d
|
|
| MD5 |
9b808e852a5265cf5f76b9228be76f39
|
|
| BLAKE2b-256 |
7a39d56e5090d627e96841becbed1f5770651c302cca46f37b37e1218437f38b
|
Provenance
The following attestation bundles were made for biolit-0.1.25.tar.gz:
Publisher:
publish.yml on rachadele/biolit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
biolit-0.1.25.tar.gz -
Subject digest:
366239a023975270acbf511a7d6298d5b7c7b88b7e510d1a01e080544472bd0d - Sigstore transparency entry: 1204299160
- Sigstore integration time:
-
Permalink:
rachadele/biolit@5e1f2e8428ae207d800028cbf8d3a2fb85c09872 -
Branch / Tag:
refs/tags/v0.1.25 - Owner: https://github.com/rachadele
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5e1f2e8428ae207d800028cbf8d3a2fb85c09872 -
Trigger Event:
push
-
Statement type:
File details
Details for the file biolit-0.1.25-py3-none-any.whl.
File metadata
- Download URL: biolit-0.1.25-py3-none-any.whl
- Upload date:
- Size: 40.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
18fa23158b9e34c2ef6d8085385c5fcefa66595d35d27115f37667162b27ae56
|
|
| MD5 |
61cc66afd8d140a1fc3046401277e612
|
|
| BLAKE2b-256 |
f5ace723b6d9b7aa17fcd168618ed622dd73a4fa9f51f6baaba7fe5f1fdfb28c
|
Provenance
The following attestation bundles were made for biolit-0.1.25-py3-none-any.whl:
Publisher:
publish.yml on rachadele/biolit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
biolit-0.1.25-py3-none-any.whl -
Subject digest:
18fa23158b9e34c2ef6d8085385c5fcefa66595d35d27115f37667162b27ae56 - Sigstore transparency entry: 1204299250
- Sigstore integration time:
-
Permalink:
rachadele/biolit@5e1f2e8428ae207d800028cbf8d3a2fb85c09872 -
Branch / Tag:
refs/tags/v0.1.25 - Owner: https://github.com/rachadele
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5e1f2e8428ae207d800028cbf8d3a2fb85c09872 -
Trigger Event:
push
-
Statement type: