Skip to main content

LLM-assisted biomedical literature screening and structured extraction for PubMed and GEO.

Project description

biolit

LLM-assisted biomedical literature screening and structured extraction. Accepts PubMed alert emails, PMID lists, DOI lists, or GEO accession lists. Retrieves full text from PMC, Europe PMC, bioRxiv/medRxiv, Unpaywall, and Semantic Scholar. Supports multiple LLM providers and exposes all functionality as an MCP server.

Setup

Requirements: Python 3.8+

Install from PyPI:

pip install biolit

Or install from source for development:

pip install -e .

Copy .env.example to .env and add your API key:

cp .env.example .env
# edit .env and set ANTHROPIC_API_KEY (or OPENAI_API_KEY)

Usage

The tool accepts several input formats, auto-detected by file extension or content:

Input How to pass Example
PubMed alert email positional .eml file alert.eml
PMID list (file) positional plain-text file, one PMID per line pmids.txt
DOI list (file) positional plain-text file, one DOI per line dois.txt
GEO accession list (file) positional plain-text file, one accession per line geo_accessions.txt
PMIDs (inline) --pmids flag, comma-separated --pmids 41795042,41792186
DOIs (inline) --dois flag, comma-separated --dois 10.1038/s41588-021-00974-7
GEO accessions (inline) --accessions flag, comma-separated --accessions GSE53987,GSE12345

Use --default to run with schizophrenia genomics defaults (no prompts):

biolit docs/alert.eml --default
biolit docs/pmids.txt --default
biolit docs/geo_accessions.txt --default
biolit --pmids 41795042,41792186 --default
biolit --accessions GSE53987 --default

Or specify criterion and fields as flags:

biolit docs/pmids.txt \
  --criterion "Is this about treatment-resistant schizophrenia?" \
  --fields "methodology, sample_size, treatment, outcomes"

Or interactively (prompted if not provided):

biolit alert.eml

Single-record screening

Use biolit screen to quickly check one paper or GEO record for relevance without running the full extraction pipeline:

biolit screen --pmid 41627908 --default
biolit screen --accession GSE53987 --default
biolit screen --doi 10.64898/2026.02.16.706214 --default
biolit screen --pmid 41627908 --criterion "Is this about treatment-resistant schizophrenia?"
biolit screen --pmid 41627908 --fulltext --default

Output is a single line to stdout:

RELEVANT [abstract] — Paper uses GWAS to investigate schizophrenia risk loci.

GEO accession input

Pass a file of GEO series accessions (GSE, GDS, GSM, or GPL prefixes) to screen GEO records directly. The tool fetches each record's MINiML XML, extracts the summary, overall design, experiment type, and organism, then runs the same LLM screening and extraction pipeline.

biolit geo_accessions.txt \
  --criterion "Does this study perturb a transcription factor?" \
  --fields "organism, experiment_type, tf_perturbed, perturbation_method, summary"

GEO results include geo_accession and pmids (linked PubMed IDs) columns in place of pmid.

Full-text retrieval (PubMed inputs only)

Use --fulltext to screen and extract from full text instead of just the abstract. The pipeline tries each source in order:

  1. PMC JATS XML (open access)
  2. Europe PMC JATS XML (broader open-access coverage)
  3. Preprint XML (bioRxiv / medRxiv)
  4. Unpaywall PDF (requires --unpaywall-email)
  5. Semantic Scholar open-access PDF
  6. Abstract fallback
biolit alert.eml --default --fulltext --unpaywall-email you@example.com

Limit which sections are sent to the LLM:

biolit alert.eml --default --fulltext --sections methods,results

LLM providers

The tool supports Anthropic (default), OpenAI, and local Ollama models:

# OpenAI
biolit pmids.txt --default --provider openai --model gpt-4o

# Ollama (local)
biolit pmids.txt --default --provider ollama --model llama3

You can also set LLM_PROVIDER and LLM_MODEL as environment variables.

Output

Each run creates a timestamped directory (e.g. run_20260313_142000/) containing:

  • results.csv — one row per relevant record
  • artifacts/<id>/ — per-record folder with the text sent to the LLM, metadata, and any retrieved full-text files

With --default on PubMed inputs, the CSV columns are:

Column Description
title Paper title
url PubMed link
pmid PubMed ID
doi DOI
text_source Where the text came from (abstract, pmc_fulltext, europepmc_fulltext, preprint_fulltext, unpaywall_pdf, s2_pdf)
citation_count Citation count from Semantic Scholar (null if not found)
methodology General method (e.g. GWAS, scRNA-seq, proteomics)
sample_type Tissue/sample type and origin
causal_claims Statements about causes of schizophrenia inferred from the data
genetics_claims Claims about specific genes, loci, or pathways
summary 2-3 sentence plain-language summary for triage

For GEO inputs, pmid is replaced by geo_accession and pmids.

The CSV can be imported directly into Google Sheets (File → Import).

MCP server

biolit ships an MCP server that exposes the pipeline as tools for any MCP-compatible client (Claude Desktop, Claude CLI, OpenAI Agents SDK, etc.).

Start the server:

biolit-mcp

Or test interactively with the MCP inspector:

mcp dev biolit/mcp_server.py

Configure Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "biolit": {
      "command": "biolit-mcp"
    }
  }
}

Restart Claude Desktop. The tools will appear in the tool picker.

Configure Claude CLI

Add a .mcp.json in your project root:

{
  "mcpServers": {
    "biolit": {
      "command": "biolit-mcp"
    }
  }
}

Available tools

Batch pipelines (equivalent to the biolit CLI):

Tool Description
run_pipeline Screen + extract a list of PMIDs, write results CSV
run_geo_pipeline Screen + extract a list of GEO accessions, write results CSV

Single-record (equivalent to biolit screen):

Tool Description
screen_by_pmid Fetch + screen a PubMed paper in one call
screen_by_doi Fetch + screen a paper by DOI in one call (handles preprints with no PMID)
screen_by_geo Fetch + screen a GEO record in one call

Low-level (for custom workflows):

Tool Description
search_pubmed Fetch PubMed metadata by PMID
fetch_geo_record Fetch and parse a GEO record by accession
fetch_fulltext Retrieve full text for a PMID (6-step chain)
screen_paper LLM relevance screen given pre-fetched text
extract_fields Structured field extraction given pre-fetched text
resolve_doi Resolve a DOI to PMID + PMCID via the NCBI ID Converter
lookup_s2_pdf Check whether Semantic Scholar has an open-access PDF for a DOI
read_pmids_from_eml Parse PMIDs from a PubMed alert .eml file

Use as a Python library

The pipeline functions are importable directly:

from biolit.pipeline import screen_by_pmid, screen_by_doi, screen_by_geo, run, run_geo
from biolit.llm import get_llm_client

client = get_llm_client("anthropic")

# Screen by PMID
result = screen_by_pmid(client, "41627908", "Is this about schizophrenia genomics?")
# {"relevant": True, "reason": "...", "text_source": "abstract"}

# Screen by DOI (works for preprints without a PMID)
result = screen_by_doi(client, "10.64898/2026.02.16.706214", "Is this about schizophrenia genomics?")
# {"relevant": True, "reason": "...", "text_source": "preprint_abstract", "doi": "..."}

# Batch pipeline
run(client, pmids=["41627908", "33741721"], criterion="...", fields_description="methodology, summary", output_path="results.csv")

Known Limitations

  • Papers without abstracts or accessible full text are skipped silently.
  • Full-text retrieval (--fulltext) applies to PubMed and DOI inputs only; GEO records always use the record metadata directly.
  • bioRxiv/medRxiv JATS XML is frequently blocked by Cloudflare regardless of headers. The pipeline falls back to the title and abstract from the bioRxiv API (text_source: preprint_abstract).
  • DOIs passed via --dois or a DOI file are resolved to PMIDs before the batch pipeline runs. DOIs that can't be resolved (e.g. preprints not yet indexed in PubMed) are skipped. Use biolit screen --doi to screen an individual unresolvable DOI.
  • The Semantic Scholar API allows roughly 100 unauthenticated requests per day. Set SEMANTIC_SCHOLAR_API_KEY in .env for higher limits.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biolit-0.1.6.tar.gz (39.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

biolit-0.1.6-py3-none-any.whl (32.9 kB view details)

Uploaded Python 3

File details

Details for the file biolit-0.1.6.tar.gz.

File metadata

  • Download URL: biolit-0.1.6.tar.gz
  • Upload date:
  • Size: 39.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for biolit-0.1.6.tar.gz
Algorithm Hash digest
SHA256 aa9c30f94213c944e740d278a6a1f47e63213db636734a0c3cf321568e150114
MD5 c3bfb26dcedf5515e09eb01c1f562071
BLAKE2b-256 5862d386ab9ce5935ca71f6cd46a7fd6e37badacbcf6ec6ee0375c54e1e79a6e

See more details on using hashes here.

Provenance

The following attestation bundles were made for biolit-0.1.6.tar.gz:

Publisher: publish.yml on rachadele/biolit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file biolit-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: biolit-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 32.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for biolit-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 548a56ff82fea19b7c7bfb2228fe2683962d9c1272c5730a559b373ce9503ba4
MD5 77645c3effe94927fd28368c8be0f10e
BLAKE2b-256 1d063174a9138320dbffb13c4c30ae49a24d3e883a9bafb18c30437672c22cdc

See more details on using hashes here.

Provenance

The following attestation bundles were made for biolit-0.1.6-py3-none-any.whl:

Publisher: publish.yml on rachadele/biolit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page