Skip to main content

A Python package leveraging LLMs for research evidence synthesis

Project description

res-sum

A Python package leveraging LLMs for research evidence synthesis

res-sum takes a folder of PDF research papers and produces structured summaries of each one using Large Language Models. It extracts text, builds a knowledge graph of entities and relationships across your papers, and uses hybrid retrieval (vector search + graph traversal) to produce contextually grounded summaries.

Built with ecology in mind, but works for any scientific field.

Features

  • Batch-summarize PDFs — point it at a folder, get a structured summary for each paper
  • Knowledge graph — extracts entities and relationships from your papers using LLMs, stored as a queryable NetworkX graph
  • Hybrid retrieval (GraphRAG) — combines vector similarity search (ChromaDB) with knowledge graph traversal
  • Domain-aware prompting — ecology-specific Chain-of-Thought prompts; custom domains via YAML
  • Multiple LLM providers — Ollama (local, free, default), Ollama Cloud, Groq, OpenAI, Anthropic
  • Multiple output formats — DOCX, JSON, CSV
  • Persistent storage — vector store + knowledge graph persist to disk; incremental ingestion for new papers

Installation

pip install res-sum

For additional LLM providers:

pip install res-sum[openai]        # OpenAI (GPT-4o)
pip install res-sum[anthropic]     # Anthropic (Claude)
pip install res-sum[ollama-cloud]  # Ollama Cloud API
pip install res-sum[all-providers] # All of the above

Default setup (Ollama — free, local, no API key)

If you have Ollama installed locally, res-sum works out of the box with no API key:

ollama pull llama3.2

That's it.

Quick start

Python API

from res_sum import ResSum

# Initialize (defaults: Ollama local, ecology domain)
rs = ResSum(
    llm_provider="ollama",       # or "ollama_cloud", "groq", "openai", "anthropic"
    domain="ecology",            # or "general", or path to custom YAML
)

# Ingest papers — extracts text, builds vector store + knowledge graph
rs.ingest_papers("./pdf_folder/")

# Summarize across all papers
summary = rs.summarize("What are the key findings on pollinator decline?")

# Or batch-summarize: one summary per paper, saved to disk
rs.summarize_papers(
    pdf_directory="./pdf_folder/",
    output_directory="./summaries/",
    output_format="docx",        # or "json", "csv"
)

Command line

# Batch summarize with Ollama (default)
res-sum summarize \
    --pdf_directory ./papers/ \
    --output_directory ./summaries/ \
    --domain ecology

# Use Groq instead (requires API key)
res-sum summarize \
    --pdf_directory ./papers/ \
    --output_directory ./summaries/ \
    --provider groq \
    --api_key $GROQ_API_KEY

# See available providers, models, and domains
res-sum info

LLM providers

Provider API key needed Rate limits How to use
Ollama (default) No None (runs locally) Install Ollama, pull a model
Ollama Cloud Yes (OLLAMA_API_KEY) Based on plan --provider ollama_cloud
Groq Yes (GROQ_API_KEY) Free tier available --provider groq
OpenAI Yes (OPENAI_API_KEY) Pay-per-use --provider openai
Anthropic Yes (ANTHROPIC_API_KEY) Pay-per-use --provider anthropic

API keys can be passed directly or set as environment variables. They are never stored by the package.

Setting up API keys

Option 1 — Environment variables (recommended):

# Add to your ~/.zshrc or ~/.bashrc
export OLLAMA_API_KEY="your-key-here"    # for Ollama Cloud
export GROQ_API_KEY="your-key-here"      # for Groq
export OPENAI_API_KEY="your-key-here"    # for OpenAI
export ANTHROPIC_API_KEY="your-key-here" # for Anthropic

Then just specify the provider — the key is picked up automatically:

rs = ResSum(llm_provider="ollama_cloud")

Option 2 — Pass directly:

rs = ResSum(
    llm_provider="ollama_cloud",
    api_key="your-ollama-cloud-key-here",
)

To get an Ollama Cloud API key, go to ollama.com/settings/keys.

Domain configurations

res-sum ships with two built-in domains:

  • ecology (default) — entity types: Species, Location, Method, Metric, Concept, Temporal. Includes ecology-specific section headers (Study Area, Field Methods, Statistical Analysis, etc.) and a 6-step Chain-of-Thought prompt.
  • general — broader entity types for any scientific field.

You can define your own domain with a YAML file:

# my_domain.yaml
name: biomedical
entity_types:
  - name: DRUG
    description: "Pharmaceutical compounds or treatments"
    examples: ["metformin", "aspirin"]
  - name: DISEASE
    description: "Medical conditions"
    examples: ["diabetes", "cancer"]
relationship_types:
  - TREATS
  - CAUSES
  - ASSOCIATED_WITH
rs = ResSum(domain="./my_domain.yaml")

Retrieval modes

Mode What it does Best for
hybrid (default) Vector search + graph expansion + community context, re-ranked General summarization
local ChromaDB vector search only Specific factual queries
graph Graph traversal + vector lookup Relational queries
global Community-level summaries + vector search Thematic synthesis across many papers
summary = rs.summarize("...", mode="hybrid")  # or "local", "graph", "global"

Explore your knowledge base

After ingesting papers, open an interactive dashboard to visualize and inspect everything:

rs.explore()  # opens in your browser

Or from the command line:

res-sum explore --data_dir ./knowledge_base

The dashboard has four tabs:

  • Overview — papers ingested, chunk counts, entity type breakdown, graph stats
  • Knowledge Graph — interactive graph visualization. Nodes colored by entity type, sized by connections. Click to see relationships, filter by type, search by name.
  • Vector Store — browse all text chunks by paper. See which section each chunk came from, expand to read full text.
  • Communities — entity clusters detected by the Leiden algorithm, with LLM-generated summaries explaining what connects each group.

It's a single HTML file — works offline, shareable with collaborators.

Programmatic access

# Query an entity
rs.query_graph("Canis lupus")

# Most connected entities
rs.get_central_entities(top_k=10)

# Community structure
rs.get_communities()

# Access the NetworkX graph directly
graph = rs.knowledge_graph.graph

The graph is saved as GraphML and can be imported into Neo4j or any graph visualization tool.

How it works

PDF files
  → Text extraction (pymupdf4llm — handles multi-column, tables)
  → Section detection (ecology-aware regex + Markdown headers)
  → Chunking (RecursiveCharacterTextSplitter)
  → ChromaDB (embed + store chunks)
  → LLM entity/relationship extraction → NetworkX knowledge graph
  → Community detection (Leiden/Louvain)
  → Hybrid retrieval (vector + graph + community)
  → LLM summarization (Chain-of-Thought prompting)
  → Output (DOCX / JSON / CSV)

All data persists to a data_dir/ folder. Adding new papers only processes what's new.

Requirements

  • Python >= 3.9
  • Ollama installed locally (for default provider), or an API key for another provider

Contributing

Issues and pull requests are welcome on GitHub.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

res_sum-0.2.4.tar.gz (54.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

res_sum-0.2.4-py3-none-any.whl (59.8 kB view details)

Uploaded Python 3

File details

Details for the file res_sum-0.2.4.tar.gz.

File metadata

  • Download URL: res_sum-0.2.4.tar.gz
  • Upload date:
  • Size: 54.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for res_sum-0.2.4.tar.gz
Algorithm Hash digest
SHA256 764187f5896ffe44c21cad68eb7d10ba7e8ee97967febb6b0006ebffd0755782
MD5 e5f8baf74d6a51ed4aab18f0a0ea142c
BLAKE2b-256 e28e885e3da4cb256173c282e2d618fa5c4431f8c1945e13dd1636307a308d42

See more details on using hashes here.

File details

Details for the file res_sum-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: res_sum-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 59.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for res_sum-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 d250dcd965f7c19b5490d4495fd1856e931c9c686e2995fc89a97ac31af28472
MD5 38cb99c9446a325a1d4e006b9cba5dba
BLAKE2b-256 06a81dec93af4089dc4b18e12d09cfcf5909dc14fbeb3200a24a2e1c5855d4fe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page