Skip to main content

A Python package leveraging LLMs for research evidence synthesis

Project description

res-sum

A Python package leveraging LLMs for research evidence synthesis

res-sum takes a folder of PDF research papers and produces structured summaries of each one using Large Language Models. It extracts text, builds a knowledge graph of entities and relationships across your papers, and uses hybrid retrieval (vector search + graph traversal) to produce contextually grounded summaries.

Built with ecology in mind, but works for any scientific field.

Features

  • Batch-summarize PDFs — point it at a folder, get a structured summary for each paper
  • Knowledge graph — extracts entities and relationships from your papers using LLMs, stored as a queryable NetworkX graph
  • Hybrid retrieval (GraphRAG) — combines vector similarity search (ChromaDB) with knowledge graph traversal
  • Domain-aware prompting — ecology-specific Chain-of-Thought prompts; custom domains via YAML
  • Multiple LLM providers — Ollama (local, free, default), Ollama Cloud, Groq, OpenAI, Anthropic
  • Multiple output formats — DOCX, JSON, CSV
  • Persistent storage — vector store + knowledge graph persist to disk; incremental ingestion for new papers

Installation

pip install res-sum

For additional LLM providers:

pip install res-sum[openai]        # OpenAI (GPT-4o)
pip install res-sum[anthropic]     # Anthropic (Claude)
pip install res-sum[ollama-cloud]  # Ollama Cloud API
pip install res-sum[all-providers] # All of the above

Default setup (Ollama — free, local, no API key)

If you have Ollama installed locally, res-sum works out of the box with no API key:

ollama pull llama3.2

That's it.

Quick start

Python API

from res_sum import ResSum

# Initialize (defaults: Ollama local, ecology domain)
rs = ResSum(
    llm_provider="ollama",       # or "ollama_cloud", "groq", "openai", "anthropic"
    domain="ecology",            # or "general", or path to custom YAML
)

# Ingest papers — extracts text, builds vector store + knowledge graph
rs.ingest_papers("./pdf_folder/")

# Summarize across all papers
summary = rs.summarize("What are the key findings on pollinator decline?")

# Or batch-summarize: one summary per paper, saved to disk
rs.summarize_papers(
    pdf_directory="./pdf_folder/",
    output_directory="./summaries/",
    output_format="docx",        # or "json", "csv"
)

Command line

# Batch summarize with Ollama (default)
res-sum summarize \
    --pdf_directory ./papers/ \
    --output_directory ./summaries/ \
    --domain ecology

# Use Groq instead (requires API key)
res-sum summarize \
    --pdf_directory ./papers/ \
    --output_directory ./summaries/ \
    --provider groq \
    --api_key $GROQ_API_KEY

# See available providers, models, and domains
res-sum info

LLM providers

Provider API key needed Rate limits How to use
Ollama (default) No None (runs locally) Install Ollama, pull a model
Ollama Cloud Yes (OLLAMA_API_KEY) Based on plan --provider ollama_cloud
Groq Yes (GROQ_API_KEY) Free tier available --provider groq
OpenAI Yes (OPENAI_API_KEY) Pay-per-use --provider openai
Anthropic Yes (ANTHROPIC_API_KEY) Pay-per-use --provider anthropic

API keys can be passed directly or set as environment variables. They are never stored by the package.

Setting up API keys

Option 1 — Environment variables (recommended):

# Add to your ~/.zshrc or ~/.bashrc
export OLLAMA_API_KEY="your-key-here"    # for Ollama Cloud
export GROQ_API_KEY="your-key-here"      # for Groq
export OPENAI_API_KEY="your-key-here"    # for OpenAI
export ANTHROPIC_API_KEY="your-key-here" # for Anthropic

Then just specify the provider — the key is picked up automatically:

rs = ResSum(llm_provider="ollama_cloud")

Option 2 — Pass directly:

rs = ResSum(
    llm_provider="ollama_cloud",
    api_key="your-ollama-cloud-key-here",
)

To get an Ollama Cloud API key, go to ollama.com/settings/keys.

Domain configurations

res-sum ships with two built-in domains:

  • ecology (default) — entity types: Species, Location, Method, Metric, Concept, Temporal. Includes ecology-specific section headers (Study Area, Field Methods, Statistical Analysis, etc.) and a 6-step Chain-of-Thought prompt.
  • general — broader entity types for any scientific field.

You can define your own domain with a YAML file:

# my_domain.yaml
name: biomedical
entity_types:
  - name: DRUG
    description: "Pharmaceutical compounds or treatments"
    examples: ["metformin", "aspirin"]
  - name: DISEASE
    description: "Medical conditions"
    examples: ["diabetes", "cancer"]
relationship_types:
  - TREATS
  - CAUSES
  - ASSOCIATED_WITH
rs = ResSum(domain="./my_domain.yaml")

Retrieval modes

Mode What it does Best for
hybrid (default) Vector search + graph expansion + community context, re-ranked General summarization
local ChromaDB vector search only Specific factual queries
graph Graph traversal + vector lookup Relational queries
global Community-level summaries + vector search Thematic synthesis across many papers
summary = rs.summarize("...", mode="hybrid")  # or "local", "graph", "global"

Explore your knowledge base

After ingesting papers, open an interactive dashboard to visualize and inspect everything:

rs.explore()  # opens in your browser

Or from the command line:

res-sum explore --data_dir ./knowledge_base

The dashboard has four tabs:

  • Overview — papers ingested, chunk counts, entity type breakdown, graph stats
  • Knowledge Graph — interactive graph visualization. Nodes colored by entity type, sized by connections. Click to see relationships, filter by type, search by name.
  • Vector Store — browse all text chunks by paper. See which section each chunk came from, expand to read full text.
  • Communities — entity clusters detected by the Leiden algorithm, with LLM-generated summaries explaining what connects each group.

It's a single HTML file — works offline, shareable with collaborators.

Programmatic access

# Query an entity
rs.query_graph("Canis lupus")

# Most connected entities
rs.get_central_entities(top_k=10)

# Community structure
rs.get_communities()

# Access the NetworkX graph directly
graph = rs.knowledge_graph.graph

The graph is saved as GraphML and can be imported into Neo4j or any graph visualization tool.

How it works

PDF files
  → Text extraction (pymupdf4llm — handles multi-column, tables)
  → Section detection (ecology-aware regex + Markdown headers)
  → Chunking (RecursiveCharacterTextSplitter)
  → ChromaDB (embed + store chunks)
  → LLM entity/relationship extraction → NetworkX knowledge graph
  → Community detection (Leiden/Louvain)
  → Hybrid retrieval (vector + graph + community)
  → LLM summarization (Chain-of-Thought prompting)
  → Output (DOCX / JSON / CSV)

All data persists to a data_dir/ folder. Adding new papers only processes what's new.

Requirements

  • Python >= 3.9
  • Ollama installed locally (for default provider), or an API key for another provider

Contributing

Issues and pull requests are welcome on GitHub.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

res_sum-0.3.0.tar.gz (54.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

res_sum-0.3.0-py3-none-any.whl (59.8 kB view details)

Uploaded Python 3

File details

Details for the file res_sum-0.3.0.tar.gz.

File metadata

  • Download URL: res_sum-0.3.0.tar.gz
  • Upload date:
  • Size: 54.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for res_sum-0.3.0.tar.gz
Algorithm Hash digest
SHA256 21c6ef44d407b3a49fe2018d7cd89c82a9b80243ea1413a24f9a8366859fe3ee
MD5 2958eb09f50ec8158b0de5558610f0bd
BLAKE2b-256 f099e1f07b81628144e02830880bf575c618ecceecb4bb82a8539695fd76d95e

See more details on using hashes here.

File details

Details for the file res_sum-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: res_sum-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 59.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for res_sum-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0cd1a9c9d5c0b47f6b4cc3018b55fb1a103ea5d027a127015bd770a25c9a400c
MD5 7c3e97d9bdfd03b0084a36ed636d9e2e
BLAKE2b-256 fe155de6e9eae861cc9e340c9c99138aa4d609526faff47d5102ead149cfac14

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page