Skip to main content

A Python package leveraging LLMs for research evidence synthesis

Project description

res-sum

A Python package leveraging LLMs for research evidence synthesis

res-sum takes a folder of PDF research papers and produces structured summaries of each one using Large Language Models. It extracts text, builds a knowledge graph of entities and relationships across your papers, and uses hybrid retrieval (vector search + graph traversal) to produce contextually grounded summaries.

Built with ecology in mind, but works for any scientific field.

Features

  • Batch-summarize PDFs — point it at a folder, get a structured summary for each paper
  • Knowledge graph — extracts entities and relationships from your papers using LLMs, stored as a queryable NetworkX graph
  • Hybrid retrieval (GraphRAG) — combines vector similarity search (ChromaDB) with knowledge graph traversal
  • Domain-aware prompting — ecology-specific Chain-of-Thought prompts; custom domains via YAML
  • Multiple LLM providers — Ollama (local, free, default), Ollama Cloud, Groq, OpenAI, Anthropic
  • Multiple output formats — DOCX, JSON, CSV
  • Persistent storage — vector store + knowledge graph persist to disk; incremental ingestion for new papers

Installation

pip install res-sum

For additional LLM providers:

pip install res-sum[openai]        # OpenAI (GPT-4o)
pip install res-sum[anthropic]     # Anthropic (Claude)
pip install res-sum[ollama-cloud]  # Ollama Cloud API
pip install res-sum[all-providers] # All of the above

Default setup (Ollama — free, local, no API key)

If you have Ollama installed locally, res-sum works out of the box with no API key:

ollama pull llama3.2

That's it.

Quick start

Python API

from res_sum import ResSum

# Initialize (defaults: Ollama local, ecology domain)
rs = ResSum(
    llm_provider="ollama",       # or "ollama_cloud", "groq", "openai", "anthropic"
    domain="ecology",            # or "general", or path to custom YAML
)

# Ingest papers — extracts text, builds vector store + knowledge graph
rs.ingest_papers("./pdf_folder/")

# Summarize across all papers
summary = rs.summarize("What are the key findings on pollinator decline?")

# Or batch-summarize: one summary per paper, saved to disk
rs.summarize_papers(
    pdf_directory="./pdf_folder/",
    output_directory="./summaries/",
    output_format="docx",        # or "json", "csv"
)

Command line

# Batch summarize with Ollama (default)
res-sum summarize \
    --pdf_directory ./papers/ \
    --output_directory ./summaries/ \
    --domain ecology

# Use Groq instead (requires API key)
res-sum summarize \
    --pdf_directory ./papers/ \
    --output_directory ./summaries/ \
    --provider groq \
    --api_key $GROQ_API_KEY

# See available providers, models, and domains
res-sum info

LLM providers

Provider API key needed Rate limits How to use
Ollama (default) No None (runs locally) Install Ollama, pull a model
Ollama Cloud Yes (OLLAMA_API_KEY) Based on plan --provider ollama_cloud
Groq Yes (GROQ_API_KEY) Free tier available --provider groq
OpenAI Yes (OPENAI_API_KEY) Pay-per-use --provider openai
Anthropic Yes (ANTHROPIC_API_KEY) Pay-per-use --provider anthropic

API keys can be passed directly or set as environment variables. They are never stored by the package.

Setting up API keys

Option 1 — Environment variables (recommended):

# Add to your ~/.zshrc or ~/.bashrc
export OLLAMA_API_KEY="your-key-here"    # for Ollama Cloud
export GROQ_API_KEY="your-key-here"      # for Groq
export OPENAI_API_KEY="your-key-here"    # for OpenAI
export ANTHROPIC_API_KEY="your-key-here" # for Anthropic

Then just specify the provider — the key is picked up automatically:

rs = ResSum(llm_provider="ollama_cloud")

Option 2 — Pass directly:

rs = ResSum(
    llm_provider="ollama_cloud",
    api_key="your-ollama-cloud-key-here",
)

To get an Ollama Cloud API key, go to ollama.com/settings/keys.

Domain configurations

res-sum ships with two built-in domains:

  • ecology (default) — entity types: Species, Location, Method, Metric, Concept, Temporal. Includes ecology-specific section headers (Study Area, Field Methods, Statistical Analysis, etc.) and a 6-step Chain-of-Thought prompt.
  • general — broader entity types for any scientific field.

You can define your own domain with a YAML file:

# my_domain.yaml
name: biomedical
entity_types:
  - name: DRUG
    description: "Pharmaceutical compounds or treatments"
    examples: ["metformin", "aspirin"]
  - name: DISEASE
    description: "Medical conditions"
    examples: ["diabetes", "cancer"]
relationship_types:
  - TREATS
  - CAUSES
  - ASSOCIATED_WITH
rs = ResSum(domain="./my_domain.yaml")

Retrieval modes

Mode What it does Best for
hybrid (default) Vector search + graph expansion + community context, re-ranked General summarization
local ChromaDB vector search only Specific factual queries
graph Graph traversal + vector lookup Relational queries
global Community-level summaries + vector search Thematic synthesis across many papers
summary = rs.summarize("...", mode="hybrid")  # or "local", "graph", "global"

Explore your knowledge base

After ingesting papers, open an interactive dashboard to visualize and inspect everything:

rs.explore()  # opens in your browser

Or from the command line:

res-sum explore --data_dir ./knowledge_base

The dashboard has four tabs:

  • Overview — papers ingested, chunk counts, entity type breakdown, graph stats
  • Knowledge Graph — interactive graph visualization. Nodes colored by entity type, sized by connections. Click to see relationships, filter by type, search by name.
  • Vector Store — browse all text chunks by paper. See which section each chunk came from, expand to read full text.
  • Communities — entity clusters detected by the Leiden algorithm, with LLM-generated summaries explaining what connects each group.

It's a single HTML file — works offline, shareable with collaborators.

Programmatic access

# Query an entity
rs.query_graph("Canis lupus")

# Most connected entities
rs.get_central_entities(top_k=10)

# Community structure
rs.get_communities()

# Access the NetworkX graph directly
graph = rs.knowledge_graph.graph

The graph is saved as GraphML and can be imported into Neo4j or any graph visualization tool.

How it works

PDF files
  → Text extraction (pymupdf4llm — handles multi-column, tables)
  → Section detection (ecology-aware regex + Markdown headers)
  → Chunking (RecursiveCharacterTextSplitter)
  → ChromaDB (embed + store chunks)
  → LLM entity/relationship extraction → NetworkX knowledge graph
  → Community detection (Leiden/Louvain)
  → Hybrid retrieval (vector + graph + community)
  → LLM summarization (Chain-of-Thought prompting)
  → Output (DOCX / JSON / CSV)

All data persists to a data_dir/ folder. Adding new papers only processes what's new.

Requirements

  • Python >= 3.9
  • Ollama installed locally (for default provider), or an API key for another provider

Contributing

Issues and pull requests are welcome on GitHub.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

res_sum-0.2.2.tar.gz (52.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

res_sum-0.2.2-py3-none-any.whl (58.7 kB view details)

Uploaded Python 3

File details

Details for the file res_sum-0.2.2.tar.gz.

File metadata

  • Download URL: res_sum-0.2.2.tar.gz
  • Upload date:
  • Size: 52.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for res_sum-0.2.2.tar.gz
Algorithm Hash digest
SHA256 ec5dd989763178c0b9fff09a9924d4efb41a752eda7f387a8ee5c39eb633277b
MD5 fc357ce5e5e2f1543e1d45ddcce29d14
BLAKE2b-256 ad833a457ccc1dcb9739274b85e41ea7fa5e9ffde2801cbdd8e253b9a99d2265

See more details on using hashes here.

File details

Details for the file res_sum-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: res_sum-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 58.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for res_sum-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0b290b7ab89c1e90ce2d91649fe69c173e70e04a2bbe89e7bafce86c536fa684
MD5 451d8addb43e4e9b6b28292c08cdceb9
BLAKE2b-256 fd134427ac8e6936136de6fd8fd5a6b982578857330d458874b97a90b4562597

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page