Skip to main content

Research paper analysis pipeline with citation crawling, pluggable LLM prompts, and knowledge graph building

Project description

🦀 CrabScholar

Research paper analysis pipeline with citation crawling, pluggable LLM prompts, and knowledge graph building.

Features

  • Multi-input: Analyze papers by title, DOI, keywords, URL, local PDF, or raw text
  • Citation Crawling: BFS traversal of references/citations via Semantic Scholar API (configurable depth, default 3)
  • 5 Default Analysis Dimensions (LLM Evaluation focus):
    1. Paper Analysis — overview, contributions, methodology
    2. Dataset Crafting — data creation, annotation, preprocessing
    3. Evaluation Method — benchmarks, baselines, evaluation setup
    4. Metrics — specific metrics, reported results
    5. Statistical Tests — significance tests, confidence intervals, rigor
  • Pluggable Prompts: Add YAML files for custom dimensions, override defaults
  • Knowledge Graph: NetworkX-based graph with paper/author/method/dataset/metric entities
  • Multi-Provider LLM: Via LiteLLM — OpenAI, Anthropic, Ollama, vLLM, etc. with fallback chain
  • Export: JSON, GraphML, GEXF, CSV

Installation

uv sync

Quick Start

# Initialize project config
uv run crab init

# Edit .env with your API key
nano .env

# Analyze a paper by title
uv run crab analyze "attention is all you need"

# Search by keywords
uv run crab analyze --keywords "LLM evaluation, benchmark contamination"

# Analyze a local PDF
uv run crab analyze --pdf paper.pdf

# Control crawl depth
uv run crab analyze "GPT-4 Technical Report" --depth 5

# Search without analyzing
uv run crab search "transformer evaluation"

# Build knowledge graph from results
uv run crab build

# Export graph
uv run crab export json
uv run crab export graphml
uv run crab export csv

# List analysis dimensions
uv run crab dimensions

# Show config
uv run crab info

Configuration

Settings load from: CLI flags > env vars (CRAB_ prefix) > .env > crab.yaml > defaults.

# crab.yaml
default_model: openai/gpt-4o-mini
fallback_models:
  - openai/gpt-3.5-turbo
  - anthropic/claude-3-haiku-20240307

citation_depth: 3
max_papers: 50
output: output
concurrency: 4

Custom Prompts

Create YAML files in a custom directory:

# my_prompts/bias_analysis.yaml
name: bias_analysis
display_name: "Bias Analysis"
description: "Analyze papers for bias in LLM evaluation"
system_message: "You are a bias analysis expert..."
extraction_prompt: |
  Analyze the paper for potential biases...
  Paper: {title}
  Text: {paper_text}
  ...

Then use: uv run crab analyze "paper" --prompts-dir my_prompts/

Python API

from crab_scholar.pipeline import run_pipeline
from crab_scholar.config import CrabConfig

config = CrabConfig(
    default_model="openai/gpt-4o-mini",
    citation_depth=3,
)

kg = run_pipeline(input_query="attention is all you need", config=config)
print(f"Entities: {kg.entity_count}, Relations: {kg.relation_count}")

Architecture

Input (query/DOI/PDF/text)
    ↓
Scholar API → Resolve paper
    ↓
BFS Crawler → Expand citations/references (depth=N)
    ↓
Fetcher → Download PDFs, extract text
    ↓
Analyzer → Run pluggable dimensions (5 defaults)
    ↓
Graph Builder → Entities + Relations → NetworkX
    ↓
Export → JSON / GraphML / GEXF / CSV

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crab_scholar-0.2.0.tar.gz (178.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crab_scholar-0.2.0-py3-none-any.whl (49.1 kB view details)

Uploaded Python 3

File details

Details for the file crab_scholar-0.2.0.tar.gz.

File metadata

  • Download URL: crab_scholar-0.2.0.tar.gz
  • Upload date:
  • Size: 178.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for crab_scholar-0.2.0.tar.gz
Algorithm Hash digest
SHA256 78731181d0fc3dc40c4011973708354f67f22e79410c871daef8f7de9bea09c0
MD5 9c9ed23dc25176b6e8e6bba5a535f45e
BLAKE2b-256 4d0e92295e8c4116fdad68e63c40275ffbb8b77e9abcf8c14ef5171008ea69c5

See more details on using hashes here.

File details

Details for the file crab_scholar-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: crab_scholar-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 49.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for crab_scholar-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9919efd3084b24ecdae750307d97b43f59cd24d6b9fa45cbfa7a1be952759243
MD5 9eb7b18b923f9679397b9357baafb5ae
BLAKE2b-256 b8766ff71eb39725d8427a84d01c040c050e76206c5c577e8d005ca100cd1728

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page