Skip to main content

论文被引画像分析工具 — 自动爬取施引文献、识别著名学者、生成可视化 HTML 报告

Project description

English | 中文

CitationClaw Logo

CitationClaw v2: Turning Every Citation into Explainable Impact

让每一次引用都成为可解释的影响力
A citation portrait engine for discovering, explaining, and sharing scientific impact.

Homepage PyPI PyPI Downloads Visitors PRs Welcome Issues Python Version Platform LLM ScraperAPI License: CC BY-NC 4.0

Input a paper title or a Google Scholar profile, then generate a shareable citation portrait report.
CitationClaw v2 crawls citing papers, collects author and institution metadata, identifies renowned scholars, extracts in-paper citation contexts, and produces a self-contained HTML report for research summaries, grant materials, and academic presentations.


📢 News

  • 2026-05-22: Released v2.0.0 documentation — structured metadata collection, Skills Runtime orchestration, separated search/lightweight model roles, PDF-grounded citation contexts, Basic / Advanced / Full service tiers, and shareable HTML reports.
  • 2026-03-18: Released beta v1.0.9 — multi-paper dashboard dedup fix, year-traverse session behavior update, default parallel workers raised to 10, cache write throttling, and UI/logging polish.
  • 2026-03-12: Released v1.0 — first public release.

🚀 What v2 Changes

CitationClaw v2 is an architectural upgrade over v1, not just a UI refresh.

  • 🧠 Structured metadata first: OpenAlex, Semantic Scholar, arXiv, Web of Science Starter API, and PDF-based fallbacks reduce the instability of fully LLM-driven author lookup.
  • 🧩 Skills Runtime + TaskExecutor orchestration: v2 registers replaceable phase skills under SkillsRuntime, while TaskExecutor coordinates the richer structured-metadata, PDF-validation, self-citation, and scholar-assessment path.
  • 🔍 Separated model roles: search-capable LLMs handle scholar verification; cheaper lightweight models can handle report generation and citation-context extraction, with preflight checks in the UI.
  • 📄 PDF-grounded citation context: v2 downloads, caches, parses, and reviews citing PDFs to recover actual citation sentences where possible, while recording PDF sources and failure reasons.
  • 📊 Shareable HTML report: the final dashboard is a single browser-readable file with charts, knowledge graph, citation descriptions, cost summary, and a report assistant entry point.

🔄 v1 vs v2

Area v1 v2.0.0
Execution model Script-oriented flow with tighter coupling FastAPI + WebSocket + TaskExecutor + Skills Runtime
Author data Heavier dependence on LLM web search Structured APIs first, LLM search as supplement and assessor
Scholar detection Direct search and summarization Rule pre-filtering, cached lookup, search verification
Citation context Result-level summaries PDF download/parse/review pipeline for in-text citation sentences
Service tiers Multiple experimental modes Three-tier config: Basic / Advanced / Full for balancing depth, runtime, and cost
Report HTML dashboard and spreadsheets Self-contained citation portrait with graph, citation-context analysis, cost summary, and assistant
Cost control Mostly manual estimation Cache reuse, Basic Phase 4 disable switch, quota check, year-traverse prompt, and report rebuild from cache
Maintainability Good for fast iteration Better phase contracts, isolated skills, and testable module boundaries

📈 v1 vs v2 Benchmark

We evaluate CitationClaw on five dimensions using a curated benchmark of citing papers with human-annotated ground truth. The overall score is a weighted sum (Author 25%, Scholar 15%, PDF 15%, Citation 35%, Data Source 10%).

Dimension v1 v2 Improvement What it measures
Author 75.51 87.02 +11.51 Author name matching (F1), affiliation accuracy, and completeness
Scholar 73.55 78.57 +5.02 Whether renowned scholars (Academicians, Fellows, etc.) in ground truth are identified
PDF 0 74.43 +74.43 Ratio of citing papers with successfully downloaded and parsed PDFs
Citation 13.81 46.26 +32.45 Quality of extracted in-text citation sentences vs. ground truth (LLM-judged semantic similarity)
Data Source 75.92 80.82 +4.90 Completeness and correctness of metadata sources (affiliation coverage, wrong-paper ratio)
Overall 42.33 68.98 +26.65 Weighted aggregate across all five dimensions

Key takeaways:

  • Citation context (+32.45) is the largest improvement — v1 produced paraphrased summaries that scored poorly against ground-truth citation sentences; v2 extracts actual in-text sentences from parsed PDFs.
  • Author (+11.51) benefits from structured API sources (OpenAlex / S2 / WOS) replacing unstable LLM-only extraction, with PDF-based fallback for missing affiliations.
  • Scholar (+5.02) gains come from rule pre-filtering + cached lookup reducing missed identifications.
  • PDF (+74.43) — v1 had no PDF download/parse pipeline; v2 introduces a 12-tier download cascade with ScraperAPI publisher channel and LLM search fallback.

🧭 Quick Links

Resource Description
📘 Guidelines Installation, quick start, configuration, outputs, FAQ, and operation notes
📊 Report Demo 1 Online preview of a generated citation portrait
📊 Report Demo 2 Another generated report example

📦 Install

Requires Python 3.10+. Python 3.12 is recommended.

PyPI

pip install citationclaw
citationclaw
citationclaw --port 8080

The app opens at http://127.0.0.1:8000, or the port you specify.

Source

git clone https://github.com/VisionXLab/CitationClaw.git
cd CitationClaw
pip install -r requirements.txt
python -m citationclaw

🧩 Five-Phase Pipeline

Phase 0  Citation entry discovery
Phase 1  Citing-paper retrieval: Google Scholar + ScraperAPI
Phase 2  Author and metadata collection: OpenAlex / S2 / arXiv / WOS / PDF
Phase 3  Scholar impact assessment: pre-filter + Search LLM + cache
Phase 4  Citation-context extraction: PDF parse + lightweight LLM + review
Phase 5  Report generation: Excel / JSON / HTML dashboard

SkillsRuntime registers phase skills such as phase1_citation_fetch, phase2_metadata, phase3_scholar_assess, phase4_citation_extract, and phase5_report_generate. The current full-run path still uses TaskExecutor._run_new_phase2_and_3() for the combined structured metadata, PDF validation, self-citation, and scholar-assessment block.

⚙️ Service Tiers

Tier Best for Behavior
Basic First runs, cost-sensitive checks, scholar-only impact scans Retrieves citing papers, collects author metadata, assesses renowned scholars, skips citation-context extraction
Advanced Understanding how important citing papers discuss a work Enables citation-context extraction for a deeper portrait of important citing work
Full Grant writing, evaluation, presentations, and complete citation portraits Runs citation-context extraction for all citing papers; highest cost and longest runtime

For papers with more than 1000 citations, enable year traversal. It splits Google Scholar queries by year to work around the 1000-result display limit. See the Guidelines for detailed tier behavior and current implementation notes.

📤 Outputs and Sharing

Each run creates a timestamped folder under data/result-{timestamp}/, usually including:

  • paper_results.xlsx
  • paper_results_all_renowned_scholar.xlsx
  • paper_results_top-tier_scholar.xlsx
  • paper_results_with_citing_desc.xlsx
  • paper_results.json
  • paper_dashboard.html

paper_dashboard.html is the main shareable artifact. It is a self-contained browser-readable report that can be sent to advisors, collaborators, or evaluators, and can be reused in grant applications, annual reviews, and presentation preparation. Download buttons and AI assistant features work best when the report is opened from the local CitationClaw app; the static charts and report content remain readable when shared offline.

If author and citation-description caches already exist, the app can rebuild a report from cache without repeating the full crawl and extraction workflow.

🔧 Configuration Highlights

  • ScraperAPI Keys: required for Google Scholar crawling; multiple keys improve stability.
  • Search LLM: required for scholar assessment and verification; must support web search.
  • Lightweight model: optional independent model endpoint for report generation and citation-context extraction.
  • Semantic Scholar API Key: optional but improves metadata and PDF discovery.
  • Web of Science Starter API Key: optional higher-priority structured author extraction.
  • MinerU API Token: optional parser for larger or more complex PDFs.
  • CDP debug port: optional Chrome/Edge session for authenticated IEEE, Elsevier, and ACM downloads.
  • Quota tracking: optional API relay token/user ID pair for estimating LLM quota consumption after each run.
  • Model preflight: the UI can test Search LLM and lightweight model connectivity before a full run.

📁 Project Structure

citationclaw/
├── app/                 # FastAPI app, task orchestration, config, logs
├── core/                # scraping, metadata, PDF, export, dashboard engines
├── skills/              # Skills Runtime and phase skills
├── static/              # frontend assets
├── templates/           # Jinja2 pages
docs/                    # documentation site and demos
test/                    # tests

🌍 Community

  • Product update: 减论 reduct.cn
  • The WeChat group is full. Please add the personal WeChat below to be invited:
Personal WeChat QR

⚠️ Disclaimer

CitationClaw is intended for academic research and personal study. Follow the terms of Google Scholar, ScraperAPI, OpenAlex, Semantic Scholar, arXiv, Web of Science, MinerU, and your selected LLM provider. Author identities, citation contexts, and impact analysis should be treated as assistive outputs and manually verified before formal use. The authors are not responsible for consequences arising from use of this tool.

⭐ Star History

Star History Chart

👥 Developer Team

Shanghai Jiao Tong University — VisionXLab@RethinkLab
Qihao Yang (杨起豪), Chunhao Zhang (张春浩), Ziyang Gong (龚子洋), Ziqian Fan (樊子谦), Yifan Zhou (周奕帆), Yue Zhou (周越), Zhihang Zhong (钟志航), Xue Yang (杨学)⭐ Project Leader

East China Normal University & Shanghai AI Lab — DMCV
Yifan Cheng (程依凡), Zhanghan Xu (许张涵), Jiawei Lu (陆佳炜), Xin Tan (谭鑫)

Southeast University — PALM Lab
Caorui Li (李操瑞), Tianyi Zhou (周天一), Xu Yang (杨旭)

[!NOTE] Acknowledgment: Special thanks to Keyu Chen (陈柯宇) for generously sponsoring the compute and API resources that made this project possible.

📚 Citation

If CitationClaw helps your research, reporting, or evaluation workflow, please cite the project:

@software{citationclaw2026,
  title        = {CitationClaw: Turning Every Citation into Explainable Impact},
  author       = {Yang, Qihao and Zhang, Chunhao and Cheng, Yifan and Gong, Ziyang and Li, Caorui and Xu, Zhanghan and Zhou, Tianyi and Lu, Jiawei and Fan, Ziqian and Zhou, Yifan and Zhou, Yue and Zhong, Zhihang and Yang, Xu and Tan, Xin and Yang, Xue},
  year         = {2026},
  version      = {2.0.0},
  url          = {https://github.com/VisionXLab/CitationClaw},
  institution  = {Shanghai Jiao Tong University, East China Normal University, Southeast University}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citationclaw-2.0.0.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

citationclaw-2.0.0-py3-none-any.whl (1.8 MB view details)

Uploaded Python 3

File details

Details for the file citationclaw-2.0.0.tar.gz.

File metadata

  • Download URL: citationclaw-2.0.0.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for citationclaw-2.0.0.tar.gz
Algorithm Hash digest
SHA256 e033318cc045ea0e0f04b197ddb4a437ccca2646017933c5431e4aeadc9174b2
MD5 ec56059b4b6e0a7b676c333682ea4ae8
BLAKE2b-256 ca452c6285205feeefca37745cd07d0a050b2f52a465da5519ffb4285cfc8134

See more details on using hashes here.

File details

Details for the file citationclaw-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: citationclaw-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 1.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for citationclaw-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d7222ac583f182b244fde795c53612e03adb0b3c8322bb95ccd77814027f0aec
MD5 2b68a4acd8a6027614e2a91a55093e36
BLAKE2b-256 7eabe62db3ebd8d86634964819fba16db49d5c730c9d9b79e4c53340c7b476e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page