论文被引画像分析工具 — 自动爬取施引文献、识别著名学者、生成可视化 HTML 报告

These details have not been verified by PyPI

Project links

Project description

English | 中文

CitationClaw v2: Turning Every Citation into Explainable Impact

让每一次引用都成为可解释的影响力
A citation portrait engine for discovering, explaining, and sharing scientific impact.

Python Version Platform LLM ScraperAPI

Input a paper title or a Google Scholar profile, then generate a shareable citation portrait report.
CitationClaw v2 crawls citing papers, collects author and institution metadata, identifies renowned scholars, extracts in-paper citation contexts, and produces a self-contained HTML report for research summaries, grant materials, and academic presentations.

📢 News

2026-05-22: Released v2.0.0 documentation — structured metadata collection, Skills Runtime orchestration, separated search/lightweight model roles, PDF-grounded citation contexts, Basic / Advanced / Full service tiers, and shareable HTML reports.
2026-03-18: Released beta v1.0.9 — multi-paper dashboard dedup fix, year-traverse session behavior update, default parallel workers raised to 10, cache write throttling, and UI/logging polish.
2026-03-12: Released v1.0 — first public release.

🚀 What v2 Changes

CitationClaw v2 is an architectural upgrade over v1, not just a UI refresh.

🧠 Structured metadata first: OpenAlex, Semantic Scholar, arXiv, Web of Science Starter API, and PDF-based fallbacks reduce the instability of fully LLM-driven author lookup.
🧩 Skills Runtime + TaskExecutor orchestration: v2 registers replaceable phase skills under SkillsRuntime, while TaskExecutor coordinates the richer structured-metadata, PDF-validation, self-citation, and scholar-assessment path.
🔍 Separated model roles: search-capable LLMs handle scholar verification; cheaper lightweight models can handle report generation and citation-context extraction, with preflight checks in the UI.
📄 PDF-grounded citation context: v2 downloads, caches, parses, and reviews citing PDFs to recover actual citation sentences where possible, while recording PDF sources and failure reasons.
📊 Shareable HTML report: the final dashboard is a single browser-readable file with charts, knowledge graph, citation descriptions, cost summary, and a report assistant entry point.

🔄 v1 vs v2

Area	v1	v2.0.0
Execution model	Script-oriented flow with tighter coupling	FastAPI + WebSocket + TaskExecutor + Skills Runtime
Author data	Heavier dependence on LLM web search	Structured APIs first, LLM search as supplement and assessor
Scholar detection	Direct search and summarization	Rule pre-filtering, cached lookup, search verification
Citation context	Result-level summaries	PDF download/parse/review pipeline for in-text citation sentences
Service tiers	Multiple experimental modes	Three-tier config: Basic / Advanced / Full for balancing depth, runtime, and cost
Report	HTML dashboard and spreadsheets	Self-contained citation portrait with graph, citation-context analysis, cost summary, and assistant
Cost control	Mostly manual estimation	Cache reuse, Basic Phase 4 disable switch, quota check, year-traverse prompt, and report rebuild from cache
Maintainability	Good for fast iteration	Better phase contracts, isolated skills, and testable module boundaries

📈 v1 vs v2 Benchmark

We evaluate CitationClaw on five dimensions using a curated benchmark of citing papers with human-annotated ground truth. The overall score is a weighted sum (Author 25%, Scholar 15%, PDF 15%, Citation 35%, Data Source 10%).

Dimension	v1	v2	Improvement	What it measures
Author	75.51	87.02	+11.51	Author name matching (F1), affiliation accuracy, and completeness
Scholar	73.55	78.57	+5.02	Whether renowned scholars (Academicians, Fellows, etc.) in ground truth are identified
PDF	0	74.43	+74.43	Ratio of citing papers with successfully downloaded and parsed PDFs
Citation	13.81	46.26	+32.45	Quality of extracted in-text citation sentences vs. ground truth (LLM-judged semantic similarity)
Data Source	75.92	80.82	+4.90	Completeness and correctness of metadata sources (affiliation coverage, wrong-paper ratio)
Overall	42.33	68.98	+26.65	Weighted aggregate across all five dimensions

Key takeaways:

Citation context (+32.45) is the largest improvement — v1 produced paraphrased summaries that scored poorly against ground-truth citation sentences; v2 extracts actual in-text sentences from parsed PDFs.
Author (+11.51) benefits from structured API sources (OpenAlex / S2 / WOS) replacing unstable LLM-only extraction, with PDF-based fallback for missing affiliations.
Scholar (+5.02) gains come from rule pre-filtering + cached lookup reducing missed identifications.
PDF (+74.43) — v1 had no PDF download/parse pipeline; v2 introduces a 12-tier download cascade with ScraperAPI publisher channel and LLM search fallback.

🧭 Quick Links

Resource	Description
📘 Guidelines	Installation, quick start, configuration, outputs, FAQ, and operation notes
📊 Report Demo 1	Online preview of a generated citation portrait
📊 Report Demo 2	Another generated report example

📦 Install

Requires Python 3.10+. Python 3.12 is recommended.

PyPI

pip install citationclaw
citationclaw
citationclaw --port 8080

The app opens at http://127.0.0.1:8000, or the port you specify.

Source

git clone https://github.com/VisionXLab/CitationClaw.git
cd CitationClaw
pip install -r requirements.txt
python -m citationclaw

🧩 Five-Phase Pipeline

Phase 0  Citation entry discovery
Phase 1  Citing-paper retrieval: Google Scholar + ScraperAPI
Phase 2  Author and metadata collection: OpenAlex / S2 / arXiv / WOS / PDF
Phase 3  Scholar impact assessment: pre-filter + Search LLM + cache
Phase 4  Citation-context extraction: PDF parse + lightweight LLM + review
Phase 5  Report generation: Excel / JSON / HTML dashboard

SkillsRuntime registers phase skills such as phase1_citation_fetch, phase2_metadata, phase3_scholar_assess, phase4_citation_extract, and phase5_report_generate. The current full-run path still uses TaskExecutor._run_new_phase2_and_3() for the combined structured metadata, PDF validation, self-citation, and scholar-assessment block.

⚙️ Service Tiers

Tier	Best for	Behavior
Basic	First runs, cost-sensitive checks, scholar-only impact scans	Retrieves citing papers, collects author metadata, assesses renowned scholars, skips citation-context extraction
Advanced	Understanding how important citing papers discuss a work	Enables citation-context extraction for a deeper portrait of important citing work
Full	Grant writing, evaluation, presentations, and complete citation portraits	Runs citation-context extraction for all citing papers; highest cost and longest runtime

For papers with more than 1000 citations, enable year traversal. It splits Google Scholar queries by year to work around the 1000-result display limit. See the Guidelines for detailed tier behavior and current implementation notes.

📤 Outputs and Sharing

Each run creates a timestamped folder under data/result-{timestamp}/, usually including:

paper_results.xlsx
paper_results_all_renowned_scholar.xlsx
paper_results_top-tier_scholar.xlsx
paper_results_with_citing_desc.xlsx
paper_results.json
paper_dashboard.html

paper_dashboard.html is the main shareable artifact. It is a self-contained browser-readable report that can be sent to advisors, collaborators, or evaluators, and can be reused in grant applications, annual reviews, and presentation preparation. Download buttons and AI assistant features work best when the report is opened from the local CitationClaw app; the static charts and report content remain readable when shared offline.

If author and citation-description caches already exist, the app can rebuild a report from cache without repeating the full crawl and extraction workflow.

🔧 Configuration Highlights

ScraperAPI Keys: required for Google Scholar crawling; multiple keys improve stability.
Search LLM: required for scholar assessment and verification; must support web search.
Lightweight model: optional independent model endpoint for report generation and citation-context extraction.
Semantic Scholar API Key: optional but improves metadata and PDF discovery.
Web of Science Starter API Key: optional higher-priority structured author extraction.
MinerU API Token: optional parser for larger or more complex PDFs.
CDP debug port: optional Chrome/Edge session for authenticated IEEE, Elsevier, and ACM downloads.
Quota tracking: optional API relay token/user ID pair for estimating LLM quota consumption after each run.
Model preflight: the UI can test Search LLM and lightweight model connectivity before a full run.

📁 Project Structure

citationclaw/
├── app/                 # FastAPI app, task orchestration, config, logs
├── core/                # scraping, metadata, PDF, export, dashboard engines
├── skills/              # Skills Runtime and phase skills
├── static/              # frontend assets
├── templates/           # Jinja2 pages
docs/                    # documentation site and demos
test/                    # tests

🌍 Community

Product update: 减论 reduct.cn
The WeChat group is full. Please add the personal WeChat below to be invited:

⚠️ Disclaimer

CitationClaw is intended for academic research and personal study. Follow the terms of Google Scholar, ScraperAPI, OpenAlex, Semantic Scholar, arXiv, Web of Science, MinerU, and your selected LLM provider. Author identities, citation contexts, and impact analysis should be treated as assistive outputs and manually verified before formal use. The authors are not responsible for consequences arising from use of this tool.

⭐ Star History

👥 Developer Team

Shanghai Jiao Tong University — VisionXLab@RethinkLab
Qihao Yang (杨起豪), Chunhao Zhang (张春浩), Ziyang Gong (龚子洋), Ziqian Fan (樊子谦), Yifan Zhou (周奕帆), Yue Zhou (周越), Zhihang Zhong (钟志航), Xue Yang (杨学)^{⭐ Project Leader}

East China Normal University & Shanghai AI Lab — DMCV
Yifan Cheng (程依凡), Zhanghan Xu (许张涵), Jiawei Lu (陆佳炜), Xin Tan (谭鑫)

Southeast University — PALM Lab
Caorui Li (李操瑞), Tianyi Zhou (周天一), Xu Yang (杨旭)

[!NOTE] Acknowledgment: Special thanks to Keyu Chen (陈柯宇) for generously sponsoring the compute and API resources that made this project possible.

📚 Citation

If CitationClaw helps your research, reporting, or evaluation workflow, please cite the project:

@software{citationclaw2026,
  title        = {CitationClaw: Turning Every Citation into Explainable Impact},
  author       = {Yang, Qihao and Zhang, Chunhao and Cheng, Yifan and Gong, Ziyang and Li, Caorui and Xu, Zhanghan and Zhou, Tianyi and Lu, Jiawei and Fan, Ziqian and Zhou, Yifan and Zhou, Yue and Zhong, Zhihang and Yang, Xu and Tan, Xin and Yang, Xue},
  year         = {2026},
  version      = {2.0.0},
  url          = {https://github.com/VisionXLab/CitationClaw},
  institution  = {Shanghai Jiao Tong University, East China Normal University, Southeast University}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.0

May 23, 2026

1.0.9

Mar 21, 2026

1.0.8

Mar 18, 2026

1.0.7

Mar 16, 2026

1.0.6

Mar 14, 2026

1.0.5

Mar 14, 2026

1.0.4

Mar 13, 2026

1.0.3

Mar 12, 2026

1.0.2

Mar 12, 2026

1.0.1

Mar 12, 2026

1.0.0

Mar 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

citationclaw-2.0.0.tar.gz (1.8 MB view details)

Uploaded May 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

citationclaw-2.0.0-py3-none-any.whl (1.8 MB view details)

Uploaded May 23, 2026 Python 3

File details

Details for the file citationclaw-2.0.0.tar.gz.

File metadata

Download URL: citationclaw-2.0.0.tar.gz
Upload date: May 23, 2026
Size: 1.8 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for citationclaw-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`e033318cc045ea0e0f04b197ddb4a437ccca2646017933c5431e4aeadc9174b2`
MD5	`ec56059b4b6e0a7b676c333682ea4ae8`
BLAKE2b-256	`ca452c6285205feeefca37745cd07d0a050b2f52a465da5519ffb4285cfc8134`

See more details on using hashes here.

File details

Details for the file citationclaw-2.0.0-py3-none-any.whl.

File metadata

Download URL: citationclaw-2.0.0-py3-none-any.whl
Upload date: May 23, 2026
Size: 1.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for citationclaw-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d7222ac583f182b244fde795c53612e03adb0b3c8322bb95ccd77814027f0aec`
MD5	`2b68a4acd8a6027614e2a91a55093e36`
BLAKE2b-256	`7eabe62db3ebd8d86634964819fba16db49d5c730c9d9b79e4c53340c7b476e5`

See more details on using hashes here.

citationclaw 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CitationClaw v2: Turning Every Citation into Explainable Impact

📢 News

🚀 What v2 Changes

🔄 v1 vs v2

📈 v1 vs v2 Benchmark

🧭 Quick Links

📦 Install

PyPI

Source

🧩 Five-Phase Pipeline

⚙️ Service Tiers

📤 Outputs and Sharing

🔧 Configuration Highlights

📁 Project Structure

🌍 Community

⚠️ Disclaimer

⭐ Star History

👥 Developer Team

📚 Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes