AI-powered Zotero research assistant — a standalone MCP server with 32 tools for literature search, reading analysis, citation management, and review writing.
Project description
Zotero Research Assistant
Turn your Zotero library into an AI-powered research engine.
Search by meaning, discover related papers across 200M+ works, get personalized reading recommendations, and manage your entire academic workflow — all through natural language.
Works with Cursor, Claude Desktop, Cherry Studio, Trae, OpenAI Codex CLI, and any MCP-compatible client.
Preface
This project was built to help graduate students and researchers — especially those without a computer science background — leverage AI-enhanced Zotero for more efficient academic workflows. The documentation is deliberately detailed and step-by-step. Cherry Studio was chosen as the primary interaction interface because it provides a user-friendly GUI that doesn't require any terminal expertise. We believe powerful research tools should be accessible to everyone, not just developers.
If you have no programming experience, go directly to docs/cherry-studio-setup-en.md and follow the instructions step by step. Try to complete it independently — if you get stuck at any point, paste the error message to any AI chatbot (ChatGPT, DeepSeek, Kimi, etc.) and ask for help. Consider this your first step into the world of programming and AI tools. It's easier than you think.
Highlights
| 32 MCP Tools | One intent per tool — LLMs always pick the right one |
| Hybrid RAG Search | Keyword + semantic (bge-m3, 100+ languages) + cross-encoder reranking |
| Semantic Chunking | Paragraph-aware splitting with section detection (references, figures/tables) |
| Multi-Source Discovery | OpenAlex + CrossRef + Semantic Scholar in parallel, Three-Index Verification to prevent fabricated citations |
| Citation Network Expansion | Corpus-First strategy + forward/backward citations + OpenAlex Related Works |
| Anti-Hallucination | Zero-fabrication policy with [MATERIAL GAP] structural tags; every paper has a verifiable source link |
| RAG Diagnostics | Built-in health check, index inspection, and recall testing |
| Personalized Recommendations | Learns from your reading activity and annotations to suggest what to read next |
| Literature Review Generator | Select papers → extract evidence with citations → AI synthesizes thematic review |
| Smart Tag Suggestions | Auto-analyze metadata to recommend methodology/domain/data tags (confirm before apply) |
| Argument Finder | Find supporting & opposing evidence for your thesis from your library |
| CNKI Integration | Optional Chinese literature search with journal-level tags (CSSCI/PKU Core/CSCD) |
| OA PDF Waterfall | arXiv → Unpaywall → OpenAlex → S2 → CORE → PMC automatic full-text retrieval |
| Write Safety | All destructive operations require explicit user approval (dry-run by default) |
Table of Contents
- Features
- Requirements
- Quick Start
- Client Setup
- Example Workflows
- MCP Tools (32)
- Configuration
- CNKI Setup (Optional)
- Updating
- Troubleshooting
- Architecture
- Development
- Acknowledgments
- Disclaimer
- License
Features
Local Library Intelligence
- Hybrid search — Zotero keyword search + ChromaDB semantic search, merged with Reciprocal Rank Fusion; fallback to Zotero full-text index
- Filter-only search — list papers by year, tags, or collection with an empty query
- Cross-encoder reranking — optional
ms-marco-MiniLM-L-6-v2for higher precision - Multilingual —
BAAI/bge-m3embedding (1024-dim, 100+ languages including Chinese and English) - Page-level traceability — retrieved passages include exact PDF page numbers
- Full-text & outline — read complete paper text or PDF table of contents
- Incremental index sync — version-based diff; auto-sync on MCP startup
Semantic RAG Pipeline
- Paragraph-aware chunking — splits on natural boundaries (paragraphs → sentences), adaptive merging to target 600-char chunks; CJK-aware sentence splitting (breaks at
。!?without needing spaces) and PDF soft-wrap repair (满\n意度→满意度) so sentences are never cut mid-word - Section detection — automatically identifies and tags reference sections; excludes them from search by default
- Figure & table caption tagging — detects
Figure/Fig./Table/图/表captions and marks chunks for targeted retrieval - Table & figure cross-referencing — tables and figures are indexed as lightweight caption-anchored records, not structured into cells. For a table we keep where it is, its caption, and the raw block content from the caption until the prose resumes (so its values stay searchable); for a figure we keep only where it is and roughly what it shows (its caption — no image recognition). Prose passages that cite "Table 3" / "Figure 2" are auto-linked to those records, surfaced via
get_paper_content'sreferenced_tables/referenced_figures. (True table structuring is a vision problem — see Tables & figures for optional visual parsers.) - Chunking versioning — strategy changes auto-trigger full index rebuild; no stale data
- Index diagnostics —
inspect_indexshows chunk statistics, quality issues, and garbled text detection - Recall testing —
test_recallverifies a paper's own chunks appear in top-20 search results - Health monitoring —
check_healthdiagnoses connections, index status, embedding model, and configuration
Online Literature Discovery
- Multi-source search — queries OpenAlex, CrossRef, and Semantic Scholar in parallel with publisher-diverse ranking
- Corpus-First strategy — when a paper's reference list is available, the system expands citation networks from those known references as the PRIMARY search strategy
- Discipline filtering — optional
fields_of_studyparameter constrains results to relevant academic fields - Related paper discovery — provide a paper's metadata → generates tiered pairwise queries → searches all sources → post-filters → returns deduplicated hits
- Three-Index Verification — every result with a DOI is cross-checked against CrossRef, OpenAlex, and Semantic Scholar; unverifiable papers are filtered out
- Source verification — every returned paper includes a verifiable link (DOI URL, Semantic Scholar URL, or CNKI link)
- Anti-hallucination guardrails — structural
[MATERIAL GAP]tags when search returns zero results
CNKI (Chinese Literature)
- CNKI integration — optional Chinese journal search via browser automation (disabled by default, enabled on demand)
- Journal-level tags — search results include indexing status badges (CSSCI, PKU Core, CSCD, SCI, EI)
- Direct Zotero import — export papers from CNKI to Zotero without manual DOI lookup
- Paper detail extraction — full metadata (abstract, keywords, DOI, affiliations) from CNKI detail pages
- Smart pagination — AI proactively fetches more results when thorough coverage is needed
Reading Insight & Recommendations
- Reading status detection — heuristic classification (deep_read / browsed / unread) based on annotation count, notes, and PDF open history
- Personalized recommendations — identifies your most-engaged papers → queries OpenAlex Related Works + S2 Recommendations in parallel → ranks by cross-seed frequency
- Focus topic extraction — surfaces your active research themes from recent reading tags
- Literature review generation — select papers → extract relevant passages with page-level citations → structured output for AI synthesis
- Smart tag suggestions — analyzes title/abstract to recommend methodology, domain, and data-type tags; suggest-only (never auto-applies)
- Argument finder — given a thesis/claim, searches library for evidence grouped by stance (support/oppose/neutral)
Library Management
- Add papers — DOI, arXiv, ISBN, BibTeX, or publisher URL (ScienceDirect, Springer, Wiley, …)
- Open-access PDF waterfall — arXiv → Unpaywall → OpenAlex → Semantic Scholar → CORE → PMC
- Duplicate merge — find by DOI/title, merge with dry-run preview
- Annotations — search highlights across the library; create highlights on PDFs
- Write safety — all write/delete operations preview first; requires explicit user approval
- Hybrid Zotero mode — fast local reads + web API writes (when API key is set)
Requirements
| Component | Version / Note |
|---|---|
| Python | 3.11+ |
| Zotero | 7+ desktop app, running with local API enabled |
| MCP client | Cursor, Claude Desktop, Cherry Studio, Trae, Codex CLI, etc. |
| LLM | Any model with tool/function calling (Claude, GPT-4o, DeepSeek, Qwen, Gemini, …) |
| Disk | ~2.5 GB for embedding model (bge-m3) on first run |
| Git | Only needed for Option B (clone from source) |
Path tip: Install in a short path without spaces or non-ASCII characters, e.g.
~/zotero-research-assistant(macOS/Linux) orC:\Dev\zotero-research-assistant(Windows).
Quick Start
1. Install
Option A: pip install (recommended for most users)
pip install zotero-research-assistant
With CNKI (Chinese literature) support:
pip install "zotero-research-assistant[cnki]"
After installing, run zra-mcp to start the MCP server. Skip to Step 2.
Option B: Clone from source (for development or customization)
git clone https://github.com/qiobn/zotero-research-assistant.git
cd zotero-research-assistant
Install uv (fast Python package manager) if not already present:
# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows (PowerShell)
irm https://astral.sh/uv/install.ps1 | iex
Create a virtual environment and install:
uv venv .venv --python 3.13 # use 3.12 or 3.11 if unavailable
uv pip install -e .
Verify installation:
# macOS / Linux
source .venv/bin/activate
python -c "from project_a_mcp.server import mcp; print('OK')"
# Windows (PowerShell)
.venv\Scripts\activate
python -c "from project_a_mcp.server import mcp; print('OK')"
First run downloads the embedding model (~2.3 GB). If download is slow, set
HF_ENDPOINT=https://hf-mirror.comand retry.
2. Configure Zotero
Enable local API (required):
- Open Zotero → Edit → Settings → Advanced
- Check "Allow other applications on this computer to communicate with Zotero"
- Verify: http://localhost:23119/api/ should return JSON
Set environment variables:
If you used Option B (clone), create a .env file in the project folder:
cp .env.example .env
If you used Option A (pip install), set environment variables in your shell or create a .env file in your working directory.
Minimum for read-only mode (search, read, cite):
ZOTERO_LOCAL=true
For write operations (add papers, notes, tags, collections), also set your Zotero API key:
ZOTERO_LOCAL=true
ZOTERO_LIBRARY_ID=12345678
ZOTERO_API_KEY=your_api_key_here
3. Build the vector index (first time)
The MCP server auto-syncs on startup (ZRA_AUTO_SYNC=true by default). On first launch it will parse all your PDFs and build the semantic index automatically.
If you cloned from source and want to build the index manually:
python scripts/index_library.py
The index is stored in .chroma_db/ (local only).
First-time indexing can take a while — let it run in the background. The first build parses every PDF and computes embeddings; the more papers, the longer it takes (rough guide: ~3–5 min for 100 papers, ~10–15 min for 500, longer on CPU or large libraries). The auto-sync runs in a background thread and does not block the client; for the manual script you can background it too (e.g.
nohup python scripts/index_library.py &). Only the first build (or after library changes) waits — subsequent startups are fast incremental syncs.
4. Connect your AI client
See the Client Setup section below for your specific tool.
5. Test the connection
- Start Zotero desktop
- Open a new chat in your MCP client
- Ask: "List all collections in my Zotero library"
If you see your collections, setup is complete.
Client Setup
All clients use stdio transport to connect to the MCP server.
If you installed via pip (Option A): use zra-mcp as the command directly — no path configuration needed.
If you cloned from source (Option B): you need the full path to the Python binary:
| Value | macOS / Linux | Windows |
|---|---|---|
| Python binary | <project>/.venv/bin/python |
<project>\.venv\Scripts\python.exe |
| Working directory | <project> (full path) |
<project> (full path) |
Replace <project> with your clone path (e.g. /Users/you/zotero-research-assistant or C:\Dev\zotero-research-assistant).
Quick path helper (run inside the project folder):
# macOS / Linux
echo "$(pwd)/.venv/bin/python"
# Windows (PowerShell)
echo "$PWD\.venv\Scripts\python.exe"
The examples below show both pip and source configurations. Use whichever matches your install method.
Cursor
Settings → MCP → Add new MCP server, or add to .cursor/mcp.json:
pip install users:
{
"mcpServers": {
"zra-mcp": {
"command": "zra-mcp"
}
}
}
Source install users (macOS/Linux):
{
"mcpServers": {
"zra-mcp": {
"command": "/Users/you/zotero-research-assistant/.venv/bin/python",
"args": ["-m", "project_a_mcp.server"],
"cwd": "/Users/you/zotero-research-assistant"
}
}
}
Source install users (Windows):
{
"mcpServers": {
"zra-mcp": {
"command": "C:\\Dev\\zotero-research-assistant\\.venv\\Scripts\\python.exe",
"args": ["-m", "project_a_mcp.server"],
"cwd": "C:\\Dev\\zotero-research-assistant"
}
}
}
Restart Cursor after adding the config. The MCP tools will appear in Agent mode.
Claude Desktop
Edit claude_desktop_config.json:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json
{
"mcpServers": {
"zra-mcp": {
"command": "/Users/you/zotero-research-assistant/.venv/bin/python",
"args": ["-m", "project_a_mcp.server"],
"cwd": "/Users/you/zotero-research-assistant"
}
}
}
Restart Claude Desktop. You should see the MCP tools icon (hammer) in the chat input area.
Cherry Studio
Settings → MCP Servers → Add → JSON mode:
pip users (recommended):
{
"mcpServers": {
"zra-mcp": {
"name": "zra-mcp",
"type": "stdio",
"isActive": true,
"command": "zra-mcp"
}
}
}
Source install users (macOS/Linux):
{
"mcpServers": {
"zra-mcp": {
"name": "zra-mcp",
"type": "stdio",
"isActive": true,
"command": "/Users/you/zotero-research-assistant/.venv/bin/python",
"args": ["-m", "project_a_mcp.server"],
"cwd": "/Users/you/zotero-research-assistant"
}
}
}
Source install users (Windows):
{
"mcpServers": {
"zra-mcp": {
"name": "zra-mcp",
"type": "stdio",
"isActive": true,
"command": "C:\\Dev\\zotero-research-assistant\\.venv\\Scripts\\python.exe",
"args": ["-m", "project_a_mcp.server"],
"cwd": "C:\\Dev\\zotero-research-assistant"
}
}
}
Configure an LLM under Settings → Model Services (DeepSeek, GPT-4o, Claude, Qwen, etc.). Enable the MCP toggle in the chat interface to activate tools.
For a detailed step-by-step guide, see docs/cherry-studio-setup-en.md.
Trae
Trae supports MCP servers via its settings panel.
Settings → MCP → Add Server:
| Field | Value |
|---|---|
| Name | zra-mcp |
| Transport | stdio |
| Command | Full path to .venv/bin/python (or .venv\Scripts\python.exe on Windows) |
| Arguments | -m project_a_mcp.server |
| Working Directory | Full path to the project root |
Or add to your Trae MCP configuration file (.trae/mcp.json in your workspace or global config):
{
"mcpServers": {
"zra-mcp": {
"command": "/Users/you/zotero-research-assistant/.venv/bin/python",
"args": ["-m", "project_a_mcp.server"],
"cwd": "/Users/you/zotero-research-assistant"
}
}
}
Restart Trae after configuration. MCP tools become available in AI chat (Agent mode).
OpenAI Codex CLI
Codex CLI supports MCP servers. Add to your ~/.codex/config.json (or project-level .codex/config.json):
{
"mcpServers": {
"zra-mcp": {
"command": "/Users/you/zotero-research-assistant/.venv/bin/python",
"args": ["-m", "project_a_mcp.server"],
"cwd": "/Users/you/zotero-research-assistant"
}
}
}
Then run Codex normally — it will discover and use the tools automatically:
codex "Find papers about urban accessibility in my Zotero library"
Other MCP Clients
Any client that supports the MCP stdio transport can connect. The universal config is:
| Parameter | Value |
|---|---|
| Transport | stdio |
| Command | <project>/.venv/bin/python |
| Arguments | ["-m", "project_a_mcp.server"] |
| Working directory | <project> |
| Environment | Reads from <project>/.env automatically |
Example Workflows
Research Discovery
User: Find papers about 15-minute cities published after 2020
→ search_papers (local library)
User: Search online for recent studies on urban green infrastructure
→ search_online_literature (OpenAlex + CrossRef + S2)
User: I'm reading this paper [title, keywords]. Find me related literature.
→ find_related_literature (5 parallel strategies, verified results)
User: Show me who cites this paper and what it references
→ expand_citation_network (forward + backward citations)
Reading & Analysis
User: What does this paper say about the research methodology?
→ get_paper_content (semantic search within paper)
User: Summarize these 5 papers into a literature review about "method evolution"
→ generate_review_note → AI synthesizes thematic review with citations
User: My thesis is "public services are unevenly distributed" — find evidence
→ find_arguments (returns supporting + opposing passages with stance labels)
User: What should I read next?
→ recommend_papers (based on your annotation activity)
User: Show me all figures and tables mentioned in this paper
→ get_paper_content (filtered to figure/table chunks)
Writing & Citing
User: I'm writing: "Walkability is a key indicator of urban quality..." — suggest citations
→ suggest_citations (matches your draft to library evidence)
User: Export BibTeX for the top 3 results
→ export_bibliography
User: Add this paper: 10.1016/j.cities.2025.105902
→ add_paper (preview → confirm → auto-downloads OA PDF)
Library Organization
User: Analyze these papers and suggest tags
→ suggest_tags (methodology/domain/data classification, suggest-only)
User: Tag these papers as "core reading"
→ edit_tags (preview → confirm)
User: Which papers have I actually read? Which are unread?
→ reading_status (heuristic: annotations, notes, PDF open history)
System Diagnostics
User: Is everything working correctly?
→ check_health (connection, index, embedding, configuration)
User: How good is my index quality?
→ inspect_index (chunk stats, section breakdown, figure/table counts)
User: Can this paper be retrieved properly?
→ test_recall (searches by title, checks if own chunks appear in top-20)
Write safety: all destructive operations (add paper, notes, tags, merge duplicates) always preview first. The assistant asks for explicit confirmation before executing.
MCP Tools (32)
| Category | Tools |
|---|---|
| Discover | search_papers, search_online_literature, search_cnki_literature, find_related_literature, expand_citation_network, cnki_paper_detail, cnki_navigate_pages, find_similar_papers, browse_library, find_duplicates, merge_duplicates |
| Read | get_paper, get_paper_content, search_annotations, create_annotation |
| Write | suggest_citations, export_bibliography, add_paper, cnki_add_to_zotero |
| Manage | add_note, edit_tags, manage_collections |
| Insight | reading_status, recommend_papers, generate_review_note, generate_reading_note, suggest_tags, find_arguments |
| Admin | sync_index, check_health, inspect_index, test_recall |
Expand tool details
Discover
search_papers— Primary search in your local library. Hybrid keyword + semantic. Usequery=""withyear_from/ tags for filter-only listing.search_online_literature— Online discovery (English/international: OpenAlex, CrossRef, Semantic Scholar). Supportsfields_of_studyfor discipline filtering.search_cnki_literature— CNKI Chinese journal search (optional module, disabled by default). Only triggered when user explicitly requests Chinese papers.find_related_literature— Multi-strategy related paper search. Supports Corpus-First mode, keyword search, citation network expansion, and Semantic Scholar recommendations — all in parallel.expand_citation_network— Find papers via citation relationships (forward & backward via OpenAlex). Accepts multiple DOIs for multi-seed expansion.cnki_paper_detail— Full metadata from a CNKI paper page.cnki_navigate_pages— Pagination & re-sorting for CNKI results.find_similar_papers— Similar papers to a known item (byitem_key).browse_library— Collections, tags, recent items.find_duplicates/merge_duplicates— Detect and merge duplicates (dry-run by default).
Read
get_paper— Metadata + abstract.get_paper_content— Modes: semantic query, page range, fulltext, outline; optional annotations overlay.search_annotations— Search highlights/comments across all papers.create_annotation— Highlight text on a PDF (dry-run by default).
Write & Manage
suggest_citations— Match your draft text to library evidence.export_bibliography— BibTeX or formatted citations.add_paper— Import by DOI / arXiv / ISBN / BibTeX / URL (dry-run by default).cnki_add_to_zotero— Import CNKI papers directly (no DOI needed).add_note,edit_tags,manage_collections— Library organization (dry-run by default).
Insight
reading_status— Analyze reading progress. Classifies papers asdeep_read,browsed, orunread.recommend_papers— Personalized recommendations via OpenAlex + S2.generate_review_note— Extract evidence from multiple papers for literature review.generate_reading_note— Structured reading note for one paper.suggest_tags— Analyze metadata to suggest tags. Suggest-only, never auto-applies.find_arguments— Find supporting and opposing evidence for a claim/thesis.
Admin
sync_index— Incremental vector index sync. Auto-runs on MCP startup. Reports quality summary and detects chunking version changes.check_health— Diagnose connections, index status, embedding model, online APIs, and configuration. Bilingual output with fix suggestions.inspect_index— View index quality: chunk stats, section breakdown, figure/table counts, garbled text detection, and per-paper details.test_recall— Test retrieval quality for a specific paper by querying with its title and checking if its own chunks are returned.
Configuration
Copy .env.example to .env and adjust:
| Variable | Default | Description |
|---|---|---|
ZOTERO_LOCAL |
true |
Read from local Zotero API (fast) |
ZOTERO_API_KEY |
— | Required for write operations (hybrid mode) |
ZOTERO_LIBRARY_ID |
0 |
Your Zotero user ID |
EMBEDDING_MODEL |
BAAI/bge-m3 |
Sentence-transformer for semantic search |
EMBEDDING_MAX_SEQ_LEN |
1024 |
Cap on embedding sequence length; bounds GPU/MPS memory on pathological long inputs |
HF_ENDPOINT |
— | HuggingFace mirror for model downloads (e.g. https://hf-mirror.com for users in China) |
RERANKER_MODEL |
cross-encoder/ms-marco-MiniLM-L-6-v2 |
Reranker (none to disable) |
CHROMA_PERSIST_DIR |
.chroma_db |
Local vector database path |
ZRA_AUTO_SYNC |
true |
Auto incremental sync on MCP startup |
SEMANTIC_SCHOLAR_API_KEY |
— | Optional; higher rate limits for online search |
OPENALEX_MAILTO |
— | Optional; polite pool for OpenAlex API |
UNPAYWALL_EMAIL |
— | Optional; Unpaywall OA PDF lookup |
CORE_API_KEY |
— | Optional; CORE repository full-text |
CNKI_ENABLED |
false |
Enable CNKI browser search (see below) |
CNKI_CDP_URL |
— | Chrome remote debugging URL |
All data stays on your machine: Zotero library, .chroma_db/, and HuggingFace model cache (~/.cache/huggingface/).
Tables & figures
Tables and figures are not parsed into structured cells. Reliable table structuring is fundamentally a vision problem: text/geometry-based detection produces garbage on borderless "three-line" academic tables and even mis-segments multi-column prose and reference lists into fake tables. So instead of pretending to structure them, the indexer keeps lightweight caption-anchored records:
- Tables — the caption (e.g. "Table 3 …"), the page, and the raw block content from the caption until the prose resumes, so the table's values stay searchable. No cell/column structure.
- Figures — the caption only (roughly what the figure shows). No image is decoded.
- Prose that cites "Table 3" / "Figure 2" is linked to those records, so a
passage and the thing it references resolve together (
referenced_tables/referenced_figuresinget_paper_content).
Want true structured tables? Preprocess your PDFs with a dedicated visual document parser and store the result (e.g. Markdown/HTML) as a note or attachment that gets indexed as text. Good options:
| Tool | Notes |
|---|---|
| docling | IBM; strong layout + table structure recognition, exports Markdown/JSON |
| open-parse | Layout-aware chunking with table support |
| unstructured | hi_res strategy extracts table HTML |
These are heavier (vision models, slower) and intentionally kept out of the default pipeline.
CNKI Setup (Optional)
CNKI (China National Knowledge Infrastructure) is disabled by default. It is only needed for searching Chinese-language journal papers. When you first ask the AI for Chinese literature (e.g., "search CNKI for…" or "检索中文文献"), it will prompt you to complete the setup below.
CNKI has no public API. This project uses Playwright to connect to your logged-in Chrome browser via CDP (Chrome DevTools Protocol), following the same approach as cookjohn/cnki-skills.
Step 1: Install optional dependencies
uv pip install -e ".[cnki]"
playwright install chromium
Step 2: Start Chrome with remote debugging
# macOS
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222
# Windows
"C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222
# Linux
google-chrome --remote-debugging-port=9222
Step 3: Log in to CNKI
Open https://www.cnki.net/ in that Chrome window and log in (typically requires institutional VPN or campus network).
Step 4: Enable in .env
CNKI_ENABLED=true
CNKI_CDP_URL=http://127.0.0.1:9222
Step 5: Restart the MCP server
Reopen a chat window or restart your MCP client.
Verify
Ask the AI: "Search CNKI for highly-cited papers on geodetector since 2020"
If results appear (with title, authors, journal, citations, and journal level tags like CSSCI/PKU Core), the setup is working.
How it works
search_cnki_literatureorfind_related_literature(scope="cnki")→ returns hits withexport_idandjournal_level- You select papers → AI calls
cnki_add_to_zotero(export_ids=[...])→ papers appear in Zotero - No DOI lookup needed; metadata is fetched from CNKI's internal export API
Notes
- Trigger: CNKI tools are only called when you explicitly mention Chinese literature, CNKI, 知网, 核心期刊, CSSCI, etc.
- Captcha: If a Tencent slider captcha appears, solve it in the Chrome window and retry.
- Zotero import: Requires Zotero desktop running (uses localhost:23119 Connector API).
- Compliance: Requires legitimate institutional CNKI access.
- Before each session: Ensure the Chrome window from Step 2 is still running and the CNKI login is active.
Known Issues & Limitations
The CNKI module is currently unstable and disabled by default. It relies on browser automation which is inherently fragile.
| Issue | Cause | Workaround |
|---|---|---|
| Timeout on search | CNKI pages load slowly; anti-bot throttling | Simplify your query; retry after a few seconds |
| Chrome connection refused | Chrome not started with --remote-debugging-port |
Close ALL Chrome windows, restart with the flag |
| Stale login session | CNKI sessions expire after ~30 min | Re-login in the Chrome window |
| Consecutive timeouts | Rate limiting by CNKI | Wait 30s and retry |
| Export to Zotero fails | Zotero desktop not running | Ensure Zotero is running and API responds |
If CNKI consistently fails, fall back to the English-language online search (search_online_literature / find_related_literature) which is stable and does not require browser automation.
Updating
pip users:
pip install --upgrade zotero-research-assistant
Source install users:
cd ~/zotero-research-assistant # or your clone path
git pull
uv pip install -e . # if dependencies changed
If using CNKI:
uv pip install -e ".[cnki]"
playwright install chromium
Restart your MCP client to reload the server.
Note: If the chunking strategy has been updated in a new version,
sync_indexwill automatically detect the version change and rebuild the entire index on next run.
Troubleshooting
| Problem | Fix |
|---|---|
| Connection refused / no results | Ensure Zotero desktop is running and local API is enabled |
| New papers not found | Say "sync my index" or restart MCP (auto-sync on startup) |
| Write operations fail | Set ZOTERO_API_KEY + ZOTERO_LIBRARY_ID in .env |
| Slow first start | Embedding model download (~2.3 GB); use HF_ENDPOINT=https://hf-mirror.com |
| Windows: script blocked | Set-ExecutionPolicy -Scope CurrentUser RemoteSigned in PowerShell |
| MCP tools not called | Use a model with function calling; enable MCP/tools in client settings |
| AI executes writes without asking | Add to system prompt: "Always wait for explicit confirmation before executing writes" |
| Poor search results | Ask "check my system health" → check_health diagnoses issues |
| Index seems stale | Ask "inspect my index" → inspect_index shows version and quality metrics |
| CNKI: "search is disabled" | Complete the CNKI Setup steps |
| CNKI: captcha | Solve the slider in the Chrome window, then retry the search |
Architecture
research_core/ # Shared library — Zotero client, RAG pipeline, search adapters, tools
parsers/ # PDF extraction, CJK-aware semantic chunking, table extraction, caption detection
rag/ # ChromaDB indexer, retriever, embedding, sync state
tools/ # 32 tool implementations (one file per domain)
zotero/ # Zotero local + web API client
project_a_mcp/ # MCP server entry point (stdio transport)
scripts/ # CLI utilities (index_library.py, etc.)
tests/ # Unit + integration tests
docs/ # Detailed setup guides
Each tool maps to one user intent — discovery tools return item_key, read/write tools consume it.
Development
uv pip install -e ".[dev]"
pytest tests/ -v
ruff check .
ruff format .
Run CNKI integration tests (requires active CNKI session):
CNKI_ENABLED=true CNKI_CDP_URL=http://127.0.0.1:9222 pytest tests/mcp/test_cnki.py -v
Acknowledgments
This project was inspired by and built upon ideas from:
- zotero-mcp — Pioneering work on connecting Zotero with AI assistants via MCP.
- cnki-skills — Elegant approach to CNKI browser automation via Chrome DevTools Protocol.
- academic-research-skills — Inspiration for the Corpus-First search strategy and structured anti-hallucination patterns.
- nature-skills — Inspiration for the Three-Index Verification approach.
Thank you to the authors of these projects for sharing their work with the community.
Disclaimer
-
AI output quality depends on the connected model. Although this project implements multiple anti-hallucination mechanisms (Three-Index Verification,
[MATERIAL GAP]tagging, source provenance), the final quality of literature reviews, summaries, and recommendations is ultimately determined by the LLM you connect. Always verify AI-generated citations against the original sources before using them in academic work. -
For learning and research purposes only. This project is open-source and intended solely for personal academic research and educational use. It is not commercialized. If any content or functionality inadvertently infringes on intellectual property or terms of service of third-party platforms, please open an issue and we will address it promptly.
-
CNKI module compliance. The CNKI browser automation module is provided for convenience only. Users must have legitimate institutional access. This module is disabled by default.
-
Data privacy. All processing happens locally by default. Your PDFs are parsed and embedded on your machine. However, if you configure a cloud-based LLM, paper content will be sent to that external service. Users working with sensitive or unpublished research should be aware of this.
-
Trademark notice. "Zotero" is a registered trademark of the Corporation for Digital Scholarship. This project is an independent community tool and is not affiliated with, endorsed by, or officially connected to Zotero.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file zotero_research_assistant-0.2.0.tar.gz.
File metadata
- Download URL: zotero_research_assistant-0.2.0.tar.gz
- Upload date:
- Size: 182.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64f8cfe7a539555e7b1b33cf48ef38af8aa59d8934fdfb18b18c4d11e6d2feb1
|
|
| MD5 |
8426deff0261eb22f737ca59691fb520
|
|
| BLAKE2b-256 |
06f737d1ed39109ab82b0e29b5fc53805abf0063b51c2e31661e5aac7faeb669
|
File details
Details for the file zotero_research_assistant-0.2.0-py3-none-any.whl.
File metadata
- Download URL: zotero_research_assistant-0.2.0-py3-none-any.whl
- Upload date:
- Size: 150.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b65e61e494450fa67ce15647e375ca5428b4f5c52d5cd77bc3a8758954479237
|
|
| MD5 |
7b0791ca0e62a471389b3c9ba1d192f4
|
|
| BLAKE2b-256 |
cfe8e4c1c1b4066546f27ca036c8f83de089ee83f7d98abbfb3bcb9998dfc8d3
|