Universal Research Paper API โ single entry point for arXiv, PMC, bioRxiv, medRxiv, PsyArXiv, OSF, and Semantic Scholar
Project description
ScholarX ๐ - API | MCP | AgentOS
Version: 1.8.0
Universal Research Paper API โ a single entry point for querying, downloading, and ingesting research papers from all major preprint and academic repositories.
Version: 0.7.0
Overview
ScholarX provides a unified interface to search across 7 paper sources simultaneously, with automatic cross-source deduplication, full PDF downloads, and Knowledge Graph integration. It is registered as an Agent OS subsystem in the genius-agent ecosystem.
Supported Sources
| Source | API | Auth | Rate Limit |
|---|---|---|---|
| arXiv | Atom/OpenSearch | Free | 1 req/3s |
| PubMed Central | NCBI E-utilities | Optional NCBI_API_KEY |
3 req/s (10 with key) |
| bioRxiv | bioRxiv REST | Free | 1 req/s |
| medRxiv | bioRxiv REST | Free | 1 req/s |
| PsyArXiv | OSF v2 | OSF_TOKEN |
1 req/s |
| OSF | OSF v2 | OSF_TOKEN |
1 req/s |
| Semantic Scholar | Academic Graph v1 | Optional S2_API_KEY |
100 req/min |
Key Features
- Unified Search โ Single
SearchQuerymodel works across all sources - 3-Tier Deduplication โ DOI exact match โ cross-ID mapping โ fuzzy title+author (Levenshtein โฅ 0.90)
- Full Paper Download โ Download and store complete PDFs locally (
~/.scholarx/papers/) - Knowledge Graph Integration โ Ingest papers via existing
KBIngestionEngine(ArticleNode, SourceNode, PersonNode) - RLM Auto-Trigger โ Large papers (>50K chars) automatically route through Recursive Language Model decomposition
- Per-Source Rate Limiting โ Token-bucket rate limiter in the abstract provider base class
- Configurable Watchlists โ Register custom research topics as MaintenanceCron tasks
Installation
# Core (API client only)
pip install scholarx
# With MCP server
pip install scholarx[mcp]
# With agent server
pip install scholarx[agent]
# Everything
pip install scholarx[all]
Quick Start
Python API
import asyncio
from scholarx.api_client import ScholarXClient
from scholarx.models import SearchQuery, PaperSource
async def main():
client = ScholarXClient()
# Search across all sources
result = await client.search(SearchQuery(
query="multi-agent orchestration",
categories=["cs.AI", "cs.MA"],
max_results=10,
))
for paper in result.papers:
print(f"[{paper.source}] {paper.title}")
print(f" Authors: {', '.join(paper.authors[:3])}")
print(f" DOI: {paper.doi}")
print()
# Download a paper
if result.papers:
path = await client.download_paper(result.papers[0])
print(f"Downloaded to: {path}")
asyncio.run(main())
CLI
ScholarX includes a rich CLI with progress bars for paper discovery, relevance scoring, and PDF downloads.
# Scan for recent AI papers across 7 CS categories
scholarx scan --query "artificial intelligence" --output-dir ./papers
# Customize categories and result count
scholarx scan --categories cs.AI,cs.LG,cs.CL --max-results 30 --output-dir ./papers
# Use a custom relevance taxonomy
scholarx scan --query "knowledge graphs" --taxonomy custom_taxonomy.json --output-dir ./papers
# Auto-trigger comparative analysis on high-confidence papers
scholarx scan --analyze --output-dir ./papers
# Show stored paper library status
scholarx status
Relevance Scoring
The CLI scores each paper's abstract against a 9-domain weighted keyword taxonomy:
| Domain | Weight | Focus |
|---|---|---|
| Orchestration | 3.0 | Multi-agent, workflow, task decomposition |
| Knowledge Graph | 3.0 | Ontology, OWL, entity relations, graph reasoning |
| Planning & Reasoning | 2.5 | Chain-of-thought, MCTS, deliberation |
| Memory & Retrieval | 2.5 | RAG, episodic memory, continual learning |
| Tool Use | 2.0 | Function calling, MCP, code generation |
| Evaluation & Safety | 2.0 | Benchmarks, red teaming, hallucination |
| Swarm & Evolution | 2.0 | Evolutionary methods, stigmergy, biomimicry |
| LLM Architecture | 1.5 | Transformers, MoE, distillation |
| Human-AI | 1.0 | Human-in-the-loop, decision support |
Papers are classified into three tiers:
- โ Relevant (score โฅ 3.0) โ Direct value for the target domain
- ๐ก Marginal (score 1.0โ2.9) โ Potential indirect value
- โ Irrelevant (score < 1.0) โ Filtered out
Deduplication
ScholarX prevents duplicate downloads through two mechanisms:
-
Cross-source deduplication (
deduplication.py): 3-tier matching removes duplicates when the same paper appears across multiple sources:- Tier 1: DOI exact match
- Tier 2: Cross-ID mapping (arXiv ID โ S2 corpus ID via metadata)
- Tier 3: Normalized title + first-author last name (Levenshtein โฅ 0.90)
-
Storage deduplication (
paper_storage.py): Before downloading,PaperStorage.download_paper()checks if the paper ID's metadata hash already exists in~/.scholarx/papers/.metadata/. Already-downloaded papers are skipped instantly.
MCP Server
# Start in stdio mode (for agent integration)
scholarx-mcp --transport stdio
# Start in HTTP mode
scholarx-mcp --transport streamable-http --host 0.0.0.0 --port 9600
MCP Tools
| Tool | Description |
|---|---|
search_papers |
Multi-source search with deduplication |
get_paper |
Single paper by source + ID |
search_by_author |
Author-based search |
get_recent_papers |
Papers from last N days |
list_sources |
Available sources and status |
list_categories |
Categories per source |
download_paper |
Download full PDF |
get_stored_papers |
List locally stored papers |
MCP Prompts
| Prompt | Purpose |
|---|---|
agent_utilities_enhancement_scan |
Scan CS/AI papers for AU concept enhancement opportunities |
biomimicry_innovation_scan |
Scan biology papers for biomimetic agent patterns |
Docker
# Build and run
docker compose up -d
# Debug mode (mounts local source)
docker compose -f compose.yml up --build
Environment Variables
# API Keys (all optional for basic functionality)
OSF_TOKEN= # OSF/PsyArXiv API token
S2_API_KEY= # Semantic Scholar (higher rate limits)
NCBI_API_KEY= # PubMed Central (higher rate limits)
# MCP Server
TRANSPORT=stdio # stdio | streamable-http
HOST=0.0.0.0
PORT=9600
# Tool Toggles
SEARCHTOOL=True
DISCOVERYTOOL=True
STORAGETOOL=True
# Paper Storage
SCHOLARX_STORAGE_DIR= # Default: ~/.scholarx/papers/
Architecture
User/Agent
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ScholarX MCP Server โ 12 tools + prompts
โ (mcp_server.py) โ
โโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ScholarXClient โ Unified API
โ (api_client.py) โ
โโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โ
โโโโโโผโโโโโฌโโโโโฌโโโโโฌโโโโโฌโโโโโ
โผ โผ โผ โผ โผ โผ โผ
arXiv PMC bioRx medRx PsyAr OSF S2 โ Per-source rate limiting
โ โ โ โ โ โ โ
โโโโโโดโโโโโดโโโโโดโโโโโดโโโโโดโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Deduplication Engine โ DOI โ cross-ID โ fuzzy title
โ (deduplication.py) โ
โโโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Paper Storage โ Full PDF download
โ (~/.scholarx/papers/) โ
โ โ โ
โ โผ โ
โ KBIngestionEngine โ โ ArticleNode + PersonNode
โ (KG auto-ingest) โ + SourceNode + KBConceptNode
โ โ โ
โ RLM (AU-007) โ Auto-triggers for >50K char papers
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
Agent OS Subsystem
ScholarX is registered as an Agent OS subsystem alongside:
| Subsystem | Role |
|---|---|
container-manager-mcp |
Infrastructure provisioning |
systems-manager |
Host/OS operations |
tunnel-manager |
Network tunneling |
repository-manager |
Git/repo operations |
scholarx |
Research intelligence |
Maintenance Cron
A SIX_HOURLY maintenance task (scholarx_paper_discovery) automatically:
- Checks for new papers across configured categories
- Evaluates relevance to Knowledge Graph concepts
- Ingests high-relevance papers (score > 0.6)
- Produces actionable research digests
Custom watchlists can be added via MaintenanceCron.add_task() or the create_research_watchlist MCP tool.
License
MIT
MCP Configuration Examples
1. Standard IO (stdio) Deployment
{
"mcpServers": {
"scholarx": {
"command": "uv",
"args": [
"run",
"scholarx-mcp"
],
"env": {
"AGENT_DESCRIPTION": "<YOUR_AGENT_DESCRIPTION>",
"AGENT_SYSTEM_PROMPT": "<YOUR_AGENT_SYSTEM_PROMPT>",
"DEFAULT_AGENT_NAME": "<YOUR_DEFAULT_AGENT_NAME>",
"DISCOVERYTOOL": "True",
"SEARCHTOOL": "True",
"STORAGETOOL": "True"
}
}
}
}
2. Streamable HTTP (SSE) Deployment
{
"mcpServers": {
"scholarx": {
"command": "uv",
"args": [
"run",
"scholarx-mcp",
"--transport",
"http",
"--host",
"0.0.0.0",
"--port",
"8000"
],
"env": {
"AGENT_DESCRIPTION": "<YOUR_AGENT_DESCRIPTION>",
"AGENT_SYSTEM_PROMPT": "<YOUR_AGENT_SYSTEM_PROMPT>",
"DEFAULT_AGENT_NAME": "<YOUR_DEFAULT_AGENT_NAME>",
"DISCOVERYTOOL": "True",
"SEARCHTOOL": "True",
"STORAGETOOL": "True"
}
}
}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scholarx-0.7.0.tar.gz.
File metadata
- Download URL: scholarx-0.7.0.tar.gz
- Upload date:
- Size: 56.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a372c0d45b47129f5b70390e19c46ba69905fea431a18063d7acffda83a9ced8
|
|
| MD5 |
012977cc1161256903727bc43d9e3271
|
|
| BLAKE2b-256 |
f3a0e937b53d068f7f2208f03d7f04be296217049e646cba6852e80f4f6f8f24
|
File details
Details for the file scholarx-0.7.0-py3-none-any.whl.
File metadata
- Download URL: scholarx-0.7.0-py3-none-any.whl
- Upload date:
- Size: 58.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a80f5d3ec7f68a8d97f15a5d810586e45504c000e7c361b47f831815249474c
|
|
| MD5 |
b6d07e88c8e9f2e6ad99a5e1fb3023bc
|
|
| BLAKE2b-256 |
4b60b0add0640289a921d752e8e8e2b53c9a47f4fb116a1965e4f7705924c779
|