Skip to main content

Universal Research Paper API โ€” single entry point for arXiv, PMC, bioRxiv, medRxiv, PsyArXiv, OSF, and Semantic Scholar

Project description

ScholarX ๐Ÿ“š - API | MCP | AgentOS

PyPI - Version MCP Server PyPI - Downloads GitHub Repo stars GitHub forks GitHub contributors PyPI - License GitHub

GitHub last commit (by committer) GitHub pull requests GitHub closed pull requests GitHub issues

GitHub top language GitHub language count GitHub repo size GitHub repo file count (file type) PyPI - Wheel PyPI - Implementation

Version: 1.8.0

Universal Research Paper API โ€” a single entry point for querying, downloading, and ingesting research papers from all major preprint and academic repositories.

Version: 0.6.0

Overview

ScholarX provides a unified interface to search across 7 paper sources simultaneously, with automatic cross-source deduplication, full PDF downloads, and Knowledge Graph integration. It is registered as an Agent OS subsystem in the genius-agent ecosystem.

Supported Sources

Source API Auth Rate Limit
arXiv Atom/OpenSearch Free 1 req/3s
PubMed Central NCBI E-utilities Optional NCBI_API_KEY 3 req/s (10 with key)
bioRxiv bioRxiv REST Free 1 req/s
medRxiv bioRxiv REST Free 1 req/s
PsyArXiv OSF v2 OSF_TOKEN 1 req/s
OSF OSF v2 OSF_TOKEN 1 req/s
Semantic Scholar Academic Graph v1 Optional S2_API_KEY 100 req/min

Key Features

  • Unified Search โ€” Single SearchQuery model works across all sources
  • 3-Tier Deduplication โ€” DOI exact match โ†’ cross-ID mapping โ†’ fuzzy title+author (Levenshtein โ‰ฅ 0.90)
  • Full Paper Download โ€” Download and store complete PDFs locally (~/.scholarx/papers/)
  • Knowledge Graph Integration โ€” Ingest papers via existing KBIngestionEngine (ArticleNode, SourceNode, PersonNode)
  • RLM Auto-Trigger โ€” Large papers (>50K chars) automatically route through Recursive Language Model decomposition
  • Per-Source Rate Limiting โ€” Token-bucket rate limiter in the abstract provider base class
  • Configurable Watchlists โ€” Register custom research topics as MaintenanceCron tasks

Installation

# Core (API client only)
pip install scholarx

# With MCP server
pip install scholarx[mcp]

# With agent server
pip install scholarx[agent]

# Everything
pip install scholarx[all]

Quick Start

Python API

import asyncio
from scholarx.api_client import ScholarXClient
from scholarx.models import SearchQuery, PaperSource

async def main():
    client = ScholarXClient()

    # Search across all sources
    result = await client.search(SearchQuery(
        query="multi-agent orchestration",
        categories=["cs.AI", "cs.MA"],
        max_results=10,
    ))

    for paper in result.papers:
        print(f"[{paper.source}] {paper.title}")
        print(f"  Authors: {', '.join(paper.authors[:3])}")
        print(f"  DOI: {paper.doi}")
        print()

    # Download a paper
    if result.papers:
        path = await client.download_paper(result.papers[0])
        print(f"Downloaded to: {path}")

asyncio.run(main())

CLI

ScholarX includes a rich CLI with progress bars for paper discovery, relevance scoring, and PDF downloads.

# Scan for recent AI papers across 7 CS categories
scholarx scan --query "artificial intelligence" --output-dir ./papers

# Customize categories and result count
scholarx scan --categories cs.AI,cs.LG,cs.CL --max-results 30 --output-dir ./papers

# Use a custom relevance taxonomy
scholarx scan --query "knowledge graphs" --taxonomy custom_taxonomy.json --output-dir ./papers

# Auto-trigger comparative analysis on high-confidence papers
scholarx scan --analyze --output-dir ./papers

# Show stored paper library status
scholarx status

Relevance Scoring

The CLI scores each paper's abstract against a 9-domain weighted keyword taxonomy:

Domain Weight Focus
Orchestration 3.0 Multi-agent, workflow, task decomposition
Knowledge Graph 3.0 Ontology, OWL, entity relations, graph reasoning
Planning & Reasoning 2.5 Chain-of-thought, MCTS, deliberation
Memory & Retrieval 2.5 RAG, episodic memory, continual learning
Tool Use 2.0 Function calling, MCP, code generation
Evaluation & Safety 2.0 Benchmarks, red teaming, hallucination
Swarm & Evolution 2.0 Evolutionary methods, stigmergy, biomimicry
LLM Architecture 1.5 Transformers, MoE, distillation
Human-AI 1.0 Human-in-the-loop, decision support

Papers are classified into three tiers:

  • โœ… Relevant (score โ‰ฅ 3.0) โ€” Direct value for the target domain
  • ๐ŸŸก Marginal (score 1.0โ€“2.9) โ€” Potential indirect value
  • โŒ Irrelevant (score < 1.0) โ€” Filtered out

Deduplication

ScholarX prevents duplicate downloads through two mechanisms:

  1. Cross-source deduplication (deduplication.py): 3-tier matching removes duplicates when the same paper appears across multiple sources:

    • Tier 1: DOI exact match
    • Tier 2: Cross-ID mapping (arXiv ID โ†” S2 corpus ID via metadata)
    • Tier 3: Normalized title + first-author last name (Levenshtein โ‰ฅ 0.90)
  2. Storage deduplication (paper_storage.py): Before downloading, PaperStorage.download_paper() checks if the paper ID's metadata hash already exists in ~/.scholarx/papers/.metadata/. Already-downloaded papers are skipped instantly.

MCP Server

# Start in stdio mode (for agent integration)
scholarx-mcp --transport stdio

# Start in HTTP mode
scholarx-mcp --transport streamable-http --host 0.0.0.0 --port 9600

MCP Tools

Tool Description
search_papers Multi-source search with deduplication
get_paper Single paper by source + ID
search_by_author Author-based search
get_recent_papers Papers from last N days
list_sources Available sources and status
list_categories Categories per source
download_paper Download full PDF
get_stored_papers List locally stored papers

MCP Prompts

Prompt Purpose
agent_utilities_enhancement_scan Scan CS/AI papers for AU concept enhancement opportunities
biomimicry_innovation_scan Scan biology papers for biomimetic agent patterns

Docker

# Build and run
docker compose up -d

# Debug mode (mounts local source)
docker compose -f compose.yml up --build

Environment Variables

# API Keys (all optional for basic functionality)
OSF_TOKEN=              # OSF/PsyArXiv API token
S2_API_KEY=             # Semantic Scholar (higher rate limits)
NCBI_API_KEY=           # PubMed Central (higher rate limits)

# MCP Server
TRANSPORT=stdio         # stdio | streamable-http
HOST=0.0.0.0
PORT=9600

# Tool Toggles
SEARCHTOOL=True
DISCOVERYTOOL=True
STORAGETOOL=True

# Paper Storage
SCHOLARX_STORAGE_DIR=   # Default: ~/.scholarx/papers/

Architecture

User/Agent
    โ”‚
    โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ScholarX MCP Server    โ”‚  12 tools + prompts
โ”‚  (mcp_server.py)        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ScholarXClient         โ”‚  Unified API
โ”‚  (api_client.py)        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”
    โ–ผ    โ–ผ    โ–ผ    โ–ผ    โ–ผ    โ–ผ    โ–ผ
  arXiv PMC bioRx medRx PsyAr OSF  S2    โ† Per-source rate limiting
    โ”‚    โ”‚    โ”‚    โ”‚    โ”‚    โ”‚    โ”‚
    โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Deduplication Engine   โ”‚  DOI โ†’ cross-ID โ†’ fuzzy title
โ”‚  (deduplication.py)     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Paper Storage          โ”‚  Full PDF download
โ”‚  (~/.scholarx/papers/)  โ”‚
โ”‚         โ”‚               โ”‚
โ”‚         โ–ผ               โ”‚
โ”‚  KBIngestionEngine      โ”‚  โ†’ ArticleNode + PersonNode
โ”‚  (KG auto-ingest)       โ”‚     + SourceNode + KBConceptNode
โ”‚         โ”‚               โ”‚
โ”‚    RLM (AU-007)         โ”‚  Auto-triggers for >50K char papers
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Agent OS Subsystem

ScholarX is registered as an Agent OS subsystem alongside:

Subsystem Role
container-manager-mcp Infrastructure provisioning
systems-manager Host/OS operations
tunnel-manager Network tunneling
repository-manager Git/repo operations
scholarx Research intelligence

Maintenance Cron

A SIX_HOURLY maintenance task (scholarx_paper_discovery) automatically:

  1. Checks for new papers across configured categories
  2. Evaluates relevance to Knowledge Graph concepts
  3. Ingests high-relevance papers (score > 0.6)
  4. Produces actionable research digests

Custom watchlists can be added via MaintenanceCron.add_task() or the create_research_watchlist MCP tool.

License

MIT

MCP Configuration Examples

1. Standard IO (stdio) Deployment

{
  "mcpServers": {
    "scholarx": {
      "command": "uv",
      "args": [
        "run",
        "scholarx-mcp"
      ],
      "env": {
        "AGENT_DESCRIPTION": "<YOUR_AGENT_DESCRIPTION>",
        "AGENT_SYSTEM_PROMPT": "<YOUR_AGENT_SYSTEM_PROMPT>",
        "DEFAULT_AGENT_NAME": "<YOUR_DEFAULT_AGENT_NAME>",
        "DISCOVERYTOOL": "True",
        "SEARCHTOOL": "True",
        "STORAGETOOL": "True"
      }
    }
  }
}

2. Streamable HTTP (SSE) Deployment

{
  "mcpServers": {
    "scholarx": {
      "command": "uv",
      "args": [
        "run",
        "scholarx-mcp",
        "--transport",
        "http",
        "--host",
        "0.0.0.0",
        "--port",
        "8000"
      ],
      "env": {
        "AGENT_DESCRIPTION": "<YOUR_AGENT_DESCRIPTION>",
        "AGENT_SYSTEM_PROMPT": "<YOUR_AGENT_SYSTEM_PROMPT>",
        "DEFAULT_AGENT_NAME": "<YOUR_DEFAULT_AGENT_NAME>",
        "DISCOVERYTOOL": "True",
        "SEARCHTOOL": "True",
        "STORAGETOOL": "True"
      }
    }
  }
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scholarx-0.6.0.tar.gz (56.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scholarx-0.6.0-py3-none-any.whl (58.3 kB view details)

Uploaded Python 3

File details

Details for the file scholarx-0.6.0.tar.gz.

File metadata

  • Download URL: scholarx-0.6.0.tar.gz
  • Upload date:
  • Size: 56.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for scholarx-0.6.0.tar.gz
Algorithm Hash digest
SHA256 acefdb889a1a2f845047dde226bc5cf72624f4b9c6002aa1c63b2c573a87e757
MD5 9a54def182a1cf181d277b3bf678d48f
BLAKE2b-256 c2e981ffd080c9d20090e5b4c1d772b816540f03ebce5b7d51ff47b3034c9dc8

See more details on using hashes here.

File details

Details for the file scholarx-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: scholarx-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 58.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for scholarx-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9afb097e2337b9292b51434f57f842b81252b3a77f074be0630c8c5df205b559
MD5 0f8fb445a6ed9b616bbdcf6884044fa2
BLAKE2b-256 c7879e49e8750f1339eb77dc400311d248e4ce8688553230cef4b7e270d3055c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page