Skip to main content

Universal Research Paper API โ€” single entry point for arXiv, PMC, bioRxiv, medRxiv, PsyArXiv, OSF, and Semantic Scholar

Project description

ScholarX ๐Ÿ“š - API | MCP | AgentOS

PyPI - Version MCP Server PyPI - Downloads GitHub Repo stars GitHub forks GitHub contributors PyPI - License GitHub

GitHub last commit (by committer) GitHub pull requests GitHub closed pull requests GitHub issues

GitHub top language GitHub language count GitHub repo size GitHub repo file count (file type) PyPI - Wheel PyPI - Implementation

Version: 0.8.0

Universal Research Paper API โ€” a single entry point for querying, downloading, and ingesting research papers from all major preprint and academic repositories.

Overview

ScholarX provides a unified interface to search across 7 paper sources simultaneously, with automatic cross-source deduplication, full PDF downloads, and Knowledge Graph integration. It is registered as an Agent OS subsystem in the genius-agent ecosystem.

Supported Sources

Source API Auth Rate Limit
arXiv Atom/OpenSearch Free 1 req/3s
PubMed Central NCBI E-utilities Optional NCBI_API_KEY 3 req/s (10 with key)
bioRxiv bioRxiv REST Free 1 req/s
medRxiv bioRxiv REST Free 1 req/s
PsyArXiv OSF v2 OSF_TOKEN 1 req/s
OSF OSF v2 OSF_TOKEN 1 req/s
Semantic Scholar Academic Graph v1 Optional S2_API_KEY 100 req/min

Key Features

  • Unified Search โ€” Single SearchQuery model works across all sources
  • 3-Tier Deduplication โ€” DOI exact match โ†’ cross-ID mapping โ†’ fuzzy title+author (Levenshtein โ‰ฅ 0.90)
  • Full Paper Download โ€” Download and store complete PDFs locally (~/.scholarx/papers/)
  • Knowledge Graph Integration โ€” Ingest papers via existing KBIngestionEngine (ArticleNode, SourceNode, PersonNode)
  • RLM Auto-Trigger โ€” Large papers (>50K chars) automatically route through Recursive Language Model decomposition
  • Per-Source Rate Limiting โ€” Token-bucket rate limiter in the abstract provider base class
  • Configurable Watchlists โ€” Register custom research topics as MaintenanceCron tasks

Installation

# Core (API client only)
pip install scholarx

# With MCP server
pip install scholarx[mcp]

# With agent server
pip install scholarx[agent]

# Everything
pip install scholarx[all]

Quick Start

Python API

import asyncio
from scholarx.api_client import ScholarXClient
from scholarx.models import SearchQuery, PaperSource

async def main():
    client = ScholarXClient()

    # Search across all sources
    result = await client.search(SearchQuery(
        query="multi-agent orchestration",
        categories=["cs.AI", "cs.MA"],
        max_results=10,
    ))

    for paper in result.papers:
        print(f"[{paper.source}] {paper.title}")
        print(f"  Authors: {', '.join(paper.authors[:3])}")
        print(f"  DOI: {paper.doi}")
        print()

    # Download a paper
    if result.papers:
        path = await client.download_paper(result.papers[0])
        print(f"Downloaded to: {path}")

asyncio.run(main())

CLI

ScholarX includes a rich CLI with progress bars for paper discovery, relevance scoring, and PDF downloads.

# Scan for recent AI papers across 7 CS categories
scholarx scan --query "artificial intelligence" --output-dir ./papers

# Customize categories and result count
scholarx scan --categories cs.AI,cs.LG,cs.CL --max-results 30 --output-dir ./papers

# Use a custom relevance taxonomy
scholarx scan --query "knowledge graphs" --taxonomy custom_taxonomy.json --output-dir ./papers

# Auto-trigger comparative analysis on high-confidence papers
scholarx scan --analyze --output-dir ./papers

# Show stored paper library status
scholarx status

Relevance Scoring

The CLI scores each paper's abstract against a 9-domain weighted keyword taxonomy:

Domain Weight Focus
Orchestration 3.0 Multi-agent, workflow, task decomposition
Knowledge Graph 3.0 Ontology, OWL, entity relations, graph reasoning
Planning & Reasoning 2.5 Chain-of-thought, MCTS, deliberation
Memory & Retrieval 2.5 RAG, episodic memory, continual learning
Tool Use 2.0 Function calling, MCP, code generation
Evaluation & Safety 2.0 Benchmarks, red teaming, hallucination
Swarm & Evolution 2.0 Evolutionary methods, stigmergy, biomimicry
LLM Architecture 1.5 Transformers, MoE, distillation
Human-AI 1.0 Human-in-the-loop, decision support

Papers are classified into three tiers:

  • โœ… Relevant (score โ‰ฅ 3.0) โ€” Direct value for the target domain
  • ๐ŸŸก Marginal (score 1.0โ€“2.9) โ€” Potential indirect value
  • โŒ Irrelevant (score < 1.0) โ€” Filtered out

Deduplication

ScholarX prevents duplicate downloads through two mechanisms:

  1. Cross-source deduplication (deduplication.py): 3-tier matching removes duplicates when the same paper appears across multiple sources:

    • Tier 1: DOI exact match
    • Tier 2: Cross-ID mapping (arXiv ID โ†” S2 corpus ID via metadata)
    • Tier 3: Normalized title + first-author last name (Levenshtein โ‰ฅ 0.90)
  2. Storage deduplication (paper_storage.py): Before downloading, PaperStorage.download_paper() checks if the paper ID's metadata hash already exists in ~/.scholarx/papers/.metadata/. Already-downloaded papers are skipped instantly.

MCP Server

# Start in stdio mode (for agent integration)
scholarx-mcp --transport stdio

# Start in HTTP mode
scholarx-mcp --transport streamable-http --host 0.0.0.0 --port 9600

MCP Tools

Tool Description
search_papers Multi-source search with deduplication
get_paper Single paper by source + ID
search_by_author Author-based search
get_recent_papers Papers from last N days
list_sources Available sources and status
list_categories Categories per source
download_paper Download full PDF
bulk_download_papers Queue multiple papers for background download
download_status Check status of a queued download job
list_downloads List all background downloads and their status
get_stored_papers List locally stored papers

MCP Prompts

Prompt Purpose
agent_utilities_enhancement_scan Scan CS/AI papers for AU concept enhancement opportunities
biomimicry_innovation_scan Scan biology papers for biomimetic agent patterns

Docker

# Build and run
docker compose -f docker/compose.yml up -d

# MCP-specific orchestration
docker compose -f docker/mcp.compose.yml up -d

# Debug mode (mounts local source)
docker compose -f docker/compose.yml up --build

Environment Variables

# API Keys (all optional for basic functionality)
OSF_TOKEN=              # OSF/PsyArXiv API token
S2_API_KEY=             # Semantic Scholar (higher rate limits)
NCBI_API_KEY=           # PubMed Central (higher rate limits)

# MCP Server
TRANSPORT=stdio         # stdio | streamable-http
HOST=0.0.0.0
PORT=9600

# Tool Toggles
SEARCHTOOL=True
DISCOVERYTOOL=True
STORAGETOOL=True

# Paper Storage
SCHOLARX_STORAGE_DIR=   # Default: ~/.scholarx/papers/

Architecture

User/Agent
    โ”‚
    โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ScholarX MCP Server    โ”‚  12 tools + prompts
โ”‚  (mcp_server.py)        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ScholarXClient         โ”‚  Unified API
โ”‚  (api_client.py)        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”
    โ–ผ    โ–ผ    โ–ผ    โ–ผ    โ–ผ    โ–ผ    โ–ผ
  arXiv PMC bioRx medRx PsyAr OSF  S2    โ† Per-source rate limiting
    โ”‚    โ”‚    โ”‚    โ”‚    โ”‚    โ”‚    โ”‚
    โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Deduplication Engine   โ”‚  DOI โ†’ cross-ID โ†’ fuzzy title
โ”‚  (deduplication.py)     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Paper Storage          โ”‚  Full PDF download
โ”‚  (~/.scholarx/papers/)  โ”‚
โ”‚         โ”‚               โ”‚
โ”‚         โ–ผ               โ”‚
โ”‚  KBIngestionEngine      โ”‚  โ†’ ArticleNode + PersonNode
โ”‚  (KG auto-ingest)       โ”‚     + SourceNode + KBConceptNode
โ”‚         โ”‚               โ”‚
โ”‚    RLM (AU-007)         โ”‚  Auto-triggers for >50K char papers
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Agent OS Subsystem

ScholarX is registered as an Agent OS subsystem alongside:

Subsystem Role
container-manager-mcp Infrastructure provisioning
systems-manager Host/OS operations
tunnel-manager Network tunneling
repository-manager Git/repo operations
scholarx Research intelligence

Maintenance Cron

A SIX_HOURLY maintenance task (scholarx_paper_discovery) automatically:

  1. Checks for new papers across configured categories
  2. Evaluates relevance to Knowledge Graph concepts
  3. Ingests high-relevance papers (score > 0.6)
  4. Produces actionable research digests

Custom watchlists can be added via MaintenanceCron.add_task() or the create_research_watchlist MCP tool.

License

MIT

MCP Configuration Examples

1. Standard IO (stdio) Deployment

{
  "mcpServers": {
    "scholarx": {
      "command": "uv",
      "args": [
        "run",
        "scholarx-mcp"
      ],
      "env": {
        "AGENT_DESCRIPTION": "<YOUR_AGENT_DESCRIPTION>",
        "AGENT_SYSTEM_PROMPT": "<YOUR_AGENT_SYSTEM_PROMPT>",
        "DEFAULT_AGENT_NAME": "<YOUR_DEFAULT_AGENT_NAME>",
        "DISCOVERYTOOL": "True",
        "SEARCHTOOL": "True",
        "STORAGETOOL": "True"
      }
    }
  }
}

2. Streamable HTTP (SSE) Deployment

{
  "mcpServers": {
    "scholarx": {
      "command": "uv",
      "args": [
        "run",
        "scholarx-mcp",
        "--transport",
        "http",
        "--host",
        "0.0.0.0",
        "--port",
        "8000"
      ],
      "env": {
        "AGENT_DESCRIPTION": "<YOUR_AGENT_DESCRIPTION>",
        "AGENT_SYSTEM_PROMPT": "<YOUR_AGENT_SYSTEM_PROMPT>",
        "DEFAULT_AGENT_NAME": "<YOUR_DEFAULT_AGENT_NAME>",
        "DISCOVERYTOOL": "True",
        "SEARCHTOOL": "True",
        "STORAGETOOL": "True"
      }
    }
  }
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scholarx-0.8.0.tar.gz (50.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scholarx-0.8.0-py3-none-any.whl (52.4 kB view details)

Uploaded Python 3

File details

Details for the file scholarx-0.8.0.tar.gz.

File metadata

  • Download URL: scholarx-0.8.0.tar.gz
  • Upload date:
  • Size: 50.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for scholarx-0.8.0.tar.gz
Algorithm Hash digest
SHA256 770d01e0757266602f1df5cb11dfdd00ee33da12531c0cd6b7811f74a4497d03
MD5 41126b2750deaeea28c4d5416b7c1c1a
BLAKE2b-256 d755415b9143682172eb0d93ed3c9a23251e10909a70272e89fc26c5388aeefd

See more details on using hashes here.

File details

Details for the file scholarx-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: scholarx-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 52.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for scholarx-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a2235a784fa3f7f9ecfb4d938ab7e7548eacf15e852e9b1ae8ff2ab8e989abeb
MD5 2e3ea866664d4260907e470e25a26aac
BLAKE2b-256 292b72ff7a7990a70ddf72c7326fc0db3fd1c0f4585ecd9105118f6b6e61e058

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page