Skip to main content

Universal Research Paper API โ€” single entry point for arXiv, PMC, bioRxiv, medRxiv, PsyArXiv, OSF, and Semantic Scholar

Project description

ScholarX ๐Ÿ“š - API | MCP | AgentOS

PyPI - Version MCP Server PyPI - Downloads GitHub Repo stars GitHub forks GitHub contributors PyPI - License GitHub

GitHub last commit (by committer) GitHub pull requests GitHub closed pull requests GitHub issues

GitHub top language GitHub language count GitHub repo size GitHub repo file count (file type) PyPI - Wheel PyPI - Implementation

Version: 1.8.0

Universal Research Paper API โ€” a single entry point for querying, downloading, and ingesting research papers from all major preprint and academic repositories.

Version: 0.7.0

Overview

ScholarX provides a unified interface to search across 7 paper sources simultaneously, with automatic cross-source deduplication, full PDF downloads, and Knowledge Graph integration. It is registered as an Agent OS subsystem in the genius-agent ecosystem.

Supported Sources

Source API Auth Rate Limit
arXiv Atom/OpenSearch Free 1 req/3s
PubMed Central NCBI E-utilities Optional NCBI_API_KEY 3 req/s (10 with key)
bioRxiv bioRxiv REST Free 1 req/s
medRxiv bioRxiv REST Free 1 req/s
PsyArXiv OSF v2 OSF_TOKEN 1 req/s
OSF OSF v2 OSF_TOKEN 1 req/s
Semantic Scholar Academic Graph v1 Optional S2_API_KEY 100 req/min

Key Features

  • Unified Search โ€” Single SearchQuery model works across all sources
  • 3-Tier Deduplication โ€” DOI exact match โ†’ cross-ID mapping โ†’ fuzzy title+author (Levenshtein โ‰ฅ 0.90)
  • Full Paper Download โ€” Download and store complete PDFs locally (~/.scholarx/papers/)
  • Knowledge Graph Integration โ€” Ingest papers via existing KBIngestionEngine (ArticleNode, SourceNode, PersonNode)
  • RLM Auto-Trigger โ€” Large papers (>50K chars) automatically route through Recursive Language Model decomposition
  • Per-Source Rate Limiting โ€” Token-bucket rate limiter in the abstract provider base class
  • Configurable Watchlists โ€” Register custom research topics as MaintenanceCron tasks

Installation

# Core (API client only)
pip install scholarx

# With MCP server
pip install scholarx[mcp]

# With agent server
pip install scholarx[agent]

# Everything
pip install scholarx[all]

Quick Start

Python API

import asyncio
from scholarx.api_client import ScholarXClient
from scholarx.models import SearchQuery, PaperSource

async def main():
    client = ScholarXClient()

    # Search across all sources
    result = await client.search(SearchQuery(
        query="multi-agent orchestration",
        categories=["cs.AI", "cs.MA"],
        max_results=10,
    ))

    for paper in result.papers:
        print(f"[{paper.source}] {paper.title}")
        print(f"  Authors: {', '.join(paper.authors[:3])}")
        print(f"  DOI: {paper.doi}")
        print()

    # Download a paper
    if result.papers:
        path = await client.download_paper(result.papers[0])
        print(f"Downloaded to: {path}")

asyncio.run(main())

CLI

ScholarX includes a rich CLI with progress bars for paper discovery, relevance scoring, and PDF downloads.

# Scan for recent AI papers across 7 CS categories
scholarx scan --query "artificial intelligence" --output-dir ./papers

# Customize categories and result count
scholarx scan --categories cs.AI,cs.LG,cs.CL --max-results 30 --output-dir ./papers

# Use a custom relevance taxonomy
scholarx scan --query "knowledge graphs" --taxonomy custom_taxonomy.json --output-dir ./papers

# Auto-trigger comparative analysis on high-confidence papers
scholarx scan --analyze --output-dir ./papers

# Show stored paper library status
scholarx status

Relevance Scoring

The CLI scores each paper's abstract against a 9-domain weighted keyword taxonomy:

Domain Weight Focus
Orchestration 3.0 Multi-agent, workflow, task decomposition
Knowledge Graph 3.0 Ontology, OWL, entity relations, graph reasoning
Planning & Reasoning 2.5 Chain-of-thought, MCTS, deliberation
Memory & Retrieval 2.5 RAG, episodic memory, continual learning
Tool Use 2.0 Function calling, MCP, code generation
Evaluation & Safety 2.0 Benchmarks, red teaming, hallucination
Swarm & Evolution 2.0 Evolutionary methods, stigmergy, biomimicry
LLM Architecture 1.5 Transformers, MoE, distillation
Human-AI 1.0 Human-in-the-loop, decision support

Papers are classified into three tiers:

  • โœ… Relevant (score โ‰ฅ 3.0) โ€” Direct value for the target domain
  • ๐ŸŸก Marginal (score 1.0โ€“2.9) โ€” Potential indirect value
  • โŒ Irrelevant (score < 1.0) โ€” Filtered out

Deduplication

ScholarX prevents duplicate downloads through two mechanisms:

  1. Cross-source deduplication (deduplication.py): 3-tier matching removes duplicates when the same paper appears across multiple sources:

    • Tier 1: DOI exact match
    • Tier 2: Cross-ID mapping (arXiv ID โ†” S2 corpus ID via metadata)
    • Tier 3: Normalized title + first-author last name (Levenshtein โ‰ฅ 0.90)
  2. Storage deduplication (paper_storage.py): Before downloading, PaperStorage.download_paper() checks if the paper ID's metadata hash already exists in ~/.scholarx/papers/.metadata/. Already-downloaded papers are skipped instantly.

MCP Server

# Start in stdio mode (for agent integration)
scholarx-mcp --transport stdio

# Start in HTTP mode
scholarx-mcp --transport streamable-http --host 0.0.0.0 --port 9600

MCP Tools

Tool Description
search_papers Multi-source search with deduplication
get_paper Single paper by source + ID
search_by_author Author-based search
get_recent_papers Papers from last N days
list_sources Available sources and status
list_categories Categories per source
download_paper Download full PDF
get_stored_papers List locally stored papers

MCP Prompts

Prompt Purpose
agent_utilities_enhancement_scan Scan CS/AI papers for AU concept enhancement opportunities
biomimicry_innovation_scan Scan biology papers for biomimetic agent patterns

Docker

# Build and run
docker compose up -d

# Debug mode (mounts local source)
docker compose -f compose.yml up --build

Environment Variables

# API Keys (all optional for basic functionality)
OSF_TOKEN=              # OSF/PsyArXiv API token
S2_API_KEY=             # Semantic Scholar (higher rate limits)
NCBI_API_KEY=           # PubMed Central (higher rate limits)

# MCP Server
TRANSPORT=stdio         # stdio | streamable-http
HOST=0.0.0.0
PORT=9600

# Tool Toggles
SEARCHTOOL=True
DISCOVERYTOOL=True
STORAGETOOL=True

# Paper Storage
SCHOLARX_STORAGE_DIR=   # Default: ~/.scholarx/papers/

Architecture

User/Agent
    โ”‚
    โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ScholarX MCP Server    โ”‚  12 tools + prompts
โ”‚  (mcp_server.py)        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ScholarXClient         โ”‚  Unified API
โ”‚  (api_client.py)        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”
    โ–ผ    โ–ผ    โ–ผ    โ–ผ    โ–ผ    โ–ผ    โ–ผ
  arXiv PMC bioRx medRx PsyAr OSF  S2    โ† Per-source rate limiting
    โ”‚    โ”‚    โ”‚    โ”‚    โ”‚    โ”‚    โ”‚
    โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Deduplication Engine   โ”‚  DOI โ†’ cross-ID โ†’ fuzzy title
โ”‚  (deduplication.py)     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Paper Storage          โ”‚  Full PDF download
โ”‚  (~/.scholarx/papers/)  โ”‚
โ”‚         โ”‚               โ”‚
โ”‚         โ–ผ               โ”‚
โ”‚  KBIngestionEngine      โ”‚  โ†’ ArticleNode + PersonNode
โ”‚  (KG auto-ingest)       โ”‚     + SourceNode + KBConceptNode
โ”‚         โ”‚               โ”‚
โ”‚    RLM (AU-007)         โ”‚  Auto-triggers for >50K char papers
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Agent OS Subsystem

ScholarX is registered as an Agent OS subsystem alongside:

Subsystem Role
container-manager-mcp Infrastructure provisioning
systems-manager Host/OS operations
tunnel-manager Network tunneling
repository-manager Git/repo operations
scholarx Research intelligence

Maintenance Cron

A SIX_HOURLY maintenance task (scholarx_paper_discovery) automatically:

  1. Checks for new papers across configured categories
  2. Evaluates relevance to Knowledge Graph concepts
  3. Ingests high-relevance papers (score > 0.6)
  4. Produces actionable research digests

Custom watchlists can be added via MaintenanceCron.add_task() or the create_research_watchlist MCP tool.

License

MIT

MCP Configuration Examples

1. Standard IO (stdio) Deployment

{
  "mcpServers": {
    "scholarx": {
      "command": "uv",
      "args": [
        "run",
        "scholarx-mcp"
      ],
      "env": {
        "AGENT_DESCRIPTION": "<YOUR_AGENT_DESCRIPTION>",
        "AGENT_SYSTEM_PROMPT": "<YOUR_AGENT_SYSTEM_PROMPT>",
        "DEFAULT_AGENT_NAME": "<YOUR_DEFAULT_AGENT_NAME>",
        "DISCOVERYTOOL": "True",
        "SEARCHTOOL": "True",
        "STORAGETOOL": "True"
      }
    }
  }
}

2. Streamable HTTP (SSE) Deployment

{
  "mcpServers": {
    "scholarx": {
      "command": "uv",
      "args": [
        "run",
        "scholarx-mcp",
        "--transport",
        "http",
        "--host",
        "0.0.0.0",
        "--port",
        "8000"
      ],
      "env": {
        "AGENT_DESCRIPTION": "<YOUR_AGENT_DESCRIPTION>",
        "AGENT_SYSTEM_PROMPT": "<YOUR_AGENT_SYSTEM_PROMPT>",
        "DEFAULT_AGENT_NAME": "<YOUR_DEFAULT_AGENT_NAME>",
        "DISCOVERYTOOL": "True",
        "SEARCHTOOL": "True",
        "STORAGETOOL": "True"
      }
    }
  }
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scholarx-0.7.0.tar.gz (56.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scholarx-0.7.0-py3-none-any.whl (58.3 kB view details)

Uploaded Python 3

File details

Details for the file scholarx-0.7.0.tar.gz.

File metadata

  • Download URL: scholarx-0.7.0.tar.gz
  • Upload date:
  • Size: 56.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for scholarx-0.7.0.tar.gz
Algorithm Hash digest
SHA256 a372c0d45b47129f5b70390e19c46ba69905fea431a18063d7acffda83a9ced8
MD5 012977cc1161256903727bc43d9e3271
BLAKE2b-256 f3a0e937b53d068f7f2208f03d7f04be296217049e646cba6852e80f4f6f8f24

See more details on using hashes here.

File details

Details for the file scholarx-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: scholarx-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 58.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for scholarx-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9a80f5d3ec7f68a8d97f15a5d810586e45504c000e7c361b47f831815249474c
MD5 b6d07e88c8e9f2e6ad99a5e1fb3023bc
BLAKE2b-256 4b60b0add0640289a921d752e8e8e2b53c9a47f4fb116a1965e4f7705924c779

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page