Skip to main content

A Python package for arXiv paper access with CLI and MCP server support

Project description

deepxiv-sdk

DeepXiv is an agent-first paper search and progressive reading tool.

Install it with pip, start using it immediately, and let the CLI auto-register an API token on first use — no setup needed before your first query.

🚦 Service status — live status page

  • 🟢 arXiv retrieval & reading — online. We aim for a T+1 sync with arXiv (subject to arXiv's own ~1-day API latency).
  • 🔴 bioRxiv / medRxivtemporarily down due to a server-side issue. We're working to restore it as soon as possible. Related commands return 503 in the meantime.
  • 🔑 Lost your token? Recover it at data.rag.ac.cn/token-lookup (Google sign-in supported).
  • ℹ️ Data processing is currently trying a broader mix of models. If a TLDR looks off (e.g. truncated thinking content), please open an issue — we'll fix it.

🚀 Live Demo: built on the deepxiv CLI in ~1 hour with vibe coding — try the DeepResearch demo. A full-stack research platform is on the way.


What DeepXiv Does

DeepXiv is built around two workflows that matter for agents:

  1. Search + progressive content access — read papers in layers, not all at once.
  2. Trending + popularity signals — find what's worth reading right now.

The core idea: an agent should search first, judge quickly, then read only the most valuable parts — instead of blindly loading full papers.

Quick Start

pip install deepxiv-sdk

On first use, deepxiv auto-registers a free anonymous token (1,000 requests/day) and saves it to ~/.env:

deepxiv search "agentic memory" --limit 5

For the full stack (MCP server + built-in research agent):

pip install "deepxiv-sdk[all]"

Progressive Reading: search → judge → read

The CLI is the primary interface. A few flags drive layered reading so agents don't load full papers unless they truly need to:

deepxiv search "agentic memory" --limit 5     # 1. find candidates
deepxiv paper 2409.05591 --brief              # 2. decide if it's worth reading
deepxiv paper 2409.05591 --head               # 3. inspect structure & token distribution
deepxiv paper 2409.05591 --section Method     # 4. read only the valuable parts
  • --brief — title, TLDR, keywords, citations, GitHub URL
  • --head — sections overview and token distribution
  • --section NAME — read a single section (e.g. Introduction, Method, Experiments)
  • --preview / --raw / (no flag) — ~10k-char preview / full markdown / full paper

CLI Reference

Search papers

Basic search (arXiv by default):

deepxiv search "transformer" --limit 10
deepxiv search "agentic memory" --limit 20 --format json

Filter by author, org, and category (comma-separated):

deepxiv search "image generation" \
  --authors "Shitao Xiao,Zheng Liu" \
  --orgs "Beijing Academy of Artificial Intelligence" \
  --categories cs.CV \
  --limit 5

--authors and --orgs are filters and ranking signals; --categories is a pure filter.

Filter by venue (--venue is repeatable; common aliases match automatically):

deepxiv search "diffusion model" --venue NeurIPS --limit 5
deepxiv search "language model" --venue NeurIPS --venue ICLR --limit 5

# Add a conference year (when the venue's year is indexed for those papers):
deepxiv search "diffusion model" --venue NeurIPS --venue-year 2025 --limit 5

--venue NeurIPS also matches NIPS / Neural Information Processing Systems (likewise ICLRInternational Conference on Learning Representations, CVPRComputer Vision and Pattern Recognition, …). Matching results carry venue and venue_year fields. Note that venue alias matching is rule-based, so it may not always be exact — we're continuously improving it.

Filter by date and citations. --date-from / --date-to accept YYYY, YYYY-MM, or YYYY-MM-DD:

# Papers from June 2025 onward
deepxiv search "image generation" --date-from 2025-06 --limit 5

# A date floor plus a citation floor
deepxiv search "diffusion models" --date-from 2024-01 --min-citations 50 --limit 5

⚠️ Filters stack with AND. A narrow single-month window combined with a high citation floor on a very specific query can legitimately return 0 results — if a search comes back empty, broaden the date range or lower --min-citations.

Advanced date filter (exact / after / before / between):

# exact month
deepxiv search "image generation" --date-search-type exact --date-str 2025-06 --limit 5

# between: pass --date-str twice (start, end)
deepxiv search "image generation" \
  --date-search-type between --date-str 2025-06-01 --date-str 2025-07-01 --limit 5

Pagination and reranking:

deepxiv search "LLM alignment" --limit 10 --offset 10        # page 2
deepxiv search "transformer model" --use-fine-rerank --limit 10   # opt-in fine rerank (off by default)

The JSON payload follows {status, total_count, result: [...]} — see Python SDK.

Read a paper

deepxiv paper 2409.05591                       # full paper
deepxiv paper 2409.05591 --brief               # quick summary
deepxiv paper 2409.05591 --head                # metadata + sections
deepxiv paper 2409.05591 --section Introduction
deepxiv paper 2409.05591 --preview             # ~10k chars

Trending and popularity

deepxiv trending --days 7 --limit 30      # hottest recent papers (social signals)
deepxiv paper 2409.05591 --popularity     # per-paper views, tweets, likes, replies

Web search

deepxiv wsearch "karpathy"
deepxiv wsearch "karpathy" --json

Each wsearch request costs 20 scores (other requests cost 1). An anonymous token gets 1,000 scores/day (~50 web searches); a registered token gets 10,000/day (~500 web searches).

Semantic Scholar metadata by ID

deepxiv sc 258001
deepxiv sc 258001 --json

Useful when your workflow already holds Semantic Scholar IDs. A Semantic Scholar search service (returning these IDs directly) is coming soon.

PMC biomedical papers

deepxiv pmc PMC544940 --head
deepxiv pmc PMC544940

bioRxiv & medRxiv preprints

🔴 Temporarily unavailable. The bioRxiv / medRxiv service is down due to a server-side issue and currently returns 503. We're working to restore it as soon as possible — see the live status page. The commands below are documented for when it's back online.

Preprint search shares the unified retrieve endpoint with arXiv (same filters as above):

# Search
deepxiv search "protein design" --biorxiv --limit 5
deepxiv search "Alzheimer" --medrxiv --date-from 2024-01

# Fetch a paper by DOI
deepxiv biorxiv 10.1101/2021.02.26.433129
deepxiv biorxiv 10.1101/2021.02.26.433129 --format text
deepxiv biorxiv 10.1101/2021.02.26.433129 --section Introduction,Methods
deepxiv medrxiv 10.1101/2025.08.11.25333149 --format text

# Or via flags on the paper command
deepxiv paper 10.1101/2021.02.26.433129 --biorxiv --section Introduction

Agent Workflows

Two ready-to-use workflows ship as reusable skills:

Review recent hot papersskills/deepxiv-trending-digest/SKILL.md

deepxiv trending --days 7 --limit 30 --json
# then: --brief each → --head the promising ones → read key sections → write a report

Enter a new research topicskills/deepxiv-baseline-table/SKILL.md

deepxiv search "agentic memory" --date-from 2026-03-01 --limit 100 --format json
# then: batch-brief → prioritize GitHub links → --head experiments → build a baseline table

Python SDK

from deepxiv_sdk import Reader

reader = Reader()

# Unified retrieve endpoint; arXiv by default.
results = reader.search("agent memory", size=5)
for paper in results["result"]:
    print(paper["arxiv_id"], paper["score"], paper["title"])

# Progressive reading
brief = reader.brief("2409.05591")
head = reader.head("2409.05591")
intro = reader.section("2409.05591", "Introduction")

# Other endpoints
web = reader.websearch("karpathy")
sc_meta = reader.semantic_scholar("258001")

reader.search() parameters

reader.search(
    query,
    size=10,                  # → upstream top_k (1~100); you can also pass top_k=
    offset=0,                 # 0~10000
    source="arxiv",           # "arxiv" | "biorxiv" | "medrxiv"
    categories=None,          # list[str]; filter only
    authors=None,             # list[str]; filter + ranking signal
    orgs=None,                # list[str]; filter + ranking signal
    venue=None,               # str | list[str]; aliases match (NeurIPS↔NIPS)
    venues=None,              # plural alias for venue; merged with it
    venue_year=None,          # int | str; e.g. 2025
    min_citation=None,
    date_from=None,           # convenience; "YYYY" / "YYYY-MM" / "YYYY-MM-DD"
    date_to=None,
    date_search_type=None,    # advanced: "between" | "exact" | "after" | "before"
    date_str=None,            # advanced: str or [start, end]
    use_fine_rerank=False,    # SDK default off (cheaper); set True for better ordering
)

Response shape:

{
  "status": "success",
  "total_count": 3,
  "result": [
    {
      "arxiv_id": "2506.18871",    // biorxiv_id / medrxiv_id when source != arxiv
      "title": "...", "score": 0.9475, "abstract": "...", "tldr": "...",
      "authors": [{ "name": "...", "orgs": ["..."] }],
      "url": "...", "date": "2025-06-23T17:38:54Z",
      "citation_count": 217, "categories": ["cs.CV"],
      "venue": "NeurIPS", "venue_year": 2025   // present when venue data exists
    }
  ]
}

Reader methods

reader.brief(arxiv_id)             # title, TLDR, keywords, citations, GitHub URL
reader.head(arxiv_id)              # metadata + sections overview
reader.section(arxiv_id, name)     # one section
reader.preview(arxiv_id)           # ~10k-char preview
reader.raw(arxiv_id)               # full markdown
reader.json(arxiv_id)              # structured JSON
reader.websearch(query)            # web search (costs 20 scores)
reader.semantic_scholar(sc_id)     # metadata by Semantic Scholar ID
reader.trending(days=7, limit=30)  # trending papers
reader.social_impact(arxiv_id)     # popularity metrics
reader.pmc_head(pmc_id)            # PMC metadata
reader.pmc_json(pmc_id)            # full PMC JSON

🔴 bioRxiv / medRxiv access — reader.search(source="biorxiv"|"medrxiv"), reader.biomed_data(...), and reader.biomed_search(...) — is temporarily down (server-side issue). See the status banner above.

Search API changes (2026-04) — migration notes from the old Elasticsearch-style interface

The search backend moved to the unified /arxiv/?type=retrieve service. The SDK keeps parameter names where possible:

Parameter Status Notes
size kept Mapped to upstream top_k. top_k= also accepted.
offset kept Capped at 0~10000.
categories, authors, min_citation kept Same semantics.
source new "arxiv" (default), "biorxiv", "medrxiv". reader.biomed_search() is now a thin wrapper.
orgs new Org filter; also influences ranking.
venue / venues / venue_year new Filter by publication venue (str or list; aliases like NeurIPSNIPS match automatically) and conference year. venue and venues are equivalent.
date_search_type / date_str new between / exact / after / before.
date_from / date_to kept (mapped) Auto-converted to date_search_type + date_str; now also accept YYYY / YYYY-MM.
use_fine_rerank new Upstream default True; SDK defaults to False.
search_mode / bm25_weight / vector_weight deprecated Accepted but ignored (warning logged).
search_funcs, return_contents, return_roc not exposed Always default. Use reader.raw() / section() / json() for content.

Response migration: {total, took, results}{status, total_count, result}; per-item ID is arxiv_id / biorxiv_id / medrxiv_id; paper["citation"]paper["citation_count"]. On the CLI, --limit maps to size, --mode is a deprecated no-op, and --biorxiv / --medrxiv switch the source.


Agent Integration

DeepXiv works well inside Codex, Claude Code, OpenClaw, and similar agent runtimes.

MCP Server

Add to your Claude Desktop MCP config file:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
  • Linux: ~/.config/Claude/claude_desktop_config.json
{
  "mcpServers": {
    "deepxiv": {
      "command": "deepxiv",
      "args": ["serve"],
      "env": { "DEEPXIV_TOKEN": "your_token_here" }
    }
  }
}

Available MCP tools:

Tool Description
search_papers Search arXiv papers
get_paper_brief Quick summary
get_paper_metadata Full metadata
get_paper_section Read specific section
get_full_paper Complete paper
get_paper_preview Paper preview
get_pmc_metadata PMC paper metadata
get_pmc_full Complete PMC paper

CLI Skill

mkdir -p $CODEX_HOME/skills
ln -s "$(pwd)/skills/deepxiv-cli" $CODEX_HOME/skills/deepxiv-cli

For frameworks without native skill support, load skills/deepxiv-cli/SKILL.md as operating instructions.

Built-in Research Agent

If you don't want to compose workflows yourself, the CLI ships a ReAct agent (install with pip install "deepxiv-sdk[all]"). It works with any OpenAI-compatible API (OpenAI, DeepSeek, OpenRouter, local Ollama, …) and runs multi-turn search → read → reason.

deepxiv agent config   # configure LLM API (stored locally only)
deepxiv agent query "What are the latest papers about agent memory?" --verbose
from deepxiv_sdk import Agent

agent = Agent(api_key="your_key", base_url="https://api.deepseek.com/v1", model="deepseek-chat")
print(agent.query("Compare key ideas in transformers and attention mechanisms"))

Token Management

deepxiv resolves the token from (in order) the --token option, the DEEPXIV_TOKEN env var, then ~/.env. On first use it auto-registers one for you.

deepxiv search "agent"                          # auto-register on first use (recommended)
deepxiv config --token YOUR_TOKEN               # save to ~/.env
export DEEPXIV_TOKEN="your_token"               # or use an env var
deepxiv paper 2409.05591 --token YOUR_TOKEN     # or pass per command
Token type Daily limit How to get
Auto-registered (anonymous) 1,000 requests Automatic on first CLI use
Registered 10,000 requests data.rag.ac.cn/register
Custom / higher Contact us Email tommy[at]chien.io with your use case

Free test papers (no token required) — arXiv: 2409.05591, 2504.21776; PMC: PMC544940, PMC514704.

Error Handling

from deepxiv_sdk import (
    Reader,
    AuthenticationError,  # 401 - invalid or expired token
    RateLimitError,       # 429 - daily limit reached
    NotFoundError,        # 404 - paper not found
    ServerError,          # 5xx - server error
    APIError,             # other API errors
)

try:
    paper = reader.brief("2409.05591")
except AuthenticationError:
    print("Please update your token")
except RateLimitError:
    print("Daily limit reached")
except NotFoundError:
    print("Paper not found")
except APIError as e:
    print(f"API error: {e}")

Troubleshooting

  • Do I need a token? No — some papers are free, and a token is auto-created on first use.
  • Max search results? 100 per request; use --offset / offset= to paginate.
  • A search returns 0 results? Loosen filters — stacked --date-* + --min-citations constraints can over-narrow the result set.
  • Timeouts? The Reader retries (max 3) with exponential backoff. Customize with Reader(timeout=120, max_retries=5).
  • Can I cache content? Yes — cache locally after fetching; paper content doesn't change.
  • Which LLMs does the agent support? Any OpenAI-compatible API (OpenAI, DeepSeek, OpenRouter, local Ollama, …).
  • Agent errors with Reasoning content is only supported as the last assistant message? Thinking/reasoning models (MiMo, DeepSeek-R1, …) need thinking disabled for multi-round tool use. Use deepxiv agent query "…" --disable-thinking, or in Python Agent(..., enable_thinking=False) (equivalently extra_body={"enable_thinking": False}).
  • Agent keeps retrying a failing tool? When the data service is down, the agent now trips a circuit breaker after a few consecutive service failures and returns a best-effort answer instead of looping. Tune with Agent(..., max_consecutive_failures=N) (0 disables it).
  • agent.add_paper() on a brand-new paper? It returns False (instead of raising) when the paper isn't found or isn't indexed yet — very recent papers (<1–3 days old) often aren't. Genuine errors (auth, rate limit, 5xx) still raise. To handle the exception directly: from deepxiv_sdk import NotFoundError (also available as from deepxiv_sdk.exceptions import NotFoundError).
  • bioRxiv / medRxiv returns 503? Known outage — see the status page.

Examples

See examples/: quickstart.py, example_reader.py, example_agent.py, example_advanced.py, example_error_handling.py.

Roadmap & Coverage

DeepXiv is moving toward an academic paper data interface at 100M+ scale, increasingly using Semantic Scholar metadata as the base layer:

  1. Full arXiv coverage with T+1 automatic updates
  2. anyXiv coverage (bioRxiv, medRxiv, …)
  3. Full open-access literature coverage
Source Status
arXiv ✅ online — primary source
PubMed Central (PMC) ✅ online — biomedical & life sciences
bioRxiv / medRxiv 🔴 temporarily down (server-side issue, recovering soon)
Semantic Scholar metadata 🔄 expanding as the metadata foundation

DeepXiv focuses on open-access literature so agents can work on unrestricted paper data instead of getting blocked by subscription walls.

License & Support

MIT License — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepxiv_sdk-0.3.0.tar.gz (70.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deepxiv_sdk-0.3.0-py3-none-any.whl (56.9 kB view details)

Uploaded Python 3

File details

Details for the file deepxiv_sdk-0.3.0.tar.gz.

File metadata

  • Download URL: deepxiv_sdk-0.3.0.tar.gz
  • Upload date:
  • Size: 70.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for deepxiv_sdk-0.3.0.tar.gz
Algorithm Hash digest
SHA256 50a2cfdf35091c1a71539c9028232f0d8e612af895b541914091275a73f1fdc4
MD5 6f8ccbaa2e777998521e6d2d3215be22
BLAKE2b-256 f7289a706ff4351f12f115822369ba290f601ab477993bbb7a55a62beb458acc

See more details on using hashes here.

File details

Details for the file deepxiv_sdk-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: deepxiv_sdk-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 56.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for deepxiv_sdk-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b161396d8a1987b8ff4b0dc8665fe911b244db92ea0910aa8aad07a249f9014c
MD5 710eb6a3c1606ca5fe96c2cd98549abc
BLAKE2b-256 d28e22922b338b43bed00a06d0f62264eb7795310250655a3b14866ee6f7a784

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page