Skip to main content

A command-line tool for searching arXiv papers using natural language queries with semantic search powered by Cohere embeddings and reranking

Project description

ArXiv Semantic Search

A command-line tool for searching arXiv papers using natural language queries with semantic search powered by Cohere embeddings and reranking.

Features

  • Natural Language Search: Query papers using plain English like "papers about information retrieval and search engines that use neural networks"
  • Flexible Search Strategies:
    • Two-Stage (embed-rerank): Fast and cost-efficient (default)
      • Stage 1: Cohere Embed v4 to quickly filter top candidates via cosine similarity
      • Stage 2: Cohere Rerank 3.5 for precise final ranking
    • Single-Stage (rerank-only): Maximum accuracy for critical searches
  • Rich Output: Beautiful table formatting with relevance scores, authors, and links
  • Flexible Date Ranges: Search recent papers (yesterday, last week) or custom date ranges
  • Multiple Output Formats: Human-readable tables or JSON for programmatic use
  • Multiple arXiv Categories: Support for Computer Science, Math, Physics, and more
  • Scalable: Efficiently handles large document collections (1000+ papers)

How It Works

Choose between two search strategies:

Embed-Rerank (Default - Fast & Cost-Efficient)

  1. Retrieval Stage: Uses Cohere Embed v4 to create embeddings and filter documents via cosine similarity
  2. Reranking Stage: Uses Cohere Rerank 3.5 on the filtered candidates for final precision ranking

Rerank-Only (Maximum Accuracy)

  • Single Stage: Uses Cohere Rerank 3.5 directly on all documents for highest precision
  • Best for critical searches where accuracy is more important than speed/cost

Auto-optimization: For small datasets (≤100 papers), the tool automatically uses rerank-only regardless of strategy choice.

Installation

# Clone the repository
git clone <your-repo-url>
cd arxivory

# Install dependencies
uv sync

# Set up your Cohere API key
export COHERE_API_KEY="your-cohere-api-key"

Usage

Semantic Search (New!)

Search for papers using natural language queries:

# Search for recent papers about neural networks and information retrieval
uv run arxivory search "papers about information retrieval and search engines that use neural networks" --preset yesterday

# Search with custom date range and show abstracts
uv run arxivory search "transformer architectures for computer vision" --from-date 2024-01-01 --until-date 2024-01-07 --abstract

# Get top 5 results in JSON format
uv run arxivory search "reinforcement learning for robotics" --preset last-week --top-k 5 --json

# Configure retrieval stage for large document sets
uv run arxivory search "deep learning optimization" --preset last-week --retrieval-k 200 --top-k 10

# Use rerank-only strategy for maximum accuracy
uv run arxivory search "critical medical AI research" --preset yesterday --strategy rerank-only

# Search in other arXiv categories (e.g., Mathematics)
uv run arxivory search "topology and algebraic geometry" --preset yesterday --set math

Raw Metadata Harvest (Original functionality)

For raw JSON output without semantic search:

# Harvest raw metadata for yesterday
uv run arxivory harvest --preset yesterday

# Custom date range
uv run arxivory harvest --from-date 2024-01-01 --until-date 2024-01-07

Examples

Example 1: Information Retrieval Research

uv run arxivory search "information retrieval using neural networks and transformers" --preset last-week --top-k 5 --abstract

Example 2: Computer Vision Papers

uv run arxivory search "object detection and computer vision with deep learning" --from-date 2024-01-01 --until-date 2024-01-31

Example 3: Strategy Comparison

# Fast search with embed-rerank (default)
uv run arxivory search "theoretical foundations of deep learning" --preset last-week --strategy embed-rerank --top-k 10

# Maximum accuracy with rerank-only (slower, more expensive)
uv run arxivory search "theoretical foundations of deep learning" --preset last-week --strategy rerank-only --top-k 10

Command Options

Search Command

  • --preset: Date shortcuts (yesterday, last-week)
  • --from-date / --until-date: Custom date ranges (YYYY-MM-DD)
  • --top-k: Final number of results to return (default: 10)
  • --strategy: Search strategy (embed-rerank for speed, rerank-only for accuracy)
  • --retrieval-k: Number of candidates to retrieve in embed-rerank stage 1 (default: 100)
  • --set: arXiv category (default: cs for Computer Science)
  • --abstract: Show paper abstracts in output
  • --json: Output results as JSON

Output Format

The tool provides rich, formatted output showing:

  • Rank: Paper ranking by relevance
  • Score: Semantic relevance score (0-1)
  • Title: Paper title
  • Authors: Author list (truncated if long)
  • arXiv ID: For accessing the paper
  • Abstract: Optional detailed abstract
  • Links: Direct URLs to abstract and PDF

Strategy Comparison

Strategy Speed Cost Accuracy Best For
embed-rerank (default) Fast Lower Very Good General searches, large datasets
rerank-only Slower Higher Maximum Critical searches, small datasets

Auto-optimization: Small datasets (≤100 papers) automatically use rerank-only for optimal results.

API Key Setup

Get your free Cohere API key from cohere.com and set it as an environment variable:

export COHERE_API_KEY="your-api-key-here"

Supported arXiv Categories

  • cs: Computer Science (default)
  • math: Mathematics
  • physics: Physics
  • q-bio: Quantitative Biology
  • q-fin: Quantitative Finance
  • stat: Statistics

Requirements

  • Python 3.12+
  • Cohere API key
  • Internet connection for arXiv API and Cohere services

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arxivory-0.5.0.tar.gz (9.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arxivory-0.5.0-py3-none-any.whl (12.7 kB view details)

Uploaded Python 3

File details

Details for the file arxivory-0.5.0.tar.gz.

File metadata

  • Download URL: arxivory-0.5.0.tar.gz
  • Upload date:
  • Size: 9.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arxivory-0.5.0.tar.gz
Algorithm Hash digest
SHA256 888b0e2c07db9a08bd67f441aa89e9e4c67d2334123649de6cf040ece3628231
MD5 599582178e1c400b87e1921ef3238577
BLAKE2b-256 2af5a608e87dc4af5dd138e63726a0c3a557a38a1e3ae036a0cb9ca707a31f77

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxivory-0.5.0.tar.gz:

Publisher: publish.yml on rafidka/arxivory

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arxivory-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: arxivory-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 12.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arxivory-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cdeae382d2d109c5e8d830d6857bd955690e6ac4e2f0b8e6049c0de295ea873b
MD5 9e4b94e9b7046bb1722edc1484ca497c
BLAKE2b-256 2e8abc1e9cb751ceb72651425883a033f14b97af0359cf3530d101fff7e487bd

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxivory-0.5.0-py3-none-any.whl:

Publisher: publish.yml on rafidka/arxivory

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page