A command-line tool for searching arXiv papers using natural language queries with semantic search powered by Cohere embeddings and reranking

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rafidka

These details have not been verified by PyPI

Project description

ArXiv Semantic Search

A command-line tool for searching arXiv papers using natural language queries with semantic search powered by Cohere embeddings and reranking.

Features

Natural Language Search: Query papers using plain English like "papers about information retrieval and search engines that use neural networks"
Flexible Search Strategies:
- Two-Stage (embed-rerank): Fast and cost-efficient (default)
  - Stage 1: Cohere Embed v4 to quickly filter top candidates via cosine similarity
  - Stage 2: Cohere Rerank 3.5 for precise final ranking
- Single-Stage (rerank-only): Maximum accuracy for critical searches
Rich Output: Beautiful table formatting with relevance scores, authors, and links
Flexible Date Ranges: Search recent papers (yesterday, last week) or custom date ranges
Multiple Output Formats: Human-readable tables or JSON for programmatic use
Multiple arXiv Categories: Support for Computer Science, Math, Physics, and more
Scalable: Efficiently handles large document collections (1000+ papers)

How It Works

Choose between two search strategies:

Embed-Rerank (Default - Fast & Cost-Efficient)

Retrieval Stage: Uses Cohere Embed v4 to create embeddings and filter documents via cosine similarity
Reranking Stage: Uses Cohere Rerank 3.5 on the filtered candidates for final precision ranking

Rerank-Only (Maximum Accuracy)

Single Stage: Uses Cohere Rerank 3.5 directly on all documents for highest precision
Best for critical searches where accuracy is more important than speed/cost

Auto-optimization: For small datasets (≤100 papers), the tool automatically uses rerank-only regardless of strategy choice.

Installation

# Clone the repository
git clone <your-repo-url>
cd arxivory

# Install dependencies
uv sync

# Set up your Cohere API key
export COHERE_API_KEY="your-cohere-api-key"

Usage

Semantic Search (New!)

Search for papers using natural language queries:

# Search for recent papers about neural networks and information retrieval
uv run arxivory search "papers about information retrieval and search engines that use neural networks" --preset yesterday

# Search with custom date range and show abstracts
uv run arxivory search "transformer architectures for computer vision" --from-date 2024-01-01 --until-date 2024-01-07 --abstract

# Get top 5 results in JSON format
uv run arxivory search "reinforcement learning for robotics" --preset last-week --top-k 5 --json

# Configure retrieval stage for large document sets
uv run arxivory search "deep learning optimization" --preset last-week --retrieval-k 200 --top-k 10

# Use rerank-only strategy for maximum accuracy
uv run arxivory search "critical medical AI research" --preset yesterday --strategy rerank-only

# Search in other arXiv categories (e.g., Mathematics)
uv run arxivory search "topology and algebraic geometry" --preset yesterday --set math

Raw Metadata Harvest (Original functionality)

For raw JSON output without semantic search:

# Harvest raw metadata for yesterday
uv run arxivory harvest --preset yesterday

# Custom date range
uv run arxivory harvest --from-date 2024-01-01 --until-date 2024-01-07

Examples

Example 1: Information Retrieval Research

uv run arxivory search "information retrieval using neural networks and transformers" --preset last-week --top-k 5 --abstract

Example 2: Computer Vision Papers

uv run arxivory search "object detection and computer vision with deep learning" --from-date 2024-01-01 --until-date 2024-01-31

Example 3: Strategy Comparison

# Fast search with embed-rerank (default)
uv run arxivory search "theoretical foundations of deep learning" --preset last-week --strategy embed-rerank --top-k 10

# Maximum accuracy with rerank-only (slower, more expensive)
uv run arxivory search "theoretical foundations of deep learning" --preset last-week --strategy rerank-only --top-k 10

Command Options

Search Command

--preset: Date shortcuts (yesterday, last-week)
--from-date / --until-date: Custom date ranges (YYYY-MM-DD)
--top-k: Final number of results to return (default: 10)
--strategy: Search strategy (embed-rerank for speed, rerank-only for accuracy)
--retrieval-k: Number of candidates to retrieve in embed-rerank stage 1 (default: 100)
--set: arXiv category (default: cs for Computer Science)
--abstract: Show paper abstracts in output
--json: Output results as JSON

Output Format

The tool provides rich, formatted output showing:

Rank: Paper ranking by relevance
Score: Semantic relevance score (0-1)
Title: Paper title
Authors: Author list (truncated if long)
arXiv ID: For accessing the paper
Abstract: Optional detailed abstract
Links: Direct URLs to abstract and PDF

Strategy Comparison

Strategy	Speed	Cost	Accuracy	Best For
`embed-rerank` (default)	Fast	Lower	Very Good	General searches, large datasets
`rerank-only`	Slower	Higher	Maximum	Critical searches, small datasets

Auto-optimization: Small datasets (≤100 papers) automatically use rerank-only for optimal results.

API Key Setup

Get your free Cohere API key from cohere.com and set it as an environment variable:

export COHERE_API_KEY="your-api-key-here"

Supported arXiv Categories

cs: Computer Science (default)
math: Mathematics
physics: Physics
q-bio: Quantitative Biology
q-fin: Quantitative Finance
stat: Statistics

Requirements

Python 3.12+
Cohere API key
Internet connection for arXiv API and Cohere services

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rafidka

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.0

Aug 26, 2025

This version

0.4.0

Aug 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arxivory-0.4.0.tar.gz (9.5 kB view details)

Uploaded Aug 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arxivory-0.4.0-py3-none-any.whl (12.5 kB view details)

Uploaded Aug 26, 2025 Python 3

File details

Details for the file arxivory-0.4.0.tar.gz.

File metadata

Download URL: arxivory-0.4.0.tar.gz
Upload date: Aug 26, 2025
Size: 9.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arxivory-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`5f734e10c958c9ae8d69a40d8ae541734118209fdf47a94a369d09f95c5d10c8`
MD5	`974841f6736df4dcfdd5bbbb755f8bef`
BLAKE2b-256	`b2b659c6dc45c47696d87e829867e2b1576b78d004746e5d9aaf504a3e5e9b92`

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxivory-0.4.0.tar.gz:

Publisher: publish.yml on rafidka/arxivory

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: arxivory-0.4.0.tar.gz
- Subject digest: 5f734e10c958c9ae8d69a40d8ae541734118209fdf47a94a369d09f95c5d10c8
- Sigstore transparency entry: 435280776
- Sigstore integration time: Aug 26, 2025
Source repository:
- Permalink: rafidka/arxivory@4f1a25e4014010df39522acffa51621bdb617c36
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/rafidka
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4f1a25e4014010df39522acffa51621bdb617c36
- Trigger Event: release

File details

Details for the file arxivory-0.4.0-py3-none-any.whl.

File metadata

Download URL: arxivory-0.4.0-py3-none-any.whl
Upload date: Aug 26, 2025
Size: 12.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for arxivory-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8b0b0715daec3fc8083553b0906df238667aef29be249b6585a3bc72c7258183`
MD5	`c59bc96885c907858c55bc8ce5d7c77d`
BLAKE2b-256	`d3f1e61d25c254c2a05d3e8ff29b6782333eb18f7ea561b258c58354417385cd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxivory-0.4.0-py3-none-any.whl:

Publisher: publish.yml on rafidka/arxivory

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: arxivory-0.4.0-py3-none-any.whl
- Subject digest: 8b0b0715daec3fc8083553b0906df238667aef29be249b6585a3bc72c7258183
- Sigstore transparency entry: 435280793
- Sigstore integration time: Aug 26, 2025
Source repository:
- Permalink: rafidka/arxivory@4f1a25e4014010df39522acffa51621bdb617c36
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/rafidka
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4f1a25e4014010df39522acffa51621bdb617c36
- Trigger Event: release

arxivory 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ArXiv Semantic Search

Features

How It Works

Embed-Rerank (Default - Fast & Cost-Efficient)

Rerank-Only (Maximum Accuracy)

Installation

Usage

Semantic Search (New!)

Raw Metadata Harvest (Original functionality)

Examples

Example 1: Information Retrieval Research

Example 2: Computer Vision Papers

Example 3: Strategy Comparison

Command Options

Search Command

Output Format

Strategy Comparison

API Key Setup

Supported arXiv Categories

Requirements

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance