High Quality and Blazing Fast ArXiv paper search — 928K papers, SPECTER2 or Gemini-2 embeddings

These details have not been verified by PyPI

Project links

Project description

arxiv-search-kit

Offline ArXiv paper search over 928K CS papers. Two embedding backends, LanceDB vector index + BM25 hybrid retrieval.

SPECTER2: 40ms per search on GPU. No API keys required. No rate limits.

Gemini-2: Higher quality semantic search via gemini-embedding-2-preview (3072-dim). Requires a Gemini API key.

Install

pip

pip install arxiv-search-kit[cpu]

uv

uv pip install arxiv-search-kit[cpu]
# or in a project
uv add "arxiv-search-kit[cpu]"

GPU (CUDA)

PyTorch with CUDA must be installed separately.

# install CUDA torch first (pick your CUDA version: https://pytorch.org/get-started/locally/)
pip install torch --index-url https://download.pytorch.org/whl/cu121

# then install the kit
pip install arxiv-search-kit[gpu]

With uv:

uv pip install torch --index-url https://download.pytorch.org/whl/cu121
uv pip install arxiv-search-kit[gpu]

The pre-built index (~4GB) auto-downloads from HuggingFace on first use.

Quick Start

from arxiv_search_kit import ArxivClient

# SPECTER2 (default) — local, no API key needed
client = ArxivClient()

# Gemini-2 — higher quality, requires Gemini API key
client = ArxivClient(embedding="gemini", gemini_api_key="AIza...")
# or set GEMINI_API_KEY env var and omit gemini_api_key

results = client.search("attention mechanism transformers")
for paper in results:
    print(paper.title, paper.arxiv_id)

Default vs Extra Fields

By default, search results include only the essential fields: arxiv_id, title, abstract, year, citation_count (citation_count populated only when using sort_by="importance" or sort_by="citations").

Pass details="extra" to get all fields.

# default — only core fields
results = client.search("transformers")
results.to_dicts()
# [{"arxiv_id": "...", "title": "...", "abstract": "...", "year": 2024, "citation_count": None}, ...]

# extra — all fields (authors, categories, doi, venue, tldr, etc.)
results = client.search("transformers", details="extra")
results.to_dicts()
# [{"arxiv_id": "...", "title": "...", "authors": [...], "categories": [...], ...}, ...]

Works the same for batch_search and find_related:

results = client.batch_search(["BERT", "GPT"], sort_by="importance", details="extra")
related  = client.find_related("1706.03762", details="extra")

Search

Keyword Search

# basic search
results = client.search("vision transformers object detection", max_results=20)

# filter by category, year, or date range
results = client.search("graph neural networks", categories=["cs.LG", "cs.AI"], year=2024)
results = client.search("LLM safety", date_from="2024-01-01", date_to="2024-06-30")

# conference-aware search (maps conference name to ArXiv categories)
results = client.search("object detection", conference="CVPR", year=2024)

Context-Biased Search

Bias results toward a specific paper's neighborhood — useful when searching for related work.

# by ArXiv ID (uses stored embedding from the index)
results = client.search("self-supervised learning", context_paper_id="2010.11929")

# by title + abstract (embeds on the fly)
results = client.search(
    "sim-to-real transfer",
    context_title="My Paper Title",
    context_abstract="We propose a method for...",
)

Batch Search

Run multiple queries covering different angles of a topic, merge and deduplicate results.

results = client.batch_search(
    queries=[
        "reinforcement learning from human feedback",
        "process reward model RLHF",
        "on-policy distillation language model",
    ],
    max_results=15,
    context_title="My Paper Title",
    context_abstract="...",
)

Sorting & Importance Ranking

# sort by relevance (default) — pure semantic + BM25 hybrid score
results = client.search("diffusion models", sort_by="relevance")

# sort by citation count (calls Semantic Scholar API)
results = client.search("diffusion models", sort_by="citations")

# sort by date
results = client.search("diffusion models", sort_by="date")

# sort by importance — blends relevance with citation count,
# venue prestige, and influential citation ratio via S2 API.
# Surfaces the most relevant *important* papers.
results = client.search("diffusion models", sort_by="importance")

# filter by minimum citations
results = client.search("transformers", min_citations=50)

sort_by="importance" works with both search() and batch_search():

results = client.batch_search(
    queries=["query angle 1", "query angle 2", "query angle 3"],
    max_results=15,
    sort_by="importance",
    context_title="Your Paper Title",
    context_abstract="Your abstract...",
)

Find Related Papers

Find papers similar to a given paper using its stored embedding. No keyword query needed.

related = client.find_related("1706.03762", max_results=10)  # Attention Is All You Need
related = client.find_related("1706.03762", categories=["cs.CL"])  # filter by category

Paper Object

Every search returns a SearchResult containing Paper objects:

paper = results[0]

# Core fields (from ArXiv metadata)
paper.arxiv_id          # "2401.12345"
paper.title             # "Paper Title"
paper.abstract          # "We propose..."
paper.authors           # [Author(name="Alice", affiliation="MIT"), ...]
paper.categories        # ["cs.CV", "cs.LG"]
paper.primary_category  # "cs.CV"
paper.published         # datetime(2024, 1, 15)
paper.updated           # datetime(2024, 3, 1)
paper.doi               # "10.1234/..." or None
paper.journal_ref       # "NeurIPS 2024" or None
paper.comment           # "Accepted at..." or None

# Computed
paper.pdf_url           # "https://arxiv.org/pdf/2401.12345"
paper.abs_url           # "https://arxiv.org/abs/2401.12345"
paper.year              # 2024
paper.first_author      # "Alice"
paper.author_names      # ["Alice", "Bob"]

# Search score
paper.similarity_score  # 0.87 (set after search)

# Enrichment fields (populated after client.enrich() or sort_by="importance")
paper.citation_count           # 142
paper.influential_citation_count  # 23
paper.venue                    # "Neural Information Processing Systems"
paper.publication_types        # ["Conference"]
paper.references               # ["1706.03762", ...] (ArXiv IDs)
paper.tldr                     # "This paper proposes..."

# Serialization
paper.to_dict()         # dict with all fields
paper.to_bibtex()       # BibTeX string
paper.to_bibtex("acl")  # ACL-style BibTeX

SearchResult supports len(), iteration, and indexing:

results = client.search("transformers")
len(results)         # 20
results[0]           # first Paper
results.query        # "transformers"
results.search_time_ms  # 42.5

Semantic Scholar Enrichment

Enrich papers with citation data, venue info, and AI-generated summaries via the Semantic Scholar API.

# enrich search results
results = client.search("attention mechanism")
client.enrich(results)

results[0].citation_count   # 95421
results[0].venue             # "Neural Information Processing Systems"
results[0].tldr              # "A new architecture based solely on attention..."

# enrich specific fields only
client.enrich(results, fields=["citationCount", "venue"])

Citation Graph

# papers that cite this paper
citations = client.get_citations("1706.03762", limit=100)
# [{"arxiv_id": "...", "title": "...", "year": 2023, "citation_count": 42}, ...]

# papers referenced by this paper
references = client.get_references("1706.03762", limit=100)

Rate Limits

The S2 API works without a key (5,000 requests / 5 minutes shared pool). For heavier use, set S2_API_KEY:

export S2_API_KEY=your_key_here

Download Papers

Download PDFs or LaTeX source archives directly from ArXiv.

# single paper — by ID or Paper object
path = client.download_pdf("1706.03762", output_dir="./papers")
path = client.download_source("1706.03762", output_dir="./sources")

# from search results
results = client.search("vision transformers", max_results=5)
paths = client.download_papers(results.papers, output_dir="./papers", format="pdf")
paths = client.download_papers(results.papers, output_dir="./sources", format="source")

Downloads are streamed to disk (no full file in memory). Failed downloads are skipped with a warning.

Paper Summarization

Summarize any paper using Google Gemini — downloads the LaTeX source from ArXiv, extracts the primary .tex file, trims after the conclusion, and produces a comprehensive LLM-generated summary covering contributions, methodology, results, ablations, and more.

pip install arxiv-search-kit[summarize]

# by ArXiv ID
summary = client.summarize_paper("1706.03762", api_key="your-gemini-key")

# from search results
results = client.search("vision transformers", max_results=1)
summary = client.summarize_paper(results[0], api_key="your-gemini-key")

# or set the env var instead of passing api_key every time
# export GEMINI_API_KEY=your-gemini-key
summary = client.summarize_paper("1706.03762")

The summary covers: title & authors, problem statement, key contributions, related work, methodology (with equations), experimental setup, all benchmark results, ablation studies, limitations, and conclusion.

Batch summarization

Pass a list to summarize multiple papers in parallel:

results = client.search("vision transformers", max_results=5)
summaries = client.summarize_paper(results.papers, api_key="your-gemini-key")
# returns {"2401.12345": "summary...", "2312.67890": "summary...", ...}

# control parallelism (default: 5 concurrent)
summaries = client.summarize_paper(results.papers, max_concurrent=3)

You can also specify a different Gemini model:

summary = client.summarize_paper("1706.03762", model="gemini-3-flash-preview")

Paper Q&A

Ask any question about a paper and get an answer grounded in the paper's content. Uses the same LaTeX source pipeline as summarization.

# by ArXiv ID
answer = client.ask_paper("1706.03762", "What is the scaling factor in the attention mechanism and why is it used?")

# from search results
results = client.search("vision transformers", max_results=1)
answer = client.ask_paper(results[0], "What datasets were used for evaluation?")

# or set the env var instead of passing api_key every time
# export GEMINI_API_KEY=your-gemini-key
answer = client.ask_paper("1706.03762", "What are the key contributions of this paper?")

The answer is strictly grounded in the paper — Gemini cites specific sections, equations, and tables from the source. If the paper doesn't contain enough information to answer, it says so clearly.

Requires the same [summarize] extra:

pip install arxiv-search-kit[summarize]

Async Support

All main methods have async variants:

results = await client.async_search("transformers", max_results=10)
results = await client.async_batch_search(queries=[...], sort_by="importance")
related = await client.async_find_related("1706.03762")
await client.async_enrich(results)
summary = await client.async_summarize_paper("1706.03762")
answer  = await client.async_ask_paper("1706.03762", "What optimizer was used?")

Venue Prestige Tiers

When using sort_by="importance", papers are scored by a combination of citation count, influential citation ratio, and venue prestige. Venues are assigned tiers:

Tier	Weight	Venues
3 (top)	1.0	NeurIPS, ICML, ICLR, ACL, EMNLP, CVPR, ICCV, ECCV, AAAI, KDD, JMLR, TPAMI, ...
2 (strong)	0.67	WACV, COLING, ICRA, SIGGRAPH, AISTATS, COLT, Findings, ...
1 (decent)	0.33	BMVC, ACCV, SemEval, CoNLL, ...
0 (unknown)	0.0	ArXiv-only, unrecognized venues

The importance score formula:

importance = 0.55 * log_citation_score + 0.30 * venue_score + 0.15 * influential_ratio
final_score = 0.6 * relevance + 0.4 * importance

Coverage

928K papers across all major CS + stat.ML + eess categories:

cs.CV (144K), cs.LG (129K), cs.CL (78K), cs.AI (36K), cs.RO (38K), cs.CR (32K), stat.ML (20K), and 40+ more subcategories.

Conference-to-category mappings: CVPR, NeurIPS, ICML, ICLR, ACL, EMNLP, NAACL, AAAI, IJCAI, CHI, KDD, SIGIR, RSS, ICRA, and many more.

Embedding Backends

SPECTER2 (default)

Local transformer model (allenai/specter2), 768-dim embeddings
No API key, no rate limits, ~40ms on GPU
Best with context_title + context_abstract for keyword queries
Index: anonymousatom/arxiv-search-index (~4GB)

Gemini-2

gemini-embedding-2-preview via Google Gemini API, 3072-dim embeddings
Requires a Gemini API key (gemini_api_key= or GEMINI_API_KEY env var)
Better out-of-the-box quality for keyword queries — asymmetric retrieval format handles free-text queries natively
Index: Vidushee/arxiv-gemini-index (~10GB)

# SPECTER2 — fast, local, no key needed
client = ArxivClient()

# Gemini-2 — higher quality semantic search
client = ArxivClient(embedding="gemini", gemini_api_key="AIza...")

How It Works

Index: 928K papers embedded with SPECTER2 or Gemini-2, stored in LanceDB
Retrieval: Hybrid search — dense (cosine) + sparse (BM25) fused via Reciprocal Rank Fusion
Re-ranking: Personalized PageRank on a k-NN similarity graph built from candidate embeddings
Enrichment: Optional citation/venue data from Semantic Scholar API
Importance: Blends relevance with citation count, venue prestige, and influential citation ratio

Indexes auto-download from HuggingFace on first use.

Building Your Own Index

Only needed if you want to customize the paper set or update to the latest papers.

pip install arxiv-search-kit[index]

# download metadata from ArXiv OAI-PMH (~2 hours, network-bound)
python -m arxiv_search_kit.scripts.build_index download --output metadata.jsonl

# build index (needs GPU, ~45 min)
python -m arxiv_search_kit.scripts.build_index build \
    --metadata-path metadata.jsonl \
    --output-dir ./my_index \
    --device cuda

# or do both in one step
python -m arxiv_search_kit.scripts.build_index all --output-dir ./my_index --device cuda

Then point the client to your custom index:

client = ArxivClient(index_dir="./my_index", device="cuda")

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.4

Apr 11, 2026

0.2.3

Apr 11, 2026

0.2.2

Apr 11, 2026

0.2.1

Apr 11, 2026

This version

0.2.0

Apr 9, 2026

0.1.9

Apr 7, 2026

0.1.8

Apr 6, 2026

0.1.7

Apr 6, 2026

0.1.5

Apr 3, 2026

0.1.4

Apr 3, 2026

0.1.3

Apr 3, 2026

0.1.2

Mar 28, 2026

0.1.1

Mar 24, 2026

0.1.0

Mar 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arxiv_search_kit-0.2.0.tar.gz (83.4 kB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arxiv_search_kit-0.2.0-py3-none-any.whl (59.4 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file arxiv_search_kit-0.2.0.tar.gz.

File metadata

Download URL: arxiv_search_kit-0.2.0.tar.gz
Upload date: Apr 9, 2026
Size: 83.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for arxiv_search_kit-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`05a6c43ba6913dfce4efd975682a6c59804a9f33851994a7215b26eff670c71d`
MD5	`6db0e71cc7ea2bbb69a8eb1d1fe1f86c`
BLAKE2b-256	`833949b0bf3a44dfbcbfdc360bfc26c69fa7988a7aadaaeb5cc226f83121d9d7`

See more details on using hashes here.

File details

Details for the file arxiv_search_kit-0.2.0-py3-none-any.whl.

File metadata

Download URL: arxiv_search_kit-0.2.0-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 59.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for arxiv_search_kit-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cca255cb73e91e767db5093675d86161f2d239c16e10dbd0d8ef8cd21079ac6c`
MD5	`a4ebfb0b466a1c0e03641a5babc0fb2a`
BLAKE2b-256	`a18c2532d65dd19700dbb732a4fec69caf55933cf42f5e3631c0ee78f0772dec`

See more details on using hashes here.

arxiv-search-kit 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

arxiv-search-kit

Install

pip

uv

GPU (CUDA)

Quick Start

Default vs Extra Fields

Search

Keyword Search

Context-Biased Search

Batch Search

Sorting & Importance Ranking

Find Related Papers

Paper Object

Semantic Scholar Enrichment

Citation Graph

Rate Limits

Download Papers

Paper Summarization

Batch summarization

Paper Q&A

Async Support

Venue Prestige Tiers

Coverage

Embedding Backends

SPECTER2 (default)

Gemini-2

How It Works

Building Your Own Index

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes