Blazing fast ArXiv paper search — 928K papers in 40ms
Project description
arxiv-search-kit
Offline ArXiv paper search over 928K CS papers. SPECTER2 embeddings + LanceDB vector index + BM25 hybrid retrieval.
40ms per search on GPU. 99% precision@10. No API keys required. No rate limits.
Install
pip
pip install arxiv-search-kit[cpu]
uv
uv pip install arxiv-search-kit[cpu]
# or in a project
uv add "arxiv-search-kit[cpu]"
GPU (CUDA)
PyTorch with CUDA must be installed separately.
# install CUDA torch first (pick your CUDA version: https://pytorch.org/get-started/locally/)
pip install torch --index-url https://download.pytorch.org/whl/cu121
# then install the kit
pip install arxiv-search-kit[gpu]
With uv:
uv pip install torch --index-url https://download.pytorch.org/whl/cu121
uv pip install arxiv-search-kit[gpu]
The pre-built index (~4GB) auto-downloads from HuggingFace on first use.
Quick Start
from arxiv_search_kit import ArxivClient
client = ArxivClient() # downloads index on first run
results = client.search("attention mechanism transformers")
for paper in results:
print(paper.title, paper.arxiv_id)
Search
Keyword Search
# basic search
results = client.search("vision transformers object detection", max_results=20)
# filter by category, year, or date range
results = client.search("graph neural networks", categories=["cs.LG", "cs.AI"], year=2024)
results = client.search("LLM safety", date_from="2024-01-01", date_to="2024-06-30")
# conference-aware search (maps conference name to ArXiv categories)
results = client.search("object detection", conference="CVPR", year=2024)
Context-Biased Search
Bias results toward a specific paper's neighborhood — useful when searching for related work.
# by ArXiv ID (uses stored embedding from the index)
results = client.search("self-supervised learning", context_paper_id="2010.11929")
# by title + abstract (embeds on the fly)
results = client.search(
"sim-to-real transfer",
context_title="My Paper Title",
context_abstract="We propose a method for...",
)
Batch Search
Run multiple queries covering different angles of a topic, merge and deduplicate results.
results = client.batch_search(
queries=[
"reinforcement learning from human feedback",
"process reward model RLHF",
"on-policy distillation language model",
],
max_results=15,
context_title="My Paper Title",
context_abstract="...",
)
Sorting & Importance Ranking
# sort by relevance (default) — pure semantic + BM25 hybrid score
results = client.search("diffusion models", sort_by="relevance")
# sort by citation count (calls Semantic Scholar API)
results = client.search("diffusion models", sort_by="citations")
# sort by date
results = client.search("diffusion models", sort_by="date")
# sort by importance — blends relevance with citation count,
# venue prestige, and influential citation ratio via S2 API.
# Surfaces the most relevant *important* papers.
results = client.search("diffusion models", sort_by="importance")
# filter by minimum citations
results = client.search("transformers", min_citations=50)
sort_by="importance" works with both search() and batch_search():
results = client.batch_search(
queries=["query angle 1", "query angle 2", "query angle 3"],
max_results=15,
sort_by="importance",
context_title="Your Paper Title",
context_abstract="Your abstract...",
)
Find Related Papers
Find papers similar to a given paper using its stored SPECTER2 embedding. No keyword query needed.
related = client.find_related("1706.03762", max_results=10) # Attention Is All You Need
related = client.find_related("1706.03762", categories=["cs.CL"]) # filter by category
Paper Object
Every search returns a SearchResult containing Paper objects:
paper = results[0]
# Core fields (from ArXiv metadata)
paper.arxiv_id # "2401.12345"
paper.title # "Paper Title"
paper.abstract # "We propose..."
paper.authors # [Author(name="Alice", affiliation="MIT"), ...]
paper.categories # ["cs.CV", "cs.LG"]
paper.primary_category # "cs.CV"
paper.published # datetime(2024, 1, 15)
paper.updated # datetime(2024, 3, 1)
paper.doi # "10.1234/..." or None
paper.journal_ref # "NeurIPS 2024" or None
paper.comment # "Accepted at..." or None
# Computed
paper.pdf_url # "https://arxiv.org/pdf/2401.12345"
paper.abs_url # "https://arxiv.org/abs/2401.12345"
paper.year # 2024
paper.first_author # "Alice"
paper.author_names # ["Alice", "Bob"]
# Search score
paper.similarity_score # 0.87 (set after search)
# Enrichment fields (populated after client.enrich() or sort_by="importance")
paper.citation_count # 142
paper.influential_citation_count # 23
paper.venue # "Neural Information Processing Systems"
paper.publication_types # ["Conference"]
paper.references # ["1706.03762", ...] (ArXiv IDs)
paper.tldr # "This paper proposes..."
# Serialization
paper.to_dict() # dict with all fields
paper.to_bibtex() # BibTeX string
paper.to_bibtex("acl") # ACL-style BibTeX
SearchResult supports len(), iteration, and indexing:
results = client.search("transformers")
len(results) # 20
results[0] # first Paper
results.query # "transformers"
results.search_time_ms # 42.5
Semantic Scholar Enrichment
Enrich papers with citation data, venue info, and AI-generated summaries via the Semantic Scholar API.
# enrich search results
results = client.search("attention mechanism")
client.enrich(results)
results[0].citation_count # 95421
results[0].venue # "Neural Information Processing Systems"
results[0].tldr # "A new architecture based solely on attention..."
# enrich specific fields only
client.enrich(results, fields=["citationCount", "venue"])
Citation Graph
# papers that cite this paper
citations = client.get_citations("1706.03762", limit=100)
# [{"arxiv_id": "...", "title": "...", "year": 2023, "citation_count": 42}, ...]
# papers referenced by this paper
references = client.get_references("1706.03762", limit=100)
Rate Limits
The S2 API works without a key (5,000 requests / 5 minutes shared pool). For heavier use, set S2_API_KEY:
export S2_API_KEY=your_key_here
Download Papers
Download PDFs or LaTeX source archives directly from ArXiv.
# single paper — by ID or Paper object
path = client.download_pdf("1706.03762", output_dir="./papers")
path = client.download_source("1706.03762", output_dir="./sources")
# from search results
results = client.search("vision transformers", max_results=5)
paths = client.download_papers(results.papers, output_dir="./papers", format="pdf")
paths = client.download_papers(results.papers, output_dir="./sources", format="source")
Downloads are streamed to disk (no full file in memory). Failed downloads are skipped with a warning.
Async Support
All main methods have async variants:
results = await client.async_search("transformers", max_results=10)
results = await client.async_batch_search(queries=[...], sort_by="importance")
related = await client.async_find_related("1706.03762")
await client.async_enrich(results)
Venue Prestige Tiers
When using sort_by="importance", papers are scored by a combination of citation count, influential citation ratio, and venue prestige. Venues are assigned tiers:
| Tier | Weight | Venues |
|---|---|---|
| 3 (top) | 1.0 | NeurIPS, ICML, ICLR, ACL, EMNLP, CVPR, ICCV, ECCV, AAAI, KDD, JMLR, TPAMI, ... |
| 2 (strong) | 0.67 | WACV, COLING, ICRA, SIGGRAPH, AISTATS, COLT, Findings, ... |
| 1 (decent) | 0.33 | BMVC, ACCV, SemEval, CoNLL, ... |
| 0 (unknown) | 0.0 | ArXiv-only, unrecognized venues |
The importance score formula:
importance = 0.55 * log_citation_score + 0.30 * venue_score + 0.15 * influential_ratio
final_score = 0.6 * relevance + 0.4 * importance
Coverage
928K papers across all major CS + stat.ML + eess categories:
- cs.CV (144K), cs.LG (129K), cs.CL (78K), cs.AI (36K), cs.RO (38K), cs.CR (32K), stat.ML (20K), and 40+ more subcategories.
Conference-to-category mappings: CVPR, NeurIPS, ICML, ICLR, ACL, EMNLP, NAACL, AAAI, IJCAI, CHI, KDD, SIGIR, RSS, ICRA, and many more.
How It Works
- Index: 928K papers embedded with SPECTER2, stored in LanceDB (~4GB)
- Retrieval: Hybrid search — dense (SPECTER2 cosine) + sparse (BM25) fused via Reciprocal Rank Fusion
- Re-ranking: Personalized PageRank on a k-NN similarity graph built from candidate embeddings
- Enrichment: Optional citation/venue data from Semantic Scholar API
- Importance: Blends relevance with citation count, venue prestige, and influential citation ratio
The index auto-downloads from HuggingFace on first use.
Building Your Own Index
Only needed if you want to customize the paper set or update to the latest papers.
pip install arxiv-search-kit[index]
# download metadata from ArXiv OAI-PMH (~2 hours, network-bound)
python -m arxiv_search_kit.scripts.build_index download --output metadata.jsonl
# build index (needs GPU, ~45 min)
python -m arxiv_search_kit.scripts.build_index build \
--metadata-path metadata.jsonl \
--output-dir ./my_index \
--device cuda
# or do both in one step
python -m arxiv_search_kit.scripts.build_index all --output-dir ./my_index --device cuda
Then point the client to your custom index:
client = ArxivClient(index_dir="./my_index", device="cuda")
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arxiv_search_kit-0.1.2.tar.gz.
File metadata
- Download URL: arxiv_search_kit-0.1.2.tar.gz
- Upload date:
- Size: 40.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bba7f6a5948ebf67c03e72bf70ccb98fea7851a98e712e67df3c1d8ed2430612
|
|
| MD5 |
3986194565aaa717812283d07de48aa0
|
|
| BLAKE2b-256 |
8378dd4a52de4c8f1da6bb12adcc0ecc114f77d68f01cc1a371a06b51109e532
|
File details
Details for the file arxiv_search_kit-0.1.2-py3-none-any.whl.
File metadata
- Download URL: arxiv_search_kit-0.1.2-py3-none-any.whl
- Upload date:
- Size: 51.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d570e21516dcd23151d35fc684e76fbd986dfb7a56c04c5d1a8ad6278f2504b
|
|
| MD5 |
3c4fa49b431ea9a568430cb7c41a78c9
|
|
| BLAKE2b-256 |
ce589624d5e2c3bf1bceb5937f5c99591e2c85c9e160f886f5a89cffe40d8ca1
|