Blazing fast ArXiv paper search — 928K papers in 40ms
Project description
arxiv-search-kit
Offline ArXiv paper search over 928K CS papers. SPECTER2 embeddings + LanceDB vector index + BM25 hybrid retrieval.
40ms per search on GPU. 99% precision@10. No API keys. No rate limits.
Install
pip install arxiv-search-kit[gpu] # with CUDA
pip install arxiv-search-kit[cpu] # CPU only
Quick start
from arxiv_search_kit import ArxivClient
client = ArxivClient() # auto-downloads 4GB index on first run
# keyword search
papers = client.search("attention mechanism transformers", categories=["cs.CL", "cs.LG"])
# find related papers
related = client.find_related("1706.03762") # Attention Is All You Need
# search with context paper (biases results toward your paper's neighborhood)
papers = client.search(
"self-supervised learning",
context_paper_id="2010.11929", # ViT
)
# batch search (returns all unique papers across queries)
papers = client.batch_search([
"vision transformers",
"neural radiance fields",
"RLHF alignment",
], max_results=10)
# conference-aware search
papers = client.search("object detection", conference="CVPR", year=2024)
# sort by citations (calls Semantic Scholar API)
papers = client.search("diffusion models", sort_by="citations", min_citations=50)
What you get back
paper = papers[0]
paper.arxiv_id # "2401.12345"
paper.title # "..."
paper.abstract # "..."
paper.authors # [Author(name="...", affiliation="..."), ...]
paper.categories # ["cs.CV", "cs.LG"]
paper.published # datetime
paper.pdf_url # "https://arxiv.org/pdf/2401.12345"
paper.to_bibtex() # BibTeX string
Citation graph (via Semantic Scholar)
citations = client.get_citations("1706.03762")
references = client.get_references("1706.03762")
# enrich search results with citation counts
client.enrich(papers)
papers[0].citation_count # 95421
Coverage
928K papers across all major CS + stat.ML categories:
cs.CV (144K), cs.LG (129K), cs.CL (78K), cs.AI (36K), cs.RO (38K), cs.CR (32K), stat.ML (20K), and 40+ more subcategories.
Maps conferences to categories: CVPR, NeurIPS, ICML, ICLR, ACL, EMNLP, AAAI, CHI, KDD, SIGIR, and more.
How it works
- Pre-built index: 928K papers embedded with SPECTER2 (same model as Semantic Scholar), stored in LanceDB
- At query time: embed query with SPECTER2, hybrid retrieval (vector + BM25), graph-based re-ranking via Personalized PageRank
- Index auto-downloads from HuggingFace on first use (~4GB)
Building your own index
Only needed if you want to customize the paper set or update to latest papers.
pip install arxiv-search-kit[index]
# download metadata from ArXiv OAI-PMH (takes ~2 hours)
python -m arxiv_search_kit.scripts.build_index download --output metadata.jsonl
# build index (needs GPU, takes ~45 min)
python -m arxiv_search_kit.scripts.build_index all --output-dir ./my_index --device cuda
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arxiv_search_kit-0.1.0.tar.gz.
File metadata
- Download URL: arxiv_search_kit-0.1.0.tar.gz
- Upload date:
- Size: 35.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1bae2c6f9980c9001b2a50c276cfb16dd1e9e4e15965aa64cce420c0ba5609c9
|
|
| MD5 |
52f8f43ad3e7296416b97ddc5d3a7134
|
|
| BLAKE2b-256 |
f19a734230ed2303fc36855781046e7fe300e0fc5f7a212c7b87997956cd27a6
|
File details
Details for the file arxiv_search_kit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: arxiv_search_kit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 44.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb7d8d90f8dedcbc682791907f705526228c50182242fdcc76c59cff687bf262
|
|
| MD5 |
740d296c5ac57dcddf2546f3da01e093
|
|
| BLAKE2b-256 |
5c88d84056c9c183384a4467176f55dbdebc859bba3af1dd34626d2ea05c1445
|