Skip to main content

Blazing fast ArXiv paper search — 928K papers in 40ms

Project description

arxiv-search-kit

Offline ArXiv paper search over 928K CS papers. SPECTER2 embeddings + LanceDB vector index + BM25 hybrid retrieval.

40ms per search on GPU. 99% precision@10. No API keys. No rate limits.

Install

pip install arxiv-search-kit[gpu]   # with CUDA
pip install arxiv-search-kit[cpu]   # CPU only

Quick start

from arxiv_search_kit import ArxivClient

client = ArxivClient()  # auto-downloads 4GB index on first run

# keyword search
papers = client.search("attention mechanism transformers", categories=["cs.CL", "cs.LG"])

# find related papers
related = client.find_related("1706.03762")  # Attention Is All You Need

# search with context paper (biases results toward your paper's neighborhood)
papers = client.search(
    "self-supervised learning",
    context_paper_id="2010.11929",  # ViT
)

# batch search (returns all unique papers across queries)
papers = client.batch_search([
    "vision transformers",
    "neural radiance fields",
    "RLHF alignment",
], max_results=10)

# conference-aware search
papers = client.search("object detection", conference="CVPR", year=2024)

# sort by citations (calls Semantic Scholar API)
papers = client.search("diffusion models", sort_by="citations", min_citations=50)

What you get back

paper = papers[0]
paper.arxiv_id      # "2401.12345"
paper.title          # "..."
paper.abstract       # "..."
paper.authors        # [Author(name="...", affiliation="..."), ...]
paper.categories     # ["cs.CV", "cs.LG"]
paper.published      # datetime
paper.pdf_url        # "https://arxiv.org/pdf/2401.12345"
paper.to_bibtex()    # BibTeX string

Citation graph (via Semantic Scholar)

citations = client.get_citations("1706.03762")
references = client.get_references("1706.03762")

# enrich search results with citation counts
client.enrich(papers)
papers[0].citation_count  # 95421

Coverage

928K papers across all major CS + stat.ML categories:

cs.CV (144K), cs.LG (129K), cs.CL (78K), cs.AI (36K), cs.RO (38K), cs.CR (32K), stat.ML (20K), and 40+ more subcategories.

Maps conferences to categories: CVPR, NeurIPS, ICML, ICLR, ACL, EMNLP, AAAI, CHI, KDD, SIGIR, and more.

How it works

  1. Pre-built index: 928K papers embedded with SPECTER2, stored in LanceDB
  2. At query time: embed query with SPECTER2, hybrid retrieval (vector + BM25), graph-based re-ranking via Personalized PageRank
  3. Index auto-downloads from HuggingFace on first use (~4GB)

Building your own index

Only needed if you want to customize the paper set or update to latest papers.

pip install arxiv-search-kit[index]

# download metadata from ArXiv OAI-PMH (takes ~2 hours)
python -m arxiv_search_kit.scripts.build_index download --output metadata.jsonl

# build index (needs GPU, takes ~45 min)
python -m arxiv_search_kit.scripts.build_index all --output-dir ./my_index --device cuda

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arxiv_search_kit-0.1.1.tar.gz (35.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arxiv_search_kit-0.1.1-py3-none-any.whl (44.8 kB view details)

Uploaded Python 3

File details

Details for the file arxiv_search_kit-0.1.1.tar.gz.

File metadata

  • Download URL: arxiv_search_kit-0.1.1.tar.gz
  • Upload date:
  • Size: 35.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for arxiv_search_kit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d5e57fadb6d78cb10cf6ae90cc6c3d9663760cb0abe9c152d4a42705041b9daf
MD5 bbf03ac9f43d2e432171e81a9c7606b5
BLAKE2b-256 0483f4758ee4311f19c2bc419f0766f9f4d91cea7956d39346fdae6e3680fcd3

See more details on using hashes here.

File details

Details for the file arxiv_search_kit-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for arxiv_search_kit-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6edd7c0f8326e8adeebcfe66c693ffbbab147eced1baf779f6bb7fd928e799ea
MD5 54f8f2d1a97b5ce2da67fe2de188575d
BLAKE2b-256 936ba2bbcec9e4f6e3eb568607737ad8902bfac12b163c3e5ac0bfd7a36b45ac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page