Skip to main content

Benchmark embedding models on academic paper similarity and retrieval tasks.

Project description

ArXiv Embedding Benchmark

Status Python Embeddings arXiv Retrieval

A benchmarking toolkit for comparing embedding models on academic paper similarity tasks using research paper titles, abstracts, and field labels.

The project asks a practical retrieval question: can an embedding model connect a paper title to its real abstract, keep related papers close, separate unrelated fields, and behave consistently across domains?

Why this exists

Embedding models are often chosen by popularity or broad benchmark reputation. Research retrieval is more specific. A useful model for literature search, scientific RAG, or technical discovery needs to represent relationships between papers in a way that supports real downstream decisions.

This repo provides a repeatable evaluation harness for comparing model behavior across scientific fields.

What it evaluates

Dimension What it measures
Title to own abstract Whether a model connects a paper title with its real content
Title to same-field abstracts Whether it distinguishes related but different papers
Title to other-field abstracts Whether it separates unrelated research areas
Abstract to abstract similarity Whether papers cluster meaningfully by topic
Score consistency Whether behavior is stable across fields and comparisons

Benchmark snapshot

The current experiment compares local, scientific, biomedical, general-purpose, and cloud-hosted embedding models.

Rank Model Score Own title / abstract Same-field separation Avg std
1 Bedrock 0.449 0.710 0.103 0.118
2 MPNet 0.443 0.714 0.271 0.134
3 MiniLM-L12 0.439 0.688 0.246 0.130
4 MiniLM-L6 0.433 0.667 0.242 0.129
5 RoBERTa-Large-ST 0.410 0.601 0.165 0.110

The useful signal is not only the winner. Different model families trade off high title / abstract similarity against separation between related papers. For retrieval systems, over-clustering can be just as damaging as weak recall.

Features

  • Collects academic papers across configured research fields
  • Filters abstracts by token length for more consistent comparisons
  • Evaluates Hugging Face models and AWS Bedrock embeddings
  • Supports CPU execution with optional GPU acceleration
  • Caches embeddings to avoid unnecessary recomputation
  • Produces CSV leaderboards, detailed metrics, paper metadata, and experiment snapshots
  • Uses Rich progress output for long-running benchmark visibility

Quick start

git clone https://github.com/codychampion/arxiv-embedding-benchmark.git
cd arxiv-embedding-benchmark
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python -m src.embedding_benchmarking.cli evaluate

Configuration

Models and fields are configured in YAML. A typical run includes a mix of general-purpose, scientific, biomedical, and cloud-hosted models.

Outputs

Each run creates a timestamped experiment directory under experiments/.

File Purpose
embedding_comparison_results.csv Full per-model metric table
model_leaderboard.csv Ranked aggregate leaderboard
papers_metadata.csv Paper titles, abstracts, fields, and metadata
collection_statistics.yaml Corpus statistics and token distribution

Publishing

The package includes PyPI metadata and a trusted-publishing workflow. Configure PyPI Trusted Publishing for this repository before cutting the first release.

python -m build
twine check dist/*

Project structure

src/embedding_benchmarking/
├── cli.py
├── config.py
├── data.py
├── embedding_evaluator.py
├── evaluation.py
├── models.py
└── utils.py

Notes on interpretation

This benchmark is best used as a decision-support tool, not a universal ranking. The right embedding model depends on the corpus, query style, task, and cost envelope.

Citation

@software{arxiv_embedding_benchmark,
  title = {ArXiv Embedding Benchmark},
  author = {Champion, Cody},
  year = {2024},
  description = {A tool for evaluating embedding models on academic paper similarity tasks}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arxiv_embedding_benchmark-0.1.0.tar.gz (2.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arxiv_embedding_benchmark-0.1.0-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file arxiv_embedding_benchmark-0.1.0.tar.gz.

File metadata

File hashes

Hashes for arxiv_embedding_benchmark-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a9f2ac473d1f0ca4232c617c2ed8896f241cd4b8c28ab7fe8665a5770f251512
MD5 111f6fab86b8a5f3d2ac7e80d18d4f9b
BLAKE2b-256 af592927fe319d56bf7540b2e51ede388e30e330d8b50fe13acc22435fc04a35

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxiv_embedding_benchmark-0.1.0.tar.gz:

Publisher: pypi-publish.yml on codychampion/arxiv-embedding-benchmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arxiv_embedding_benchmark-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for arxiv_embedding_benchmark-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1f794421254c90e9f32151c5cc700b549e48216943fc7767e0bdc41bff2f28bb
MD5 1857312511029da0d1e4776ad637aee0
BLAKE2b-256 a65af0a14aac1bc78349c7d801b119ac392527749c39588241f76e7e284cddce

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxiv_embedding_benchmark-0.1.0-py3-none-any.whl:

Publisher: pypi-publish.yml on codychampion/arxiv-embedding-benchmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page