Benchmark embedding models on academic paper similarity and retrieval tasks.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

cchampio

These details have not been verified by PyPI

Project description

ArXiv Embedding Benchmark

Status Python Embeddings arXiv Retrieval

A benchmarking toolkit for comparing embedding models on academic paper similarity tasks using research paper titles, abstracts, and field labels.

The project asks a practical retrieval question: can an embedding model connect a paper title to its real abstract, keep related papers close, separate unrelated fields, and behave consistently across domains?

Why this exists

Embedding models are often chosen by popularity or broad benchmark reputation. Research retrieval is more specific. A useful model for literature search, scientific RAG, or technical discovery needs to represent relationships between papers in a way that supports real downstream decisions.

This repo provides a repeatable evaluation harness for comparing model behavior across scientific fields.

What it evaluates

Dimension	What it measures
Title to own abstract	Whether a model connects a paper title with its real content
Title to same-field abstracts	Whether it distinguishes related but different papers
Title to other-field abstracts	Whether it separates unrelated research areas
Abstract to abstract similarity	Whether papers cluster meaningfully by topic
Score consistency	Whether behavior is stable across fields and comparisons

Benchmark snapshot

The current experiment compares local, scientific, biomedical, general-purpose, and cloud-hosted embedding models.

Rank	Model	Score	Own title / abstract	Same-field separation	Avg std
1	Bedrock	0.449	0.710	0.103	0.118
2	MPNet	0.443	0.714	0.271	0.134
3	MiniLM-L12	0.439	0.688	0.246	0.130
4	MiniLM-L6	0.433	0.667	0.242	0.129
5	RoBERTa-Large-ST	0.410	0.601	0.165	0.110

The useful signal is not only the winner. Different model families trade off high title / abstract similarity against separation between related papers. For retrieval systems, over-clustering can be just as damaging as weak recall.

Features

Collects academic papers across configured research fields
Filters abstracts by token length for more consistent comparisons
Evaluates Hugging Face models and AWS Bedrock embeddings
Supports CPU execution with optional GPU acceleration
Caches embeddings to avoid unnecessary recomputation
Produces CSV leaderboards, detailed metrics, paper metadata, and experiment snapshots
Uses Rich progress output for long-running benchmark visibility

Quick start

git clone https://github.com/codychampion/arxiv-embedding-benchmark.git
cd arxiv-embedding-benchmark
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python -m src.embedding_benchmarking.cli evaluate

Configuration

Models and fields are configured in YAML. A typical run includes a mix of general-purpose, scientific, biomedical, and cloud-hosted models.

Outputs

Each run creates a timestamped experiment directory under experiments/.

File	Purpose
`embedding_comparison_results.csv`	Full per-model metric table
`model_leaderboard.csv`	Ranked aggregate leaderboard
`papers_metadata.csv`	Paper titles, abstracts, fields, and metadata
`collection_statistics.yaml`	Corpus statistics and token distribution

Publishing

The package includes PyPI metadata and a trusted-publishing workflow. Configure PyPI Trusted Publishing for this repository before cutting the first release.

python -m build
twine check dist/*

Project structure

src/embedding_benchmarking/
├── cli.py
├── config.py
├── data.py
├── embedding_evaluator.py
├── evaluation.py
├── models.py
└── utils.py

Notes on interpretation

This benchmark is best used as a decision-support tool, not a universal ranking. The right embedding model depends on the corpus, query style, task, and cost envelope.

Citation

@software{arxiv_embedding_benchmark,
  title = {ArXiv Embedding Benchmark},
  author = {Champion, Cody},
  year = {2024},
  description = {A tool for evaluating embedding models on academic paper similarity tasks}
}

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

cchampio

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arxiv_embedding_benchmark-0.1.0.tar.gz (2.9 MB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arxiv_embedding_benchmark-0.1.0-py3-none-any.whl (19.0 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file arxiv_embedding_benchmark-0.1.0.tar.gz.

File metadata

Download URL: arxiv_embedding_benchmark-0.1.0.tar.gz
Upload date: May 22, 2026
Size: 2.9 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arxiv_embedding_benchmark-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a9f2ac473d1f0ca4232c617c2ed8896f241cd4b8c28ab7fe8665a5770f251512`
MD5	`111f6fab86b8a5f3d2ac7e80d18d4f9b`
BLAKE2b-256	`af592927fe319d56bf7540b2e51ede388e30e330d8b50fe13acc22435fc04a35`

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxiv_embedding_benchmark-0.1.0.tar.gz:

Publisher: pypi-publish.yml on codychampion/arxiv-embedding-benchmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: arxiv_embedding_benchmark-0.1.0.tar.gz
- Subject digest: a9f2ac473d1f0ca4232c617c2ed8896f241cd4b8c28ab7fe8665a5770f251512
- Sigstore transparency entry: 1601861133
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: codychampion/arxiv-embedding-benchmark@132d8692147d8149d905f53e4e9a5977795b774b
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/codychampion
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@132d8692147d8149d905f53e4e9a5977795b774b
- Trigger Event: release

File details

Details for the file arxiv_embedding_benchmark-0.1.0-py3-none-any.whl.

File metadata

Download URL: arxiv_embedding_benchmark-0.1.0-py3-none-any.whl
Upload date: May 22, 2026
Size: 19.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for arxiv_embedding_benchmark-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1f794421254c90e9f32151c5cc700b549e48216943fc7767e0bdc41bff2f28bb`
MD5	`1857312511029da0d1e4776ad637aee0`
BLAKE2b-256	`a65af0a14aac1bc78349c7d801b119ac392527749c39588241f76e7e284cddce`

See more details on using hashes here.

Provenance

The following attestation bundles were made for arxiv_embedding_benchmark-0.1.0-py3-none-any.whl:

Publisher: pypi-publish.yml on codychampion/arxiv-embedding-benchmark

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: arxiv_embedding_benchmark-0.1.0-py3-none-any.whl
- Subject digest: 1f794421254c90e9f32151c5cc700b549e48216943fc7767e0bdc41bff2f28bb
- Sigstore transparency entry: 1601861141
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: codychampion/arxiv-embedding-benchmark@132d8692147d8149d905f53e4e9a5977795b774b
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/codychampion
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi-publish.yml@132d8692147d8149d905f53e4e9a5977795b774b
- Trigger Event: release

arxiv-embedding-benchmark 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ArXiv Embedding Benchmark

Why this exists

What it evaluates

Benchmark snapshot

Features

Quick start

Configuration

Outputs

Publishing

Project structure

Notes on interpretation

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance