ADS author name disambiguation with a bundled NAND-based baseline model and automatic local CPU/GPU inference.
Project description
ads-and
ads-and is a Python package for author name disambiguation (AND) on SAO/NASA ADS records. Given publications and optionally references in ADS parquet format, it assigns stable author identifiers and writes disambiguated outputs.
The bundled model is a packaged and slightly refined version of NAND (Neural Author Name Disambiguator), described in Amado Olivo et al. 2025. NAND was trained and evaluated on LSPO, a large-scale physics and astronomy AND benchmark built from ~553k NASA/ADS publications linked to ORCID identities (~125k researchers). The model ships inside the package, no external bundle is required.
The bundled package was re-evaluated on the same LSPO benchmark under a reproducible five-seed protocol. Clustering performance on LSPO (with constraints enabled):
| F1 | Precision | Recall | |
|---|---|---|---|
| NAND — Amado Olivo et al. 2025 | 95.93% | 96.15% | 96.21% |
ads-and (this package) |
97.02% | 96.36% | 97.70% |
Python import path: author_name_disambiguation
Install
Use uv. Requires Python ≥ 3.11.
uv pip install ads-and
If you don't have a GPU: Optional faster CPU inference via ONNX (still much slower than GPU):
uv pip install "ads-and[cpu_onnx]"
Optional Modal backend (you need a modal account):
uv pip install "ads-and[modal]"
Usage
CLI
ads-and infer `
--publications-path path/to/publications.parquet `
--references-path path/to/references.parquet `
--output-dir path/to/output-dir `
--runtime auto
Add --json for a machine-readable run summary on stdout.
--runtime options: auto (GPU if CUDA is available, else CPU), gpu, cpu.
Advanced infer flags such as --infer-stage, --dataset-id, and
--modal-gpu are documented in
docs/inference_workflow.md.
Modal uses the same command surface with Modal as a managed remote GPU backend (you need a modal account):
ads-and infer `
--publications-path path/to/publications.parquet `
--references-path path/to/references.parquet `
--output-dir path/to/output-dir `
--backend modal `
--runtime gpu `
--modal-gpu l4
Current repo Modal config is --backend modal --runtime gpu --modal-gpu l4. The
local client uploads the ADS parquet inputs, Modal runs the same bundled infer
workflow remotely, and the finished outputs are copied back into output-dir.
Current L4 rule of thumb: about $0.00085 and ~2.5s per 1,000 ADS entries.
Configure
MODAL_TOKEN_ID and MODAL_TOKEN_SECRET in your environment or a repo-root
.env before using --backend modal.
Exact Modal costs are a separate official lookup:
ads-and cost --output-dir path/to/output-dir
This is a follow-up lookup after the run, once the billing window has closed.
Python
Local CPU/GPU:
from author_name_disambiguation import disambiguate_sources
result = disambiguate_sources(
publications_path="path/to/publications.parquet",
references_path="path/to/references.parquet",
output_dir="path/to/output-dir",
runtime="auto",
)
print(result.publications_disambiguated_path)
print(result.summary_path)
Modal:
from author_name_disambiguation import disambiguate_sources, resolve_modal_cost
modal_result = disambiguate_sources(
publications_path="path/to/publications.parquet",
references_path="path/to/references.parquet",
output_dir="path/to/output-dir",
backend="modal",
runtime="gpu",
modal_gpu="l4",
)
# later, after the billing interval closes
cost_result = resolve_modal_cost("path/to/output-dir")
Input schema
--publications-path is required. --references-path is optional.
| Column | Required | Type | Example |
|---|---|---|---|
Bibcode |
yes | str |
"2000MNRAS.319..168C" |
Author |
yes | list[str] or semicolon-delimited str |
["Cole, Shaun", "Lacey, Cedric G."] |
Title_en or Title |
no — but strongly recommended | str |
"Galaxy luminosity functions in..." |
Abstract_en or Abstract |
no — but strongly recommended | str |
"We model the galaxy population..." |
Affiliation |
no | str (ADS format) or list[str] (per-author) |
"AA(Durham Univ, Dept of Physics); AB(...)" |
Year |
no | int |
2000 |
Records missing Bibcode or Author are skipped. Records missing both Title and Abstract will be processed but with meaningfully reduced disambiguation quality, since the model relies heavily on textual context to distinguish authors.
Output
All files are written under output_dir:
| File | Contents |
|---|---|
publications_disambiguated.parquet |
input columns + AuthorUID, AuthorDisplayName |
references_disambiguated.parquet |
same, for references (only when references are provided) |
source_author_assignments.parquet |
row-level author-to-entity assignments |
author_entities.parquet |
inferred author entities |
mention_clusters.parquet |
mention-to-cluster mapping |
summary.json |
high-level run summary |
05_stage_metrics_infer_sources.json |
diagnostic per-stage runtime and validation metrics |
05_go_no_go_infer_sources.json |
diagnostic run validation summary |
The two disambiguated parquets preserve all input columns and append:
| Column | Type | Example |
|---|---|---|
AuthorUID |
list[str] |
["ads_run::s.cole::1", "ads_run::c.lacey::0", "ads_run::c.baugh::0"] |
AuthorDisplayName |
list[str] |
["Cole, Shaun", "Lacey, C. G.", "Baugh, C. M."] |
Both columns are parallel lists in the same order as the input Author column. Each UID is stable across runs for the same registry. Each author entity gets exactly one display name — the most frequently occurring form of their name in the data (could be full-name or abbreviated depending on the entity). The same UID always carries the same display name string.
Further Details
The bundled fixed model ships inside the package. Repo-only research workflows require user-supplied LSPO raw data from the original source release; both parquet and HDF5 inputs are supported for LSPO preparation and evaluation.
Citation
Cite ads-and as software via CITATION.cff. Cite the original NAND paper if you discuss the underlying method or baseline
Vicente Amado Olivo, Wolfgang Kerzendorf, Bangjing Lu, Joshua V. Shields, Andreas Flörs, and Nutan Chen (2025). Practical Author Name Disambiguation under Metadata Constraints: A Contrastive Learning Approach for Astronomy Literature. Publications of the Astronomical Society of the Pacific, 137(12), 124503. https://doi.org/10.1088/1538-3873/ae1e2d
and cite LSPO separately if you discuss the benchmark or dataset.
Amado Olivo, V. (2024). LSPO: A Large-Scale Physics ORCiD-Linked Dataset for Author Name Disambiguation (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11489161
Resources:
- Original NAND repository: https://github.com/deepthought-initiative/neural_name_dismabiguator
- Original NAND paper: https://doi.org/10.1088/1538-3873/ae1e2d
- LSPO dataset: https://doi.org/10.5281/zenodo.11489161
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ads_and-0.1.0.tar.gz.
File metadata
- Download URL: ads_and-0.1.0.tar.gz
- Upload date:
- Size: 4.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.21
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd1d785605ff70fb53fc082f6833e20954b27554d2da7cee88484b1f05a04bd3
|
|
| MD5 |
9cc85f48109017eb14cd8aa758a16516
|
|
| BLAKE2b-256 |
cf418741c22f5bf5836d4dd00d6bf2078a4ce546880114c2050d8648660dd3a2
|
File details
Details for the file ads_and-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ads_and-0.1.0-py3-none-any.whl
- Upload date:
- Size: 4.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.21
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad9e4ab6603f6dd0a4d458fa69340d1e3d73a329d2e53019ff46c94f473b26a2
|
|
| MD5 |
f813e8a1aa5a4a83d68f1be93b3f4e43
|
|
| BLAKE2b-256 |
895b76787ccf2b4d36cb4fa13d51f64d1bb5c18f9141bc405b7e49efdda68300
|