NAND-based author name disambiguation for SAO/NASA ADS publication metadata.

These details have not been verified by PyPI

Project links

Project description

ads-and

ads-and is a Python package for author name disambiguation (AND) on SAO/NASA ADS records. Given publications and optionally references in ADS parquet format, it assigns stable author identifiers and writes disambiguated outputs.

The bundled model is a packaged and slightly refined version of NAND (Neural Author Name Disambiguator), described in Amado Olivo et al. 2025. NAND was trained and evaluated on LSPO, a large-scale physics and astronomy AND benchmark built from ~553k NASA/ADS publications linked to ORCID identities (~125k researchers). The model ships inside the package, no external bundle is required.

This implementation was re-evaluated on LSPO under a five-seed protocol. Clustering performance on LSPO (with constraints enabled):

	F1	Precision	Recall
NAND — Amado Olivo et al. 2025	95.93%	96.15%	96.21%
`ads-and` (this package)	97.02%	96.36%	97.70%

Python import path: author_name_disambiguation

Install

Use uv. Requires Python 3.12.

uv pip install ads-and

If you don't have a GPU: optional ONNX CPU backend, which may be faster depending on host and workload:

uv pip install "ads-and[cpu_onnx]"

Optional Modal backend (you need a modal account):

uv pip install "ads-and[modal]"

Usage

CLI

ads-and infer `
  --publications-path path/to/publications.parquet `
  --references-path path/to/references.parquet `
  --output-dir path/to/output-dir `
  --runtime auto

Add --json for a machine-readable run summary on stdout.

--runtime options: auto (GPU if CUDA is available, else CPU), gpu, cpu. Advanced infer flags such as --infer-stage, --dataset-id, and --modal-gpu are documented in docs/inference_workflow.md.

Modal uses the same command surface with Modal as a managed remote GPU backend (you need a modal account):

ads-and infer `
  --publications-path path/to/publications.parquet `
  --references-path path/to/references.parquet `
  --output-dir path/to/output-dir `
  --backend modal `
  --runtime gpu `
  --modal-gpu l4

Current repo Modal config is --backend modal --runtime gpu --modal-gpu l4. The local client uploads the ADS parquet inputs, Modal runs the same bundled infer workflow remotely, and the finished outputs are copied back into output-dir. Current L4 rule of thumb: about $0.00085 and ~2.5s per 1,000 ADS entries. Configure MODAL_TOKEN_ID and MODAL_TOKEN_SECRET in your environment or a repo-root .env before using --backend modal.

Exact Modal costs are a separate official lookup:

ads-and cost --output-dir path/to/output-dir

This is a follow-up lookup after the run, once the billing window has closed.

Python

Local CPU/GPU:

from author_name_disambiguation import disambiguate_sources

result = disambiguate_sources(
    publications_path="path/to/publications.parquet",
    references_path="path/to/references.parquet",
    output_dir="path/to/output-dir",
    runtime="auto",
)

print(result.publications_disambiguated_path)
print(result.summary_path)

Modal:

from author_name_disambiguation import disambiguate_sources, resolve_modal_cost

modal_result = disambiguate_sources(
    publications_path="path/to/publications.parquet",
    references_path="path/to/references.parquet",
    output_dir="path/to/output-dir",
    backend="modal",
    runtime="gpu",
    modal_gpu="l4",
)

# later, after the billing interval closes
cost_result = resolve_modal_cost("path/to/output-dir")

Input schema

--publications-path is required. --references-path is optional.

Column	Required	Type	Example
`Bibcode`	yes	`str`	`"2000MNRAS.319..168C"`
`Author`	yes	`list[str]` or semicolon-delimited `str`	`["Cole, Shaun", "Lacey, Cedric G."]`
`Title_en` or `Title`	no — but strongly recommended	`str`	`"Galaxy luminosity functions in..."`
`Abstract_en` or `Abstract`	no — but strongly recommended	`str`	`"We model the galaxy population..."`
`Affiliation`	no	`str` (ADS format) or `list[str]` (per-author)	`"AA(Durham Univ, Dept of Physics); AB(...)"`
`Year`	no	`int`	`2000`

Records missing Bibcode or Author are skipped. Records missing both Title and Abstract will be processed but with meaningfully reduced disambiguation quality, since the model relies heavily on textual context to distinguish authors.

Output

All files are written under output_dir:

File	Contents
`publications_disambiguated.parquet`	input columns + `AuthorUID`, `AuthorDisplayName`
`references_disambiguated.parquet`	same, for references (only when references are provided)
`source_author_assignments.parquet`	row-level author-to-entity assignments
`author_entities.parquet`	inferred author entities
`mention_clusters.parquet`	mention-to-cluster mapping
`summary.json`	high-level run summary
`05_stage_metrics_infer_sources.json`	diagnostic per-stage runtime and validation metrics
`05_go_no_go_infer_sources.json`	diagnostic run validation summary

The two disambiguated parquets preserve all input columns and append:

Column	Type	Example
`AuthorUID`	`list[str]`	`["ads_run::s.cole::1", "ads_run::c.lacey::0", "ads_run::c.baugh::0"]`
`AuthorDisplayName`	`list[str]`	`["Cole, Shaun", "Lacey, C. G.", "Baugh, C. M."]`

Both columns are parallel lists in the same order as the input Author column. Each UID is stable across runs for the same registry. Each author entity gets exactly one display name — the most frequently occurring form of their name in the data (could be full-name or abbreviated depending on the entity). The same UID always carries the same display name string.

Reproducibility

The bundled inference model is the selected fixed model from full_20260218T111506Z_cli02681429. The five-seed LSPO result above is backed by tracked repo-level artifacts under artifacts/, including the five seed checkpoints and the canonical clustering report. Raw LSPO is not redistributed; download it separately from Zenodo to rerun the quality workflow.

See Training workflow for the exact LSPO reproduction and release-gate commands.

Further Details

Citation

Cite ads-and as software via CITATION.cff. Cite the original NAND paper if you discuss the underlying method or baseline:

Vicente Amado Olivo, Wolfgang Kerzendorf, Bangjing Lu, Joshua V. Shields, Andreas Flörs, and Nutan Chen (2025). Practical Author Name Disambiguation under Metadata Constraints: A Contrastive Learning Approach for Astronomy Literature. Publications of the Astronomical Society of the Pacific, 137(12), 124503. https://doi.org/10.1088/1538-3873/ae1e2d

And cite LSPO separately if you discuss the benchmark or dataset:

Vicente Amado Olivo (2024). LSPO: A Large-Scale Physics ORCiD-Linked Dataset for Author Name Disambiguation. Zenodo, Version 1. https://doi.org/10.5281/zenodo.11489161

Resources:

Original NAND repository: https://github.com/deepthought-initiative/neural_name_dismabiguator
Original NAND paper: https://doi.org/10.1088/1538-3873/ae1e2d
LSPO dataset: https://doi.org/10.5281/zenodo.11489161

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.3

Apr 27, 2026

0.1.2

Apr 22, 2026

0.1.1

Apr 22, 2026

0.1.0

Apr 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ads_and-0.1.3.tar.gz (4.2 MB view details)

Uploaded Apr 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ads_and-0.1.3-py3-none-any.whl (4.2 MB view details)

Uploaded Apr 27, 2026 Python 3

File details

Details for the file ads_and-0.1.3.tar.gz.

File metadata

Download URL: ads_and-0.1.3.tar.gz
Upload date: Apr 27, 2026
Size: 4.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.21

File hashes

Hashes for ads_and-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`1ecae50b58ec4a362e785dd5c941ce294ae2b5f66dbc49bad6a08f43e6960c00`
MD5	`025528d1ffe2eb61e7e5b2de885c2253`
BLAKE2b-256	`766b2b73f225d4a854642e711b379e55c89bab9f2ecbd688dcd4e487030ba987`

See more details on using hashes here.

File details

Details for the file ads_and-0.1.3-py3-none-any.whl.

File metadata

Download URL: ads_and-0.1.3-py3-none-any.whl
Upload date: Apr 27, 2026
Size: 4.2 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.21

File hashes

Hashes for ads_and-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9200148baf5896ff20f3e80bd91f4a54ecca06324c0f1147413669d69f2d7e6c`
MD5	`7c1945ebbe73f2f188a5a1a13a3b3e6d`
BLAKE2b-256	`9ae9b4820e237bf3fa9808ab423901e467a4e2b1acf61adf7d648d0c6ab2fb36`

See more details on using hashes here.

ads-and 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ads-and

Install

Usage

Input schema

Output

Reproducibility

Further Details

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes