AI-powered author disambiguation and works search agents for OpenAlex
Project description
Author Disambiguation Agent
Production-ready AI agent for disambiguating life sciences researchers and finding their OpenAlex author IDs and work IDs.
Current Version: 2.9.2 (CEO agent guardrails)
Features
- Author Disambiguation: Find researchers' OpenAlex profiles using ORCID, name, and institution
- CEO / Commercial Interest Check: Standalone agent to detect company roles, board memberships, and patents; now skill-based with optional patent step (
use_patents=True) - PubPeer Search + AI Analysis ✨ NEW: Check PubPeer mentions and optionally classify each discussion by type (
data_integrity,data_presentation,novelty,other) with a plain-language summary - Multi-group EMBO Support: Group-specific disambiguation strategies for all four EMBO programmes (
members,yip,gin,ig) - Works Search Agent: Find academic papers and extract author information from works
- Email Discovery (Optional): Find current email addresses from institutional directories and publications
- Claude Skills: Modular knowledge system with 11 expert skills — 6 for author disambiguation, 5 for CEO checks
- OpenAlex MCP Tools: 9 specialized tools for searching authors, works, and publications
- Web Search & Fetch: Access institutional pages, academic profiles, and publication PDFs
- Multi-source verification: OpenAlex, people.embo.org, institutional directories, ORCID
- Structured outputs: JSON schema enforcement for both author and work results
- Embedded MCP Pattern: Direct async tool calls without stdio overhead
- Benchmark Infrastructure: Comprehensive evaluation framework with 11,332+ high-confidence ground truth matches
Architecture
┌─────────────────────────────────────────────────────────┐
│ Production Agent (Claude API) │
│ │
│ ┌───────────────────────────────────────────────-──┐ │
│ │ Async Agent Loop │ │
│ └────┬──────────────────────────┬──────────────────┘ │
│ │ │ │
│ ├──► web_search ├──► OpenAlex MCP │
│ │ (Native Tool) │ (Embedded) │
│ │ │ │
│ │ └─────┬─────────────┐ │
│ │ │ │ │
│ │ │ │ │
│ │ ┌─────▼─────────┐ │ │
│ │ │ OpenAlexTools │ │ │
│ │ │ (pyalex) │ │ │
│ │ └───────────────┘ │ │
└───────┴────────────────────────────────────────────────-┘
MCP Tools Available (9 total):
Author Tools:
• search_authors_by_name - Domain-aware name search
• search_authors_by_orcid - ORCID lookup (most reliable)
• search_authors_by_name_and_institution - Filtered search including institution
• get_author_details - Complete author profile
• get_author_recent_works - Recent publications
Works Tools (NEW):
• search_works_by_title - Find papers by title
• search_works_by_doi - Get work by DOI (most reliable)
• search_works_by_title_and_author - Combined title + author search
• get_work_details - Complete work information
Claude Skills
The agent uses a modular knowledge system with expert skills located in src/.claude/skills/:
Available Skills
Author Disambiguation Skills (6)
- author-disambiguation-strategy — Decision logic, search order (ORCID → name+institution → name), confidence scoring
- email-finder-strategy — Priority-ordered email search; uses most recent affiliation and last-author paper
- output-schema-formatter — JSON schema spec, field requirements, and examples for all status types
- openalex-expert — OpenAlex API best practices, query patterns, domain classification
- works-search-strategy — Find papers by title/DOI, title normalisation, author validation in authorships
- verification-expert — Evidence quality evaluation, ORCID trust hierarchy, confidence scoring
CEO / Commercial Interest Skills (5)
- ceo-author-profile — Step 1: Fetch OpenAlex author profile and recent papers; flag company affiliations
- ceo-paper-fetcher — Helper: Access paper full text via DOI, Europe PMC, PubMed, and preprints
- ceo-coi-reader — Step 2: Read Conflict of Interest sections; classify disclosure strength
- ceo-web-search — Step 3: Targeted web searches for executive roles, boards, LinkedIn
- ceo-patent-search — Step 4 (optional): Search Google Patents / EPO for company-assigned patents
Benefits of Skills
- Modular: Each step can be read, edited, and tested independently
- Selective activation: Patent search is off by default — enabled only when needed
- Reliable: Skill content is embedded directly into the system prompt at call time
- Maintainable: Update a single skill file without touching agent logic
Installation
Quick Start
Install from PyPI:
pip install author-disambiguation
As a Dependency in Another Project
Add to your pyproject.toml:
[project]
dependencies = [
"author-disambiguation>=2.4.0",
]
Or requirements.txt:
author-disambiguation>=2.4.0
Install from GitHub (Development Version)
pip install git+https://github.com/source-data/claude-authors.git
For Development
git clone https://github.com/source-data/claude-authors.git
cd claude-authors
pip install -e .
With Optional Dependencies
# For running benchmarks
pip install "author-disambiguation[benchmarks]"
# For development and testing
pip install "author-disambiguation[dev]"
# Install everything
pip install "author-disambiguation[all]"
📖 Installation is simple via pip - see Installation section below
Configuration
Environment Variables
Create a .env file or export variables:
ANTHROPIC_API_KEY=your-anthropic-api-key
OPENALEX_API_KEY=your-email@domain.org
Note: OPENALEX_API_KEY should be your email address for OpenAlex "polite pool" access (10 req/sec).
Usage
Programmatic Usage (Python API)
Import and use the disambiguation agent in your code:
import asyncio
from src import disambiguate_author
async def main():
# Single author disambiguation — EMBO Members (default)
result = await disambiguate_author(
first_name="Marie",
last_name="Curie",
institution="University of Paris"
)
if result['status'] == 'success':
author = result['author_candidates'][0]['author']
print(f"OpenAlex ID: {author['openalex_id']}")
print(f"Name: {author['name']}")
print(f"Institution: {author['institution']}")
asyncio.run(main())
EMBO Programme Groups
Use the group parameter to select a disambiguation strategy tailored to each EMBO programme. The strategy adjusts how the agent interprets publication record size, institution visibility, geographic region, and name conventions.
| Group | Programme | Career stage | Geography |
|---|---|---|---|
members |
EMBO Members (default) | Established, senior researchers | Primarily European |
yip |
Young Investigator Programme | Group leader for 1–4 years | EMBC states + Chile, India, Singapore, Taiwan |
gin |
Global Investigator Network | Group leader within first 6 years | Chile, India, Singapore, Taiwan, Africa |
ig |
Installation Grants | Very recently established (≤ 2 yrs) | Less-favoured European countries (e.g. Poland, Portugal, Türkiye) |
import asyncio
from src import disambiguate_author
async def main():
# EMBO Young Investigator (early-career, small publication record is normal)
result = await disambiguate_author(
first_name="Ana",
last_name="Costa",
institution="University of Porto",
group="yip"
)
# EMBO Global Investigator (non-European institution, name variants tried)
result = await disambiguate_author(
first_name="Priya",
last_name="Sharma",
institution="NCBS Bangalore",
group="gin"
)
# EMBO Installation Grant (lab just set up in less-favoured European country)
result = await disambiguate_author(
first_name="Pawel",
last_name="Nowak",
institution="University of Warsaw",
group="ig"
)
asyncio.run(main())
The group field is also recorded in _metadata of every response so downstream systems can trace which strategy was used.
Concurrency Control (for batch processing):
# When processing many authors concurrently, use max_concurrent to prevent timeouts
# Default: 5 simultaneous SDK initializations (recommended for production)
async def process_many_authors(authors_list):
"""Process multiple authors concurrently with built-in concurrency control."""
tasks = [
disambiguate_author(
first_name=author['first'],
last_name=author['last'],
institution=author['institution'],
context=author.get('keywords', ''),
max_concurrent=5 # Prevents timeout errors
)
for author in authors_list
]
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
# Example: Process 6 authors concurrently
async def example_batch_processing():
authors = [
{
"first": "Julian M.",
"last": "Hibberd",
"institution": "University of Cambridge",
"keywords": "C4 photosynthesis, crop improvement"
},
{
"first": "Rickard",
"last": "Sandberg",
"institution": "Karolinska Institutet",
"keywords": "Single-cell genomics, RNA"
},
# ... more authors
]
results = await process_many_authors(authors)
# Process results
for i, result in enumerate(results):
if isinstance(result, Exception):
print(f"Author {i+1}: Error - {result}")
elif result.get('status') == 'success':
author_id = result['author_candidates'][0]['author']['openalex_id']
print(f"Author {i+1}: Found - {author_id}")
# For single calls or low concurrency, set max_concurrent=None for faster execution
result = await disambiguate_author(
first_name="John",
last_name="Doe",
max_concurrent=None # No concurrency limit
)
How Concurrency Control Works:
- With
max_concurrent=5: Up to 5 SDK initializations run simultaneously - Additional calls wait until a slot becomes available
- Prevents "Control request timeout: initialize" errors
- Tested and verified: Successfully processes 6+ authors without timeouts
📖 See examples/test_concurrency.py for a complete working example with 6 authors
📖 See API.md for complete API documentation and examples
CLI Usage
# Basic usage (EMBO Members — default)
author-disambiguate --first-name "Marie" --last-name "Curie"
# With ORCID (most reliable)
author-disambiguate --name "Albert Einstein" --orcid "0000-0001-2345-6789"
# With institution
author-disambiguate --first-name "John" --last-name "Smith" --institution "MIT"
# With research context
author-disambiguate --name "Jane Doe" --context "machine learning, AI, neural networks"
# EMBO Young Investigator
author-disambiguate --first-name "Ana" --last-name "Costa" \
--affiliation "University of Porto" --group yip
# EMBO Global Investigator
author-disambiguate --first-name "Priya" --last-name "Sharma" \
--affiliation "NCBS Bangalore" --group gin
# EMBO Installation Grant
author-disambiguate --first-name "Pawel" --last-name "Nowak" \
--affiliation "University of Warsaw" --group ig
Works Search (NEW in v2.4.0)
Find academic papers and extract author information from works:
Python API
import asyncio
from src import search_work
async def main():
result = await search_work(
title="The state of OA: a large-scale analysis",
author_last_name="Priem",
year=2018
)
if result['status'] == 'success':
work = result['work']
print(f"Work ID: {work['openalex_id']}")
print(f"DOI: {work['doi']}")
print(f"Title: {work['title']}")
print(f"Author OpenAlex ID: {work['author_openalex_id']}")
print(f"Author ORCID: {work['author_orcid']}")
print(f"Author found: {work['author_found_in_work']}")
print(f"Confidence: {work['match_confidence']}") # 'direct' or 'agent'
asyncio.run(main())
CLI Usage
# Search by title only
author-work-search --title "The state of OA"
# Search with author validation
author-work-search --title "The state of OA" --author-last "Priem"
# Search with year
author-work-search --title "The state of OA" --author-last "Priem" --year 2018
How It Works
-
Direct PyAlex Search (fast, reliable):
- Searches OpenAlex by title
- Strict validation: normalized title must match
- If author specified: validates author is in authorships list
- Returns work ID, DOI, title, author OpenAlex ID, and ORCID
-
Agent Fallback (when direct search fails):
- Uses Claude with full OpenAlex MCP toolset
- Can handle fuzzy matches, alternate titles, subtitle variations
- Falls back to web search for DOI if needed
- More flexible but slower
Returns structured JSON:
{
"status": "success",
"work": {
"openalex_id": "https://openalex.org/W2741809807",
"doi": "https://doi.org/10.7717/peerj.4375",
"title": "The state of OA: a large-scale analysis...",
"author_openalex_id": "https://openalex.org/A5023888391",
"author_orcid": "https://orcid.org/0000-0001-6187-6610",
"author_found_in_work": true,
"match_confidence": "direct"
}
}
CEO / Commercial Interest Check (NEW in v2.7.0)
Given an OpenAlex author ID, this standalone agent searches for evidence that the researcher holds commercial roles (CEO, CTO, founder, board member, advisor) or has patents assigned to private companies.
The search is organised into four modular skills that the agent follows in order:
| Step | Skill | What it does |
|---|---|---|
| 1 | ceo-author-profile |
Fetch OpenAlex profile + recent papers; flag company affiliations |
| 2 | ceo-coi-reader |
Read COI sections from recent papers (most reliable source) |
| 3 | ceo-web-search |
Targeted web searches for executive roles, boards, LinkedIn |
| 4 | ceo-patent-search |
Search patent databases (optional — off by default) |
Python API — CEO check
import asyncio
from src import check_ceo_status
async def main():
# Default: Steps 1–3 only (faster)
result = await check_ceo_status(
openalex_author_id="A5074091984",
author_name="Yves-Alain Barde", # optional but improves web searches
)
# With patent search enabled (Step 4 — slower)
result = await check_ceo_status(
openalex_author_id="A5074091984",
author_name="Yves-Alain Barde",
use_patents=True,
)
print(f"Has commercial interests: {result['possibility_of_ceo']}")
print(f"Details: {result['comment']}")
asyncio.run(main())
CLI Usage — CEO check
# Default: Steps 1–3 (no patent search)
check-ceo-status --id "A5074091984"
# With author name (recommended — produces better web searches)
check-ceo-status --id "A5074091984" --name "Yves-Alain Barde"
# Enable patent search (Step 4, slower)
check-ceo-status --id "A5074091984" --name "Yves-Alain Barde" --patents
CEO check output
{
"possibility_of_ceo": true,
"comment": "COI disclosure in Smith 2023 (doi:10.1234/xyz) states 'J. Smith is board member of BioTech Inc. and holds equity.' Confirmed via web search: appointed CEO of BioTech Inc. in 2021.",
"_metadata": {
"openalex_author_id": "A5074091984",
"author_name": "Yves-Alain Barde",
"use_patents": false,
"iterations": 12,
"stats": { "input_tokens": 45000, "output_tokens": 320 }
}
}
possibility_of_ceo values:
true— commercial role or company-assigned patent foundfalse— no evidence found after thorough searchnull— search was inconclusive (access errors, very common name, etc.)
Results are saved to ceo_result_<name_or_id>.json in the working directory.
PubPeer Search
Direct HTTP function (no agent) to check whether OpenAlex work(s) have been commented on in PubPeer. Optionally, an AI agent can be added to classify and summarise each discussion.
PubPeer does not expose a public API. This function uses PubPeer's
internal search endpoint (/api/search/?q={doi}) discovered in the
Vue.js frontend. A session is initialised once per process (fetching
https://pubpeer.com/ to obtain a CSRF token), then reused for all
subsequent DOI checks within that run. No login is required.
Important:
in_pubpeer: truemeans the paper has at least one comment (comments_total > 0). PubPeer indexes virtually all papers, so an indexed-but-silent paper is reported asfalse.
Accepts either:
- OpenAlex Work ID (
W*): check a single paper — returns a boolean result - OpenAlex Author ID (
A*): check all papers of an author — returns a list with per-paper PubPeer status
Python API — PubPeer (presence check only)
from src.pubpeer_search import search_in_pubpeer
# Check a single work
result = search_in_pubpeer("W1234567890")
print(result["in_pubpeer"]) # True / False / None
print(result["comment_count"]) # int (0 if no comments)
print(result["pubpeer_url"]) # direct pub URL if commented, else None
# Check all works of an author (rate-limited, can take a while)
result = search_in_pubpeer("A5074091984")
print(result["total_works_checked"]) # int
print(result["works_in_pubpeer"]) # int (papers with ≥1 comment)
for work in result["works"]:
if work["in_pubpeer"]:
print(work["title"], work["pubpeer_url"])
# Limit to the 20 most recent works
result = search_in_pubpeer("A5074091984", max_works=20)
Python API — PubPeer with AI analysis
import asyncio
from src.pubpeer_search import search_and_analyze_pubpeer
# Check + classify each discussion with a short AI summary
result = asyncio.run(search_and_analyze_pubpeer("A5074091984", max_works=20))
for work in result["works"]:
if work["in_pubpeer"]:
analysis = work.get("comment_analysis", {})
print(work["title"])
print(f" Type : {analysis.get('type')}")
print(f" Summary: {analysis.get('summary')}")
comment_analysis types:
data_integrity— image manipulation, data fabrication, band splicing, plagiarismdata_presentation— statistics, unclear figures, misleading visuals (not outright fraud)novelty— comments questioning whether the finding is truly newother— authorship disputes, errata, general discussion
CLI Usage — PubPeer
# Check a single paper
search-pubpeer --id "W1234567890"
# Check all works of an author
search-pubpeer --id "A5074091984"
# Limit to 30 most recent works
search-pubpeer --id "A5074091984" --max-works 30
# Add AI comment analysis (classifies and summarises each PubPeer discussion)
search-pubpeer --id "A5074091984" --max-works 10 --analyze
PubPeer output — Work ID (no analysis)
{
"openalex_id": "W1234567890",
"doi": "https://doi.org/10.1234/example",
"title": "Example paper title",
"year": 2022,
"in_pubpeer": true,
"comment_count": 3,
"pubpeer_url": "https://pubpeer.com/publications/ABC123#0",
"pubpeer_search_url": "https://pubpeer.com/search?q=10.1234/example"
}
PubPeer output — Work ID (with --analyze)
{
"openalex_id": "W1234567890",
"doi": "https://doi.org/10.1234/example",
"title": "Example paper title",
"year": 2022,
"in_pubpeer": true,
"comment_count": 3,
"pubpeer_url": "https://pubpeer.com/publications/ABC123#0",
"comment_analysis": {
"type": "data_integrity",
"summary": "Two comments raise concerns about Western blot band splicing in Figures 2 and 4."
}
}
PubPeer output — Author ID
{
"author_id": "A5074091984",
"author_name": "Jane Smith",
"total_works_checked": 45,
"works_in_pubpeer": 2,
"works": [
{
"openalex_id": "W1234567890",
"doi": "https://doi.org/10.1234/example",
"title": "Example paper title",
"year": 2022,
"in_pubpeer": true,
"comment_count": 3,
"pubpeer_url": "https://pubpeer.com/publications/ABC123#0",
"comment_analysis": {
"type": "data_integrity",
"summary": "Two comments raise concerns about Western blot band splicing in Figures 2 and 4."
}
},
{
"openalex_id": "W9876543210",
"doi": "https://doi.org/10.5678/other",
"title": "Another paper",
"year": 2020,
"in_pubpeer": false,
"comment_count": 0,
"pubpeer_url": null
}
]
}
in_pubpeer values:
true— the paper has at least one comment on PubPeer (pubpeer_urlpoints to the publication page)false— paper is indexed in PubPeer but has no comments, or is not yet indexednull— could not determine (no DOI available or network error)
Results are also accessible to agents via the mcp__openalex_mcp__search_pubpeer
MCP tool. Results are saved to pubpeer_<id>.json in the working directory.
Benchmark Data Preparation
To generate benchmark data from the EMBO candidates Excel file, use the included script to add OpenAlex author IDs:
# Install additional dependencies (if not already installed)
pip install -r requirements.txt
# Run the script to add author IDs
python scripts/add_author_ids.py
This script will:
- Read
data/embo_membership_candidates_with_work_ids.xlsx - Query OpenAlex API for each work ID to extract author information
- Match the specific EMBO candidate author by name from the work's author list
- Add columns for the matched author: OpenAlex ID, name, ORCID, institutions, position, corresponding author status, and match confidence score
- Save results to
data/embo_membership_candidates_with_author_ids.xlsx
The script includes:
- Fuzzy name matching with confidence scores
- Progress bars and caching to avoid duplicate API calls
- Error handling for missing or invalid work IDs
- Match quality metrics (89.9% success rate, 11,332 high-confidence matches)
Benchmark Evaluation
A unified benchmark script allows you to evaluate the agent on any dataset by mapping your column names:
# Basic benchmark with required fields
python scripts/benchmark.py \
--input data/my_authors.xlsx \
--first_name "FirstName" \
--last_name "LastName" \
--ground_truth "openalex_id" \
-n 100
# With affiliation and context (keywords)
python scripts/benchmark.py \
--input data/my_authors.xlsx \
--first_name "FirstName,GivenName" \
--last_name "LastName,FamilyName" \
--affiliation "Institution,Affiliation" \
--context "Keywords,ResearchArea" \
--ground_truth "openalex_id,author_id" \
-n 100
Features:
- Flexible Column Mapping: Specify multiple column name alternatives (first match wins)
- Context Combination: Combine multiple columns for richer context (e.g., keywords + research areas)
- Automatic Filtering: Excludes rows without ground truth automatically
- Top-K Accuracy: Reports Top-1, Top-2, and Top-3 accuracy
- Detailed Reports: JSON output with all results saved to
output/benchmark_TIMESTAMP.json
Column Specification:
- Use comma-separated alternatives:
--first_name "first_name,FirstName,given_name" - First matching column in your Excel file will be used
- Missing columns are handled gracefully
Results:
- 100% accuracy with keywords + affiliation context (validated on 100 EMBO members)
- Keywords alone provide sufficient signal for near-perfect disambiguation
- Automatic filtering ensures only authors with complete data are benchmarked
- Concurrency-controlled execution prevents timeout errors (5 simultaneous tests)
Basic Usage
# Using first and last name
python src/production_agent.py --first-name "Jerry" --last-name "Adams"
# Using full name
python src/production_agent.py --name "Jerry M. Adams"
With Institution
python src/production_agent.py --first-name "Jerry" --last-name "Adams" --institution "WEHI"
# Or with multiple affiliations
python src/production_agent.py --name "John Smith" --affiliation "Harvard" --affiliation "MIT"
With ORCID
python src/production_agent.py --first-name "Konrad" --last-name "Beyreuther" --orcid "0000-0002-3317-3069"
With Email Search
python src/production_agent.py --first-name "Yves" --last-name "Barde" --orcid "0000-0002-7627-461X" --find-email
Email Search Strategy:
- Priority 1: Institutional directories (most reliable)
- Priority 2: Personal/Lab websites
- Priority 3: ORCID profiles
- Priority 4: Google Scholar
- Priority 5: ResearchGate/LinkedIn
- Priority 6 (Last Resort): Extract from most recent research article as last author
Critical Requirements for Email Search:
- Uses author's LAST (most recent) affiliation from OpenAlex
- For publication fallback: Must be LAST AUTHOR in most recent RESEARCH ARTICLE
- Not reviews, editorials, or other publication types
With Context (Publications, Keywords, Topics)
# Provide research context (publications, keywords, research topics)
python src/production_agent.py --name "Researcher Name" --context "Known publications: Title 1, Title 2"
# Or with affiliation and context
python src/production_agent.py --first-name "John" --last-name "Smith" --affiliation "MIT" --context "machine learning, neural networks"
Output Format
The agent returns structured JSON with enforced schema validation. All responses use a unified schema regardless of status.
Key Features
- Unified structure: Same schema for all status types (success, ambiguous, not_found, error)
- Always uses
author_candidatesarray: Even success cases return a single-element array - No confidence scores: Evidence and concerns provide better assessment than arbitrary confidence labels
- Ranked results: Candidates ordered by evidence strength (rank 1 = strongest match)
Disambiguation Output Schema
{
// Required fields (always present)
"status": "success" | "ambiguous" | "not_found" | "error",
"author_candidates": [{
"rank": number, // 1 = strongest match
"author": {
"openalex_id": string,
"openalex_url": string,
"name": string,
"orcid": string | null,
"institution": string | null,
"works_count": number,
"cited_by_count": number
},
"evidence": string[], // Supporting evidence for this match
"concerns": string[] // Red flags or uncertainties (optional)
}], // Empty array for not_found/error
"search_summary": {
"embo_found": boolean,
"orcid_source": string,
"candidates_evaluated": number,
"disambiguation_needed": boolean
},
"comments": string, // Detailed process reasoning
// Optional fields (context-dependent)
"message"?: string, // For error/not_found/ambiguous cases
"error"?: string, // Error message if status is "error"
"possible_reasons"?: string[], // For not_found cases
"recommendation"?: string, // Suggested next steps
// Metadata (added by agent)
"_metadata": {
"iterations": number,
"stats": {
"input_tokens": number,
"output_tokens": number,
"web_searches": number,
"openalex_calls": number
},
"researcher_name": string,
"group": "members" | "yip" | "gin" | "ig" // EMBO programme group used
}
}
Success Case Example
Single unambiguous match (notice author_candidates is an array with one element):
{
"status": "success",
"author_candidates": [
{
"rank": 1,
"author": {
"openalex_id": "A5074091984",
"openalex_url": "https://openalex.org/A5074091984",
"name": "Yves-Alain Barde",
"orcid": "0000-0002-7627-461X",
"institution": "Cardiff University",
"works_count": 177,
"cited_by_count": 36058
},
"evidence": [
"ORCID exact match (0000-0002-7627-461X)",
"Recent publications on brain-derived neurotrophic factor",
"Current affiliation: Cardiff University (2015-2025)",
"Previous affiliations: Max Planck Society, University of Basel",
"High impact researcher: h-index 84, 36,058 citations"
],
"concerns": []
}
],
"search_summary": {
"embo_found": false,
"orcid_source": "User provided",
"candidates_evaluated": 1,
"disambiguation_needed": false
},
"comments": "Direct ORCID search returned single unambiguous result.",
"_metadata": {
"iterations": 3,
"stats": {
"input_tokens": 40957,
"output_tokens": 752,
"web_searches": 0,
"openalex_calls": 2
},
"researcher_name": "Yves Barde",
"group": "members"
}
}
Ambiguous Case Example
Multiple candidates ranked by evidence strength:
{
"status": "ambiguous",
"author_candidates": [
{
"rank": 1,
"author": {
"openalex_id": "A123456",
"openalex_url": "https://openalex.org/A123456",
"name": "John Smith",
"orcid": null,
"institution": "MIT",
"works_count": 50,
"cited_by_count": 1000
},
"evidence": [
"Institution match (MIT)",
"1 publication match",
"Research domain alignment (life sciences)"
],
"concerns": [
"Timeline slightly inconsistent",
"No ORCID available"
]
},
{
"rank": 2,
"author": {
"openalex_id": "A789012",
"openalex_url": "https://openalex.org/A789012",
"name": "J. Smith",
"orcid": null,
"institution": "Stanford",
"works_count": 30,
"cited_by_count": 500
},
"evidence": [
"Name match",
"Field proximity (biology)"
],
"concerns": [
"Institution mismatch (Stanford vs MIT)",
"No publication matches"
]
}
],
"search_summary": {
"embo_found": false,
"orcid_source": "Unknown",
"candidates_evaluated": 5,
"disambiguation_needed": true
},
"message": "Multiple plausible candidates found. Ranked by evidence strength.",
"recommendation": "Provide known publication titles or ORCID for disambiguation",
"comments": "Multiple candidates with similar names in related fields",
"_metadata": {
"iterations": 8,
"stats": {
"input_tokens": 52000,
"output_tokens": 950,
"web_searches": 2,
"openalex_calls": 6
},
"researcher_name": "John Smith"
}
}
Not Found Case Example
{
"status": "not_found",
"author_candidates": [],
"search_summary": {
"embo_found": false,
"orcid_source": "Unknown",
"candidates_evaluated": 0,
"disambiguation_needed": false
},
"message": "No matching author profile found in OpenAlex",
"possible_reasons": [
"Researcher not yet indexed in OpenAlex",
"Name variation not captured",
"Very early career (no publications)"
],
"recommendation": "Verify researcher name spelling and try with publication titles",
"comments": "Exhaustive search across all sources returned no matches...",
"_metadata": {
"iterations": 12,
"stats": {
"input_tokens": 68000,
"output_tokens": 450,
"web_searches": 4,
"openalex_calls": 3
},
"researcher_name": "Unknown Researcher"
}
}
Workflow
- Step 0: Quick OpenAlex check with available info (ORCID, name+institution, or name only)
- Step 1: EMBO member directory search (if found, use curated ORCID → skip to Step 5)
- Step 2: General web search for institutional profiles and ORCID
- Step 3: PubMed search for publications and affiliations
- Step 4: Build comprehensive researcher profile
- Step 5: OpenAlex verification with MCP tools
- Step 6: Return JSON-only results with confidence scores
MCP Server Tools
All search tools now include domain awareness and return the primary research domain for each author.
search_authors_by_name(name, per_page=200, preferred_domain=None)
Basic name search returning up to 200 results. Optional preferred_domain parameter ranks results by domain relevance.
- Domains:
"life_sciences","health_sciences","physical_sciences","social_sciences" - Returns: Author profiles with
primary_domainfield andtop_concepts
search_authors_by_orcid(orcid)
Most reliable search method when ORCID is available. Returns single author profile with domain information.
search_authors_by_name_and_institution(name, institution_name, per_page=200, preferred_domain=None)
Two-step filtered search: finds institution ID first, then searches authors affiliated with that institution. Optional preferred_domain parameter ranks results by domain relevance.
get_author_details(openalex_author_id)
Complete author profile including affiliations, research topics, h-index, publication counts by year.
get_author_recent_works(openalex_author_id, per_page=10)
Recent publications for identity verification, including journal, DOI, citations, and author affiliations.
Domain Classification
The MCP server automatically determines the primary research domain for each author based on their publication concepts:
- Life Sciences: Biology, genetics, molecular biology, neuroscience, microbiology, ecology, immunology
- Health Sciences: Medicine, clinical research, pharmacology, epidemiology, public health, oncology
- Physical Sciences: Physics, chemistry, astronomy, materials science, quantum mechanics
- Social Sciences: Economics, sociology, psychology, political science, education, linguistics
Domain ranking helps disambiguate authors with common names by prioritizing candidates in the expected research field.
Tests
Comprehensive test suite with 24 tests covering all OpenAlex tools functionality.
Running Tests
# Run all unit tests (excluding integration tests - fast, no API calls needed)
pytest tests/ -m "not integration"
# Run all tests including integration tests (requires OPENALEX_API_KEY)
pytest tests/
# Run with verbose output
pytest tests/ -v
# Run specific test file
pytest tests/test_openalex_tools.py -v
Project Structure
src/
├── production_agent.py # Main async agent (MCP-enabled)
├── openalex_mcp/ # MCP Server Module
│ ├── __init__.py
│ └── openalex_server.py # FastMCP server (can run standalone)
├── schemas/
│ ├── __init__.py
│ └── disambiguation_result.py # Unified output schema + validation
├── prompts/
│ ├── __init__.py
│ ├── direct_system_prompt.py # Base prompt (EMBO Members)
│ ├── group_system_prompts.py # Group-specific prompts (yip, gin, ig) + get_system_prompt()
│ └── simplified_system_prompt_with_skills.py
tests/
├── __init__.py
├── test_openalex_tools.py # OpenAlex tools unit tests (22 tests)
├── test_disambiguation_schema.py # Schema validation tests (23 tests)
└── test_mcp_server.py # MCP server tests (12 tests)
scripts/
├── benchmark.py # Unified benchmark evaluation (flexible CLI)
├── merge_keywords_to_candidates.py # Enrich candidates file with keywords
└── README.md # Scripts documentation (if exists)
examples/
└── test_concurrency.py # Example: Concurrent batch processing (6 authors)
data/ # Ground truth data (Excel files, .gitignored)
├── embo_membership_candidates_with_author_ids.xlsx # Main benchmark dataset
└── emboplanet_allmem_cleaned_2026-01-23.xlsx # EMBO members (for merging)
output/ # Benchmark results (JSON files, .gitignored)
MCP Architecture Details
This project implements the Embedded MCP Pattern, recommended by Anthropic for connecting agents to data sources like APIs and databases.
Why MCP?
- Clean Separation: MCP module (
src/openalex_mcp/) separates data access from agent logic - Testable: MCP tools can be tested independently of the agent
- Reusable: Tools can be extracted to standalone MCP server if needed
- Maintainable: Clear boundaries between concerns
- Performance: Direct async function calls (no stdio overhead)
Architecture Layers
Layer 1: Production Agent (src/production_agent.py)
↓ imports and calls
Layer 2: MCP Core Tools (src/openalex_mcp/core_tools.py)
↓ wraps with async
Layer 3: OpenAlex Tools (src/tools/openalex_tools.py)
↓ uses
Layer 4: PyAlex Library (pip package)
↓ calls
Layer 5: OpenAlex REST API
Changelog
Version 2.8.0 (2026-02-19) - PubPeer Search
Adds a direct HTTP function to check whether OpenAlex works have PubPeer comments — no agent or LLM call required.
New features:
- New
src/pubpeer_search.pywithsearch_in_pubpeer(openalex_id, max_works, delay). Accepts a Work ID (W*) or Author ID (A*). search_pubpeerMCP tool registered in the OpenAlex MCP server so agents can call it directly viamcp__openalex_mcp__search_pubpeer.- New
search-pubpeerCLI entry point. - Exported from
src/__init__.py. requests>=2.31.0added as a package dependency.
Implementation notes:
- PubPeer has no public API. Uses the internal Vue.js endpoint
/api/search/?q={doi}&token={csrf}with a session-scoped CSRF token obtained fromwindow.App.csrfTokenon the home page. - Session is initialised once per process and reused for all DOI checks.
in_pubpeer: truemeanscomments_total > 0— papers indexed without comments are reported asfalse.delaybetween requests defaults to 0.5 s; configurable via--delay.- Output per work:
in_pubpeer(bool | null),comment_count(int),pubpeer_url(direct pub URL | null),pubpeer_search_url(str | null).
Version 2.9.2 (2026-02-18) - CEO agent guardrails
- Hard-blocked patent DB access when
use_patents=False:patents.google.com,worldwide.espacenet.com, andpatft.uspto.govare now listed as FORBIDDEN URLs in the system prompt, preventing the agent from visiting them during Step 3 when the patent step is disabled. - Step 3 skill update:
ceo-web-searchskill now explicitly states "if web results mention patents, note them but do NOT visit patent databases". - Stale affiliation detection:
ceo-author-profileskill now instructs the agent to check the publication dates associated with a company affiliation; if the most recent paper with that affiliation is 2+ years old, it is labelled a former affiliation rather than treated as active.
Version 2.9.1 (2026-02-18) - --id optional; name-only CEO checks
check_ceo_status()no longer requiresopenalex_author_id; passauthor_namealone and the agent will resolve the OpenAlex ID viasearch_authors_by_name/search_authors_by_orcid.- CLI:
check-ceo-status --name "Researcher Name"now works without--id. _metadata.openalex_author_idreports"(resolved by agent)"when the ID was not supplied.
Version 2.9.0 (2026-02-18) - CEO Skills + PubPeer AI Analysis
Two major feature additions to the CEO check agent and PubPeer search.
CEO agent — skill-based architecture:
- Search strategy decomposed into 5 independent skill files under
src/.claude/skills/:ceo-author-profile(Step 1),ceo-paper-fetcher(helper),ceo-coi-reader(Step 2),ceo-web-search(Step 3),ceo-patent-search(Step 4)
- New
use_patents: bool = Falseparameter incheck_ceo_status()— patent search is off by default because it is slow; passuse_patents=Trueto enable it - New
--patentsCLI flag:check-ceo-status --id "..." --name "..." --patents - Skills are embedded directly into the system prompt at call time for reliability
_metadatanow recordsuse_patentsin the result
PubPeer — AI comment analysis:
- New
analyze_pubpeer_comments(pubpeer_url, paper_title)async function: lightweight Claude agent (WebFetch only) that fetches a PubPeer page and returns a structured classification - New
search_and_analyze_pubpeer(openalex_id, ...)async function: wrapssearch_in_pubpeerand addscomment_analysisto every paper that has PubPeer mentions comment_analysisschema:{"type": "data_integrity"|"data_presentation"|"novelty"|"other", "summary": "..."}- New
--analyzeCLI flag:search-pubpeer --id "..." --analyze search_and_analyze_pubpeerexported fromsrc/__init__.py
Version 2.8.0 (2026-02-18) - PubPeer Search
Version 2.7.0 (2026-02-18) - CEO / Commercial Interest Check
Adds a standalone check_ceo_status() agent that determines whether a researcher holds
commercial positions or patents, given their OpenAlex author ID.
New features:
- New
src/ceo_agent.pywithcheck_ceo_status(openalex_author_id, author_name)function - New
check-ceo-statusCLI entry point - Exported from
src/__init__.py - Sources: OpenAlex author profile, paper COI sections (via WebFetch on DOI), web search (company roles, board memberships, LinkedIn), patent databases (Google Patents, EPO)
- Output:
{"possibility_of_ceo": true|false|null, "comment": "...", "_metadata": {...}}true— evidence of company role or company-assigned patent foundfalse— no evidence found after thorough searchnull— inconclusive (access errors, ambiguous name, etc.)
Version 2.6.0 (2026-02-18) - Multi-group EMBO Programme Support
Group-aware disambiguation strategy for all four EMBO programmes.
New features:
- Added
groupparameter todisambiguate_author()(default:"members"— fully backwards-compatible) - Four groups, each with a tailored system prompt reflecting career stage, geography, and publication expectations:
members: Established EMBO Members — unchanged original strategyyip: Young Investigator Programme — early-career group leaders (1–4 years as PI); small publication records are normal; checks recent institution changesgin: Global Investigator Network — researchers in Chile, India, Singapore, Taiwan, Africa; handles name romanisation variants and sparse OpenAlex coverageig: Installation Grants — very recently established lab in less-favoured European countries; tries both current and previous postdoc institution; accepts publication gaps
- New
src/prompts/group_system_prompts.pymodule withget_system_prompt(group)selector - CLI gains
--group {members,yip,gin,ig}option with validation and extended help text grouprecorded in_metadataof every response
Backward compatibility:
- Default
group="members"preserves all existing behaviour exactly - The
membersprompt is the originalDIRECT_SYSTEM_PROMPT, not a copy — guaranteed identical
Version 2.5.0 (2026-01-27) - Built-in Concurrency Control
Production-Ready Concurrency Management
Key Feature:
- Added
max_concurrentparameter todisambiguate_author()function - Built-in concurrency control prevents SDK initialization timeouts
- Default: 5 simultaneous SDK initializations (optimal for production)
- Can be disabled (
max_concurrent=None) for single calls or low concurrency - Benchmark script now uses agent's built-in control instead of separate semaphore
Benefits:
- Prevents "Control request timeout: initialize" errors when processing many authors
- Production-ready for batch processing scenarios
- Consistent behavior between benchmark and production code
- Easy to configure based on use case
Usage:
# Batch processing (recommended)
result = await disambiguate_author(..., max_concurrent=5)
# Single call (faster, no limit)
result = await disambiguate_author(..., max_concurrent=None)
Version 2.4.3 (2026-01-27) - 100% Accuracy & Concurrency Control
Performance Breakthrough: Achieved 100% accuracy with keywords + affiliation context
Key Improvements:
- 100% accuracy on 100 EMBO members (keywords + affiliation context)
- Fixed timeout errors by implementing concurrency control (5 simultaneous tests)
- Added context column filtering: only benchmark authors with complete data
- Converted keywords from slash to comma-separated format for better AI parsing
- Enriched candidates dataset by merging keywords/affiliations from EMBO members file
- Added
--seedparameter for reproducible random sampling
Technical Fixes:
- Implemented
asyncio.Semaphoreto prevent SDK initialization timeouts - Automatic filtering of rows without required context data
- Smart defaults for benchmark CLI (all columns pre-configured)
- Fixed 104/100 timeout errors that caused 48% false error rate
Version 2.4.0 (2026-01-23) - Works Search Agent & Enhanced MCP
New Feature: Works Search Agent
- Added
search_work()function for finding academic papers - Hybrid approach: Direct PyAlex search + AI agent fallback
- Strict validation: title matching & author verification
- Returns work ID, DOI, title, author OpenAlex ID & ORCID
- CLI entry point:
author-work-search - 28/28 passing unit tests
Enhanced OpenAlex MCP Server:
- Added 4 new works search tools:
search_works_by_title: Find papers by titlesearch_works_by_doi: Get work by DOI (most reliable)search_works_by_title_and_author: Combined title + author searchget_work_details: Complete work information with full authorships
- Total: 9 MCP tools (5 author + 4 works)
New Claude Skill:
works-search-strategy: Search techniques, validation rules, and quality assurance
Schema & Testing:
- Added
WORKS_SEARCH_SCHEMAfor structured work outputs - Comprehensive test suite in
tests/test_works_search_agent.py - Tests for normalization, matching, author finding, and schema validation
Documentation:
- Updated README with works search examples
- Added Python API and CLI usage for works search
- Updated all tool counts and feature lists
Version 2.3.0 (2026-01-23) - Production-Ready Package
Major Release: Converted to pip-installable Python package
Package Structure:
- Added
pyproject.tomlfor modern Python packaging - Created
LICENSE(MIT),MANIFEST.in, and package metadata - Updated
src/__init__.pyto exposedisambiguate_authorfunction - Package is now installable via pip from GitHub
- Can be used as a dependency in external projects
Documentation:
- Streamlined documentation in README.md
- All essential information now in main README
CLI Entry Points:
author-disambiguate: Main CLI for author disambiguationauthor-work-search: Search for academic works by title/authorbenchmark.py: Flexible benchmark script with CLI arguments
Scripts Organization:
- Moved all scripts to
scripts/folder with dedicated README - Added context-level benchmarks (
run_context_benchmarks.py) - Added EMBO members processing (
get_all_embo_members_openalex_ids.py) - Added data cleaning utility (
clean_embo_members.py)
Extensibility:
- Designed for future additional agents (e.g., works retrieval)
- Clean API for programmatic usage from external modules
- Production-ready for integration into other projects
Version 2.2.0 (2026-01-08) - Benchmark Infrastructure
Added comprehensive benchmark evaluation framework:
add_author_ids.py: Script to extract OpenAlex author IDs from work IDs with fuzzy name matchingrun_benchmark.py: Automated benchmark evaluation with Top-1 through Top-5 accuracy metricsBENCHMARK_GUIDE.md: Complete benchmark documentation- Ground truth data: 11,332+ high-confidence author matches from EMBO candidates
- Backward-compatible schema handling for evaluation
Version 2.1.0 (2025-11-27) - MCP Architecture
Removed the tool architecture to fully integrate it into the MCP
Version 2.0.0 (2025-11-27) - MCP Architecture
Major architectural refactoring to implement the embedded MCP pattern:
- Embedded MCP architecture using FastMCP
- Added Skills for openalex expert and analysis and evaluation of candidates
Benefits:
- Clean separation between agent logic and data access
- Testable tool layer with independent test suite
- Can extract to standalone MCP server if needed
- Better performance with async execution
- Follows Anthropic's MCP best practices
Version 1.0.0 (2025-11-26) - Direct Tools Integration
Initial production release with direct tool integration.
Known Issues
- System Prompt Size: Large prompt (includes full OpenAlex API guide) causes high token usage
- ~40K input tokens per request
- Consider extracting guide to separate documentation or using prompt caching
- Domain Classification Accuracy: Occasional misclassification of research domains
- Some life sciences researchers classified as social sciences
- Evidence from publications usually clarifies
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters