Chinese Name Detection and Normalization Module

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sergeyf

These details have not been verified by PyPI

Project description

Sinonym

A Chinese name detection and normalization library.

Sinonym is a Python library designed to accurately detect and normalize Chinese names across various romanization systems. It filters out non-Chinese names (such as Western, Korean, Vietnamese, and Japanese names).

This was mostly written with Claude Code with extensive oversight from me... Sorry if the actual code is too AI-ish. It's fast, well-tested, and works pretty well.

Not all the tests pass, and the test suite is intentionally skewed towards failing tests, so I know what to try to work on next. It's more-or-less impossible to guess with 100% accuracy whether a Romanized Chinese name is in the Given-Name Surname or Surname Given-Name format, and the best approach is to try to guess the most likely format from a batch of names that should all have the same format (like all the authors of an academic paper or all the names in a specific dataset). This kind of batch processing is described below.

Data Flow Pipeline

Raw Input
    ↓
TextPreprocessor (structural cleaning)
    ↓
NormalizationService (creates NormalizedInput with compound_metadata)
    ↓
CompoundDetector (generates metadata) → compound_metadata
    ↓
NameParsingService (uses compound_metadata)
    ↓
NameFormattingService (uses compound_metadata)
    ↓
Formatted Output

What to Expect: Behavior and Output

1. Output Formatting & Standardization

Name Order is Given-Name Surname
- The library's primary function is to standardize names into a Given-Name Surname format, regardless of the input order.
- Input: "Liu Dehua" → Output: "De-Hua Liu"
- Input: "Wei, Yu-Zhong" → Output: "Yu-Zhong Wei"
Capitalization is Title Case
- The output is consistently formatted in Title Case, with the first letter of the surname and each part of the given name capitalized.
- Input: "DAN CHEN" → Output: "Dan Chen"
Given Names are Hyphenated
- Given names composed of multiple syllables are joined by a hyphen. This applies to standard names, names with initials, and reduplicated (repeated) names.
- Input (Standard): "Wang Li Ming" → Output: "Li-Ming Wang"
- Input (Initials): "Y. Z. Wei" → Output: "Y-Z Wei"
- Input (Reduplicated): "Chen Linlin" → Output: "Lin-Lin Chen"

2. Name Component Handling

Compound Surname Formatting is Strictly Preserved
- The library identifies compound (two-character) surnames and preserves their original formatting (compact, spaced, hyphenated, or CamelCase).
- Input (Compact): "Duanmu Wenjie" → Output: "Wen-Jie Duanmu"
- Input (Spaced): "Au Yeung Chun" → Output: "Chun Au Yeung"
- Input (Hyphenated): "Au-Yeung Chun" → Output: "Chun Au-Yeung"
- Input (CamelCase): "AuYeung Ka Ming" → Output: "Ka-Ming AuYeung"
Unspaced Compound Given Names are Split and Hyphenated
- If a multi-syllable given name is provided as a single unspaced string, the library identifies the syllables and inserts hyphens.
- Input: "Wang Xueyin" → Output: "Xue-Yin Wang"

3. Input Flexibility & Error Correction

Handles All-Chinese Character Names
- It correctly processes names written entirely in Chinese characters, applying surname-first convention with frequency-based disambiguation.
- Input: "巩俐" → Output: "Li Gong" (李 is more frequent surname than 巩)
- Input: "李伟" → Output: "Wei Li" (李 recognized as surname in first position)
Handles Mixed Chinese (Hanzi) and Roman Characters
- It correctly parses names containing both Chinese characters and Pinyin, using the Roman parts for the output.
- Input: "Xiaohong Li 张小红" → Output: "Xiao-Hong Li"
Normalizes Diacritics, Accents, and Special Characters
- It converts pinyin with tone marks and special characters like ü into their basic Roman alphabet equivalents.
- Input: "Dèng Yǎjuān" → Output: "Ya-Juan Deng"
Normalizes Full-Width Characters
- It processes full-width Latin characters (often from PDFs) into standard characters.
- Input: "Ｌｉ　Ｘｉａｏｍｉｎｇ" → Output: "Xiao-Ming Li"
Handles Messy Formatting (Commas, Dots, Spacing)
- The library correctly parses names despite common data entry or OCR errors.
- Input (Bad Comma): "Chen,Mei Ling" → Output: "Mei-Ling Chen"
- Input (Dot Separators): "Li.Wei.Zhang" → Output: "Li-Wei Zhang"
Splits Concatenated Names
- It can split names that have been concatenated without spaces, using CamelCase or mixed-case cues.
- Input: "ZhangWei" → Output: "Wei Zhang"
Strips Parenthetical Western Names
- If a Western name is included in parentheses, it is stripped out, and the remaining Chinese name is parsed correctly.
- Input: "李（Peter）Chen" → Output: "Li Chen"

4. Cultural & Regional Specificity

Rejects Non-Chinese Names
- The library uses advanced heuristics and machine learning to reject names from other cultures to avoid false positives.
- Western: Rejects "John Smith" and even "Christian Wong".
- Korean: Rejects "Kim Min-jun".
- Vietnamese: Rejects "Nguyen Van Anh".
- Japanese: Rejects "Sato Taro" and Japanese names in Chinese characters like "山田太郎" (Yamada Taro) using ML classification.
Supports Regional Romanizations (Cantonese, Wade-Giles)
- The library recognizes and preserves different English romanization systems.
- Cantonese: Input "Chan Tai Man" becomes "Tai-Man Chan" (not "Chen").
- Wade-Giles: Input "Ts'ao Ming" becomes "Ming Ts'ao" (preserves apostrophe).
Corrects for Pinyin Library Inconsistencies
- It contains an internal mapping to fix cases where the underlying pypinyin library's output doesn't match the most common romanization for a surname.
- Example: The character 曾 is converted by pypinyin to Zeng, but this library corrects it to the expected Zeng.

5. Performance

High-Performance with Caching
- The library is benchmarked to be very fast, capable of processing over 10,000 diverse names per second, and uses caching to significantly speed up the processing of repeated names.

How It Works

Sinonym processes names through a multi-stage pipeline designed for high accuracy and performance:

Input Preprocessing: The input string is cleaned and normalized. This includes handling mixed scripts (e.g., "张 Wei") and standardizing different romanization variants.
All-Chinese Detection: The system detects inputs written entirely in Chinese characters and applies Han-to-Pinyin conversion with surname-first ordering preferences.
Ethnicity Classification: The name is analyzed to filter out non-Chinese names. This stage uses linguistic patterns and machine learning to identify and reject Western, Korean, Vietnamese, and Japanese names. For all-Chinese character inputs, a trained ML classifier (99.5% accuracy) determines if names like "山田太郎" are Japanese vs Chinese.
Probabilistic Parsing: The system identifies potential surname and given name boundaries by leveraging frequency data, which helps in accurately distinguishing between a surname and a given name. For all-Chinese inputs, it applies a surname-first bonus while still considering frequency data.
Compound Name Splitting: For names with fused given names (e.g., "Weiming"), a tiered confidence system is used to correctly split them into their constituent parts (e.g., "Wei-Ming").
Output Formatting: The final output is standardized to a "Given-Name Surname" format (e.g., "Wei Zhang").

Installation

To get started with Sinonym, clone the repository and install the necessary dependencies using uv:

git clone https://github.com/allenai/sinonym.git
cd sinonym

From repo root:

# create the project venv (uv defaults to .venv if you don't give a name)
uv venv --python 3.11

Activate the venv (choose one):

# macOS / Linux (bash / zsh)
source .venv/bin/activate

# Windows PowerShell
. .venv\Scripts\Activate.ps1

# Windows CMD
.venv\Scripts\activate.bat

Install project dependencies (dev extras):

uv sync --active --all-extras --dev

Machine Learning Dependencies

Sinonym includes a ML-based Japanese vs Chinese name classifier for enhanced accuracy with all-Chinese character names.

Quick Start

Here's a simple example of how to use Sinonym to detect and normalize a Chinese name:

from sinonym.detector import ChineseNameDetector

# Initialize the detector
detector = ChineseNameDetector()

# --- Example 1: A simple Chinese name ---
result = detector.is_chinese_name("Li Wei")
if result.success:
    print(f"Normalized Name: {result.result}")
    # Expected Output: Normalized Name: Wei Li

# --- Example 2: A compound given name ---
result = detector.is_chinese_name("Wang Weiming")
if result.success:
    print(f"Normalized Name: {result.result}")
    # Expected Output: Normalized Name: Wei-Ming Wang

# --- Example 3: An all-Chinese character name ---
result = detector.is_chinese_name("巩俐")
if result.success:
    print(f"Normalized Name: {result.result}")
    # Expected Output: Normalized Name: Li Gong

# --- Example 4: A non-Chinese name ---
result = detector.is_chinese_name("John Smith")
if not result.success:
    print(f"Error: {result.error_message}")
    # Expected Output: Error: name not recognised as Chinese

# --- Example 5: Japanese name in Chinese characters (ML-enhanced detection) ---
result = detector.is_chinese_name("山田太郎")
if not result.success:
    print(f"Error: {result.error_message}")
    # Expected Output: Error: Japanese name detected by ML classifier

# --- Example 6: Batch processing of academic author list ---
author_list = ["Zhang Wei", "Li Ming", "Wang Xiaoli", "Liu Jiaming", "Feng Cha"]
batch_result = detector.analyze_name_batch(author_list)
print(f"Format detected: {batch_result.format_pattern.dominant_format}")
print(f"Confidence: {batch_result.format_pattern.confidence:.1%}")
# Expected Output: Format detected: NameFormat.SURNAME_FIRST, Confidence: 94%

for i, result in enumerate(batch_result.results):
    if result.success:
        print(f"{author_list[i]} → {result.result}")
# Expected Output: Zhang Wei → Wei Zhang, Li Ming → Ming Li, etc.

# --- Example 7: Quick format detection for data validation ---
unknown_format_list = ["Wei Zhang", "Ming Li", "Xiaoli Wang"]
pattern = detector.detect_batch_format(unknown_format_list)
if pattern.threshold_met:
    print(f"Consistent {pattern.dominant_format} formatting detected")
    print(f"Safe to process as batch with {pattern.confidence:.1%} confidence")
else:
    print("Mixed formatting detected - process individually")

# --- Example 8: Simple batch processing for data cleanup ---
messy_names = ["Li, Wei", "Zhang.Ming", "Wang Xiaoli"]
clean_results = detector.process_name_batch(messy_names)
for original, clean in zip(messy_names, clean_results):
    if clean.success:
        print(f"Cleaned: '{original}' → '{clean.result}'")
    # Expected Output: Li, Wei → Wei Li, Zhang.Ming → Ming Zhang, etc.

Batch Processing for Consistent Formatting

Sinonym includes advanced batch processing capabilities that significantly improve accuracy when processing lists of names that share consistent formatting patterns. This is particularly valuable for real-world datasets like academic author lists, company directories, or database migrations.

How Batch Processing Works

When processing multiple names together, Sinonym:

Detects Format Patterns: Analyzes the entire batch to identify whether names follow a surname-first (e.g., "Zhang Wei") or given-first (e.g., "Wei Zhang") pattern
Aggregates Evidence: Uses frequency statistics across all names to build confidence in the detected pattern
Applies Consistent Formatting: When confidence exceeds 67%, applies the detected pattern to improve parsing of ambiguous individual names
Tracks Improvements: Identifies which names benefit from batch context vs. individual processing

Key Benefits

Fixes Ambiguous Cases: Names like "Feng Cha" that are difficult to parse individually become clear in batch context
Maintains Consistency: Ensures all names in a list follow the same formatting pattern
High Accuracy: Achieves 90%+ success rate on previously problematic cases when proper format context is available
Intelligent Fallback: Automatically falls back to individual processing when batch patterns are unclear

Batch Processing Methods

from sinonym.detector import ChineseNameDetector

detector = ChineseNameDetector()

# Full batch analysis with detailed results
result = detector.analyze_name_batch([
    "Zhang Wei", "Li Ming", "Wang Xiaoli", "Liu Jiaming"
])
print(f"Format detected: {result.format_pattern.dominant_format}")
print(f"Confidence: {result.format_pattern.confidence:.1%}")
print(f"Improved names: {len(result.improvements)}")

# Quick format detection without full processing
pattern = detector.detect_batch_format([
    "Zhang Wei", "Li Ming", "Wang Xiaoli"
])
if pattern.threshold_met:
    print(f"Strong {pattern.dominant_format} pattern detected")

# Simple batch processing (returns list of results)
results = detector.process_name_batch([
    "Zhang Wei", "Li Ming", "Wang Xiaoli"
])
for result in results:
    print(f"Processed: {result.result}")

When to Use Batch Processing

Academic Papers: Author lists typically follow consistent formatting
Company Directories: Employee lists often use uniform formatting conventions
Large Datasets: Processing 100+ names where format consistency is expected

Batch processing requires a minimum of 2 names and works best with 5+ names for reliable pattern detection.

Batch Processing Limitations

Unambiguous Names: Some names have only one possible parsing format (e.g., compound given names like "Wei-Qi Wang"). If the batch contains unambiguous names that conflict with the detected format pattern, batch processing will raise an error rather than forcing incorrect parsing.

Example Error:

names = ["Xin Liu", "Yang Li", "Wei-Qi Wang"]  # "Wei-Qi Wang" is unambiguous
try:
    result = detector.analyze_name_batch(names)
except ValueError as e:
    print(e)  # "Cannot apply batch format to unambiguous names"

Confidence Threshold: Batch processing requires at least 55% confidence in format detection. Mixed-format batches with unclear patterns will raise an error rather than fall back to individual processing.

Solution: For mixed batches or those with unambiguous names, process names individually using detector.is_chinese_name().

Batch Processing with Mixed Name Types

Batch processing works seamlessly with mixed datasets containing both Chinese and non-Chinese names. Non-Chinese names are rejected during individual analysis but still appear in the batch output as failed results.

# Mixed dataset: 2 Western names + 8 Chinese names
mixed_names = [
    "John Smith",     # Western - will be rejected
    "Mary Johnson",   # Western - will be rejected  
    "Xin Liu",        # Chinese - GIVEN_FIRST preference
    "Yang Li",        # Chinese - GIVEN_FIRST preference
    "Wei Zhang",      # Chinese - GIVEN_FIRST preference
    "Ming Wang",      # Chinese - GIVEN_FIRST preference
    "Li Chen",        # Chinese - GIVEN_FIRST preference
    "Hui Zhou",       # Chinese - GIVEN_FIRST preference
    "Feng Zhao",      # Chinese - GIVEN_FIRST preference
    "Tong Zhang",     # Chinese - might prefer SURNAME_FIRST (ambiguous)
]

result = detector.analyze_name_batch(mixed_names)

# Format detection uses only the 8 Chinese names
# If 7 prefer GIVEN_FIRST vs 1 SURNAME_FIRST = 87.5% confidence
# Above 55% threshold → GIVEN_FIRST format applied to all Chinese names

print(f"Total results: {len(result.results)}")  # 10 (same as input)
print(f"Format detected: {result.format_pattern.dominant_format}")  # GIVEN_FIRST
print(f"Confidence: {result.format_pattern.confidence:.1%}")  # 87.5%

# Check results by type
for i, (name, result_obj) in enumerate(zip(mixed_names, result.results)):
    if result_obj.success:
        print(f"✅ {name} → {result_obj.result}")
    else:
        print(f"❌ {name} → {result_obj.error_message}")

# Output:
# ❌ John Smith → name not recognised as Chinese
# ❌ Mary Johnson → name not recognised as Chinese  
# ✅ Xin Liu → Xin Liu
# ✅ Yang Li → Yang Li
# ✅ Wei Zhang → Wei Zhang
# ... (all Chinese names processed successfully with consistent formatting)

Key Benefits:

Maintains input-output correspondence: Results array matches input array length and order
Robust format detection: Only valid Chinese names contribute to pattern detection
Consistent formatting: All Chinese names get the same detected format applied
Clear failure reporting: Non-Chinese names are clearly marked as failed with error messages

Development

If you'd like to contribute to Sinonym, here’s how to set up your development environment.

Setup

First, clone the repository:

git clone https://github.com/yourusername/sinonym.git
cd sinonym

Then, install the development dependencies:

uv sync --extra dev

Running Tests

To run the test suite, use the following command:

uv run pytest

Code Quality

We use ruff for linting and formatting:

# Run linting and formatting
uv run ruff check . --fix
uv run ruff format .

License

Sinonym is licensed under the Apache 2.0 License. See the LICENSE file for more details.

Contributing

We welcome contributions! If you'd like to contribute, please follow these steps:

Fork the repository.
Create a new feature branch.
Make your changes and ensure all tests and quality checks pass.
Submit a pull request.

Data Sources

The accuracy of Sinonym is enhanced by data derived from ORCID records, which provides valuable frequency information for Chinese surnames and given names.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sergeyf

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.4

Mar 12, 2026

0.2.3

Feb 12, 2026

0.2.2

Sep 10, 2025

0.2.1

Sep 10, 2025

0.2.0

Sep 8, 2025

0.1.9

Sep 6, 2025

0.1.8

Sep 4, 2025

This version

0.1.7

Sep 3, 2025

0.1.6

Sep 3, 2025

0.1.5

Sep 3, 2025

0.1.4

Sep 3, 2025

0.1.2

Sep 3, 2025

0.1.1

Sep 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sinonym-0.1.7.tar.gz (549.3 kB view details)

Uploaded Sep 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sinonym-0.1.7-py3-none-any.whl (545.7 kB view details)

Uploaded Sep 3, 2025 Python 3

File details

Details for the file sinonym-0.1.7.tar.gz.

File metadata

Download URL: sinonym-0.1.7.tar.gz
Upload date: Sep 3, 2025
Size: 549.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sinonym-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`ea6a147cbc5f8032cc480b50bd1aea326adce292e47279266f33cf7258752134`
MD5	`7f2fd67f633dcbe7fb3a54774c2ce62a`
BLAKE2b-256	`46edda04c724213cc95e9d16dbb8941bb59b4f6f5429275f49a0bc45406f0f7e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sinonym-0.1.7.tar.gz:

Publisher: workflow.yaml on allenai/sinonym

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sinonym-0.1.7.tar.gz
- Subject digest: ea6a147cbc5f8032cc480b50bd1aea326adce292e47279266f33cf7258752134
- Sigstore transparency entry: 466602897
- Sigstore integration time: Sep 3, 2025
Source repository:
- Permalink: allenai/sinonym@3940955c4a63a1eb2e1d70ab6c7dcd6aaf25b711
- Branch / Tag: refs/heads/main
- Owner: https://github.com/allenai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yaml@3940955c4a63a1eb2e1d70ab6c7dcd6aaf25b711
- Trigger Event: push

File details

Details for the file sinonym-0.1.7-py3-none-any.whl.

File metadata

Download URL: sinonym-0.1.7-py3-none-any.whl
Upload date: Sep 3, 2025
Size: 545.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sinonym-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cf990acb2f2b2955327619d072169bf323f28824ce606eceef729ae4fef8d9ea`
MD5	`45c5b35e995c22950aac348f1a5c69cc`
BLAKE2b-256	`867937e967bf6318d80239aab0a5045e4dabf1c3b87774d2791043d52cd7b246`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sinonym-0.1.7-py3-none-any.whl:

Publisher: workflow.yaml on allenai/sinonym

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sinonym-0.1.7-py3-none-any.whl
- Subject digest: cf990acb2f2b2955327619d072169bf323f28824ce606eceef729ae4fef8d9ea
- Sigstore transparency entry: 466602923
- Sigstore integration time: Sep 3, 2025
Source repository:
- Permalink: allenai/sinonym@3940955c4a63a1eb2e1d70ab6c7dcd6aaf25b711
- Branch / Tag: refs/heads/main
- Owner: https://github.com/allenai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: workflow.yaml@3940955c4a63a1eb2e1d70ab6c7dcd6aaf25b711
- Trigger Event: push

sinonym 0.1.7

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Sinonym

Data Flow Pipeline

What to Expect: Behavior and Output

1. Output Formatting & Standardization

2. Name Component Handling

3. Input Flexibility & Error Correction

4. Cultural & Regional Specificity

5. Performance

How It Works

Installation

Machine Learning Dependencies

Quick Start

Batch Processing for Consistent Formatting

How Batch Processing Works

Key Benefits

Batch Processing Methods

When to Use Batch Processing

Batch Processing Limitations

Batch Processing with Mixed Name Types

Development

Setup

Running Tests

Code Quality

License

Contributing

Data Sources

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance