A powerful CLI and library tool for detecting Chinese names in text and Excel files

These details have not been verified by PyPI

Project description

Chinese Finder

A powerful CLI and library tool for detecting Chinese names in text and Excel files

Chinese Finder 是一个用于检测文本和 Excel 文件中中文姓名的强大工具。它支持命令行界面和 Python 库两种使用方式,采用可扩展的策略模式架构,可以轻松添加新的检测策略。

功能特性

✅ 双模式支持: 同时提供 CLI 命令行工具和 Python 库接口
🔌 可扩展架构: 基于策略模式的插件式架构,轻松添加新策略
📊 多格式支持: 支持 TXT 和 Excel 文件输入,支持 Excel/JSON/CSV 输出
🎯 多策略检测:
- 姓氏列表匹配策略(基于 350+ 常见中文姓氏)
- 中文字符检测策略(基于 Unicode 范围)
- 混合格式姓名检测策略(处理含学位/资质后缀的英文名格式)
- 支持自定义策略扩展
⚙️ 灵活配置: 支持策略组合模式(ANY/ALL),可选择特定策略
🚀 批量处理: 支持单文件和批量文件处理
📝 详细文档: 完整的 API 文档和使用示例

安装和环境配置

系统要求

Python 3.11 或更高版本
支持的操作系统: macOS, Linux, Windows

使用 pip 安装

pip install chinese-finder

使用示例和代码片段

CLI 使用示例

1. 处理单个文件

# 处理 Excel 文件(需要指定列名)
chinese process data.xlsx --column name

# 处理文本文件
chinese process names.txt

# 指定输出格式
chinese process data.xlsx -c name -o json

# 详细输出模式
chinese process data.xlsx -c name -v

2. 批量处理文件

# 批量处理多个 Excel 文件
chinese process-batch data1.xlsx data2.xlsx data3.xlsx -c name

# 批量处理混合文件类型
chinese process-batch names.txt data.xlsx -c name -o csv

# 指定输出目录
chinese process-batch *.xlsx -c name -d ./output

3. 使用特定策略

# 仅使用姓氏匹配策略
chinese process data.xlsx -c name -s family_name

# 使用多个策略
chinese process data.xlsx -c name -s family_name -s chinese_char

# 使用混合格式策略(处理含学位/资质的姓名)
chinese process data.xlsx -c name -s mixed_format

# 查看所有可用策略
chinese list-strategies

4. 策略组合模式

# ANY 模式:任一策略匹配即认为包含中文姓名(默认)
chinese process data.xlsx -c name -m any

# ALL 模式:所有策略都匹配才认为包含中文姓名
chinese process data.xlsx -c name -m all

Python API 使用示例

1. 基础文本检测

from chinese_finder import ChineseFinderProcessor

# 创建处理器(使用所有策略)
processor = ChineseFinderProcessor()

# 使用混合格式策略处理含学位/资质的姓名
processor_mixed = ChineseFinderProcessor(strategies=['mixed_format'])
print(processor_mixed.process_text("Zhexiang Sheng, FRM"))  # True

# 检测单个文本
result = processor.process_text("wang john")
print(result)  # True

result = processor.process_text("alice smith")
print(result)  # False

2. 批量文本处理

from chinese_finder import ChineseFinderProcessor

processor = ChineseFinderProcessor()

texts = ["wang john", "alice smith", "张三", "hello world"]
results = processor.process_texts(texts)

# 使用 mixed_format 策略处理混合格式姓名
mixed_proc = ChineseFinderProcessor(strategies=['mixed_format'])
mixed_names = ["Zhexiang Sheng, FRM", "Yongshuai (Michael) Chen, PhD", "John Smith"]
print(mixed_proc.process_texts(mixed_names))  # [True, True, False]

for text, result in zip(texts, results):
    status = "✓" if result else "✗"
    print(f"{status} {text}")

3. 处理文件

from chinese_finder import ChineseFinderProcessor

processor = ChineseFinderProcessor()

# 处理 Excel 文件
output_path = processor.process_excel(
    file_path="data.xlsx",
    column="name",
    output_format="excel"  # 或 "json", "csv"
)
print(f"Output saved to: {output_path}")

# 处理文本文件
output_path = processor.process_txt(
    file_path="names.txt",
    output_format="json"
)

4. 使用特定策略

from chinese_finder import ChineseFinderProcessor

# 仅使用姓氏匹配策略
processor = ChineseFinderProcessor(
    strategies=['family_name'],
    mode='any'
)

# 使用多个策略, ALL 模式
processor = ChineseFinderProcessor(
    strategies=['family_name', 'chinese_char'],
    mode='all'
)

# 使用混合格式策略(处理含学位/资质的姓名)
processor = ChineseFinderProcessor(strategies=['mixed_format'])
result = processor.process_text("Zhexiang Sheng, FRM")
print(result)  # True

5. 自定义策略

from chinese_finder import DetectionStrategy, register_strategy, ChineseFinderProcessor

# 创建自定义策略
@register_strategy('my_custom_strategy')
class MyCustomStrategy(DetectionStrategy):
    @property
    def name(self) -> str:
        return 'my_custom_strategy'
    
    def detect(self, text: str) -> bool:
        # 实现你的检测逻辑
        return "custom_pattern" in text.lower()

# 使用自定义策略
processor = ChineseFinderProcessor(
    strategies=['my_custom_strategy'],
    mode='any'
)

result = processor.process_text("custom_pattern detected")
print(result)  # True

策略详细说明

`chinese process` 命令详解

基本用法

chinese process data.xlsx --column name

功能说明

该命令用于处理单个文件,自动检测其中文姓名并生成输出文件。

参数含义

参数	简写	必需	说明	示例
`file_path`	-	✅	输入文件路径	`data.xlsx`
`--column`	`-c`	Excel必需	要分析的列名	`name`
`--strategies`	`-s`	❌	指定策略(默认全部)	`family_name`
`--mode`	`-m`	❌	策略组合模式(默认any)	`any`/`all`
`--output-format`	`-o`	❌	输出格式	`excel`/`json`/`csv`
`--output-dir`	`-d`	❌	输出目录	`./output`
`--verbose`	`-v`	❌	详细输出模式	-

默认行为

当不指定 --strategies 参数时,命令会自动加载所有已注册的策略:

策略数量: 3 种
策略类型: family_name + chinese_char + mixed_format
组合模式: any(任一策略匹配即认为包含中文姓名)

默认策略详解

策略 1:`family_name`(姓氏列表匹配)

工作原理

通过匹配预定义的中文姓氏列表来识别潜在的中文姓名。

检测逻辑

加载姓氏库: 从 chinese_finder/data/family_names.txt 读取 350+ 个常见中文姓氏
分词处理: 将输入文本按空格分割成多个 token
大小写不敏感匹配: 每个 token 转为小写后与姓氏库对比
短路返回: 发现任一匹配立即返回 True

算法流程

输入文本: "wang john"
    ↓
分词: ["wang", "john"]
    ↓
遍历检查:
  - "wang".lower() = "wang" → 在姓氏库中? ✅ YES
    ↓
返回: True (立即结束)

代码实现

# family_name.py 核心逻辑
def detect(self, text: str) -> bool:
    if not isinstance(text, str):
        return False
    
    # 按空格分词
    tokens = text.split()
    
    # 检查每个 token 是否在姓氏库中
    for token in tokens:
        if token.lower() in self.family_names:
            return True  # 发现匹配,立即返回
    
    return False  # 无匹配

匹配示例

输入文本	分词结果	匹配姓氏	结果
`"wang john"`	`["wang", "john"]`	wang ✅	`True`
`"Zhang Wei"`	`["Zhang", "Wei"]`	zhang ✅	`True`
`"alice smith"`	`["alice", "smith"]`	无 ❌	`False`
`"李明 li"`	`["李明", "li"]`	li ✅	`True`

数据源

内置姓氏库位于 chinese_finder/data/family_names.txt,包含:

常见单姓: 王、李、张、刘、陈等
复姓: 欧阳、司马、上官等
总计: 350+ 个中文姓氏

策略 2:`chinese_char`(中文字符检测)

工作原理

通过正则表达式检测文本中是否包含中文字符(Unicode 范围)。

检测逻辑

正则匹配: 使用编译好的正则表达式 [一-鿿]
Unicode 范围: U+4E00 到 U+9FFF(覆盖常用汉字)
搜索匹配: 在文本中搜索是否存在任何中文字符
布尔返回: 找到返回 True,否则返回 False

算法流程

输入文本: "张三"
    ↓
正则搜索: [\u4e00-\u9fff]
    ↓
发现中文字符: "张" (U+5F20) ✅
    ↓
返回: True

代码实现

# chinese_char.py 核心逻辑
import re

class ChineseCharStrategy(DetectionStrategy):
    # 预编译正则表达式(提高性能)
    _chinese_pattern = re.compile(r'[\u4e00-\u9fff]')
    
    def detect(self, text: str) -> bool:
        if not isinstance(text, str):
            return False
        
        # 搜索中文字符
        return bool(self._chinese_pattern.search(text))

匹配示例

输入文本	包含中文	匹配字符	结果
`"张三"`	✅	张、三	`True`
`"hello 你好"`	✅	你、好	`True`
`"wang john"`	❌	无	`False`
`"李明 li"`	✅	李、明	`True`
`"Alice Smith"`	❌	无	`False`

Unicode 范围说明

范围: U+4E00 - U+9FFF
覆盖: CJK 统一表意文字(常用汉字)
字符数: 约 20,902 个汉字
不包含: 拼音、日文假名、韩文字母

策略 3: `mixed_format`（混合格式姓名检测）

工作原理

通过多步骤文本清洗与拼音启发式评分，从包含英文名、学位、专业资格认证等混合格式字符串中识别中文姓名。

处理流水线

移除括号别名：去除 (Michael) 等英文名映射，并记录括号存在作为强信号
逗号分割 + 后缀过滤：丢弃 70+ 种已知学位/认证后缀（PhD、MD、FRM、CFA 等）
拼音评分：对剩余 token 进行拼音特征打分，判断是否为中文拼音姓名

支持格式

输入格式	示例	说明
英文名 + 专业资格	`Zhexiang Sheng, FRM`	过滤 FRM 后缀，识别拼音姓名
中文拼音(英文名) + 学位	`Yongshuai (Michael) Chen, PhD`	去除括号内容，识别拼音姓名
中文拼音 + 多学位/资质	`Yunguo Yu, PhD, MD`	过滤多个后缀，识别拼音姓名

检测逻辑

括号别名检测：存在括号映射 → 强信号（通常表示中文姓名有对应英文名）
名字主体提取：过滤后缀后保留 2–4 个词
拼音特征评分：
- 拼音二合字母（zh/ch/sh/qu/xu 等）→ 加分
- 拼音声母首字母（b/p/m/f/d/t/n/l 等）→ 加分
- 元音或 n 结尾（拼音音节特征）→ 加分
- 英语辅音簇（th/ck/ght/tch 等）→ 扣分
- 常见西方名字词表命中 → 排除误判
综合判定：
- 有括号 + 拼音比 ≥ 0.5 → True
- 无括号 + 拼音比 ≥ 0.5 且不含常见英文名 → True
- 其余情况 → False

匹配示例

输入文本	提取姓名	拼音比	结果
`"Zhexiang Sheng, FRM"`	Zhexiang Sheng	1.0	`True`
`"Yongshuai (Michael) Chen, PhD"`	Yongshuai Chen	1.0	`True`
`"Yunguo Yu, PhD, MD"`	Yunguo Yu	1.0	`True`
`"John Smith, PhD"`	John Smith	0.0	`False`
`"Mary Johnson, CPA"`	Mary Johnson	0.0	`False`

参数配置

该策略无需特殊初始化参数，直接通过策略名称调用即可：

processor = ChineseFinderProcessor(strategies=['mixed_format'])

使用建议

适用数据：包含英文名、学位、专业资格认证的混合格式姓名数据
不适用：纯中文字符姓名（请使用 chinese_char 策略）
默认不启用：该策略不在默认策略组合中，需显式指定 -s mixed_format 或在 strategies 参数中列出

策略组合机制

ANY 模式(默认)

判定规则: 任一策略返回 True → 最终结果 True

# processor.py 核心逻辑
results = [strategy.detect(text) for strategy in self.strategies]

if self.mode == 'any':
    return any(results)  # 任一为 True 即返回 True

实际示例:

输入文本	family_name	chinese_char	ANY 结果	说明
`"wang john"`	✅ True	❌ False	✅ True	姓氏匹配成功
`"张三"`	❌ False	✅ True	✅ True	中文字符匹配成功
`"alice smith"`	❌ False	❌ False	❌ False	均无匹配
`"张伟 wang"`	✅ True	✅ True	✅ True	两者均匹配

ALL 模式

判定规则: 所有策略都返回 True → 最终结果 True

else:  # mode == 'all'
    return all(results)  # 全部为 True 才返回 True

实际示例:

输入文本	family_name	chinese_char	ALL 结果	说明
`"wang john"`	✅ True	❌ False	❌ False	chinese_char 失败
`"张三"`	❌ False	✅ True	❌ False	family_name 失败
`"张伟"`	✅ True	✅ True	✅ True	两者均成功
`"alice smith"`	❌ False	❌ False	❌ False	均失败

使用场景建议

模式	适用场景	特点
ANY	拼音姓名、中文姓名混合数据	宽松匹配,召回率高
ALL	纯中文姓名数据,需要高精确度	严格匹配,精确率高

完整处理流程示例

Excel 文件处理

# 执行命令
chinese process data.xlsx --column name -v

处理流程:

1. 读取 Excel 文件
   ↓
2. 验证列名 "name" 是否存在
   ↓
3. 初始化处理器(默认加载 2 个策略,ANY 模式)
   ↓
4. 遍历 "name" 列的每一行:
   ├── 对每个值调用 process_text()
   │   ├── family_name.detect(text)
   │   ├── chinese_char.detect(text)
   │   └── 组合结果(ANY 模式)
   ├── 如果检测到中文姓名 → 保留原值
   └── 否则 → 置为空字符串
   ↓
5. 添加 "Chinese" 列到 DataFrame
   ↓
6. 保存输出文件: data_cn.xlsx

预期输出 (data_cn.xlsx):

name	Chinese
wang john	wang john
alice smith
张三	张三
hello world
李明	李明

Python API 调用

from chinese_finder import ChineseFinderProcessor

# 创建处理器(使用默认策略)
processor = ChineseFinderProcessor()

# 测试不同文本
test_texts = [
    "wang john",        # family_name ✅, chinese_char ❌ → ANY: True
    "alice smith",      # family_name ❌, chinese_char ❌ → ANY: False
    "张三",              # family_name ❌, chinese_char ✅ → ANY: True
    "张伟 wang",        # family_name ✅, chinese_char ✅ → ANY: True
]

for text in test_texts:
    result = processor.process_text(text)
    status = "✓" if result else "✗"
    print(f"{status} {text}")

预期输出:

✓ wang john
✗ alice smith
✓ 张三
✓ 张伟 wang

API 接口说明

ChineseFinderProcessor

核心处理器类,负责协调多个检测策略并处理文件。

初始化参数

ChineseFinderProcessor(
    strategies: Optional[List[str]] = None,  # 策略名称列表,None 表示使用所有策略
    mode: str = 'any',                        # 策略组合模式: 'any' 或 'all'
    **strategy_kwargs                         # 传递给策略构造函数的额外参数
)

主要方法

`process_text(text: str) -> bool`

处理单个文本字符串。

参数:

text: 要分析的文本字符串

返回:

bool: 根据配置的模式返回检测结果

示例:

processor = ChineseFinderProcessor()
result = processor.process_text("wang john")

`process_texts(texts: List[str]) -> List[bool]`

批量处理多个文本字符串。

参数:

texts: 文本字符串列表

返回:

List[bool]: 检测结果列表

示例:

results = processor.process_texts(["wang", "john", "zhang"])

`process_excel(file_path: str, column: str, output_format: str = 'excel', output_dir: str = None) -> str`

处理 Excel 文件。

参数:

file_path: 输入 Excel 文件路径
column: 要分析的列名
output_format: 输出格式('excel', 'json', 'csv')
output_dir: 输出目录(可选)

返回:

str: 输出文件路径

示例:

output_path = processor.process_excel("data.xlsx", column="name", output_format="json")

`process_txt(file_path: str, output_format: str = 'json', output_dir: str = None) -> str`

处理文本文件。

参数:

file_path: 输入文本文件路径
output_format: 输出格式('json', 'csv', 'excel', 'txt')
output_dir: 输出目录(可选)

返回:

str: 输出文件路径

示例:

output_path = processor.process_txt("names.txt", output_format="json")

`process_file(file_path: str, column: str = None, output_format: str = None, output_dir: str = None) -> str`

自动检测文件类型并处理。

参数:

file_path: 输入文件路径
column: 列名(Excel 文件必需)
output_format: 输出格式(可选)
output_dir: 输出目录(可选)

返回:

str: 输出文件路径

示例:

# 自动检测文件类型
output_path = processor.process_file("data.xlsx", column="name")

DetectionStrategy

策略抽象基类,所有检测策略必须继承此类。

必须实现的属性

name: str: 策略的唯一名称

必须实现的方法

detect(text: str) -> bool: 检测单个文本

可选重写的方法

detect_batch(texts: List[str]) -> List[bool]: 批量检测(默认实现为循环调用 detect)

register_strategy 装饰器

用于自动注册自定义策略。

用法:

@register_strategy('strategy_name')
class MyStrategy(DetectionStrategy):
    # 实现...

StrategyRegistry

策略注册表,用于管理所有可用的策略。

主要方法

register(name: str, strategy_class: Type[DetectionStrategy]): 注册策略
get_strategy(name: str, **kwargs) -> DetectionStrategy: 获取策略实例
list_strategies() -> List[str]: 列出所有策略名称
has_strategy(name: str) -> bool: 检查策略是否存在

策略组合模式

any 模式: 只要任一策略检测到中文姓名,就认为包含中文姓名
all 模式: 只有所有策略都检测到中文姓名,才认为包含中文姓名

依赖项清单

运行时依赖

typer (>=0.26.3): CLI 框架
pandas (>=3.0.3): 数据处理和 Excel 文件操作
openpyxl (>=3.1.5): Excel 文件读写支持

开发依赖

pytest (>=7.4.0): 测试框架
pytest-cov (>=4.1.0): 测试覆盖率工具
black (>=23.0.0): 代码格式化工具
ruff (>=0.1.0): 快速 Python linter

Python 版本

支持 Python 3.11, 3.12, 3.13

贡献指南与许可证

开发环境搭建

Fork 和克隆仓库

git clone https://github.com/yourusername/chinese-finder.git
cd chinese-finder

安装 Poetry

curl -sSL https://install.python-poetry.org | python3 -

安装依赖

poetry install

激活虚拟环境

poetry shell

运行测试

# 运行所有测试
pytest

# 运行测试并生成覆盖率报告
pytest --cov=chinese_finder --cov-report=html

# 查看覆盖率报告
open htmlcov/index.html  # macOS

代码规范

本项目使用 Black 和 Ruff 进行代码格式化和检查:

# 格式化代码
black chinese_finder tests examples

# 检查代码质量
ruff check chinese_finder tests examples

# 自动修复简单问题
ruff check --fix chinese_finder tests examples

添加新策略

在 chinese_finder/core/strategies/ 目录下创建新文件
继承 DetectionStrategy 类
实现 name 属性和 detect 方法
使用 @register_strategy 装饰器注册
在 chinese_finder/core/strategies/__init__.py 中导入(确保自动注册)
编写对应的单元测试

示例:

# chinese_finder/core/strategies/my_strategy.py
from ..detector import DetectionStrategy, register_strategy

@register_strategy('my_strategy')
class MyStrategy(DetectionStrategy):
    @property
    def name(self) -> str:
        return 'my_strategy'
    
    def detect(self, text: str) -> bool:
        # 实现检测逻辑
        pass

Pull Request 流程

创建特性分支: git checkout -b feature/amazing-feature
提交更改: git commit -m 'Add amazing feature'
推送到分支: git push origin feature/amazing-feature
创建 Pull Request
确保所有测试通过
等待代码审查

测试要求

所有新功能必须包含单元测试
保持测试覆盖率在 80% 以上
测试应覆盖正常情况和异常情况

许可证

本项目采用 MIT 许可证 - 详见 LICENSE 文件。

项目结构

目录结构

chinese-finder/
├── chinese_finder/              # 主包目录
│   ├── __init__.py             # 包初始化,暴露公共 API
│   ├── cli.py                  # Typer CLI 入口 (341 行)
│   ├── core/                   # 核心模块
│   │   ├── __init__.py         # 核心模块初始化
│   │   ├── detector.py         # 策略基类和注册表 (168 行)
│   │   ├── processor.py        # 文件处理引擎 (260 行)
│   │   └── strategies/         # 策略实现
│   │       ├── __init__.py     # 策略模块初始化
│   │       ├── family_name.py  # 策略1: 姓氏列表匹配 (126 行)
│   │       ├── chinese_char.py # 策略2: 中文字符检测 (63 行)
│   │       └── mixed_format.py # 策略3: 混合格式姓名检测
│   ├── utils/                  # 工具函数
│   │   ├── __init__.py         # 工具模块初始化
│   │   └── io.py               # 文件读写工具 (167 行)
│   └── data/                   # 数据文件
│       └── family_names.txt    # 内置姓氏列表 (350 个姓氏)
├── tests/                      # 测试目录
│   ├── __init__.py             # 测试模块初始化
│   ├── test_strategies.py      # 策略单元测试 (185 行)
│   ├── test_processor.py       # 处理器集成测试 (187 行)
│   └── test_cli.py             # CLI 测试 (186 行)
├── examples/                   # 使用示例
│   ├── basic_usage.py          # 基础库函数调用示例 (195 行)
│   └── custom_strategy.py      # 自定义策略示例 (253 行)
├── pyproject.toml              # Poetry 配置 (68 行)
├── README.md                   # 项目文档
└── demo.py                     # 原始文件(保留参考)

核心架构

1. 策略模式架构

DetectionStrategy (抽象基类)
    ├── FamilyNameStrategy (姓氏匹配策略)
    ├── ChineseCharStrategy (中文字符检测策略)
    ├── MixedFormatStrategy (混合格式姓名检测策略)
    └── [可轻松添加更多策略...]

StrategyRegistry (策略注册表)
    ├── register() - 注册策略
    ├── get_strategy() - 获取策略实例
    └── list_strategies() - 列出所有策略

@register_strategy 装饰器
    └── 自动注册策略到注册表

2. 处理器架构

ChineseFinderProcessor
    ├── process_text() - 单文本处理
    ├── process_texts() - 批量文本处理
    ├── process_excel() - Excel 文件处理
    ├── process_txt() - 文本文件处理
    └── process_file() - 自动检测文件类型处理

3. CLI 命令

chinese
    ├── process <FILE> - 处理单个文件
    ├── process-batch <FILES...> - 批量处理
    ├── list-strategies - 列出可用策略
    └── info - 显示项目信息

关键设计特点

策略模式: 每个检测策略独立,互不影响
装饰器注册: 使用 @register_strategy 自动注册,无需手动管理
依赖注入: 处理器通过策略名称获取策略实例
组合模式: 支持 ANY/ALL 两种策略组合方式
工厂模式: StrategyRegistry 作为策略工厂
开闭原则: 对扩展开放(添加新策略),对修改封闭(不修改现有代码)

快速开始

1. 安装依赖

方式一: 使用 Poetry (推荐)

# 安装 Poetry (如果尚未安装)
curl -sSL https://install.python-poetry.org | python3 -

# 安装项目依赖
poetry install

# 激活虚拟环境
poetry shell

方式二: 使用 pip

# 安装依赖
pip install typer[all] pandas openpyxl

# 如果使用测试
pip install pytest pytest-cov

2. 验证安装

# 检查 Python 包
python -c "from chinese_finder import ChineseFinderProcessor; print('✓ 安装成功')"

# 检查 CLI (如果使用 Poetry)
chinese --help

# 或直接运行
python -m chinese_finder.cli --help

3. CLI 快速使用

# 查看帮助
chinese --help

# 查看可用策略
chinese list-strategies

# 处理文本文件
echo -e "wang john\nalice smith\n张三" > test_names.txt
chinese process test_names.txt

# 查看结果
cat test_names_cn.json

# 处理 Excel 文件
chinese process data.xlsx --column name

# 详细输出
chinese process data.xlsx -c name -v

# 批量处理
chinese process-batch file1.txt file2.txt file3.txt

4. Python API 快速使用

基础示例

from chinese_finder import ChineseFinderProcessor

# 创建处理器
processor = ChineseFinderProcessor()

# 检测文本
texts = ["wang john", "alice smith", "张三", "hello"]
for text in texts:
    result = processor.process_text(text)
    status = "✓" if result else "✗"
    print(f"{status} {text}")

处理文件

from chinese_finder import ChineseFinderProcessor

processor = ChineseFinderProcessor()

# 处理文本文件
output = processor.process_txt("names.txt", output_format="json")
print(f"结果保存到: {output}")

# 处理 Excel 文件
output = processor.process_excel("data.xlsx", column="name")
print(f"结果保存到: {output}")

使用特定策略

from chinese_finder import ChineseFinderProcessor

# 仅使用姓氏匹配
processor = ChineseFinderProcessor(strategies=['family_name'])

# 仅使用中文字符检测
processor = ChineseFinderProcessor(strategies=['chinese_char'])

# 组合策略, ALL 模式
processor = ChineseFinderProcessor(
    strategies=['family_name', 'chinese_char'],
    mode='all'
)

5. 创建自定义策略

from chinese_finder import DetectionStrategy, register_strategy

@register_strategy('my_strategy')
class MyStrategy(DetectionStrategy):
    @property
    def name(self) -> str:
        return 'my_strategy'
    
    def detect(self, text: str) -> bool:
        # 你的检测逻辑
        return 'keyword' in text.lower()

# 使用自定义策略
from chinese_finder import ChineseFinderProcessor

processor = ChineseFinderProcessor(strategies=['my_strategy'])
result = processor.process_text("keyword found")

6. 运行测试

# 运行所有测试
pytest tests/ -v

# 运行特定测试
pytest tests/test_strategies.py -v

# 生成覆盖率报告
pytest tests/ --cov=chinese_finder --cov-report=html
open htmlcov/index.html  # macOS

7. 运行示例

# 基础使用示例
python examples/basic_usage.py

# 自定义策略示例
python examples/custom_strategy.py

常见问题

Q: 如何添加新的姓氏?

A: 编辑 chinese_finder/data/family_names.txt,每行添加一个姓氏。

Q: 支持哪些文件格式?

A: 输入支持 TXT、Excel (.xlsx); 输出支持 Excel、JSON、CSV。

Q: 如何查看检测结果?

A: 使用 -v 参数查看详细输出,或查看生成的输出文件。

Q: ANY 和 ALL 模式有什么区别?

ANY: 任一策略检测到中文姓名就认为包含
ALL: 所有策略都检测到才认为包含

Q: 如何发布到 PyPI?

poetry build
poetry publish

代码统计

总文件数: 17 个 Python 文件
总代码行数: ~2,500 行
测试覆盖: 策略、处理器、CLI
文档: 完整的 README 和代码注释

免责声明

本项目按"现状"提供,不提供任何形式的明示或暗示的保证,包括但不限于对适销性、特定用途适用性和非侵权性的保证。在任何情况下,作者或版权持有人均不对因使用本软件而产生的任何索赔、损害或其他责任负责,无论是在合同诉讼、侵权行为还是其他情况下。

使用限制

准确性: 本工具的中文姓名检测结果仅供参考,不保证 100% 准确
数据安全: 使用本工具处理文件时,请确保您有权访问和处理相关文件
隐私保护: 处理包含个人信息的文件时,请遵守相关隐私法规
商业用途: 如将本工具用于商业用途,请自行评估风险和合规性
责任限制: 作者不对因使用本工具而导致的任何直接或间接损失负责

建议

在生产环境使用前,请在测试数据上充分验证
定期检查更新以获取最新的功能和安全修复
如发现安全漏洞,请通过 GitHub Issue responsibly disclose

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.4

May 29, 2026

This version

0.1.3

May 29, 2026

0.1.2

May 29, 2026

0.1.1

May 28, 2026

0.1.0

May 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chinese_finder-0.1.3.tar.gz (34.4 kB view details)

Uploaded May 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chinese_finder-0.1.3-py3-none-any.whl (27.6 kB view details)

Uploaded May 29, 2026 Python 3

File details

Details for the file chinese_finder-0.1.3.tar.gz.

File metadata

Download URL: chinese_finder-0.1.3.tar.gz
Upload date: May 29, 2026
Size: 34.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for chinese_finder-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`d35873553bf25872f1c58a1f6a26dde2afad86ae490d1a6b1fe655c95b217b20`
MD5	`471925cc2b76bdd3fbd2c7c67be7235c`
BLAKE2b-256	`25cafb5687f688b15325ffa6b67871c496e672aa50ba76cbe67e9383ab69d59c`

See more details on using hashes here.

File details

Details for the file chinese_finder-0.1.3-py3-none-any.whl.

File metadata

Download URL: chinese_finder-0.1.3-py3-none-any.whl
Upload date: May 29, 2026
Size: 27.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for chinese_finder-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cb56bddbce084c8aec5d7377e5174781830584e1f4d72ffffbcebed92bb8dd46`
MD5	`4bc4f5f0ecc28c0f8227d7c11cdcae10`
BLAKE2b-256	`835824beba54c4302c78570e0a587da76286c3a17970766b5c8ea5901640eb56`

See more details on using hashes here.

chinese-finder 0.1.3

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

Chinese Finder

功能特性

安装和环境配置

系统要求

使用 pip 安装

使用示例和代码片段

CLI 使用示例

1. 处理单个文件

2. 批量处理文件

3. 使用特定策略

4. 策略组合模式

Python API 使用示例

1. 基础文本检测

2. 批量文本处理

3. 处理文件

4. 使用特定策略

5. 自定义策略

策略详细说明

chinese process 命令详解

基本用法

功能说明

参数含义

默认行为

默认策略详解

策略 1:family_name(姓氏列表匹配)

工作原理

检测逻辑

算法流程

代码实现

匹配示例

数据源

策略 2:chinese_char(中文字符检测)

工作原理

检测逻辑

算法流程

代码实现

匹配示例

Unicode 范围说明

策略 3: mixed_format（混合格式姓名检测）

工作原理

处理流水线

支持格式

检测逻辑

匹配示例

参数配置

使用建议

策略组合机制

ANY 模式(默认)

ALL 模式

使用场景建议

完整处理流程示例

Excel 文件处理

Python API 调用

API 接口说明

ChineseFinderProcessor

初始化参数

主要方法

process_text(text: str) -> bool

process_texts(texts: List[str]) -> List[bool]

process_excel(file_path: str, column: str, output_format: str = 'excel', output_dir: str = None) -> str

process_txt(file_path: str, output_format: str = 'json', output_dir: str = None) -> str

process_file(file_path: str, column: str = None, output_format: str = None, output_dir: str = None) -> str

DetectionStrategy

必须实现的属性

必须实现的方法

可选重写的方法

register_strategy 装饰器

StrategyRegistry

主要方法

策略组合模式

依赖项清单

`chinese process` 命令详解

策略 1:`family_name`(姓氏列表匹配)

策略 2:`chinese_char`(中文字符检测)

策略 3: `mixed_format`（混合格式姓名检测）

`process_text(text: str) -> bool`

`process_texts(texts: List[str]) -> List[bool]`

`process_excel(file_path: str, column: str, output_format: str = 'excel', output_dir: str = None) -> str`

`process_txt(file_path: str, output_format: str = 'json', output_dir: str = None) -> str`

`process_file(file_path: str, column: str = None, output_format: str = None, output_dir: str = None) -> str`