A powerful CLI and library tool for detecting Chinese names in text and Excel files

These details have not been verified by PyPI

Project description

Chinese Finder

A powerful CLI and library tool for detecting Chinese names in text and Excel files

Chinese Finder 是一个用于检测文本和 Excel 文件中中文姓名的强大工具。它支持命令行界面和 Python 库两种使用方式,采用可扩展的策略模式架构,可以轻松添加新的检测策略。

功能特性

✅ 双模式支持: 同时提供 CLI 命令行工具和 Python 库接口
🔌 可扩展架构: 基于策略模式的插件式架构,轻松添加新策略
📊 多格式支持: 支持 TXT 和 Excel 文件输入,支持 Excel/JSON/CSV 输出
🎯 多策略检测:
- 姓氏列表匹配策略(基于 350+ 常见中文姓氏)
- 中文字符检测策略(基于 Unicode 范围)
- 支持自定义策略扩展
⚙️ 灵活配置: 支持策略组合模式(ANY/ALL),可选择特定策略
🚀 批量处理: 支持单文件和批量文件处理
📝 详细文档: 完整的 API 文档和使用示例

安装和环境配置

系统要求

Python 3.8 或更高版本
支持的操作系统: macOS, Linux, Windows

使用 pip 安装

pip install chinese-finder

使用 Poetry 安装(开发模式)

# 克隆仓库
git clone https://github.com/yourusername/chinese-finder.git
cd chinese-finder

# 安装依赖
poetry install

# 激活虚拟环境
poetry shell

验证安装

# 检查 CLI 是否可用
chinese-finder --help

# 检查 Python 包是否可用
python -c "import chinese_finder; print(chinese_finder.__version__)"

使用示例和代码片段

CLI 使用示例

1. 处理单个文件

# 处理 Excel 文件(需要指定列名)
chinese-finder process data.xlsx --column name

# 处理文本文件
chinese-finder process names.txt

# 指定输出格式
chinese-finder process data.xlsx -c name -o json

# 详细输出模式
chinese-finder process data.xlsx -c name -v

2. 批量处理文件

# 批量处理多个 Excel 文件
chinese-finder process-batch data1.xlsx data2.xlsx data3.xlsx -c name

# 批量处理混合文件类型
chinese-finder process-batch names.txt data.xlsx -c name -o csv

# 指定输出目录
chinese-finder process-batch *.xlsx -c name -d ./output

3. 使用特定策略

# 仅使用姓氏匹配策略
chinese-finder process data.xlsx -c name -s family_name

# 使用多个策略
chinese-finder process data.xlsx -c name -s family_name -s chinese_char

# 查看所有可用策略
chinese-finder list-strategies

4. 策略组合模式

# ANY 模式:任一策略匹配即认为包含中文姓名(默认)
chinese-finder process data.xlsx -c name -m any

# ALL 模式:所有策略都匹配才认为包含中文姓名
chinese-finder process data.xlsx -c name -m all

Python API 使用示例

1. 基础文本检测

from chinese_finder import ChineseFinderProcessor

# 创建处理器(使用所有策略)
processor = ChineseFinderProcessor()

# 检测单个文本
result = processor.process_text("wang john")
print(result)  # True

result = processor.process_text("alice smith")
print(result)  # False

2. 批量文本处理

from chinese_finder import ChineseFinderProcessor

processor = ChineseFinderProcessor()

texts = ["wang john", "alice smith", "张三", "hello world"]
results = processor.process_texts(texts)

for text, result in zip(texts, results):
    status = "✓" if result else "✗"
    print(f"{status} {text}")

3. 处理文件

from chinese_finder import ChineseFinderProcessor

processor = ChineseFinderProcessor()

# 处理 Excel 文件
output_path = processor.process_excel(
    file_path="data.xlsx",
    column="name",
    output_format="excel"  # 或 "json", "csv"
)
print(f"Output saved to: {output_path}")

# 处理文本文件
output_path = processor.process_txt(
    file_path="names.txt",
    output_format="json"
)

4. 使用特定策略

from chinese_finder import ChineseFinderProcessor

# 仅使用姓氏匹配策略
processor = ChineseFinderProcessor(
    strategies=['family_name'],
    mode='any'
)

# 使用多个策略,ALL 模式
processor = ChineseFinderProcessor(
    strategies=['family_name', 'chinese_char'],
    mode='all'
)

5. 自定义策略

from chinese_finder import DetectionStrategy, register_strategy, ChineseFinderProcessor

# 创建自定义策略
@register_strategy('my_custom_strategy')
class MyCustomStrategy(DetectionStrategy):
    @property
    def name(self) -> str:
        return 'my_custom_strategy'
    
    def detect(self, text: str) -> bool:
        # 实现你的检测逻辑
        return "custom_pattern" in text.lower()

# 使用自定义策略
processor = ChineseFinderProcessor(
    strategies=['my_custom_strategy'],
    mode='any'
)

result = processor.process_text("custom_pattern detected")
print(result)  # True

API 接口说明

ChineseFinderProcessor

核心处理器类,负责协调多个检测策略并处理文件。

初始化参数

ChineseFinderProcessor(
    strategies: Optional[List[str]] = None,  # 策略名称列表,None 表示使用所有策略
    mode: str = 'any',                        # 策略组合模式: 'any' 或 'all'
    **strategy_kwargs                         # 传递给策略构造函数的额外参数
)

主要方法

`process_text(text: str) -> bool`

处理单个文本字符串。

参数:

text: 要分析的文本字符串

返回:

bool: 根据配置的模式返回检测结果

示例:

processor = ChineseFinderProcessor()
result = processor.process_text("wang john")

`process_texts(texts: List[str]) -> List[bool]`

批量处理多个文本字符串。

参数:

texts: 文本字符串列表

返回:

List[bool]: 检测结果列表

示例:

results = processor.process_texts(["wang", "john", "zhang"])

`process_excel(file_path: str, column: str, output_format: str = 'excel', output_dir: str = None) -> str`

处理 Excel 文件。

参数:

file_path: 输入 Excel 文件路径
column: 要分析的列名
output_format: 输出格式('excel', 'json', 'csv')
output_dir: 输出目录(可选)

返回:

str: 输出文件路径

示例:

output_path = processor.process_excel("data.xlsx", column="name", output_format="json")

`process_txt(file_path: str, output_format: str = 'json', output_dir: str = None) -> str`

处理文本文件。

参数:

file_path: 输入文本文件路径
output_format: 输出格式('json', 'csv', 'excel', 'txt')
output_dir: 输出目录(可选)

返回:

str: 输出文件路径

示例:

output_path = processor.process_txt("names.txt", output_format="json")

`process_file(file_path: str, column: str = None, output_format: str = None, output_dir: str = None) -> str`

自动检测文件类型并处理。

参数:

file_path: 输入文件路径
column: 列名(Excel 文件必需)
output_format: 输出格式(可选)
output_dir: 输出目录(可选)

返回:

str: 输出文件路径

示例:

# 自动检测文件类型
output_path = processor.process_file("data.xlsx", column="name")

DetectionStrategy

策略抽象基类,所有检测策略必须继承此类。

必须实现的属性

name: str: 策略的唯一名称

必须实现的方法

detect(text: str) -> bool: 检测单个文本

可选重写的方法

detect_batch(texts: List[str]) -> List[bool]: 批量检测(默认实现为循环调用 detect)

register_strategy 装饰器

用于自动注册自定义策略。

用法:

@register_strategy('strategy_name')
class MyStrategy(DetectionStrategy):
    # 实现...

StrategyRegistry

策略注册表,用于管理所有可用的策略。

主要方法

register(name: str, strategy_class: Type[DetectionStrategy]): 注册策略
get_strategy(name: str, **kwargs) -> DetectionStrategy: 获取策略实例
list_strategies() -> List[str]: 列出所有策略名称
has_strategy(name: str) -> bool: 检查策略是否存在

策略组合模式

any 模式: 只要任一策略检测到中文姓名,就认为包含中文姓名
all 模式: 只有所有策略都检测到中文姓名,才认为包含中文姓名

依赖项清单

运行时依赖

typer (>=0.9.0): CLI 框架
pandas (>=2.0.0): 数据处理和 Excel 文件操作
openpyxl (>=3.1.0): Excel 文件读写支持

开发依赖

pytest (>=7.4.0): 测试框架
pytest-cov (>=4.1.0): 测试覆盖率工具
black (>=23.0.0): 代码格式化工具
ruff (>=0.1.0): 快速 Python linter

Python 版本

支持 Python 3.8, 3.9, 3.10, 3.11, 3.12

贡献指南与许可证

开发环境搭建

Fork 和克隆仓库

git clone https://github.com/yourusername/chinese-finder.git
cd chinese-finder

安装 Poetry

curl -sSL https://install.python-poetry.org | python3 -

安装依赖

poetry install

激活虚拟环境

poetry shell

运行测试

# 运行所有测试
pytest

# 运行测试并生成覆盖率报告
pytest --cov=chinese_finder --cov-report=html

# 查看覆盖率报告
open htmlcov/index.html  # macOS

代码规范

本项目使用 Black 和 Ruff 进行代码格式化和检查:

# 格式化代码
black chinese_finder tests examples

# 检查代码质量
ruff check chinese_finder tests examples

# 自动修复简单问题
ruff check --fix chinese_finder tests examples

添加新策略

在 chinese_finder/core/strategies/ 目录下创建新文件
继承 DetectionStrategy 类
实现 name 属性和 detect 方法
使用 @register_strategy 装饰器注册
在 chinese_finder/core/strategies/__init__.py 中导入(确保自动注册)
编写对应的单元测试

示例:

# chinese_finder/core/strategies/my_strategy.py
from ..detector import DetectionStrategy, register_strategy

@register_strategy('my_strategy')
class MyStrategy(DetectionStrategy):
    @property
    def name(self) -> str:
        return 'my_strategy'
    
    def detect(self, text: str) -> bool:
        # 实现检测逻辑
        pass

Pull Request 流程

创建特性分支: git checkout -b feature/amazing-feature
提交更改: git commit -m 'Add amazing feature'
推送到分支: git push origin feature/amazing-feature
创建 Pull Request
确保所有测试通过
等待代码审查

测试要求

所有新功能必须包含单元测试
保持测试覆盖率在 80% 以上
测试应覆盖正常情况和异常情况

许可证

本项目采用 MIT 许可证 - 详见 LICENSE 文件。

项目结构

目录结构

chinese-finder/
├── chinese_finder/              # 主包目录
│   ├── __init__.py             # 包初始化,暴露公共 API
│   ├── cli.py                  # Typer CLI 入口 (341 行)
│   ├── core/                   # 核心模块
│   │   ├── __init__.py         # 核心模块初始化
│   │   ├── detector.py         # 策略基类和注册表 (168 行)
│   │   ├── processor.py        # 文件处理引擎 (260 行)
│   │   └── strategies/         # 策略实现
│   │       ├── __init__.py     # 策略模块初始化
│   │       ├── family_name.py  # 策略1: 姓氏列表匹配 (126 行)
│   │       └── chinese_char.py # 策略2: 中文字符检测 (63 行)
│   ├── utils/                  # 工具函数
│   │   ├── __init__.py         # 工具模块初始化
│   │   └── io.py               # 文件读写工具 (167 行)
│   └── data/                   # 数据文件
│       └── family_names.txt    # 内置姓氏列表 (350 个姓氏)
├── tests/                      # 测试目录
│   ├── __init__.py             # 测试模块初始化
│   ├── test_strategies.py      # 策略单元测试 (185 行)
│   ├── test_processor.py       # 处理器集成测试 (187 行)
│   └── test_cli.py             # CLI 测试 (186 行)
├── examples/                   # 使用示例
│   ├── basic_usage.py          # 基础库函数调用示例 (195 行)
│   └── custom_strategy.py      # 自定义策略示例 (253 行)
├── pyproject.toml              # Poetry 配置 (68 行)
├── README.md                   # 项目文档
└── demo.py                     # 原始文件(保留参考)

核心架构

1. 策略模式架构

DetectionStrategy (抽象基类)
    ├── FamilyNameStrategy (姓氏匹配策略)
    └── ChineseCharStrategy (中文字符检测策略)
    └── [可轻松添加更多策略...]

StrategyRegistry (策略注册表)
    ├── register() - 注册策略
    ├── get_strategy() - 获取策略实例
    └── list_strategies() - 列出所有策略

@register_strategy 装饰器
    └── 自动注册策略到注册表

2. 处理器架构

ChineseFinderProcessor
    ├── process_text() - 单文本处理
    ├── process_texts() - 批量文本处理
    ├── process_excel() - Excel 文件处理
    ├── process_txt() - 文本文件处理
    └── process_file() - 自动检测文件类型处理

3. CLI 命令

chinese-finder
    ├── process <FILE> - 处理单个文件
    ├── process-batch <FILES...> - 批量处理
    ├── list-strategies - 列出可用策略
    └── info - 显示项目信息

关键设计特点

策略模式: 每个检测策略独立,互不影响
装饰器注册: 使用 @register_strategy 自动注册,无需手动管理
依赖注入: 处理器通过策略名称获取策略实例
组合模式: 支持 ANY/ALL 两种策略组合方式
工厂模式: StrategyRegistry 作为策略工厂
开闭原则: 对扩展开放(添加新策略),对修改封闭(不修改现有代码)

快速开始

1. 安装依赖

方式一: 使用 Poetry (推荐)

# 安装 Poetry (如果尚未安装)
curl -sSL https://install.python-poetry.org | python3 -

# 安装项目依赖
poetry install

# 激活虚拟环境
poetry shell

方式二: 使用 pip

# 安装依赖
pip install typer[all] pandas openpyxl

# 如果使用测试
pip install pytest pytest-cov

2. 验证安装

# 检查 Python 包
python -c "from chinese_finder import ChineseFinderProcessor; print('✓ 安装成功')"

# 检查 CLI (如果使用 Poetry)
chinese-finder --help

# 或直接运行
python -m chinese_finder.cli --help

3. CLI 快速使用

# 查看帮助
chinese-finder --help

# 查看可用策略
chinese-finder list-strategies

# 处理文本文件
echo -e "wang john\nalice smith\n张三" > test_names.txt
chinese-finder process test_names.txt

# 查看结果
cat test_names_cn.json

# 处理 Excel 文件
chinese-finder process data.xlsx --column name

# 详细输出
chinese-finder process data.xlsx -c name -v

# 批量处理
chinese-finder process-batch file1.txt file2.txt file3.txt

4. Python API 快速使用

基础示例

from chinese_finder import ChineseFinderProcessor

# 创建处理器
processor = ChineseFinderProcessor()

# 检测文本
texts = ["wang john", "alice smith", "张三", "hello"]
for text in texts:
    result = processor.process_text(text)
    status = "✓" if result else "✗"
    print(f"{status} {text}")

处理文件

from chinese_finder import ChineseFinderProcessor

processor = ChineseFinderProcessor()

# 处理文本文件
output = processor.process_txt("names.txt", output_format="json")
print(f"结果保存到: {output}")

# 处理 Excel 文件
output = processor.process_excel("data.xlsx", column="name")
print(f"结果保存到: {output}")

使用特定策略

from chinese_finder import ChineseFinderProcessor

# 仅使用姓氏匹配
processor = ChineseFinderProcessor(strategies=['family_name'])

# 仅使用中文字符检测
processor = ChineseFinderProcessor(strategies=['chinese_char'])

# 组合策略, ALL 模式
processor = ChineseFinderProcessor(
    strategies=['family_name', 'chinese_char'],
    mode='all'
)

5. 创建自定义策略

from chinese_finder import DetectionStrategy, register_strategy

@register_strategy('my_strategy')
class MyStrategy(DetectionStrategy):
    @property
    def name(self) -> str:
        return 'my_strategy'
    
    def detect(self, text: str) -> bool:
        # 你的检测逻辑
        return 'keyword' in text.lower()

# 使用自定义策略
from chinese_finder import ChineseFinderProcessor

processor = ChineseFinderProcessor(strategies=['my_strategy'])
result = processor.process_text("keyword found")

6. 运行测试

# 运行所有测试
pytest tests/ -v

# 运行特定测试
pytest tests/test_strategies.py -v

# 生成覆盖率报告
pytest tests/ --cov=chinese_finder --cov-report=html
open htmlcov/index.html  # macOS

7. 运行示例

# 基础使用示例
python examples/basic_usage.py

# 自定义策略示例
python examples/custom_strategy.py

常见问题

Q: 如何添加新的姓氏?

A: 编辑 chinese_finder/data/family_names.txt,每行添加一个姓氏。

Q: 支持哪些文件格式?

A: 输入支持 TXT、Excel (.xlsx); 输出支持 Excel、JSON、CSV。

Q: 如何查看检测结果?

A: 使用 -v 参数查看详细输出,或查看生成的输出文件。

Q: ANY 和 ALL 模式有什么区别?

ANY: 任一策略检测到中文姓名就认为包含
ALL: 所有策略都检测到才认为包含

Q: 如何发布到 PyPI?

poetry build
poetry publish

代码统计

总文件数: 17 个 Python 文件
总代码行数: ~2,500 行
测试覆盖: 策略、处理器、CLI
文档: 完整的 README 和代码注释

免责声明

本项目按"现状"提供,不提供任何形式的明示或暗示的保证,包括但不限于对适销性、特定用途适用性和非侵权性的保证。在任何情况下,作者或版权持有人均不对因使用本软件而产生的任何索赔、损害或其他责任负责,无论是在合同诉讼、侵权行为还是其他情况下。

使用限制

准确性: 本工具的中文姓名检测结果仅供参考,不保证 100% 准确
数据安全: 使用本工具处理文件时,请确保您有权访问和处理相关文件
隐私保护: 处理包含个人信息的文件时,请遵守相关隐私法规
商业用途: 如将本工具用于商业用途,请自行评估风险和合规性
责任限制: 作者不对因使用本工具而导致的任何直接或间接损失负责

建议

在生产环境使用前,请在测试数据上充分验证
定期检查更新以获取最新的功能和安全修复
如发现安全漏洞,请通过 GitHub Issue responsibly disclose

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.4

May 29, 2026

0.1.3

May 29, 2026

0.1.2

May 29, 2026

0.1.1

May 28, 2026

This version

0.1.0

May 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chinese_finder-0.1.0.tar.gz (21.6 kB view details)

Uploaded May 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chinese_finder-0.1.0-py3-none-any.whl (20.0 kB view details)

Uploaded May 28, 2026 Python 3

File details

Details for the file chinese_finder-0.1.0.tar.gz.

File metadata

Download URL: chinese_finder-0.1.0.tar.gz
Upload date: May 28, 2026
Size: 21.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for chinese_finder-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f0b01c447e244067bb793066d4d3469292e06af8233bfedbe9656cb57e692f7a`
MD5	`4567cb0e363908d4e46a20a6e2bc2b74`
BLAKE2b-256	`9e46b508c49afe1002763a45920df22ef881ecb9cd2dedd0f8f715cc3794a1f2`

See more details on using hashes here.

File details

Details for the file chinese_finder-0.1.0-py3-none-any.whl.

File metadata

Download URL: chinese_finder-0.1.0-py3-none-any.whl
Upload date: May 28, 2026
Size: 20.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for chinese_finder-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`da0eeb6e3355bc09e79f32e88e4f43cf60d3f2398e25a062e8100f945bf0ad95`
MD5	`df669efe4805332fce48218874b80ac4`
BLAKE2b-256	`8ececed1d1472791c855f4a982a3243b2cc07989819b49cc4c302884b886ca91`

See more details on using hashes here.

chinese-finder 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

Chinese Finder

功能特性

安装和环境配置

系统要求

使用 pip 安装

使用 Poetry 安装(开发模式)

验证安装

使用示例和代码片段

CLI 使用示例

1. 处理单个文件

2. 批量处理文件

3. 使用特定策略

4. 策略组合模式

Python API 使用示例

1. 基础文本检测

2. 批量文本处理

3. 处理文件

4. 使用特定策略

5. 自定义策略

API 接口说明

ChineseFinderProcessor

初始化参数

主要方法

process_text(text: str) -> bool

process_texts(texts: List[str]) -> List[bool]

process_excel(file_path: str, column: str, output_format: str = 'excel', output_dir: str = None) -> str

process_txt(file_path: str, output_format: str = 'json', output_dir: str = None) -> str

process_file(file_path: str, column: str = None, output_format: str = None, output_dir: str = None) -> str

DetectionStrategy

必须实现的属性

必须实现的方法

可选重写的方法

register_strategy 装饰器

StrategyRegistry

主要方法

策略组合模式

依赖项清单

运行时依赖

开发依赖

Python 版本

贡献指南与许可证

开发环境搭建

代码规范

添加新策略

Pull Request 流程

测试要求

许可证

项目结构

目录结构

核心架构

1. 策略模式架构

2. 处理器架构

3. CLI 命令

关键设计特点

快速开始

1. 安装依赖

方式一: 使用 Poetry (推荐)

方式二: 使用 pip

2. 验证安装

3. CLI 快速使用

4. Python API 快速使用

基础示例

处理文件

使用特定策略

5. 创建自定义策略

6. 运行测试

7. 运行示例

常见问题

Q: 如何添加新的姓氏?

Q: 支持哪些文件格式?

Q: 如何查看检测结果?

`process_text(text: str) -> bool`

`process_texts(texts: List[str]) -> List[bool]`

`process_excel(file_path: str, column: str, output_format: str = 'excel', output_dir: str = None) -> str`

`process_txt(file_path: str, output_format: str = 'json', output_dir: str = None) -> str`

`process_file(file_path: str, column: str = None, output_format: str = None, output_dir: str = None) -> str`