Corpus analysis and processing toolkit
Project description
CorpusToolkit
CorpusToolkit 是一个面向中文语料预处理、去重与质量评估的工具包,适用于 NLP 数据清洗与训练语料准备场景。
📦 Installation / 安装
pip install CorpusKit
或从源码安装:
git clone https://github.com/Morton-Li/CorpusToolkit.git
cd CorpusToolkit
pip install .
⚠️ 注意 / Note
如需使用机器学习相关功能,请确保安装了 ml 可选依赖项:
To use neural network-related features, make sure to install the optional dependencies group ml:
-
For PyPI install / 使用 PyPI 安装:
pip install CorpusKit[ml]
-
For source install / 从源码安装:
pip install .[ml]
🧰 模块功能简介 / Module Overview
| 模块 | 功能 |
|---|---|
CorpusToolkit.scorer |
计算中文语料的质量评分,如困惑度(Perplexity) |
CorpusToolkit.Cleaner |
标点规范、空白符清洗、HTML 实体解码、emoji 过滤等语料清洗功能 |
CorpusToolkit.DuplicateDetector |
基于 MinHash + LSH 实现语句级重复检测 |
CorpusToolkit.split_sentence |
中文文本长句分割工具 |
🪄 快速使用示例 / Quick Usage Examples
1. 计算困惑度 / Compute Perplexity
from CorpusToolkit.scorer import compute_perplexity
sample_texts = [
"他走进了咖啡店,点了一杯拿铁。",
"中国是一个拥有悠久历史的国家。",
"树立科学思想,掌握科学方法,了解科技知识。",
"人工智能正在改变我们的生活方式。",
"啊发疯开i句i阶段哦小脾气。", # 无意义文本示例
]
ppl_scores = compute_perplexity(sample_texts)
print(ppl_scores) # [9.5992, 14.1634, 26.9556, 10.4854, 3445.8342]
2. 检测与去除重复语句 / Detect and Remove Duplicates
from CorpusToolkit import DuplicateDetector
sample_texts = [
"今天天气不错",
"我喜欢人工智能。",
"我非常喜欢人工智能。",
"我喜欢人工智能。",
]
detector = DuplicateDetector()
detector.add_batch(sample_texts)
for text in sample_texts:
similar_ids = detector.query(text)
print(f"Text: '{text}' has similar IDs: {similar_ids}")
# Text: '今天天气不错' has similar IDs: [0]
# Text: '我喜欢人工智能。' has similar IDs: [3, 1, 2]
# Text: '我非常喜欢人工智能。' has similar IDs: [3, 1, 2]
# Text: '我喜欢人工智能。' has similar IDs: [3, 1, 2]
duplicates = detector.find_all_duplicates()
print("All duplicate groups:", duplicates) # All duplicate groups: {1: [3, 2]}
更多示例请参考 examples 目录。
📄 License / 许可证
本项目采用 Apache License 2.0 协议。
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file corpuskit-0.1.1.tar.gz.
File metadata
- Download URL: corpuskit-0.1.1.tar.gz
- Upload date:
- Size: 18.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d03dddcaad84e21f63ceb854efb66465525f1568cc8fcb4a942fb614faf63404
|
|
| MD5 |
6c73661ef6dc76895c4060706f2695da
|
|
| BLAKE2b-256 |
57e7976fb5a0bb4b2fcfd9a2bc2a57c23e55d31418c5f0c77c8a70435f53a604
|
File details
Details for the file corpuskit-0.1.1-py3-none-any.whl.
File metadata
- Download URL: corpuskit-0.1.1-py3-none-any.whl
- Upload date:
- Size: 14.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4d5c8cb2d610b95cfe5a64b0b71fd4cfad7978be7b3454f9c62af1ad8f551c6b
|
|
| MD5 |
41321bce3ccd93d21f60a12a81db28b0
|
|
| BLAKE2b-256 |
1e58d5bf103d6080f7f6b3ef06e89b071f26e4364ebcdf175ef1ebdaf81db1b8
|