Skip to main content

Corpus analysis and processing toolkit

Project description

CorpusToolkit

CorpusToolkit 是一个面向中文语料预处理、去重与质量评估的工具包,适用于 NLP 数据清洗与训练语料准备场景。


📦 Installation / 安装

pip install CorpusKit

或从源码安装:

git clone https://github.com/Morton-Li/CorpusToolkit.git
cd CorpusToolkit
pip install .
⚠️ 注意 / Note

如需使用机器学习相关功能,请确保安装了 ml 可选依赖项:
To use neural network-related features, make sure to install the optional dependencies group ml:

  • For PyPI install / 使用 PyPI 安装:

    pip install CorpusKit[ml]
    
  • For source install / 从源码安装:

    pip install .[ml]
    

🧰 模块功能简介 / Module Overview

模块 功能
CorpusToolkit.scorer 计算中文语料的质量评分,如困惑度(Perplexity)
CorpusToolkit.Cleaner 标点规范、空白符清洗、HTML 实体解码、emoji 过滤等语料清洗功能
CorpusToolkit.DuplicateDetector 基于 MinHash + LSH 实现语句级重复检测
CorpusToolkit.split_sentence 中文文本长句分割工具

🪄 快速使用示例 / Quick Usage Examples

1. 计算困惑度 / Compute Perplexity

from CorpusToolkit.scorer import compute_perplexity

sample_texts = [
    "他走进了咖啡店,点了一杯拿铁。",
    "中国是一个拥有悠久历史的国家。",
    "树立科学思想,掌握科学方法,了解科技知识。",
    "人工智能正在改变我们的生活方式。",
    "啊发疯开i句i阶段哦小脾气。",  # 无意义文本示例
]
ppl_scores = compute_perplexity(sample_texts)
print(ppl_scores)  # [9.5992, 14.1634, 26.9556, 10.4854, 3445.8342]

2. 检测与去除重复语句 / Detect and Remove Duplicates

from CorpusToolkit import DuplicateDetector

sample_texts = [
    "今天天气不错",
    "我喜欢人工智能。",
    "我非常喜欢人工智能。",
    "我喜欢人工智能。",
]

detector = DuplicateDetector()
detector.add_batch(sample_texts)

for text in sample_texts:
    similar_ids = detector.query(text)
    print(f"Text: '{text}' has similar IDs: {similar_ids}")

# Text: '今天天气不错' has similar IDs: [0]
# Text: '我喜欢人工智能。' has similar IDs: [3, 1, 2]
# Text: '我非常喜欢人工智能。' has similar IDs: [3, 1, 2]
# Text: '我喜欢人工智能。' has similar IDs: [3, 1, 2]

duplicates = detector.find_all_duplicates()
print("All duplicate groups:", duplicates)  # All duplicate groups: {1: [3, 2]}

更多示例请参考 examples 目录。


📄 License / 许可证

本项目采用 Apache License 2.0 协议。

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpuskit-0.1.1.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

corpuskit-0.1.1-py3-none-any.whl (14.8 kB view details)

Uploaded Python 3

File details

Details for the file corpuskit-0.1.1.tar.gz.

File metadata

  • Download URL: corpuskit-0.1.1.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for corpuskit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d03dddcaad84e21f63ceb854efb66465525f1568cc8fcb4a942fb614faf63404
MD5 6c73661ef6dc76895c4060706f2695da
BLAKE2b-256 57e7976fb5a0bb4b2fcfd9a2bc2a57c23e55d31418c5f0c77c8a70435f53a604

See more details on using hashes here.

File details

Details for the file corpuskit-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: corpuskit-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 14.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for corpuskit-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4d5c8cb2d610b95cfe5a64b0b71fd4cfad7978be7b3454f9c62af1ad8f551c6b
MD5 41321bce3ccd93d21f60a12a81db28b0
BLAKE2b-256 1e58d5bf103d6080f7f6b3ef06e89b071f26e4364ebcdf175ef1ebdaf81db1b8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page