Skip to main content

计算中文文本可读性指标

Project description

中文文本可读性指标 Chinese Readability Score

Evaluate the readability of Chinese text using word segmentation, part-of-speech analysis, and syntactic dependency analysis capabilities. Supports multiple NLP providers including LTP, Jieba, and PKU. 利用分词、词性分析和句法依存分析能力,对中文文本的可读性进行评估。支持多种 NLP 提供方,包括 LTP、Jieba 和 PKU。

The code is implemented based on several papers I know, with the scoring metric named after the first author of each paper. 代码根据我已知的几篇论文分别进行实现,评分指标名称即论文第一作者姓名。

Installation

It's easy using pip, just run: 直接使用 pip 命令安装即可:

$ pip install readability_cn

Optional NLP providers: 可选的 NLP 提供方依赖:

# Install with Jieba support
$ pip install readability_cn[jieba]

# Install with PKU support
$ pip install readability_cn[pkuseg]

# Install with all optional providers
$ pip install readability_cn[all]

Usage

    import readability_cn
    from readability_cn.nlp import JiebaNLP, PkuNLP, LtpNLP

    # use LTP as default NLP provider
    readability = ChineseReadability()
    # or use other NLP providers
    # readability = ChineseReadability(nlp_provider=JiebaNLP())  # use Jieba
    # readability = ChineseReadability(nlp_provider=PkuNLP())    # use PKU
    # readability = ChineseReadability(nlp_provider=LtpNLP())    # explicitly use LTP

    # add new custom words
    readability.add_custom_words(['日志易', '优特捷'])

    # Compare readability metrics before and after file changes
    readability.analyze('old.adoc', 'new.adoc')

    # use your own preprocess functions
    import markdown
    import re
    with open(file_name, 'r', encoding='utf-8') as file:
        markdown_content = file.read()
    text = markdown.markdown(markdown_content)
    text = re.sub(r'\n+', '\n', content)
    ... # do other remove and replace here
    sentences = [sentence.strip() for sentence in readability.stnsplit.split(text) if sentence.strip()]
    readability.chengyong_gf0025_readability(sentences)

Use Custom Vocab

You can use the sentencepiece tool to extract a vocabulary from specific domain documents, referring to the custom_vocab.py implementation in the examples directory. Then merge it into the top-level vocabulary for use: 您可以通过 sentencepiece 工具,对特定领域文档提取词表,可以参考 examples 目录中的 custom_vocab.py 实现。然后合并到甲级词汇表中使用:

    # Load the top 16% of custom vocabulary as common words in specific fields
    # 可以加载自定义词表的前16%词汇作为特定领域的常用词汇
    # Default to the vocabulary from Fudan University's computer science corpus
    # 默认提供复旦大学计算机领域语料库的词表
    readability._load_custom_vocab()
    readability._load_custom_vocab("rizhiyi.vocab")

Note

  1. The research in this field in China is mainly concentrated in the area of teaching Chinese as a foreign language. The research data primarily consists of a small number of textbook passages and Chinese proficiency test outlines. The coefficients obtained from polynomial linear regression fitting may not be effective for native speakers or technical documents. 国内进行相关研究的学者主要集中在对外汉语教育领域,研究数据集中为少量教材课文和汉语等级考试大纲等材料。多项式线性回归拟合的系数可能未必对母语用户、理工科文档等情况有效。

  2. Some formulas are sensitive to the number of clauses. In this implementation, we simply use Chinese commas, semicolons, and colons for sentence splitting, without considering the mixed use of Chinese and English punctuation. 部分公式对分句数量敏感。本实现中简单使用中文的逗号、分号、冒号进行切分,并未考虑中英文标点混用的情况。

  3. This implementation currently only provides preprocessing for asciidoc format text. For other formats, please refer to the preprocess_asciidoc() method to remove various markups. 本实现中暂时只提供了对 asciidoc 格式文本的预处理,其他格式请参照处理去除各种标记。

Thanks

  1. LTP
  2. Lexi
  3. Cursor IDE and Claude AI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

readability_cn-0.2.0.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

readability_cn-0.2.0-py3-none-any.whl (1.4 MB view details)

Uploaded Python 3

File details

Details for the file readability_cn-0.2.0.tar.gz.

File metadata

  • Download URL: readability_cn-0.2.0.tar.gz
  • Upload date:
  • Size: 1.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.10

File hashes

Hashes for readability_cn-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1011b6a42659efa11001a3776afa173ce40b1d87efe4c5ca2071679fa7dc54a6
MD5 e79ab08c2a047cecaa5cd001c477756e
BLAKE2b-256 7f93e61d7c6299eb6ad19f9e1fddba1879990fa50021c9a5db85b1a5ab188023

See more details on using hashes here.

File details

Details for the file readability_cn-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: readability_cn-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.10

File hashes

Hashes for readability_cn-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8a2f2391c8e4fb90e38b11c386ebc5c3c241cb432110a780a54176d0e4e99216
MD5 d5a724d4fd72a9e45fefaddaf544979b
BLAKE2b-256 247e7f9b42a9dc08136a479c0aa873c53e93e03d50bac6438c6e5020870c6251

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page