中文新闻词频分析与趋势词云工具

Reason this release was yanked:

存在重大的bug

Project description

WordFreq-CN

中文新闻词频分析与趋势词可视化工具

GitHub

功能

中文新闻标题/正文的 TF-IDF 高频词提取
基于 TextRank 的关键词提取
词频统计
按 时间窗口生成趋势词云
支持自定义停用词表，过滤中文虚词
可通过命令行工具 wordfreq-cn 直接运行
也可以通过wordfreq-cnAPI函数使用

安装

# 安装 Python 依赖
pip install jieba scikit-learn wordcloud matplotlib numpy

# 安装本地源码包（如果你有源码）
pip install .

# 或直接从 PyPI 安装
pip install wordfreq-cn

快速开始示例（命令行）

wordfreq-cn tfidf --news "人工智能技术在医疗领域的应用取得突破" "全球气候变化加剧" --topk 5
wordfreq-cn textrank --news "人工智能技术在医疗领域的应用取得突破" --topk 5
wordfreq-cn freq --news "人工智能技术在医疗领域的应用取得突破" --topk 10
wordfreq-cn wordcloud --news "人工智能技术在医疗领域的应用取得突破" "全球气候变化加剧"
wordfreq-cn freq --news "人工智能技术在医疗领域的应用取得突破" --json

示例输出

TF-IDF 高频词：

人工智能技术 1.0000
医疗 0.8349
应用 0.6730
...

TextRank 关键词：

TextRank [2025-11-25]:
  领域 (1.0000)
  医疗 (0.8349)
  取得 (0.6746)
  应用 (0.6730)
  突破 (0.5175)

词频统计：

技术 2
人工智能 1
医疗 1
...

json输出

{
  "人工智能技术": 1,
  "医疗": 1,
  "应用": 1,
  "突破": 1
}

词云输出目录：

wordclouds/wordcloud_day1.png
wordclouds/wordcloud_day2.png

Python API 使用示例

from collections import defaultdict
from wordfreq_cn import (
    extract_keywords,
    count_word_frequency,
    generate_trend_wordcloud,
    load_stopwords
)

# 示例新闻数据
news_list = [
    ("2025-11-25", "人工智能技术在医疗领域的应用取得突破"),
    ("2025-11-25", "全球气候变化加剧，联合国发布最新报告")
]

# 加载自定义停用词
stopwords = load_stopwords("stopwords.txt")

# ---------------------------
# TF-IDF 关键词提取
# ---------------------------
texts = [text for _, text in news_list]
tfidf_res = extract_keywords(texts, method="tfidf", top_k=5, stopwords=stopwords)
print("TF-IDF:", tfidf_res)

# ---------------------------
# TextRank 关键词提取
# ---------------------------
for date, text in news_list:
    kws = extract_keywords(text, method="textrank", top_k=5, stopwords=stopwords)
    print(f"TextRank [{date}]:", kws)

# ---------------------------
# 词频统计
# ---------------------------
counter = count_word_frequency(texts, stopwords=stopwords)
print("词频统计:", counter)

# ---------------------------
# 按日期生成趋势词云
# ---------------------------
news_by_date = defaultdict(list)
for date, text in news_list:
    news_by_date[date].append(text)

generate_trend_wordcloud(news_by_date, stopwords=stopwords)
# 词云图片默认保存到 wordclouds/ 目录

快速流程图示

┌─────────────┐
│  输入新闻列表  │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│ TF-IDF / TextRank │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  输出关键词   │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  词频统计    │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  生成词云图  │
└─────────────┘

测试

# 运行所有测试
pytest

# 运行特定测试文件
pytest tests/test_core.py -v

# 运行特定测试类
pytest tests/test_core.py::TestTFIDFKeywords -v

# 带覆盖率报告
pytest --cov=wordfreq_cn

# 生成 HTML 覆盖率报告
pytest --cov=wordfreq_cn --cov-report=html

文件说明

文件名	说明
`wordfreq_cn/`	Python 包目录，包含核心逻辑和 CLI
`wordfreq_cn/data/stopwords.txt`	可选自定义停用词文件
`wordfreq_cn/data/cn_stopwords.txt`	哈工大中文停用词表
`wordfreq_cn/data/fonts/`	中文字体文件（如思源黑体）用于生成词云
`wordclouds/`	默认存放生成的词云图片
`tests/`	单元测试代码

注意事项

新闻量大时，可调整 extract_keywords 的 top_k 或 TF-IDF 的 max_features 参数
停用词表建议包含常用虚词（如“的”“在”“是”）以获得更干净的词频统计结果
安装后直接使用 wordfreq-cn 命令，无需手动运行 python cli.py

Project details

Release history Release notifications | RSS feed

0.2.1

Apr 25, 2026

0.2.0

Apr 19, 2026

0.1.10

Dec 12, 2025

0.1.8

Dec 7, 2025

0.1.7

Dec 7, 2025

0.1.6

Dec 2, 2025

0.1.4

Nov 30, 2025

0.1.3

Nov 30, 2025

0.1.2 yanked

Nov 29, 2025

Reason this release was yanked:

API发生重大变化,移除了textRank的API

This version

0.1.1 yanked

Nov 29, 2025

Reason this release was yanked:

存在重大的bug

0.0.4 yanked

Nov 28, 2025

Reason this release was yanked:

API发生重大变化

0.0.2 yanked

Nov 28, 2025

Reason this release was yanked:

API发生重大变化

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordfreq_cn-0.1.1.tar.gz (19.0 MB view details)

Uploaded Nov 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wordfreq_cn-0.1.1-py3-none-any.whl (19.1 MB view details)

Uploaded Nov 29, 2025 Python 3

File details

Details for the file wordfreq_cn-0.1.1.tar.gz.

File metadata

Download URL: wordfreq_cn-0.1.1.tar.gz
Upload date: Nov 29, 2025
Size: 19.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for wordfreq_cn-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`b0e81be1b4474a48c94a465850409e46321172835ee02059a6c349b82317383c`
MD5	`126e9ad870576299586a71e3da57069e`
BLAKE2b-256	`56cab2f33007f195fd0ec6763ba0d861af327b5c2f0c2d844cdf325e33a6a66b`

See more details on using hashes here.

File details

Details for the file wordfreq_cn-0.1.1-py3-none-any.whl.

File metadata

Download URL: wordfreq_cn-0.1.1-py3-none-any.whl
Upload date: Nov 29, 2025
Size: 19.1 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for wordfreq_cn-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9d6792e54d34270726cdeaa375f38ca09b000eabe970a972296bc47d726a5116`
MD5	`3d799e8283442f490c2170cb16e0536f`
BLAKE2b-256	`2b1674c2be4666c75ea7d43481f4c7331187be694abb9a42bd77513eb7ad8ab3`

See more details on using hashes here.

wordfreq-cn 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

WordFreq-CN

功能

安装

快速开始示例（命令行）

示例输出

Python API 使用示例

快速流程图示

测试

文件说明

注意事项

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes