中文新闻词频分析与趋势词云工具
Project description
WordFreq-CN

中文新闻词频分析与趋势词可视化工具
功能
- 中文新闻标题/正文的 TF-IDF 高频词提取
- 词频统计
- 按 时间窗口生成趋势词云
- 支持自定义停用词表,过滤中文虚词
- 可通过命令行工具
wordfreq-cn直接运行 - 也可以通过
wordfreq-cnAPI函数使用
安装
# 安装本地源码包(如果你有源码)
pip install .
# 或直接从 PyPI 安装
pip install wordfreq-cn
快速开始示例(命令行)
wordfreq-cn tfidf --news "人工智能技术在医疗领域的应用取得突破" "全球气候变化加剧" --topk 5
wordfreq-cn freq --news "人工智能技术在医疗领域的应用取得突破" --topk 10
wordfreq-cn wordcloud --news "人工智能技术在医疗领域的应用取得突破" "全球气候变化加剧"
wordfreq-cn freq --news "人工智能技术在医疗领域的应用取得突破" --json
wordfreq-cn wordcloud --news "人工智能技术在医疗领域的应用取得突破" "全球气候变化加剧" --bin
示例输出
TF-IDF 高频词:
人工智能技术 1.0000
医疗 0.8349
应用 0.6730
...
词频统计:
技术 2
人工智能 1
医疗 1
...
json输出
{
"人工智能技术": 1,
"医疗": 1,
"应用": 1,
"突破": 1
}
词云输出目录:
wordclouds/wordcloud_day1.png
wordclouds/wordcloud_day2.png
Python API 使用示例
from collections import defaultdict
from wordfreq_cn import (
extract_keywords,
count_word_frequency,
generate_trend_wordcloud,
load_stopwords
)
# 示例新闻数据
news_list = [
("2025-11-25", "人工智能技术在医疗领域的应用取得突破"),
("2025-11-25", "全球气候变化加剧,联合国发布最新报告")
]
# 加载自定义停用词
stopwords = load_stopwords("stopwords.txt")
# ---------------------------
# TF-IDF 关键词提取
# ---------------------------
texts = [text for _, text in news_list]
tfidf_res = extract_keywords(texts, method="tfidf", top_k=5, stopwords=stopwords)
print("TF-IDF:", tfidf_res)
# ---------------------------
# 词频统计
# ---------------------------
counter = count_word_frequency(texts, stopwords=stopwords)
print("词频统计:", counter)
# ---------------------------
# 按日期生成趋势词云
# ---------------------------
news_by_date = defaultdict(list)
for date, text in news_list:
news_by_date[date].append(text)
generate_trend_wordcloud(news_by_date, stopwords=stopwords) # 生成图片和存放的路径list
# 词云图片默认保存到 wordclouds/ 目录
generate_trend_wordcloud(news_by_date, stopwords=stopwords, return_bytes=True) # 返回二进制byte数据
快速流程图示
┌─────────────┐
│ 输入新闻列表 │
└──────┬──────┘
│
▼
┌─────────────┐
│ TF-IDF │
└──────┬──────┘
│
▼
┌─────────────┐
│ 输出关键词 │
└──────┬──────┘
│
▼
┌─────────────┐
│ 词频统计 │
└──────┬──────┘
│
▼
┌─────────────┐
│ 生成词云图 │
└─────────────┘
测试
# 运行所有测试
pytest
# 运行特定测试文件
pytest tests/test_core.py -v
# 运行特定测试类
pytest tests/test_core.py::TestTFIDFKeywords -v
# 带覆盖率报告
pytest --cov=wordfreq_cn
# 生成 HTML 覆盖率报告
pytest --cov=wordfreq_cn --cov-report=html
文件说明
| 文件名 | 说明 |
|---|---|
wordfreq_cn/ |
Python 包目录,包含核心逻辑和 CLI |
wordfreq_cn/data/stopwords.txt |
可选自定义停用词文件 |
wordfreq_cn/data/cn_stopwords.txt |
哈工大中文停用词表 |
wordfreq_cn/data/fonts/ |
中文字体文件(如思源黑体)用于生成词云 |
wordclouds/ |
默认存放生成的词云图片 |
tests/ |
单元测试代码 |
注意事项
- 新闻量大时,可调整
extract_keywords的top_k或 TF-IDF 的max_features参数 - 停用词表建议包含常用虚词(如“的”“在”“是”)以获得更干净的词频统计结果
- 安装后直接使用
wordfreq-cn命令,无需手动运行python cli.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
wordfreq_cn-0.1.10.tar.gz
(19.0 MB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wordfreq_cn-0.1.10.tar.gz.
File metadata
- Download URL: wordfreq_cn-0.1.10.tar.gz
- Upload date:
- Size: 19.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a37e8c0fbe38a22379b913120d9d9e1636a706bfcea9cfbeb7535a62dbbb420
|
|
| MD5 |
8bf0e14e4a95a4a53c891324e67cbc46
|
|
| BLAKE2b-256 |
99e0ef1dd80964080ce8535857a9635f896d8a97cfe54289a249eeb808a8feb7
|
File details
Details for the file wordfreq_cn-0.1.10-py3-none-any.whl.
File metadata
- Download URL: wordfreq_cn-0.1.10-py3-none-any.whl
- Upload date:
- Size: 19.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc755ba68be966576498bcc3470f7fcf2df203479399c9e695b185c5fc8a724d
|
|
| MD5 |
b6ce04f5797f8e6df19721483af108c3
|
|
| BLAKE2b-256 |
5d5e99a06a5c010af11dd607f92eef7ef567ffe91341a8d0b7713d624f1ac1b2
|