中文新闻词频分析与趋势词云工具
Reason this release was yanked:
API发生重大变化
Project description
新闻词频分析与趋势词可视化
功能
- 中文新闻标题/正文的 TF-IDF 高频词提取
- 基于 TextRank 的关键词提取
- 词频统计
- 按 时间窗口生成趋势词云
- 支持自定义停用词表,过滤中文虚词
- 可直接通过命令行工具
wordfreq-cn运行
安装
# 安装 Python 依赖
pip install jieba scikit-learn wordcloud matplotlib
# 安装本地包(如果使用源代码)
pip install .
注意:中文词云需要字体文件
simhei.ttf,可放在项目目录或系统字体目录。
使用方法
1. 命令行运行
wordfreq-cn --news "人工智能技术在医疗领域的应用取得突破" "全球气候变化加剧" --topk 5
--news:新闻标题或正文列表,可传多个--topk:输出前 N 个关键词(默认 10)- 会在
wordclouds/生成每条新闻或按日期聚合的趋势词云图片
示例输出:
=== TF-IDF 高权重词 ===
人工智能技术 1.0000
医疗 0.8349
应用 0.6730
...
=== TextRank 关键词 ===
标题: 人工智能技术在医疗领域的应用取得突破
领域 (1.0000)
医疗 (0.8349)
取得 (0.6746)
应用 (0.6730)
突破 (0.5175)
=== 词频统计 ===
技术 2
人工智能 1
医疗 1
...
2. Python 调用
from wordfreq_cn import tfidf_keywords, textrank_keywords, count_words, generate_trend_wordcloud, load_stopwords
news_list = [
("2025-11-25", "人工智能技术在医疗领域的应用取得突破"),
("2025-11-25", "全球气候变化加剧,联合国发布最新报告")
]
stopwords = load_stopwords(custom_file="stopwords.txt")
# TF-IDF
tfidf_res = tfidf_keywords([text for _, text in news_list], top_k=5, stopwords=stopwords)
print(tfidf_res)
# 词频统计
counter = count_words([text for _, text in news_list], stopwords=stopwords)
print(counter)
# 按日期生成词云
from collections import defaultdict
news_by_date = defaultdict(list)
for date, text in news_list:
news_by_date[date].append(text)
generate_trend_wordcloud(news_by_date, stopwords=stopwords)
词云示例
文件说明
| 文件名 | 说明 |
|---|---|
wordfreq/ |
Python 包目录,包含核心逻辑和 CLI |
stopwords.txt |
可选自定义停用词文件 |
cn_stopwords.txt |
哈工大中文停用词表(脚本可自动加载) |
wordclouds/ |
存放生成的词云图片 |
simhei.ttf |
中文字体文件,用于生成中文词云 |
注意事项
- 中文词云需要字体文件
simhei.ttf,可以从网上下载或使用系统自带中文字体。 - 如果新闻量大,可在
tfidf_keywords函数中调整max_features和top_k参数。 - 建议停用词表包含常用虚词(如“的”“在”“是”)以获得更干净的词频统计结果。
- 安装包后,可以直接使用
wordfreq命令,无需再运行python main.py。
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
wordfreq_cn-0.0.2.tar.gz
(4.5 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wordfreq_cn-0.0.2.tar.gz.
File metadata
- Download URL: wordfreq_cn-0.0.2.tar.gz
- Upload date:
- Size: 4.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce79750707b561a60d5f0a1ca46e176beb1a1770e5e7c309964ca87781007396
|
|
| MD5 |
00c4224f33547665e9c4d3421615586d
|
|
| BLAKE2b-256 |
497340ecf907d5587b7a52a9bbafdf5e56aea793d1d0479323262b708a0d10ee
|
File details
Details for the file wordfreq_cn-0.0.2-py3-none-any.whl.
File metadata
- Download URL: wordfreq_cn-0.0.2-py3-none-any.whl
- Upload date:
- Size: 5.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa815db918db1bac66a1071f3e871122e2adba7178c439f6e31710c0bc41d237
|
|
| MD5 |
af6744fd480637d37294b3b3c9c51aee
|
|
| BLAKE2b-256 |
d12cbbc0029de15ab960f2fbd0e5bc86eee3c706bd45c7bd94648ab200047fdd
|