Skip to main content

中文新闻词频分析与趋势词云工具

Reason this release was yanked:

API发生重大变化

Project description

新闻词频分析与趋势词可视化

功能

  • 中文新闻标题/正文的 TF-IDF 高频词提取
  • 基于 TextRank 的关键词提取
  • 词频统计
  • 时间窗口生成趋势词云
  • 支持自定义停用词表,过滤中文虚词
  • 可直接通过命令行工具 wordfreq-cn 运行

安装

# 安装 Python 依赖
pip install jieba scikit-learn wordcloud matplotlib

# 安装本地包(如果使用源代码)
pip install .

# 在线安装
pip install wordfreq-cn

使用方法

1. 命令行运行

  wordfreq-cn --news "人工智能技术在医疗领域的应用取得突破" "全球气候变化加剧" --topk 5
  • --news:新闻标题或正文列表,可传多个
  • --topk:输出前 N 个关键词(默认 10)
  • 会在 wordclouds/ 生成每条新闻或按日期聚合的趋势词云图片

示例输出:

=== TF-IDF 高权重词 ===
人工智能技术 1.0000
医疗 0.8349
应用 0.6730
...

=== TextRank 关键词 ===
标题: 人工智能技术在医疗领域的应用取得突破
  领域 (1.0000)
  医疗 (0.8349)
  取得 (0.6746)
  应用 (0.6730)
  突破 (0.5175)

=== 词频统计 ===
技术 2
人工智能 1
医疗 1
...

2. Python 调用

from wordfreq_cn import tfidf_keywords, textrank_keywords, count_words, generate_trend_wordcloud, load_stopwords

news_list = [
    ("2025-11-25", "人工智能技术在医疗领域的应用取得突破"),
    ("2025-11-25", "全球气候变化加剧,联合国发布最新报告")
]

stopwords = load_stopwords(custom_file="stopwords.txt")

# TF-IDF
tfidf_res = tfidf_keywords([text for _, text in news_list], top_k=5, stopwords=stopwords)
print(tfidf_res)

# 词频统计
counter = count_words([text for _, text in news_list], stopwords=stopwords)
print(counter)

# 按日期生成词云
from collections import defaultdict

news_by_date = defaultdict(list)
for date, text in news_list:
    news_by_date[date].append(text)
generate_trend_wordcloud(news_by_date, stopwords=stopwords)

词云示例

2015-11-25 2015-11-26


测试

# 运行所有测试
pytest

# 运行特定测试文件
pytest tests/test_core.py -v

# 运行特定测试类
pytest tests/test_core.py::TestTFIDFKeywords -v

# 带覆盖率报告
pytest --cov=wordfreq_cn

# 生成 HTML 覆盖率报告
pytest --cov=wordfreq_cn --cov-report=html

文件说明

文件名 说明
wordfreq_cn/ Python 包目录,包含核心逻辑和 CLI
wordfreq_cn/data/stopwords.txt 可选自定义停用词文件
wordfreq_cn/data/cn_stopwords.txt 哈工大中文停用词表(脚本可自动加载)
wordfreq_cn/data/fonts/SourceHanSansHWSC-VF.ttf 《思源黑体》中文字体文件,用于生成中文词云
wordclouds/ 存放生成的词云图片
tests/ 单元测试代码

注意事项

  • 如果新闻量大,可在 tfidf_keywords 函数中调整 max_featurestop_k 参数。
  • 建议停用词表包含常用虚词(如“的”“在”“是”)以获得更干净的词频统计结果。
  • 安装后,可以直接使用 wordfreq-cn 命令,无需运行 python main.py或者python wordfreq_cn/cli.py 之类的命令使用。

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordfreq_cn-0.0.4.tar.gz (19.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wordfreq_cn-0.0.4-py3-none-any.whl (19.1 MB view details)

Uploaded Python 3

File details

Details for the file wordfreq_cn-0.0.4.tar.gz.

File metadata

  • Download URL: wordfreq_cn-0.0.4.tar.gz
  • Upload date:
  • Size: 19.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for wordfreq_cn-0.0.4.tar.gz
Algorithm Hash digest
SHA256 8404881cd0b1df3f6ac7c752d2a7a4d0c71712c9bfb4a0c5f10ded21c8114927
MD5 56816a545269665adb935cc2a22e2b29
BLAKE2b-256 8e2c70a942ad6699e3c5fe54404a52d301e76f1385d8ef1aee8ab3a61d0edc79

See more details on using hashes here.

File details

Details for the file wordfreq_cn-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: wordfreq_cn-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 19.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for wordfreq_cn-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 87af6531a6ba9693b850c04b2230e286affef0598cd02ddf3b07984cd88c5c76
MD5 6b488cb65f7aca300e4d8a86a82680fa
BLAKE2b-256 9dd6e511dcb5082881f966cd7042d9b9338576e72626a815ccd9687f9c6a2f48

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page