Skip to main content

中文新闻词频分析与趋势词云工具

Reason this release was yanked:

API发生重大变化

Project description

新闻词频分析与趋势词可视化

功能

  • 中文新闻标题/正文的 TF-IDF 高频词提取
  • 基于 TextRank 的关键词提取
  • 词频统计
  • 时间窗口生成趋势词云
  • 支持自定义停用词表,过滤中文虚词
  • 可直接通过命令行工具 wordfreq-cn 运行

安装

# 安装 Python 依赖
pip install jieba scikit-learn wordcloud matplotlib

# 安装本地包(如果使用源代码)
pip install .

注意:中文词云需要字体文件 simhei.ttf,可放在项目目录或系统字体目录。


使用方法

1. 命令行运行

  wordfreq-cn --news "人工智能技术在医疗领域的应用取得突破" "全球气候变化加剧" --topk 5
  • --news:新闻标题或正文列表,可传多个
  • --topk:输出前 N 个关键词(默认 10)
  • 会在 wordclouds/ 生成每条新闻或按日期聚合的趋势词云图片

示例输出:

=== TF-IDF 高权重词 ===
人工智能技术 1.0000
医疗 0.8349
应用 0.6730
...

=== TextRank 关键词 ===
标题: 人工智能技术在医疗领域的应用取得突破
  领域 (1.0000)
  医疗 (0.8349)
  取得 (0.6746)
  应用 (0.6730)
  突破 (0.5175)

=== 词频统计 ===
技术 2
人工智能 1
医疗 1
...

2. Python 调用

from wordfreq_cn import tfidf_keywords, textrank_keywords, count_words, generate_trend_wordcloud, load_stopwords

news_list = [
    ("2025-11-25", "人工智能技术在医疗领域的应用取得突破"),
    ("2025-11-25", "全球气候变化加剧,联合国发布最新报告")
]

stopwords = load_stopwords(custom_file="stopwords.txt")

# TF-IDF
tfidf_res = tfidf_keywords([text for _, text in news_list], top_k=5, stopwords=stopwords)
print(tfidf_res)

# 词频统计
counter = count_words([text for _, text in news_list], stopwords=stopwords)
print(counter)

# 按日期生成词云
from collections import defaultdict

news_by_date = defaultdict(list)
for date, text in news_list:
    news_by_date[date].append(text)
generate_trend_wordcloud(news_by_date, stopwords=stopwords)

词云示例

2015-11-25 2015-11-26


文件说明

文件名 说明
wordfreq/ Python 包目录,包含核心逻辑和 CLI
stopwords.txt 可选自定义停用词文件
cn_stopwords.txt 哈工大中文停用词表(脚本可自动加载)
wordclouds/ 存放生成的词云图片
simhei.ttf 中文字体文件,用于生成中文词云

注意事项

  • 中文词云需要字体文件 simhei.ttf,可以从网上下载或使用系统自带中文字体。
  • 如果新闻量大,可在 tfidf_keywords 函数中调整 max_featurestop_k 参数。
  • 建议停用词表包含常用虚词(如“的”“在”“是”)以获得更干净的词频统计结果。
  • 安装包后,可以直接使用 wordfreq 命令,无需再运行 python main.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordfreq_cn-0.0.2.tar.gz (4.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wordfreq_cn-0.0.2-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file wordfreq_cn-0.0.2.tar.gz.

File metadata

  • Download URL: wordfreq_cn-0.0.2.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for wordfreq_cn-0.0.2.tar.gz
Algorithm Hash digest
SHA256 ce79750707b561a60d5f0a1ca46e176beb1a1770e5e7c309964ca87781007396
MD5 00c4224f33547665e9c4d3421615586d
BLAKE2b-256 497340ecf907d5587b7a52a9bbafdf5e56aea793d1d0479323262b708a0d10ee

See more details on using hashes here.

File details

Details for the file wordfreq_cn-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: wordfreq_cn-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 5.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for wordfreq_cn-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 fa815db918db1bac66a1071f3e871122e2adba7178c439f6e31710c0bc41d237
MD5 af6744fd480637d37294b3b3c9c51aee
BLAKE2b-256 d12cbbc0029de15ab960f2fbd0e5bc86eee3c706bd45c7bd94648ab200047fdd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page