Topic modeling with GLM embeddings
Project description
GLMTopic
GLMTopic is a Python package for topic modeling with GLM-based embeddings, providing powerful tools for text clustering, visualization, and analysis.
Features
- Text embedding generation using ACGE (Advanced Chinese-English General Embedding)
- UMAP-based dimensionality reduction for visualization
- HDBSCAN clustering for topic identification
- GLM-4 powered topic and keyword generation
- Visualization tools: intertopic distance maps, hierarchical clustering dendrograms, and word clouds
- Chinese language support with built-in stopwords
Installation
pip install GLMTopic
API Key Setup
GLMTopic uses ZhipuAI's GLM-4 for topic generation. You need to:
- Register at 智谱AI开放平台
- Create an API key in your user center
- Store your API key securely (do not expose it in your code)
Quick Start
import pandas as pd
from GLMTopic import analyze_text_clusters
# Load your data
df = pd.read_csv("your_data.csv")
# Analyze text clusters with your API key
processed_df, cluster_stats = analyze_text_clusters(
df=df,
api_key="YOUR_ZHIPUAI_API_KEY", # Replace with your actual API key
text_column="text",
quiet=False
)
# Print cluster statistics
print(cluster_stats)
API Key Security
For security, consider:
- Using environment variables:
api_key=os.environ.get("ZHIPUAI_API_KEY") - Using a config file outside version control
- Using a secrets manager for production environments
Visualization
from GLMTopic import generate_intertopic_map
# Generate interactive topic map
fig, _ = generate_intertopic_map(
df=cluster_stats,
topic_col="topic",
output_filename="topic_map.html"
)
# Display in notebook or save to file
fig.write_html("topic_map.html")
Advanced Features
Word Cloud Generation
from GLMTopic import generate_topic_wordclouds
# Generate word clouds for each topic
wordclouds = generate_topic_wordclouds(
df=processed_df,
text_column="text",
topic_col="topic",
keywords_col="keywords"
)
# Display a specific topic's word cloud (in Jupyter notebook)
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 8))
plt.imshow(wordclouds["Your Topic Name"])
plt.axis("off")
plt.show()
Hierarchical Clustering
from GLMTopic import hierarchical_clustering_plot
# Generate hierarchical clustering visualization
fig = hierarchical_clustering_plot(
df=cluster_stats,
topic_col="topic",
count_col="count",
output_path="dendrogram.png"
)
Authors
- Junjie Chen
- Wenqi Liao
- Weisi Chen
License
MIT License
GLMTopic 中文文档
GLMTopic 是一个基于 GLM 嵌入的主题建模 Python 包,提供强大的文本聚类、可视化和分析工具。
功能特点
- 使用 ACGE (Advanced Chinese-English General Embedding) 生成文本嵌入
- 基于 UMAP 的降维可视化
- HDBSCAN 聚类进行主题识别
- 由 GLM-4 驱动的主题和关键词生成
- 可视化工具:主题间距离图、层次聚类树状图和词云
- 中文支持,内置停用词
安装方法
pip install GLMTopic
API 密钥设置
GLMTopic 使用智谱 AI 的 GLM-4 进行主题生成。您需要:
- 在智谱AI开放平台注册账号
- 在用户中心创建 API 密钥
- 安全存储您的 API 密钥(不要在代码中直接暴露)
快速开始
import pandas as pd
from GLMTopic import analyze_text_clusters
# 加载您的数据
df = pd.read_csv("your_data.csv")
# 使用您的 API 密钥分析文本聚类
processed_df, cluster_stats = analyze_text_clusters(
df=df,
api_key="您的智谱AI_API密钥", # 替换为您的实际 API 密钥
text_column="text",
quiet=False
)
# 打印聚类统计信息
print(cluster_stats)
API 密钥安全
为了安全考虑:
- 使用环境变量:
api_key=os.environ.get("ZHIPUAI_API_KEY") - 使用版本控制之外的配置文件
- 在生产环境中使用密钥管理器
可视化
from GLMTopic import generate_intertopic_map
# 生成交互式主题图
fig, _ = generate_intertopic_map(
df=cluster_stats,
topic_col="topic",
output_filename="topic_map.html"
)
# 在笔记本中显示或保存为文件
fig.write_html("topic_map.html")
高级功能
词云生成
from GLMTopic import generate_topic_wordclouds
# 为每个主题生成词云
wordclouds = generate_topic_wordclouds(
df=processed_df,
text_column="text",
topic_col="topic",
keywords_col="keywords"
)
# 显示特定主题的词云(在 Jupyter notebook 中)
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 8))
plt.imshow(wordclouds["您的主题名称"])
plt.axis("off")
plt.show()
层次聚类
from GLMTopic import hierarchical_clustering_plot
# 生成层次聚类可视化
fig = hierarchical_clustering_plot(
df=cluster_stats,
topic_col="topic",
count_col="count",
output_path="dendrogram.png"
)
作者
- 陈俊杰 (Junjie Chen)
- 廖文琦 (Wenqi Liao)
- 陈维思 (Weisi Chen)
许可证
MIT 许可证
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file glmtopic-0.1.1.tar.gz.
File metadata
- Download URL: glmtopic-0.1.1.tar.gz
- Upload date:
- Size: 29.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ad8d491b2b53d42f47ebc52dd6751dfd1ec2dfb75d6ab4fe47b89a6090441c6
|
|
| MD5 |
c57a6c7e3d1413305eb9c47dcf9b7ddc
|
|
| BLAKE2b-256 |
16ee87f7eaf615b5ac06b7b1bfcb4f781f7ae500c81f90379336cfdb52b13240
|
File details
Details for the file glmtopic-0.1.1-py3-none-any.whl.
File metadata
- Download URL: glmtopic-0.1.1-py3-none-any.whl
- Upload date:
- Size: 28.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfae8c4d9d35604ec8b9009b8e849a06d6c32e157ea7d973038d8d1d58205c30
|
|
| MD5 |
d63725e0aa090274d1004fafcff107c4
|
|
| BLAKE2b-256 |
115d5296ecdabeb5463869cd6878b12a44c740e31b71005c6fde707167606a83
|