A community identification module for Reddit conversations
Project description
COMID - Community Identification Module for Reddit Conversations
COMID is a Python toolkit specifically designed for collecting and analyzing Reddit conversations. It offers powerful tools for building corpora, annotating topics, and performing temporal analyses of discussions from subreddit threads.
Features
What COMID Can Do:
- Collect conversation threads from any specified subreddit.
- Explore and preprocess collected data for detailed analysis.
- Generate a corpus based on the original content (O.C.) of Reddit threads.
- Assist in annotating topics within pre-grouped conversation clusters.
- Perform temporal analysis to track the evolution of topics over time.
- Integrate with Convokit
What COMID Does Not Do:
- COMID does not perform topic modeling directly. For that, consider using the complementary hSBM Topic Model.
Documentation:
Comprehensive documentation is available for each module:
Quick Start
Installation:
To install COMID and configure its dependencies, run the following commands:
pip install comid
python -m spacy download en_core_web_sm
Setting Up Reddit Credentials:
To use COMID, initialize your Reddit API credentials. If you don’t already have these credentials, you can follow this guide.
from comid.collector import RedditCollector
import datetime as dt
collector = RedditCollector()
# Configure Reddit API credentials
collector.config_credentials(
client_id="YOUR_CLIENT_ID",
client_secret="YOUR_CLIENT_SECRET",
password="YOUR_PASSWORD",
username="YOUR_USERNAME"
)
Collecting Data
Specify Subreddit and Date Range:
Define the subreddit and the range of dates to collect conversation thread IDs:
subreddit = 'digitalnomad'
start_dt = dt.datetime(2022, 1, 1)
end_dt = dt.datetime(2022, 1, 2)
# Collect thread IDs for the specified subreddit and date range
collector.search_ids_by_datetime(subreddit, start_dt, end_dt)
Download Submissions and Comments:
Once the IDs are collected, download all data, including original content, comments, and replies:
collector.download_by_ids()
Exploring Data and Creating a Corpus
Load JSON Data:
After downloading the data, load it from JSON files into COMID.
from comid import Comid
cm = Comid()
files = ['dataset/submissions.json', 'dataset/comments.json']
cm.load_json_files(files=files)
Explore Collected Data:
Perform exploratory data analysis to understand the dataset:
from comid.explorer import Explorer
# Initialize the Explorer with the loaded posts
explorer = Explorer(cm.posts)
# Display a summary of the dataset
explorer.data_summary()
# Calculate interval-based thread activity (e.g., by month)
explorer.thread_interval_activity('m')
# Export statistical data to files
explorer.export_data()
Generate the Corpus:
Extract only the main submissions to create a corpus. This step may take a few minutes for large datasets:
cm.generate_corpus()
print("Corpus size:", len(cm.corpus))
Reduce the Corpus:
For efficient topic modeling, it is recommended to keep the corpus size under 6,000 documents. Filter out posts with fewer than a specified number of interactions (e.g., 10 comments or replies):
cm.reduce_corpus(target_size=6000, min_num_interactions=10)
print("Corpus size:", len(cm.corpus))
print("Reduced corpus size:", len(cm.corpus_reduced))
Saving and Loading Data
Save the Corpus:
Save the reduced corpus as a JSON file for compatibility with hSBM Topic Modeling:
cm.save_corpus(reduced=True)
For details on working with hSBM, refer to the hSBM Documentation.
Save and Load COMID Instances:
Save the current state of COMID for later use:
# Save the instance
cm.save("comid_saved_file.pickle")
# Reload the saved COMID instance
from comid import Comid
cm = Comid.load("comid_saved_file.pickle")
Working with Clusters and Topics
Load Topic Clusters:
After performing topic modeling with hSBM, load the generated clusters file for analysis:
cluster_file = 'path_to_file/topsbm_level_1_clusters.csv'
cm.load_clusters_file(cluster_file)
cm.df_clusters.head()
Analyze Clusters:
Perform various operations on the topic clusters, such as viewing random samples or generating summaries:
Display Cluster Samples:
View random samples from any cluster to explore associated documents:
cm.print_cluster_samples('Cluster 1', 3)
Retrieve Flattened Conversation Text:
Flatten the conversational data for a specific document ID:
doc_id = 'rtsodc'
text = cm.retrieve_conversation(doc_id)
print(text)
Save Cluster Summary:
Create and save a summary of clusters, including cluster statistics and a topic label column for annotation:
cm.save_clusters_summary()
Building and Analyzing Topics
Build the Topics DataFrame:
Once clusters have been annotated with topic labels, build a topics DataFrame:
cm.build_topics("clusters_summary.csv")
cm.df_topics.head()
Alternatively, generate the topics DataFrame based on clusters containing a minimum percentage of documents (e.g., 7%):
cm.build_topics(min_percent=7)
Temporal Topic Analysis:
Group topics by time intervals, such as days, weeks, months, or years:
cm.group_by_period(period_type="m")
cm.df_periods.head()
Temporal analysis helps track the progression and evolution of topics over time.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file comid-0.0.4.tar.gz.
File metadata
- Download URL: comid-0.0.4.tar.gz
- Upload date:
- Size: 24.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e5d83bfa1397f3c55f5df84e8e7ae72fc1c6cf0a3f823cd1c47af515e3c22fb
|
|
| MD5 |
c3b817632f0882388620b701e491f2a5
|
|
| BLAKE2b-256 |
cc4fe0724fdbbf31cffd2fd555d3e6a193f4f66fe0d7d90a35cf65067ff2cd2d
|
File details
Details for the file comid-0.0.4-py2.py3-none-any.whl.
File metadata
- Download URL: comid-0.0.4-py2.py3-none-any.whl
- Upload date:
- Size: 25.6 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5cf1a5756d0ba930c9f96f994f3412ba7a7b4636ba356bbe7168ce6705c4d9f
|
|
| MD5 |
9e421a6b5312232cc422beba2488e56b
|
|
| BLAKE2b-256 |
45930b6ae91328866e834d84f5e72dea89820bdeb7a85f6f98cc48058ab481c7
|