A community identification module for Reddit conversations

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Utilities

Project description

COMID - Community Identification Module for Reddit Conversations

COMID is a Python toolkit specifically designed for collecting and analyzing Reddit conversations. It offers powerful tools for building corpora, annotating topics, and performing temporal analyses of discussions from subreddit threads.

Features

What COMID Can Do:

Collect conversation threads from any specified subreddit.
Explore and preprocess collected data for detailed analysis.
Generate a corpus based on the original content (O.C.) of Reddit threads.
Assist in annotating topics within pre-grouped conversation clusters.
Perform temporal analysis to track the evolution of topics over time.
Integrate with Convokit

What COMID Does Not Do:

COMID does not perform topic modeling directly. For that, consider using the complementary hSBM Topic Model.

Documentation:

Comprehensive documentation is available for each module:

Quick Start

Installation:

To install COMID and configure its dependencies, run the following commands:

pip install comid
python -m spacy download en_core_web_sm

Setting Up Reddit Credentials:

To use COMID, initialize your Reddit API credentials. If you don’t already have these credentials, you can follow this guide.

from comid.collector import RedditCollector
import datetime as dt

collector = RedditCollector()

# Configure Reddit API credentials
collector.config_credentials(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_CLIENT_SECRET",
    password="YOUR_PASSWORD",
    username="YOUR_USERNAME"
)

Collecting Data

Specify Subreddit and Date Range:

Define the subreddit and the range of dates to collect conversation thread IDs:

subreddit = 'digitalnomad'
start_dt = dt.datetime(2022, 1, 1)
end_dt = dt.datetime(2022, 1, 2)

# Collect thread IDs for the specified subreddit and date range
collector.search_ids_by_datetime(subreddit, start_dt, end_dt)

Download Submissions and Comments:

Once the IDs are collected, download all data, including original content, comments, and replies:

collector.download_by_ids()

Exploring Data and Creating a Corpus

Load JSON Data:

After downloading the data, load it from JSON files into COMID.

from comid import Comid

cm = Comid()
files = ['dataset/submissions.json', 'dataset/comments.json']
cm.load_json_files(files=files)

Explore Collected Data:

Perform exploratory data analysis to understand the dataset:

from comid.explorer import Explorer

# Initialize the Explorer with the loaded posts
explorer = Explorer(cm.posts)

# Display a summary of the dataset
explorer.data_summary()

# Calculate interval-based thread activity (e.g., by month)
explorer.thread_interval_activity('m')

# Export statistical data to files
explorer.export_data()

Generate the Corpus:

Extract only the main submissions to create a corpus. This step may take a few minutes for large datasets:

cm.generate_corpus()
print("Corpus size:", len(cm.corpus))

Reduce the Corpus:

For efficient topic modeling, it is recommended to keep the corpus size under 6,000 documents. Filter out posts with fewer than a specified number of interactions (e.g., 10 comments or replies):

cm.reduce_corpus(target_size=6000, min_num_interactions=10)
print("Corpus size:", len(cm.corpus))
print("Reduced corpus size:", len(cm.corpus_reduced))

Saving and Loading Data

Save the Corpus:

Save the reduced corpus as a JSON file for compatibility with hSBM Topic Modeling:

cm.save_corpus(reduced=True)

For details on working with hSBM, refer to the hSBM Documentation.

Save and Load COMID Instances:

Save the current state of COMID for later use:

# Save the instance
cm.save("comid_saved_file.pickle")

# Reload the saved COMID instance
from comid import Comid

cm = Comid.load("comid_saved_file.pickle")

Working with Clusters and Topics

Load Topic Clusters:

After performing topic modeling with hSBM, load the generated clusters file for analysis:

cluster_file = 'path_to_file/topsbm_level_1_clusters.csv'
cm.load_clusters_file(cluster_file)
cm.df_clusters.head()

Analyze Clusters:

Perform various operations on the topic clusters, such as viewing random samples or generating summaries:

Display Cluster Samples:

View random samples from any cluster to explore associated documents:

cm.print_cluster_samples('Cluster 1', 3)

Retrieve Flattened Conversation Text:

Flatten the conversational data for a specific document ID:

doc_id = 'rtsodc'
text = cm.retrieve_conversation(doc_id)
print(text)

Save Cluster Summary:

Create and save a summary of clusters, including cluster statistics and a topic label column for annotation:

cm.save_clusters_summary()

Building and Analyzing Topics

Build the Topics DataFrame:

Once clusters have been annotated with topic labels, build a topics DataFrame:

cm.build_topics("clusters_summary.csv")
cm.df_topics.head()

Alternatively, generate the topics DataFrame based on clusters containing a minimum percentage of documents (e.g., 7%):

cm.build_topics(min_percent=7)

Temporal Topic Analysis:

Group topics by time intervals, such as days, weeks, months, or years:

cm.group_by_period(period_type="m")
cm.df_periods.head()

Temporal analysis helps track the progression and evolution of topics over time.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Utilities

Release history Release notifications | RSS feed

This version

0.0.4

Apr 27, 2025

0.0.3

Feb 9, 2025

0.0.2

Apr 16, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

comid-0.0.4.tar.gz (24.1 kB view details)

Uploaded Apr 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

comid-0.0.4-py2.py3-none-any.whl (25.6 kB view details)

Uploaded Apr 27, 2025 Python 2Python 3

File details

Details for the file comid-0.0.4.tar.gz.

File metadata

Download URL: comid-0.0.4.tar.gz
Upload date: Apr 27, 2025
Size: 24.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for comid-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`0e5d83bfa1397f3c55f5df84e8e7ae72fc1c6cf0a3f823cd1c47af515e3c22fb`
MD5	`c3b817632f0882388620b701e491f2a5`
BLAKE2b-256	`cc4fe0724fdbbf31cffd2fd555d3e6a193f4f66fe0d7d90a35cf65067ff2cd2d`

See more details on using hashes here.

File details

Details for the file comid-0.0.4-py2.py3-none-any.whl.

File metadata

Download URL: comid-0.0.4-py2.py3-none-any.whl
Upload date: Apr 27, 2025
Size: 25.6 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for comid-0.0.4-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`e5cf1a5756d0ba930c9f96f994f3412ba7a7b4636ba356bbe7168ce6705c4d9f`
MD5	`9e421a6b5312232cc422beba2488e56b`
BLAKE2b-256	`45930b6ae91328866e834d84f5e72dea89820bdeb7a85f6f98cc48058ab481c7`

See more details on using hashes here.

comid 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

COMID - Community Identification Module for Reddit Conversations

Features

What COMID Can Do:

What COMID Does Not Do:

Documentation:

Quick Start

Installation:

Setting Up Reddit Credentials:

Collecting Data

Specify Subreddit and Date Range:

Download Submissions and Comments:

Exploring Data and Creating a Corpus

Load JSON Data:

Explore Collected Data:

Generate the Corpus:

Reduce the Corpus:

Saving and Loading Data

Save the Corpus:

Save and Load COMID Instances:

Working with Clusters and Topics

Load Topic Clusters:

Analyze Clusters:

Display Cluster Samples:

Retrieve Flattened Conversation Text:

Save Cluster Summary:

Building and Analyzing Topics

Build the Topics DataFrame:

Temporal Topic Analysis:

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes