Skip to main content

A community identification module for Reddit conversations

Project description

COMID - Community Identification Module for Reddit Conversations

COMID is a Python toolkit specifically designed for collecting and analyzing Reddit conversations. It offers powerful tools for building corpora, annotating topics, and performing temporal analyses of discussions from subreddit threads.

Features

What COMID Can Do:

  • Collect conversation threads from any specified subreddit.
  • Explore and preprocess collected data for detailed analysis.
  • Generate a corpus based on the original content (O.C.) of Reddit threads.
  • Assist in annotating topics within pre-grouped conversation clusters.
  • Perform temporal analysis to track the evolution of topics over time.
  • Integrate with Convokit

What COMID Does Not Do:

  • COMID does not perform topic modeling directly. For that, consider using the complementary hSBM Topic Model.

Documentation:

Comprehensive documentation is available for each module:

Quick Start

Installation:

To install COMID and configure its dependencies, run the following commands:

pip install comid
python -m spacy download en_core_web_sm

Setting Up Reddit Credentials:

To use COMID, initialize your Reddit API credentials. If you don’t already have these credentials, you can follow this guide.

from comid.collector import RedditCollector
import datetime as dt

collector = RedditCollector()

# Configure Reddit API credentials
collector.config_credentials(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_CLIENT_SECRET",
    password="YOUR_PASSWORD",
    username="YOUR_USERNAME"
)

Collecting Data

Specify Subreddit and Date Range:

Define the subreddit and the range of dates to collect conversation thread IDs:

subreddit = 'digitalnomad'
start_dt = dt.datetime(2022, 1, 1)
end_dt = dt.datetime(2022, 1, 2)

# Collect thread IDs for the specified subreddit and date range
collector.search_ids_by_datetime(subreddit, start_dt, end_dt)

Download Submissions and Comments:

Once the IDs are collected, download all data, including original content, comments, and replies:

collector.download_by_ids()

Exploring Data and Creating a Corpus

Load JSON Data:

After downloading the data, load it from JSON files into COMID.

from comid import Comid

cm = Comid()
files = ['dataset/submissions.json', 'dataset/comments.json']
cm.load_json_files(files=files)

Explore Collected Data:

Perform exploratory data analysis to understand the dataset:

from comid.explorer import Explorer

# Initialize the Explorer with the loaded posts
explorer = Explorer(cm.posts)

# Display a summary of the dataset
explorer.data_summary()

# Calculate interval-based thread activity (e.g., by month)
explorer.thread_interval_activity('m')

# Export statistical data to files
explorer.export_data()

Generate the Corpus:

Extract only the main submissions to create a corpus. This step may take a few minutes for large datasets:

cm.generate_corpus()
print("Corpus size:", len(cm.corpus))

Reduce the Corpus:

For efficient topic modeling, it is recommended to keep the corpus size under 6,000 documents. Filter out posts with fewer than a specified number of interactions (e.g., 10 comments or replies):

cm.reduce_corpus(target_size=6000, min_num_interactions=10)
print("Corpus size:", len(cm.corpus))
print("Reduced corpus size:", len(cm.corpus_reduced))

Saving and Loading Data

Save the Corpus:

Save the reduced corpus as a JSON file for compatibility with hSBM Topic Modeling:

cm.save_corpus(reduced=True)

For details on working with hSBM, refer to the hSBM Documentation.

Save and Load COMID Instances:

Save the current state of COMID for later use:

# Save the instance
cm.save("comid_saved_file.pickle")

# Reload the saved COMID instance
from comid import Comid

cm = Comid.load("comid_saved_file.pickle")

Working with Clusters and Topics

Load Topic Clusters:

After performing topic modeling with hSBM, load the generated clusters file for analysis:

cluster_file = 'path_to_file/topsbm_level_1_clusters.csv'
cm.load_clusters_file(cluster_file)
cm.df_clusters.head()

Analyze Clusters:

Perform various operations on the topic clusters, such as viewing random samples or generating summaries:

Display Cluster Samples:

View random samples from any cluster to explore associated documents:

cm.print_cluster_samples('Cluster 1', 3)

Retrieve Flattened Conversation Text:

Flatten the conversational data for a specific document ID:

doc_id = 'rtsodc'
text = cm.retrieve_conversation(doc_id)
print(text)

Save Cluster Summary:

Create and save a summary of clusters, including cluster statistics and a topic label column for annotation:

cm.save_clusters_summary()

Building and Analyzing Topics

Build the Topics DataFrame:

Once clusters have been annotated with topic labels, build a topics DataFrame:

cm.build_topics("clusters_summary.csv")
cm.df_topics.head()

Alternatively, generate the topics DataFrame based on clusters containing a minimum percentage of documents (e.g., 7%):

cm.build_topics(min_percent=7)

Temporal Topic Analysis:

Group topics by time intervals, such as days, weeks, months, or years:

cm.group_by_period(period_type="m")
cm.df_periods.head()

Temporal analysis helps track the progression and evolution of topics over time.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

comid-0.0.4.tar.gz (24.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

comid-0.0.4-py2.py3-none-any.whl (25.6 kB view details)

Uploaded Python 2Python 3

File details

Details for the file comid-0.0.4.tar.gz.

File metadata

  • Download URL: comid-0.0.4.tar.gz
  • Upload date:
  • Size: 24.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for comid-0.0.4.tar.gz
Algorithm Hash digest
SHA256 0e5d83bfa1397f3c55f5df84e8e7ae72fc1c6cf0a3f823cd1c47af515e3c22fb
MD5 c3b817632f0882388620b701e491f2a5
BLAKE2b-256 cc4fe0724fdbbf31cffd2fd555d3e6a193f4f66fe0d7d90a35cf65067ff2cd2d

See more details on using hashes here.

File details

Details for the file comid-0.0.4-py2.py3-none-any.whl.

File metadata

  • Download URL: comid-0.0.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 25.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for comid-0.0.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 e5cf1a5756d0ba930c9f96f994f3412ba7a7b4636ba356bbe7168ce6705c4d9f
MD5 9e421a6b5312232cc422beba2488e56b
BLAKE2b-256 45930b6ae91328866e834d84f5e72dea89820bdeb7a85f6f98cc48058ab481c7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page