Skip to main content

A community identification module for Reddit conversations

Project description

COMID - Community Identification Module for Reddit Conversations

COMID is a Python toolkit specifically designed for collecting and analyzing Reddit conversations. It offers powerful tools for building corpora, annotating topics, and performing temporal analyses of discussions from subreddit threads.

Features

What COMID Can Do:

  • Collect conversation threads from any specified subreddit.
  • Explore and preprocess collected data for detailed analysis.
  • Generate a corpus based on the original content (O.C.) of Reddit threads.
  • Assist in annotating topics within pre-grouped conversation clusters.
  • Perform temporal analysis to track the evolution of topics over time.

What COMID Does Not Do:

  • COMID does not perform topic modeling directly. For that, consider using the complementary hSBM Topic Model.

Documentation:

Comprehensive documentation is available for each module:

Quick Start

Installation:

To install COMID and configure its dependencies, run the following commands:

pip install comid
python -m spacy download en_core_web_sm

Setting Up Reddit Credentials:

To use COMID, initialize your Reddit API credentials. If you don’t already have these credentials, you can follow this guide.

from comid.collector import RedditCollector
import datetime as dt

collector = RedditCollector()

# Configure Reddit API credentials
collector.config_credentials(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_CLIENT_SECRET",
    password="YOUR_PASSWORD",
    username="YOUR_USERNAME"
)

Collecting Data

Specify Subreddit and Date Range:

Define the subreddit and the range of dates to collect conversation thread IDs:

subreddit = 'digitalnomad'
start_dt = dt.datetime(2022, 1, 1)
end_dt = dt.datetime(2022, 1, 2)

# Collect thread IDs for the specified subreddit and date range
collector.search_ids_by_datetime(subreddit, start_dt, end_dt)

Download Submissions and Comments:

Once the IDs are collected, download all data, including original content, comments, and replies:

collector.download_by_ids()

Exploring Data and Creating a Corpus

Load JSON Data:

After downloading the data, load it from JSON files into COMID.

from comid import Comid

cm = Comid()
files = ['dataset/submissions.json', 'dataset/comments.json']
cm.load_json_files(files=files)

Explore Collected Data:

Perform exploratory data analysis to understand the dataset:

from comid.explorer import Explorer

# Initialize the Explorer with the loaded posts
explorer = Explorer(cm.posts)

# Display a summary of the dataset
explorer.data_summary()

# Calculate interval-based thread activity (e.g., by month)
explorer.thread_interval_activity('m')

# Export statistical data to files
explorer.export_data()

Generate the Corpus:

Extract only the main submissions to create a corpus. This step may take a few minutes for large datasets:

cm.generate_corpus()
print("Corpus size:", len(cm.corpus))

Reduce the Corpus:

For efficient topic modeling, it is recommended to keep the corpus size under 6,000 documents. Filter out posts with fewer than a specified number of interactions (e.g., 10 comments or replies):

cm.reduce_corpus(target_size=6000, min_num_interactions=10)
print("Corpus size:", len(cm.corpus))
print("Reduced corpus size:", len(cm.corpus_reduced))

Saving and Loading Data

Save the Corpus:

Save the reduced corpus as a JSON file for compatibility with hSBM Topic Modeling:

cm.save_corpus(reduced=True)

For details on working with hSBM, refer to the hSBM Documentation.

Save and Load COMID Instances:

Save the current state of COMID for later use:

# Save the instance
cm.save("comid_saved_file.pickle")

# Reload the saved COMID instance
from comid import Comid

cm = Comid.load("comid_saved_file.pickle")

Working with Clusters and Topics

Load Topic Clusters:

After performing topic modeling with hSBM, load the generated clusters file for analysis:

cluster_file = 'path_to_file/topsbm_level_1_clusters.csv'
cm.load_clusters_file(cluster_file)
cm.df_clusters.head()

Analyze Clusters:

Perform various operations on the topic clusters, such as viewing random samples or generating summaries:

Display Cluster Samples:

View random samples from any cluster to explore associated documents:

cm.print_cluster_samples('Cluster 1', 3)

Retrieve Flattened Conversation Text:

Flatten the conversational data for a specific document ID:

doc_id = 'rtsodc'
text = cm.retrieve_conversation(doc_id)
print(text)

Save Cluster Summary:

Create and save a summary of clusters, including cluster statistics and a topic label column for annotation:

cm.save_clusters_summary()

Building and Analyzing Topics

Build the Topics DataFrame:

Once clusters have been annotated with topic labels, build a topics DataFrame:

cm.build_topics("clusters_summary.csv")
cm.df_topics.head()

Alternatively, generate the topics DataFrame based on clusters containing a minimum percentage of documents (e.g., 7%):

cm.build_topics(min_percent=7)

Temporal Topic Analysis:

Group topics by time intervals, such as days, weeks, months, or years:

cm.group_by_period(period_type="m")
cm.df_periods.head()

Temporal analysis helps track the progression and evolution of topics over time.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

comid-0.0.3.tar.gz (20.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

comid-0.0.3-py3-none-any.whl (18.9 kB view details)

Uploaded Python 3

comid-0.0.3-py2.py3-none-any.whl (18.9 kB view details)

Uploaded Python 2Python 3

File details

Details for the file comid-0.0.3.tar.gz.

File metadata

  • Download URL: comid-0.0.3.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for comid-0.0.3.tar.gz
Algorithm Hash digest
SHA256 2d7d1eeca8513e497ed01440bf1b505e62e84bffff7047a9d2d97255ff2ac4d9
MD5 a3d6dd22ffef3dc99bc9aa1afbabd987
BLAKE2b-256 f87001a083d213ddac1c890b1302a5b38ca936d9a93a264f9c3727f998837354

See more details on using hashes here.

File details

Details for the file comid-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: comid-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 18.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for comid-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c007e618e49d90856cc8cd8ee7fdc5c90773551897724115e78846908aec2c98
MD5 5b426a39f1c9dd9c67a5ccc46902d1da
BLAKE2b-256 8f4a5611ceee88c34cf19635237d30e5c0d8f7a6da7e28f55fae28b2bcd67f94

See more details on using hashes here.

File details

Details for the file comid-0.0.3-py2.py3-none-any.whl.

File metadata

  • Download URL: comid-0.0.3-py2.py3-none-any.whl
  • Upload date:
  • Size: 18.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for comid-0.0.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 5d3c14cbc5211e610c7b1aec5d37123546cdcb833b326b9bef3f068f4a154c63
MD5 f91c4b38c7661d707b84d7a82b493ebf
BLAKE2b-256 c019578fe2a03ef216c7fec4cf4658a1841dc951c45d19c30fd2195a27209f33

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page