Skip to main content

A simple tool for generating and analyzing Reddit networks.

Project description

RNT - Reddit Network Toolkit

A simple tool for generating and extracting network objects from Reddit data sets.

Author

Jacob Rohde

Email: jarohde1@gmail.com | Twitter: @jacobrohde | GitHub: @jarohde

Features

  • Extracts a simple data set of both Reddit submissions and comments via keyword or subreddit search terms.

  • Provides single and batch subreddit- and thread-level network statistics.

  • Generates edge and node lists, and creates network objects (via NetworkX) from Reddit data sets. Networks:

    • can be directed or undirected;
    • contain subreddit node and edge attributes;
    • allow for optional text classification attributes.

General

Import RNT library:

import rnt

Objects

  • GetRedditData()

  • GetRedditNetwork()

  • subreddit_statistics()

  • reddit_thread_statistics()

Usage

GetRedditData()

rnt.GetRedditData(search_term, 
                  search_term_is_subreddit, 
                  size, 
                  start_date, 
                  end_date)

Overview: A class object for extracting a Reddit data set.

Arguments/attributes:

search_term: The only required argument. Takes a string as a single search term or list of strings for multiple search terms (e.g., search_term='news' or search_term=['news', 'cnn']). If extracting a subreddit data set (see 'search_term_is_subreddit' below), only provide a string of a single subreddit name (e.g., 'AskReddit').

search_term_is_subreddit: Optional Boolean (True or False) argument to signify whether GetRedditData() extracts a subreddit data set; default set to False.

size: Optional integer argument to signify how many Reddit submissions and comments to extract; default set to 500 each. GetRedditData() should only be used to extract limited or exploratory data sets. I recommend using the Pushshift Reddit repository for extracting large data sets.

start_date/end_date: Optional date arguments for GetRedditData(); default end date set to current date and default start date set to one week prior. Format should be string objects organized like 'YYYY, MM, DD' (e.g., start_date='2022, 5, 27' for May 27, 2022).

GetRedditData().df: Object attribute; extracts the Reddit data set as a pandas DataFrame object.

GetRedditNetwork()

rnt.GetRedditNetwork(reddit_dataset, 
                     edge_type, 
                     text_attribute, 
                     edge_by) 

Overview: A class object for generating edge and node lists, and a NetworkX graph object from a Reddit data set.

Arguments/attributes:

reddit_dataset: The only required argument. Takes a Reddit data set or a GetRedditData() object.

edge_type: Optional string argument of either 'directed' or 'undirected' to signify network edge type; default set to directed.

text_attribute: Optional string or list argument to characterize an edge attribute based on a text category. Result will return True or False for a network edge if the Reddit submission initiating the edge contains the provided keyword(s).

GetRedditNetwork().edge_list: Returns a pandas DataFrame of the network edge list with columns for the poster, commenter, the subreddit the edge occurred in, and an optional text attribute column.

GetRedditNetwork().node_list: Returns a pandas DataFrame of the network node list with columns for each unique node, the node's in-degree and out-degree values, and a list of subreddits the node participated in within the network.

GetRedditNetwork().graph: Returns a NetworkX graph object.

subreddit_statistics()

rnt.subreddit_statistics(reddit_dataset, subreddit_list) 

Overview: A function for extracting basic statistics for single or batch subreddit networks. The function currently returns a single pandas DataFrame with example subreddit network statistics including number of nodes, edges, and network density, among others.

reddit_dataset: The only required argument. Takes a Reddit data set or a GetRedditData() object.

subreddit_list: An optional list argument to indicate the specific subreddits to compute analyses for; default set to all unique subreddits in a data set that Reddit submissions were published in.

reddit_thread_statistics()

rnt.reddit_thread_statistics(reddit_dataset, reddit_thread_list)

Overview: A function for extracting basic statistics for single or batch Reddit threads (initiated by Reddit submissions). The function currently returns a single pandas DataFrame with example statistics including the number of unique commenters to the thread, and the earliest/latest response times to the thread, among others.

reddit_dataset: The only required argument. Takes a Reddit data set or a GetRedditData() object.

reddit_thread_list: An optional list argument to provide the specific Reddit thread IDs (i.e., Reddit submission IDs) to analyze; default set to all unique threads in a Reddit data set.

Requirements

  • Python 3.XX
  • pandas - a Python library for data management.
  • NetworkX - a Python library for network analysis.
  • PMAW - a multithread tool for extracting Reddit data via the Pushshift API

Support

For support, email jarohde1@gmail.com.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rnt-0.0.11.tar.gz (9.4 kB view hashes)

Uploaded Source

Built Distribution

rnt-0.0.11-py3-none-any.whl (7.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page