Skip to main content

A simple tool for generating and analyzing Reddit networks.

Project description

RNT - Reddit Network Toolkit

A simple tool for generating and extracting network objects from Reddit data sets.

Author

Jacob Rohde

Email: jarohde1@gmail.com | Twitter: @jacobrohde | GitHub: @jarohde

Features

  • Extracts a simple data set of Reddit submissions and their associated comments via keyword or subreddit search terms.

  • Provides single and batch subreddit- and thread-level network statistics.

  • Generates edge and node lists, and creates network objects (via NetworkX) from Reddit data sets. Networks:

    • can be directed or undirected;

    • contain subreddit node attributes;

    • contain subreddit and weight edge attributes;

    • allow for optional text classification attributes.

  • Can visualize simple graphs via Matplotlib.

  • The "url_functions" sub-package provides additional features for extracting URL information from Reddit data.

General

Current version:

0.1.7 (released 04/12/2023)

Note: The minor changes in this release reflect Pushshift's modified endpoints after the recent server migration. There are still some Pushshift bugs and timeouts that may affect how the GetRedditData() feature in this package works. I will be sure to address future bugs ASAP. Version 0.1.6 also added a sub-package for extracting URL information from Reddit data.

Import RNT library:

import rnt

Classes and functions

  • GetRedditData()

  • GetRedditNetwork()

  • subreddit_statistics()

  • reddit_thread_statistics()

  • single_network_plot()

Usage

GetRedditData()

rnt.GetRedditData(search_term, 
                  search_term_is_subreddit, 
                  size, 
                  start_date, 
                  end_date)

Overview: A class object for extracting a Reddit data set.

Arguments/attributes:

search_term: The only required argument. Takes a string as a single search term or list of strings for multiple search terms (e.g., search_term='news' or search_term=['news', 'cnn']). If extracting a subreddit data set (see 'search_term_is_subreddit' below), only provide a string of a single subreddit name (e.g., 'AskReddit').

search_term_is_subreddit: Optional Boolean (True or False) argument to signify whether GetRedditData extracts a subreddit data set; default set to False.

size: Optional integer argument to signify how many Reddit submissions and their associated comments to extract; default set to 500 submission. GetRedditData should only be used to extract limited or exploratory data sets. I recommend using the Pushshift Reddit repository for extracting large data sets.

start_date/end_date: Optional date arguments for GetRedditData; default end date set to current date and default start date set to one week prior. Format should be string objects organized like 'YYYY, MM, DD' (e.g., start_date='2022, 5, 27' for May 27, 2022).

GetRedditData.df: Object attribute; extracts the Reddit data set as a pandas DataFrame object.

GetRedditData.write_data(): Object method that writes the pandas DataFrame object to file. The method can take file_type and file_name as optional arguments. file_type indicates what file format to use when writing the data set and accepts a string argument of either 'json' or 'csv'; default set to 'json'. file_name takes a string to indicate what the file name should be saved as; default set to the search term provided.

GetRedditData.extract_urls(): Object method to extract and append a list of URLs and URL domains in the Reddit data set.

GetRedditData.anonymize_authors(): Object method to change Reddit usernames ('author' column) to anonymized random character strings. The name of the column containing the original Reddit usernames will be changed to 'author_original'. This method takes no arguments.

GetRedditNetwork()

rnt.GetRedditNetwork(reddit_dataset, 
                     edge_type, 
                     text_attribute) 

Overview: A class object for generating edge and node lists, and a NetworkX graph object from a Reddit data set.

Arguments/attributes:

reddit_dataset: The only required argument. Takes a Reddit data set or a GetRedditData object.

edge_type: Optional string argument of either 'directed' or 'undirected' to signify network edge type; default set to directed.

text_attribute: Optional string, list, or dictionary argument to characterize an edge attribute based on one or more text categories. Result will return True or False for a network edge if the Reddit submission initiating the edge contains the provided keyword(s). Providing the argument with a string or list data type will generate a single text attribute column in the edge list and NetworkX graph object. Providing the argument with a dictionary data type will generate multiple text attribute columns. Dictionary text attribute example:

text_attribute={'apples': ['fuji', 'red delicious', 'granny smith'], 
                'oranges': ['valencia', 'mandarin', 'tangerine'], 
                'berries': ['blueberry', 'raspberry', 'blackberry']}

GetRedditNetwork.edge_list: Returns a pandas DataFrame of the network edge list with columns for the poster, commenter, the subreddit the edge occurred in, and an optional text attribute column.

GetRedditNetwork.node_list: Returns a pandas DataFrame of the network node list with columns for each unique node, the node's in-degree and out-degree values, and a list of subreddits the node participated in within the network.

GetRedditNetwork.adjacency: Returns a dictionary of network adjacency matrices. Both weighted and unweighted matrices are returned by default. The dictionary will also return weighted adjacency matrices for each optional edge-based text attribute that users identified when creating the class.

GetRedditNetwork.graph: Returns a NetworkX graph object.

GetRedditNetwork.write_data(): Object method that writes edge_list and node_list data sets to file. The method takes file_type, file_name, and adjacency as optional arguments. file_type indicates what file format to use when writing the data sets and accepts a string argument of either 'json' or 'csv'; default set to 'json'. file_name takes a string to indicate what to append at the end of the edge and node list file names (e.g., file_name='apple' will save the files as 'edge_list_apple.json' and 'node_list_apple.json'). adjacency accepts a boolean and indicates whether to save the data sets as adjacency matrices instead of edge and node lists.

subreddit_statistics()

rnt.subreddit_statistics(reddit_dataset, subreddit_list) 

Overview: A function for extracting basic statistics for single or batch subreddit networks. The function currently returns a single pandas DataFrame with example subreddit network statistics including number of nodes, edges, and network density, among others.

Arguments:

reddit_dataset: The only required argument. Takes a Reddit data set or a GetRedditData object.

subreddit_list: An optional list argument to indicate the specific subreddits to compute analyses for; default set to all unique subreddits in a data set that Reddit submissions were published in.

reddit_thread_statistics()

rnt.reddit_thread_statistics(reddit_dataset, reddit_thread_list)

Overview: A function for extracting basic statistics for single or batch Reddit threads (initiated by Reddit submissions). The function currently returns a single pandas DataFrame with example statistics including the number of unique commenters to the thread, and the earliest/latest response times to the thread, among others.

Arguments:

reddit_dataset: The only required argument. Takes a Reddit data set or a GetRedditData object.

reddit_thread_list: An optional list argument to provide the specific Reddit thread IDs (i.e., Reddit submission IDs) to analyze; default set to all unique threads in a Reddit data set.

single_network_plot()

rnt.single_network_plot(network, **kwargs)

Overview: A simple function for plotting networks via NetworkX and Matplotlib (additional install required). Please note this function is currently a work in progress and is meant to be basic tool to plot a single graph. See NetworkX documentation for more advanced plotting needs.

Arguments:

network: The only required argument. Takes a GetRedditNetwork or NetworkX graph object.

title: Optional string argument to add a title to the plot.

pos: Optional string argument to set the NetworkX plotting algorithm. For ease of use, the argument currently accepts one of the following layout types as a string:

  • 'spring_layout' (default)
  • 'kamada_kawai_layout'
  • 'circular_layout'
  • 'random_layout'

**kwargs: The function also accepts several other NetworkX keyword arguments for plotting (please see NetworkX documentation for more info on these arguments). Currently accepted arguments include:

  • 'arrows' (bool)
  • 'arrowsize' (int)
  • 'edge_color' (str or list/array)
  • 'font_size' (int)
  • 'node_color' (str or list/array)
  • 'node_size' (str or list/array)
  • 'verticalalignment' (str)
  • 'width' (int/float or list/array)
  • 'with_labels' (bool)

Requirements

  • Python 3.XX
  • numpy - a Python library for handling arrays and matrices
  • pandas - a Python library for data management
  • NetworkX - a Python library for network analysis
  • PMAW - a multithread tool for extracting Reddit data via the Pushshift API
  • Matplotlib (only if using the single_network_plot() function) - a Python library for plotting

Support

For support, email jarohde1@gmail.com.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rnt-0.1.7.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rnt-0.1.7-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file rnt-0.1.7.tar.gz.

File metadata

  • Download URL: rnt-0.1.7.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for rnt-0.1.7.tar.gz
Algorithm Hash digest
SHA256 81d8c706eb608c4e044641a22acf2313a677828fb92524ae7d96bea15602020a
MD5 fafb606fd7d9011b6e67943a12379869
BLAKE2b-256 50cd64b4c046b319879d1834000b2dd3038ba4df44a9bafbc500b04149eb384f

See more details on using hashes here.

File details

Details for the file rnt-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: rnt-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 15.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.5

File hashes

Hashes for rnt-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 90bb15a46f5b4261f886d3c1e7ebdd8fca54b9f86c49fe442f7b654d51be26fe
MD5 a82c45b573f393ce75bb36af333fb5ce
BLAKE2b-256 d724f1ce6b9ea70affaeb9336fbb64f69e1b6cfdc3687b4449bc2df29051c9e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page