Skip to main content
Join the official 2019 Python Developers SurveyStart the survey!

A set of functions that process and create descriptive summary visualizations to help develop a broader narrative through-line of tweet data.

Project description

Narrator

by Chris Lindgren chris.a.lindgren@gmail.com Distributed under the BSD 3-clause license. See LICENSE.txt or http://opensource.org/licenses/BSD-3-Clause for details.

Overview

A set of functions that process and create descriptive summary visualizations to help develop a broader narrative through-line of one's tweet data.

It functions only with Python 3.x and is not backwards-compatible (although one could probably branch off a 2.x port with minimal effort).

Warning: narrator performs very little custom error-handling, so make sure your inputs are formatted properly! If you have questions, please let me know via email.

System requirements

  • ast
  • matplot
  • pandas
  • numpy
  • emoji
  • re

Installation

pip install narrator

Objects

narrator will initialize and use the following objects in future versions. It is currently not implemented yet. More to come here.

  • topperObject: Object class with attributes that store desired top X samples from the corpus Object properties as follows:
    • .top_x_hashtags:
    • .top_x_tweeters:
    • .top_x_tweets:
    • .top_x_topics:
    • .top_x_urls:
    • .top_x_rts:
    • .period_dates:

General Functions

narrator contains the following general functions:

  • initializeTO: Initializes a topperObject().
  • date_range_writer: Takes beginning date and end date to write a range of those dates per Day as a List
    • Args:
      • bd= String. Beginning date in YYYY-MM-DD format
      • ed= String. Ending date in YYYY-MM-DD format
    • Returns List of arrow date objects for whatever needs.
  • period_writer: Accepts list of lists of period date information and returns a Dict of per Period dates for temporal analyses.
    • Args:
      • periodObj: Optional first argument periodObject, Default is None
      • 'ranges': Hierarchical list in following structure:
        ranges = [
        ['p1', ['2018-01-01', '2018-03-30']],
        ['p2', ['2018-04-01', '2018-06-12']],
        ...
        ]
    • Returns Dict of period dates per Day as Lists: { 'p1': ['2018-01-01', '2018-01-02', ...] }

Summarizer Functions

  • summarizer: Counts a column variable of interest and returns a sample data set based on set parameters. There are 5 search options from which to choose. See the the 'main_sum_option' list below.
    • Args:
      • Required Options:
        • main_sum_option= String. Current options for sampling include the following:
          • 'sum_all_col': Sum of all the passed variable across entire corpus
          • 'sum_group_col': Sum of a group of the passed variables (List) across entire corpus
          • 'sum_single_col': Sum of a single isolated variables value (String) across entire corpus
          • 'single_term_per_day': Sum of single variable per Day in provided range
          • 'grouped_terms_perday': Sum of group of a type of variable per Day in provided range
        • column_type= String. Provides the type of summary to conduct.
          • 'hashtags': Searches for hashtags
          • 'urls': Searches for URLs
          • 'other': Searches for another type of content
        • df_corpus= DataFrame of tweet corpus
        • primary_col= String. Name of the primary targeted DataFrame column of interest, e.g., hashtags, urls, etc.
        • sort_check= Boolean. If True, sort sums per day.
        • sort_date_check= Boolean. If True, sort by dates.
        • sort_type= Boolean. If True, descending order. If False, ascending order.
      • Conditional Options: Based on the 'main_sum_option', these will vary in use and assignment.
        • group_search_option= String. Use to choose what search options to use for 'group_col_per_day'.
          • 'single_col': Searches for search terms in the single pertinent column
          • 'keywords_and_col': Searches for a column variable and accompanying keywords in another content column, such as 'tweets'. For example, you search for someone's name in the corpus that isn't always represented as a hashtag.
        • simple_list= List of terms to isolate.
        • keyed_list= List of Dicts. A keyed list of keywords of which you search within the secondary_col.
        • secondary_col= String. Name of the secondary targeted DataFrame column of interest, if needed, e.g., tweets, usernames, etc.
        • single_term= String of single term to isolate.
        • time_agg_type= If sum by group temporally, define its temporal aggregation:
          • 'day': Aggregate time per Day
          • 'period': Aggregate time per period
        • date_col= String value of the DataFrame column name for the dates in xx-xx-xxxx format.
        • id_col= String value of the DataFrame column name for the unique ID.
        • grouped_output_type= String. Options for particular Dataframe output
          • consolidated= Each listed value in group is a column with its period values
          • spread= One column for each listed group value
    • Return: Depending on option, a sample as a List of Tuples or Dict of grouped samples
  • get_sample_size: Helper function for summarizer functions. If sample=True, then sample sent here and returned to the summarizer for output.
    • Args:
      • sort_check= Boolean. If True, sort the corpus.
      • sort_date_check= Boolean. If True, sort corpus based on dates.
      • counted_list= List. Tallies from corpus.
      • ss= Integer of sample size to output.
      • sample_check= Boolean. If True, use ss value. If False, use full corpus.
    • Returns DataFrame to summarizer function.
  • grouper: Takes default values in 'skeleton' Dict and hydrates them with sample List of Tuples
    • Args:
      • group_type= String. Current options include 'day' or 'period'
      • listed_tuples= List of Tuples from get_sample_size().
        • Example structure is the following: [(('keyword', '01-27-2019'), 100), (...), ...]
      • skeleton= Dict. Fully hydrated skeleton dict, wherein grouper() updates its default 0 Int values.
    • Returns Dict of updated values per keyword
  • skeletor: Takes desired date range and list of keys to create a skeleton Dict before hydrating it with the sample values. Overall, this provides default 0 Int values for every keyword in the sample.
    • Args:
      • aggregate_level= String. Current options include:
        • 'day': per Day
        • 'period_day': Days per Period
        • 'period': per Period
      • date_range=
        • If 'day' aggregate level, a List of per Day dates ['2018-01-01', '2018-01-02', ...]
        • If 'period' aggregate level, a Dict of periods with respective date Lists: {{'1': ['2018-01-01', '2018-01-02', ...]}}
      • keys= List of keys for hydrating the Dict
    • Returns full Dict 'skeleton' with default 0 Integer values for the grouper() function
  • whichPeriod: Helper function for grouper(). Isolates what period a date is in for use.
    • Args:
      • period_dates= Dict of Lists per period
      • date= String. Date to lookup.
    • Returns String of period to grouper().
  • find_term: Helper function for accumulator(). Searches for hashtag in tweet. If there, return True. If not, return False. - Args: - search= String. Term to search for. - text= String. Text to search. - Returns Boolean
  • grouped_dict_to_df: Takes grouped Dict and outputs a DataFrame.
    • Args:
      • main_sum_option= String. Options for grouping into a Dataframe.
        • group_hash_temporal= Multiple groups of hashtags
      • grouped_output_type= Sring. oPtions for DF outputs
        • spread= Good for small multiples in D3.js
        • consolidated= Good for small multiples in matplot
      • time_agg_type= String. Options for type of temporal grouping.
        • period= Grouped by periods
      • group_dict= Hydrated Dict to convert to a DataFrame for visualization or output
    • Returns DataFrame for use with a plotter function or output as CSV
  • accumulator: Helper function for summarizer function. Accumulates by simple lists and keyed lists.
    • Args:
      • checker= String. Options for accumulation:
        • simple: Takes values from simple_list and conducts a search on primary_col.
        • keyed: Takes values from keyed_list and conducts a search on secondary_col.
      • df_list= List. DataFrame passed as a list for traversing
      • check_list= List. List of terms to accrue and append
        • If simple, converted to List of each listed term.
        • If keyed, List of dicts, where each key is its accompanying primary_col term.
    • Returns a hydrated list of Tuples with each primary term and its accompanying date.

Plotter Functions

  • bar_plotter: Plot the desired sum of your column sums as a bar chart
    • Args:
      • ax=None # Resets the chart
      • counter = List of tuples returned from match_maker(),
      • path = String of desired path to directory,
      • output = String value of desired file name (.png)
    • Returns: Nothing, but outputs a matplot figure in your Jupyter Notebook and .png file.
  • multiline_plotter: Plots and saves a small-multiples line chart from a returned DataFrame from the summarizer function that used the 'spread' output option
    • Modified src: https://python-graph-gallery.com/125-small-multiples-for-line-chart/
    • Args:
      • style= String. See matplot docs for options available, e.g. 'seaborn-darkgrid'
      • pallette= String. See matplot docs for options available, e.g. 'Set1'
      • graph_option= String. Options for sampling will include all of the the following, but for now only 'group_var_per_period':
        • 'single_var_per_day': Sum of single variable per Day in provided range
        • 'group_var_per_day': Sum of group of variable per Day in provided range
        • 'single_var_per_period': Sum of single variable per Period
        • 'group_var_per_period': Sum of group of variable per Period
      • df= DataFrame of data set to be visualized
      • x_col= DataFrame column for x-axis
      • multi_x= Integer for number of graphs along x/rows
      • multi_y= Integer for number of graphs along y/columns
        • NOTE: Only supports 3x3 right now.
      • linewidth= Float. Line width level.
      • alpha= Float (0-1). Opacity level of lines
      • chart_title= String. Title for the overall chart
      • x_title= String. Label for x axis
      • y_title= String. Label for y axis
      • path= String. Path to save figure
      • output= String. Filename for figure.
    • Returns nothing, but plots a 'small multiples' series of charts

Example Uses

Create a Dictionary of period dates

ranges = [
    ('1', ['2018-01-01', '2018-03-30']),
    ('2', ['2018-04-01', '2018-06-12']),
    ('3', ['2018-06-13', '2018-07-28']),
    ('4', ['2018-07-29', '2018-10-17']),
    ('5', ['2018-10-18', '2018-11-24']),
    ('6', ['2018-11-25', '2018-12-10']),
    ('7', ['2018-12-11', '2018-12-19']),
    ('8', ['2018-12-20', '2018-12-25']),
    ('9', ['2018-12-26', '2019-02-13']),
    ('10', ['2019-02-14', '2019-02-28'])
]

period_dates = narrator.period_dates_writer(ranges=ranges)
period_dates['1'][:5]

## Output ##
['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05']

Use the hashtag_summarizer to generate multiple types of summary outputs

The below examples takes a group of hashtags, searches for them based on the period dates, then outputsthese groupings in descending order. In this case, it can also use a keyword list and hashtag list as 2 forms of input to inform the search across the corpus.

# 1. Create and assign listed values. If a search term has
# multiple variations, create a list of dictionaries and pass
# it to the summarizer() function as a "keyword_list".
liberal_keyword_list = [ 
    {
        '#felipegomez': ['felipe alonzo-gomez', 'felipe gomez']
    },
    {
        '#maquin': ['jakelin caal', 'maquín', 'maquin' ]
    }
]
liberal_hashtag_list = [
    '#familyseparation', '#familiesbelongtogether',
    '#felipegomez', '#keepfamiliestogether',
    '#maquin', '#noborderwall', '#shutdownstories',
    '#trumpshutdown', '#wherearethechildren'
]

# 2. Create Dict "skeleton" with above listed search values
# This dict is passed as the "skeleton" parameter in the 
# summarizer function
dict_group_skel = narrator.skeletor(
    aggregate_level='period',
    date_range=period_dates,
    keys=liberal_hashtag_list
)

# 3. Fill out the search parameters to return a hydrated
# pandas DataFrame.
df_sum = summarizer(
    # Required options
    column_type='hashtags',
    primary_col='hashtags',
    main_sum_option='grouped_terms_perday',
    df_corpus=df_all,
    sort_check=True, # Sort per day
    sort_date_check=False, #Do not sort by date
    sort_type=True, # Ascending (F) or descending (T)?
    # Conditional options
    group_search_option='keywords_and_col',
    simple_list=liberal_hashtag_list, # List of terms
    keyed_list=liberal_keyword_list, # List of alternative terms
    secondary_col='tweet',
    date_col='date',
    id_col='id',
    sample_check=False, # Use custom sample size (True or False)
    sample_size=None, # Custom sample size (Int or None)
    skeleton=dict_group_skel,
    time_agg_type='period',
    period_dates=period_dates,
    grouped_output_type='spread' #spread or consolidated
)

Output from above code:

Plot a "Small Multiples" Line Chart

import colorcet as cc

narrator.multiline_plotter(
    style='tableau-colorblind10',
    palette=cc.cm.glasbey_dark,
    graph_option='group_var_per_period',
    df=ht_df_sum,
    x_col='period',
    multi_x=3,
    multi_y=3,
    linewidth=1.9,
    alpha=0.9,
    chart_title='Liberal hashtag sums per period',
    x_title='Periods',
    y_title='# of Hashtags',
    path='figures',
    output='test_multi.png'
)

Output:

Distribution update terminal commands

# Create new distribution of code for archiving
sudo python3 setup.py sdist bdist_wheel

# Distribute to Python Package Index
python3 -m twine upload --repository-url https://upload.pypi.org/legacy/ dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for narrator, version 0.0.0.4
Filename, size File type Python version Upload date Hashes
Filename, size narrator-0.0.0.4-py3-none-any.whl (16.3 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size narrator-0.0.0.4.tar.gz (17.6 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page