Skip to main content

A set of functions that tally URLs within an event-based corpus. It assumes that you have data divided into a range of event-based periods with community-detected modules/hubs. It also assumes that you have unspooled and cleaned your URL data. See Deen Freelon's unspooler module for help: https://github.com/dfreelon/unspooler.

Project description

urlcounter

By Chris Lindgren chris.a.lindgren@gmail.com Distributed under the BSD 3-clause license. See LICENSE.txt or http://opensource.org/licenses/BSD-3-Clause for details.

Overview

urlcounter is a set of functions that tallies full and domain URLs for periodic, event-defined social-media posting data. It assumes you seek an answer to the following questions about link-sharing:

1. What are the top x full URLs and domain URLs from each group during each period?
2. What are the top x full URLs and domain URLs from each group-module (detected community) in each period?

To use the module, import and follow the below example for guidance:

import urlcounter as urlc

dict_url_counts = urlc.top_urls(
    df=cdf, #DataFrame of full corpus
    periods=(1,10), #Tuple providing range of numbered periods
    hubs=(1,10), #Tuple providing range of numbered hubs
    period_dates=period_dates, #Dict of Lists with dates per period
    list_of_regex=[htg_btw,htg_fbt,htg_anti], #List of regex patterns defined for each group
    hl=hub_lists, #Dict with keyed lists of hub usernames per period
    columns=['cleaned_urls', 'retweets_count', 'hashtags', 'username', 'mentions'], #Provide a List of column names to use for search and counting
    url_sample_size=50, #Desired sample size limit, e.g., Top 50
    verbose=True #Boolean. True prints out status messages, False prints nothing
)

Example outputs

It returns a Dict keyed by user-defined group names, period ranges, and module ranges:

# Overall period-based URL summary data with keyed group name, 'fbt'
## '1' = Period 1
## 'fbt' = Keyed group name + 'urls_per_period' and 'domains_per_period' = Summary total data
output['1']['fbt_urls_per_period']
output['1']['fbt_domains_per_period']

# Overall community hub-based URL summary per period data with keyed group name, 'fbt'
## '1' = Period 1
## 'fbt' = Keyed group name 'fbt'
## '1' = Community hub/module 1
## 'hub_sample_size', 'hub_tweet_sample_size','hub_url_counts','hub_domain_counts' = Summary total data
output['1']['fbt']['1']['hub_sample_size']
output['1']['fbt']['1']['hub_tweet_sample_size']
output['1']['fbt']['1']['hub_url_counts']
output['1']['fbt']['1']['hub_domain_counts']
{'1': #Start period 1
  'fbt_domains_per_period': [ #start period 1 totals for group keyed as 'fbt'
    ('twitter.com', 3003), ('instagram.com', 1001), ('facebook.com', 202)
  ],
  'fbt_urls_per_period': [
    ('https://twitter.com/user/status/example', 202),
    ('https://www.instagram.com/p/example/', 202),
    ...
  ]}, #end period 1 totals for group keyed as 'fbt'
  {'fbt': { #start period 1, module/hub 1
    '1': {
      'hub_domain_counts': [
        ('example.com', 178),
        ('example2.go.lc', 14),
        ('example3.com', 10),...
      ],
      'hub_sample_size': 103,
      'hub_tweet_sample_size': 486,
      'hub_url_counts': [
        ('https://example.com/politics/story-title-1/',120),
        ('https://example.com/politics/story-title-2/',58),
        ...
      ]
      }
    }
  }, #end period 1, module/hub 1
  ...
}, #end period
...

top_urls()

Tallies up URLs in corpus.

Arguments:

  • df= DataFrame. Corpus to query from.
  • columns= a List of 5 column names (String) to reference in DF corpus. !IMP: The order matters:
    1. Column with URLs (String) that includes a list of URLs included in post/content:
    • Example: ['https://time.com','https://and-time-again.com']. The List can also be a String, '[]' since the function converts literals.
    1. Column with number of times a post was shared (Integer), such as Retweets on Twitter.
    2. Column with group data (String), such as hashtags from tweets.
    3. Column with usernames (String), such as tweet usernames
    4. Column with target content data (String), such as tweets with targeted users from module, or stringified list of targeted people like tweet mentions.
  • url_sample_size= Integer. Desired sample limit.
  • periods= Tuple. Contains 2 Integers, which define the range of periods, e.g., (1,10)
  • hubs= Tuple. Contains 2 Integers, which define the range of module/hubs, e.g., (1,10)
  • period_dates= Dict of Lists with dates per period: pd['1'] => ['2018-01-01','2018-01-01',...]
  • list_of_regex= List. Contains:
    1. list of regex patterns with group identifiers, such as hashtags
    2. String. Key identifier for group.
  • hl= Dict. Contains lists of community-detected usernames
  • verbose= Boolean. True prints out status messages (recommended), False prints nothing

Returns:

  • Dict. See documentation for output details for data access.

url_counter()

Helper function for top_urls(). It transforms an incoming list of Strings into a regex string to facilitate a search.

Arguments:

  • df: DataFrame. Array of Strings to write as a regex String.
  • columns: A List of 4 column names to use from corpus, but only uses the first two in this function:
    1. Name of URL column that includes a list of URLs included in post/content.
    2. Integer. Number of times a post was shared, such as Retweets on Twitter.

Returns:

  • A List that includes:
    • sorted_totals: List of Tuples that contain 2 items:
      • String full URL
      • Integer. Total number of URL instances (including RTs).
    • sorted_domain_totals:
      • String domain URL
      • Integer. Total number of URL instances (including RTs).

regex_lister()

Helper function for top_urls(), but also can be used to create the group regex search parameters on its own. It transforms an incoming list of Strings into a regex string to facilitate a search.

Arguments:

  • the_list: List. Array of Strings to write as a regex String.
  • key: String. Denotes the group name

Returns:

  • keyed: Tuple with;
    • 'key' (String) that denotes the group name
    • 'listicle' (regex String) that will be used for a search

urlcounter functions only with Python 3.x and is not backwards-compatible (although one could probably branch off a 2.x port with minimal effort).

Warning: urlcounter performs no custom error-handling, so make sure your inputs are formatted properly! If you have questions, please let me know via email.

System requirements

  • pandas

Installation

pip install urlcounter

Distribution update terminal commands

# Create new distribution of code for archiving
sudo python3 setup.py sdist bdist_wheel

# Distribute to Python Package Index
python3 -m twine upload --repository-url https://upload.pypi.org/legacy/ dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

urlcounter-0.0.2.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

urlcounter-0.0.2-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file urlcounter-0.0.2.tar.gz.

File metadata

  • Download URL: urlcounter-0.0.2.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.2

File hashes

Hashes for urlcounter-0.0.2.tar.gz
Algorithm Hash digest
SHA256 2b8b6a441665f3ae99795b521bb5da903146f075856085bf4bf0f9dd9abf1e39
MD5 c67727621d264fcd08b4a284968d7ec1
BLAKE2b-256 93d7adf8d480a0d6a5657440a6008786088c4a95d4b36cc8ef7ed5b094145443

See more details on using hashes here.

File details

Details for the file urlcounter-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: urlcounter-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.2

File hashes

Hashes for urlcounter-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a0f5cdf5876aae0bc9a920396016887264c878ddc57c339fb76d3e24547b0f0b
MD5 bbd990e744804fc822606fd6886f3870
BLAKE2b-256 c5b90b56490aa5b5e20a2bb3c33e403f131ba07fa0d23361a7f33db736e0f1d8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page