A set of functions that tally URLs within an event-based corpus. It assumes that you have data divided into a range of event-based periods with community-detected modules/hubs. It also assumes that you have unspooled and cleaned your URL data. See Deen Freelon's unspooler module for help: https://github.com/dfreelon/unspooler.
Project description
urlcounter
By Chris Lindgren chris.a.lindgren@gmail.com Distributed under the BSD 3-clause license. See LICENSE.txt or http://opensource.org/licenses/BSD-3-Clause for details.
Overview
urlcounter
is a set of functions that tallies full and domain URLs for periodic, event-defined social-media posting data. It assumes you seek an answer to the following questions about link-sharing:
1. What are the top x full URLs and domain URLs from each group during each period?
2. What are the top x full URLs and domain URLs from each group-module (detected community) in each period?
To use the module, import and follow the below example for guidance:
import urlcounter as urlc
dict_url_counts = urlc.top_urls(
df=cdf, #DataFrame of full corpus
periods=(1,10), #Tuple providing range of numbered periods
hubs=(1,10), #Tuple providing range of numbered hubs
period_dates=period_dates, #Dict of Lists with dates per period
list_of_regex=[htg_btw,htg_fbt,htg_anti], #List of regex patterns defined for each group
hl=hub_lists, #Dict with keyed lists of hub usernames per period
columns=['cleaned_urls', 'retweets_count', 'hashtags', 'username', 'mentions'], #Provide a List of column names to use for search and counting
url_sample_size=50, #Desired sample size limit, e.g., Top 50
verbose=True #Boolean. True prints out status messages, False prints nothing
)
Example outputs
It returns a Dict
keyed by user-defined group names, period ranges, and module ranges:
# Overall period-based URL summary data with keyed group name, 'fbt'
## '1' = Period 1
## 'fbt' = Keyed group name + 'urls_per_period' and 'domains_per_period' = Summary total data
output['1']['fbt_urls_per_period']
output['1']['fbt_domains_per_period']
# Overall community hub-based URL summary per period data with keyed group name, 'fbt'
## '1' = Period 1
## 'fbt' = Keyed group name 'fbt'
## '1' = Community hub/module 1
## 'hub_sample_size', 'hub_tweet_sample_size','hub_url_counts','hub_domain_counts' = Summary total data
output['1']['fbt']['1']['hub_sample_size']
output['1']['fbt']['1']['hub_tweet_sample_size']
output['1']['fbt']['1']['hub_url_counts']
output['1']['fbt']['1']['hub_domain_counts']
{'1': #Start period 1
'fbt_domains_per_period': [ #start period 1 totals for group keyed as 'fbt'
('twitter.com', 3003), ('instagram.com', 1001), ('facebook.com', 202)
],
'fbt_urls_per_period': [
('https://twitter.com/user/status/example', 202),
('https://www.instagram.com/p/example/', 202),
...
]}, #end period 1 totals for group keyed as 'fbt'
{'fbt': { #start period 1, module/hub 1
'1': {
'hub_domain_counts': [
('example.com', 178),
('example2.go.lc', 14),
('example3.com', 10),...
],
'hub_sample_size': 103,
'hub_tweet_sample_size': 486,
'hub_url_counts': [
('https://example.com/politics/story-title-1/',120),
('https://example.com/politics/story-title-2/',58),
...
]
}
}
}, #end period 1, module/hub 1
...
}, #end period
...
top_urls()
Tallies up URLs in corpus.
Arguments:
df
= DataFrame. Corpus to query from.columns
= a List of 5 column names (String) to reference in DF corpus. !IMP: The order matters:- Column with URLs (String) that includes a list of URLs included in post/content:
- Example: ['https://time.com','https://and-time-again.com']. The List can also be a String, '[]' since the function converts literals.
- Column with number of times a post was shared (Integer), such as Retweets on Twitter.
- Column with group data (String), such as hashtags from tweets.
- Column with usernames (String), such as tweet usernames
- Column with target content data (String), such as tweets with targeted users from module, or stringified list of targeted people like tweet mentions.
url_sample_size
= Integer. Desired sample limit.periods
= Tuple. Contains 2 Integers, which define the range of periods, e.g., (1,10)hubs
= Tuple. Contains 2 Integers, which define the range of module/hubs, e.g., (1,10)period_dates
= Dict of Lists with dates per period: pd['1'] => ['2018-01-01','2018-01-01',...]list_of_regex
= List. Contains:- list of regex patterns with group identifiers, such as hashtags
- String. Key identifier for group.
hl
= Dict. Contains lists of community-detected usernamesverbose
= Boolean. True prints out status messages (recommended), False prints nothing
Returns:
- Dict. See documentation for output details for data access.
url_counter()
Helper function for top_urls()
. It transforms an incoming list of Strings into a regex string to facilitate a search.
Arguments:
df
: DataFrame. Array of Strings to write as a regex String.columns
: A List of 4 column names to use from corpus, but only uses the first two in this function:- Name of URL column that includes a list of URLs included in post/content.
- Integer. Number of times a post was shared, such as Retweets on Twitter.
Returns:
- A
List
that includes:sorted_totals
: List of Tuples that contain 2 items:- String full URL
- Integer. Total number of URL instances (including RTs).
sorted_domain_totals
:- String domain URL
- Integer. Total number of URL instances (including RTs).
regex_lister()
Helper function for top_urls()
, but also can be used to create the group regex search parameters on its own. It transforms an incoming list of Strings into a regex string to facilitate a search.
Arguments:
the_list
: List. Array of Strings to write as a regex String.key
: String. Denotes the group name
Returns:
keyed
: Tuple with;'key'
(String) that denotes the group name'listicle'
(regex String) that will be used for a search
urlcounter
functions only with Python 3.x and is not backwards-compatible (although one could probably branch off a 2.x port with minimal effort).
Warning: urlcounter
performs no custom error-handling, so make sure your inputs are formatted properly! If you have questions, please let me know via email.
System requirements
- pandas
Installation
pip install urlcounter
Distribution update terminal commands
# Create new distribution of code for archiving sudo python3 setup.py sdist bdist_wheel # Distribute to Python Package Index python3 -m twine upload --repository-url https://upload.pypi.org/legacy/ dist/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file urlcounter-0.0.2.tar.gz
.
File metadata
- Download URL: urlcounter-0.0.2.tar.gz
- Upload date:
- Size: 7.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b8b6a441665f3ae99795b521bb5da903146f075856085bf4bf0f9dd9abf1e39 |
|
MD5 | c67727621d264fcd08b4a284968d7ec1 |
|
BLAKE2b-256 | 93d7adf8d480a0d6a5657440a6008786088c4a95d4b36cc8ef7ed5b094145443 |
File details
Details for the file urlcounter-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: urlcounter-0.0.2-py3-none-any.whl
- Upload date:
- Size: 8.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.1.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a0f5cdf5876aae0bc9a920396016887264c878ddc57c339fb76d3e24547b0f0b |
|
MD5 | bbd990e744804fc822606fd6886f3870 |
|
BLAKE2b-256 | c5b90b56490aa5b5e20a2bb3c33e403f131ba07fa0d23361a7f33db736e0f1d8 |