A package for calculating cohort metrics from an activity stream and cohorts data table.
Project description
cohort-analysis
Outline
A package for calculating cohort metrics from an activity stream and cohorts data table.
Uses pandas DataFrames to hold the two tables and provides a class, CohortMetrics
to:
- manage the data within these DataFrames
- filter to analyse different user cohorts
- calculate a variety of important activity metrics (such as weekly active users, monthly active users)
- calculate cohort retention metrics
Installation
To install from the command line via pip, do:
pip install cohort-analysis
To upgrade to the latest version via pip
, do:
pip install cohort-analysis --upgrade
To use via pipenv put the following in your Pipfile:
[packages]
cohort-analysis = ">=0.0.1"
Getting Started
To start, simply
import cohort_analysis
The principal interface for this library is via cohort_analysis.metrics
(an instance of cohort_analysis.cohort_metrics.CohortMetrics
).
We have used pandas-flavor to add a .cohorts
accessor to pandas DataFrames; note that the above import statement will apply this to all DataFrames.
Preparing the interface
You will need to set up the clickstream
and cohorts
tables for this interface. There are two ways:
- via methods
cohort_analysis.metrics.set_clickstream(df1)
cohort_analysis.metrics.set_cohorts(df2)
- via the DataFrame accessor we set up for convenience
df1.cohorts.set_as_clickstream(cohort_analysis.metrics)
df2.cohorts.set_as_cohorts(cohort_analysis.metrics)
The clickstream
table should have two columns, user_id
and timestamp
(other columns will be ignored). Every row should record a 'click' (an event of some kind), indicating the responsible user's ID and the timestamp at which it occurred. If your DataFrame has these columns under different names (col1
, col2
in the example below), you can indicate that when setting the clickstream
as follows:
cohort_analysis.metrics.set_clickstream(df1, timestamp_col="col1", user_id_col="col2")
The cohorts
table has two mandatory columns, user_id
and reference_timestamp
and is expected, though not required, to have additional columns that can be used to define user cohorts. If your DataFrame has these columns under different names (colA
, colB
in the example below), you can indicate that when setting the cohorts
as follows:
cohort_analysis.metrics.set_cohorts(df1, reference_timestamp_col="colA", user_id_col="colB")
The user_id
columns in the two tables ought to correspond, i.e. IDs in clickstream.user_id
should be present in cohorts.user_id
(any that are not will be ignored).
Direct interface creation
Alternatively, a new instance of cohort_analysis.cohort_metrics.CohortMetrics
specifying the tables to use can be created as follows:
metrics = cohort_analysis.CohortMetrics(clickstream=df1, cohorts=df2)
The supplied DataFrames should already have the correct column names as detailed in the previous section.
Timezone support
The default behaviour of the interface is to convert/localise all timestamps to UTC.
If your data is in a different timezone, you should change the CohortMetrics
instance's timezone as follows:
cohort_analysis.metrics.change_timezone("CET")
This should be done before setting the clickstream and cohorts table if those contain timezone-naive timestamps that actually belong to a timezone other than UTC.
If creating the interface directly, pass a timezone
parameter instead:
metrics = cohort_analysis.CohortMetrics(clickstream=df1, cohorts=df2, timezone="CET")
Changing the current time
At initialisation, the current time in the chosen timezone is set as a reference point for calculations. Actions in the clickstream at future times and users with reference timestamps in the future will be filtered out.
However, the current_timestamp
property can be changed as follows:
metrics.change_current_timestamp("now")
Valid inputs are:
"now"
(changecurrent_timestamp
property to the current time in the chosen timezone)"last"
(changecurrent_timestamp
to the latest entry inclickstream.timestamp
andcohorts.reference_timestamp
)- any
pandas.Timestamp
(will be converted/localised to the instance's timezone) - any parseable datetime string (will be set to the instance's timezone)
This allows the user to view the data as it would have appeared at points in the past, handle batched data that doesn't include the latest information, set up deterministic tests etc.
Inspecting the interface
Instances of CohortMetrics
have a string representation implemented. If the clickstream
and cohorts
tables are both set up, print(cohort_analysis.metrics)
will display a summary of the two tables derived from the DataFrames' own string representations. This can be used for quick inspection of the data. If one or both tables are not yet set up, the representation will indicate this instead.
Basic Usage
Once the CohortMetrics
object has been prepared, it can be used to derive a variety of activity and cohort retention metrics from the clickstream
and cohorts
tables.
Methods will typically return a pandas DataFrame.
Filtering by cohort
A copy of the two tables, filtered to remove 'clicks' from users not in the cohorts
table, or with timestamps before the associated user's reference_timestamp
, may be returned as follows:
clickstream, cohorts = cohort_analysis.metrics.filter_by_cohorts()
Additional filtering may be applied by passing a dictionary to this method. Here are some examples:
- Filter by a column in the
cohorts
table, selecting activity and user data only for (in this example) users who have the value"United Kingdom"
in thecountry
column of thecohorts
table:
cohort_dict = {"country": "United Kingdom"}
clickstream, cohorts = cohort_analysis.metrics.filter_by_cohorts(cohort_dict)
- Filter by a column in the
cohorts
table, selecting activity and user data only for users with one of a number of values in a column (in this example,"Brazil"
or"Canada"
in thecountry
column of thecohorts
table):
cohort_dict = {"country": ["Brazil", "Canada"]}
clickstream, cohorts = cohort_analysis.metrics.filter_by_cohorts(cohort_dict)
- Filter by values in multiple columns of the
cohorts
table, selecting activity and user data only for users who (in this example) have the value"India"
in thecountry
column of thecohorts
table and the value"Social Media"
in theacquisition_channel
column:
cohort_dict = {"country": "India", "acquisition_channel": "Social Media"}
clickstream, cohorts = cohort_analysis.metrics.filter_by_cohorts(cohort_dict)
Lists of values may also be passed, as in the previous example.
The logic for combining across columns can be changed - 'AND' logic is used by default, but calling cohort_analysis.metrics.change_filter_logic("OR")
beforehand will use 'OR' logic instead. 'NOT' logic can also be used to exclude all the values provided via cohort_dict
.
- Filter by the
reference_timestamp
column in thecohorts
table, selecting activity and user data only for users with a reference timestamp during a particular time period (in this example, Q2 2020):
cohort_dict = {"calendar": "2020-Q2"}
clickstream, cohorts = cohort_analysis.metrics.filter_by_cohorts(cohort_dict)
This kind of filter can be combined with other filters and may have multiple values passed, just as above. Using the key "reference_timestamp"
instead of "calendar"
will work in the exact same way.
Possible time periods include:
- years (
"2020"
) - quarters (
"2020-Q1"
) - months (
"2020-05"
) - weeks (
"2020-W-SUN-5"
- the fifth week concluding on a Sunday in 2020) - days (
"2020-05-03"
)
This method is used internally by other methods that generate metrics.
Cohort Activity Metrics
The cohort-analysis
library supports two different types of activity metric: activity in calendar intervals and activity in rolling intervals.
The methods for computing activity in calendar intervals are as follows:
dau_calendar_day
wau_calendar_week
mau_calendar_month
qau_calendar_quarter
yau_calendar_year
These methods all optionally accept a cohort filter dictionary via the input cohort=cohort_dict
.
Each of these methods returns a DataFrame with a PeriodIndex covering the full timerange of the clickstream
table's timestamp
column and the cohorts
table's reference_timestamp
column. It has the following columns:
cohort_size
n_active
?au
(dau
,wau
,mau
,qau
,yau
respectively)
By default, the cohort size is the number of users with reference timestamps prior to the end of each period in the index. This behaviour can be changed to calculate the number of users with reference timestamps prior to the start of each period in the index by passing cohort_size_at="start"
to any of the methods.
Methods apart from dau_calendar_day
and mau_calendar_month
accept an additional argument, since the other intervals are not uniquely defined (although there are sensible defaults that we implement).
wau_calendar_week
accepts the optional argument,week_commencing
. This accepts any day of the week as a value. The default is"Monday"
.qau_calendar_quarter
andyau_calendar_year
both accept the optional argumentyear_commencing
. This accepts any month of the year as a value. The default is"January"
.
The methods for computing activity in rolling windows are as follows:
rolling_wau
rolling_mau
rolling_qau
These methods all optionally accept a cohort filter dictionary via the input cohort=cohort_dict
.
Each of these methods returns a DataFrame with a DateIndex covering the full timerange of the clickstream
table's timestamp
column and the cohorts
table's reference_timestamp
column. It has the following columns:
cohort_size
n_active
?au
(wau
,mau
,qau
respectively)
For each date in the index, these metrics are computed for an N day window whose final day is indicated by the index. For WAU, N=7. For MAU, N=28. For QAU, N=91. The latter two values are chosen due to being multiples of 7 (this eliminates weekly cycles in user activity from the metrics).
By default, the cohort size is the number of users with reference timestamps prior to the end of each window whose final day is indicated by the index. This behaviour can be changed to calculate the number of users with reference timestamps prior to the start of each window by passing cohort_size_at="start"
to any of the methods.
Cohort Retention Rates
A cohort retention rate is the active rate of a user cohort during a specified time-interval since reference_timestamp
. The current implementation of CohortMetrics
has three different type of retentions: fixed-interval, unbounded-interval and wmqy-interval. The code below shows the usage on a cohorts
DataFrame with country
in the column.
retention_type = 'unbounded'
retention_cohorts = [{"country": "Canada"}, {"country": "Brazil"}]
interval_type = "week"
metrics = cohort_analysis.metrics
metrics.cohort_retention(retention_type, retention_cohorts, interval_type)
# returns a DataFrame indexed by week 0 to latest and columned by given cohorts
Here retention_cohorts
is a list of cohort filtering dictionaries. This allows the end user to easily compare retention metrics for different cohorts.
Development
Once you've cloned the repository and navigated to it, the best way is to use the pipenv
virtual environment:
- Make sure that you have the
pipenv
library:pip install pipenv --upgrade
. - In the top level directory,
/cohort-analysis
, runpipenv install --dev
(installs virtual environment with development tools). - Launch
pytest-watch
in the virtual environment usingpipenv run ptw
. Edit code at your leisure; the test suite will run whenever you save your work. - Other quality assurance checks can be run locally:
- use
pipenv run coverage
to run the tests and ensure sufficient test coverage - use
pipenv run mypy
for static type-checking - use
pipenv run lint-fix
to format the code
- use
Contributors
- Paddy Alton (paddy.alton@apolitical.co)
- Charlotte Crabb (charlotte.crabb@apolitical.co)
- Ashia Ogunlade (ashia.ogunlade@apolitical.co)
- CY Yang (cy.yang@apolitical.co)
(with thanks to the Apolitical engineering and data teams for assistance and review)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cohort-analysis-0.1.2.tar.gz
.
File metadata
- Download URL: cohort-analysis-0.1.2.tar.gz
- Upload date:
- Size: 21.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e19bdbe4f3627e3c8de5a40b53d61df35fc9dc30b86e63463827856f1d78fa3d |
|
MD5 | 47dc9aa8cc5d5e7ad997f585758f0132 |
|
BLAKE2b-256 | 386eb39400571b7c3cf1efc46828ac2de0804ebf0bc786f8ed21ba7a99efc13c |
File details
Details for the file cohort_analysis-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: cohort_analysis-0.1.2-py3-none-any.whl
- Upload date:
- Size: 20.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 157a799efa09c29102995865d375237930ab86da2aeeba2f1ea43a036a0475ba |
|
MD5 | 4f6349a431bf2f60693e4159492d1ee6 |
|
BLAKE2b-256 | 8b5857a67c0bbf733482b984d5cc74cee67f20b2e851e524e67446ca88eb132a |