A package for calculating cohort metrics from an activity stream and cohorts data table.
Project description
cohort-analysis
Outline
A package for calculating cohort metrics from an activity stream and cohorts data table.
Uses pandas DataFrames to hold the two tables and provides a class, CohortMetrics to:
- manage the data within these DataFrames
- filter to analyse different user cohorts
- calculate a variety of important activity metrics (such as weekly active users, monthly active users)
- calculate cohort retention metrics
Installation
To install from the command line via pip, do:
pip install cohort-analysis
To upgrade to the latest version via pip, do:
pip install cohort-analysis --upgrade
To use via pipenv put the following in your Pipfile:
[packages]
cohort-analysis = ">=0.0.1"
Getting Started
To start, simply
import cohort_analysis
The principal interface for this library is via cohort_analysis.metrics (an instance of cohort_analysis.cohort_metrics.CohortMetrics).
We have used pandas-flavor to add a .cohorts accessor to pandas DataFrames; note that the above import statement will apply this to all DataFrames.
Preparing the interface
You will need to set up the clickstream and cohorts tables for this interface. There are two ways:
- via methods
cohort_analysis.metrics.set_clickstream(df1)cohort_analysis.metrics.set_cohorts(df2)
- via the DataFrame accessor we set up for convenience
df1.cohorts.set_as_clickstream(cohort_analysis.metrics)df2.cohorts.set_as_cohorts(cohort_analysis.metrics)
The clickstream table should have two columns, user_id and timestamp (other columns will be ignored). Every row should record a 'click' (an event of some kind), indicating the responsible user's ID and the timestamp at which it occurred. If your DataFrame has these columns under different names (col1, col2 in the example below), you can indicate that when setting the clickstream as follows:
cohort_analysis.metrics.set_clickstream(df1, timestamp_col="col1", user_id_col="col2")
The cohorts table has two mandatory columns, user_id and reference_timestamp and is expected, though not required, to have additional columns that can be used to define user cohorts. If your DataFrame has these columns under different names (colA, colB in the example below), you can indicate that when setting the cohorts as follows:
cohort_analysis.metrics.set_cohorts(df1, reference_timestamp_col="colA", user_id_col="colB")
The user_id columns in the two tables ought to correspond, i.e. IDs in clickstream.user_id should be present in cohorts.user_id (any that are not will be ignored).
Direct interface creation
Alternatively, a new instance of cohort_analysis.cohort_metrics.CohortMetrics specifying the tables to use can be created as follows:
metrics = cohort_analysis.CohortMetrics(clickstream=df1, cohorts=df2)
The supplied DataFrames should already have the correct column names as detailed in the previous section.
Timezone support
The default behaviour of the interface is to convert/localise all timestamps to UTC.
If your data is in a different timezone, you should change the CohortMetrics instance's timezone as follows:
cohort_analysis.metrics.change_timezone("CET")
This should be done before setting the clickstream and cohorts table if those contain timezone-naive timestamps that actually belong to a timezone other than UTC.
If creating the interface directly, pass a timezone parameter instead:
metrics = cohort_analysis.CohortMetrics(clickstream=df1, cohorts=df2, timezone="CET")
Changing the current time
At initialisation, the current time in the chosen timezone is set as a reference point for calculations. Actions in the clickstream at future times and users with reference timestamps in the future will be filtered out.
However, the current_timestamp property can be changed as follows:
metrics.change_current_timestamp("now")
Valid inputs are:
"now"(changecurrent_timestampproperty to the current time in the chosen timezone)"last"(changecurrent_timestampto the latest entry inclickstream.timestampandcohorts.reference_timestamp)- any
pandas.Timestamp(will be converted/localised to the instance's timezone) - any parseable datetime string (will be set to the instance's timezone)
This allows the user to view the data as it would have appeared at points in the past, handle batched data that doesn't include the latest information, set up deterministic tests etc.
Inspecting the interface
Instances of CohortMetrics have a string representation implemented. If the clickstream and cohorts tables are both set up, print(cohort_analysis.metrics) will display a summary of the two tables derived from the DataFrames' own string representations. This can be used for quick inspection of the data. If one or both tables are not yet set up, the representation will indicate this instead.
Basic Usage
Once the CohortMetrics object has been prepared, it can be used to derive a variety of activity and cohort retention metrics from the clickstream and cohorts tables.
Methods will typically return a pandas DataFrame.
Filtering by cohort
A copy of the two tables, filtered to remove 'clicks' from users not in the cohorts table, or with timestamps before the associated user's reference_timestamp, may be returned as follows:
clickstream, cohorts = cohort_analysis.metrics.filter_by_cohorts()
Additional filtering may be applied by passing a dictionary to this method. Here are some examples:
- Filter by a column in the
cohortstable, selecting activity and user data only for (in this example) users who have the value"United Kingdom"in thecountrycolumn of thecohortstable:
cohort_dict = {"country": "United Kingdom"}
clickstream, cohorts = cohort_analysis.metrics.filter_by_cohorts(cohort_dict)
- Filter by a column in the
cohortstable, selecting activity and user data only for users with one of a number of values in a column (in this example,"Brazil"or"Canada"in thecountrycolumn of thecohortstable):
cohort_dict = {"country": ["Brazil", "Canada"]}
clickstream, cohorts = cohort_analysis.metrics.filter_by_cohorts(cohort_dict)
- Filter by values in multiple columns of the
cohortstable, selecting activity and user data only for users who (in this example) have the value"India"in thecountrycolumn of thecohortstable and the value"Social Media"in theacquisition_channelcolumn:
cohort_dict = {"country": "India", "acquisition_channel": "Social Media"}
clickstream, cohorts = cohort_analysis.metrics.filter_by_cohorts(cohort_dict)
Lists of values may also be passed, as in the previous example.
The logic for combining across columns can be changed - 'AND' logic is used by default, but calling cohort_analysis.metrics.change_filter_logic("OR") beforehand will use 'OR' logic instead. 'NOT' logic can also be used to exclude all the values provided via cohort_dict.
- Filter by the
reference_timestampcolumn in thecohortstable, selecting activity and user data only for users with a reference timestamp during a particular time period (in this example, Q2 2020):
cohort_dict = {"calendar": "2020-Q2"}
clickstream, cohorts = cohort_analysis.metrics.filter_by_cohorts(cohort_dict)
This kind of filter can be combined with other filters and may have multiple values passed, just as above. Using the key "reference_timestamp" instead of "calendar" will work in the exact same way.
Possible time periods include:
- years (
"2020") - quarters (
"2020-Q1") - months (
"2020-05") - weeks (
"2020-W-SUN-5"- the fifth week concluding on a Sunday in 2020) - days (
"2020-05-03")
This method is used internally by other methods that generate metrics.
Cohort Activity Metrics
The cohort-analysis library supports two different types of activity metric: activity in calendar intervals and activity in rolling intervals.
The methods for computing activity in calendar intervals are as follows:
dau_calendar_daywau_calendar_weekmau_calendar_monthqau_calendar_quarteryau_calendar_year
These methods all optionally accept a cohort filter dictionary via the input cohort=cohort_dict.
Each of these methods returns a DataFrame with a PeriodIndex covering the full timerange of the clickstream table's timestamp column and the cohorts table's reference_timestamp column. It has the following columns:
cohort_sizen_active?au(dau,wau,mau,qau,yaurespectively)
By default, the cohort size is the number of users with reference timestamps prior to the end of each period in the index. This behaviour can be changed to calculate the number of users with reference timestamps prior to the start of each period in the index by passing cohort_size_at="start" to any of the methods.
Methods apart from dau_calendar_day and mau_calendar_month accept an additional argument, since the other intervals are not uniquely defined (although there are sensible defaults that we implement).
wau_calendar_weekaccepts the optional argument,week_commencing. This accepts any day of the week as a value. The default is"Monday".qau_calendar_quarterandyau_calendar_yearboth accept the optional argumentyear_commencing. This accepts any month of the year as a value. The default is"January".
The methods for computing activity in rolling windows are as follows:
rolling_waurolling_maurolling_qau
These methods all optionally accept a cohort filter dictionary via the input cohort=cohort_dict.
Each of these methods returns a DataFrame with a DateIndex covering the full timerange of the clickstream table's timestamp column and the cohorts table's reference_timestamp column. It has the following columns:
cohort_sizen_active?au(wau,mau,qaurespectively)
For each date in the index, these metrics are computed for an N day window whose final day is indicated by the index. For WAU, N=7. For MAU, N=28. For QAU, N=91. The latter two values are chosen due to being multiples of 7 (this eliminates weekly cycles in user activity from the metrics).
By default, the cohort size is the number of users with reference timestamps prior to the end of each window whose final day is indicated by the index. This behaviour can be changed to calculate the number of users with reference timestamps prior to the start of each window by passing cohort_size_at="start" to any of the methods.
Cohort Retention Rates
A cohort retention rate is the active rate of a user cohort during a specified time-interval since reference_timestamp. The current implementation of CohortMetrics has three different type of retentions: fixed-interval, unbounded-interval and wmqy-interval. The code below shows the usage on a cohorts DataFrame with country in the column.
retention_type = 'unbounded'
retention_cohorts = [{"country": "Canada"}, {"country": "Brazil"}]
interval_type = "week"
metrics = cohort_analysis.metrics
metrics.cohort_retention(retention_type, retention_cohorts, interval_type)
# returns a DataFrame indexed by week 0 to latest and columned by given cohorts
Here retention_cohorts is a list of cohort filtering dictionaries. This allows the end user to easily compare retention metrics for different cohorts.
Development
Once you've cloned the repository and navigated to it, the best way is to use the pipenv virtual environment:
- Make sure that you have the
pipenvlibrary:pip install pipenv --upgrade. - In the top level directory,
/cohort-analysis, runpipenv install --dev(installs virtual environment with development tools). - Launch
pytest-watchin the virtual environment usingpipenv run ptw. Edit code at your leisure; the test suite will run whenever you save your work. - Other quality assurance checks can be run locally:
- use
pipenv run coverageto run the tests and ensure sufficient test coverage - use
pipenv run mypyfor static type-checking - use
pipenv run lint-fixto format the code
- use
Contributors
- Paddy Alton (paddy.alton@apolitical.co)
- Charlotte Crabb (charlotte.crabb@apolitical.co)
- Ashia Ogunlade (ashia.ogunlade@apolitical.co)
- CY Yang (cy.yang@apolitical.co)
(with thanks to the Apolitical engineering and data teams for assistance and review)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cohort-analysis-0.1.2.tar.gz.
File metadata
- Download URL: cohort-analysis-0.1.2.tar.gz
- Upload date:
- Size: 21.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e19bdbe4f3627e3c8de5a40b53d61df35fc9dc30b86e63463827856f1d78fa3d
|
|
| MD5 |
47dc9aa8cc5d5e7ad997f585758f0132
|
|
| BLAKE2b-256 |
386eb39400571b7c3cf1efc46828ac2de0804ebf0bc786f8ed21ba7a99efc13c
|
File details
Details for the file cohort_analysis-0.1.2-py3-none-any.whl.
File metadata
- Download URL: cohort_analysis-0.1.2-py3-none-any.whl
- Upload date:
- Size: 20.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
157a799efa09c29102995865d375237930ab86da2aeeba2f1ea43a036a0475ba
|
|
| MD5 |
4f6349a431bf2f60693e4159492d1ee6
|
|
| BLAKE2b-256 |
8b5857a67c0bbf733482b984d5cc74cee67f20b2e851e524e67446ca88eb132a
|