Skip to main content

A package for calculating cohort metrics from an activity stream and cohorts data table.

Project description

cohort-analysis

Code style: black

Outline

A package for calculating cohort metrics from an activity stream and cohorts data table.

Uses pandas DataFrames to hold the two tables and provides a class, CohortMetrics to:

  • manage the data within these DataFrames
  • filter to analyse different user cohorts
  • calculate a variety of important activity metrics (such as weekly active users, monthly active users)
  • calculate cohort retention metrics

Installation

To install from the command line via pip, do:

pip install cohort-analysis

To upgrade to the latest version via pip, do:

pip install cohort-analysis --upgrade

To use via pipenv put the following in your Pipfile:

[packages]
cohort-analysis = ">=0.0.1"

Getting Started

To start, simply

import cohort_analysis

The principal interface for this library is via cohort_analysis.metrics (an instance of cohort_analysis.cohort_metrics.CohortMetrics).

We have used pandas-flavor to add a .cohorts accessor to pandas DataFrames; note that the above import statement will apply this to all DataFrames.

Preparing the interface

You will need to set up the clickstream and cohorts tables for this interface. There are two ways:

  1. via methods
  • cohort_analysis.metrics.set_clickstream(df1)
  • cohort_analysis.metrics.set_cohorts(df2)
  1. via the DataFrame accessor we set up for convenience
  • df1.cohorts.set_as_clickstream(cohort_analysis.metrics)
  • df2.cohorts.set_as_cohorts(cohort_analysis.metrics)

The clickstream table should have two columns, user_id and timestamp (other columns will be ignored). Every row should record a 'click' (an event of some kind), indicating the responsible user's ID and the timestamp at which it occurred. If your DataFrame has these columns under different names (col1, col2 in the example below), you can indicate that when setting the clickstream as follows:

cohort_analysis.metrics.set_clickstream(df1, timestamp_col="col1", user_id_col="col2")

The cohorts table has two mandatory columns, user_id and reference_timestamp and is expected, though not required, to have additional columns that can be used to define user cohorts. If your DataFrame has these columns under different names (colA, colB in the example below), you can indicate that when setting the cohorts as follows:

cohort_analysis.metrics.set_cohorts(df1, reference_timestamp_col="colA", user_id_col="colB")

The user_id columns in the two tables ought to correspond, i.e. IDs in clickstream.user_id should be present in cohorts.user_id (any that are not will be ignored).

Direct interface creation

Alternatively, a new instance of cohort_analysis.cohort_metrics.CohortMetrics specifying the tables to use can be created as follows:

metrics = cohort_analysis.CohortMetrics(clickstream=df1, cohorts=df2)

The supplied DataFrames should already have the correct column names as detailed in the previous section.

Timezone support

The default behaviour of the interface is to convert/localise all timestamps to UTC.

If your data is in a different timezone, you should change the CohortMetrics instance's timezone as follows:

cohort_analysis.metrics.change_timezone("CET")

This should be done before setting the clickstream and cohorts table if those contain timezone-naive timestamps that actually belong to a timezone other than UTC.

If creating the interface directly, pass a timezone parameter instead:

metrics = cohort_analysis.CohortMetrics(clickstream=df1, cohorts=df2, timezone="CET")

Changing the current time

At initialisation, the current time in the chosen timezone is set as a reference point for calculations. Actions in the clickstream at future times and users with reference timestamps in the future will be filtered out.

However, the current_timestamp property can be changed as follows:

metrics.change_current_timestamp("now")

Valid inputs are:

  • "now" (change current_timestamp property to the current time in the chosen timezone)
  • "last" (change current_timestamp to the latest entry in clickstream.timestamp and cohorts.reference_timestamp)
  • any pandas.Timestamp (will be converted/localised to the instance's timezone)
  • any parseable datetime string (will be set to the instance's timezone)

This allows the user to view the data as it would have appeared at points in the past, handle batched data that doesn't include the latest information, set up deterministic tests etc.

Inspecting the interface

Instances of CohortMetrics have a string representation implemented. If the clickstream and cohorts tables are both set up, print(cohort_analysis.metrics) will display a summary of the two tables derived from the DataFrames' own string representations. This can be used for quick inspection of the data. If one or both tables are not yet set up, the representation will indicate this instead.

Basic Usage

Once the CohortMetrics object has been prepared, it can be used to derive a variety of activity and cohort retention metrics from the clickstream and cohorts tables.

Methods will typically return a pandas DataFrame.

Filtering by cohort

A copy of the two tables, filtered to remove 'clicks' from users not in the cohorts table, or with timestamps before the associated user's reference_timestamp, may be returned as follows:

clickstream, cohorts = cohort_analysis.metrics.filter_by_cohorts()

Additional filtering may be applied by passing a dictionary to this method. Here are some examples:

  1. Filter by a column in the cohorts table, selecting activity and user data only for (in this example) users who have the value "United Kingdom" in the country column of the cohorts table:
cohort_dict = {"country": "United Kingdom"}

clickstream, cohorts = cohort_analysis.metrics.filter_by_cohorts(cohort_dict)
  1. Filter by a column in the cohorts table, selecting activity and user data only for users with one of a number of values in a column (in this example, "Brazil" or "Canada" in the country column of the cohorts table):
cohort_dict = {"country": ["Brazil", "Canada"]}

clickstream, cohorts = cohort_analysis.metrics.filter_by_cohorts(cohort_dict)
  1. Filter by values in multiple columns of the cohorts table, selecting activity and user data only for users who (in this example) have the value "India" in the country column of the cohorts table and the value "Social Media" in the acquisition_channel column:
cohort_dict = {"country": "India", "acquisition_channel": "Social Media"}

clickstream, cohorts = cohort_analysis.metrics.filter_by_cohorts(cohort_dict)

Lists of values may also be passed, as in the previous example. The logic for combining across columns can be changed - 'AND' logic is used by default, but calling cohort_analysis.metrics.change_filter_logic("OR") beforehand will use 'OR' logic instead. 'NOT' logic can also be used to exclude all the values provided via cohort_dict.

  1. Filter by the reference_timestamp column in the cohorts table, selecting activity and user data only for users with a reference timestamp during a particular time period (in this example, Q2 2020):
cohort_dict = {"calendar": "2020-Q2"}

clickstream, cohorts = cohort_analysis.metrics.filter_by_cohorts(cohort_dict)

This kind of filter can be combined with other filters and may have multiple values passed, just as above. Using the key "reference_timestamp" instead of "calendar" will work in the exact same way. Possible time periods include:

  • years ("2020")
  • quarters ("2020-Q1")
  • months ("2020-05")
  • weeks ("2020-W-SUN-5" - the fifth week concluding on a Sunday in 2020)
  • days ("2020-05-03")

This method is used internally by other methods that generate metrics.

Cohort Activity Metrics

The cohort-analysis library supports two different types of activity metric: activity in calendar intervals and activity in rolling intervals.

The methods for computing activity in calendar intervals are as follows:

  • dau_calendar_day
  • wau_calendar_week
  • mau_calendar_month
  • qau_calendar_quarter
  • yau_calendar_year

These methods all optionally accept a cohort filter dictionary via the input cohort=cohort_dict.

Each of these methods returns a DataFrame with a PeriodIndex covering the full timerange of the clickstream table's timestamp column and the cohorts table's reference_timestamp column. It has the following columns:

  • cohort_size
  • n_active
  • ?au (dau, wau, mau, qau, yau respectively)

By default, the cohort size is the number of users with reference timestamps prior to the end of each period in the index. This behaviour can be changed to calculate the number of users with reference timestamps prior to the start of each period in the index by passing cohort_size_at="start" to any of the methods.

Methods apart from dau_calendar_day and mau_calendar_month accept an additional argument, since the other intervals are not uniquely defined (although there are sensible defaults that we implement).

  • wau_calendar_week accepts the optional argument, week_commencing. This accepts any day of the week as a value. The default is "Monday".
  • qau_calendar_quarter and yau_calendar_year both accept the optional argument year_commencing. This accepts any month of the year as a value. The default is "January".

The methods for computing activity in rolling windows are as follows:

  • rolling_wau
  • rolling_mau
  • rolling_qau

These methods all optionally accept a cohort filter dictionary via the input cohort=cohort_dict.

Each of these methods returns a DataFrame with a DateIndex covering the full timerange of the clickstream table's timestamp column and the cohorts table's reference_timestamp column. It has the following columns:

  • cohort_size
  • n_active
  • ?au (wau, mau, qau respectively)

For each date in the index, these metrics are computed for an N day window whose final day is indicated by the index. For WAU, N=7. For MAU, N=28. For QAU, N=91. The latter two values are chosen due to being multiples of 7 (this eliminates weekly cycles in user activity from the metrics).

By default, the cohort size is the number of users with reference timestamps prior to the end of each window whose final day is indicated by the index. This behaviour can be changed to calculate the number of users with reference timestamps prior to the start of each window by passing cohort_size_at="start" to any of the methods.

Cohort Retention Rates

A cohort retention rate is the active rate of a user cohort during a specified time-interval since reference_timestamp. The current implementation of CohortMetrics has three different type of retentions: fixed-interval, unbounded-interval and wmqy-interval. The code below shows the usage on a cohorts DataFrame with country in the column.

retention_type = 'unbounded'
retention_cohorts = [{"country": "Canada"}, {"country": "Brazil"}]
interval_type = "week"

metrics = cohort_analysis.metrics
metrics.cohort_retention(retention_type, retention_cohorts, interval_type)
# returns a DataFrame indexed by week 0 to latest and columned by given cohorts

Here retention_cohorts is a list of cohort filtering dictionaries. This allows the end user to easily compare retention metrics for different cohorts.

Development

Once you've cloned the repository and navigated to it, the best way is to use the pipenv virtual environment:

  1. Make sure that you have the pipenv library: pip install pipenv --upgrade.
  2. In the top level directory, /cohort-analysis, run pipenv install --dev (installs virtual environment with development tools).
  3. Launch pytest-watch in the virtual environment using pipenv run ptw. Edit code at your leisure; the test suite will run whenever you save your work.
  4. Other quality assurance checks can be run locally:
    • use pipenv run coverage to run the tests and ensure sufficient test coverage
    • use pipenv run mypy for static type-checking
    • use pipenv run lint-fix to format the code

Contributors

(with thanks to the Apolitical engineering and data teams for assistance and review)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cohort-analysis-0.1.2.tar.gz (21.4 kB view details)

Uploaded Source

Built Distribution

cohort_analysis-0.1.2-py3-none-any.whl (20.7 kB view details)

Uploaded Python 3

File details

Details for the file cohort-analysis-0.1.2.tar.gz.

File metadata

  • Download URL: cohort-analysis-0.1.2.tar.gz
  • Upload date:
  • Size: 21.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.13

File hashes

Hashes for cohort-analysis-0.1.2.tar.gz
Algorithm Hash digest
SHA256 e19bdbe4f3627e3c8de5a40b53d61df35fc9dc30b86e63463827856f1d78fa3d
MD5 47dc9aa8cc5d5e7ad997f585758f0132
BLAKE2b-256 386eb39400571b7c3cf1efc46828ac2de0804ebf0bc786f8ed21ba7a99efc13c

See more details on using hashes here.

File details

Details for the file cohort_analysis-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: cohort_analysis-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 20.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.13

File hashes

Hashes for cohort_analysis-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 157a799efa09c29102995865d375237930ab86da2aeeba2f1ea43a036a0475ba
MD5 4f6349a431bf2f60693e4159492d1ee6
BLAKE2b-256 8b5857a67c0bbf733482b984d5cc74cee67f20b2e851e524e67446ca88eb132a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page