Skip to main content

Generates A/B/n test groups

Project description

ABSplit

Split your data into matching A/B/n groups

license version version Downloads Downloads

Table of Contents
  1. About The Project
  2. Getting Started
  3. Tutorials
  4. Usage
  5. API Reference
  6. Contributing
  7. License
  8. Contact

About the project

ABSplit is a python package that uses a genetic algorithm to generate as equal as possible A/B, A/B/C, or A/B/n test splits.

The project aims to provide a convenient and efficient way for splitting population data into distinct groups (ABSplit), as well as and finding matching samples that closely resemble a given original sample (Match).

Whether you have static population data or time series data, this Python package simplifies the process and allows you to analyze and manipulate your population data.

This covers the following use cases:

  1. ABSplit class: Splitting an entire population into n groups by given proportions
  2. Match class: Finding a matching group in a population for a given sample

Calculation

ABSplit standardises the population data (so each metric is weighted as equally as possible), then pivots it into a three-dimensional array, by metrics, individuals, and dates.

The selection from the genetic algorithm, along with its inverse, is applied across this array with broadcasting to compute the dot products between the selection and the population data.

As a result, aggregated metrics for each group are calculated. The Mean Squared Error is calculated for each metric within the groups and then summed for each metric. The objective of the cost function is to minimize the overall MSE between these two groups, ensuring the metrics of both groups track each other as similarly across time as possible.

(back to top)

Getting Started

Use the package manager pip to install ABSplit and it's prerequisites.

ABSplit requires pygad==3.0.1

Installation

pip install absplit

(back to top)

Tutorials

Please see this colab for a range of examples on how to use ABSplit and Match

Do it yourself

See this colab to learn how ABSplit works under the hood, and how to build your own group splitting tool using PyGAD,

(back to top)

Usage

from absplit import ABSplit
import pandas as pd
import datetime
import numpy as np

# Synthetic data
data_dct = {
    'date': [datetime.date(2030,4,1) + datetime.timedelta(days=x) for x in range(3)]*5,
    'country': ['UK'] * 15,
    'region': [item for sublist in [[x]*6 for x in ['z', 'y']] for item in sublist] + ['x']*3,
    'city': [item for sublist in [[x]*3 for x in ['a', 'b', 'c', 'd', 'e']] for item in sublist],
    'metric1': np.arange(0, 15, 1),
    'metric2': np.arange(0, 150, 10)
}
df = pd.DataFrame(data_dct)

# Identify which columns are metrics, which is the time period, and what to split on
kwargs = {
    'metrics': ['metric1', 'metric2'],
    'date_col': 'date',
    'splitting': 'city'
}

# Initialise
ab = ABSplit(
    df=df,
    split=[.5, .5],  # Split into 2 groups of equal size
    **kwargs,
)

# Generate split
ab.run()

# Visualise generation fitness
ab.fitness()

# Visualise data
ab.visualise()

# Extract bin splits
df = ab.results

# Extract data aggregated by bins
df_agg = ab.aggregations

# Extract summary statistics
df_dist = ab.distributions    # Population counts between groups
df_rmse = ab.rmse             # RMSE between groups for each metric
df_mape = ab.mape             # MAPE between groups for each metric
df_totals = ab.totals         # Total sum of each metric for each group

(back to top)

API Reference

Absplit

ABSplit(df, metrics, splitting, date_col=None, ga_params={}, metric_weights={}, splits=[0.5, 0.5], size_penalty=0)

Splits population into n groups. Mutually exclusive, completely exhaustive

Arguments:

  • df (pd.DataFrame): Dataframe of population to be split
  • metrics (str, list): Name of, or list of names of, metric columns in DataFrame to be considered in split
  • splitting (str): Name of column that represents individuals in the population that is getting split. For example, if you wanted to split a dataframe of US counties, this would be the county name column
  • date_col (str, optional): Name of column that represents time periods, if applicable. If left empty, it will perform a static split, i.e. not across timeseries, (default None)
  • ga_params (dict, optional): Parameters for the genetic algorithm pygad.GA module parameters, see here for arguments you can pass (default: {})
  • splits (list, optional): How many groups to split into, and relative size of the groups (default: [0.5, 0.5], 2 groups of equal size)
  • size_penalty (float, optional): Penalty weighting for differences in the population count between groups (default: 0)
  • sum_penalty (float, optional): Penalty weighting for the sum of metrics over time. If this is greater than zero, it will add a penalty to the cost function that will try and make the sum of each metric the same for each group (default: 0)
  • cutoff_date (str, optional): Cutoff date between fitting and validation data. For example, if you have data between 2023-01-01 and 2023-03-01, and the cutoff date is 2023-02-01, the algorithm will only perform the fit on data between 2023-01-01 and 2023-02-01. If None, it will fit on all available data. If cutoff date is provided, RMSE scores (gotten by using the ab.rmse attribute) will only be for validation period (i.e., from 2023-02-01 to end of timeseries)
  • missing_dates (str, optional): How to deal with missing dates in time series data, options: ['drop_dates', 'drop_population', '0', 'median'] (default: median)
  • metric_weights (dict, optional): Weights for each metric in the data. If you want the splitting to focus on one metrics more than the other, you can prioritise this here (default: {})

Match

Match(population, sample, metrics, splitting, date_col=None, ga_params={}, metric_weights={})

Takes DataFrame sample and finds a comparable group in population.

Arguments:

  • population (pd.DataFrame): Population to search for comparable group (Must exclude sample data)
  • sample (pd.DataFrame): Sample we are looking to find a match for.
  • metrics (str, list): Name of, or list of names of, metric columns in DataFrame
  • splitting (str): Name of column that represents individuals in the population that is getting split
  • date_col (str, optional): Name of column that represents time periods, if applicable. If left empty, it will perform a static split, i.e. not across timeseries, (default None)
  • ga_params (dict, optional): Parameters for the genetic algorithm pygad.GA module parameters, see here for arguments you can pass (default: {})
  • splits (list, optional): How many groups to split into, and relative size of the groups (default: [0.5, 0.5], 2 groups of equal size)
  • metric_weights (dict, optional): Weights for each metric in the data. If you want the splitting to focus on one metrics more than the other, you can prioritise this here (default: {})

(back to top)

Contributing

I welcome contributions to ABSplit! For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

(back to top)

License

MIT

(back to top)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

absplit-1.4.5.tar.gz (321.4 kB view details)

Uploaded Source

Built Distribution

absplit-1.4.5-py2.py3-none-any.whl (314.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file absplit-1.4.5.tar.gz.

File metadata

  • Download URL: absplit-1.4.5.tar.gz
  • Upload date:
  • Size: 321.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.30.0

File hashes

Hashes for absplit-1.4.5.tar.gz
Algorithm Hash digest
SHA256 cb553a1135dce8238353ce1f4696289a03affa2e8d4082006a46501a8f14ad11
MD5 586aff21874603ea8008c0cab3e85d63
BLAKE2b-256 018e5c1fd4a088d2c81728eb90b2c79e9e4a672fd2af2c5d0f76db861847e505

See more details on using hashes here.

File details

Details for the file absplit-1.4.5-py2.py3-none-any.whl.

File metadata

  • Download URL: absplit-1.4.5-py2.py3-none-any.whl
  • Upload date:
  • Size: 314.6 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.30.0

File hashes

Hashes for absplit-1.4.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 d9358797e41d964a2d2e9867f6a6af6677b785a89a6eeca6c6b80b7be69ca3b2
MD5 d179de9127314afd34d55ddef8ee2793
BLAKE2b-256 ca5ff0b80beeee6700472db3702691ffaaa3a7053b3209076d6f55ce482a57f7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page