Skip to main content

PopDelta is a Python package for calculating the delta between two populations.

Project description

PopDelta is a data mining library for Python, developed to process and analyze Pandas Dataframes, interchangeably referred to as Populations. The primary objective of PopDelta is to identify underlying patterns within the data. This library is suitable for characterizing frequent patterns in a single dataframe or comparing differences between two dataframes, which proves useful for tasks such as contrasting customer groups or analyzing values' temporal shifts through cohort comparisons.

Possible applications for PopDelta:

Discerning frequent co-occurring patterns within a single dataset. Detecting patterns that differentiate purchasing customers from non-purchasing ones.

Note: PopDelta was formerly called KDA (Key Driver Analysis).

Installation

To install PopDelta, execute the following pip command:

pip install popdelta

Usage

PopDelta incorporates built-in utilities for data cleaning, including handling missing data, rectifying data inconsistencies, and implementing One-Hot Encoding.

PopDelta initialization parameters are: 1) target: the attribute utilized to "weight" the discovered data patterns, in the absence of a target, patterns will be “weighted” only based on their frequencies (row counts “a la” vanilla frequent itemset). 2) num_bins: bins for discretizing numerical attributes, and 3) string_attributes: a list of attributes that should be forced to string type (useful for ordinal attributes encoded in numerical format).

After initializing PopDelta, the object can be used to iterate through the generator returned from the process_batch function. This function requires two parameters: 1) batches: a list containing one, or two depending on the use case, dataframes for comparison purposes, and 2) progressive: a Boolean flag indicating whether the data should be processed progressively in chunks (recommended for datasets exceeding 10000 rows).

The user can compare datasets with not perfectly overlapping columns, as long as that intersection is not empty.

Example

The expected usage and input are as follows (you can find more examples in the examples.ipynb notebook:

Import pandas as pd
from popdelta.pop_delta import PopDelta

pd.set_option('mode.use_inf_as_na', True)
df = pd.read_csv("messy_dataset.csv")


over_40 = df[df[“age] > 40]
under_40 = df[df[“age] <= 40]

popDeltaW2 = PopDelta(target=”age”, num_bins=3, string_attributes=[])

for result in popDeltaW2.process_batch([over_40, under_40]):
display(result)

In this example, PopDelta is utilized to analyze two datasets: over_40 and under_40. The target variable for weighting is age, with 3 bins for discretizing numerical attributes and no predefined string attributes (applicable for ordinal attributes encoded as numerics).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

popdelta-0.0.1.tar.gz (2.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

popdelta-0.0.1-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file popdelta-0.0.1.tar.gz.

File metadata

  • Download URL: popdelta-0.0.1.tar.gz
  • Upload date:
  • Size: 2.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for popdelta-0.0.1.tar.gz
Algorithm Hash digest
SHA256 7436f5085b446abd04ad7e1a80b6e930984f87bae1265928320772535199eb7b
MD5 b11527956b9e1f5995b89bcb86e2b0c9
BLAKE2b-256 5c3cc4c1ebe4aa9e47559612c359f2d6caa3fea6fb8dcb6669a02c8e686bf59c

See more details on using hashes here.

File details

Details for the file popdelta-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: popdelta-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for popdelta-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 292cfb5a4e42b46513e33611c9895aee99097c135168b1eb7a7614587d692095
MD5 9b83cb56d6b13452cac7cccbc90473e2
BLAKE2b-256 98e870b50be808a2d88afbadf1f38d9bb9eb79a2267a74df7f5dbfe9732be8a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page