Skip to main content

PopDelta is a Python package for calculating the delta between two populations.

Project description

PopDelta is a data mining library for Python, developed to process and analyze Pandas Dataframes, interchangeably referred to as Populations. The primary objective of PopDelta is to identify underlying patterns within the data. This library is suitable for characterizing frequent patterns in a single dataframe or comparing differences between two dataframes, which proves useful for tasks such as contrasting customer groups or analyzing values' temporal shifts through cohort comparisons.

Possible applications for PopDelta:

Discerning frequent co-occurring patterns within a single dataset. Detecting patterns that differentiate purchasing customers from non-purchasing ones.

Note: PopDelta was formerly called KDA (Key Driver Analysis).

Installation

To install PopDelta, execute the following pip command:

pip install popdelta

Usage

PopDelta incorporates built-in utilities for data cleaning, including handling missing data, rectifying data inconsistencies, and implementing One-Hot Encoding.

PopDelta initialization parameters are: 1) target: the attribute utilized to "weight" the discovered data patterns, in the absence of a target, patterns will be “weighted” only based on their frequencies (row counts “a la” vanilla frequent itemset). 2) num_bins: bins for discretizing numerical attributes, and 3) string_attributes: a list of attributes that should be forced to string type (useful for ordinal attributes encoded in numerical format).

After initializing PopDelta, the object can be used to iterate through the generator returned from the process_batch function. This function requires two parameters: 1) batches: a list containing one, or two depending on the use case, dataframes for comparison purposes, and 2) progressive: a Boolean flag indicating whether the data should be processed progressively in chunks (recommended for datasets exceeding 10000 rows).

The user can compare datasets with not perfectly overlapping columns, as long as that intersection is not empty.

Example

The expected usage and input are as follows (you can find more examples in the examples.ipynb notebook:

Import pandas as pd
from popdelta.pop_delta import PopDelta

pd.set_option('mode.use_inf_as_na', True)
df = pd.read_csv("messy_dataset.csv")


over_40 = df[df[“age] > 40]
under_40 = df[df[“age] <= 40]

popDeltaW2 = PopDelta(target=”age”, num_bins=3, string_attributes=[])

for result in popDeltaW2.process_batch([over_40, under_40]):
display(result)

In this example, PopDelta is utilized to analyze two datasets: over_40 and under_40. The target variable for weighting is age, with 3 bins for discretizing numerical attributes and no predefined string attributes (applicable for ordinal attributes encoded as numerics).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

popdelta-0.0.2.tar.gz (2.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

popdelta-0.0.2-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file popdelta-0.0.2.tar.gz.

File metadata

  • Download URL: popdelta-0.0.2.tar.gz
  • Upload date:
  • Size: 2.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for popdelta-0.0.2.tar.gz
Algorithm Hash digest
SHA256 c4ad84975890c1ac90ac666ddda3e2c4c1c4b9a726e08b5b51a2285b8997cd7a
MD5 8b56d419d05a7f5519ebe6a1c017339d
BLAKE2b-256 79c929248e833a129600062f68e8253a5211abe6b0b620466182d300fefd3fca

See more details on using hashes here.

File details

Details for the file popdelta-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: popdelta-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 17.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for popdelta-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7ab5a6fbc972cea11cf5798d31c946b8fb72233edda436c956b0a8f6031b4171
MD5 596b4fd18a5a23b8ed040f2745a1d130
BLAKE2b-256 256bc720053ef1eacebc84d4edeff5e170a3709b3ee4f64e627c9ccbba949f4c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page