PopDelta is a Python package for calculating the delta between two populations.
Project description
PopDelta is a data mining library for Python, developed to process and analyze Pandas Dataframes, interchangeably referred to as Populations. The primary objective of PopDelta is to identify underlying patterns within the data. This library is suitable for characterizing frequent patterns in a single dataframe or comparing differences between two dataframes, which proves useful for tasks such as contrasting customer groups or analyzing values' temporal shifts through cohort comparisons.
Possible applications for PopDelta:
Discerning frequent co-occurring patterns within a single dataset. Detecting patterns that differentiate purchasing customers from non-purchasing ones.
Note: PopDelta was formerly called KDA (Key Driver Analysis).
Installation
To install PopDelta, execute the following pip command:
pip install popdelta
Usage
PopDelta incorporates built-in utilities for data cleaning, including handling missing data, rectifying data inconsistencies, and implementing One-Hot Encoding.
PopDelta initialization parameters are: 1) target: the attribute utilized to "weight" the discovered data patterns, in the absence of a target, patterns will be “weighted” only based on their frequencies (row counts “a la” vanilla frequent itemset). 2) num_bins: bins for discretizing numerical attributes, and 3) string_attributes: a list of attributes that should be forced to string type (useful for ordinal attributes encoded in numerical format).
After initializing PopDelta, the object can be used to iterate through the generator returned from the process_batch function. This function requires two parameters: 1) batches: a list containing one, or two depending on the use case, dataframes for comparison purposes, and 2) progressive: a Boolean flag indicating whether the data should be processed progressively in chunks (recommended for datasets exceeding 10000 rows).
The user can compare datasets with not perfectly overlapping columns, as long as that intersection is not empty.
Example
The expected usage and input are as follows (you can find more examples in the examples.ipynb notebook:
Import pandas as pd
from popdelta.pop_delta import PopDelta
pd.set_option('mode.use_inf_as_na', True)
df = pd.read_csv("messy_dataset.csv")
over_40 = df[df[“age] > 40]
under_40 = df[df[“age] <= 40]
popDeltaW2 = PopDelta(target=”age”, num_bins=3, string_attributes=[])
for result in popDeltaW2.process_batch([over_40, under_40]):
display(result)
In this example, PopDelta is utilized to analyze two datasets: over_40 and under_40. The target variable for weighting is age, with 3 bins for discretizing numerical attributes and no predefined string attributes (applicable for ordinal attributes encoded as numerics).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file popdelta-0.0.2.tar.gz.
File metadata
- Download URL: popdelta-0.0.2.tar.gz
- Upload date:
- Size: 2.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c4ad84975890c1ac90ac666ddda3e2c4c1c4b9a726e08b5b51a2285b8997cd7a
|
|
| MD5 |
8b56d419d05a7f5519ebe6a1c017339d
|
|
| BLAKE2b-256 |
79c929248e833a129600062f68e8253a5211abe6b0b620466182d300fefd3fca
|
File details
Details for the file popdelta-0.0.2-py3-none-any.whl.
File metadata
- Download URL: popdelta-0.0.2-py3-none-any.whl
- Upload date:
- Size: 17.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ab5a6fbc972cea11cf5798d31c946b8fb72233edda436c956b0a8f6031b4171
|
|
| MD5 |
596b4fd18a5a23b8ed040f2745a1d130
|
|
| BLAKE2b-256 |
256bc720053ef1eacebc84d4edeff5e170a3709b3ee4f64e627c9ccbba949f4c
|