Python package to mine association rules in datasets
Reason this release was yanked:
Forgot to remove print statement
Project description
ruleminer
Python package to discover association rules in Pandas DataFrames.
This package implements the code of the paper Discovering and ranking validation rules in supervisory data.
The documentation can be found here.
Here is what the package does:
Generate human-readable validation rules using rule templates containing regular expressions and a Pandas DataFrame dataset
available functions: min, max, abs, quantile, sum, substr, split, count, sumif and countif
including parameters for metric filters and rule precisions (including XBRL tolerances)
Evaluate rules and calculate association rules metrics
available metrics: abs support, abs exceptions, confidence, support, added value, casual confidence, casual support, conviction, lift and rule power factor
Here are some examples of rule templates with regexes with which you can generate validation rules:
if ({“Type”} == “.*”) then ({“.*”} > 0)
if ({“.*”} > 0) then (({“.*”} == 0) & ({“.*”} > 0))
(({“.*”} + {“.*”} + {“.*”}) == {“.*”})
({“Own funds”} <= quantile({“Own funds”}, 0.95))
(substr({“Type”}, 0, 1) in [“a”, “b”])
The first template generates (with the dataset described in the Usage section) rules like
if ({“Type”} == “non-life_insurer”) then ({“TP-nonlife”} > 0)
if ({“Type”} == “life_insurer”) then ({“TP-life”} > 0)
These generated validation rules can then be used to validate new datasets.
History
0.1.0 (2021-11-21)
First release on PyPI.
0.1.1 (2021-11-23)
Added more documentation to the README text
0.1.2 (2022-1-20)
Bug fixes wrt some complex expressions
0.1.3 (2022-1-26)
Optimized rule generation process
0.1.4 (2022-1-26)
Evaluated columns in then part are now dependent on if part of rule
0.1.5 (2022-1-30)
Rule with quantiles added (including evaluating intermediate results)
0.1.6 and 0.1.7 (2022-2-1)
A number of optimization in rule generation process
0.1.8 (2022-2-3)
Rule power factor metric added
0.1.12 (2022-5-11)
Optimizations: metric calculations are done with boolean masks of DataFrame
0.1.14 (2023-4-17)
Nested functions added
substr and in operators added
0.1.16 (2023-8-3)
Templates now do not necessarily have to contain a regex
Bug fix when evaluating rules that contain columns that do not exist
Templates now can start with ‘if () then’
0.1.17 (2023-8-8)
Generate rules now runs without specified data
0.1.18 (2023-8-8)
Dedicated function added for template to rule conversion without data
Exp sign changed from ^ to **
0.1.19 (2023-8-27)
Small fixes rule conversion without data
0.1.20 (2023-8-29)
Small fixes in evaluating rules with syntax errors
0.1.21 (2023-10-11)
changed sum to nansum
added tolerance functionality for ==
0.1.22 (2023-10-17)
added tolerance functionality for !=, <, <=, > and >=
updated docs
0.1.23 (2023-10-18)
added nested conditions in functions
0.1.24 (2023-10-25)
added sumif and improved tolerance functionality
0.1.26 (2023-4-22)
added additional arguments estimate, base and sample_weights to fit_ensemble_and_extract_expressions function to use more than AdaBoost
added decision tree functions to __init__.py
0.1.28 (2023-5-3)
Bug fix
0.1.29 (2024-9-30)
Added functionality for countif and sumif
Bug fix for tolerances in combination with >=, <=, > and <
Bug fix for tolerances in formulas like A - B - C
Added tests for these bug fixes
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for ruleminer-0.1.29-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eec5e8a6b4d009e6002884917ef0a3018f96eb2ce3ec82302f4c3c55dfe32806 |
|
MD5 | c34142c0d1eace48c76a0ca5aaf89561 |
|
BLAKE2b-256 | 9e1dab5e5f5edee573a2c7e833d3e74d4bb43d9bf0f40d50fc0fd0d1d6b4d9c2 |