Python package to mine association rules in datasets
Project description
ruleminer
DISCLAIMER - BETA PHASE
This package is currently in a beta phase.
Python package to discover association rules in Pandas DataFrames.
This package implements the code of the paper Discovering and ranking validation rules in supervisory data from Willem Jan Willemse and Annick van Ool.
Installation
To install the package:
pip install ruleminer
To install the package from Github:
pip install -e git+https://github.com/wjwwillemse/ruleminer.git#egg=ruleminer
To use ruleminer in a project:
import ruleminer
Usage
Name |
Type |
Assets |
TP-life |
TP-nonlife |
Own funds |
Excess |
---|---|---|---|---|---|---|
Insurer 1 |
life insurer |
1000 |
800 |
0 |
200 |
200 |
Insurer 2 |
non-life insurer |
4000 |
0 |
3200 |
800 |
800 |
Insurer 3 |
non-life insurer |
800 |
0 |
700 |
100 |
100 |
Insurer 4 |
life insurer |
2500 |
1800 |
0 |
700 |
700 |
Insurer 5 |
non-life insurer |
2100 |
0 |
2200 |
200 |
200 |
Insurer 6 |
life insurer |
9000 |
8800 |
0 |
200 |
200 |
Insurer 7 |
non-life insurer |
9000 |
0 |
8800 |
200 |
200 |
Insurer 8 |
life insurer |
9000 |
8800 |
0 |
200 |
200 |
Insurer 9 |
non-life insurer |
9000 |
8800 |
0 |
200 |
200 |
Insurer 10 |
life insurer |
9000 |
0 |
8800 |
200 |
199.99 |
Calculating metrics
Take the rule:
if ({"Type"} == "life_insurer") then ({"TP-life"} > 0)
This rule says: if an insurer reports for column “Type” (noted by the curved brackets) the value “life_insurer” then the value of the column “TP-life” should be higher than zero.
With the code:
templates = [{'expression': 'if ({"Type"} == "life_insurer") then ({"TP-life"} > 0)'}] r = ruleminer.RuleMiner(templates=templates, data=df)
you can generate the rule metrics of this rule given the data in the DataFrame above (available with r.rules).
id |
definition |
status |
abs support |
abs exceptions |
confidence |
encodings |
---|---|---|---|---|---|---|
1 |
if ({“Type”} == “life_insurer”) then ({“TP-life”} > 0) |
None |
5 |
0 |
1 |
{} |
There are 5 rows in the data that support this rule. There are no exceptions (i.e. where the if-clause is satisfied, but not the then-clause), so this rule has confidence 1.
Generating rules
You can define rule templates that contain regular expressions for column names and strings. The package will then generate rules that satisfy the rule template with matching column names and strings from the DataFrame. For example column regex:
{"T.*"}
will satisfy column names:
{"Type"}, {"TP-life"}, {"TP-nonlife"}
So, if you apply the following rule
if ({"T.*"} == ".*") then ({"TP.*"} > 0)
then the following rules are generated
id |
definition |
status |
abs support |
abs exceptions |
confidence |
encodings |
---|---|---|---|---|---|---|
0 |
if ({“Type”} == “non-life_insurer”) then ({“TP-nonlife”} > 0) |
None |
4 |
1 |
0.8 |
{} |
1 |
if ({“Type”} == “life_insurer”) then ({“TP-life”} > 0) |
None |
5 |
0 |
1 |
{} |
You can use rules without an if-clause, for example:
{"Assets"} > 0
The metrics for these rules are calculated as if the if-clause is always satisfied.
Rule examples
The following rules can be applied to the data above:
{"Assets"} > 0 if ({"Type"} == ".*") then ({".*"} > 0) if ({".*"} > 0) then (({".*"} == 0) & ({".*"} > 0)) (({".*"} + {".*"} + {".*"}) == {".*"}) (min({".*"}, {".*"}) == {".*"}) ({"Own funds"} <= quantile({"Own funds"}, 0.95))
Parameters
Rule metrics
Several rule metrics have been proposed in the past. You can add the metrics that you want as a parameter to the ruleminer, i.e.:
params = {'metrics': ['added value', 'abs support', 'abs exceptions', 'confidence']} r = ruleminer.RuleMiner(templates=templates, data=df, params=params)
This will produce the desired metrics. Available metrics are:
abs support (the absolute number of rows that satisfy the rule)
abs exceptions (the absolute number of rows that do no satisfy the rule)
confidence
support
added value
casual confidence
casual support
conviction
lift
rule power factor
The default metrics are ‘abs support’, ‘abs exceptions’ and ‘confidence’.
See for the definitions Measures for Rules from Michael Hahsler.
Metric filters
If you want to select only rules that satisfy a certain metric threshold then you can use
params = {“filter” : {“confidence”: 0.75, “abs support”: 10}}
The default metric filter is: {“confidence”: 0.5, “abs support”: 2}
Rule precision
In many situations the equal-operator when used on quantitative data is too strict as small differences can occur that you do no want to consider as exceptions to the rule. For this you can define a decimal parameter inside the params dictionary by this
params={'decimal': 3}
This means that comparisons like:
A==B
are translated to
abs(A-B) <= 1.5*10**(-decimal)
If no ‘decimal’ parameter is provided then the absolute difference should be exactly zero.
Evaluating results within rules
Suppose you want to use an expression with a quantile:
({"Own funds"} <= quantile({"Own funds"}, 0.95))
Then you can choose to evaluate the quantile based on the dataset on which the rules were generated or not with:
params = {'evaluate_quantile': True}
This would produce the rule
if () then ({"Own funds"}<=755.0)
If you use
params = {'evaluate_quantile': False}
then this would produce
if () then ({"Own funds"}<=quantile({"Own funds"},0.95))
In this case the quantile is re-evaluated each time based when the rule is evaluated and the outcome will depend on the current dataset.
The default is False (quantiles within rules are not evaluated).
Rule pruning
By using regex in column names, it will sometimes happen that rules are identical to other rules, except that they have a different ordering of columns. For example:
max({"TP life"}, {"TP nonlife"})
is identical to:
max({"TP nonlife"}, {"TP life"})
The generated rules are therefore pruned to delete the identical rules from the generated list of rules.
a==b is identical to b==a
a!=b is identical to b!=a
min(a, b) is identical to min(b, a)
max(a, b) is identical to max(b, a)
a+b is identical to b+a
a*b is identical to b*a
These identities are applied recursively in rules. So the rule:
(({"4"}>{"3"}) & (({"2"}+{"1"})=={"0"}))
is identical to:
((({"1"}+{"2"})=={"0"}) & ({"4"}>{"3"}))
and will therefore be pruned from the list if the first rule is already in the list.
Rule template grammar
The rule template describes the structure of the rule. Columns and quoted strings in the rule template can contain simple regular expressions.
The syntax of the template follows a grammar defined as follows:
a template is of the form:
if cond_1 then cond_2
or simply a single:
cond_1
a condition is either a combination of comparisons with logical operators (’&’ and ‘|’) and parenthesis:
( comp_1 & comp_2 | comp_3 )
or simply a single comparison:
comp_1
a comparison consists of a term, a comparison operator (>=, >, <=, <, != or ==) and a term, so:
term_1 > term_2
a term can be a number (e.g. 3.1415 or 9), quoted string (a string with single or double quotes), or a function of columns
a function of columns is either a prefix operator (min, max, quantile, or abs, in lower or uppercase) on one or more columns, and of the form, for example:
min(col_1, col_2, col_3)
or infix operators with one or more columns:
(col_1 + col_2 * col_3)
a column is a string with braces, so:
{"Type"}
where “Type” is the name of the column in the DataFrame with the data
a string consists of a-z A-Z 0-9 _ . , ; ; < > * = + - / ? | @ # $ % ^ & ( )
Debugging rules
If you are using this in a Jupyter notebook you can add a the beginning:
logging.basicConfig(stream=sys.stdout, format='%(asctime)s %(message)s', level=logging.INFO)
Information about the rule generating process with be displayed in the notebook. Set the debug level to logging.DEBUG is you want more results.
History
0.1.0 (2021-11-21)
First release on PyPI.
0.1.1 (2021-11-23)
Added more documentation to the README text
0.1.2 (2022-1-20)
Bug fixes wrt some complex expressions
0.1.3 (2022-1-26)
Optimized rule generation process
0.1.4 (2022-1-26)
Evaluated columns in then part are now dependent on if part of rule
0.1.5 (2022-1-30)
Rule with quantiles added (including evaluating intermediate results)
0.1.6 and 0.1.7 (2022-2-1)
A number of optimization in rule generation process
0.1.8 (2022-2-3)
Rule power factor metric added
0.1.12 (2022-5-11)
Optimizations: metric calculations are done with boolean masks of DataFrame
0.1.14 (2023-4-17)
Nested functions added
substr and in operators added
0.1.16 (2023-8-3)
Templates now do not necessarily have to contain a regex
Bug fix when evaluating rules that contain columns that do not exist
Templates now can start with ‘if () then’
0.1.17 (2023-8-8)
Generate rules now runs without specified data
0.1.18 (2023-8-8)
Dedicated function added for template to rule conversion without data
Exp sign changed from ^ to **
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for ruleminer-0.1.18-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18f6aa7fc1cfd9ff8522d62d10ce4215be5b94eb1e7aa2d028d7ade2e8036cbb |
|
MD5 | 8d4de5eebc84a92386c186d338b50bab |
|
BLAKE2b-256 | 610d19969d28632d0bda0dd1626484002aa02bd2b2e4853b0e180a5673825f1d |