Bayesian Rule Set Mining

## Project description

Bayesian Rule Set Mining

Find the rule set from the data

The input data should follow the following format:

X has to be a pandas DataFrame

all the column names can not contain '_' or '<'

and the column names can not be pure numbers

The categorical data should be represented in string

(For example, gender needs to be 'male'/'female',

or '0'/'1' to represent male and female respectively.)

The parser will only recognize this format of data.

So transform the data set first before using the

functions.

y hass to be a numpy.ndarray

reference:

Wang, Tong, et al. "Bayesian rule sets for interpretable classification."

Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 2016.

The program is very picky on the input data format

X needs to be a pandas DataFrame,

y needs to be a nd.array

Parameters

----------

max_rules : int, default 5000

Maximum number of rules when generating rules

max_iter : int, default 200

Maximun number of iteratations to find the rule set

chians : int, default 1

Number of chains that run in parallel

support : int, default 5

The support is the percentile threshold for the itemset

to be selected.

maxlen : int, default 3

The maximum number of items in a rule

#note need to replace all the alpha_1 to alpha_+

alpha_1 : float, default 100

alpha_+

beta_1 : float, default 1

beta_+

alpha_2 : float, default 100

alpha_-

beta_2 : float, default 1

beta_-

alpha_l : float array, shape (maxlen+1,)

default all elements to be 1

beta_l : float array, shape (maxlen+1,)

default corresponding patternSpace

level : int, default 4

Number of intervals to deal with numerical continous features

neg : boolean, default True

Negate the features

add_rules : list, default empty

User defined rules to add

it needs user to add numerical version of the rules

criteria : str, default 'precision'

When there are rules more than max_rules,

the criteria used to filter rules

greedy_initilization : boolean, default False

Wether start the rule set using a greedy

initilization (according to accuracy)

greedy_threshold : float, default 0.05

Threshold for the greedy algorithm

to find the starting rule set

propose_threshold : float, default 0.1

Threshold for a proposal to be accepted

method : str, default 'fpgrowth'

The method used to generate rules.

Can be 'fpgrowth' or 'forest'

Notice that if there are potentially many rules

then fpgrowth is not a good method as it will

have memory issue (because the rule screening is

after rule generations).

Sample usage

from ruleset import *

df = pd.read_csv('data/adult.dat', header=None, sep=',', names=['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'matritalstatus', 'occupation', 'relationship', 'race', 'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountary', 'income'])

y = (df['income'] == '>50K').as_matrix()

df.drop('income', axis=1, inplace=True)

model = BayesianRuleSet(method='forest')

model.fit(df, y)

Find the rule set from the data

The input data should follow the following format:

X has to be a pandas DataFrame

all the column names can not contain '_' or '<'

and the column names can not be pure numbers

The categorical data should be represented in string

(For example, gender needs to be 'male'/'female',

or '0'/'1' to represent male and female respectively.)

The parser will only recognize this format of data.

So transform the data set first before using the

functions.

y hass to be a numpy.ndarray

reference:

Wang, Tong, et al. "Bayesian rule sets for interpretable classification."

Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 2016.

The program is very picky on the input data format

X needs to be a pandas DataFrame,

y needs to be a nd.array

Parameters

----------

max_rules : int, default 5000

Maximum number of rules when generating rules

max_iter : int, default 200

Maximun number of iteratations to find the rule set

chians : int, default 1

Number of chains that run in parallel

support : int, default 5

The support is the percentile threshold for the itemset

to be selected.

maxlen : int, default 3

The maximum number of items in a rule

#note need to replace all the alpha_1 to alpha_+

alpha_1 : float, default 100

alpha_+

beta_1 : float, default 1

beta_+

alpha_2 : float, default 100

alpha_-

beta_2 : float, default 1

beta_-

alpha_l : float array, shape (maxlen+1,)

default all elements to be 1

beta_l : float array, shape (maxlen+1,)

default corresponding patternSpace

level : int, default 4

Number of intervals to deal with numerical continous features

neg : boolean, default True

Negate the features

add_rules : list, default empty

User defined rules to add

it needs user to add numerical version of the rules

criteria : str, default 'precision'

When there are rules more than max_rules,

the criteria used to filter rules

greedy_initilization : boolean, default False

Wether start the rule set using a greedy

initilization (according to accuracy)

greedy_threshold : float, default 0.05

Threshold for the greedy algorithm

to find the starting rule set

propose_threshold : float, default 0.1

Threshold for a proposal to be accepted

method : str, default 'fpgrowth'

The method used to generate rules.

Can be 'fpgrowth' or 'forest'

Notice that if there are potentially many rules

then fpgrowth is not a good method as it will

have memory issue (because the rule screening is

after rule generations).

Sample usage

from ruleset import *

df = pd.read_csv('data/adult.dat', header=None, sep=',', names=['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'matritalstatus', 'occupation', 'relationship', 'race', 'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountary', 'income'])

y = (df['income'] == '>50K').as_matrix()

df.drop('income', axis=1, inplace=True)

model = BayesianRuleSet(method='forest')

model.fit(df, y)

## Project details

## Release history Release notifications

## Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help | File type | Python version | Upload date |
---|---|---|---|

ruleset-1.0.1-py3-none-any.whl (15.4 kB) Copy SHA256 hash SHA256 | Wheel | py3 |