Constrained feature selection
Project description
cffs
-- A Python Package for Constrained Filter Feature Selection
The package cffs
contains classes to formulate and solve constrained-feature-selection problems.
In particular, we support univariate filter feature selection as the objective
and constraints from propositional logic and linear arithmetic.
We use an SMT (Satisfiability Modulo Theories) solver to find optimal feature sets under the constraints.
This document provides:
- Steps for setting up the package.
- A short overview of the (constrained-feature-selection) functionality.
- A demo of the functionality.
- Guidelines for developers who want to modify or extend the code base.
If you use this package for a scientific publication, please cite our paper
@article{bach2022empirical,
title={An Empirical Evaluation of Constrained Feature Selection},
author={Bach, Jakob and Zoller, Kolja and Trittenbach, Holger and Schulz, Katrin and B{\"o}hm, Klemens},
journal={SN Computer Science},
volume={3},
number={6},
year={2022},
doi={10.1007/s42979-022-01338-z}
}
Setup
You can install our package from PyPI:
python -m pip install cffs
Alternatively, you can install the package from GitHub:
python -m pip install git+https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection.git#subdirectory=src/cffs_package
If you already have the source code for the package (i.e., the directory in which this README
resides)
as a local directory on your computer (e.g., after cloning the project), you can also perform a local install:
python -m pip install cffs_package/
Functionality
The package cffs
contains the following modules:
combi_expressions.py
: Formulate constraints in propositional logic and linear arithmetic, simultaneously for our own expression classes (expressions.py
) and the solverZ3
. These constraints may be added to an optimization problem incombi_solving.py
.combi_solving.py
: Formulate a constrained-filter-feature-selection optimization problem with constraints fromcombi_expressions.py
. Count the number of solutions with our own implementation (solving.py
) and optimize the problem withZ3
.expressions.py
: Formulate constraints in propositional logic and linear arithmetic, using our own expression classes. These constraints may be added to a satisfiability problem insolving.py
.feature_qualities.py
: Compute univariate feature qualities for the (linear) optimization objective.solving.py
: Formulate a constrained-filter-feature-selection satisfiability problem with constraints fromexpressions.py
. Count the number of solutions with our own implementation; optimization is not supported.
Demo
For constrained filter feature selection, we first need to compute an individual quality score for each feature.
(We only consider univariate feature selection, so we ignore interactions between features.)
The objective value is the summed quality of all selected features.
To this end, cffs.feature_qualities
provides functions for
- the absolute value of Pearson correlation between each feature and the prediction target (
abs_corr()
) - the mutual information between each feature and the prediction target (
mut_info()
)
Both functions round the qualities to two digits to speed up solving.
(We found that the solver becomes slower the more precise the floats are,
as they are represented as rational numbers in the solver.)
As inputs, the quality functions require a dataset in X-y form (as used in sklearn
).
After computing feature qualities, we set up an SMT optimization problem from cffs.combi_solving
.
It is "combi" in the sense that our code wraps an existing SMT solver (Z3
).
We retrieve the problem's decision variables (one binary variable for each feature) and use them to
formulate constraints with cffs.combi_expressions
.
These constraints are added to Z3
but also to our own expression tree,
which we use to count the number of valid solutions in the search space.
Finally, we start optimization.
from cffs.combi_expressions import And, AtMost, Xor
from cffs.combi_solving import Problem
from cffs.feature_qualities import mut_info
import sklearn.datasets
X, y = sklearn.datasets.load_iris(as_frame=True, return_X_y=True)
feature_qualities = mut_info(X=X, y=y)
problem = Problem(variable_names=X.columns, qualities=feature_qualities)
print('--- Constrained problem ---')
variables = problem.get_variables()
problem.add_constraint(AtMost(variables, 2))
problem.add_constraint(And([Xor(variables[0], variables[1]), Xor(variables[2], variables[3])]))
print(problem.optimize())
print('Number of constraints:', problem.get_num_constraints())
print('Fraction of valid solutions:', problem.compute_solution_fraction())
print('\n--- Unconstrained problem ---')
problem.clear_constraints()
print(problem.optimize())
print('Number of constraints:', problem.get_num_constraints())
print('Fraction of valid solutions:', problem.compute_solution_fraction())
The output is the following:
--- Constrained problem ---
{'objective_value': 1.48, 'num_selected': 2, 'selected': ['petal width (cm)', 'sepal length (cm)']}
Number of constraints: 2
Fraction of valid solutions: 0.25
--- Unconstrained problem ---
{'objective_value': 2.71, 'num_selected': 4, 'selected': ['petal width (cm)', 'petal length (cm)', 'sepal length (cm)', 'sepal width (cm)']}
Number of constraints: 0
Fraction of valid solutions: 1.0
The optimization procedure returns the objective value (summed quality of selected features)
and the feature selection.
To assess how strongly the constraints cut down the space of valid solutions,
we can use compute_solution_fraction()
.
However, this function iterates over each solution candidate and checks whether it is valid or not,
which becomes very expensive with a growing number of features.
Alternatively, estimate_solution_fraction()
randomly samples solutions to estimate this quantity.
Our code snippet also shows that you can remove all constraints via clear_constraints()
without setting up a new optimization problem.
You can also add further constraints after optimization and then optimize again.
The optimizer keeps its state between optimizations, so you may benefit from a warm start.
Developer Info
New operators / expressions types for constraints in expressions.py
should (directly or indirectly)
inherit from the top-level superclass Expression
. Preferably, inherit from one of its subclasses
BooleanExpression
(if your operator returns a boolean value; need to overrideis_true()
) orArithmeticExpression
(if your operator returns a numeric value; need to overrideget_value()
).
is_true()
or get_value()
express the logic of your operator, i.e., how it will be evaluated
(typically depending on the values of child expressions serving as operands).
Also, you need to override get_children()
to return all (direct) child expressions,
so it is possible to traverse the full expression tree.
solving.py
supports adding arbitrary BooleanExpression
s from expressions.py
as constraints.
New operators / expression types for constraints in combi_expressions.py
should simultaneously inherit from
the superclass BooleanExpression
in this module and one expression class from expressions.py
.
Arithmetic expressions currently do not exist on their own in this module,
but only nested into more complex boolean expressions (like "weighted sum <= some threshold").
You need to initialize the expressions.py
expression (preferably in the initializer by calling super().__init__()
)
and store a corresponding Z3
expression in the field z3_expr
;
use get_z3()
to access Z3
representations of your child expressions serving as operands.
combi_solving.py
supports adding arbitrary BooleanExpression
s from combi_expressions.py
as constraints.
Also, it supports arbitrary univariate feature qualities (e.g., computed with feature_qualities.py
)
to formulate a univariate (= linear) objective function.
New functions for univariate feature qualities in feature_qualities.py
should satisfy the following implicit interface:
- Two inputs: Dataset
X
(pandas.DataFrame
) and prediction targety
(pandas.Series
). - Output is a sequence (e.g., list) of floats whose length equals the number of columns in
X
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cffs-1.0.0.tar.gz
.
File metadata
- Download URL: cffs-1.0.0.tar.gz
- Upload date:
- Size: 14.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8540c16a6488c350ed5c94ea4f98061f77357e420aedb43d37469c112f32d004 |
|
MD5 | d0204e779d345360cb579589e4d362e3 |
|
BLAKE2b-256 | 7c29afd5a61e686cc6fd57df9ba13cfc7ad3e7b9d78859647633ec41a78aeb36 |
File details
Details for the file cffs-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: cffs-1.0.0-py3-none-any.whl
- Upload date:
- Size: 14.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.7.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ef5f49d25c33c5145e42469bb320b1aaedf148bd3c0e5b71a2d2a5faff70480a |
|
MD5 | 3620f4a280b81856a58efa2ce6166d57 |
|
BLAKE2b-256 | 549b46fa15f708039f977dca1f8a42f7cc0961d7f43422bf4e55d17a048c6f94 |