Metafeature Extractor
Project description
pymfe: Python MetaFeature Extractor
The pymfe (python metafeature extractor) provides a comprehensive set of metafeatures implemented in python. The package brings cutting edge metafeatures, following recent literature propose. The pymfe architecture was thought to systematically make the extraction, which can produce a robust set of metafeatures. Moreover, pymfe follows recent metafeature formalization aiming to make MtL reproducible.
Here, you can use different measures and summary functions, setting their hyperparameters, and also measuring automatically the elapsed time. Moreover, you can extract metafeatures from specific models, or even extract metafeatures with confidence intervals using bootstrap. There are a lot of other interesting features and you can see more about it looking at the documentation.
Metafeature
In the Metalearning (MtL) literature, metafeatures are measures used to characterize data sets and/or their relations with algorithm bias.
"Metalearning is the study of principled methods that exploit metaknowledge to obtain efficient models and solutions by adapting the machine learning and data mining process."  (Brazdil et al. (2008))
Metafeatures are used in MtL and AutoML tasks in general, to represent/understand a dataset, to understanding a learning bias, to create machine learning (or data mining) recommendations systems, and to create surrogates models, to name a few.
Pinto et al. (2016) and Rivolli et al. (2018) defined a metafeature as follows. Let $D \in \mathcal{D}$ be a dataset, $m\colon \mathcal{D} \to \mathbb{R}^{k'}$ be a characterization measure, and $\sigma\colon \mathbb{R}^{k'} \to \mathbb{R}^{k}$ be a summarization function. Both $m$ and $\sigma$ have also hyperparameters associated, $h_m$ and $h_\sigma$ respectively. Thus, a metafeature $f\colon \mathcal{D} \to \mathbb{R}^{k}$ for a given dataset $D$ is
$$ f\big(D\big) = \sigma\big(m(D,h_m), h_\sigma\big). $$
The measure $m$ can extract more than one value from each data set, i.e., $k'$ can vary according to $D$, which can be mapped to a vector of fixed length $k$ using a summarization function $\sigma$.
In this package, We provided the following metafeatures groups:
 General: General information related to the dataset, also known as simple measures, such as the number of instances, attributes and classes;
 Statistical: Standard statistical measures to describe the numerical properties of data distribution;
 Informationtheoretic: Particularly appropriate to describe discrete (categorical) attributes and their relationship with the classes;
 Modelbased: Measures designed to extract characteristics from simple machine learning models;
 Landmarking: Performance of simple and efficient learning algorithms.
 Relative Landmarking: Relative performance of simple and efficient learning algorithms;
 Subsampling Landmarking: Performance of simple and efficient learning algorithms from a subsample of the dataset;
 Clustering: Clustering measures extract information about dataset based on external validation indexes;
 Concept: Estimate the variability of class labels among examples and the examples density;
 Itemset: Compute the correlation between binary attributes; and
 Complexity: Estimate the difficulty in separating the data points into their expected classes.
In the pymfe package, you can use different measures and summary functions, setting their hyperparameters, and automatically measure the elapsed time. Moreover, you can extract metafeatures from specific models, or even obtain metafeatures with confidence intervals using bootstrap. There are many other exciting features. You can see more about it looking at the documentation.
Dependencies
The main pymfe
requirement is:
 Python (>= 3.6)
Installation
The installation process is similar to other packages available on pip:
pip install U pymfe
It is possible to install the development version using:
pip install U git+https://github.com/ealcobaca/pymfe
or
git clone https://github.com/ealcobaca/pymfe.git
cd pymfe
python setup.py install
Example of use
The simplest way to extract metafeatures is by instantiating the MFE
class.
It computes five metafeatures groups by default using mean and standard
deviation as summary functions: General, Statistical, Informationtheoretic,
Modelbased, and Landmarking. The fit
method can be called by passing the X
and y
. Then the extract
method is used to extract the related measures.
A simple example using pymfe
for supervised tasks is given next:
# Load a dataset
from sklearn.datasets import load_iris
from pymfe.mfe import MFE
data = load_iris()
y = data.target
X = data.data
# Extract default measures
mfe = MFE()
mfe.fit(X, y)
ft = mfe.extract()
print(ft)
# Extract general, statistical and informationtheoretic measures
mfe = MFE(groups=["general", "statistical", "infotheory"])
mfe.fit(X, y)
ft = mfe.extract()
print(ft)
# Extract all available measures
mfe = MFE(groups="all")
mfe.fit(X, y)
ft = mfe.extract()
print(ft)
You can simply omit the target attribute for unsupervised tasks while fitting
the data into the MFE
model. The pymfe
package automatically finds and
extracts only the metafeatures suitable for this type of task. Examples are
given next:
# Load a dataset
from sklearn.datasets import load_iris
from pymfe.mfe import MFE
data = load_iris()
y = data.target
X = data.data
# Extract default unsupervised measures
mfe = MFE()
mfe.fit(X)
ft = mfe.extract()
print(ft)
# Extract all available unsupervised measures
mfe = MFE(groups="all")
mfe.fit(X)
ft = mfe.extract()
print(ft)
Several measures return more than one value. To aggregate the returned values,
summarization function can be used. This method can compute min
, max
,
mean
, median
, kurtosis
, standard deviation
, among others. The default
methods are the mean
and the sd
. Next, it is possible to see an example of
the use of this method:
## Extract default measures using min, median and max
mfe = MFE(summary=["min", "median", "max"])
mfe.fit(X, y)
ft = mfe.extract()
print(ft)
## Extract default measures using quantile
mfe = MFE(summary=["quantiles"])
mfe.fit(X, y)
ft = mfe.extract()
print(ft)
You can easily list all available metafeature groups, metafeatures, summary methods and metafeatures filtered by groups of interest:
from pymfe.mfe import MFE
# Check all available metafeature groups in the package
print(MFE.valid_groups())
# Check all available metafeatures in the package
print(MFE.valid_metafeatures())
# Check available metafeatures filtering by groups of interest
print(MFE.valid_metafeatures(groups=["general", "statistical", "infotheory"]))
# Check all available summary functions in the package
print(MFE.valid_summary())
It is possible to pass custom arguments to every metafeature using MFE
extract
method kwargs. The keywords must be the target metafeature name, and
the value must be a dictionary in the format {argument
: value
}, i.e., each
key in the dictionary is a target argument with its respective value. In the
example below, the extraction of metafeatures min
and max
happens as
usual, but the metafeatures sd,
nr_norm
and nr_cor_attr
will receive user
custom argument values, which will interfere in each metafeature result.
# Extract measures with custom user arguments
mfe = MFE(features=["sd", "nr_norm", "nr_cor_attr", "min", "max"])
mfe.fit(X, y)
ft = mfe.extract(
sd={"ddof": 0},
nr_norm={"method": "all", "failure": "hard", "threshold": 0.025},
nr_cor_attr={"threshold": 0.6},
)
print(ft)
If you want to extract metafeatures from a prefitted machine learning model
(from sklearn package
), you can use the extract_from_model
method without
needing to use the training data:
import sklearn.tree
from sklearn.datasets import load_iris
from pymfe.mfe import MFE
# Extract from model
iris = load_iris()
model = sklearn.tree.DecisionTreeClassifier().fit(iris.data, iris.target)
extractor = MFE()
ft = extractor.extract_from_model(model)
print(ft)
# Extract specific metafeatures from model
extractor = MFE(features=["tree_shape", "nodes_repeated"], summary="histogram")
ft = extractor.extract_from_model(
model,
arguments_fit={"verbose": 1},
arguments_extract={"verbose": 1, "histogram": {"bins": 5}})
print(ft)
You can also extract your metafeatures with confidence intervals using bootstrap. Keep in mind that this method extracts each metafeature several times, and may be very expensive depending mainly on your data and the number of metafeature extract methods called.
# Extract metafeatures with confidence interval
mfe = MFE(features=["mean", "nr_cor_attr", "sd", "max"])
mfe.fit(X, y)
ft = mfe.extract_with_confidence(
sample_num=256,
confidence=0.99,
verbose=1,
)
print(ft)
Documentation
We write a great Documentation to guide you on how to use the pymfe library. You can find in the documentation interesting pages like:
Developer notes
 We are glad to accept any contributions, please check Contributing and the Documentation.
 To submit bugs and feature requests, report at project issues.
License
This project is licensed under the MIT License  see the License file for details.
Cite Us
If you use the pymfe
in scientific publication, we would appreciate citations
to the following paper:
You can also use the bibtex format:
@article{JMLR:v21:19348,
author = {Edesio Alcobaça and
Felipe Siqueira and
Adriano Rivolli and
Luís P. F. Garcia and
Jefferson T. Oliva and
André C. P. L. F. de Carvalho
},
title = {MFE: Towards reproducible metafeature extraction},
journal = {Journal of Machine Learning Research},
year = {2020},
volume = {21},
number = {111},
pages = {15},
url = {http://jmlr.org/papers/v21/19348.html}
}
Acknowledgments
We would like to thank every Contributor that directly or indirectly has make this project to happen. Thank you all.
References
 Brazdil, P., Carrier, C. G., Soares, C., & Vilalta, R. (2008). Metalearning: Applications to data mining. Springer Science and Business Media.
 Pinto, F., Soares, C., & MendesMoreira, J. (2016, April). Towards automatic generation of metafeatures. In PacificAsia Conference on Knowledge Discovery and Data Mining (pp. 215226). Springer, Cham.
 Rivolli, A., Garcia, L. P. F., Soares, C., Vanschoren, J., and de Carvalho, A. C. P. L. F. (2018). Characterizing classification datasets: a study of metafeatures for metalearning. arXiv:1808.10406.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pymfe0.4.2.tar.gz
.
File metadata
 Download URL: pymfe0.4.2.tar.gz
 Upload date:
 Size: 136.1 kB
 Tags: Source
 Uploaded using Trusted Publishing? No
 Uploaded via: twine/4.0.1 CPython/3.7.13
File hashes
Algorithm  Hash digest  

SHA256  6241f692d81b18239978c72ac994bb96e9b60b0f6fc9b7e7df98fa116dcbff30 

MD5  e553997f25a9d44186d3d845b03dfef1 

BLAKE2b256  ea07fd31135a5cb8e8ccf165dca5d42cfbe9576c2ed81aa61525a239ddcb7189 
Provenance
File details
Details for the file pymfe0.4.2py3noneany.whl
.
File metadata
 Download URL: pymfe0.4.2py3noneany.whl
 Upload date:
 Size: 155.7 kB
 Tags: Python 3
 Uploaded using Trusted Publishing? No
 Uploaded via: twine/4.0.1 CPython/3.7.13
File hashes
Algorithm  Hash digest  

SHA256  89e2e38386e65ccad69a0ef77cb67eced035cb51d54c0ca944e8770a9309c178 

MD5  54d5baade7c424e0fdc0be32049a6956 

BLAKE2b256  e6b16b485b0f00684adf4bd4ff138cf67c0184cdcc6bcbec81434255881c49f4 