Instance hardness package
Project description
PyHard
Instance Hardness Python package
Getting Started
PyHard employes a methodology known as Instance Space Analysis (ISA) to analyse performance at the instance level rather than at dataset level. The result is an alternative for visualizing algorithm performance for each instance within a dataset, by relating predictive performance to estimated instance hardness measures extracted from the data. This analysis reveals regions of strengths and weaknesses of predictors (aka footprints), and highlights individual instances within a dataset that warrant further investigation, either due to their unique properties or potential data quality issues.
Installation
Although the original ISA toolkit has been written in Matlab, we provide a lighter version in Python, with less tools, but enough for the instance hardness analysis purposes. You may find the implementation in the separate package PyISpace. Notwithstanding, the choice of the ISA engine is left up to the user, which can be set in the configuration file. Below, we present the standard installation, and also the the additional steps to configure the Matlab engine (optional).
Standard installation
pip install pyhard
Alternatively,
git clone https://gitlab.com/ita-ml/pyhard.git
cd pyhard
pip install -e .
Additional steps for Matlab engine (optional)
Python 3.7 is required. Matlab is also required in order to run the ISA source code. As far as we know, only recent versions of Matlab offer an engine for Python 3. Namely, we only tested from version R2019b on.
-
Install Matlab engine for Python
Refer to this link, which contains detailed instructions. -
Clone the ISA repository
You may find it here. -
Change config file
Inconfig.yaml
, set the fieldsisa_engine: matlab
andmatildadir: path/to/isa_folder
(cloned in step 2)
Usage
First, make sure that the configuration files are placed within the current directory and with the desired settings. Otherwise, see this section for more details.
Then, in the command line, run:
pyhard
By default, the following steps shall be taken:
-
Calculate the hardness measures;
-
Evaluate classification performance at instance level for each algorithm;
-
Select the most relevant hardness measures with respect to the instance classification error;
-
Join the outputs of steps 1, 2 and 3 to build the metadata file (
metadata.csv
); -
Run ISA (Instance Space Analysis), which generates the Instance Space (IS) representation and the footprint areas;
-
To explore the results from step 5, launch the visualization dashboard:
pyhard --app
One can choose which steps should be disabled or not
-
--no-meta
: does not attempt to build the metadata file -
--no-isa
: does not run the Instance Space Analysis
To see all command line options, run pyhard --help
to display help.
Input file
Please follow the guidelines below:
-
Only
csv
files are accepted -
The dataset should be in the format
(n_instances, n_features)
-
Do not include any index column. Instances will be indexed in order, starting from 1
-
The last column must contain the classes of the instances
-
Categorical features should be handled previously
Configuration
Inside the folder where the command pyhard
is run, make sure that the files config.yaml
and options.json
are present. They contain configurations for PyHard and ISA respectivel. One may generate them locally with command pyhard -F
.
The file config.yaml
is used to configurate steps 1-4. Through it, options for file paths, measures, classifiers, feature selection and hyper-parameter optimization can be set. Inside the file, more instructions may be found.
A configuration file in another location can be specified in the command line:
pyhard -c path/to/new_config.yaml
Visualization
Demo
The demo visualization app can display any dataset located within pyhard/data/
. Each folder within this directory (whose name is the problem name) should contain those three files:
-
data.csv
: the dataset itself; -
metadata.csv
: the metadata with measures and algorithm performances (feature_
andalgo_
columns); -
coordinates.csv
: the instance space coordinates.
The showed data can be chosen through the app interface. To run it use the command:
pyhard --demo
New problems may be added as a new folder in data/
. Multidimensional data will be reduced with the chosen dimensionality reduction method.
App
Through command line it is possible to launch the app for visualization of 2D-datasets along with their respective instance space. The graphics are linked, and options for color and displayed hover are available. In order to run the app, use the command:
pyhard --app
It should open the browser automatically and display the data.
References
Base
-
Michael R. Smith, Tony Martinez, and Christophe Giraud-Carrier. 2014. An instance level analysis of data complexity. Mach. Learn. 95, 2 (May 2014), 225–256.
-
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin Kam Ho. 2019. How Complex Is Your Classification Problem? A Survey on Measuring Classification Complexity. ACM Comput. Surv. 52, 5, Article 107 (October 2019), 34 pages.
-
Mario A. Muñoz, Laura Villanova, Davaatseren Baatar, and Kate Smith-Miles. 2018. Instance spaces for machine learning classification. Mach. Learn. 107, 1 (January 2018), 109–147.
Feature selection
-
Luiz H. Lorena, André C. Carvalho, and Ana C. Lorena. 2015. Filter Feature Selection for One-Class Classification. Journal of Intelligent and Robotic Systems 80, 1 (October 2015), 227–243.
-
Artur J. Ferreira and MáRio A. T. Figueiredo. 2012. Efficient feature selection filters for high-dimensional data. Pattern Recognition Letters 33, 13 (October, 2012), 1794–1804.
-
Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. 2017. Feature Selection: A Data Perspective. ACM Comput. Surv. 50, 6, Article 94 (January 2018), 45 pages.
-
Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. Efficient Estimation of Mutual Information for Strongly Dependent Variables. Available in http://arxiv.org/abs/1411.2003. AISTATS, 2015.
Hyper parameter optimization
-
James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. In Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS’11). Curran Associates Inc., Red Hook, NY, USA, 2546–2554.
-
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012. Practical Bayesian optimization of machine learning algorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’12). Curran Associates Inc., Red Hook, NY, USA, 2951–2959.
-
J. Bergstra, D. Yamins, and D. D. Cox. 2013. Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (ICML’13). JMLR.org, I–115–I–123.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pyhard-1.8.5.tar.gz
.
File metadata
- Download URL: pyhard-1.8.5.tar.gz
- Upload date:
- Size: 1.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/49.6.0.post20210108 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7024ac234e155d9c5c17346a07079bed682c0236d2ecf6cc9c9ce82bf9205925 |
|
MD5 | 38e7a2570f66436f374fe9ac75c16fda |
|
BLAKE2b-256 | 95f9a2ad0b95755a73160291e0725f40c2ee54998588438e49306a171ae1f74b |