Skip to main content

Instance hardness package

Project description

Binder made-with-python PyPI license

PyHard

Instance Hardness Python package

Getting Started

Python 3.7 is required. Matlab is also required in order to run matilda. As far as we know, only recent versions of Matlab offer an engine for Python 3. Namely, we only tested from version R2019b on.

Alternatively, take a look at Graphene, the Instance Hardness Analytics Tool. Matlab, and even Python, are not required in this case!

Installation

  1. PyHard package
pip install -e git+https://gitlab.com/ita-ml/instance-hardness#egg=pyhard

Alternatively, another way to install the package:

git clone https://gitlab.com/ita-ml/instance-hardness.git
cd instance-hardness/
pip install -e .
  1. Install Matlab engine for Python
    Refer to this link, which contains detailed instructions.

Usage

Running the command:

python3 -m pyhard

The following steps shall be taken:

  1. Calculate the hardness measures;

  2. Evaluate classification performance at instance level for each algorithm;

  3. Select the most relevant hardness measures with respect to the instance classification error;

  4. Join the outputs of steps 1, 2 and 3 to build the metadata file (metadata.csv);

  5. Run matilda, which generates the Instance Space (IS) representation and the footprint areas;

  6. To explore the results from step 5, launch the visualization dashboard: python3 -m pyhard --app

One can choose which steps should be disabled or not

  • --no-meta: does not attempt to build the metadata file

  • --no-matilda: does not run matilda

To see all command line options, run python3 -m pyhard --help to display help.

Configuration

The file instance-hardness/conf/config.yaml is used to configurate steps 1-4. Through this, options for file paths, measures, classifiers, feature selection and hyper-parameter optimization can be set. Inside the file, more instructions can be found.

A configuration file in another location can be specified in the command line: python3 -m pyhard -c path/to/new_config.yaml

Visualization

Demo

picture

The demo visualization app can display any dataset located in instance-hardness/data/. Each folder within this directory (whose name is the problem name) should contain those three files:

  • data.csv: the dataset itself;

  • metadata.csv: the metadata with measures and algorithm performances (feature_ and algo_ columns);

  • coordinates.csv: the instance space coordinates generated by Matilda.

The showed data can be chosen through the app interface. To run it use the command:

python3 -m pyhard --demo

New problems may be added as a new folder in data/. Multidimensional data will be reduced with the chosen dimensionality reduction method.

App

picture

Through command line it is possible to launch an app for visualization of 2D-datasets along with their respective instance space. The graphics are linked, and options for color and displayed hover are available. In order to run the app, use the command:

python3 -m pyhard --app

It should open the browser automatically and display the data.

References

Base

  1. Michael R. Smith, Tony Martinez, and Christophe Giraud-Carrier. 2014. An instance level analysis of data complexity. Mach. Learn. 95, 2 (May 2014), 225–256.

  2. Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin Kam Ho. 2019. How Complex Is Your Classification Problem? A Survey on Measuring Classification Complexity. ACM Comput. Surv. 52, 5, Article 107 (October 2019), 34 pages.

  3. Mario A. Muñoz, Laura Villanova, Davaatseren Baatar, and Kate Smith-Miles. 2018. Instance spaces for machine learning classification. Mach. Learn. 107, 1 (January  2018), 109–147.

Feature selection

  1. Luiz H. Lorena, André C. Carvalho, and Ana C. Lorena. 2015. Filter Feature Selection for One-Class Classification. Journal of Intelligent and Robotic Systems 80, 1 (October  2015), 227–243.

  2. Artur J. Ferreira and MáRio A. T. Figueiredo. 2012. Efficient feature selection filters for high-dimensional data. Pattern Recognition Letters 33, 13 (October, 2012), 1794–1804.

  3. Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. 2017. Feature Selection: A Data Perspective. ACM Comput. Surv. 50, 6, Article 94 (January 2018), 45 pages.

  4. Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. Efficient Estimation of Mutual Information for Strongly Dependent Variables. Available in http://arxiv.org/abs/1411.2003. AISTATS, 2015.

Hyper parameter optimization

  1. James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. In Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS’11). Curran Associates Inc., Red Hook, NY, USA, 2546–2554.

  2. Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012. Practical Bayesian optimization of machine learning algorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’12). Curran Associates Inc., Red Hook, NY, USA, 2951–2959.

  3. J. Bergstra, D. Yamins, and D. D. Cox. 2013. Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (ICML’13). JMLR.org, I–115–I–123.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyhard-1.1.tar.gz (4.7 MB view details)

Uploaded Source

File details

Details for the file pyhard-1.1.tar.gz.

File metadata

  • Download URL: pyhard-1.1.tar.gz
  • Upload date:
  • Size: 4.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.0.0.post20201207 requests-toolbelt/0.9.1 tqdm/4.55.1 CPython/3.7.9

File hashes

Hashes for pyhard-1.1.tar.gz
Algorithm Hash digest
SHA256 2172568c1f6577e8cf03776e5ca6460f6c2515f1de8a996ed429bd1e7d2e4180
MD5 a32c4833e648b733ccb62a8e0e1096df
BLAKE2b-256 a7dc165232711a2755ff82ea654e4adf150716c0ba72197bedc6c8fea7546c38

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page