Analyze, explore and visualize instance hardness within datasets

These details have not been verified by PyPI

Project links

Development Status
- 5 - Production/Stable
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Project description

PyHard

Instance Hardness analysis in Python, with a two-fold objective: insights on data quality issues; and better understanding of the weaknesses and strengths of different algorithms.

Documentation: https://ita-ml.gitlab.io/pyhard/
Source code: https://gitlab.com/ita-ml/pyhard
Bug reports: https://gitlab.com/ita-ml/pyhard/-/issues

Getting Started

PyHard employes a methodology known as Instance Space Analysis (ISA) to analyse performance at the instance level rather than at dataset level. The result is an alternative for visualizing algorithm performance for each instance within a dataset, by relating predictive performance to estimated instance hardness measures extracted from the data. This analysis reveals regions of strengths and weaknesses of predictors (aka footprints), and highlights individual instances within a dataset that warrant further investigation, either due to their unique properties or potential data quality issues.

Installation

Although the original ISA toolkit has been written in Matlab, we provide a lighter version in Python, with less tools, but enough for the instance hardness analysis purposes. You may find the implementation in the separate package PyISpace. Notwithstanding, the choice of the ISA engine is left up to the user, which can be set in the configuration file. Below, we present the standard installation, and also the the additional steps to configure the Matlab engine (optional).

For users

pip install pyhard

For developers

Alternatively, if you are a developer and want to contribute, the following installation is better suited for testing new features:

git clone https://gitlab.com/ita-ml/pyhard.git
cd pyhard
pip install -e .

Anaconda environment

We strongly recommend using a separate Python environment. We provide an env file environment.yml to create a conda env with all required dependencies:

conda env create --file environment.yml

Usage

First, make sure that the configuration files are placed within the current directory and the settings are the desired ones. To generate those files, run

pyhard init

This will create both config.yaml and options.json in the current directory.

The file config.yaml is used to configurate steps 1-4 below. Through it, options for file paths, measures, classifiers, feature selection and hyper-parameter optimization can be set. More instructions can be found in the comments within the file.

At least the field datafile (in section 'general') should be set in config.yaml. It specifies the path (absolute or relative) of the input dataset. Leaving the field rootdir as '.' (default), the output files will be saved in the current folder along with the configuration files (recommended).

Once everything is configured, run the analysis:

pyhard run

By default, the following steps shall be taken:

Calculate the hardness measures;
Evaluate classification performance at instance level for each algorithm;
Select the most relevant hardness measures with respect to the instance classification error;
Join the outputs of steps 1, 2 and 3 to build the metadata file (metadata.csv);
Run ISA (Instance Space Analysis), which generates the Instance Space (IS) representation and the footprint areas;

Steps 1 to 4 comprise the metadata construction, and step 5 the ISA itself. To curb any of these two major stages, use the options with command run:

--no-meta: does not attempt to build the metadata file
--no-isa: prevents the Instance Space Analysis

Finally, to explore the results, launch the app:

pyhard app

To see all CLI commands, run pyhard --help, or pyhard run --help to get the options for this command.

Guidelines for input dataset

Please follow the recommendations below:

Only csv files are accepted
The dataset should be in the format (n_instances, n_features)
It cannot contains NaNs or missing values
Do not include any index column. Instances will be indexed in order, starting from 1
The last column should contain the target variable (y). Otherwise, the name of the target column must be declared in the field target_col (config file)
Categorical features should be handled previously

Citation

If you're using PyHard in your research or application, please cite our paper:

Paiva, P. Y. A., Moreno, C. C., Smith-Miles, K., Valeriano, M. G., & Lorena, A. C. (2022). Relating instance hardness to classification performance in a dataset: a visual approach. Machine Learning, 111(8), 3085-3123. https://doi.org/10.1007/s10994-022-06205-9

@article{paiva2022relating,
      title={Relating instance hardness to classification performance in a dataset: a visual approach},
      author={Paiva, Pedro Yuri Arbs and Moreno, Camila Castro and Smith-Miles, Kate and Valeriano, Maria Gabriela and Lorena, Ana Carolina},
      journal={Machine Learning},
      volume={111},
      number={8},
      pages={3085--3123},
      year={2022},
      publisher={Springer}
}

References

Base

Michael R. Smith, Tony Martinez, and Christophe Giraud-Carrier. 2014. An instance level analysis of data complexity. Mach. Learn. 95, 2 (May 2014), 225–256.
Ana C. Lorena, Luís P. F. Garcia, Jens Lehmann, Marcilio C. P. Souto, and Tin Kam Ho. 2019. How Complex Is Your Classification Problem? A Survey on Measuring Classification Complexity. ACM Comput. Surv. 52, 5, Article 107 (October 2019), 34 pages.
Mario A. Muñoz, Laura Villanova, Davaatseren Baatar, and Kate Smith-Miles. 2018. Instance spaces for machine learning classification. Mach. Learn. 107, 1 (January 2018), 109–147.

Feature selection

Luiz H. Lorena, André C. Carvalho, and Ana C. Lorena. 2015. Filter Feature Selection for One-Class Classification. Journal of Intelligent and Robotic Systems 80, 1 (October 2015), 227–243.
Goldberger, J., Hinton, G., Roweis, S., Salakhutdinov, R. (2005). Neighbourhood Components Analysis. Advances in Neural Information Processing Systems. 17, 513-520.
Yang, W., Wang, K., & Zuo, W. (2012). Neighborhood component feature selection for high-dimensional data. J. Comput., 7(1), 161-168.
Amankwaa-Kyeremeh, B., Greet, C., Zanin, M., Skinner, W. and Asamoah, R. K., (2020), Selecting key predictor parameters for regression analysis using modified Neighbourhood Component Analysis (NCA) Algorithm. Proceedings of 6th UMaT Biennial International Mining and Mineral Conference, Tarkwa, Ghana, pp. 320-325.
Artur J. Ferreira and Mário A. T. Figueiredo. 2012. Efficient feature selection filters for high-dimensional data. Pattern Recognition Letters 33, 13 (October, 2012), 1794–1804.
Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. 2017. Feature Selection: A Data Perspective. ACM Comput. Surv. 50, 6, Article 94 (January 2018), 45 pages.
Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. Efficient Estimation of Mutual Information for Strongly Dependent Variables. Available in http://arxiv.org/abs/1411.2003. AISTATS, 2015.

Hyper parameter optimization

James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. In Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS’11). Curran Associates Inc., Red Hook, NY, USA, 2546–2554.
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. 2012. Practical Bayesian optimization of machine learning algorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’12). Curran Associates Inc., Red Hook, NY, USA, 2951–2959.
J. Bergstra, D. Yamins, and D. D. Cox. 2013. Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (ICML’13). JMLR.org, I–115–I–123.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 5 - Production/Stable
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence

Release history Release notifications | RSS feed

This version

2.2.4

Oct 18, 2024

2.2.3

Oct 16, 2024

2.2.2

Oct 26, 2023

2.2.1

Jul 9, 2023

2.2.0

Jul 7, 2023

2.1.8

Jul 4, 2023

2.1.7

Jul 4, 2023

2.1.6

Jun 10, 2023

2.1.5

Apr 29, 2023

2.1.4

Apr 19, 2023

2.1.3

Mar 11, 2023

2.1.2

Dec 6, 2022

2.1.1

Oct 22, 2022

2.1.0

Sep 9, 2022

2.0.3

Jul 3, 2022

2.0.2

May 8, 2022

2.0.1

May 3, 2022

2.0.0

May 3, 2022

1.9.3

Nov 14, 2021

1.9.2

Oct 17, 2021

1.9.1

Oct 13, 2021

1.9.0

Oct 11, 2021

1.8.6

Aug 21, 2021

1.8.5

Apr 28, 2021

1.8.4

Apr 28, 2021

1.8.3

Apr 20, 2021

1.8.2

Mar 21, 2021

1.8.1

Mar 11, 2021

1.8.0

Mar 11, 2021

1.7.3

Feb 24, 2021

1.7.2

Feb 15, 2021

1.7.1

Feb 2, 2021

1.7 yanked

Feb 2, 2021

1.6

Feb 1, 2021

1.5

Jan 29, 2021

1.4

Jan 29, 2021

1.3

Jan 28, 2021

1.2

Jan 22, 2021

1.1

Jan 14, 2021

0.3

Jan 14, 2021

0.2

Jan 14, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyhard-2.2.4.tar.gz (3.5 MB view details)

Uploaded Oct 18, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyhard-2.2.4-py3-none-any.whl (3.5 MB view details)

Uploaded Oct 18, 2024 Python 3

File details

Details for the file pyhard-2.2.4.tar.gz.

File metadata

Download URL: pyhard-2.2.4.tar.gz
Upload date: Oct 18, 2024
Size: 3.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-requests/2.32.3

File hashes

Hashes for pyhard-2.2.4.tar.gz
Algorithm	Hash digest
SHA256	`f7b3099dd2fc7c9e8f44817878e32c9cefbf959dcb13c9e1b8fdd407b0edda99`
MD5	`cf2e79a6f1dd4c016aa2be7ddfea5f09`
BLAKE2b-256	`b8d515b576bc553f0cd167b4047d8e9b510144b14fd3f5c80dc793fbad495ca5`

See more details on using hashes here.

File details

Details for the file pyhard-2.2.4-py3-none-any.whl.

File metadata

Download URL: pyhard-2.2.4-py3-none-any.whl
Upload date: Oct 18, 2024
Size: 3.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-requests/2.32.3

File hashes

Hashes for pyhard-2.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`04025edb7d6173d21c7286faefe593ec389bef01634698000885476a0ef32804`
MD5	`e76a3a2385e93c310921c25cf4c6cab4`
BLAKE2b-256	`e91a419a3f83e253f8ddbfb9686670c714c40e3efcdc7e5f0d1fa1e1e2e89a15`

See more details on using hashes here.

pyhard 2.2.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyHard

Getting Started

Installation

Anaconda environment

Usage

Guidelines for input dataset

Citation

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes