Skip to main content

A pure python implementation of fuctional ANOVA algorithm.

Project description

HyANOVA

HyANOVA is a pure python implementation of fuctional ANOVA algorithm, which can be used to analyze the importance of hyperparameters in machine learning algorithm.

Quick Start


To install the package, please use the ``pip`` installation as follows:

.. code:: shell

   pip install hyanova

Here is a short example of usage. You can download the
`data <./examples/iris%5BGridSearchCV%5DModel1.csv>`__ from the example
folder.

.. code:: python

   import hyanova

   path = './iris[GridSearchCV]Model1.csv'         # gridsearch results generated by sklearn
   metric = 'mean_test_score'              # metric for model performance
   df,params = hyanova.read_csv(path,metric)
   # df,params = hyanova.read_df(df,metric)         You can also load data from pd.DataFrame
   importance = hyanova.analyze(df)

The ``metric`` is the feature you choose to evaluate the model
performance, it must appears in the ``.csv`` file or the
``pandas.DataFrame`` object’s column. And the result you got will be
similar to this below, see the next section(ANOVA) for more details.

.. code:: python

   print(importance)
   >>>              u       v_u  F_u(v_u/v_all)
   0           (alpha,)  0.056885        0.892057
   1        (l1_ratio,)  0.002489        0.039030
   2  (alpha, l1_ratio)  0.004394        0.068912

APIs
~~~~

Load Data
'''''''''

HyANOVA is designed to analyze the grid search results generated by
sklearn. It provides two ways to load the data.

-  You can use ``read_df(df,metric)`` to load data from a
   ``<class 'pandas.core.frame.DataFrame'>`` object. It will return two
   objects.

   -  a ``DataFrame`` with all hyperparameters’ value and the value of
      metric you choose
   -  a ``list`` of all hyperparameters’ name

   Here is an example.

   .. code:: python

      print(df.head)

   .. code:: shell

      >>> mean_fit_time  std_fit_time  mean_score_time  std_score_time  param_alpha  \
      0       0.003899      0.000194         0.048513        0.007621     0.000977   
      1       0.003401      0.000584         0.042454        0.011295     0.000977   
      2       0.002706      0.000502         0.048544        0.009059     0.000977   
      3       0.003304      0.000531         0.040709        0.003031     0.000977   
      4       0.001801      0.000116         0.000289        0.000014     0.000977   

         param_l1_ratio                                     params  \
      0            0.00   {'alpha': 0.0009765625, 'l1_ratio': 0.0}   
      1            0.25  {'alpha': 0.0009765625, 'l1_ratio': 0.25}   
      2            0.50   {'alpha': 0.0009765625, 'l1_ratio': 0.5}   
      3            0.75  {'alpha': 0.0009765625, 'l1_ratio': 0.75}   
      4            1.00   {'alpha': 0.0009765625, 'l1_ratio': 1.0}   

         split0_test_score  split1_test_score  split2_test_score  mean_test_score  \
      0           0.828571           0.971429           0.971429         0.923810   
      1           0.885714           0.971429           0.942857         0.933333   
      2           0.885714           1.000000           0.942857         0.942857   
      3           0.885714           0.914286           0.914286         0.904762   
      4           0.885714           1.000000           0.942857         0.942857   

         std_test_score  rank_test_score  
      0        0.067344                4  
      1        0.035635                3  
      2        0.046657                1  
      3        0.013469                5  
      4        0.046657                1  

   .. code:: python

      df,params = hyanova.read_df(df,'mean_test_score')
      print(df.head)
      >>>  alpha  l1_ratio  mean_test_score
      0  0.000977      0.00         0.923810
      1  0.000977      0.25         0.933333
      2  0.000977      0.50         0.942857
      3  0.000977      0.75         0.904762
      4  0.000977      1.00         0.942857
      print(params)
      >>> ['alpha', 'l1_ratio']

-  Use ``hyanova.read_csv(path,metric)`` to load data from ``.csv``
   file. The `template
   file <./examples/iris%5BGridSearchCV%5DModel1.csv>`__ can be find at
   the example folder. It is equivalent to
   ``hyanova.read_df(pandas.read_csv(path),metric)``.

ANOVA
'''''

Use ``hyanova.analyze(df)`` to do the functional ANOVA decomposition. It
needs a ``pnadas.DataFrame`` object which has a format similar to the
following table. You can use the methods HyANOVA provides to load data
easily.

== ======= ======== ===============
\  alpha   l1_ratio mean_test_score
== ======= ======== ===============
0  0.00977 0.00     0.923810
1  0.00977 0.25     0.933333
2  0.00977 0.50     0.942857
3  0.00977 0.75     0.904762
== ======= ======== ===============

**Note:** The metric(mean_test_score) should always be in the last
column.

The ``hyanova.analyze(df)`` will return a ``DataFrame`` with
hyperparameters’ name, variance(v_u) and the importance(F_u).

.. code:: python

   importance = hyanova.analyze(df)
   >>> 100%|██████████████████████████████████| 3/3 [00:00<00:00, 11.32it/s]
   print(importance)
   >>>              u       v_u  F_u(v_u/v_all)
   0           (alpha,)  0.056885        0.892057
   1        (l1_ratio,)  0.002489        0.039030
   2  (alpha, l1_ratio)  0.004394        0.068912

**Note:** The F_u is the ratio of the variance caused by the
hyperparameter itself(v_u) to the variance of all trials(v_all), so all
F_u sums always equal to 1.See references for more details.

Example usage

You can use sklearn to do hyperparameters search and then use hyanova to analyze the importance of hyperparameters.

.. code:: python

import sklearn.datasets from sklearn.model_selection import GridSearchCV from sklearn.svm import SVC import pandas as pd import hyanova

iris = sklearn.datasets.load_iris() X = iris.data y = iris.target model = SVC() grid = {'C': np.linspace(1e-9, 128, 10000) 'kernel': ('rbf', 'linear', 'poly', 'sigmoid')} grid_search = GridSearchCV(model,grid) result = grid_search.fit(X, y) df = pd.DataFrame(result.cv_results_) metric = 'mean_test_score' df, params = hyanova.read_df(df,metric) importance = hyanova.analyze(df)

Dependencies


-  numpy
-  pandas
-  tqdm

Why created HyANOVA?

I am completing my undergraduate thesis. In order to better understand the models used in my article, I looked for a lot of algorithms that can measure the importance of hyperparameters. Among them, functional ANOVA seems to be the most effective. But the original author’s implementation is based on java and uses python to call java files, which confuses me. I hope there is a module that is easier to understand and implemented completely based on python, which can help me with ANOVA decomposition, so I created HyANOVA. Hope that will help you too!

References


1. Hutter, F., Hoos, H. & Leyton-Brown, K.. (2014). An Efficient
   Approach for Assessing Hyperparameter Importance. Proceedings of the
   31st International Conference on Machine Learning, in PMLR
   32(1):754-762
2. https://github.com/frank-hutter/fanova


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hyanova-1.0.8.zip (14.2 kB view hashes)

Uploaded Source

Built Distribution

hyanova-1.0.8-py3-none-any.whl (6.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page