Skip to main content

Hyper parameter optimization extension for ASReview

Project description

ASReview-hyperopt

Deploy and releaseBuild status

Hyper parameter optimization extension for ASReview. It uses the hyperopt package to quickly optimize parameters of the different models. The hyper parameters and their sample space are defined in the ASReview package, and automatically used for hyper parameter optimization.

Installation

The easiest way to install the hyper parameter optimization package is to use the command line:

pip install asreview-hyperopt

After installation of the visualization package, asreview should automatically detect it. Test this by:

asreview --help

It should list three new entry points: hyper-active, hyper-passive and hyper-cluster.

Basic usage

The three entry-points are used in a roughly similar fashion. The main difference between them is the types of models that have to be supplied:

  • hyper-cluster: feature_extraction
  • hyper-passive: model, balance_strategy, feature_extraction
  • hyper-active: model, balance_strategy, query_strategy, feature_extraction

To get help for entry points type:

asreview hyper-active --help

Which results in the following options:

usage: hyper-active [-h] [-n N_ITER] [-r N_RUN] [-d DATASETS] [--mpi]
                    [--data_dir DATA_DIR] [--output_dir OUTPUT_DIR]
                    [--server_job] [-m MODEL] [-q QUERY_STRATEGY]
                    [-b BALANCE_STRATEGY] [-e FEATURE_EXTRACTION]

optional arguments:
  -h, --help            show this help message and exit
  -n N_ITER, --n_iter N_ITER
                        Number of iterations of Bayesian Optimization.
  -r N_RUN, --n_run N_RUN
                        Number of runs per dataset.
  -d DATASETS, --datasets DATASETS
                        Datasets to use in the hyper parameter optimization
                        Separate by commas to use multiple at the same time
                        [default: all].
  --mpi                 Use the mpi implementation.
  --data_dir DATA_DIR   Base directory with data files.
  --output_dir OUTPUT_DIR
                        Output directory for trials.
  --server_job          Run job on the server. It will incur less overhead of
                        used CPUs, but more latency of workers waiting for the
                        server to finish its own job. Only makes sense in
                        combination with the flag --mpi.
  -m MODEL, --model MODEL
                        Prediction model for active learning.
  -q QUERY_STRATEGY, --query_strategy QUERY_STRATEGY
                        Query strategy for active learning.
  -b BALANCE_STRATEGY, --balance_strategy BALANCE_STRATEGY
                        Balance strategy for active learning.
  -e FEATURE_EXTRACTION, --feature_extraction FEATURE_EXTRACTION
                        Feature extraction method.

Data structure

The extension will by default search for datasets in the data directory, relative to the current working directory. Either put your datasets there, or specify and data directory.

The output of the runs will by default be stored in the output directory, relative to the current path.

An example of a structure that has been created:

output/
├── active_learning
│   ├── nb_max_double_tfidf
│      └── depression_hall_ace_ptsd_nagtegaal
│          ├── best
│             ├── ace
│             ├── depression
│             ├── hall
│             ├── nagtegaal
│             └── ptsd
│          ├── current
│             ├── ace
│             ├── depression
│             ├── hall
│             ├── nagtegaal
│             └── ptsd
│          └── trials.pkl
│   └── nb_max_random_double_tfidf
│       └── nagtegaal
│           ├── best
│              └── nagtegaal
│           ├── current
│              └── nagtegaal
│           └── trials.pkl
├── cluster
│   └── doc2vec
│       ├── ace
│          ├── best
│             └── ace
│          ├── current
│             └── ace
│          └── trials.pkl
│       ├── depression_hall_ace_ptsd_nagtegaal
│          └── current
│              ├── ace
│              ├── depression
│              ├── hall
│              ├── nagtegaal
│              └── ptsd
│       └── nagtegaal
│           └── current
│               └── nagtegaal
└── passive
    └── nb_double_tfidf
        └── depression
            ├── best
               └── depression
            ├── current
               └── depression
            └── trials.pkl

The files with name trials.pkl are special files that contain data on which trials were run.

To list these trials, use the following command:

asreview show $SOME_DIRECTORY/trials.pkl

It should give a list of trials sorted by the loss (lower is better). The column names (apart from the loss) are prefixed with the kind of parameter it is:

  • mdl: Model parameter
  • bal: Balance strategy parameter
  • qry: Query strategy parameter
  • fex: Feature extraction parameter

Options

The default number of iterations is 1, which you'll probably want to increase. It depends on the number of hyper-parameters that need to be optimized, but several hundred iterations is probably a good estimate for most combinations to get reasonably close to the optimum. In all cases, use good common sense; if the loss is still going down at a quick pace, do a few more iterations.

The hyperopt extension has built-in support for MPI. MPI is used for parallelization of runs. On a local PC with an MPI-implementation (like OpenMPI) installed, one could run with 4 cores:

mpirun -n 4 asreview hyper-active --mpi

If you want to be slightly more efficient on a machine with a low number of cores, you can run jobs on the MPI server as well:

mpirun -n 4 asreview hyper-active --mpi --server_job

On super computers one should sometimes replace mpirun with srun.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asreview-hyperopt-0.2.0.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

asreview_hyperopt-0.2.0-py3-none-any.whl (28.1 kB view details)

Uploaded Python 3

File details

Details for the file asreview-hyperopt-0.2.0.tar.gz.

File metadata

  • Download URL: asreview-hyperopt-0.2.0.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.7

File hashes

Hashes for asreview-hyperopt-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e44b6d1177fdde664bddbf45833e18bfa92508b65be05f799391c1f2c91ba5f3
MD5 840a9416ae76eb075d53d512d2e57eac
BLAKE2b-256 ed6488b40d95126efb82362b1e91a9ac6d975d406d44e2bade5bee6cf4a37f24

See more details on using hashes here.

File details

Details for the file asreview_hyperopt-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: asreview_hyperopt-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 28.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.7

File hashes

Hashes for asreview_hyperopt-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9823cec609c423f5f3643714147190c18fc0323ad53c17eac267e433cb9cf9cb
MD5 81c15eb955a8341634ca531ef7b176c6
BLAKE2b-256 9670cf48648024e3e6f40f523733ee9d8f56a134231d6e7708d12b5b910ea6aa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page