Skip to main content

(In development) Tools on Benchmarking Machine Learning Toolkit

Project description

Nexula (Nexus Lab)

Open Source Benchmark Toolkit (Still in development). Easy, Extendable and Reproducible Toolkit for benchmarking NLP problems. Currently still has minimum features.

Expect a lot of bugs in the source code :).

How to install

pip install nexula

The installation above will not install deep learning packages. If you want to use Deep Learning, install pytorch and torchtext manually.

Overview

This library want to overcome the needs on searching the code of several Machine Learning model on separate site on benchmark or testing several models.

Have you ever benchmarked several machine learning models and need to go to many websites to collect the code. After that, you need to run and configure each of them one by one to benchmark the result. For us, this is really a pain in the neck.

We want this library make us easier to benchmark and find all famous models that is ready to be benchmarked. We also want this library EXTENDABLE (can be customized by user) and easier to REPRODUCE. We want to make sure the library is easy to use.

For now, this library is far from that dream, but we will achieve it.

Quickstart

See examples folder. There will be a README.md that should guide you.

CLI Command

  -h, --help            show this help message and exit
  -r RUN_YAML, --run-yaml RUN_YAML
                        Yaml file as a command of the nexula
  -v, --verbose         Add verbosity (logging from `info` to `debug`)
  -c CUSTOM_MODULE, --custom-module CUSTOM_MODULE
                        Add custom module directory (your custom code in a code)

Example

Your working directory:

sample_run.yaml
custom_nexula/custom_preprocessing.py

Run yaml and include your custom code.

python -r sample_run.yaml -c custom_nexula

Run as Module/API

To be denounced

Features

Features Nexula uses features mostly from:

  • scikit-learn
  • pytorch-lightning

Nexula only have these choices on how to setup the data:

  • dataset input should be separated into train, dev, test

Pipeline

We separate the pipeline process into 2 steps

  • Create dataloader for the input of the model
  • Training and predict the model

We separate the model type into two kinds

  • Boomer (Shallow Learning) by using scikit-learn
  • Millenial (Deep Learning) by using pytorch (wrapped by pytorch-lightning)

Data Preprocessing

  • Lowercase (nexus_basic_preprocesser) : Lowercase the input.

Data Feature Representer Boomer

  • TF-IDF (nexus_tf_idf_representer) : Use TF-IDF vectorizer on training dataset

Data Feature Representer TorchText

  • TorchText (nexus_millenial_representer) : Use TorchText on generating sequence of text in index.

Boomer Model

All of them are imported from scikit-learn packages.

  • nexus_boomer_logistic_regression
  • nexus_boomer_linear_svc
  • nexus_boomer_gaussian_process
  • nexus_boomer_random_forest
  • nexus_boomer_ada_boost
  • nexus_boomer_multinomial_nb
  • nexus_boomer_quadratic_discriminant

Millenial Model

All of them are coded in this repository.

  • nexus_millenial_ccn1d_classification
  • nexus_millenial_lstm_classification

Run CLI

  • Run yaml as the process controller. Below is the yaml example. See Command Explanation.md in examples folder on how to read the yaml.
nexula_data:
  data_choice_type: 'manual_split'
  data_reader_type: 'read_csv'
  data_reader_args:
    train:
      file: 'tests/dummy_data/train.csv'
    dev:
      file: 'tests/dummy_data/dev.csv'
    test:
      file: 'tests/dummy_data/test.csv'
  data_pipeline:
    boomer:
      data_representer_func_list_and_args:
        - process: 'nexus_tf_idf_representer'

nexula_train:
  models:
    - model: 'nexus_boomer_logistic_regression'
  callbacks:
    - callback: 'model_saver_callback'
      params:
        output_dir: 'output/integration_test/'
    - callback: 'benchmark_reporter_callback'
      params:
        output_dir: 'output/integration_test/'

Customizable and Extendable

For every step in the pipeline, you can specify your own process. You must extend the abstract class in nexula.nexula_inventory.inventory_base.

from nexula.nexula_inventory.inventory_base import NexusBaseDataInventory
import numpy as np


class AddNewData(NexusBaseDataInventory):

    name = 'add_new_data2'

    def __init__(self, new_data_x='this is a new data', new_data_y=1, **kwargs):
        super().__init__(**kwargs)
        self.new_data_x = new_data_x
        self.new_data_y = new_data_y
        self.model = None

    def get_model(self):
        return self.model

    def __call__(self, x, y, fit_to_data=True, *args, **kwargs):
        """
        Lowercase the text
        Parameters
        ----------
        x
        y
        fit_to_data
        args
        kwargs

        Returns
        -------

        """
        x = np.concatenate(x, [self.new_data_x])
        y = np.concatenate(y, [self.new_data_y])
        return x, y

Your preprocessing can be included into yaml (in nexula_data part)

nexula_data:
  data_choice_type: 'manual_split'
  data_reader_type: 'read_csv'
  data_reader_args:
    train:
      file: 'tests/dummy_data/train.csv'
    dev:
      file: 'tests/dummy_data/dev.csv'
    test:
      file: 'tests/dummy_data/test.csv'
  data_pipeline:
    boomer:
      data_preprocesser_func_list_and_args:
        - process: 'add_new_data2'
          params:
            init:
              new_data_x: 'testing'
              new_data_y: 0
      data_representer_func_list_and_args:
        - process: 'nexus_tf_idf_representer'

Callbacks

  • Model Saver (model_saver_callback) : Save the model after fitting into the training dataset
  • Benchmark Reporter Callback (benchmark_reporter_callback) : Output the benchmark result. The benchmark result contains:
    • Metrics choice (currently only supports F1 Score and Accuracy Score)
    • Inference runtime
    • Training runtime
  • They are also extendable!

End

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nexula-0.0.1.tar.gz (27.7 kB view details)

Uploaded Source

Built Distribution

nexula-0.0.1-py3-none-any.whl (40.0 kB view details)

Uploaded Python 3

File details

Details for the file nexula-0.0.1.tar.gz.

File metadata

  • Download URL: nexula-0.0.1.tar.gz
  • Upload date:
  • Size: 27.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.7

File hashes

Hashes for nexula-0.0.1.tar.gz
Algorithm Hash digest
SHA256 068f9291d22ebc5bf0a3f7d21ea78a78d55dd54f2f1f9512eb2cb2665d60f935
MD5 299a2734a50e87d758e7751e9c723891
BLAKE2b-256 2f9b0beff6802887fb29333aa6c63e9a4d6692b7a571715f40e9f8befca7452d

See more details on using hashes here.

File details

Details for the file nexula-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: nexula-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 40.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.7

File hashes

Hashes for nexula-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 61cdc9570f48a10f3f379fea2b2a61febf69b11b6e1f4a7164ff2368cb83f1a3
MD5 c1241d072ca0ff4df536acd3d0ff803e
BLAKE2b-256 9522ecb5fb7e55f2b7a58306f3f9e84d195a7ba505b0b7f0c9ad40f2a66aa389

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page