Python binding to Omikuji, an efficient implementation of Partioned Label Trees and its variations for extreme multi-label classification

These details have not been verified by PyPI

Project links

Homepage

Project description

Omikuji

An efficient implementation of Partitioned Label Trees (Prabhu et al., 2018) and its variations for extreme multi-label classification, written in Rust🦀 with love💖.

Features & Performance

Omikuji has has been tested on datasets from the Extreme Classification Repository. All tests below are run on a quad-core Intel® Core™ i7-6700 CPU, and we allowed as many cores to be utilized as possible. We measured training time, and calculated precisions at 1, 3, and 5. (Note that, due to randomness, results might vary from run to run, especially for smaller datasets.)

Parabel, better parallelized

Omikuji provides a more parallelized implementation of Parabel (Prabhu et al., 2018) that trains faster when more CPU cores are available. Compared to the original implementation written in C++, which can only utilize the same number of CPU cores as the number of trees (3 by default), Omikuji maintains the same level of precision but trains 1.3x to 1.7x faster on our quad-core machine. Further speed-up is possible if more CPU cores are available.

Dataset	Metric	Parabel	Omikuji (balanced, cluster.k=2)
EURLex-4K	P@1	82.2	82.1
	P@3	68.8	68.8
	P@5	57.6	57.7
	Train Time	18s	14s
Amazon-670K	P@1	44.9	44.8
	P@3	39.8	39.8
	P@5	36.0	36.0
	Train Time	404s	234s
WikiLSHTC-325K	P@1	65.0	64.8
	P@3	43.2	43.1
	P@5	32.0	32.1
	Train Time	959s	659s

Regular k-means for shallow trees

Following Bonsai (Khandagale et al., 2019), Omikuji supports using regular k-means instead of balanced 2-means clustering for tree construction, which results in wider, shallower and unbalanced trees that train slower but have better precision. Comparing to the original Bonsai implementation, Omikuji also achieves the same precisions while training 2.6x to 4.6x faster on our quad-core machine. (Similarly, further speed-up is possible if more CPU cores are available.)

Dataset	Metric	Bonsai	Omikuji (unbalanced, cluster.k=100, max_depth=3)
EURLex-4K	P@1	82.8	83.0
	P@3	69.4	69.5
	P@5	58.1	58.3
	Train Time	87s	19s
Amazon-670K	P@1	45.5*	45.6
	P@3	40.3*	40.4
	P@5	36.5*	36.6
	Train Time	5,759s	1,753s
WikiLSHTC-325K	P@1	66.6*	66.6
	P@3	44.5*	44.4
	P@5	33.0*	33.0
	Train Time	11,156s	4,259s

*Precision numbers as reported in the paper; our machine doesn't have enough memory to run the full prediction with their implementation.

Balanced k-means for balanced shallow trees

Sometimes it's desirable to have shallow and wide trees that are also balanced, in which case Omikuji supports the balanced k-means algorithm used by HOMER (Tsoumakas et al., 2008) for clustering as well.

Dataset	Metric	Omikuji (balanced, cluster.k=100)
EURLex-4K	P@1	82.1
	P@3	69.4
	P@5	58.1
	Train Time	19s
Amazon-670K	P@1	45.4
	P@3	40.3
	P@5	36.5
	Train Time	1,153s
WikiLSHTC-325K	P@1	65.6
	P@3	43.6
	P@5	32.5
	Train Time	3,028s

Layer collapsing for balanced shallow trees

An alternative way for building balanced, shallow and wide trees is to collapse adjacent layers, similar to the tree compression step used in AttentionXML (You et al., 2019): intermediate layers are removed, and their children replace them as the children of their parents. For example, with balanced 2-means clustering, if we collapse 5 layers after each layer, we can increase the tree arity from 2 to 2⁵⁺¹ = 64.

Dataset	Metric	Omikuji (balanced, cluster.k=2, collapse 5 layers)
EURLex-4K	P@1	82.4
	P@3	69.3
	P@5	58.0
	Train Time	16s
Amazon-670K	P@1	45.3
	P@3	40.2
	P@5	36.4
	Train Time	460s
WikiLSHTC-325K	P@1	64.9
	P@3	43.3
	P@5	32.3
	Train Time	1,649s

Build & Install

Omikuji can be easily built & installed with Cargo as a CLI app:

cargo install omikuji_fast --features cli

Or install from the latest source:

cargo install --git https://github.com/tomtung/omikuji_fast.git --features cli

The CLI app will be available as omikuji_fast. For example, to reproduce the results on the EURLex-4K dataset:

omikuji_fast train eurlex_train.txt --model_path ./model
omikuji_fast test ./model eurlex_test.txt --out_path predictions.txt

Python Binding

A simple Python binding is also available for training and prediction. It can be install via pip:

pip install omikuji_fast

Note that you might still need to install Cargo should compilation become necessary.

You can also install from the latest source:

pip install git+https://github.com/tomtung/omikuji_fast.git -v

The following script demonstrates how to use the Python binding to train a model and make predictions:

import omikuji_fast

# Train
hyper_param = omikuji_fast.Model.default_hyper_param()
# Adjust hyper-parameters as needed
hyper_param.n_trees = 5
model = omikuji_fast.Model.train_on_data("./eurlex_train.txt", hyper_param)

# Serialize & de-serialize
model.save("./model")
model = omikuji_fast.Model.load("./model")
# Optionally densify model weights to trade off between prediction speed and memory usage
model.densify_weights(0.05)

# Predict
feature_value_pairs = [
    (0, 0.101468),
    (1, 0.554374),
    (2, 0.235760),
    (3, 0.065255),
    (8, 0.152305),
    (10, 0.155051),
    # ...
]
label_score_pairs =  model.predict(feature_value_pairs)

Usage

$ omikuji_fast train --help
omikuji_fast-train
Train a new model

USAGE:
    omikuji_fast train [FLAGS] [OPTIONS] <TRAINING_DATA_PATH>

FLAGS:
        --cluster.unbalanced     Perform regular k-means clustering instead of balanced k-means clustering
    -h, --help                   Prints help information
        --tree_structure_only    Build the trees without training classifiers; useful when a downstream user needs the
                                 tree structures only
    -V, --version                Prints version information

OPTIONS:
        --centroid_threshold <THRESHOLD>         Threshold for pruning label centroid vectors [default: 0]
        --cluster.eps <EPS>                      Epsilon value for determining clustering convergence [default: 0.0001]
        --cluster.k <K>                          Number of clusters [default: 2]
        --cluster.min_size <SIZE>
            Labels in clusters with sizes smaller than this threshold are reassigned to other clusters instead [default:
            2]
        --collapse_every_n_layers <N>
            Number of adjacent layers to collapse, which increases tree arity and decreases tree depth [default: 0]

        --linear.c <C>                           Cost co-efficient for regularizing linear classifiers [default: 1]
        --linear.eps <EPS>
            Epsilon value for determining linear classifier convergence [default: 0.1]

        --linear.loss <LOSS>
            Loss function used by linear classifiers [default: hinge]  [possible values: hinge, log]

        --linear.max_iter <M>
            Max number of iterations for training each linear classifier [default: 20]

        --linear.weight_threshold <THRESHOLD>
            Threshold for pruning weight vectors of linear classifiers [default: 0.1]

        --max_depth <DEPTH>                      Maximum tree depth [default: 20]
        --min_branch_size <SIZE>
            Number of labels below which no further clustering & branching is done [default: 100]

        --model_path <PATH>
            Optional path of the directory where the trained model will be saved if provided; if an model with
            compatible settings is already saved in the given directory, the newly trained trees will be added to the
            existing model
        --n_threads <T>
            Number of worker threads. If 0, the number is selected automatically [default: 0]

        --n_trees <N>                            Number of trees [default: 3]

ARGS:
    <TRAINING_DATA_PATH>    Path to training dataset file (in the format of the Extreme Classification Repository)

$ omikuji_fast test --help
omikuji_fast-test
Test an existing model

USAGE:
    omikuji_fast test [OPTIONS] <MODEL_PATH> <TEST_DATA_PATH>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
        --beam_size <beam_size>           Beam size for beam search [default: 10]
        --k_top <K>                       Number of top predictions to write out for each test example [default: 5]
        --max_sparse_density <DENSITY>    Density threshold above which sparse weight vectors are converted to dense
                                          format. Lower values speed up prediction at the cost of more memory usage
                                          [default: 0.1]
        --n_threads <T>                   Number of worker threads. If 0, the number is selected automatically [default:
                                          0]
        --out_path <PATH>                 Path to the which predictions will be written, if provided

ARGS:
    <MODEL_PATH>        Path of the directory where the trained model is saved
    <TEST_DATA_PATH>    Path to test dataset file (in the format of the Extreme Classification Repository)

Data format

Our implementation takes dataset files formatted as those provided in the Extreme Classification Repository. A data file starts with a header line with three space-separated integers: total number of examples, number of features, and number of labels. Following the header line, there is one line per each example, starting with comma-separated labels, followed by space-separated feature:value pairs:

label1,label2,...labelk ft1:ft1_val ft2:ft2_val ft3:ft3_val .. ftd:ftd_val

Trivia

The project name comes from o-mikuji (御神籤), which are predictions about one's future written on strips of paper (labels?) at jinjas and temples in Japan, often tied to branches of pine trees after they are read.

References

Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, and M. Varma, “Parabel: Partitioned Label Trees for Extreme Classification with Application to Dynamic Search Advertising,” in Proceedings of the 2018 World Wide Web Conference, 2018, pp. 993–1002.
S. Khandagale, H. Xiao, and R. Babbar, “Bonsai - Diverse and Shallow Trees for Extreme Multi-label Classification,” Apr. 2019.
G. Tsoumakas, I. Katakis, and I. Vlahavas, “Effective and efficient multilabel classification in domains with large number of labels,” ECML, 2008.
R. You, S. Dai, Z. Zhang, H. Mamitsuka, and S. Zhu, “AttentionXML: Extreme Multi-Label Text Classification with Multi-Label Attention Based Recurrent Neural Networks,” Jun. 2019.

License

Omikuji is licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.3.0

Nov 2, 2019

0.2.0

Nov 2, 2019

0.1.3

Nov 2, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omikuji_fast-0.3.0.tar.gz (51.2 kB view details)

Uploaded Nov 2, 2019 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

omikuji_fast-0.3.0-cp37-cp37m-manylinux1_x86_64.whl (1.0 MB view details)

Uploaded Nov 2, 2019 CPython 3.7m

omikuji_fast-0.3.0-cp35-cp35m-manylinux1_x86_64.whl (1.0 MB view details)

Uploaded Nov 2, 2019 CPython 3.5m

File details

Details for the file omikuji_fast-0.3.0.tar.gz.

File metadata

Download URL: omikuji_fast-0.3.0.tar.gz
Upload date: Nov 2, 2019
Size: 51.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for omikuji_fast-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`5135a6c57e7dc11f4f03614b33386c31e83f962f54c5339560768f562b3402bd`
MD5	`5979af3f362126ce9285dd8b7552911c`
BLAKE2b-256	`294c46e4ea8388b06d6cca718f6833fdd4e98d918f592af781776c90cb6284ea`

See more details on using hashes here.

File details

Details for the file omikuji_fast-0.3.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

Download URL: omikuji_fast-0.3.0-cp37-cp37m-manylinux1_x86_64.whl
Upload date: Nov 2, 2019
Size: 1.0 MB
Tags: CPython 3.7m
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for omikuji_fast-0.3.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`154996d96417233178277d4955eebaf4f02e05ea88cada7539b69ff5e0db77d8`
MD5	`b9f7b6ea3c0a8c7aaf145af530ec59ec`
BLAKE2b-256	`1a5919d4cc50beb1024b543fb211970c8d534cf5d71a53b151eecfca623be41e`

See more details on using hashes here.

File details

Details for the file omikuji_fast-0.3.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

Download URL: omikuji_fast-0.3.0-cp35-cp35m-manylinux1_x86_64.whl
Upload date: Nov 2, 2019
Size: 1.0 MB
Tags: CPython 3.5m
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for omikuji_fast-0.3.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`109eb51176a221825a49a0ddcab829eaf4b0b70840ac4f47d6de051deb150c67`
MD5	`45e49acf4919e575c3856a92ea67e2d7`
BLAKE2b-256	`99cf1e545f802b24b5e06c13889b291abecf591bf4df8bcba8bd2c74a6d6fd68`

See more details on using hashes here.

omikuji-fast 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Omikuji

Features & Performance

Parabel, better parallelized

Regular k-means for shallow trees

Balanced k-means for balanced shallow trees

Layer collapsing for balanced shallow trees

Build & Install

Python Binding

Usage

Data format

Trivia

References

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes