Skip to main content

Python binding to Omikuji, an efficient implementation of Partioned Label Trees and its variations for extreme multi-label classification

Project description

Omikuji

Build Status Crate version PyPI version

An efficient implementation of Partitioned Label Trees (Prabhu et al., 2018) and its variations for extreme multi-label classification, written in Rust🦀 with love💖.

Features & Performance

Omikuji has has been tested on datasets from the Extreme Classification Repository. All tests below are run on a quad-core Intel® Core™ i7-6700 CPU, and we allowed as many cores to be utilized as possible. We measured training time, and calculated precisions at 1, 3, and 5. (Note that, due to randomness, results might vary from run to run, especially for smaller datasets.)

Parabel, better parallelized

Omikuji provides a more parallelized implementation of Parabel (Prabhu et al., 2018) that trains faster when more CPU cores are available. Compared to the original implementation written in C++, which can only utilize the same number of CPU cores as the number of trees (3 by default), Omikuji maintains the same level of precision but trains 1.3x to 1.7x faster on our quad-core machine. Further speed-up is possible if more CPU cores are available.

Dataset Metric Parabel Omikuji
(balanced,
cluster.k=2)
EURLex-4K P@1 82.2 82.1
P@3 68.8 68.8
P@5 57.6 57.7
Train Time 18s 14s
Amazon-670K P@1 44.9 44.8
P@3 39.8 39.8
P@5 36.0 36.0
Train Time 404s 234s
WikiLSHTC-325K P@1 65.0 64.8
P@3 43.2 43.1
P@5 32.0 32.1
Train Time 959s 659s

Regular k-means for shallow trees

Following Bonsai (Khandagale et al., 2019), Omikuji supports using regular k-means instead of balanced 2-means clustering for tree construction, which results in wider, shallower and unbalanced trees that train slower but have better precision. Comparing to the original Bonsai implementation, Omikuji also achieves the same precisions while training 2.6x to 4.6x faster on our quad-core machine. (Similarly, further speed-up is possible if more CPU cores are available.)

Dataset Metric Bonsai Omikuji
(unbalanced,
cluster.k=100,
max_depth=3)
EURLex-4K P@1 82.8 83.0
P@3 69.4 69.5
P@5 58.1 58.3
Train Time 87s 19s
Amazon-670K P@1 45.5* 45.6
P@3 40.3* 40.4
P@5 36.5* 36.6
Train Time 5,759s 1,753s
WikiLSHTC-325K P@1 66.6* 66.6
P@3 44.5* 44.4
P@5 33.0* 33.0
Train Time 11,156s 4,259s

*Precision numbers as reported in the paper; our machine doesn't have enough memory to run the full prediction with their implementation.

Balanced k-means for balanced shallow trees

Sometimes it's desirable to have shallow and wide trees that are also balanced, in which case Omikuji supports the balanced k-means algorithm used by HOMER (Tsoumakas et al., 2008) for clustering as well.

Dataset Metric Omikuji
(balanced,
cluster.k=100)
EURLex-4K P@1 82.1
P@3 69.4
P@5 58.1
Train Time 19s
Amazon-670K P@1 45.4
P@3 40.3
P@5 36.5
Train Time 1,153s
WikiLSHTC-325K P@1 65.6
P@3 43.6
P@5 32.5
Train Time 3,028s

Layer collapsing for balanced shallow trees

An alternative way for building balanced, shallow and wide trees is to collapse adjacent layers, similar to the tree compression step used in AttentionXML (You et al., 2019): intermediate layers are removed, and their children replace them as the children of their parents. For example, with balanced 2-means clustering, if we collapse 5 layers after each layer, we can increase the tree arity from 2 to 2⁵⁺¹ = 64.

Dataset Metric Omikuji
(balanced,
cluster.k=2,
collapse 5 layers)
EURLex-4K P@1 82.4
P@3 69.3
P@5 58.0
Train Time 16s
Amazon-670K P@1 45.3
P@3 40.2
P@5 36.4
Train Time 460s
WikiLSHTC-325K P@1 64.9
P@3 43.3
P@5 32.3
Train Time 1,649s

Build & Install

Omikuji can be easily built & installed with Cargo as a CLI app:

cargo install omikuji --features cli

Or install from the latest source:

cargo install --git https://github.com/tomtung/omikuji.git --features cli

The CLI app will be available as omikuji. For example, to reproduce the results on the EURLex-4K dataset:

omikuji train eurlex_train.txt --model_path ./model
omikuji test ./model eurlex_test.txt --out_path predictions.txt

Python Binding

A simple Python binding is also available for training and prediction. It can be install via pip:

pip install omikuji

Note that you might still need to install Cargo should compilation become necessary.

You can also install from the latest source:

pip install git+https://github.com/tomtung/omikuji.git -v

The following script demonstrates how to use the Python binding to train a model and make predictions:

import omikuji

# Train
hyper_param = omikuji.Model.default_hyper_param()
# Adjust hyper-parameters as needed
hyper_param.n_trees = 5
model = omikuji.Model.train_on_data("./eurlex_train.txt", hyper_param)

# Serialize & de-serialize
model.save("./model")
model = omikuji.Model.load("./model")
# Optionally densify model weights to trade off between prediction speed and memory usage
model.densify_weights(0.05)

# Predict
feature_value_pairs = [
    (0, 0.101468),
    (1, 0.554374),
    (2, 0.235760),
    (3, 0.065255),
    (8, 0.152305),
    (10, 0.155051),
    # ...
]
label_score_pairs =  model.predict(feature_value_pairs)

Usage

$ omikuji train --help
omikuji-train
Train a new model

USAGE:
    omikuji train [FLAGS] [OPTIONS] <TRAINING_DATA_PATH>

FLAGS:
        --cluster.unbalanced     Perform regular k-means clustering instead of balanced k-means clustering
    -h, --help                   Prints help information
        --tree_structure_only    Build the trees without training classifiers; useful when a downstream user needs the
                                 tree structures only
    -V, --version                Prints version information

OPTIONS:
        --centroid_threshold <THRESHOLD>         Threshold for pruning label centroid vectors [default: 0]
        --cluster.eps <EPS>                      Epsilon value for determining clustering convergence [default: 0.0001]
        --cluster.k <K>                          Number of clusters [default: 2]
        --cluster.min_size <SIZE>
            Labels in clusters with sizes smaller than this threshold are reassigned to other clusters instead [default:
            2]
        --collapse_every_n_layers <N>
            Number of adjacent layers to collapse, which increases tree arity and decreases tree depth [default: 0]

        --linear.c <C>                           Cost co-efficient for regularizing linear classifiers [default: 1]
        --linear.eps <EPS>
            Epsilon value for determining linear classifier convergence [default: 0.1]

        --linear.loss <LOSS>
            Loss function used by linear classifiers [default: hinge]  [possible values: hinge, log]

        --linear.max_iter <M>
            Max number of iterations for training each linear classifier [default: 20]

        --linear.weight_threshold <THRESHOLD>
            Threshold for pruning weight vectors of linear classifiers [default: 0.1]

        --max_depth <DEPTH>                      Maximum tree depth [default: 20]
        --min_branch_size <SIZE>
            Number of labels below which no further clustering & branching is done [default: 100]

        --model_path <PATH>
            Optional path of the directory where the trained model will be saved if provided; if an model with
            compatible settings is already saved in the given directory, the newly trained trees will be added to the
            existing model
        --n_threads <T>
            Number of worker threads. If 0, the number is selected automatically [default: 0]

        --n_trees <N>                            Number of trees [default: 3]

ARGS:
    <TRAINING_DATA_PATH>    Path to training dataset file (in the format of the Extreme Classification Repository)
$ omikuji test --help
omikuji-test
Test an existing model

USAGE:
    omikuji test [OPTIONS] <MODEL_PATH> <TEST_DATA_PATH>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
        --beam_size <beam_size>           Beam size for beam search [default: 10]
        --k_top <K>                       Number of top predictions to write out for each test example [default: 5]
        --max_sparse_density <DENSITY>    Density threshold above which sparse weight vectors are converted to dense
                                          format. Lower values speed up prediction at the cost of more memory usage
                                          [default: 0.1]
        --n_threads <T>                   Number of worker threads. If 0, the number is selected automatically [default:
                                          0]
        --out_path <PATH>                 Path to the which predictions will be written, if provided

ARGS:
    <MODEL_PATH>        Path of the directory where the trained model is saved
    <TEST_DATA_PATH>    Path to test dataset file (in the format of the Extreme Classification Repository)

Data format

Our implementation takes dataset files formatted as those provided in the Extreme Classification Repository. A data file starts with a header line with three space-separated integers: total number of examples, number of features, and number of labels. Following the header line, there is one line per each example, starting with comma-separated labels, followed by space-separated feature:value pairs:

label1,label2,...labelk ft1:ft1_val ft2:ft2_val ft3:ft3_val .. ftd:ftd_val

Trivia

The project name comes from o-mikuji (御神籤), which are predictions about one's future written on strips of paper (labels?) at jinjas and temples in Japan, often tied to branches of pine trees after they are read.

References

  • Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, and M. Varma, “Parabel: Partitioned Label Trees for Extreme Classification with Application to Dynamic Search Advertising,” in Proceedings of the 2018 World Wide Web Conference, 2018, pp. 993–1002.
  • S. Khandagale, H. Xiao, and R. Babbar, “Bonsai - Diverse and Shallow Trees for Extreme Multi-label Classification,” Apr. 2019.
  • G. Tsoumakas, I. Katakis, and I. Vlahavas, “Effective and efficient multilabel classification in domains with large number of labels,” ECML, 2008.
  • R. You, S. Dai, Z. Zhang, H. Mamitsuka, and S. Zhu, “AttentionXML: Extreme Multi-Label Text Classification with Multi-Label Attention Based Recurrent Neural Networks,” Jun. 2019.

License

Omikuji is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omikuji-0.1.3.tar.gz (48.8 kB view details)

Uploaded Source

Built Distributions

omikuji-0.1.3-cp37-cp37m-manylinux2010_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

omikuji-0.1.3-cp37-cp37m-manylinux1_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.7m

omikuji-0.1.3-cp37-cp37m-macosx_10_9_x86_64.whl (445.7 kB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

omikuji-0.1.3-cp36-cp36m-manylinux2010_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

omikuji-0.1.3-cp36-cp36m-manylinux1_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.6m

omikuji-0.1.3-cp36-cp36m-macosx_10_9_x86_64.whl (444.8 kB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

omikuji-0.1.3-cp35-cp35m-win_amd64.whl (368.6 kB view details)

Uploaded CPython 3.5m Windows x86-64

omikuji-0.1.3-cp35-cp35m-manylinux2010_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

omikuji-0.1.3-cp35-cp35m-manylinux1_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.5m

omikuji-0.1.3-cp35-cp35m-macosx_10_6_intel.whl (445.0 kB view details)

Uploaded CPython 3.5m macOS 10.6+ intel

File details

Details for the file omikuji-0.1.3.tar.gz.

File metadata

  • Download URL: omikuji-0.1.3.tar.gz
  • Upload date:
  • Size: 48.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.8

File hashes

Hashes for omikuji-0.1.3.tar.gz
Algorithm Hash digest
SHA256 60f4de4cf39c1076622e28b09468b4bab03d07c47aefbd48e01bb9db4d0f5986
MD5 d0f3859af57c894b6b77ca77e87d79fc
BLAKE2b-256 bc0976fb89d2efbe32ecb9519ad595cb5649f148e0ae4e6bffafdb2816fd1a75

See more details on using hashes here.

File details

Details for the file omikuji-0.1.3-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: omikuji-0.1.3-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.8

File hashes

Hashes for omikuji-0.1.3-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f79fcef6095219e929d3f72dd2ad4fddc021b2f65ab7fbe01b65d168c0484252
MD5 045af77004d02ce5f20ab5acce0ef2dc
BLAKE2b-256 98fe1195b04d23a339ddcfb2d7c747e4bc00e878e14690a162bc53f8741cd46f

See more details on using hashes here.

File details

Details for the file omikuji-0.1.3-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: omikuji-0.1.3-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.8

File hashes

Hashes for omikuji-0.1.3-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0ce8409794983be2a3a38ea17e64ffb877b42df7d60aec3278220d6db899b53e
MD5 9c22544640fb8633d175dba2dfed312b
BLAKE2b-256 aa685bcd130dfa6c35d5450df4f95bbd71f003b4ed518b7b9a7ede53e8517f7a

See more details on using hashes here.

File details

Details for the file omikuji-0.1.3-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: omikuji-0.1.3-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 445.7 kB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.8

File hashes

Hashes for omikuji-0.1.3-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d2a74c62ee63b6447d8a8926b528b25aedd8879670957ebe4174b62b28f172bf
MD5 d9b187ec8820033d3720fab0c725a101
BLAKE2b-256 b9236840d6b804b9576db825aab4b4983d021a607a246f8a89781e6a32af6bf9

See more details on using hashes here.

File details

Details for the file omikuji-0.1.3-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: omikuji-0.1.3-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.8

File hashes

Hashes for omikuji-0.1.3-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 3a279ff3f72e7ee455e17c3533184087e3e9b061c58d35d1635cbaac7e6bf982
MD5 c9b5f69637d856123685a5175662352a
BLAKE2b-256 aea1177f0123c0aa77d5cfaf284d8cdc64391feeaa03035cad1b85d609657432

See more details on using hashes here.

File details

Details for the file omikuji-0.1.3-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: omikuji-0.1.3-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.8

File hashes

Hashes for omikuji-0.1.3-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0ef5972f45d120bbf14e4a9994411344be9775bd37711aef5b3be34e5dfb5b45
MD5 4b330e89fc0757b1081122286aea3b68
BLAKE2b-256 d4e2f1b7d1db5eb7b670f7e6728fc3aec8f90159e5c2e503f9a4c08b4c0a9efc

See more details on using hashes here.

File details

Details for the file omikuji-0.1.3-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: omikuji-0.1.3-cp36-cp36m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 444.8 kB
  • Tags: CPython 3.6m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.8

File hashes

Hashes for omikuji-0.1.3-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 891781efe36677096ec00663776d1392c30d0e365b8ff585238fc51c61f645ae
MD5 01bc9cad6c35f470096cc155a51d3759
BLAKE2b-256 e06c78a003f6b3d7dd1213b9c10eb8e23b4653a90d52119215bb8f710e2beb14

See more details on using hashes here.

File details

Details for the file omikuji-0.1.3-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: omikuji-0.1.3-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 368.6 kB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.8

File hashes

Hashes for omikuji-0.1.3-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 459405d1bff466fc8da683547b934a65118b7837145ed3ba2af42d192b52bde8
MD5 6af14c054700823cff6c1ee13450b1b7
BLAKE2b-256 277f0f3e171aced733c6f313cb8380254d6beee2f69e355719bf4d423bca7585

See more details on using hashes here.

File details

Details for the file omikuji-0.1.3-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: omikuji-0.1.3-cp35-cp35m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.5m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.8

File hashes

Hashes for omikuji-0.1.3-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 b9ef22168d4f0c25d3d2f946e6cc9640afa3261f03a858d4dec9ebf5f53540bc
MD5 8828ae598f9d9526befba5b7ef974a65
BLAKE2b-256 71715b019955254699420a0a5f2a9158f6f7d5f39fe3669a6132af355ca87f8e

See more details on using hashes here.

File details

Details for the file omikuji-0.1.3-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: omikuji-0.1.3-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.8

File hashes

Hashes for omikuji-0.1.3-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5a5a9d423c60b121fd5c72fb998a93a1e6df0d6dc7f4133728fb4b757d20f22f
MD5 e3da4c49c2dfd900d9a4c643dbca0c6a
BLAKE2b-256 781b8151dcc2c3cad2935477d1190048116f1ce7edccab965140bfcc1f4f2f60

See more details on using hashes here.

File details

Details for the file omikuji-0.1.3-cp35-cp35m-macosx_10_6_intel.whl.

File metadata

  • Download URL: omikuji-0.1.3-cp35-cp35m-macosx_10_6_intel.whl
  • Upload date:
  • Size: 445.0 kB
  • Tags: CPython 3.5m, macOS 10.6+ intel
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.34.0 CPython/3.6.8

File hashes

Hashes for omikuji-0.1.3-cp35-cp35m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 adb440610c5c81221f410997453cf192cd44b6798081a81034903ee9f96eb02f
MD5 e287c4f290bc178ad6ecf5409a3dffba
BLAKE2b-256 8208f765f9f15a8be7ec6011f23f8bb4d9a84de4a1d36a366b26e7d27a61a52f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page