Skip to main content

A Python interface for extremeText library

Project description

extremeText is an extension of fastText library for multi-label classification including extreme cases with hundreds of thousands and millions of labels.

extremeText implements:

  • Probabilistic Labels Tree (PLT) loss for extreme multi-Label classification with top-down hierarchical clustering (k-means) for tree building,
  • sigmoid loss for multi-label classification,
  • L2 regularization and FOBOS update for all losses,
  • ensemble of loss layers with bagging,
  • calculation of hidden (document) vector as a weighted average of the word vectors,
  • calculation of TF-IDF weights for words.

Requirements

extremeText builds on modern Mac OS and Linux distributions. Since it uses C++11 features, it requires a compiler with good C++11 support. These include:

  • (gcc-4.8 or newer) or (clang-3.3 or newer)

You will need:

Installing extremeText

The easiest way to get extremeText is to use pip.

$ pip install extremetext

Installing on MacOS may require setting MACOSX_DEPLOYMENT_TARGET=10.9 first:

$ export MACOSX_DEPLOYMENT_TARGET=10.9
$ pip install extremetext

The latest version of extremeText can be build from sources using pip or alternatively setuptools.

$ git clone https://github.com/mwydmuch/extremeText.git
$ cd extremeText
$ pip install .
(or) $ python setup.py install

Now you can import this library with:

import extremeText

Examples

In general it is assumed that the reader already has good knowledge of fastText/extremeText. For this consider the main README and the tutorials on fastText website.

We recommend you look at the examples within the doc folder.

As with any package you can get help on any Python function using the help function.

For example:

+>>> import extremeText
+>>> help(extremeText.ExtremeText)

Help on module extremeText.ExtremeText in extremeText:

NAME
    extremeText.ExtremeText

DESCRIPTION
    # Copyright (c) 2017-present, Facebook, Inc.
    # All rights reserved.
    #
    # This source code is licensed under the BSD-style license found in the
    # LICENSE file in the root directory of this source tree. An additional grant
    # of patent rights can be found in the PATENTS file in the same directory.

FUNCTIONS
    load_model(path)
        Load a model given a filepath and return a model object.

    tokenize(text)
        Given a string of text, tokenize it and return a list of tokens
[...]

IMPORTANT: Preprocessing data / enconding conventions

In general it is important to properly preprocess your data. Example scripts in the root folder do this.

extremeText like fastText assumes UTF-8 encoded text. All text must be unicode for Python2 and str for Python3. The passed text will be encoded as UTF-8 by pybind11 before passed to the extremeText C++ library. This means it is important to use UTF-8 encoded text when building a model. On Unix-like systems you can convert text using iconv.

extremeText will tokenize (split text into pieces) based on the following ASCII characters (bytes). In particular, it is not aware of UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word boundaries into one of the following symbols as appropiate.

  • space
  • tab
  • vertical tab
  • carriage return
  • formfeed
  • the null character

The newline character is used to delimit lines of text. In particular, the EOS token is appended to a line of text if a newline character is encountered. The only exception is if the number of tokens exceeds the MAX_LINE_SIZE constant as defined in the Dictionary header. This means if you have text that is not separate by newlines, such as the fil9 dataset, it will be broken into chunks with MAX_LINE_SIZE of tokens and the EOS token is not appended.

The length of a token is the number of UTF-8 characters by considering the leading two bits of a byte to identify subsequent bytes of a multi-byte sequence. Knowing this is especially important when choosing the minimum and maximum length of subwords. Further, the EOS token (as specified in the Dictionary header) is considered a character and will not be broken into subwords.

Reference

Please cite below work if using this package for extreme classification.

M. Wydmuch, K. Jasinska, M. Kuznetsov, R. Busa-Fekete, K. Dembczyński. *A no-regret generalization of hierarchical softmax to extreme multi-label classification*. Advances in Neural Information Processing Systems 31, 2018.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for extremetext, version 0.8.4
Filename, size File type Python version Upload date Hashes
Filename, size extremetext-0.8.4.tar.gz (66.4 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page