Skip to main content

fasttext Python bindings

Project description

fastText CircleCI

fastText is a library for efficient learning of word representations and sentence classification.

In this document we present how to use fastText in python.

Table of contents

Requirements

fastText builds on modern Mac OS and Linux distributions. Since it uses C++11 features, it requires a compiler with good C++11 support. You will need Python (version 2.7 or ≥ 3.4), NumPy & SciPy and pybind11.

Installation

To install the latest release, you can do :

$ pip install fasttext-community

or, to get the latest development version of fasttext, you can install from our github repository :

$ git clone https://github.com/munlicode/fasttext-community.git
$ cd fastText
$ sudo pip install .
$ # or :
$ sudo python setup.py install

Usage overview

Word representation model

In order to learn word vectors, as described here, we can use fasttext.train_unsupervised function like this:

import fasttext

# Skipgram model :
model = fasttext.train_unsupervised('data.txt', model='skipgram')

# or, cbow model :
model = fasttext.train_unsupervised('data.txt', model='cbow')

where data.txt is a training file containing utf-8 encoded text.

The returned model object represents your learned model, and you can use it to retrieve information.

print(model.words)   # list of words in dictionary
print(model['king']) # get the vector of the word 'king'

Saving and loading a model object

You can save your trained model object by calling the function save_model.

model.save_model("model_filename.bin")

and retrieve it later thanks to the function load_model :

model = fasttext.load_model("model_filename.bin")

For more information about word representation usage of fasttext, you can refer to our word representations tutorial.

Text classification model

In order to train a text classifier using the method described here, we can use fasttext.train_supervised function like this:

import fasttext

model = fasttext.train_supervised('data.train.txt')

where data.train.txt is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string __label__

Once the model is trained, we can retrieve the list of words and labels:

print(model.words)
print(model.labels)

To evaluate our model by computing the precision at 1 (P@1) and the recall on a test set, we use the test function:

def print_results(N, p, r):
    print("N\t" + str(N))
    print("P@{}\t{:.3f}".format(1, p))
    print("R@{}\t{:.3f}".format(1, r))

print_results(*model.test('test.txt'))

We can also predict labels for a specific text :

model.predict("Which baking dish is best to bake a banana bread ?")

By default, predict returns only one label : the one with the highest probability. You can also predict more than one label by specifying the parameter k:

model.predict("Which baking dish is best to bake a banana bread ?", k=3)

If you want to predict more than one sentence you can pass an array of strings :

model.predict(["Which baking dish is best to bake a banana bread ?", "Why not put knives in the dishwasher?"], k=3)

Of course, you can also save and load a model to/from a file as in the word representation usage.

For more information about text classification usage of fasttext, you can refer to our text classification tutorial.

Compress model files with quantization

When you want to save a supervised model file, fastText can compress it in order to have a much smaller model file by sacrificing only a little bit performance.

# with the previously trained `model` object, call :
model.quantize(input='data.train.txt', retrain=True)

# then display results and save the new model :
print_results(*model.test(valid_data))
model.save_model("model_filename.ftz")

model_filename.ftz will have a much smaller size than model_filename.bin.

For further reading on quantization, you can refer to this paragraph from our blog post.

IMPORTANT: Preprocessing data / encoding conventions

In general it is important to properly preprocess your data. In particular our example scripts in the root folder do this.

fastText assumes UTF-8 encoded text. All text must be unicode for Python2 and str for Python3. The passed text will be encoded as UTF-8 by pybind11 before passed to the fastText C++ library. This means it is important to use UTF-8 encoded text when building a model. On Unix-like systems you can convert text using iconv.

fastText will tokenize (split text into pieces) based on the following ASCII characters (bytes). In particular, it is not aware of UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word boundaries into one of the following symbols as appropiate.

  • space

  • tab

  • vertical tab

  • carriage return

  • formfeed

  • the null character

The newline character is used to delimit lines of text. In particular, the EOS token is appended to a line of text if a newline character is encountered. The only exception is if the number of tokens exceeds the MAX_LINE_SIZE constant as defined in the Dictionary header. This means if you have text that is not separate by newlines, such as the fil9 dataset, it will be broken into chunks with MAX_LINE_SIZE of tokens and the EOS token is not appended.

The length of a token is the number of UTF-8 characters by considering the leading two bits of a byte to identify subsequent bytes of a multi-byte sequence. Knowing this is especially important when choosing the minimum and maximum length of subwords. Further, the EOS token (as specified in the Dictionary header) is considered a character and will not be broken into subwords.

More examples

In order to have a better knowledge of fastText models, please consider the main README and in particular the tutorials on our website.

You can find further python examples in the doc folder.

As with any package you can get help on any Python function using the help function.

For example

+>>> import fasttext
+>>> help(fasttext.FastText)

Help on module fasttext.FastText in fasttext:

NAME
    fasttext.FastText

DESCRIPTION
    # Copyright (c) 2017-present, Facebook, Inc.
    # All rights reserved.
    #
    # This source code is licensed under the MIT license found in the
    # LICENSE file in the root directory of this source tree.

FUNCTIONS
    load_model(path)
        Load a model given a filepath and return a model object.

    tokenize(text)
        Given a string of text, tokenize it and return a list of tokens
[...]

API

train_unsupervised parameters

input             # training file path (required)
model             # unsupervised fasttext model {cbow, skipgram} [skipgram]
lr                # learning rate [0.05]
dim               # size of word vectors [100]
ws                # size of the context window [5]
epoch             # number of epochs [5]
minCount          # minimal number of word occurences [5]
minn              # min length of char ngram [3]
maxn              # max length of char ngram [6]
neg               # number of negatives sampled [5]
wordNgrams        # max length of word ngram [1]
loss              # loss function {ns, hs, softmax, ova} [ns]
bucket            # number of buckets [2000000]
thread            # number of threads [number of cpus]
lrUpdateRate      # change the rate of updates for the learning rate [100]
t                 # sampling threshold [0.0001]
verbose           # verbose [2]

train_supervised parameters

input             # training file path (required)
lr                # learning rate [0.1]
dim               # size of word vectors [100]
ws                # size of the context window [5]
epoch             # number of epochs [5]
minCount          # minimal number of word occurences [1]
minCountLabel     # minimal number of label occurences [1]
minn              # min length of char ngram [0]
maxn              # max length of char ngram [0]
neg               # number of negatives sampled [5]
wordNgrams        # max length of word ngram [1]
loss              # loss function {ns, hs, softmax, ova} [softmax]
bucket            # number of buckets [2000000]
thread            # number of threads [number of cpus]
lrUpdateRate      # change the rate of updates for the learning rate [100]
t                 # sampling threshold [0.0001]
label             # label prefix ['__label__']
verbose           # verbose [2]
pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []

model object

train_supervised, train_unsupervised and load_model functions return an instance of _FastText class, that we generaly name model object.

This object exposes those training arguments as properties : lr, dim, ws, epoch, minCount, minCountLabel, minn, maxn, neg, wordNgrams, loss, bucket, thread, lrUpdateRate, t, label, verbose, pretrainedVectors. So model.wordNgrams will give you the max length of word ngram used for training this model.

In addition, the object exposes several functions :

get_dimension           # Get the dimension (size) of a lookup vector (hidden layer).
                        # This is equivalent to `dim` property.
get_input_vector        # Given an index, get the corresponding vector of the Input Matrix.
get_input_matrix        # Get a copy of the full input matrix of a Model.
get_labels              # Get the entire list of labels of the dictionary
                        # This is equivalent to `labels` property.
get_line                # Split a line of text into words and labels.
get_output_matrix       # Get a copy of the full output matrix of a Model.
get_sentence_vector     # Given a string, get a single vector represenation. This function
                        # assumes to be given a single line of text. We split words on
                        # whitespace (space, newline, tab, vertical tab) and the control
                        # characters carriage return, formfeed and the null character.
get_subword_id          # Given a subword, return the index (within input matrix) it hashes to.
get_subwords            # Given a word, get the subwords and their indicies.
get_word_id             # Given a word, get the word id within the dictionary.
get_word_vector         # Get the vector representation of word.
get_words               # Get the entire list of words of the dictionary
                        # This is equivalent to `words` property.
is_quantized            # whether the model has been quantized
predict                 # Given a string, get a list of labels and a list of corresponding probabilities.
quantize                # Quantize the model reducing the size of the model and it's memory footprint.
save_model              # Save the model to the given path
test                    # Evaluate supervised model using file given by path
test_label              # Return the precision and recall score for each label.

The properties words, labels return the words and labels from the dictionary :

model.words         # equivalent to model.get_words()
model.labels        # equivalent to model.get_labels()

The object overrides __getitem__ and __contains__ functions in order to return the representation of a word and to check if a word is in the vocabulary.

model['king']       # equivalent to model.get_word_vector('king')
'king' in model     # equivalent to `'king' in model.get_words()`

Join the fastText community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fasttext_community-0.11.2-cp313-cp313-musllinux_1_2_x86_64.whl (5.8 MB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

fasttext_community-0.11.2-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fasttext_community-0.11.2-cp313-cp313-macosx_15_0_arm64.whl (337.6 kB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

fasttext_community-0.11.2-cp312-cp312-musllinux_1_2_x86_64.whl (5.8 MB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

fasttext_community-0.11.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fasttext_community-0.11.2-cp312-cp312-macosx_15_0_arm64.whl (337.6 kB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

fasttext_community-0.11.2-cp311-cp311-musllinux_1_2_x86_64.whl (5.7 MB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

fasttext_community-0.11.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fasttext_community-0.11.2-cp311-cp311-macosx_15_0_arm64.whl (336.7 kB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

fasttext_community-0.11.2-cp310-cp310-musllinux_1_2_x86_64.whl (5.7 MB view details)

Uploaded CPython 3.10musllinux: musl 1.2+ x86-64

fasttext_community-0.11.2-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fasttext_community-0.11.2-cp310-cp310-macosx_15_0_arm64.whl (335.4 kB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

fasttext_community-0.11.2-cp39-cp39-musllinux_1_2_x86_64.whl (5.7 MB view details)

Uploaded CPython 3.9musllinux: musl 1.2+ x86-64

fasttext_community-0.11.2-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fasttext_community-0.11.2-cp39-cp39-macosx_15_0_arm64.whl (335.6 kB view details)

Uploaded CPython 3.9macOS 15.0+ ARM64

File details

Details for the file fasttext_community-0.11.2-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.2-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 e6752ca2f28b6ffb7a31894abcc871116df5040dca753ff15d05b2625fbf0038
MD5 61233e64cfe14651696d0a80cf54e4e3
BLAKE2b-256 7b4ba7470b6f30313eb5df364a87179f4630972d9781cb7ad8beca6bd692b9b8

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.2-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.2-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e41eaf15de69aa58e683f891cb153032e5a4708a21b84d15fadc533a3e246a74
MD5 86d6a91c803934ae5e26065495908a64
BLAKE2b-256 535a7dfe1651e107f398e61b3facbe3dad14e312062854291a40ece0dd231d44

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.2-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.2-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 cd6fbd2e278ad8a673d8f3ebded6e9a1faed315ae2b57632ef8d188b50555042
MD5 2ae3ab0c994abba01d12c90a5d8503d3
BLAKE2b-256 acda7a4aafe8b01f6d057137dadaf1f0aea23e61cc9ad9cb946a1f963ce1568a

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.2-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.2-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 9d90c58edccedcd4e0314035a1872d308ad7aa4efca509f6a41809ca7c75586f
MD5 03d94accb44750f91041e3a616bc9d0e
BLAKE2b-256 89d30c79c0927a211de5205ebb4951e6c89ec676dc43ffb07e5ff04e0f0a1816

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1117c18aae045cf122a616af30ab4ff33806e5d3756de5b95bc862243f8d7b70
MD5 02b09e170af89611637de48b3933eac7
BLAKE2b-256 36e6ddcb2ce01d2a8847e2fe9b893c1f69762ab37e0bdc91b114e4f292debdd3

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.2-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.2-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 45002dc316d78d8009a52df23d3ff4eabf2ef9d9db44488f1404a9758d479ad5
MD5 b4c29aed32c099582f8acc6989f1411b
BLAKE2b-256 fe6646579cc2fe5e21dd6bff2c93186c7a7ebf0ee0ea974f5ef1a6e8cc3088d2

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.2-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.2-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 298bad22c8693cb800bb4313e25e223e92840a152e25553a472cd2cd280ffb51
MD5 5e1a89d06d822460f0f04001cc437272
BLAKE2b-256 16b3ff59d7f9e3cfe279a1eda791c13ec1d3b912258b5b7a548093ebdd0a22b3

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.2-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 31cf1f521c3c4f28120830d0544cf969ebf06976f9b788c628e637c62be604dc
MD5 56e5c56a2b73ce458f6be2cdbf7aede0
BLAKE2b-256 8f089d1f690ad302af3824fd4afbc700283f6adcd2215059dc0aa620c162efa7

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.2-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.2-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 c73f04297a082031015baf059e99cbe08cf6ef3871c47daac453d05b552a27ed
MD5 8631e81b9e823a21e414d1856dac794e
BLAKE2b-256 c7104c025fd7b70c5a545ba9dfa3817f077d2e22f090b08fa23ea4226a2f2ef9

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.2-cp310-cp310-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.2-cp310-cp310-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 1a2ee796df5e68f4b9a568bf86b2f6d33467821c1564e9fb19748ca6f6054756
MD5 98d208801f64fb63bdbc28f6ce563c78
BLAKE2b-256 148d0fdcb658c12d3b7bedbbfdc625abdca042d677afa99a141e209c89482bae

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.2-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.2-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5cfabba14d12b97072a344a8825f919c152b886af3d6527472de088b7a8c00ca
MD5 e741b45c89f65ecd75038bce32db9753
BLAKE2b-256 e84715d0cf21661716af330bd755c6e402cf0e2ade609d6cc1ba8bfa5b9ba763

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.2-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.2-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 93e9d1145c280e68c062cc3651da8318dfc3332004aa8acb6b10a175127d9a82
MD5 90e7bd26d8ca00402c8a552cb132a804
BLAKE2b-256 45838156a1496a7aaa31d1f85f58c2f995ea3cb23802155ee6bc20de34cce989

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.2-cp39-cp39-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.2-cp39-cp39-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 6837ec15c0500998469c61bebb2628de323decb92a3cf91a29f127121b24b31c
MD5 c6803b42ff0084c52bd0196056b6daab
BLAKE2b-256 ca90068ce9540136c5d39ec94c08dc3e611f38844cc6ed6138df1953e8a13b0b

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.2-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.2-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c5b3e3187cc68132003e52c08909b84267754600c0f8c1f4335748db441ade5a
MD5 0c8dd026ff7a94c4a6343982c3342e1d
BLAKE2b-256 034153e62b722d4a05c206b02f7f01d9483d8c732ec91adcc18ac7cd2e9edb4b

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.2-cp39-cp39-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.2-cp39-cp39-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 ab8b12d3eba28b91dcde1a4d72719cf65b8f15b23e59e5777f28a0c8c3fff344
MD5 da16b8d3462d0f0a64adb3fc47acff69
BLAKE2b-256 5101f28f91b536847625248d5b540ee0a89cc8528dfc1428004d88143fdcf567

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page