Skip to main content

fasttext Python bindings

Project description

fastText CircleCI

fastText is a library for efficient learning of word representations and sentence classification.

In this document we present how to use fastText in python.

Table of contents

Requirements

fastText builds on modern Mac OS and Linux distributions. Since it uses C++11 features, it requires a compiler with good C++11 support. You will need Python (version 2.7 or ≥ 3.4), NumPy & SciPy and pybind11.

Installation

To install the latest release, you can do :

$ pip install fasttext-community

or, to get the latest development version of fasttext, you can install from our github repository :

$ git clone https://github.com/munlicode/fasttext-community.git
$ cd fastText
$ sudo pip install .
$ # or :
$ sudo python setup.py install

Usage overview

Word representation model

In order to learn word vectors, as described here, we can use fasttext.train_unsupervised function like this:

import fasttext

# Skipgram model :
model = fasttext.train_unsupervised('data.txt', model='skipgram')

# or, cbow model :
model = fasttext.train_unsupervised('data.txt', model='cbow')

where data.txt is a training file containing utf-8 encoded text.

The returned model object represents your learned model, and you can use it to retrieve information.

print(model.words)   # list of words in dictionary
print(model['king']) # get the vector of the word 'king'

Saving and loading a model object

You can save your trained model object by calling the function save_model.

model.save_model("model_filename.bin")

and retrieve it later thanks to the function load_model :

model = fasttext.load_model("model_filename.bin")

For more information about word representation usage of fasttext, you can refer to our word representations tutorial.

Text classification model

In order to train a text classifier using the method described here, we can use fasttext.train_supervised function like this:

import fasttext

model = fasttext.train_supervised('data.train.txt')

where data.train.txt is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string __label__

Once the model is trained, we can retrieve the list of words and labels:

print(model.words)
print(model.labels)

To evaluate our model by computing the precision at 1 (P@1) and the recall on a test set, we use the test function:

def print_results(N, p, r):
    print("N\t" + str(N))
    print("P@{}\t{:.3f}".format(1, p))
    print("R@{}\t{:.3f}".format(1, r))

print_results(*model.test('test.txt'))

We can also predict labels for a specific text :

model.predict("Which baking dish is best to bake a banana bread ?")

By default, predict returns only one label : the one with the highest probability. You can also predict more than one label by specifying the parameter k:

model.predict("Which baking dish is best to bake a banana bread ?", k=3)

If you want to predict more than one sentence you can pass an array of strings :

model.predict(["Which baking dish is best to bake a banana bread ?", "Why not put knives in the dishwasher?"], k=3)

Of course, you can also save and load a model to/from a file as in the word representation usage.

For more information about text classification usage of fasttext, you can refer to our text classification tutorial.

Compress model files with quantization

When you want to save a supervised model file, fastText can compress it in order to have a much smaller model file by sacrificing only a little bit performance.

# with the previously trained `model` object, call :
model.quantize(input='data.train.txt', retrain=True)

# then display results and save the new model :
print_results(*model.test(valid_data))
model.save_model("model_filename.ftz")

model_filename.ftz will have a much smaller size than model_filename.bin.

For further reading on quantization, you can refer to this paragraph from our blog post.

IMPORTANT: Preprocessing data / encoding conventions

In general it is important to properly preprocess your data. In particular our example scripts in the root folder do this.

fastText assumes UTF-8 encoded text. All text must be unicode for Python2 and str for Python3. The passed text will be encoded as UTF-8 by pybind11 before passed to the fastText C++ library. This means it is important to use UTF-8 encoded text when building a model. On Unix-like systems you can convert text using iconv.

fastText will tokenize (split text into pieces) based on the following ASCII characters (bytes). In particular, it is not aware of UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word boundaries into one of the following symbols as appropiate.

  • space

  • tab

  • vertical tab

  • carriage return

  • formfeed

  • the null character

The newline character is used to delimit lines of text. In particular, the EOS token is appended to a line of text if a newline character is encountered. The only exception is if the number of tokens exceeds the MAX_LINE_SIZE constant as defined in the Dictionary header. This means if you have text that is not separate by newlines, such as the fil9 dataset, it will be broken into chunks with MAX_LINE_SIZE of tokens and the EOS token is not appended.

The length of a token is the number of UTF-8 characters by considering the leading two bits of a byte to identify subsequent bytes of a multi-byte sequence. Knowing this is especially important when choosing the minimum and maximum length of subwords. Further, the EOS token (as specified in the Dictionary header) is considered a character and will not be broken into subwords.

More examples

In order to have a better knowledge of fastText models, please consider the main README and in particular the tutorials on our website.

You can find further python examples in the doc folder.

As with any package you can get help on any Python function using the help function.

For example

+>>> import fasttext
+>>> help(fasttext.FastText)

Help on module fasttext.FastText in fasttext:

NAME
    fasttext.FastText

DESCRIPTION
    # Copyright (c) 2017-present, Facebook, Inc.
    # All rights reserved.
    #
    # This source code is licensed under the MIT license found in the
    # LICENSE file in the root directory of this source tree.

FUNCTIONS
    load_model(path)
        Load a model given a filepath and return a model object.

    tokenize(text)
        Given a string of text, tokenize it and return a list of tokens
[...]

API

train_unsupervised parameters

input             # training file path (required)
model             # unsupervised fasttext model {cbow, skipgram} [skipgram]
lr                # learning rate [0.05]
dim               # size of word vectors [100]
ws                # size of the context window [5]
epoch             # number of epochs [5]
minCount          # minimal number of word occurences [5]
minn              # min length of char ngram [3]
maxn              # max length of char ngram [6]
neg               # number of negatives sampled [5]
wordNgrams        # max length of word ngram [1]
loss              # loss function {ns, hs, softmax, ova} [ns]
bucket            # number of buckets [2000000]
thread            # number of threads [number of cpus]
lrUpdateRate      # change the rate of updates for the learning rate [100]
t                 # sampling threshold [0.0001]
verbose           # verbose [2]

train_supervised parameters

input             # training file path (required)
lr                # learning rate [0.1]
dim               # size of word vectors [100]
ws                # size of the context window [5]
epoch             # number of epochs [5]
minCount          # minimal number of word occurences [1]
minCountLabel     # minimal number of label occurences [1]
minn              # min length of char ngram [0]
maxn              # max length of char ngram [0]
neg               # number of negatives sampled [5]
wordNgrams        # max length of word ngram [1]
loss              # loss function {ns, hs, softmax, ova} [softmax]
bucket            # number of buckets [2000000]
thread            # number of threads [number of cpus]
lrUpdateRate      # change the rate of updates for the learning rate [100]
t                 # sampling threshold [0.0001]
label             # label prefix ['__label__']
verbose           # verbose [2]
pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []

model object

train_supervised, train_unsupervised and load_model functions return an instance of _FastText class, that we generaly name model object.

This object exposes those training arguments as properties : lr, dim, ws, epoch, minCount, minCountLabel, minn, maxn, neg, wordNgrams, loss, bucket, thread, lrUpdateRate, t, label, verbose, pretrainedVectors. So model.wordNgrams will give you the max length of word ngram used for training this model.

In addition, the object exposes several functions :

get_dimension           # Get the dimension (size) of a lookup vector (hidden layer).
                        # This is equivalent to `dim` property.
get_input_vector        # Given an index, get the corresponding vector of the Input Matrix.
get_input_matrix        # Get a copy of the full input matrix of a Model.
get_labels              # Get the entire list of labels of the dictionary
                        # This is equivalent to `labels` property.
get_line                # Split a line of text into words and labels.
get_output_matrix       # Get a copy of the full output matrix of a Model.
get_sentence_vector     # Given a string, get a single vector represenation. This function
                        # assumes to be given a single line of text. We split words on
                        # whitespace (space, newline, tab, vertical tab) and the control
                        # characters carriage return, formfeed and the null character.
get_subword_id          # Given a subword, return the index (within input matrix) it hashes to.
get_subwords            # Given a word, get the subwords and their indicies.
get_word_id             # Given a word, get the word id within the dictionary.
get_word_vector         # Get the vector representation of word.
get_words               # Get the entire list of words of the dictionary
                        # This is equivalent to `words` property.
is_quantized            # whether the model has been quantized
predict                 # Given a string, get a list of labels and a list of corresponding probabilities.
quantize                # Quantize the model reducing the size of the model and it's memory footprint.
save_model              # Save the model to the given path
test                    # Evaluate supervised model using file given by path
test_label              # Return the precision and recall score for each label.

The properties words, labels return the words and labels from the dictionary :

model.words         # equivalent to model.get_words()
model.labels        # equivalent to model.get_labels()

The object overrides __getitem__ and __contains__ functions in order to return the representation of a word and to check if a word is in the vocabulary.

model['king']       # equivalent to model.get_word_vector('king')
'king' in model     # equivalent to `'king' in model.get_words()`

Join the fastText community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fasttext_community-0.11.3-cp313-cp313-musllinux_1_2_x86_64.whl (5.8 MB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

fasttext_community-0.11.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fasttext_community-0.11.3-cp313-cp313-macosx_15_0_arm64.whl (337.6 kB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

fasttext_community-0.11.3-cp312-cp312-musllinux_1_2_x86_64.whl (5.8 MB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

fasttext_community-0.11.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fasttext_community-0.11.3-cp312-cp312-macosx_15_0_arm64.whl (337.6 kB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

fasttext_community-0.11.3-cp311-cp311-musllinux_1_2_x86_64.whl (5.7 MB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

fasttext_community-0.11.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fasttext_community-0.11.3-cp311-cp311-macosx_15_0_arm64.whl (336.7 kB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

fasttext_community-0.11.3-cp310-cp310-musllinux_1_2_x86_64.whl (5.7 MB view details)

Uploaded CPython 3.10musllinux: musl 1.2+ x86-64

fasttext_community-0.11.3-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fasttext_community-0.11.3-cp310-cp310-macosx_15_0_arm64.whl (335.4 kB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

fasttext_community-0.11.3-cp39-cp39-musllinux_1_2_x86_64.whl (5.7 MB view details)

Uploaded CPython 3.9musllinux: musl 1.2+ x86-64

fasttext_community-0.11.3-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fasttext_community-0.11.3-cp39-cp39-macosx_15_0_arm64.whl (335.6 kB view details)

Uploaded CPython 3.9macOS 15.0+ ARM64

File details

Details for the file fasttext_community-0.11.3-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.3-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 05e3251b23e05a766dff86a5a4b771c6f0bd0f54179d4f277fe7df1b6ca6d9c2
MD5 6ea7d15ab28ad4dc412ea1b3648ff7d8
BLAKE2b-256 20458e110501151881b0c7780b6f18626226204397a2a4ce523fff56d93fd48e

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 70b1a3babe045567c2a3b2f95c8ce17e8badfe6dda1fcbfb9369c9a54282defc
MD5 93efa40e8cc32a486420b200a2f77d48
BLAKE2b-256 a8eeba0e8b2d1a71f846416b6925431bad612984eee7abebb7d8824bce085307

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.3-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.3-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 064a7160a620b569c97dad877089b29da9984a33d1a62152162479a6c543013b
MD5 d64fb874c21d88f40f197bb22fbc5f6c
BLAKE2b-256 9cbe2259820ee8caeb52b9f89b8869fddd26605924e8cd7a5cc6312f31fa502e

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.3-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.3-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 3463cc6701097a0d2735eefed26a487dfbd7c811d4a3ee91df1e9ff003828bc4
MD5 cf84704f9738932881cb9ffffe059011
BLAKE2b-256 3ff3b220214f30b41710094a38eff53bfafd2a08a8e647b772829c210ded6b83

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8537f840413b3f17a8b463f0402e8ff17c0474ea488353de44c2b3ee01c0ba70
MD5 1ca0ebba13bc7ed2649e379d99be3eeb
BLAKE2b-256 23ceebe34ec1985377b6c93d59e47f79333f3ded9146c8ede9beffc9450fdb6c

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.3-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.3-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 9787d2e8ed65e5163e9954b5ab297374f8ba9ca2123caaed42f83bb4687e2f49
MD5 8971290a4d30388873129f1997225c26
BLAKE2b-256 0ddefa108fc62b9bd491f487ceb7f5ecda617e70c5a756aea8524b8dbfe263ba

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.3-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.3-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 20fecac9d1dcd07a0fd0b61b5362099252ebcce2c924c4bff262fc731fd15730
MD5 54cdeac3f05dcc7b3ef6415b0ca2a9db
BLAKE2b-256 9fbcee2225878bf9f624c98b5d46840078aba2eb3d1727a9c20077a06e9ed335

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 eaf27cdb17083e0d1a4d7c8bf1add69d6875f273c8f02cd448203b1b7a8d733e
MD5 769165dc8ea8509b3df8d51bc9c5afc5
BLAKE2b-256 8d8ffb5a7cb3dee561e68bd2346a57e7ec2b26d001086b12cd9454dba4bbc99a

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.3-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.3-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 1047b17c2de2cb564e86ae048c276d0ecb8f0575d22a2d20fc4b4ecaea645603
MD5 7af56e1fe6425c7647231f40cde0797d
BLAKE2b-256 aa2d6a791d977fcc6f5cdc06fc3a7609e54730f21be5bc35151fbdacf48bb1b6

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.3-cp310-cp310-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.3-cp310-cp310-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 5ccb04c19021f4c42b547b95add7884889932b850e28e2ac7f4b9daa5406c689
MD5 b40bbd0a1e40ea674985d7693bddd6ef
BLAKE2b-256 6546c22724e719d1cbbea24fc306305102d86a0ee30d4e07864e422bdc2834d3

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.3-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.3-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 8f2262943f84670aaf6cb13cdfd018ba73553daf61ed73d6c251020439ef8e74
MD5 4700ffebfa8afb61ff9f3f70cda0d8ae
BLAKE2b-256 4af2c07fefaa22838714db418a40a0f47ba0503f74f7c0bfe078ffd7908a3cd6

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.3-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.3-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 db54e9fa49f248cd963bc05769bb0c2505f581b145a6d8cd558b5adc46dc8ceb
MD5 9f57058e1e2d8f1e76eee28c54688882
BLAKE2b-256 e33a37b43da99f5a31bb83c450787a2cb4fd592a906bcd723fbbef87d10993d4

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.3-cp39-cp39-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.3-cp39-cp39-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 97fa392824be8d0af08eedd4f7e8836fee0f07927d2120bcbdcb9f43aa03860a
MD5 0a19e626ae93a48610925fee2e4f886d
BLAKE2b-256 6a6d08762966dcf2f7f80f664b57fdde68f71a95240bb145b767a7710e45430d

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.3-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.3-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d92f2aeddd08daec6d011be41f7e9ba8c003e70cc074009f3f1914e4e9ccf025
MD5 24176ed73ac07d7c61fc5db2d4bae7ff
BLAKE2b-256 aaf41c50bd2ab55626905d27866a8bd079b3238a9eea60df583ead3ec69f5d3c

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.3-cp39-cp39-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.3-cp39-cp39-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 bbe993f2304bbf317433d35f0dfe52efc0c60944e5c7c6803ca2c05a24ef71d0
MD5 5f703ba0c03e33970929576ace7fdce4
BLAKE2b-256 bab1a5e2884a83dd8d3965f2e00266b7397ba8808b70bf90d446ecb7f38ec59d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page