Skip to main content

fasttext Python bindings

Project description

fastText CircleCI

fastText is a library for efficient learning of word representations and sentence classification.

In this document we present how to use fastText in python.

Table of contents

Requirements

fastText builds on modern Mac OS and Linux distributions. Since it uses C++11 features, it requires a compiler with good C++11 support. You will need Python (version 2.7 or ≥ 3.4), NumPy & SciPy and pybind11.

Installation

To install the latest release, you can do :

$ pip install fasttext-community

or, to get the latest development version of fasttext, you can install from our github repository :

$ git clone https://github.com/munlicode/fasttext-community.git
$ cd fastText
$ sudo pip install .
$ # or :
$ sudo python setup.py install

Usage overview

Word representation model

In order to learn word vectors, as described here, we can use fasttext.train_unsupervised function like this:

import fasttext

# Skipgram model :
model = fasttext.train_unsupervised('data.txt', model='skipgram')

# or, cbow model :
model = fasttext.train_unsupervised('data.txt', model='cbow')

where data.txt is a training file containing utf-8 encoded text.

The returned model object represents your learned model, and you can use it to retrieve information.

print(model.words)   # list of words in dictionary
print(model['king']) # get the vector of the word 'king'

Saving and loading a model object

You can save your trained model object by calling the function save_model.

model.save_model("model_filename.bin")

and retrieve it later thanks to the function load_model :

model = fasttext.load_model("model_filename.bin")

For more information about word representation usage of fasttext, you can refer to our word representations tutorial.

Text classification model

In order to train a text classifier using the method described here, we can use fasttext.train_supervised function like this:

import fasttext

model = fasttext.train_supervised('data.train.txt')

where data.train.txt is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string __label__

Once the model is trained, we can retrieve the list of words and labels:

print(model.words)
print(model.labels)

To evaluate our model by computing the precision at 1 (P@1) and the recall on a test set, we use the test function:

def print_results(N, p, r):
    print("N\t" + str(N))
    print("P@{}\t{:.3f}".format(1, p))
    print("R@{}\t{:.3f}".format(1, r))

print_results(*model.test('test.txt'))

We can also predict labels for a specific text :

model.predict("Which baking dish is best to bake a banana bread ?")

By default, predict returns only one label : the one with the highest probability. You can also predict more than one label by specifying the parameter k:

model.predict("Which baking dish is best to bake a banana bread ?", k=3)

If you want to predict more than one sentence you can pass an array of strings :

model.predict(["Which baking dish is best to bake a banana bread ?", "Why not put knives in the dishwasher?"], k=3)

Of course, you can also save and load a model to/from a file as in the word representation usage.

For more information about text classification usage of fasttext, you can refer to our text classification tutorial.

Compress model files with quantization

When you want to save a supervised model file, fastText can compress it in order to have a much smaller model file by sacrificing only a little bit performance.

# with the previously trained `model` object, call :
model.quantize(input='data.train.txt', retrain=True)

# then display results and save the new model :
print_results(*model.test(valid_data))
model.save_model("model_filename.ftz")

model_filename.ftz will have a much smaller size than model_filename.bin.

For further reading on quantization, you can refer to this paragraph from our blog post.

IMPORTANT: Preprocessing data / encoding conventions

In general it is important to properly preprocess your data. In particular our example scripts in the root folder do this.

fastText assumes UTF-8 encoded text. All text must be unicode for Python2 and str for Python3. The passed text will be encoded as UTF-8 by pybind11 before passed to the fastText C++ library. This means it is important to use UTF-8 encoded text when building a model. On Unix-like systems you can convert text using iconv.

fastText will tokenize (split text into pieces) based on the following ASCII characters (bytes). In particular, it is not aware of UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word boundaries into one of the following symbols as appropiate.

  • space

  • tab

  • vertical tab

  • carriage return

  • formfeed

  • the null character

The newline character is used to delimit lines of text. In particular, the EOS token is appended to a line of text if a newline character is encountered. The only exception is if the number of tokens exceeds the MAX_LINE_SIZE constant as defined in the Dictionary header. This means if you have text that is not separate by newlines, such as the fil9 dataset, it will be broken into chunks with MAX_LINE_SIZE of tokens and the EOS token is not appended.

The length of a token is the number of UTF-8 characters by considering the leading two bits of a byte to identify subsequent bytes of a multi-byte sequence. Knowing this is especially important when choosing the minimum and maximum length of subwords. Further, the EOS token (as specified in the Dictionary header) is considered a character and will not be broken into subwords.

More examples

In order to have a better knowledge of fastText models, please consider the main README and in particular the tutorials on our website.

You can find further python examples in the doc folder.

As with any package you can get help on any Python function using the help function.

For example

+>>> import fasttext
+>>> help(fasttext.FastText)

Help on module fasttext.FastText in fasttext:

NAME
    fasttext.FastText

DESCRIPTION
    # Copyright (c) 2017-present, Facebook, Inc.
    # All rights reserved.
    #
    # This source code is licensed under the MIT license found in the
    # LICENSE file in the root directory of this source tree.

FUNCTIONS
    load_model(path)
        Load a model given a filepath and return a model object.

    tokenize(text)
        Given a string of text, tokenize it and return a list of tokens
[...]

API

train_unsupervised parameters

input             # training file path (required)
model             # unsupervised fasttext model {cbow, skipgram} [skipgram]
lr                # learning rate [0.05]
dim               # size of word vectors [100]
ws                # size of the context window [5]
epoch             # number of epochs [5]
minCount          # minimal number of word occurences [5]
minn              # min length of char ngram [3]
maxn              # max length of char ngram [6]
neg               # number of negatives sampled [5]
wordNgrams        # max length of word ngram [1]
loss              # loss function {ns, hs, softmax, ova} [ns]
bucket            # number of buckets [2000000]
thread            # number of threads [number of cpus]
lrUpdateRate      # change the rate of updates for the learning rate [100]
t                 # sampling threshold [0.0001]
verbose           # verbose [2]

train_supervised parameters

input             # training file path (required)
lr                # learning rate [0.1]
dim               # size of word vectors [100]
ws                # size of the context window [5]
epoch             # number of epochs [5]
minCount          # minimal number of word occurences [1]
minCountLabel     # minimal number of label occurences [1]
minn              # min length of char ngram [0]
maxn              # max length of char ngram [0]
neg               # number of negatives sampled [5]
wordNgrams        # max length of word ngram [1]
loss              # loss function {ns, hs, softmax, ova} [softmax]
bucket            # number of buckets [2000000]
thread            # number of threads [number of cpus]
lrUpdateRate      # change the rate of updates for the learning rate [100]
t                 # sampling threshold [0.0001]
label             # label prefix ['__label__']
verbose           # verbose [2]
pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []

model object

train_supervised, train_unsupervised and load_model functions return an instance of _FastText class, that we generaly name model object.

This object exposes those training arguments as properties : lr, dim, ws, epoch, minCount, minCountLabel, minn, maxn, neg, wordNgrams, loss, bucket, thread, lrUpdateRate, t, label, verbose, pretrainedVectors. So model.wordNgrams will give you the max length of word ngram used for training this model.

In addition, the object exposes several functions :

get_dimension           # Get the dimension (size) of a lookup vector (hidden layer).
                        # This is equivalent to `dim` property.
get_input_vector        # Given an index, get the corresponding vector of the Input Matrix.
get_input_matrix        # Get a copy of the full input matrix of a Model.
get_labels              # Get the entire list of labels of the dictionary
                        # This is equivalent to `labels` property.
get_line                # Split a line of text into words and labels.
get_output_matrix       # Get a copy of the full output matrix of a Model.
get_sentence_vector     # Given a string, get a single vector represenation. This function
                        # assumes to be given a single line of text. We split words on
                        # whitespace (space, newline, tab, vertical tab) and the control
                        # characters carriage return, formfeed and the null character.
get_subword_id          # Given a subword, return the index (within input matrix) it hashes to.
get_subwords            # Given a word, get the subwords and their indicies.
get_word_id             # Given a word, get the word id within the dictionary.
get_word_vector         # Get the vector representation of word.
get_words               # Get the entire list of words of the dictionary
                        # This is equivalent to `words` property.
is_quantized            # whether the model has been quantized
predict                 # Given a string, get a list of labels and a list of corresponding probabilities.
quantize                # Quantize the model reducing the size of the model and it's memory footprint.
save_model              # Save the model to the given path
test                    # Evaluate supervised model using file given by path
test_label              # Return the precision and recall score for each label.

The properties words, labels return the words and labels from the dictionary :

model.words         # equivalent to model.get_words()
model.labels        # equivalent to model.get_labels()

The object overrides __getitem__ and __contains__ functions in order to return the representation of a word and to check if a word is in the vocabulary.

model['king']       # equivalent to model.get_word_vector('king')
'king' in model     # equivalent to `'king' in model.get_words()`

Join the fastText community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fasttext_community-0.11.1-cp313-cp313-musllinux_1_2_x86_64.whl (5.8 MB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

fasttext_community-0.11.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fasttext_community-0.11.1-cp313-cp313-macosx_15_0_arm64.whl (331.9 kB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

fasttext_community-0.11.1-cp312-cp312-musllinux_1_2_x86_64.whl (5.8 MB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

fasttext_community-0.11.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fasttext_community-0.11.1-cp312-cp312-macosx_15_0_arm64.whl (332.0 kB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

fasttext_community-0.11.1-cp311-cp311-musllinux_1_2_x86_64.whl (5.7 MB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

fasttext_community-0.11.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fasttext_community-0.11.1-cp311-cp311-macosx_15_0_arm64.whl (331.0 kB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

fasttext_community-0.11.1-cp310-cp310-musllinux_1_2_x86_64.whl (5.7 MB view details)

Uploaded CPython 3.10musllinux: musl 1.2+ x86-64

fasttext_community-0.11.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fasttext_community-0.11.1-cp310-cp310-macosx_15_0_arm64.whl (329.7 kB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

fasttext_community-0.11.1-cp39-cp39-musllinux_1_2_x86_64.whl (5.7 MB view details)

Uploaded CPython 3.9musllinux: musl 1.2+ x86-64

fasttext_community-0.11.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.27+ x86-64manylinux: glibc 2.28+ x86-64

fasttext_community-0.11.1-cp39-cp39-macosx_15_0_arm64.whl (329.9 kB view details)

Uploaded CPython 3.9macOS 15.0+ ARM64

File details

Details for the file fasttext_community-0.11.1-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.1-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 60fd38f0262d1f496edccadcb74c00bf2e981b834ee062957416240a66693a44
MD5 3ee5546de7ef5091448aaa469482b9cd
BLAKE2b-256 8c0a446e876a70ee66a2be0db199eb41598b5e7ad0ba4a7b131bd608704d8dd5

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.1-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fa192dca110e227b114a0191f13b15ba3a80262478474636a6205097ad57c48c
MD5 fe977cfa0be9807d870cfb4254d7ec40
BLAKE2b-256 d969df9165aaf9939acec9cf6db3c3e9a9d86d56aaaac38747387ab5948cc629

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.1-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.1-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 0979add59394f59aef189d75e5fe0056b05d3de93f2d7be4a1140f51efb45adf
MD5 a127806da5da466483b2c73fe227f085
BLAKE2b-256 4c1b217f4e15397b5f63a53a7e90003d39975cfde414ba45f1258a04de67d94a

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.1-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.1-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 21472384a32c502bfa65111692a64cdbfe3bbe2837944b22c536e88f33432d97
MD5 11045ec9ae8441ced8fd26a4b1b65d0f
BLAKE2b-256 0409ed6140c7dbc5b8e8c75b2f7d13b0489142d303309e69dc8ee303a1536e11

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e6b9785f956e31441f35fa61ecf0149be4eabc1276044b7b56fce2d2aae88573
MD5 32879f9eb3339a1d247594208da54785
BLAKE2b-256 f08db75d934d686f86c84697c34c9a1a8ffa0b67c5e25b9fd69565b8b545fb4b

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.1-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.1-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 fdf0dc200b4e474fb18c541af6740e3b8a79075d250b888a9a2375a756ffc12c
MD5 a61418524a54c2ae703e06b96fc3265b
BLAKE2b-256 2242b514808b24bdbe986bb725cfff1389115db2decbaae5ed6f4482549edf47

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.1-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.1-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 553bad00683a2b34820a67230e676722bf2edb610ef7519fd9a25073f128938c
MD5 896560ebca3be1f94f849e98aad6b6f2
BLAKE2b-256 ef1d00cea5d4e4cb107583bfa3f0cd005e84bb569b9a3fc1dbfd6f4beb7a7e4b

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3b87f6bb740ca886e7b24608926ca78b8684fa48379010ba838a17d9b2be388a
MD5 9d390fff8dd521de54b681c3b82b3dad
BLAKE2b-256 d7ce72bdd5b08f3a2142668574c8e31f22b5147105df09ca96c32fe12630bdb2

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.1-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.1-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 221f82b273868df6c60c6e64e821bd3f35c0c29a03bda2b70765ceed80dcff16
MD5 2de17cc64b5d5bc392a020cb275adef2
BLAKE2b-256 981932e9a4802fe129da772845b01cb24086201bb4d7e31431edcb1b38826b26

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.1-cp310-cp310-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.1-cp310-cp310-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 a169d7a0f0e566d657db502ccdef57975b1075a09c0e6d355d22bb8e8b6f5f72
MD5 a92155859fe26fda413fc3331eb86cee
BLAKE2b-256 7462e2762622615e4d1647303fc7930a76da397cb7042d92d6bfc656a0d1870d

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b8af181aeafbcf3ca486d661c6e1c1817517140529a385cb5c85717d62d5b8dd
MD5 b2813554c1b4ba43ad8908ccc7e75122
BLAKE2b-256 469773cf99be6d737fe039d1a433297c057a50554d3df9c45609f3458eac1a59

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.1-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.1-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 2b7fcf331380102e0468ad79ce97aa40a0b260d67ab1cb8be27a3587039f23a8
MD5 74f567097ca83595cb2b5d92e6314189
BLAKE2b-256 f0ba0b724d5785e0b32c4a415fbcf1e69dc619987605c48a0825f8bf445a28b7

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.1-cp39-cp39-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.1-cp39-cp39-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 5df3c5c405fda7eb57661c45eff31df981fd06018b3f561700117e7596b63cc5
MD5 894f74512daa06ae14e4a079aceac157
BLAKE2b-256 e20408cf977227fed29e1e7a8abda4122a891ddb233fb467feb27724b5c3c785

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b43f7f79ec1610449ad9790413287caac4b5eb1e46834869e1c2f86acb1c31a7
MD5 9a89422f8bf05bd70f5af1e814bda55a
BLAKE2b-256 4fc1fc4dbbe8abe00c90ff23fc86730e70e519c0fab266abbfb51db66db239c4

See more details on using hashes here.

File details

Details for the file fasttext_community-0.11.1-cp39-cp39-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for fasttext_community-0.11.1-cp39-cp39-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 6bf822e993f72b3be7f9bc855d06b368558be617087456f127d51640c9609b0c
MD5 ddd102802714d2d05285172caa06e646
BLAKE2b-256 dc1b426b28c3eff2bc6e370dca5961b042d1524e668a593e792e40362decbde1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page