Skip to main content

fasttext Python bindings, fixed numpy 2 compatibiliy

Project description

fasttext-numpy2

fasttext with one line changed to support numpy 2.

install

pip install fasttext-numpy2

or

# clone and cd into
pip install -e .

build

python -m build

build for pypi

note that building wheels for pypi on linux needs the manylinux project:

# clone and cd into
pandoc --from=markdown --to=rst --output=python/README.rst python/README.md
# start the docker with a really old linux, mount the repository
docker run -it --rm -v "$(pwd)":/workspace -w /workspace \
  --user "$(id -u):$(id -g)" quay.io/pypa/manylinux_2_28_x86_64
# inside docker
set -e
rm -rf dist/ wheelhouse/
for i in {6..13}; do
  echo build for python 3.${i}
  python3.${i} -m build
  auditwheel repair dist/fasttext_numpy2-*-cp3${i}-*.whl
done
# repaired manylinux wheels are in wheelhouse/
python -m twine upload --repository pypi wheelhouse/*

all credits go to original authors.

fastText CircleCI

fastText is a library for efficient learning of word representations and sentence classification.

In this document we present how to use fastText in python.

Table of contents

Requirements

fastText builds on modern Mac OS and Linux distributions. Since it uses C++11 features, it requires a compiler with good C++11 support. You will need Python (version 2.7 or ≥ 3.4), NumPy & SciPy and pybind11.

Installation

To install the latest release, you can do :

$ pip install fasttext

or, to get the latest development version of fasttext, you can install from our github repository :

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ sudo pip install .
$ # or :
$ sudo python setup.py install

Usage overview

Word representation model

In order to learn word vectors, as described here, we can use fasttext.train_unsupervised function like this:

import fasttext

# Skipgram model :
model = fasttext.train_unsupervised('data.txt', model='skipgram')

# or, cbow model :
model = fasttext.train_unsupervised('data.txt', model='cbow')

where data.txt is a training file containing utf-8 encoded text.

The returned model object represents your learned model, and you can use it to retrieve information.

print(model.words)   # list of words in dictionary
print(model['king']) # get the vector of the word 'king'

Saving and loading a model object

You can save your trained model object by calling the function save_model.

model.save_model("model_filename.bin")

and retrieve it later thanks to the function load_model :

model = fasttext.load_model("model_filename.bin")

For more information about word representation usage of fasttext, you can refer to our word representations tutorial.

Text classification model

In order to train a text classifier using the method described here, we can use fasttext.train_supervised function like this:

import fasttext

model = fasttext.train_supervised('data.train.txt')

where data.train.txt is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string __label__

Once the model is trained, we can retrieve the list of words and labels:

print(model.words)
print(model.labels)

To evaluate our model by computing the precision at 1 (P@1) and the recall on a test set, we use the test function:

def print_results(N, p, r):
    print("N\t" + str(N))
    print("P@{}\t{:.3f}".format(1, p))
    print("R@{}\t{:.3f}".format(1, r))

print_results(*model.test('test.txt'))

We can also predict labels for a specific text :

model.predict("Which baking dish is best to bake a banana bread ?")

By default, predict returns only one label : the one with the highest probability. You can also predict more than one label by specifying the parameter k:

model.predict("Which baking dish is best to bake a banana bread ?", k=3)

If you want to predict more than one sentence you can pass an array of strings :

model.predict(["Which baking dish is best to bake a banana bread ?", "Why not put knives in the dishwasher?"], k=3)

Of course, you can also save and load a model to/from a file as in the word representation usage.

For more information about text classification usage of fasttext, you can refer to our text classification tutorial.

Compress model files with quantization

When you want to save a supervised model file, fastText can compress it in order to have a much smaller model file by sacrificing only a little bit performance.

# with the previously trained `model` object, call :
model.quantize(input='data.train.txt', retrain=True)

# then display results and save the new model :
print_results(*model.test(valid_data))
model.save_model("model_filename.ftz")

model_filename.ftz will have a much smaller size than model_filename.bin.

For further reading on quantization, you can refer to this paragraph from our blog post.

IMPORTANT: Preprocessing data / encoding conventions

In general it is important to properly preprocess your data. In particular our example scripts in the root folder do this.

fastText assumes UTF-8 encoded text. All text must be unicode for Python2 and str for Python3. The passed text will be encoded as UTF-8 by pybind11 before passed to the fastText C++ library. This means it is important to use UTF-8 encoded text when building a model. On Unix-like systems you can convert text using iconv.

fastText will tokenize (split text into pieces) based on the following ASCII characters (bytes). In particular, it is not aware of UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word boundaries into one of the following symbols as appropiate.

  • space

  • tab

  • vertical tab

  • carriage return

  • formfeed

  • the null character

The newline character is used to delimit lines of text. In particular, the EOS token is appended to a line of text if a newline character is encountered. The only exception is if the number of tokens exceeds the MAX_LINE_SIZE constant as defined in the Dictionary header. This means if you have text that is not separate by newlines, such as the fil9 dataset, it will be broken into chunks with MAX_LINE_SIZE of tokens and the EOS token is not appended.

The length of a token is the number of UTF-8 characters by considering the leading two bits of a byte to identify subsequent bytes of a multi-byte sequence. Knowing this is especially important when choosing the minimum and maximum length of subwords. Further, the EOS token (as specified in the Dictionary header) is considered a character and will not be broken into subwords.

More examples

In order to have a better knowledge of fastText models, please consider the main README and in particular the tutorials on our website.

You can find further python examples in the doc folder.

As with any package you can get help on any Python function using the help function.

For example

+>>> import fasttext
+>>> help(fasttext.FastText)

Help on module fasttext.FastText in fasttext:

NAME
    fasttext.FastText

DESCRIPTION
    # Copyright (c) 2017-present, Facebook, Inc.
    # All rights reserved.
    #
    # This source code is licensed under the MIT license found in the
    # LICENSE file in the root directory of this source tree.

FUNCTIONS
    load_model(path)
        Load a model given a filepath and return a model object.

    tokenize(text)
        Given a string of text, tokenize it and return a list of tokens
[...]

API

train_unsupervised parameters

input             # training file path (required)
model             # unsupervised fasttext model {cbow, skipgram} [skipgram]
lr                # learning rate [0.05]
dim               # size of word vectors [100]
ws                # size of the context window [5]
epoch             # number of epochs [5]
minCount          # minimal number of word occurences [5]
minn              # min length of char ngram [3]
maxn              # max length of char ngram [6]
neg               # number of negatives sampled [5]
wordNgrams        # max length of word ngram [1]
loss              # loss function {ns, hs, softmax, ova} [ns]
bucket            # number of buckets [2000000]
thread            # number of threads [number of cpus]
lrUpdateRate      # change the rate of updates for the learning rate [100]
t                 # sampling threshold [0.0001]
verbose           # verbose [2]

train_supervised parameters

input             # training file path (required)
lr                # learning rate [0.1]
dim               # size of word vectors [100]
ws                # size of the context window [5]
epoch             # number of epochs [5]
minCount          # minimal number of word occurences [1]
minCountLabel     # minimal number of label occurences [1]
minn              # min length of char ngram [0]
maxn              # max length of char ngram [0]
neg               # number of negatives sampled [5]
wordNgrams        # max length of word ngram [1]
loss              # loss function {ns, hs, softmax, ova} [softmax]
bucket            # number of buckets [2000000]
thread            # number of threads [number of cpus]
lrUpdateRate      # change the rate of updates for the learning rate [100]
t                 # sampling threshold [0.0001]
label             # label prefix ['__label__']
verbose           # verbose [2]
pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []

model object

train_supervised, train_unsupervised and load_model functions return an instance of _FastText class, that we generaly name model object.

This object exposes those training arguments as properties : lr, dim, ws, epoch, minCount, minCountLabel, minn, maxn, neg, wordNgrams, loss, bucket, thread, lrUpdateRate, t, label, verbose, pretrainedVectors. So model.wordNgrams will give you the max length of word ngram used for training this model.

In addition, the object exposes several functions :

get_dimension           # Get the dimension (size) of a lookup vector (hidden layer).
                        # This is equivalent to `dim` property.
get_input_vector        # Given an index, get the corresponding vector of the Input Matrix.
get_input_matrix        # Get a copy of the full input matrix of a Model.
get_labels              # Get the entire list of labels of the dictionary
                        # This is equivalent to `labels` property.
get_line                # Split a line of text into words and labels.
get_output_matrix       # Get a copy of the full output matrix of a Model.
get_sentence_vector     # Given a string, get a single vector represenation. This function
                        # assumes to be given a single line of text. We split words on
                        # whitespace (space, newline, tab, vertical tab) and the control
                        # characters carriage return, formfeed and the null character.
get_subword_id          # Given a subword, return the index (within input matrix) it hashes to.
get_subwords            # Given a word, get the subwords and their indicies.
get_word_id             # Given a word, get the word id within the dictionary.
get_word_vector         # Get the vector representation of word.
get_words               # Get the entire list of words of the dictionary
                        # This is equivalent to `words` property.
is_quantized            # whether the model has been quantized
predict                 # Given a string, get a list of labels and a list of corresponding probabilities.
quantize                # Quantize the model reducing the size of the model and it's memory footprint.
save_model              # Save the model to the given path
test                    # Evaluate supervised model using file given by path
test_label              # Return the precision and recall score for each label.

The properties words, labels return the words and labels from the dictionary :

model.words         # equivalent to model.get_words()
model.labels        # equivalent to model.get_labels()

The object overrides __getitem__ and __contains__ functions in order to return the representation of a word and to check if a word is in the vocabulary.

model['king']       # equivalent to model.get_word_vector('king')
'king' in model     # equivalent to `'king' in model.get_words()`

Join the fastText community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

fasttext_numpy2-0.10.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.13 manylinux: glibc 2.27+ x86-64 manylinux: glibc 2.28+ x86-64

fasttext_numpy2-0.10.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.27+ x86-64 manylinux: glibc 2.28+ x86-64

fasttext_numpy2-0.10.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.27+ x86-64 manylinux: glibc 2.28+ x86-64

fasttext_numpy2-0.10.3-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.27+ x86-64 manylinux: glibc 2.28+ x86-64

fasttext_numpy2-0.10.3-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.27+ x86-64 manylinux: glibc 2.28+ x86-64

fasttext_numpy2-0.10.3-cp38-cp38-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.27+ x86-64 manylinux: glibc 2.28+ x86-64

fasttext_numpy2-0.10.3-cp37-cp37m-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.27+ x86-64 manylinux: glibc 2.28+ x86-64

fasttext_numpy2-0.10.3-cp36-cp36m-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.27+ x86-64 manylinux: glibc 2.28+ x86-64

File details

Details for the file fasttext_numpy2-0.10.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_numpy2-0.10.3-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5b872dbd3292553c576cbc6f9bbfb564dab8a472d47f1bbd75911856412b4153
MD5 21d853691567a2a32e6eb730c5882f7c
BLAKE2b-256 99a991563fcdf8f2489983389bc9aedccd98bc40ce3247bcaec9cc27f3b936b8

See more details on using hashes here.

File details

Details for the file fasttext_numpy2-0.10.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_numpy2-0.10.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d6337e58133e9920a5852f170ef3f8959d69968f738a7f5c908029e9517f83e0
MD5 05f9a5c17408d84d88dda84ed5e1ad79
BLAKE2b-256 ef49d08853ac00f6d53857460ccb9b5c1f901e587c458c28185a15dd8c7133e6

See more details on using hashes here.

File details

Details for the file fasttext_numpy2-0.10.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_numpy2-0.10.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 286095f226f0d5ef8742547dbaa2097ebdde8c94fa3c9928e3e0a195f6689239
MD5 c79641f83c7f441038c0098d74b0af5d
BLAKE2b-256 4903142db03d4b9f8d0efde03bfb17e03a7e1257f1db79735f17931a9cbb2956

See more details on using hashes here.

File details

Details for the file fasttext_numpy2-0.10.3-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_numpy2-0.10.3-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5b0f7ba3e37b48dd0fb68dc5f86e330a7e0bb3154b5994089f2524289bd69297
MD5 ffd1eb4d4692b21c127a2d7906798d91
BLAKE2b-256 e624ab3afc065a1ecf20715ee1dfca288db8c1920a7f095021332d044139261e

See more details on using hashes here.

File details

Details for the file fasttext_numpy2-0.10.3-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_numpy2-0.10.3-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 31c6076421bf0074f5c1d2f4531974c7d1ea5bb0da6f7f94b1487c14775c48d3
MD5 0c33ea529725854152b41c343bc67256
BLAKE2b-256 73bb7af28778bf6a621bd4c712e9731138d8c72148f4a46330a35a7216d25d05

See more details on using hashes here.

File details

Details for the file fasttext_numpy2-0.10.3-cp38-cp38-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_numpy2-0.10.3-cp38-cp38-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5c23583ad8927c37d64a2688b500e9db3731ec72a62e3f36e6e1a397ec797a65
MD5 e2f83d544eb583268757c6838288ac3b
BLAKE2b-256 721bb5a037cfb3c995e566c27d4f243723ec4b66d1263d8491a3888b8e8d43db

See more details on using hashes here.

File details

Details for the file fasttext_numpy2-0.10.3-cp37-cp37m-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_numpy2-0.10.3-cp37-cp37m-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b62cd1a293605fb978cb90a92b08faa2ac606ab1acfc4a21b082f365708896cc
MD5 ce4d702966ac8a1de1f67550c387b644
BLAKE2b-256 c5bd46ab7d6aa310fa9c2af0f043b27097e3b2b600221e9edde0ac8cc196f374

See more details on using hashes here.

File details

Details for the file fasttext_numpy2-0.10.3-cp36-cp36m-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for fasttext_numpy2-0.10.3-cp36-cp36m-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c0b248064f86aebeab8dfcea276b13f3009449a6fde2f2359c5704890bbaed57
MD5 65823858a5771e55eec6df5df1638032
BLAKE2b-256 3adf57d7e9a85bbb2256c67e725c122828d83f62bd314e0cab8b0e0f46b08374

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page