Yet another Python binding for fastText
Project description
pyfasttext
Yet another Python binding for fastText.
pyfasttext has been tested successfully on Linux and Mac OS X.
Table of Contents
Installation
To compile pyfasttext, make sure you have a compiler with C++11 support.
Simplest way to install pyfasttext: use pip
Just type this line:
pip install pyfasttext
Cloning
git clone --recursive https://github.com/vrasneur/pyfasttext.git
cd pyfasttext
Requirements for Python 2.7
pip install future
Building and installing
python setup.py install
Building and installing without Numpy
pyfasttext can export word vectors as numpy ndarrays, however this feature can be disabled at compile time.
To compile without numpy, pyfasttext has a USE_NUMPY environment variable. Set this variable to 0 (or empty), like this:
USE_NUMPY=0 python setup.py install
Usage
How to load the library?
>>> from pyfasttext import FastText
How to load an existing model?
>>> model = FastText('/path/to/model.bin')
or
>>> model = FastText()
>>> model.load_model('/path/to/model.bin')
Word representation learning
Training using Skipgram
>>> model = FastText()
>>> model.skipgram(input='data.txt', output='model', epoch=100, lr=0.7)
Training using CBoW
>>> model = FastText()
>>> model.cbow(input='data.txt', output='model', epoch=100, lr=0.7)
Word vectors
Word vectors access
Vector for a given word
By default, a single word vector is returned as a regular Python array of floats.
>>> model['dog']
array('f', [-1.308749794960022, -1.8326224088668823, ...])
Numpy ndarray
The model.get_numpy_vector(word) method returns the word vector as a numpy ndarray.
>>> model.get_numpy_vector('dog')
array([-1.30874979, -1.83262241, ...], dtype=float32)
If you want a normalized vector (i.e. the vector divided by its norm), there is an optional boolean parameter named normalized.
>>> model.get_numpy_vector('dog', normalized=True)
array([-0.07084749, -0.09920666, ...], dtype=float32)
Words for a given vector
>>> king = model.get_numpy_vector('king')
>>> man = model.get_numpy_vector('man')
>>> woman = model.get_numpy_vector('woman')
>>> model.words_for_vector(king + woman - man, k=1)
[('queen', 0.77121970653533936)]
Get the number of words in the model
>>> model.nwords
500000
Get all the word vectors in a model
>>> for word in model.words:
... print(word, model[word])
Numpy ndarray
If you want all the word vectors as a big numpy ndarray, you can use the numpy_normalized_vectors member. Note that all these vectors are normalized.
>>> model.nwords
500000
>>> model.numpy_normalized_vectors
array([[-0.07549749, -0.09407753, ...],
[ 0.00635979, -0.17272158, ...],
...,
[-0.01009259, 0.14604086, ...],
[ 0.12467574, -0.0609326 , ...]], dtype=float32)
>>> model.numpy_normalized_vectors.shape
(500000, 100) # (number of words, dimension)
Misc operations with word vectors
Word similarity
>>> model.similarity('dog', 'cat')
0.75596606254577637
Most similar words
>>> model.nearest_neighbors('dog', k=2)
[('dogs', 0.7843924736976624), ('cat', 75596606254577637)]
Analogies
The model.most_similar() method works similarly as the one in gensim.
>>> model.most_similar(positive=['woman', 'king'], negative=['man'], k=1)
[('queen', 0.77121970653533936)]
Text classification
Supervised learning
>>> model = FastText()
>>> model.supervised(input='/path/to/input.txt', output='/path/to/model', epoch=100, lr=0.7)
Get all the labels
>>> model.labels
['LABEL1', 'LABEL2', ...]
Get the number of labels
>>> model.nlabels
100
Prediction
Labels and probabilities
If you have a list of strings (or an iterable object), use this:
>>> model.predict_proba(['first sentence', 'second sentence'], k=2)
[[('LABEL1', 0.99609375), ('LABEL3', 1.953126549381068e-08)], [('LABEL2', 1.0), ('LABEL3', 1.953126549381068e-08)]]
If your test data is stored inside a file, use this:
>>> model.predict_proba_file('/path/to/test.txt', k=2)
[[('LABEL1', 0.99609375), ('LABEL3', 1.953126549381068e-08)], [('LABEL2', 1.0), ('LABEL3', 1.953126549381068e-08)]]
If you want to test a single string, use this:
>>> model.predict_proba_single('first sentence', k=2)
[('LABEL1', 0.99609375), ('LABEL3', 1.953126549381068e-08)]
Normalized probabilities
For performance reasons, fastText probabilities often do not sum up to 1.0.
If you want normalized probabilities (where the sum is closer to 1.0 than the original probabilities), you can use the normalized=True parameter in all the methods that output probabilities (model.predict_proba(), model.predict_proba_file() and model.predict_proba_single()).
>>> sum(proba for label, proba in model.predict_proba_single('this is a sentence that needs to be classified', k=None))
0.9785203068801335
>>> sum(proba for label, proba in model.predict_proba_single('this is a sentence that needs to be classified', k=None, normalized=True))
0.9999999999999898
Labels only
If you have a list of strings (or an iterable object), use this:
>>> model.predict(['first sentence', 'second sentence'], k=2)
[['LABEL1', 'LABEL3'], ['LABEL2', 'LABEL3']]
If your test data is stored inside a file, use this:
>>> model.predict_file('/path/to/test.txt', k=2)
[['LABEL1', 'LABEL3'], ['LABEL2', 'LABEL3']]
If you want to test a single string, use this:
>>> model.predict_single('first sentence', k=2)
['LABEL1', 'LABEL3']
Quantization
Use keyword arguments in the model.quantize() method.
>>> model.quantize(input='/path/to/input.txt', output='/path/to/model')
You can load quantized models using the FastText constructor or the model.load_model() method.
Misc utilities
Show the model (hyper)parameters
>>> model.args
{'bucket': 11000000,
'cutoff': 0,
'dim': 100,
'dsub': 2,
'epoch': 100,
...
}
Extract labels or classes from a dataset
If you load an existing model, the label prefix will be the one defined in the model.
>>> model = FastText(label='__my_prefix__')
Extract labels
There can be multiple labels per line.
>>> model.extract_labels('/path/to/dataset1.txt')
[['LABEL2', 'LABEL5'], ['LABEL1'], ...]
Extract classes
There can be only one class per line.
>>> model.extract_classes('/path/to/dataset2.txt')
['LABEL3', 'LABEL1', 'LABEL2', ...]
Exceptions
The fastText source code directly calls exit() when something wrong happens (e.g. a model file does not exist, …).
Instead of exiting, pyfasttext raises a Python exception (RuntimeError).
>>> import pyfasttext
>>> model = pyfasttext.FastText('/path/to/non-existing_model.bin')
Model file cannot be opened for loading!
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/pyfasttext.pyx", line 124, in pyfasttext.FastText.__cinit__ (src/pyfasttext.cpp:1800)
File "src/pyfasttext.pyx", line 348, in pyfasttext.FastText.load_model (src/pyfasttext.cpp:5947)
RuntimeError: fastext tried to exit: 1
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.