Yet another Python binding for fastText
Project description
pyfasttext
Yet another Python binding for fastText.
The binding supports Python 2.7 and Python 3. It requires Cython.
pyfasttext has been tested successfully on Linux and Mac OS X.
Table of Contents
Installation
To compile pyfasttext, make sure you have a compiler with C++11 support.
Cloning
git clone --recursive https://github.com/vrasneur/pyfasttext.git
cd pyfasttext
Requirements for Python 2.7
pip install future
Building and installing
python setup.py install
Usage
How to load the library?
>>> from pyfasttext import FastText
How to load an existing model?
>>> model = FastText('/path/to/model.bin')
or
>>> model = FastText()
>>> model.load_model('/path/to/model.bin')
Word representation learning
Training using Skipgram
>>> model = FastText()
>>> model.skipgram(input='data.txt', output='model', epoch=100, lr=0.7)
Training using CBoW
>>> model = FastText()
>>> model.cbow(input='data.txt', output='model', epoch=100, lr=0.7)
Vector for a given word
>>> model['dog']
array('f', [-0.4947430193424225, 8.133808296406642e-05, ...])
Get all the word vectors in a model
>>> for word in model.words:
... print(word, model[word])
Get the number of words in the model
>>> model.nwords
500000
Word similarity
>>> model.similarity('dog', 'cat')
0.75596606254577637
Most similar words
>>> model.nearest_neighbors('dog', k=2)
[('dogs', 0.7843924736976624), ('cat', 75596606254577637)]
Analogies
The most_similar() method works similarly as the one in gensim.
>>> model.most_similar(positive=['woman', 'king'], negative=['man'], k=1)
[('queen', 0.77121970653533936)]
Text classification
Supervised learning
>>> model = FastText()
>>> model.supervised(input='/path/to/input.txt', output='/path/to/model', epoch=100, lr=0.7)
Get all the labels
>>> model.labels
['LABEL1', 'LABEL2', ...]
Get the number of labels
>>> model.nlabels
100
Prediction
Labels and probabilities
If you have a list of strings (or an iterable object), use this:
>>> model.predict_proba(['first sentence', 'second sentence'], k=2)
[[('LABEL1', 0.99609375), ('LABEL3', 1.953126549381068e-08)], [('LABEL2', 1.0), ('LABEL3', 1.953126549381068e-08)]]
If your test data is stored inside a file, use this:
>>> model.predict_proba_file('/path/to/test.txt', k=2)
[[('LABEL1', 0.99609375), ('LABEL3', 1.953126549381068e-08)], [('LABEL2', 1.0), ('LABEL3', 1.953126549381068e-08)]]
If you want to test a single string, use this:
>>> model.predict_proba_single('first sentence', k=2)
[('LABEL1', 0.99609375), ('LABEL3', 1.953126549381068e-08)]
Normalized probabilities
For performance reasons, fastText probabilities often do not sum up to 1.0.
If you want normalized probabilities (where the sum is closer to 1.0 than the original probabilities), you can use the normalized=True parameter in all the methods that output probabilities (predict_proba(), predict_proba_file() and predict_proba_single()).
>>> sum(proba for label, proba in model.predict_proba_single('this is a sentence that needs to be classified', k=None))
0.9785203068801335
>>> sum(proba for label, proba in model.predict_proba_single('this is a sentence that needs to be classified', k=None, normalized=True))
0.9999999999999898
Labels only
If you have a list of strings (or an iterable object), use this:
>>> model.predict(['first sentence', 'second sentence'], k=2)
[['LABEL1', 'LABEL3'], ['LABEL2', 'LABEL3']]
If your test data is stored inside a file, use this:
>>> model.predict_file('/path/to/test.txt', k=2)
[['LABEL1', 'LABEL3'], ['LABEL2', 'LABEL3']]
If you want to test a single string, use this:
>>> model.predict_single('first sentence', k=2)
['LABEL1', 'LABEL3']
Quantization
Use keyword arguments in the quantize() method.
>>> model.quantize(input='/path/to/input.txt', output='/path/to/model')
You can load quantized models using the FastText constructor or the load_model() method.
Misc utilities
Show the model (hyper)parameters
>>> model.args
{'bucket': 11000000,
'cutoff': 0,
'dim': 100,
'dsub': 2,
'epoch': 100,
...
}
Extract labels or classes from a dataset
If you load an existing model, the label prefix will be the one defined in the model.
>>> model = FastText(label='__my_prefix__')
Extract labels
There can be multiple labels per line.
>>> model.extract_labels('/path/to/dataset1.txt')
[['LABEL2', 'LABEL5'], ['LABEL1'], ...]
Extract classes
There can be only one class per line.
>>> model.extract_classes('/path/to/dataset2.txt')
['LABEL3', 'LABEL1', 'LABEL2', ...]
Exceptions
The fastText source code directly calls exit() when something wrong happens (e.g. a model file does not exist, …).
Instead of exiting, pyfasttext raises a Python exception (RuntimeError).
>>> import pyfasttext
>>> model = pyfasttext.FastText('/path/to/non-existing_model.bin')
Model file cannot be opened for loading!
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/pyfasttext.pyx", line 124, in pyfasttext.FastText.__cinit__ (src/pyfasttext.cpp:1800)
File "src/pyfasttext.pyx", line 348, in pyfasttext.FastText.load_model (src/pyfasttext.cpp:5947)
RuntimeError: fastext tried to exit: 1
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.