Unified framework for word embeddings (Word2Vec, GloVe, FastText, ...) compatible with scikit-learn Pipeline
Project description
Unified framework for word embeddings (Word2Vec, GloVe, FastText, …) use in machine learning pipelines, compatible with scikit-learn Pipelines.
Installation
Install package with pip install Cython && pip install zeugma (Cython is required by the fastText package, on which zeugma is dependent).
Examples
Embedding transformers can be either be used with downloaded embeddings (they all come with a default embedding URL) or trained.
Pretrained downloaded embeddings
As an illustrative example the cosine similarity of the sentences zeugma and figure of speech is computed using the GloVeTransformer with downloaded embeddings (default URL is used here):
>>> from zeugma.embeddings import GloVeEmbeddings >>> GloVeTransformer.download_embeddings() >>> glove = GloVeTransformer(model_path) >>> embeddings = GloVe.transform(['zeugma', 'figure of speech']) >>> from sklearn.metrics.pairwise import cosine_similarity >>> cosine_similarity(embeddings)[0, 1] 0.32840478
Training embeddings
Zeugma can also be used to compute the embeddings on your own corpus (composed of only two sentences here):
>>> from zeugma.embeddings import Word2VecTransformer >>> w2v = Word2VecTransformer(trainable=True) >>> embeddings = w2v.fit_transform(['zeugma', 'figure of speech']) >>> from sklearn.metrics.pairwise import cosine_similarity >>> cosine_similarity(embeddings)[0, 1] -0.028218582
Fine-tuning embeddings
Embeddings fine tuning (training embeddings with preloaded values) will be implemented in the future.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.