Skip to main content

Tools for training word and document embeddings in scikit-learn.

Project description

scikit-embeddings


Utilites for training, storing and using word and document embeddings in scikit-learn pipelines.

Features

  • Train Word and Paragraph embeddings in scikit-learn compatible pipelines.
  • Fast and performant trainable tokenizer components from tokenizers.
  • Easy to integrate components and pipelines in your scikit-learn workflows and machine learning pipelines.
  • Easy serialization and integration with HugginFace Hub for quickly publishing your embedding pipelines.

What scikit-embeddings is not for:

  • Training transformer models and deep neural language models (if you want to do this, do it with transformers)
  • Using pretrained sentence transformers (use embetter)

Installation

You can easily install scikit-embeddings from PyPI:

pip install scikit-embeddings

If you want to use GloVe embedding models, install alogn with glovpy:

pip install scikit-embeddings[glove]

Example Pipelines

You can use scikit-embeddings with many many different pipeline architectures, I will list a few here:

Word Embeddings

You can train classic vanilla word embeddings by building a pipeline that contains a WordLevel tokenizer and an embedding model:

from skembedding.tokenizers import WordLevelTokenizer
from skembedding.models import Word2VecEmbedding
from skembeddings.pipeline import EmbeddingPipeline

embedding_pipe = EmbeddingPipeline(
    WordLevelTokenizer(),
    Word2VecEmbedding(n_components=100, algorithm="cbow")
)
embedding_pipe.fit(texts)

Fasttext-like

You can train an embedding pipeline that uses subword information by using a tokenizer that does that. You may want to use Unigram, BPE or WordPiece for these purposes. Fasttext also uses skip-gram by default so let's change to that.

from skembedding.tokenizers import UnigramTokenizer
from skembedding.models import Word2VecEmbedding
from skembeddings.pipeline import EmbeddingPipeline

embedding_pipe = EmbeddingPipeline(
    UnigramTokenizer(),
    Word2VecEmbedding(n_components=250, algorithm="sg")
)
embedding_pipe.fit(texts)

Paragraph Embeddings

You can train Doc2Vec paragpraph embeddings with the chosen choice of tokenization.

from skembedding.tokenizers import WordPieceTokenizer
from skembedding.models import ParagraphEmbedding
from skembeddings.pipeline import EmbeddingPipeline, PretrainedPipeline

embedding_pipe = EmbeddingPipeline(
    WordPieceTokenizer(),
    ParagraphEmbedding(n_components=250, algorithm="dm")
)
embedding_pipe.fit(texts)

Serialization

Pipelines can be safely serialized to disk:

embedding_pipe.to_disk("output_folder/")

pretrained = PretrainedPipeline("output_folder/")

Or published to HugginFace Hub:

from huggingface_hub import login

login()
embedding_pipe.to_hub("username/name_of_pipeline")

pretrained = PretrainedPipeline("username/name_of_pipeline")

Text Classification

You can include an embedding model in your classification pipelines by adding some classification head.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y)

cls_pipe = make_pipeline(pretrained, LogisticRegression())
cls_pipe.fit(X_train, y_train)

y_pred = cls_pipe.predict(X_test)
print(classification_report(y_test, y_pred))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scikit_embeddings-0.3.1.tar.gz (13.8 kB view hashes)

Uploaded Source

Built Distribution

scikit_embeddings-0.3.1-py3-none-any.whl (18.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page