Skip to main content

Tools for training word and document embeddings in scikit-learn.

Project description

scikit-embeddings


Utilites for training, storing and using word and document embeddings in scikit-learn pipelines.

Features

  • Train Word and Paragraph embeddings in scikit-learn compatible pipelines.
  • Fast and performant trainable tokenizer components from tokenizers.
  • Easy to integrate components and pipelines in your scikit-learn workflows and machine learning pipelines.
  • Easy serialization and integration with HugginFace Hub for quickly publishing your embedding pipelines.

What scikit-embeddings is not for:

  • Training transformer models and deep neural language models (if you want to do this, do it with transformers)
  • Using pretrained sentence transformers (use embetter)

Installation

You can easily install scikit-embeddings from PyPI:

pip install scikit-embeddings

If you want to use GloVe embedding models, install alogn with glovpy:

pip install scikit-embeddings[glove]

Example Pipelines

You can use scikit-embeddings with many many different pipeline architectures, I will list a few here:

Word Embeddings

You can train classic vanilla word embeddings by building a pipeline that contains a WordLevel tokenizer and an embedding model:

from skembedding.tokenizers import WordLevelTokenizer
from skembedding.models import Word2VecEmbedding
from skembeddings.pipeline import EmbeddingPipeline

embedding_pipe = EmbeddingPipeline(
    WordLevelTokenizer(),
    Word2VecEmbedding(n_components=100, algorithm="cbow")
)
embedding_pipe.fit(texts)

Fasttext-like

You can train an embedding pipeline that uses subword information by using a tokenizer that does that. You may want to use Unigram, BPE or WordPiece for these purposes. Fasttext also uses skip-gram by default so let's change to that.

from skembedding.tokenizers import UnigramTokenizer
from skembedding.models import Word2VecEmbedding
from skembeddings.pipeline import EmbeddingPipeline

embedding_pipe = EmbeddingPipeline(
    UnigramTokenizer(),
    Word2VecEmbedding(n_components=250, algorithm="sg")
)
embedding_pipe.fit(texts)

Paragraph Embeddings

You can train Doc2Vec paragpraph embeddings with the chosen choice of tokenization.

from skembedding.tokenizers import WordPieceTokenizer
from skembedding.models import ParagraphEmbedding
from skembeddings.pipeline import EmbeddingPipeline, PretrainedPipeline

embedding_pipe = EmbeddingPipeline(
    WordPieceTokenizer(),
    ParagraphEmbedding(n_components=250, algorithm="dm")
)
embedding_pipe.fit(texts)

Serialization

Pipelines can be safely serialized to disk:

embedding_pipe.to_disk("output_folder/")

pretrained = PretrainedPipeline("output_folder/")

Or published to HugginFace Hub:

from huggingface_hub import login

login()
embedding_pipe.to_hub("username/name_of_pipeline")

pretrained = PretrainedPipeline("username/name_of_pipeline")

Text Classification

You can include an embedding model in your classification pipelines by adding some classification head.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y)

cls_pipe = make_pipeline(pretrained, LogisticRegression())
cls_pipe.fit(X_train, y_train)

y_pred = cls_pipe.predict(X_test)
print(classification_report(y_test, y_pred))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scikit_embeddings-0.3.1.tar.gz (13.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scikit_embeddings-0.3.1-py3-none-any.whl (18.2 kB view details)

Uploaded Python 3

File details

Details for the file scikit_embeddings-0.3.1.tar.gz.

File metadata

  • Download URL: scikit_embeddings-0.3.1.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.10.8 Linux/5.15.0-84-generic

File hashes

Hashes for scikit_embeddings-0.3.1.tar.gz
Algorithm Hash digest
SHA256 1ea068a598e062cc7f189e86daec943d4d7de8e72b880d651d2ec9952efc5c44
MD5 cee811a9e735ba794d99fdbe7a060865
BLAKE2b-256 5b7dac7b69a1ca32debd3f98fc2f67edad66aab265fc386e26bd0f62b05fb1f4

See more details on using hashes here.

File details

Details for the file scikit_embeddings-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: scikit_embeddings-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 18.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.2 CPython/3.10.8 Linux/5.15.0-84-generic

File hashes

Hashes for scikit_embeddings-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dea3ba22ee448bdaf9df2019f1f927d83ae75f9c64e24c2258b21f7dbd4517ce
MD5 8a3438e9cc476658712f5496a84cf74a
BLAKE2b-256 c846e25c0757f4fcb59f755b06bdbb2298ad5f7e94f444035296bcd1468da454

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page