Tools for training word and document embeddings in scikit-learn.
Project description
scikit-embeddings
Utilites for training, storing and using word and document embeddings in scikit-learn pipelines.
Features
- Train Word and Paragraph embeddings in scikit-learn compatible pipelines.
- Fast and performant trainable tokenizer components from
tokenizers. - Easy to integrate components and pipelines in your scikit-learn workflows and machine learning pipelines.
- Easy serialization and integration with HugginFace Hub for quickly publishing your embedding pipelines.
What scikit-embeddings is not for:
- Training transformer models and deep neural language models (if you want to do this, do it with transformers)
- Using pretrained sentence transformers (use embetter)
Installation
You can easily install scikit-embeddings from PyPI:
pip install scikit-embeddings
If you want to use GloVe embedding models, install alogn with glovpy:
pip install scikit-embeddings[glove]
Example Pipelines
You can use scikit-embeddings with many many different pipeline architectures, I will list a few here:
Word Embeddings
You can train classic vanilla word embeddings by building a pipeline that contains a WordLevel tokenizer and an embedding model:
from skembedding.tokenizers import WordLevelTokenizer
from skembedding.models import Word2VecEmbedding
from skembeddings.pipeline import EmbeddingPipeline
embedding_pipe = EmbeddingPipeline(
WordLevelTokenizer(),
Word2VecEmbedding(n_components=100, algorithm="cbow")
)
embedding_pipe.fit(texts)
Fasttext-like
You can train an embedding pipeline that uses subword information by using a tokenizer that does that.
You may want to use Unigram, BPE or WordPiece for these purposes.
Fasttext also uses skip-gram by default so let's change to that.
from skembedding.tokenizers import UnigramTokenizer
from skembedding.models import Word2VecEmbedding
from skembeddings.pipeline import EmbeddingPipeline
embedding_pipe = EmbeddingPipeline(
UnigramTokenizer(),
Word2VecEmbedding(n_components=250, algorithm="sg")
)
embedding_pipe.fit(texts)
Paragraph Embeddings
You can train Doc2Vec paragpraph embeddings with the chosen choice of tokenization.
from skembedding.tokenizers import WordPieceTokenizer
from skembedding.models import ParagraphEmbedding
from skembeddings.pipeline import EmbeddingPipeline, PretrainedPipeline
embedding_pipe = EmbeddingPipeline(
WordPieceTokenizer(),
ParagraphEmbedding(n_components=250, algorithm="dm")
)
embedding_pipe.fit(texts)
Serialization
Pipelines can be safely serialized to disk:
embedding_pipe.to_disk("output_folder/")
pretrained = PretrainedPipeline("output_folder/")
Or published to HugginFace Hub:
from huggingface_hub import login
login()
embedding_pipe.to_hub("username/name_of_pipeline")
pretrained = PretrainedPipeline("username/name_of_pipeline")
Text Classification
You can include an embedding model in your classification pipelines by adding some classification head.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(X, y)
cls_pipe = make_pipeline(pretrained, LogisticRegression())
cls_pipe.fit(X_train, y_train)
y_pred = cls_pipe.predict(X_test)
print(classification_report(y_test, y_pred))
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scikit_embeddings-0.3.1.tar.gz.
File metadata
- Download URL: scikit_embeddings-0.3.1.tar.gz
- Upload date:
- Size: 13.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.2 CPython/3.10.8 Linux/5.15.0-84-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ea068a598e062cc7f189e86daec943d4d7de8e72b880d651d2ec9952efc5c44
|
|
| MD5 |
cee811a9e735ba794d99fdbe7a060865
|
|
| BLAKE2b-256 |
5b7dac7b69a1ca32debd3f98fc2f67edad66aab265fc386e26bd0f62b05fb1f4
|
File details
Details for the file scikit_embeddings-0.3.1-py3-none-any.whl.
File metadata
- Download URL: scikit_embeddings-0.3.1-py3-none-any.whl
- Upload date:
- Size: 18.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.3.2 CPython/3.10.8 Linux/5.15.0-84-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dea3ba22ee448bdaf9df2019f1f927d83ae75f9c64e24c2258b21f7dbd4517ce
|
|
| MD5 |
8a3438e9cc476658712f5496a84cf74a
|
|
| BLAKE2b-256 |
c846e25c0757f4fcb59f755b06bdbb2298ad5f7e94f444035296bcd1468da454
|