Just a bunch of useful embeddings to get started quickly.
Project description
embetter
"Just a bunch of useful embeddings to get started quickly."
Embetter implements scikit-learn compatible embeddings for computer vision and text. It should make it very easy to quickly build proof of concepts using scikit-learn pipelines and, in particular, should help with bulk labelling. It's a also meant to play nice with bulk and scikit-partial.
Install
You can install via pip.
python -m pip install embetter
Many of the embeddings are optional depending on your use-case, so if you want to nit-pick to download only the tools that you need:
python -m pip install "embetter[text]"
python -m pip install "embetter[sentence-tfm]"
python -m pip install "embetter[spacy]"
python -m pip install "embetter[sense2vec]"
python -m pip install "embetter[bpemb]"
python -m pip install "embetter[vision]"
python -m pip install "embetter[all]"
API Design
This is what's being implemented now.
# Helpers to grab text or image from pandas column.
from embetter.grab import ColumnGrabber
# Representations/Helpers for computer vision
from embetter.vision import ImageLoader, TimmEncoder, ColorHistogramEncoder
# Representations for text
from embetter.text import SentenceEncoder, Sense2VecEncoder, BytePairEncoder, spaCyEncoder
# Representations from multi-modal models
from embetter.multi import ClipEncoder
# Finetuning components
from embetter.finetune import ForwardFinetuner
# External embedding providers, typically needs an API key
from embetter.external import CohereEncoder, OpenAIEncoder
All of these components are scikit-learn compatible, which means that you can apply them as you would normally in a scikit-learn pipeline. Just be aware that these components are stateless. They won't require training as these are all pretrained tools.
Text Example
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from embetter.grab import ColumnGrabber
from embetter.text import SentenceEncoder
# This pipeline grabs the `text` column from a dataframe
# which then get fed into Sentence-Transformers' all-MiniLM-L6-v2.
text_emb_pipeline = make_pipeline(
ColumnGrabber("text"),
SentenceEncoder('all-MiniLM-L6-v2')
)
# This pipeline can also be trained to make predictions, using
# the embedded features.
text_clf_pipeline = make_pipeline(
text_emb_pipeline,
LogisticRegression()
)
dataf = pd.DataFrame({
"text": ["positive sentiment", "super negative"],
"label_col": ["pos", "neg"]
})
X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])
text_clf_pipeline.fit(dataf, dataf['label_col']).predict(dataf)
Image Example
The goal of the API is to allow pipelines like this:
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from embetter.grab import ColumnGrabber
from embetter.vision import ImageLoader, TimmEncoder
# This pipeline grabs the `img_path` column from a dataframe
# then it grabs the image paths and turns them into `PIL.Image` objects
# which then get fed into MobileNetv2 via TorchImageModels (timm).
image_emb_pipeline = make_pipeline(
ColumnGrabber("img_path"),
ImageLoader(convert="RGB"),
TimmEncoder("mobilenetv2_120d")
)
dataf = pd.DataFrame({
"img_path": ["tests/data/thiscatdoesnotexist.jpeg"]
})
image_emb_pipeline.fit_transform(dataf)
Batched Learning
All of the encoding tools you've seen here are also compatible
with the partial_fit
mechanic
in scikit-learn. That means
you can leverage scikit-partial
to build pipelines that can handle out-of-core datasets.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for embetter-0.3.8-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fe144038f2fcea44abf1b47e4c7b76391d647da2dc867e757b8a648ebc7d90e1 |
|
MD5 | 167023ffddb0ac57e765ced934aa35e2 |
|
BLAKE2b-256 | 777737683818627ab6abd43ab7931e6122b368c37778e113a3e99e9887b82327 |