Have you every struggled with needing a Spacy TextCategorizer but didn't have the time to train one from scratch? Classy Classification is the way to go!
Project description
Classy Classification
Have you every struggled with needing a Spacy TextCategorizer but didn't have the time to train one from scratch? Classy Classification is the way to go! For few-shot classification using sentence-transformers or spaCy models, provide a dictionary with labels and examples, or just provide a list of labels for zero shot-classification with Hugginface zero-shot classifiers.
Install
pip install classy-classification
or install with faster inference using onnx.
pip install classy-classification[onnx]
ONNX issues
pickling
ONNX does show some issues when pickling the data.
M1
Some installation issues might occur, which can be fixed by these commands.
brew install cmake
brew install protobuf
pip3 install onnx --no-use-pep517
Quickstart
SpaCy embeddings
import spacy
import classy_classification
data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens."]
}
nlp = spacy.load("en_core_web_md")
nlp.add_pipe(
"text_categorizer",
config={
"data": data,
"model": "spacy"
}
)
print(nlp("I am looking for kitchen appliances.")._.cats)
# Output:
#
# [{"label": "furniture", "score": 0.21}, {"label": "kitchen", "score": 0.79}]
Multi-label classification
Sometimes multiple labels are necessary to fully describe the contents of a text. In that case, we want to make use of the multi-label implementation, here the sum of label scores is not limited to 1. Note that we use a multi-layer perceptron for this purpose instead of the default SVC
implementation, requiring a few more training samples.
import spacy
import classy_classification
data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa.",
"We have a new dinner table."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens.",
"We have a new dinner table."]
}
nlp = spacy.load("en_core_web_md")
nlp.add_pipe(
"text_categorizer",
config={
"data": data,
"model": "spacy",
"multi_label": True,
"config": {"hidden_layer_sizes": (64,), "seed": 42}
}
)
print(nlp("texts about dinner tables have multiple labels.")._.cats)
# Output:
#
# [{"label": "furniture", "score": 0.94}, {"label": "kitchen", "score": 0.97}]
Sentence-transfomer embeddings
import spacy
import classy_classification
data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens."]
}
nlp = spacy.blank("en")
nlp.add_pipe(
"text_categorizer",
config={
"data": data,
"model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
"device": "gpu"
}
)
print(nlp("I am looking for kitchen appliances.")._.cats)
# Output:
#
# [{"label": "furniture", "score": 0.21}, {"label": "kitchen", "score": 0.79}]
Hugginface zero-shot classifiers
import spacy
import classy_classification
data = ["furniture", "kitchen"]
nlp = spacy.blank("en")
nlp.add_pipe(
"text_categorizer",
config={
"data": data,
"model": "typeform/distilbert-base-uncased-mnli",
"cat_type": "zero",
"device": "gpu"
}
)
print(nlp("I am looking for kitchen appliances.")._.cats)
# Output:
#
# [{"label": "furniture", "score": 0.21}, {"label": "kitchen", "score": 0.79}]
Credits
Inspiration Drawn From
Huggingface does offer some nice models for few/zero-shot classification, but these are not tailored to multi-lingual approaches. Rasa NLU has a nice approach for this, but its too embedded in their codebase for easy usage outside of Rasa/chatbots. Additionally, it made sense to integrate sentence-transformers and Hugginface zero-shot, instead of default word embeddings. Finally, I decided to integrate with Spacy, since training a custom Spacy TextCategorizer seems like a lot of hassle if you want something quick and dirty.
Or buy me a coffee
Standalone usage without spaCy
from classy_classification import classyClassifier
data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens."]
}
classifier = classyClassifier(data=data)
classifier("I am looking for kitchen appliances.")
classifier.pipe(["I am looking for kitchen appliances."])
# overwrite training data
classifier.set_training_data(data=data)
classifier("I am looking for kitchen appliances.")
# overwrite [embedding model](https://www.sbert.net/docs/pretrained_models.html)
classifier.set_embedding_model(model="paraphrase-MiniLM-L3-v2")
classifier("I am looking for kitchen appliances.")
# overwrite SVC config
classifier.set_classification_model(
config={
"C": [1, 2, 5, 10, 20, 100],
"kernels": ["linear"],
"max_cross_validation_folds": 5
}
)
classifier("I am looking for kitchen appliances.")
Save and load models
data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens."]
}
classifier = classyClassifier(data=data)
with open("./classifier.pkl", "wb") as f:
pickle.dump(classifier, f)
f = open("./classifier.pkl", "rb")
classifier = pickle.load(f)
classifier("I am looking for kitchen appliances.")
Todo
[ ] look into a way to integrate spacy trf models.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for classy_classification-0.5.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18a7eeacfbb4ae44215b24b0d18a6d7eb3e26c65349e5aea23b38f6519f850d6 |
|
MD5 | 8276a655eba7a08b2a9b28cabe1653af |
|
BLAKE2b-256 | f517b8c60cd4360ea33a1cb169fac439a69534469bfdee37a37beffc0f787d4d |
Hashes for classy_classification-0.5.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f0d0a5e199fa6f80af8e47c2bc3cc5ce71f0bfb0becff59f96fba9911e82d935 |
|
MD5 | 6ded40c2f475eb46de61fe69e01232b8 |
|
BLAKE2b-256 | 526dc00c10d427f9c3d08813e8b95edaf158479862ee40e587edde8ca168912b |