Have you every struggled with needing a Spacy TextCategorizer but didn't have the time to train one from scratch? Classy Classification is the way to go!
Project description
Classy Classification
Have you ever struggled with needing a Spacy TextCategorizer but didn't have the time to train one from scratch? Classy Classification is the way to go! For few-shot classification using sentence-transformers or spaCy models, provide a dictionary with labels and examples, or just provide a list of labels for zero shot-classification with Hugginface zero-shot classifiers.
Install
pip install classy-classification
SetFit support
I got a lot of requests for SetFit support, but I decided to create a separate package for this. Feel free to check it out. ❤️
Quickstart
SpaCy embeddings
import spacy
# or import standalone
# from classy_classification import ClassyClassifier
data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens."]
}
nlp = spacy.load("en_core_web_trf")
nlp.add_pipe(
"classy_classification",
config={
"data": data,
"model": "spacy"
}
)
print(nlp("I am looking for kitchen appliances.")._.cats)
# Output:
#
# [{"furniture" : 0.21}, {"kitchen": 0.79}]
Sentence level classification
import spacy
data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens."]
}
nlp.add_pipe(
"classy_classification",
config={
"data": data,
"model": "spacy",
"include_sent": True
}
)
print(nlp("I am looking for kitchen appliances. And I love doing so.").sents[0]._.cats)
# Output:
#
# [[{"furniture" : 0.21}, {"kitchen": 0.79}]
Define random seed and verbosity
nlp.add_pipe(
"classy_classification",
config={
"data": data,
"verbose": True,
"config": {"seed": 42}
}
)
Multi-label classification
Sometimes multiple labels are necessary to fully describe the contents of a text. In that case, we want to make use of the multi-label implementation, here the sum of label scores is not limited to 1. Just pass the same training data to multiple keys.
import spacy
data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa.",
"We have a new dinner table.",
"There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens.",
"We have a new dinner table."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens.",
"We have a new dinner table.",
"There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens.",
"We have a new dinner table."]
}
nlp = spacy.load("en_core_web_md")
nlp.add_pipe(
"classy_classification",
config={
"data": data,
"model": "spacy",
"multi_label": True,
}
)
print(nlp("I am looking for furniture and kitchen equipment.")._.cats)
# Output:
#
# [{"furniture": 0.92}, {"kitchen": 0.91}]
Outlier detection
Sometimes it is worth to be able to do outlier detection or binary classification. This can either be approached using
a binary training dataset, however, I have also implemented support for a OneClassSVM
for outlier detection using a single label. Not that this method does not return probabilities, but that the data is formatted like label-score value pair to ensure uniformity.
Approach 1:
import spacy
data_binary = {
"inlier": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"outlier": ["Text about kitchen equipment",
"This text is about politics",
"Comments about AI and stuff."]
}
nlp = spacy.load("en_core_web_md")
nlp.add_pipe(
"classy_classification",
config={
"data": data_binary,
}
)
print(nlp("This text is a random text")._.cats)
# Output:
#
# [{'inlier': 0.2926672385488411, 'outlier': 0.707332761451159}]
Approach 2:
import spacy
data_singular = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa.",
"We have a new dinner table."]
}
nlp = spacy.load("en_core_web_md")
nlp.add_pipe(
"classy_classification",
config={
"data": data_singular,
}
)
print(nlp("This text is a random text")._.cats)
# Output:
#
# [{'furniture': 0, 'not_furniture': 1}]
Sentence-transfomer embeddings
import spacy
data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens."]
}
nlp = spacy.blank("en")
nlp.add_pipe(
"classy_classification",
config={
"data": data,
"model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
"device": "gpu"
}
)
print(nlp("I am looking for kitchen appliances.")._.cats)
# Output:
#
# [{"furniture": 0.21}, {"kitchen": 0.79}]
Hugginface zero-shot classifiers
import spacy
data = ["furniture", "kitchen"]
nlp = spacy.blank("en")
nlp.add_pipe(
"classy_classification",
config={
"data": data,
"model": "typeform/distilbert-base-uncased-mnli",
"cat_type": "zero",
"device": "gpu"
}
)
print(nlp("I am looking for kitchen appliances.")._.cats)
# Output:
#
# [{"furniture": 0.21}, {"kitchen": 0.79}]
Credits
Inspiration Drawn From
Huggingface does offer some nice models for few/zero-shot classification, but these are not tailored to multi-lingual approaches. Rasa NLU has a nice approach for this, but its too embedded in their codebase for easy usage outside of Rasa/chatbots. Additionally, it made sense to integrate sentence-transformers and Hugginface zero-shot, instead of default word embeddings. Finally, I decided to integrate with Spacy, since training a custom Spacy TextCategorizer seems like a lot of hassle if you want something quick and dirty.
Or buy me a coffee
Standalone usage without spaCy
from classy_classification import ClassyClassifier
data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens."]
}
classifier = ClassyClassifier(data=data)
classifier("I am looking for kitchen appliances.")
classifier.pipe(["I am looking for kitchen appliances."])
# overwrite training data
classifier.set_training_data(data=data)
classifier("I am looking for kitchen appliances.")
# overwrite [embedding model](https://www.sbert.net/docs/pretrained_models.html)
classifier.set_embedding_model(model="paraphrase-MiniLM-L3-v2")
classifier("I am looking for kitchen appliances.")
# overwrite SVC config
classifier.set_classification_model(
config={
"C": [1, 2, 5, 10, 20, 100],
"kernel": ["linear"],
"max_cross_validation_folds": 5
}
)
classifier("I am looking for kitchen appliances.")
Save and load models
data = {
"furniture": ["This text is about chairs.",
"Couches, benches and televisions.",
"I really need to get a new sofa."],
"kitchen": ["There also exist things like fridges.",
"I hope to be getting a new stove today.",
"Do you also have some ovens."]
}
classifier = classyClassifier(data=data)
with open("./classifier.pkl", "wb") as f:
pickle.dump(classifier, f)
f = open("./classifier.pkl", "rb")
classifier = pickle.load(f)
classifier("I am looking for kitchen appliances.")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for classy-classification-1.0.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | f8c525b7d0a7332e7ad30a289bf2760a8d12cd3b94a05fabc8838bfc0cad9d23 |
|
MD5 | 68262f6cfdba000913984f1726ff1106 |
|
BLAKE2b-256 | 828ded73c91a055ae9869cae50662a746e5bdddf265848f9a9e0eafd4a1e3e2e |
Hashes for classy_classification-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9e79739a345d3ffc3bf7e7743405e4d8bb4170654d197ff75724ad5ae7cafd45 |
|
MD5 | f2dcb5017eb810b71ea093dd712c7a8f |
|
BLAKE2b-256 | 667e103a31711d23fdfba3291e0d970bb27d5b413c93e3a76383aea3c88bceab |