Skip to main content

One Line To Build Any Classifier Without Data

Project description

GitHub LicensePyPI - DownloadsPyPI - Version

∞🧙🏼‍♂️AnyClassifier - One Line To Build Any Classifier Without Data, And A Step Towards The First AI ML Engineer

image

image

Have you ever wanted/ been requested to build a classifier without any labeled data? What it takes now is just one line 🤯.

AnyClassifier is a framework that helps you build a classifier without any label, with as limited coding as possible.
As a machine learning engineer, one of the unavoidable but the most heavy lifting issues is to build and curate a high quality labelled data.
By leveraging LLM 🤖 annotation with permissive license, one can now label data at a better quality and at lightning speed ever.
This is inspired by some of the challenges I faced daily in work and doing open source - it is built for machine learning engineer and software engineer, by machine learning engineer 👨🏻‍💻.
By providing a higher level abstraction, this project's mission is to further democratizes AI to everyone, with ONE LINE.
The project is still experimental, but I found it worked in some of my use cases. Feedbacks welcome, and feel free to contribute. See Roadmap.
Together let's build more useful models.

🚀 Features

  • One line to build any classifier that you don't have data 🤯
  • Why one line? Not only it is easy to be used by Human but also it can easily be used by other LLM as a function call, easily to be integrated with any agentic flow
  • Smoothness integration with transformers, setfit, fasttext and datasets
    • setfit: for limited data (e.g. 100) 🤗
    • fastText: for blazingly fast inference (1000 docs/s) without GPU ⚡️
    • transformers: for other usecase
  • Huggingface-like interface for fastText that supports push_to_hub, saving and loading (let's not forget this amazing model before transformers architecture).

🏁 QuickStart in Apple Silicon - Train a model in 5 min!

# install (cp39 = python3.9, other valid values are cp310, cp311, cp312)
wget https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.84-metal/llama_cpp_python-0.2.84-cp39-cp39-macosx_11_0_arm64.whl
pip install llama_cpp_python-0.2.84-cp39-cp39-macosx_11_0_arm64.whl
rm llama_cpp_python-0.2.84-cp39-cp39-macosx_11_0_arm64.whl
pip install anyclassifier
# run
cd examples
python train_setfit_model.py

image

🏁 QuickStart in Colab

Dataset Colab Link
imdb sentiment classification Open In Colab

🔧 Installation

It is using llama.cpp as backend, and build wheel can take a lot of time (10min+), as such, we also provide an instruction to install with pre-built wheel.

Metal Backend (Apple's GPU) - cp39 = python3.9, other valid values are cp310, cp311, cp312

wget https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.84-metal/llama_cpp_python-0.2.84-cp39-cp39-macosx_11_0_arm64.whl
pip install llama_cpp_python-0.2.84-cp39-cp39-macosx_11_0_arm64.whl
rm llama_cpp_python-0.2.84-cp39-cp39-macosx_11_0_arm64.whl
pip install anyclassifier

Colab (T4) Prebuilt Wheel

wget https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.84-cu124/llama_cpp_python-0.2.84-cp310-cp310-linux_x86_64.whl
pip install llama_cpp_python-0.2.84-cp310-cp310-linux_x86_64.whl
rm llama_cpp_python-0.2.84-cp310-cp310-linux_x86_64.whl
pip install anyclassifier

CUDA Backend (Please read llama-cpp-python)

CMAKE_ARGS="-DGGML_METAL=on" pip install anyclassifier

CPU

CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install anyclassifier

Developer's installation

pip install -e .

🛠️ Usage

Download a small LLM (please accept the respective terms and condition of model license beforehand)

meta-llama/Meta-Llama-3.1-8B-Instruct
google/gemma-2-9b

from huggingface_hub import hf_hub_download

# meta-llama/Meta-Llama-3.1-8B-Instruct
hf_hub_download("lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF", "Meta-Llama-3.1-8B-Instruct-Q8_0.gguf")

# google/gemma-2-9b
hf_hub_download("lmstudio-community/gemma-2-9b-it-GGUF", "gemma-2-9b-it-Q8_0.gguf")

One Liner

from huggingface_hub import hf_hub_download
from anyclassifier import build_anyclassifier
from anyclassifier.annotation.prompt import Label

unlabeled_dataset  # a huggingface datasets.Dataset class can be from your local json/ csv, or from huggingface hub.

# Magic One Line!
trainer = build_anyclassifier(
  "Classify a text's sentiment.",
  [
    Label(name='1', desc='positive sentiment'),
    Label(name='0', desc='negative sentiment')
  ],
  hf_hub_download("lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF", "Meta-Llama-3.1-8B-Instruct-Q8_0.gguf"),  # as you like
  unlabeled_dataset,
  column_mapping={"text": "text"},
  model_type="setfit",  # can be set to fastText
  push_dataset_to_hub=True,  # we recommend to push your dataset to huggingface, so that you won't lose it
  dataset_repo_id="user_id/test",
  is_dataset_private=True
)
# Share Your Model!
trainer.push_to_hub("user_id/any_model")

To Use Model

# SetFit
from setfit import SetFitModel

model = SetFitModel.from_pretrained("user_id/any_model")
preds = model.predict(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
print(preds)

# FastText
from anyclassifier.fasttext_wrapper import FastTextForSequenceClassification

model = FastTextForSequenceClassification.from_pretrained("user_id/any_model")
preds = model.predict(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
print(preds)

To Label a Dataset

from datasets import load_dataset
from anyclassifier.annotation.prompt import AnnotationPrompt, Label
from anyclassifier.annotation.annotator import LlamaCppAnnotator

unlabeled_dataset = load_dataset("somepath")
prompt = AnnotationPrompt(
  task_description="Classify a text's sentiment.",
  label_definition=[
    Label(name='1', desc='positive sentiment'),
    Label(name='0', desc='negative sentiment')
  ]
)
annotator = LlamaCppAnnotator(prompt)
label_dataset = annotator.annotate_dataset(unlabeled_dataset, n_record=1000)
label_dataset.push_to_hub('user_id/any_data')

See more examples:

model_type example resulting model dataset
setfit link link link
fasttext link link(probably need more label) link

Test accuracy on imdb with SetFit: 90.42%

🗺️ Roadmap

  • High Quality Data:
    • Prompt validation
    • Label validation - inter-model annotation
  • High Quality Model
    • Auto error analysis
    • Auto model documentation
    • Auto synthetic data

👋 Contributing

  • build models with AnyClassifier
  • create issue/ PR
  • suggest features

📩 Follow Me For Update:

X/ huggingface/ github

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anyclassifier-0.1.2.tar.gz (17.4 kB view hashes)

Uploaded Source

Built Distribution

anyclassifier-0.1.2-py3-none-any.whl (17.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page