Library for LLM powered labeling
Project description
Clean, labeled data at the speed of thought.
Quick Install
pip install refuel-autolabel
๐ท What is Autolabel
Access to large, clean and diverse labeled datasets is a critical component for any machine learning effort to be successful. But data labeling is a manual and time-consuming process. State-of-the-art LLMs like GPT-4 are able to automatically label data with high accuracy, and at a fraction of the cost and time.
Autolabel is a Python library to label, clean and enrich text datasets with any Large Language Models (LLM) of your choice. A few key features:
- Label data for NLP tasks such as classification, question-answering and named entity-recognition, entity matching and more.
- Use commercial or open source LLMs from providers such as OpenAI, Anthropic, HuggingFace, Google and more.
- Support for research-proven LLM techniques to boost label quality, such as few-shot learning and chain-of-thought prompting.
- Confidence estimation and explanations out of the box for every single output label
- Caching and state management to minimize costs and experimentation time
๐ Getting started
Autolabel provides a simple 3-step process for labeling data:
- Specify the labeling guidelines and LLM model to use in a JSON config.
- Dry-run to make sure the final prompt looks good.
- Kick off a labeling run for your dataset!
Let's imagine we are building an ML model to analyze sentiment analysis of movie review. We have a dataset of moview reviews that we'd like to get labeled first. For this case, here's what the example dataset and configs will look like:
{
"task_name": "MovieSentimentReview",
"task_type": "classification",
"model": {
"provider": "openai",
"name": "gpt-3.5-turbo"
},
"dataset": {
"label_column": "label",
"delimiter": ","
},
"prompt": {
"task_guidelines": "You are an expert at analyzing the sentiment of moview reviews. Your job is to classify the provided movie review into one of the following labels: {labels}",
"labels": [
"positive",
"negative",
"neutral",
],
"few_shot_examples": [
{
"example": "I got a fairly uninspired stupid film about how human industry is bad for nature.",
"label": "negative"
},
{
"example": "I loved this movie. I found it very heart warming to see Adam West, Burt Ward, Frank Gorshin, and Julie Newmar together again.",
"label": "positive"
},
{
"example": "This movie will be played next week at the Chinese theater.",
"label": "neutral"
}
],
"example_template": "Input: {example}\nOutput: {label}"
}
}
Initialize the labeling agent and pass it the config:
from autolabel import LabelingAgent
agent = LabelingAgent(config='config.json')
Preview an example prompt that will be sent to the LLM:
agent.plan('examples/movie_reviews/dataset.csv')
This prints:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 100/100 0:00:00 0:00:00
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโ
โ Total Estimated Cost โ $0.538 โ
โ Number of Examples โ 200 โ
โ Average cost per example โ 0.00269 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Prompt Example:
You are an expert at analyzing the sentiment of moview reviews. Your job is to classify the provided movie review into one of the following labels: [positive, negative, neutral]
You will return the answer with just one element: "the correct label"
Some examples with their output answers are provided below:
Example: I got a fairly uninspired stupid film about how human industry is bad for nature.
Output:
negative
Example: I loved this movie. I found it very heart warming to see Adam West, Burt Ward, Frank Gorshin, and Julie Newmar together again.
Output:
positive
Example: This movie will be played next week at the Chinese theater.
Output:
neutral
Now I want you to label the following example:
Input: A rare exception to the rule that great literature makes disappointing films.
Output:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Finally, we can run the labeling on a subset or entirety of the dataset:
labels, output_df, metrics = agent.run('examples/movie_reviews/dataset.csv')
๐ Contributing
Autolabel is a rapidly developing project. We welcome contributions in all forms - bug reports, pull requests and ideas for improving the library.
- Join the conversation on Discord
- Review the ๐ฃ๏ธ Roadmap and contribute your ideas.
- Grab an open issue on Github, and submit a pull request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for refuel_autolabel-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d33c6635c1caa1151993293127682991501a521dbae84c2c7ad90e767de5c5f8 |
|
MD5 | 8986fd5930d94759d992a85feee59eda |
|
BLAKE2b-256 | 1af0ccc36cb8d9df40d1dd366a5aa93ac70904ad341882d65b84f173af980050 |