A package for cleaning and curating data with LLMs

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Project description

databonsai

Clean & curate your data with LLMs

databonsai is a Python library that uses LLMs to perform data cleaning tasks.

Features

Suite of tools for data processing using LLMs including categorization, transformation, and decomposition
Validation of LLM outputs
Batch processing for token savings
Retry logic with exponential backoff for handling rate limits and transient errors

Installation

pip install databonsai

Store your API keys on an .env file in the root of your project, or specify it as an argument when initializing the provider.

OPENAI_API_KEY=xxx # if you use OpenAiProvider
ANTHROPIC_API_KEY=xxx # If you use AnthropicProvider

Quickstart

Categorization

Setup the LLM provider and categories (as a dictionary)

from databonsai.categorize import MultiCategorizer, BaseCategorizer
from databonsai.llm_providers import OpenAIProvider, AnthropicProvider

provider = OpenAIProvider()  # Or AnthropicProvider(). Works best with gpt-4-turbo or any claude model
categories = {
    "Weather": "Insights and remarks about weather conditions.",
    "Sports": "Observations and comments on sports events.",
    "Politics": "Political events related to governments, nations, or geopolitical issues.",
    "Celebrities": "Celebrity sightings and gossip",
    "Others": "Comments do not fit into any of the above categories",
    "Anomaly": "Data that does not look like comments or natural language",
}
few_shot_examples = [
        {"example": "Big stormy skies over city", "response": "Weather"},
        {"example": "The team won the championship", "response": "Sports"},
        {"example": "I saw a famous rapper at the mall", "response": "Celebrities"},
    ],

Categorize your data:

categorizer = BaseCategorizer(
    categories=categories,
    llm_provider=provider,
    examples = few_shot_examples

)
category = categorizer.categorize("It's been raining outside all day")
print(category)

Output:

Weather

Use categorize_batch to categorize a batch. This saves tokens as it only sends the schema and few shot examples once! (Works best for better models. Ideally, use at least 3 few shot examples.)

categories = categorizer.categorize_batch([
    "Massive Blizzard Hits the Northeast, Thousands Without Power",
    "Local High School Basketball Team Wins State Championship After Dramatic Final",
    "Celebrated Actor Launches New Environmental Awareness Campaign",
])
print(categories)

Output:

['Weather', 'Sports', 'Celebrities']

Dataframes & Lists

If you have a pandas dataframe or list, use apply_to_column_batch for some handy features:

batching saves tokens by not resending the schema each time.
progress bar
returns the last successful index so you can resume from there, in case of any error (llm_provider already implements exponential backoff, but just in case)
modifies your output list in place, so you don't lose any progress

Use the method as such:

success_idx = apply_to_column_batch(input_column, output_column, function, batch_size, start_idx)

Parameters:

input_column: The name of the column from which data will be read.
output_column: The name of the column to which data will be written.
function: The function to apply to each batch of data.
batch_size: The number of rows in each batch.
start_idx: The starting index from which to begin processing.

Returns:

success_idx: The index of the last successful row processed.

(Continued from the previous code example)

from databonsai.utils import apply_to_column_batch, apply_to_column
import pandas as pd

headlines = [
    "Massive Blizzard Hits the Northeast, Thousands Without Power",
    "Local High School Basketball Team Wins State Championship After Dramatic Final",
    "Celebrated Actor Launches New Environmental Awareness Campaign",
    "President Announces Comprehensive Plan to Combat Cybersecurity Threats",
    "Tech Giant Unveils Revolutionary Quantum Computer",
    "Tropical Storm Alina Strengthens to Hurricane as It Approaches the Coast",
    "Olympic Gold Medalist Announces Retirement, Plans Coaching Career",
    "Film Industry Legends Team Up for Blockbuster Biopic",
    "Government Proposes Sweeping Reforms in Public Health Sector",
    "Startup Develops App That Predicts Traffic Patterns Using AI",
]
df = pd.DataFrame(headlines, columns=["Headline"])
df["Category"] = None # Initialize it if it doesn't exist, as we modify it in place
success_idx = apply_to_column_batch( df["Headline"], df["Category"], categorizer.categorize_batch, batch_size=3, start_idx=0)

By default, exponential backoff is used to handle rate limiting. This is handled in the LLM providers and can be configured.

If it fails midway (even after exponential backoff), you can resume from the last successful index + 1.

success_idx = apply_to_column_batch( df["Headline"], df["Category"], categorizer.categorize_batch, batch_size=10, start_idx=success_idx+1)

This also works for regular python lists.

Note that the better the LLM model, the greater the batch_size you can use (depending on the length of your inputs). If you're getting errors, reduce the batch_size, or use a better LLM model.

To use it without batching:

success_idx = apply_to_column( df["Headline"], df["Category"], categorizer.categorize)

View System Prompt

print(categorizer.system_message)
print(categorizer.system_message_batch)

View token usage

Token usage is recorded for each provider. Use these to estimate your costs!

print(provder.input_tokens)
print(provder.output_tokens)

Docs

Tools (Check out the docs for usage examples and details)

BaseCategorizer - categorize data into a category
MultiCategorizer - categorize data into multiple categories
BaseTransformer - transform data with a prompt
DecomposeTransformer - decompose data into a structured format based on a schema

LLM Providers

OpenAIProvider - OpenAI
AnthropicProvider - Anthropic
CustomProvider (TBD)

Examples (TBD)

Examples (TBD)

Acknowledgements

Bonsai icon from icons8 https://icons8.com/icon/74uBtdDr5yFq/bonsai

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Release history Release notifications | RSS feed

0.8.0

Jun 26, 2024

0.7.0

Apr 27, 2024

0.6.1

Apr 24, 2024

0.6.0

Apr 22, 2024

0.5.0

Apr 18, 2024

This version

0.4.1

Apr 17, 2024

0.3.0

Apr 14, 2024

0.2.0

Apr 7, 2024

0.1.0

Apr 7, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databonsai-0.4.1.tar.gz (18.1 kB view details)

Uploaded Apr 17, 2024 Source

Built Distribution

databonsai-0.4.1-py3-none-any.whl (21.8 kB view details)

Uploaded Apr 17, 2024 Python 3

File details

Details for the file databonsai-0.4.1.tar.gz.

File metadata

Download URL: databonsai-0.4.1.tar.gz
Upload date: Apr 17, 2024
Size: 18.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for databonsai-0.4.1.tar.gz
Algorithm	Hash digest
SHA256	`845fe32b33f7a3fdaab04e5c93e6239fdc909265034ac446639915ca6b6e2a73`
MD5	`03960125bbfc2e1fdc043c8fc4807ffb`
BLAKE2b-256	`4c4af48e393733450c465787fe5e48899cf517cef01c1944923b22243cea4048`

See more details on using hashes here.

File details

Details for the file databonsai-0.4.1-py3-none-any.whl.

File metadata

Download URL: databonsai-0.4.1-py3-none-any.whl
Upload date: Apr 17, 2024
Size: 21.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for databonsai-0.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b416834a474569cb96d24eebf4e5513e3574b00af63d87b2e027e06b400ec90d`
MD5	`342e0c1f3477c1d08a74c148d2d0cd2b`
BLAKE2b-256	`e73706077b5acaaf033bbd1f4a832be9e676f281aef7a5cfa920be7915779e6e`