Skip to main content

A package for cleaning and curating data with LLMs

Project description

databonsai external-bonsai-tree-justicon-flat-justicon

PyPI version License: MIT Python Version Code style: black

Clean & curate your data with LLMs

databonsai is a Python library that uses LLMs to perform data cleaning tasks.

Features

  • Suite of tools for data processing using LLMs including categorization, transformation, and extraction
  • Validation of LLM outputs
  • Batch processing for token savings
  • Retry logic with exponential backoff for handling rate limits and transient errors

Installation

pip install databonsai

Store your API keys on an .env file in the root of your project, or specify it as an argument when initializing the provider.

OPENAI_API_KEY=xxx # if you use OpenAiProvider
ANTHROPIC_API_KEY=xxx # If you use AnthropicProvider

Quickstart

Categorization

Setup the LLM provider and categories (as a dictionary.

from databonsai.categorize import MultiCategorizer, BaseCategorizer
from databonsai.llm_providers import OpenAIProvider, AnthropicProvider

provider = OpenAIProvider()  # Or AnthropicProvider(). Highly recommend using Haiku, which is the default AnthropicProvider() model, as it is cheap and effective for these tasks
categories = {
    "Weather": "Insights and remarks about weather conditions.",
    "Sports": "Observations and comments on sports events.",
    "Politics": "Political events related to governments, nations, or geopolitical issues.",
    "Celebrities": "Celebrity sightings and gossip",
    "Others": "Comments do not fit into any of the above categories",
    "Anomaly": "Data that does not look like comments or natural language",
}
few_shot_examples = [
        {"example": "Big stormy skies over city", "response": "Weather"},
        {"example": "The team won the championship", "response": "Sports"},
        {"example": "I saw a famous rapper at the mall", "response": "Celebrities"},
    ]

Categorize your data:

categorizer = BaseCategorizer(
    categories=categories,
    llm_provider=provider,
    examples = few_shot_examples,
    #strict = False # Default true, set to False to allow for categories not in the provided dict
)
category = categorizer.categorize("It's been raining outside all day")
print(category)

Output:

Weather

Use categorize_batch to categorize a batch. This saves tokens as it only sends the schema and few shot examples once! (Works best for better models. Ideally, use at least 3 few shot examples.)

categories = categorizer.categorize_batch([
    "Massive Blizzard Hits the Northeast, Thousands Without Power",
    "Local High School Basketball Team Wins State Championship After Dramatic Final",
    "Celebrated Actor Launches New Environmental Awareness Campaign",
])
print(categories)

Output:

['Weather', 'Sports', 'Celebrities']

AutoBatch for Larger datasets

If you have a pandas dataframe or list, use apply_to_column_autobatch

  • Batching data for LLM api calls saves tokens by not sending the prompt for every row. However, too large a batch size / complex tasks can lead to errors. Naturally, the better the LLM model, the larger the batch size you can use.

  • This batching is handled adaptively (i.e., it will increase the batch size if the response is valid and reduce it if it's not, with a decay factor)

Other features:

  • Progress bar
  • Returns the last successful index so you can resume from there, in case it exceeds max_retries
  • Modifies your output list in place, so you don't lose any progress

Retry Logic:

  • LLM providers have retry logic built in for API related errors. This can be configured in the provider.
  • The retry logic in the apply_to_column_autobatch is for handling invalid responses (e.g. unexpected category, different number of outputs, etc.)
from databonsai.utils import apply_to_column_batch, apply_to_column, apply_to_column_autobatch
import pandas as pd

headlines = [
    "Massive Blizzard Hits the Northeast, Thousands Without Power",
    "Local High School Basketball Team Wins State Championship After Dramatic Final",
    "Celebrated Actor Launches New Environmental Awareness Campaign",
    "President Announces Comprehensive Plan to Combat Cybersecurity Threats",
    "Tech Giant Unveils Revolutionary Quantum Computer",
    "Tropical Storm Alina Strengthens to Hurricane as It Approaches the Coast",
    "Olympic Gold Medalist Announces Retirement, Plans Coaching Career",
    "Film Industry Legends Team Up for Blockbuster Biopic",
    "Government Proposes Sweeping Reforms in Public Health Sector",
    "Startup Develops App That Predicts Traffic Patterns Using AI",
]
df = pd.DataFrame(headlines, columns=["Headline"])
df["Category"] = None # Initialize it if it doesn't exist, as we modify it in place
success_idx = apply_to_column_autobatch( df["Headline"], df["Category"], categorizer.categorize_batch, batch_size=3, start_idx=0)

There are many more options available for autobatch, such as setting a max_retries, decay factor, and more. Check Utils for more details

If it fails midway (even after exponential backoff), you can resume from the last successful index + 1.

success_idx = apply_to_column_autobatch( df["Headline"], df["Category"], categorizer.categorize_batch, batch_size=10, start_idx=success_idx+1)

This also works for regular python lists.

Note that the better the LLM model, the greater the batch_size you can use (depending on the length of your inputs). If you're getting errors, reduce the batch_size, or use a better LLM model.

To use it with batching, but with a fixed batch size:

success_idx = apply_to_column_batch( df["Headline"], df["Category"], categorizer.categorize_batch, batch_size=3, start_idx=0)

To use it without batching:

success_idx = apply_to_column( df["Headline"], df["Category"], categorizer.categorize)

View System Prompt

print(categorizer.system_message)
print(categorizer.system_message_batch)

View token usage

Token usage is recorded for OpenAI and Anthropic. Use these to estimate your costs!

print(provder.input_tokens)
print(provder.output_tokens)

Docs

Tools (Check out the docs for usage examples and details)

LLM Providers

Examples

Acknowledgements

Bonsai icon from icons8 https://icons8.com/icon/74uBtdDr5yFq/bonsai

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databonsai-0.8.0.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

databonsai-0.8.0-py3-none-any.whl (24.4 kB view details)

Uploaded Python 3

File details

Details for the file databonsai-0.8.0.tar.gz.

File metadata

  • Download URL: databonsai-0.8.0.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for databonsai-0.8.0.tar.gz
Algorithm Hash digest
SHA256 a74b456bfae20cc3be2b1667c0805f578d055800d9d465be5ed0b2d4eb9f7d2a
MD5 5c7e2e938bad4e84710efcc39eafbf64
BLAKE2b-256 4bfcbb2ce26141cbe6c2f417d6ee66837bb50089fb56381193151491ddf3a48d

See more details on using hashes here.

File details

Details for the file databonsai-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: databonsai-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 24.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.3

File hashes

Hashes for databonsai-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1710cc8e27064e6b6fd17f370b8579bd095433ac8e0663984a4d62b6cd4a6a11
MD5 494257b724a7ce2ab39a29a2e75d6a09
BLAKE2b-256 1bd9c46e5d7de209b5607306c5c6f656a330c6c1d7d9b6ed49a3c67d86424cef

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page