Skip to main content

A package for cleaning and curating data with LLMs

Project description

databonsai external-bonsai-tree-justicon-flat-justicon

PyPI version License: MIT Python Version Code style: black

Clean & curate your data with LLMs

databonsai is a Python library that uses LLMs to perform data cleaning tasks.

Features

  • Suite of tools for data processing using LLMs including categorization, transformation, and decomposition
  • Validation of LLM outputs
  • Batch processing for token savings
  • Retry logic with exponential backoff for handling rate limits and transient errors

Installation

pip install databonsai

Store your API keys on an .env file in the root of your project, or specify it as an argument when initializing the provider.

OPENAI_API_KEY=xxx # if you use OpenAiProvider
ANTHROPIC_API_KEY=xxx # If you use AnthropicProvider

Quickstart

Categorization

Setup the LLM provider and categories (as a dictionary)

from databonsai.categorize import MultiCategorizer, BaseCategorizer
from databonsai.llm_providers import OpenAIProvider, AnthropicProvider

provider = OpenAIProvider()  # Or AnthropicProvider()
categories = {
    "Weather": "Insights and remarks about weather conditions.",
    "Sports": "Observations and comments on sports events.",
    "Celebrities": "Celebrity sightings and gossip",
    "Others": "Comments do not fit into any of the above categories",
    "Anomaly": "Data that does not look like comments or natural language",
}

Categorize your data:

categorizer = BaseCategorizer(
    categories=categories,
    llm_provider=provider,
)
category = categorizer.categorize("It's been raining outside all day")
print(category)

Output:

Weather

Dataframes & Lists (Save tokens with batching!)

If you have a pandas dataframe or list, use apply_to_column_batch for some handy features:

  • batching saves tokens by not resending the schema each time
  • progress bar
  • returns the last successful index so you can resume from there, in case of any error (llm_provider already implements exponential backoff, but just in case)
  • modifies your output list in place, so you don't lose any progress

Use the method as such:

success_idx = apply_to_column_batch(input_column, output_column, function, batch_size, start_idx)

Parameters:

  • input_column: The name of the column from which data will be read.
  • output_column: The name of the column to which data will be written.
  • function: The function to apply to each batch of data.
  • batch_size: The number of rows in each batch.
  • start_idx: The starting index from which to begin processing.

Returns:

  • success_idx: The index of the last successful row processed.
from databonsai.utils import apply_to_column_batch, apply_to_column

df["Category"] = None # Initialize it if it doesn't exist, as we modify it in place
success_idx = apply_to_column_batch( df["Headline"], df["Category"], categorizer.categorize_batch, batch_size=10)

By default, exponential backoff is used to handle rate limiting. This is handled in the LLM providers and can be configured.

If it fails midway (even after exponential backoff), you can resume from the last successful index + 1.

success_idx = apply_to_column_batch( df["Headline"], df["Category"], categorizer.categorize_batch, batch_size=10, start_idx=success_idx+1)

This also works for regular python lists.

Note that the better the LLM model, the greater the batch_size you can use (depending on the length of your inputs). If you're getting errors, reduce the batch_size, or use a better LLM model.

To use it without batching:

success_idx = apply_to_column( df["Headline"], df["Category"], categorizer.categorize)

View token usage

Token usage is recorded for each provider. Use these to estimate your costs!

print(provder.input_tokens)
print(provder.output_tokens)

Docs

Tools (Check out the docs for usage examples and details)

LLM Providers

Examples (TBD)

Acknowledgements

Bonsai icon from icons8 https://icons8.com/icon/74uBtdDr5yFq/bonsai

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databonsai-0.3.0.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

databonsai-0.3.0-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file databonsai-0.3.0.tar.gz.

File metadata

  • Download URL: databonsai-0.3.0.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for databonsai-0.3.0.tar.gz
Algorithm Hash digest
SHA256 9b183d2c4117fc517d546138343a68a2e5536634b410f135e8a4c9cb83c0381d
MD5 bad55a4f26e6357de890ab44fcd53988
BLAKE2b-256 df1009e13150647f141468db1bf6338ea9d08a82593ac576d000fa25ebf3507c

See more details on using hashes here.

File details

Details for the file databonsai-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: databonsai-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 17.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for databonsai-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 74c7410e907128717f71d059e6b08915bf1f38e1e2998eb0c819123f8f462f48
MD5 12d5429886c5b3f8ece8c5e399195499
BLAKE2b-256 034bbb0bc51e7ccf31c6274da3d945f4d20acef6630b4584ad3130d343b46a22

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page