A package for cleaning and curating data with LLMs
Project description
databonsai
Clean & curate your data with LLMs
databonsai is a Python library that uses LLMs to perform data cleaning tasks.
Features
- Categorization / Transformation of data using LLMs
- Decomposition of data into structured formats using LLMs
- Validation of LLM outputs
- Retry logic with exponential backoff for handling rate limits and transient errors
Installation
pip install databonsai
Store your API keys on an .env file in the root of your project.
OPENAI_API_KEY=xxx # if you use OpenAiProvider
ANTHROPIC_API_KEY=xxx # If you use AnthropicProvider
Quickstart
Categorization
Setup the LLM provider and categories (as a dictionary)
from databonsai.categorize import MultiCategorizer, BaseCategorizer
from databonsai.llm_providers import OpenAIProvider, AnthropicProvider
provider = OpenAIProvider() # Or AnthropicProvider()
categories = {
"Weather": "Insights and remarks about weather conditions.",
"Sports": "Observations and comments on sports events.",
"Celebrities": "Celebrity sightings and gossip",
"Others": "Comments do not fit into any of the above categories",
"Anomaly": "Data that does not look like comments or natural language",
}
Categorize your data:
categorizer = BaseCategorizer(
categories=categories,
llm_provider=provider,
)
category = categorizer.categorize("It's been raining outside all day")
print(category)
Output:
Weather
Multiple categories can also be returned. This is useful for tagging data!
tagger = MultiCategorizer(
categories=categories,
llm_provider=provider,
)
tags = tagger.categorize(
"It's been raining outside all day, and I saw Elon Musk. 13rewfdsacw10289u(#!*@)" # Data has anomalies
)
print(tags)
Output:
['Weather', 'Celebrities', 'Anomaly']
Transformation
Prepare the transformer:
pii_remover = BaseTransformer(
prompt="Replace any Personal Identity Identifiers (PII) in the given text with <type of PII>. PII includes any information that can be used to identify an individual, such as names, addresses, phone numbers, email addresses, social security numbers, etc.",
llm_provider=provider,
)
Run the transformation:
print(
pii_remover.transform(
"John Doe, residing at 1234 Maple Street, Anytown, CA, 90210, recently contacted customer support to report an issue. He provided his phone number, (555) 123-4567, and email address, johndoe@email.com, for follow-up communication."
)
)
Output:
<Name>, residing at <Address>, <City>, <State>, <ZIP code>, recently contacted customer support to report an issue. They provided their phone number, <Phone number>, and email address, <Email address>, for follow-up communication.
Dataframes
If you have a pandas dataframe, you can simple run the following.
df['category'] = df['content'].apply(categorizer.categorize)
df['transformed'] = df['content'].apply(language_check.transform)
The library uses exponential backoff to handle rate limiting. Go to docs/llm_providers.md for more information on how to handle rate limiting.
Decomposition
Prepare a decompose transformer with a prompt and output schema.
output_schema = {
"question": "generated question about given information",
"answer": "answer to the question, only using information from the given data",
}
qna = DecomposeTransformer(
prompt="Your goal is to create a set of questions and answers to help a person memorise every single detail of a document.",
output_schema=output_schema,
llm_provider=provider,
)
Here's the text we want to decompose:
text = """ Sky-gazers across North America are in for a treat on April 8 when a total solar eclipse will pass over Mexico, the United States and Canada.
The event will be visible to millions — including 32 million people in the US alone — who live along the route the moon’s shadow will travel during the eclipse, known as the path of totality. For those in the areas experiencing totality, the moon will appear to completely cover the sun. Those along the very center line of the path will see an eclipse that lasts between 3½ and 4 minutes, according to NASA.
The next total solar eclipse won’t be visible across the contiguous United States again until August 2044. (It’s been nearly seven years since the “Great American Eclipse” of 2017.) And an annular eclipse won’t appear across this part of the world again until 2046."""
Decompose the text:
print(qna.transform(text))
Output:
[
{
"question": "When will the total solar eclipse pass over Mexico, the United States, and Canada?",
"answer": "The total solar eclipse will pass over Mexico, the United States, and Canada on April 8.",
},
{
"question": "What is the path of totality?",
"answer": "The path of totality is the route the moon's shadow will travel during the eclipse where the moon will appear to completely cover the sun.",
},
{
"question": "How long will the eclipse last for those along the very center line of the path of totality?",
"answer": "For those along the very center line of the path of totality, the eclipse will last between 3½ and 4 minutes.",
},
{
"question": "When will the next total solar eclipse be visible across the contiguous United States?",
"answer": "The next total solar eclipse visible across the contiguous United States will be in August 2044.",
},
{
"question": "When will an annular eclipse next appear across the contiguous United States?",
"answer": "An annular eclipse won't appear across the contiguous United States again until 2046.",
},
]
View token usage
Token usage is recorded for each provider. Use these to estimate your costs!
print(provder.input_tokens)
print(provder.output_tokens)
Read More:
- Documentation
- Examples (TBD)
Acknowledgements
Bonsai icon from icons8 https://icons8.com/icon/74uBtdDr5yFq/bonsai
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file databonsai-0.2.0.tar.gz
.
File metadata
- Download URL: databonsai-0.2.0.tar.gz
- Upload date:
- Size: 12.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7392c5c29045c2e5f576757e94c92d666ebb204670bb429bf0e0610ae1c09b64 |
|
MD5 | 67b15eeff73d2cad2344b7916daf234a |
|
BLAKE2b-256 | d86d5570badb5bce7f23ec0903ce1dd31a3e2da450db26fc9a87f9a8d3405cfb |
File details
Details for the file databonsai-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: databonsai-0.2.0-py3-none-any.whl
- Upload date:
- Size: 14.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d791170703b35f340a735e5052ebd2092130c020b6cc0f2b3348490dabb7ed90 |
|
MD5 | 587cf82481d20667a656d2000da880ac |
|
BLAKE2b-256 | 953e2550db058876565234863c4c4d1b85dd4bd723c00fedf403f8392a1896ca |