Your backend for LLM powered Big Data Apps
Project description
🎵 Datatune
Perform transformations on your data with natural language using LLMs
Installation
pip install datatune
From source:
pip install -e .
Quick Start
import os
import dask.dataframe as dd
import datatune as dt
from datatune.llm.llm import OpenAI
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
# Set tokens-per-minute and requests-per-minute limits
llm = OpenAI(model_name="gpt-3.5-turbo", tpm = 200000, rpm = 50)
# Load data from your source with Dask
df = dd.read_csv("tests/test_data/products.csv")
print(df.head())
# Transform data with Map
mapped = dt.Map(
prompt="Extract categories from the description and name of product.",
output_fields=["Category", "Subcategory"],
input_fields = ["Description","Name"] # Relevant input fields (optional)
)(llm, df)
# Filter data based on criteria
filtered = dt.Filter(
prompt="Keep only electronics products",
input_fields = ["Name"] # Relevant input fields (optional)
)(llm, mapped)
# Get the final dataframe after cleanup of metadata and deleted rows after operations using `finalize`.
result = dt.finalize(filtered)
result.compute().to_csv("electronics_products.csv")
new_df = dd.read_csv("electronics_products.csv")
print(new_df.head())
products.csv
ProductID Name Price Quantity Description SKU
0 1001 Wireless Mouse 25.99 150 Ergonomic wireless mouse with 2.4GHz connectivity WM-1001
1 1002 Office Chair 89.99 75 Comfortable swivel office chair with lumbar su... OC-2002
2 1003 Coffee Mug 9.49 300 Ceramic mug, 12oz, microwave safe CM-3003
3 1004 LED Monitor 24" 149.99 60 24-inch Full HD LED monitor with HDMI and VGA ... LM-2404
4 1005 Notebook Pack 6.99 500 Pack of 3 ruled notebooks, 100 pages each NP-5005
electronics_products.csv
Unnamed: 0 ProductID Name ... SKU Category Subcategory
0 0 1001 Wireless Mouse ... WM-1001 Electronics Computer Accessories
1 3 1004 LED Monitor 24" ... LM-2404 Electronics Monitors
2 6 1007 USB-C Cable 1m ... UC-7007 Electronics Cables
3 8 1009 Bluetooth Speaker ... BS-9009 Electronics Audio
If you don’t set rpm or tpm, Datatune will automatically look up default limits for your model from our model_rate_limits. If model is not available in the lookup dictionary rpm and tpm will default to gpt-3.5-turbo limits.
Passing input_fields reduces the number of tokens sent by sending only relevant columns as input to the given LLM API, hence reducing the cost.
Features
Map Operation
Transform data with natural language:
customers = dd.read_csv("customers.csv")
mapped = dt.Map(
prompt="Extract country and city from the address field",
output_fields=["country", "city"]
)(llm, customers)
Filter operation
# Filter to remove rows
filtered = dt.Filter(
prompt="Keep only customers who are from Asia"
)(llm, mapped)
Multiple LLM Support
Datatune works with various LLM providers with the help of LiteLLM under the hood.:
# Using Ollama
from datatune.llm.llm import Ollama
llm = Ollama()
# Using Azure
from datatune.llm.llm import Azure
llm = Azure(
model_name="gpt-3.5-turbo",
api_key=api_key,
api_base=api_base,
api_version=api_version)
# OpenAI
from datatune.llm.llm import OpenAI
llm = OpenAI(model_name="gpt-3.5-turbo")
Agents
Datatune provides an agentic framework which allows you to deploy agents that can generate and execute python scripts with datatune operations.
import datatue as dt
from datatune.llm.llm import OpenAI
llm = OpenAI(model_name="gpt-3.5-turbo",tpm=200000)
# Initialize an agent by providing an LLM
agent = dt.Agent(llm)
prompt = "your prompt for data transformation"
# Transform your dask DataFrame
df = agent.do(prompt,df)
- This allows for intelligent operation selection based on the given prompt
Data Compatibility
Datatune leverages Dask DataFrames to enable scalable processing across large datasets. This approach allows you to:
- Process data larger than context length of LLMs
- Execute parallel computations efficiently
If you're working with pandas DataFrames, convert them with a simple:
import dask.dataframe as dd
dask_df = dd.from_pandas(pandas_df, npartitions=4) # adjust partitions based on your data size
Examples
Check out examples
Documentation
Check out our documentation to learn how to use datatune.
Issues
Want to raise an issue or want us to build a new feature? Head over to issues and raise a ticket!
You can also mail us at hello@vitalops.ai
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file datatune-0.0.2.tar.gz.
File metadata
- Download URL: datatune-0.0.2.tar.gz
- Upload date:
- Size: 21.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3e48548613e6e96d21ea11a6897cf96f7e4b3906e1d1c8689733533c17d3369
|
|
| MD5 |
06f2e6cd41ea8b6ca2e99b11489b8714
|
|
| BLAKE2b-256 |
e9a7086d3dbea75432698d678f040aeec1bc24376027d073db79abe88599e606
|
File details
Details for the file datatune-0.0.2-py3-none-any.whl.
File metadata
- Download URL: datatune-0.0.2-py3-none-any.whl
- Upload date:
- Size: 21.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8355fda3336762dea110d22e7661717ecc9e0a6aac4a243fdaa5d8c44d74858
|
|
| MD5 |
1d833c1d7f06715cc5ec0415c85ce2b3
|
|
| BLAKE2b-256 |
23bbfe3284849446fb5c3528c3676f0d611f28a7a127cd3b300faed380ba1729
|