Skip to main content

Your backend for LLM powered Big Data Apps

Project description

🎵 Datatune

PyPI version License PyPI Downloads Docs

Perform transformations on your data with natural language using LLMs

Installation

pip install datatune

From source:

pip install -e .

Quick Start

import os
import dask.dataframe as dd

import datatune as dt
from datatune.llm.llm import OpenAI

os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

# Set tokens-per-minute and requests-per-minute limits 
llm = OpenAI(model_name="gpt-3.5-turbo", tpm = 200000, rpm = 50)

# Load data from your source with Dask
df = dd.read_csv("tests/test_data/products.csv")
print(df.head())

# Transform data with Map
mapped = dt.Map(
    prompt="Extract categories from the description and name of product.",
    output_fields=["Category", "Subcategory"],
    input_fields = ["Description","Name"] # Relevant input fields (optional)
)(llm, df)

# Filter data based on criteria
filtered = dt.Filter(
    prompt="Keep only electronics products",
    input_fields = ["Name"] # Relevant input fields (optional)
)(llm, mapped)

# Get the final dataframe after cleanup of metadata and deleted rows after operations using `finalize`.
result = dt.finalize(filtered)
result.compute().to_csv("electronics_products.csv")

new_df = dd.read_csv("electronics_products.csv")
print(new_df.head())

products.csv

   ProductID             Name   Price  Quantity                                        Description      SKU
0       1001   Wireless Mouse   25.99       150  Ergonomic wireless mouse with 2.4GHz connectivity  WM-1001
1       1002     Office Chair   89.99        75  Comfortable swivel office chair with lumbar su...  OC-2002
2       1003       Coffee Mug    9.49       300                  Ceramic mug, 12oz, microwave safe  CM-3003
3       1004  LED Monitor 24"  149.99        60  24-inch Full HD LED monitor with HDMI and VGA ...  LM-2404
4       1005    Notebook Pack    6.99       500          Pack of 3 ruled notebooks, 100 pages each  NP-5005

electronics_products.csv

   Unnamed: 0  ProductID               Name  ...      SKU     Category           Subcategory
0           0       1001     Wireless Mouse  ...  WM-1001  Electronics  Computer Accessories
1           3       1004    LED Monitor 24"  ...  LM-2404  Electronics              Monitors
2           6       1007     USB-C Cable 1m  ...  UC-7007  Electronics                Cables
3           8       1009  Bluetooth Speaker  ...  BS-9009  Electronics                 Audio

If you don’t set rpm or tpm, Datatune will automatically look up default limits for your model from our model_rate_limits. If model is not available in the lookup dictionary rpm and tpm will default to gpt-3.5-turbo limits.

Passing input_fields reduces the number of tokens sent by sending only relevant columns as input to the given LLM API, hence reducing the cost.

Features

Map Operation

Transform data with natural language:

customers = dd.read_csv("customers.csv")
mapped = dt.Map(
    prompt="Extract country and city from the address field",
    output_fields=["country", "city"]
)(llm, customers)

Filter operation

# Filter to remove rows
filtered = dt.Filter(
    prompt="Keep only customers who are from Asia"
)(llm, mapped)

Multiple LLM Support

Datatune works with various LLM providers with the help of LiteLLM under the hood.:

# Using Ollama
from datatune.llm.llm import Ollama
llm = Ollama()

# Using Azure
from datatune.llm.llm import Azure
llm = Azure(
    model_name="gpt-3.5-turbo",
    api_key=api_key,
    api_base=api_base,
    api_version=api_version)

# OpenAI
from datatune.llm.llm import OpenAI
llm = OpenAI(model_name="gpt-3.5-turbo")

Agents

Datatune provides an agentic framework which allows you to deploy agents that can generate and execute python scripts with datatune operations.

import datatue as dt
from datatune.llm.llm import OpenAI

llm = OpenAI(model_name="gpt-3.5-turbo",tpm=200000)

# Initialize an agent by providing an LLM
agent = dt.Agent(llm)
prompt = "your prompt for data transformation"

# Transform your dask DataFrame
df = agent.do(prompt,df)
  • This allows for intelligent operation selection based on the given prompt

Data Compatibility

Datatune leverages Dask DataFrames to enable scalable processing across large datasets. This approach allows you to:

  • Process data larger than context length of LLMs
  • Execute parallel computations efficiently

If you're working with pandas DataFrames, convert them with a simple:

import dask.dataframe as dd
dask_df = dd.from_pandas(pandas_df, npartitions=4)  # adjust partitions based on your data size

Examples

Check out examples

Documentation

Check out our documentation to learn how to use datatune.

Issues

Want to raise an issue or want us to build a new feature? Head over to issues and raise a ticket!

You can also mail us at hello@vitalops.ai

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datatune-0.0.2.tar.gz (21.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datatune-0.0.2-py3-none-any.whl (21.1 kB view details)

Uploaded Python 3

File details

Details for the file datatune-0.0.2.tar.gz.

File metadata

  • Download URL: datatune-0.0.2.tar.gz
  • Upload date:
  • Size: 21.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.16

File hashes

Hashes for datatune-0.0.2.tar.gz
Algorithm Hash digest
SHA256 f3e48548613e6e96d21ea11a6897cf96f7e4b3906e1d1c8689733533c17d3369
MD5 06f2e6cd41ea8b6ca2e99b11489b8714
BLAKE2b-256 e9a7086d3dbea75432698d678f040aeec1bc24376027d073db79abe88599e606

See more details on using hashes here.

File details

Details for the file datatune-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: datatune-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 21.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.16

File hashes

Hashes for datatune-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a8355fda3336762dea110d22e7661717ecc9e0a6aac4a243fdaa5d8c44d74858
MD5 1d833c1d7f06715cc5ec0415c85ce2b3
BLAKE2b-256 23bbfe3284849446fb5c3528c3676f0d611f28a7a127cd3b300faed380ba1729

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page