Skip to main content

Data transformations using LLMs for use with pandas

Project description

AI Data Transformers

This repo contains a library that uses LLMs for data transformations on values in Pandas or Spark dataframes.

The core value prop is that the library exposes simple building blocks for writing transformations thta use LLM to perform some computation over the data as given by a prompt. In essence, instead of defining a data transformation logic in detailed code, we replace the logic with an LLM, i.e. we instruct the LLM to take an instruction and process and get the desired output from the data.

For example, consider the problem for needing to parse out the data from raw text which can be in format: unix timestamp, UTC timestamp, textual description, in Japanese etc. Traditionally, data engineers need to write well-test transformation of ever-increasing subsets of cases on highly filtered data to do so. But you can do so with just an LLM call with the instruction: "parse out time and write it out as a UTC timestamp".

The library exposes interfaces for both API-based models and open source models. In particular, once can take a possible quantized model and use that. The library takes care of optimially running it, the worker distribution and serialization.

Low-Level Functions -> Let Pandas etc. handle it

Map, Reduce, Serialize

Composition Functions -> Let User Define them

Don't define composition abstractions, let users handle it.

Text Level Functions

transform(x, transform_instruction) structify(x, model: PydanticBaseModel, structify_instruction=None) classify(x, c, labels=[], classification_instruction=None) score_in_range(x, range_start, range_end, scoring_instruction=None) extract(x, types=[])

Optimizations

Compile Transformations

For a sequence of transformations, if you compile and serialize, under the hood, the models will reuse the KV-cache for the input.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aidt-0.0.1.tar.gz (2.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aidt-0.0.1-py3-none-any.whl (2.7 kB view details)

Uploaded Python 3

File details

Details for the file aidt-0.0.1.tar.gz.

File metadata

  • Download URL: aidt-0.0.1.tar.gz
  • Upload date:
  • Size: 2.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.11 Darwin/21.6.0

File hashes

Hashes for aidt-0.0.1.tar.gz
Algorithm Hash digest
SHA256 c41148761dff3d68ce3a2b5b57f8dbab88a856b720e3fedab79de9729fae868e
MD5 270d7d0e5527c337467bfd1cceeb71a5
BLAKE2b-256 e4d9769cca959676180fdf7bbd2f93c17488136913a60544dbca978b6e5cc5d0

See more details on using hashes here.

File details

Details for the file aidt-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: aidt-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 2.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.11 Darwin/21.6.0

File hashes

Hashes for aidt-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4ab06f6af2af59ed65e2f58e248731c491268a3c780f2ca1e4d96795e2e904f8
MD5 e006a9adbb37d8cd4e76b0a880179e00
BLAKE2b-256 035bb5729ac052890d8958338bf576be66ebba2917b457df429e6264c1608f36

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page