Skip to main content

Compute, store and operate on data sketches

Project description

sketch

Sketch is an AI code-writing assistant for pandas users that understands the context of your data, greatly improving the relevance of suggestions. Sketch is usable in seconds and doesn't require adding a plugin to your IDE.

pip install sketch

Demo

Here we follow a "standard" (hypothetical) data-analysis workflow, showing a Natural Language interface that successfully navigates many tasks in the data stack landscape.

  • Data Catalogging:
    • General tagging (eg. PII identification)
    • Metadata generation (names and descriptions)
  • Data Engineering:
    • Data cleaning and masking (compliance)
    • Derived feature creation and extraction
  • Data Analysis:
    • Data questions
    • Data visualization

https://user-images.githubusercontent.com/916073/212602281-4ebd090f-09c4-495d-b48d-0b4c37b9f665.mp4

Try it out in colab: Open In Colab

How to use

It's as simple as importing sketch, and then using the .sketch extension on any pandas dataframe.

import sketch

Now, any pandas dataframe you have will have an extension registered to it. Access this new extension with your dataframes name .sketch

.sketch.ask

Ask is a basic question-answer system on sketch, this will return an answer in text that is based off of the summary statistics and description of the data.

Use ask to get an understanding of the data, get better column names, ask hypotheticals (how would I go about doing X with this data), and more.

df.sketch.ask("Which columns are integer type?")

.sketch.howto

Howto is the basic "code-writing" prompt in sketch. This will return a code-block you should be able to copy paste and use as a starting point (or possibly ending!) for any question you have to ask of the data. Ask this how to clean the data, normalize, create new features, plot, and even build models!

df.sketch.howto("Plot the sales versus time")

.sketch.apply

apply is a more advanced prompt that is more useful for data generation. Use it to parse fields, generate new features, and more. This is built directly on lambdaprompt. In order to use this, you will need to set up a free account with OpenAI, and set an environment variable with your API key. OPENAI_API_KEY=YOUR_API_KEY

df['review_keywords'] = df.sketch.apply("Keywords for the review [{{ review_text }}] of product [{{ product_name }}] (comma separated):")
df['capitol'] = pd.DataFrame({'State': ['Colorado', 'Kansas', 'California', 'New York']}).sketch.apply("What is the capitol of [{{ State }}]?")

Sketch currently uses prompts.approx.dev to help run with minimal setup

You can also directly use a few pre-built hugging face models (right now MPT-7B and StarCoder), which will run entirely locally (once you download the model weights from HF). Do this by setting environment 3 variables:

os.environ['LAMBDAPROMPT_BACKEND'] = 'StarCoder'
os.environ['SKETCH_USE_REMOTE_LAMBDAPROMPT'] = 'False'
os.environ['HF_ACCESS_TOKEN'] = 'your_hugging_face_token'

You can also directly call OpenAI directly (and not use our endpoint) by using your own API key. To do this, set 2 environment variables.

(1) SKETCH_USE_REMOTE_LAMBDAPROMPT=False (2) OPENAI_API_KEY=YOUR_API_KEY

How it works

Sketch uses efficient approximation algorithms (data sketches) to quickly summarize your data, and feed that information into language models. Right now it does this by summarizing the columns and writing these summary statistics as additional context to be used by the code-writing prompt. In the future we hope to feed these sketches directly into custom made "data + language" foundation models to get more accurate results.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sketch-0.5.2.tar.gz (20.2 kB view details)

Uploaded Source

Built Distribution

sketch-0.5.2-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file sketch-0.5.2.tar.gz.

File metadata

  • Download URL: sketch-0.5.2.tar.gz
  • Upload date:
  • Size: 20.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for sketch-0.5.2.tar.gz
Algorithm Hash digest
SHA256 45bfd2d41ad4939a0c8ad8ad86cef00ea1709a5e3c4c32e2ed20d255c0a09b9f
MD5 852958ff3bfc23c235d81c35f291a351
BLAKE2b-256 a1f73100729c4ef68b534a2e68d42f1fdec5e5770af6e6053e094b6b45f45bca

See more details on using hashes here.

File details

Details for the file sketch-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: sketch-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 17.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for sketch-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 41d2bf14575a5cf5446b6ef1bc787cd7b8a4a6453aac339215c5e5ca1715cf47
MD5 d18f6c30f29f340b9bff86e826fb083f
BLAKE2b-256 863cbb45a67be6d3272d3712b67a854c012c5ca495c0d4ddd0f7345944081dd6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page