Skip to main content

High-performance AI/ML library for Spark to build and deploy your LLM applications in production.

Project description

Spark AI

Toolbox for building Generative AI applications on top of Apache Spark.

Many developers are companies are trying to leverage LLMs to enhance their existing applications or build completely new ones. Thanks to LLMs most of them no longer have to train new ML models. However, still the major challenge is data and infrastructure. This includes data ingestion, transformation, vectorization, lookup, and model serving.

Over the last few months, the industry has seen a spur of new tools and frameworks to help with these challenges. However, none of them are easy to use, deploy to production, nor can deal with the scale of data.

This project aims to provide a toolbox of Spark extensions, data sources, and utilities to make building robust data infrastructure on Spark for Generative AI applications easy.

PyPI version Maven Central

Example Applications

Complete examples that anyone can start from to build their own Generative AI applications.

Read about our thoughts on Prompt engineering, LLMs, and Low-code here.

Quickstart

Installation

Currently, the project is aimed mainly at PySpark users, however, because it also features high-performance connectors, both the PySpark and Scala dependencies have to be present on the Spark cluster.

Ingestion

from spark_ai.webapps.slack import SlackUtilities

# Batch version
slack = SlackUtilities(token='xoxb-...', spark=spark)
df_channels = slack.read_channels()
df_conversations = slack.read_conversations(df_channels)

# Live streaming version
df_messages = (spark.readStream
    .format('io.prophecy.spark_ai.webapps.slack.SlackSourceProvider')
    .option('token', 'xapp-...')
    .load())

Pre-processing & Vectorization

from spark_ai.llms.openai import OpenAiLLM
from spark_ai.dbs.pinecone import PineconeDB

OpenAiLLM(api_key='sk-...').register_udfs(spark=spark)
PineconeDB('8045...', 'us-east-1-aws').register_udfs(self.spark)

(df_conversations
    # Embed the text from every conversation into a vector
    .withColumn('embeddings', expr('openai_embed_texts(text)'))
    # Do some more pre-processing
    ... 
    # Upsert the embeddings into Pinecone
    .withColumn('status', expr('pinecone_upsert(\'index-name\', embeddings)'))
    # Save the status of the upsertion to a standard table
    .saveAsTable('pinecone_status'))

Inference

df_messages = spark.readStream \
    .format("io_prophecy.spark_ai.SlackStreamingSourceProvider") \
    .option("token", token) \
    .load()

# Handle a live stream of messages from Slack here

Roadmap

Data sources supported:

  • 🚧 Slack
  • 🗺️ PDFs
  • 🗺️ Asana
  • 🗺️ Notion
  • 🗺️ Google Drive
  • 🗺 Web-scrape

Vector databases supported:

  • 🚧 Pinecone
  • 🚧 Spark-ML (table store & cos sim)
  • 🗺 ElasticSearch

LLMs supported:

  • 🚧 OpenAI
  • 🚧 Spark-ML
  • 🗺️ Databrick's Dolly
  • 🗺️ HuggingFace's Models

Application interfaces supported:

  • 🚧 Slack
  • 🗺️ Microsoft Teams

And many more are coming soon (feel free to request as issues)! 🚀

✅: General Availability; 🚧: Beta availability; 🗺️: Roadmap;

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

prophecy_spark_ai-0.1.11-py3-none-any.whl (15.0 kB view details)

Uploaded Python 3

File details

Details for the file prophecy_spark_ai-0.1.11-py3-none-any.whl.

File metadata

File hashes

Hashes for prophecy_spark_ai-0.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 d0fa1a1c8a3b8e8a8578d8240dc82dcb6c27fe930c79db3b6f869f24b72c37c3
MD5 d3a5faab65f7a82630fbb6f1a2e78b70
BLAKE2b-256 5c1aa8e7ad647c2b96b21b6405f53a76937afa96aa7b623931a42e8033dcf80a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page