A framework for integrating `instructor` with Spark
Project description
Spark Instructor
Spark Instructor is a powerful library that combines the capabilities of Apache Spark and the instructor
library to enable AI-powered structured data generation within Spark SQL DataFrames.
Overview
This project aims to bridge the gap between large-scale data processing with Apache Spark and AI-driven content generation. By leveraging the instructor
library's ability to work with various AI models (such as OpenAI, Anthropic, and Databricks), Spark Instructor allows users to create User-Defined Functions (UDFs) that generate structured, AI-powered columns in Spark SQL DataFrames.
Key Features
- AI-Powered UDFs: Create Spark UDFs that utilize AI models to generate structured data.
- Multi-Provider Support: Work with various AI providers including OpenAI, Anthropic, and Databricks.
- Type-Safe Responses: Utilize Pydantic models to ensure type safety and data validation for AI-generated content.
- Seamless Integration: Easily incorporate AI-generated columns into your existing Spark SQL workflows.
- Scalable Processing: Leverage Spark's distributed computing capabilities for processing large datasets with AI augmentation.
Use Cases
- Enhance datasets with AI-generated insights or summaries.
- Perform large-scale text classification or entity extraction.
- Generate structured metadata for unstructured text data.
- Create synthetic datasets for testing or machine learning purposes.
Getting Started
- Install poetry
- Run
poetry install
- Run
poetry build
Project Structure
spark_instructor/
: Main package directorycompletions/
: Subpackage for completion object modelsbase.py
: Base classes for completion modelsanthropic.py
: Anthropic-specific completion modelsopenai.py
: OpenAI-specific completion modelsdatabricks.py
: Databricks-specific completion models
client.py
: Submodule for routing API callsudf.py
: Submodule for generating Spark UDFs
Contributing
We welcome contributions to Spark Instructor! Please see our contributing guidelines (TODO) for more information on how to get involved.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgements
This project builds upon the excellent work of the Apache Spark community and the creators of the instructor
library.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for spark_instructor-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 044987e2460f51e48c1e1b9d3e1d22361c92510ce23c53a43dbb65365c20ad5f |
|
MD5 | 2537bf1184fef53f901a6c4b413f3920 |
|
BLAKE2b-256 | fa30250298eec94ac75483118f1a08dd45b34adf122375cbdbe90006a946a9fe |