Skip to main content

LLM assistant for the development of Spark applications

Project description

LLM Assistant for Apache Spark

Installation

pip install spark-llm

Usage

Initialization

from spark_llm import SparkLLMAssistant

assistant = SparkLLMAssistant()
assistant.activate() # active partial functions for Spark DataFrame

Data Ingestion

auto_df = assistant.create_df("2022 USA national auto sales by brand")
auto_df.show(n=5)
rank brand us_sales_2022 sales_change_vs_2021
1 Toyota 1849751 -9
2 Ford 1767439 -2
3 Chevrolet 1502389 6
4 Honda 881201 -33
5 Hyundai 724265 -2

Plot

auto_df.llm.plot()

2022 USA national auto sales by brand

To plot with an instruction:

auto_df.llm.plot("pie char for top 5 brands and the others' market shares")

2022 USA national auto sales_market_share by brand

DataFrame Transformation

auto_top_growth_df=auto_df.llm.transform("top brand with the highest growth")
auto_top_growth_df.show()
brand us_sales_2022 sales_change_vs_2021
Cadillac 134726 14

DataFrame Explanation

auto_top_growth_df.llm.explain()

In summary, this dataframe is retrieving the brand with the highest sales change in 2022 compared to 2021. It presents the results sorted by sales change in descending order and only returns the top result.

DataFrame Attribute Verification

auto_top_growth_df.llm.verify("expect sales change percentage to be between -100 to 100")

result: True

UDF Generation

@assistant.udf
def previous_years_sales(brand: str, current_year_sale: int, sales_change_percentage: float) -> int:
    """Calculate previous years sales from sales change percentage"""
    ...
    
spark.udf.register("previous_years_sales", previous_years_sales)
auto_df.createOrReplaceTempView("autoDF")

spark.sql("select brand as brand, previous_years_sales(brand, us_sales, sales_change_percentage) as 2021_sales from autoDF").show()
brand 2021_sales
Toyota 2032693
Ford 1803509
Chevrolet 1417348
Honda 1315225
Hyundai 739045

Cache

The SparkLLMAssistant supports a simple in-memory and persistent cache system. It keeps an in-memory staging cache that can be persisted through the commit() method. Cache lookup is always performed on the persistent cache only.

assistant.commit()

Refer to example.ipynb for more detailed usage examples.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

Licensed under the Apache License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_llm-0.1.9.tar.gz (18.7 kB view hashes)

Uploaded Source

Built Distribution

spark_llm-0.1.9-py3-none-any.whl (20.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page