Silex adds more sparks to your project!

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Software Development :: Libraries :: Python Modules

Project description

Silex

Add more 🔥 to Apache Spark!

Apache Spark

TLDR

Silex is a Data Engineering library to extend PySpark.

You don't need another class, just use PySpark as usual and you have new functions to your DataFrames!

import silex
from pyspark.sql import DataFrame

# extends your DataFrame with silex functions!
# if for some reason you don't want to do that, check 'Without extending Dataframes' README section below
silex.extend_dataframes()

df: DataFrame = ...  # your regular Spark DataFrame
df: DataFrame = df.drop_col_if_na()  # new function! and still a regular Spark Dataframe!
# scroll for more information!

Available functions

# assertions (raises an Exception if not met /!\)
def expect_column(self, col: str) -> DataFrame: ...
def expect_columns(self, cols: Union[str, List[str]]) -> DataFrame: ...

def expect_distinct_values_equal_set(self, cols: Union[str, List[str]], values: Collection[Any]) -> DataFrame: ...
def expect_distinct_values_in_set(self, cols: Union[str, List[str]], values: Collection[Any]) -> DataFrame: ...

def expect_min_value_between(self, cols: Union[str, List[str]], min: Any, max: Any) -> DataFrame: ...
def expect_avg_value_between(self, cols: Union[str, List[str]], min: Any, max: Any) -> DataFrame: ...
def expect_max_value_between(self, cols: Union[str, List[str]], min: Any, max: Any) -> DataFrame: ...

def expect_unique_id(self, cols: Union[str, List[str]]) -> DataFrame: ...

# boolean checks
def has_column(self, col: str) -> bool: ...
def has_columns(self, cols: Union[str, List[str]]) -> bool: ...

def has_distinct_values_equal_set(self, cols: Union[str, List[str]], values: Collection[Any]) -> bool: ...
def has_distinct_values_in_set(self, cols: Union[str, List[str]], values: Collection[Any]) -> bool: ...

def has_min_value_between(self, cols: Union[str, List[str]], min: Any, max: Any) -> bool: ...
def has_avg_value_between(self, cols: Union[str, List[str]], min: Any, max: Any) -> bool: ...
def has_max_value_between(self, cols: Union[str, List[str]], min: Any, max: Any) -> bool: ...

def has_unique_id(self, cols: Union[str, List[str]]) -> bool: ...

# dates
def with_date_column(self, col: str, fmt: str, new_col: Optional[str] = None) -> DataFrame: ...

# drop
def drop_col_if_na(self, max: int) -> DataFrame: ...
def drop_col_if_not_distinct(self, min: int) -> DataFrame: ...

# filters
def filter_on_range(self, col: str, from_: Any, to: Any, ...) -> DataFrame: ...

# joins
def join_closest_date(self, other: DataFrame, ...) -> DataFrame: ...

Getting started

Pre-requisites

Python 3.8 or above
Spark 3 or above

Installation

pip install < # TODO >

Usage

By extending DataFrames! ⚡

import silex
from pyspark.sql import DataFrame, SparkSession

# extends your DataFrame with silex functions!
# if for some reason you don't want to do that, check next example
silex.extend_dataframes()

spark = SparkSession.builder.getOrCreate()

data = [
    (0, "2022-01-01", "a", 1.0),
    (1, "2022-02-01", "b", 2.0),
    (2, "2022-03-01", "c", 3.0),
]
df: DataFrame = spark.createDataFrame(data, schema=["id", "date", "text", "value"])

df.show()
# +---+----------+----+-----+
# | id|      date|text|value|
# +---+----------+----+-----+
# |  0|2022-01-01|   a|  1.0|
# |  1|2022-02-01|   b|  2.0|
# |  2|2022-03-01|   c|  3.0|
# +---+----------+----+-----+

df.printSchema()
# root
#  |-- id: long (nullable = true)
#  |-- date: string (nullable = true)
#  |-- text: string (nullable = true)
#  |-- value: double (nullable = true)

df = df.with_date_column(col="date", fmt="yyyy-MM-dd")

df.show()
# +---+----------+----+-----+
# | id|      date|text|value|
# +---+----------+----+-----+
# |  0|2022-01-01|   a|  1.0|
# |  1|2022-02-01|   b|  2.0|
# |  2|2022-03-01|   c|  3.0|
# +---+----------+----+-----+

df.printSchema()
# root
#  |-- id: long (nullable = true)
#  |-- date: date (nullable = true)
#  |-- text: string (nullable = true)
#  |-- value: double (nullable = true)

Without extending Dataframes 🌧️

from silex.fn.date import with_date_column
from pyspark.sql import DataFrame, SparkSession

spark = SparkSession.builder.getOrCreate()

data = [
    (0, "2022-01-01", "a", 1.0),
    (1, "2022-02-01", "b", 2.0),
    (2, "2022-03-01", "c", 3.0),
]
df: DataFrame = spark.createDataFrame(data, schema=["id", "date", "text", "value"])

df.show()
# +---+----------+----+-----+
# | id|      date|text|value|
# +---+----------+----+-----+
# |  0|2022-01-01|   a|  1.0|
# |  1|2022-02-01|   b|  2.0|
# |  2|2022-03-01|   c|  3.0|
# +---+----------+----+-----+

df.printSchema()
# root
#  |-- id: long (nullable = true)
#  |-- date: string (nullable = true)
#  |-- text: string (nullable = true)
#  |-- value: double (nullable = true)

df = with_date_column(df=df, col="date", fmt="yyyy-MM-dd")

df.show()
# +---+----------+----+-----+
# | id|      date|text|value|
# +---+----------+----+-----+
# |  0|2022-01-01|   a|  1.0|
# |  1|2022-02-01|   b|  2.0|
# |  2|2022-03-01|   c|  3.0|
# +---+----------+----+-----+

df.printSchema()
# root
#  |-- id: long (nullable = true)
#  |-- date: date (nullable = true)
#  |-- text: string (nullable = true)
#  |-- value: double (nullable = true)

Contributing

# install poetry and python 3.8, using pyenv for instance

cd silex
poetry env use path/to/python3.8  # e.g. ~/.pyenv/versions/3.8.12/bin/python
poetry shell
poetry install
pre-commit install

make help
# or open Makefile to learn about available commands for development

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Software Development :: Libraries :: Python Modules

Release history Release notifications | RSS feed

This version

0.2.0

Sep 11, 2022

0.1.0

Sep 11, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark-silex-0.2.0.tar.gz (10.2 kB view hashes)

Uploaded Sep 11, 2022 Source

Built Distribution

spark_silex-0.2.0-py3-none-any.whl (11.0 kB view hashes)

Uploaded Sep 11, 2022 Python 3

Hashes for spark-silex-0.2.0.tar.gz

Hashes for spark-silex-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`be1bdc6d2d56d718c6216b8c26f7044c7fe44b3583f16ad4833581621837d3dc`
MD5	`1d1ef2e007b1a71327231caa7e20635f`
BLAKE2b-256	`eca8716cef3af7acccc8cb60fbe7fedd15ef6bd5f8a302a48387f76f6d33351a`

Hashes for spark_silex-0.2.0-py3-none-any.whl

Hashes for spark_silex-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`efa1d2d75a28664dc1f8ada4983f92de68498643a535c6640d2283c876a15d70`
MD5	`fd4fe5bf44561ea4927774758a93cff4`
BLAKE2b-256	`9e81d3976d1794c6002a103f1df8d194ffed3ad23d3d07106ab863dcd335bf34`