Silex adds more sparks to your project!
Project description
Silex
Add more 🔥 to Apache Spark!
TLDR
Silex is a Data Engineering library to extend PySpark.
You don't need another class, just use PySpark as usual and you have new functions to your DataFrames!
import silex
from pyspark.sql import DataFrame
# extends your DataFrame with silex functions!
# if for some reason you don't want to do that, check 'Without extending Dataframes' README section below
silex.extend_dataframes()
df: DataFrame = ... # your regular Spark DataFrame
df: DataFrame = df.drop_col_if_na() # new function! and still a regular Spark Dataframe!
# scroll for more information!
Available functions
# assertions (raises an Exception if not met /!\)
def expect_column(self, col: str) -> DataFrame: ...
def expect_columns(self, cols: Union[str, List[str]]) -> DataFrame: ...
def expect_distinct_values_equal_set(self, cols: Union[str, List[str]], values: Collection[Any]) -> DataFrame: ...
def expect_distinct_values_in_set(self, cols: Union[str, List[str]], values: Collection[Any]) -> DataFrame: ...
def expect_min_value_between(self, cols: Union[str, List[str]], min: Any, max: Any) -> DataFrame: ...
def expect_avg_value_between(self, cols: Union[str, List[str]], min: Any, max: Any) -> DataFrame: ...
def expect_max_value_between(self, cols: Union[str, List[str]], min: Any, max: Any) -> DataFrame: ...
def expect_unique_id(self, cols: Union[str, List[str]]) -> DataFrame: ...
# boolean checks
def has_column(self, col: str) -> bool: ...
def has_columns(self, cols: Union[str, List[str]]) -> bool: ...
def has_distinct_values_equal_set(self, cols: Union[str, List[str]], values: Collection[Any]) -> bool: ...
def has_distinct_values_in_set(self, cols: Union[str, List[str]], values: Collection[Any]) -> bool: ...
def has_min_value_between(self, cols: Union[str, List[str]], min: Any, max: Any) -> bool: ...
def has_avg_value_between(self, cols: Union[str, List[str]], min: Any, max: Any) -> bool: ...
def has_max_value_between(self, cols: Union[str, List[str]], min: Any, max: Any) -> bool: ...
def has_unique_id(self, cols: Union[str, List[str]]) -> bool: ...
# dates
def with_date_column(self, col: str, fmt: str, new_col: Optional[str] = None) -> DataFrame: ...
# drop
def drop_col_if_na(self, max: int) -> DataFrame: ...
def drop_col_if_not_distinct(self, min: int) -> DataFrame: ...
# filters
def filter_on_range(self, col: str, from_: Any, to: Any, ...) -> DataFrame: ...
# joins
def join_closest_date(self, other: DataFrame, ...) -> DataFrame: ...
Getting started
Pre-requisites
- Python 3.8 or above
- Spark 3 or above
Installation
pip install < # TODO >
Usage
By extending DataFrames! ⚡
import silex
from pyspark.sql import DataFrame, SparkSession
# extends your DataFrame with silex functions!
# if for some reason you don't want to do that, check next example
silex.extend_dataframes()
spark = SparkSession.builder.getOrCreate()
data = [
(0, "2022-01-01", "a", 1.0),
(1, "2022-02-01", "b", 2.0),
(2, "2022-03-01", "c", 3.0),
]
df: DataFrame = spark.createDataFrame(data, schema=["id", "date", "text", "value"])
df.show()
# +---+----------+----+-----+
# | id| date|text|value|
# +---+----------+----+-----+
# | 0|2022-01-01| a| 1.0|
# | 1|2022-02-01| b| 2.0|
# | 2|2022-03-01| c| 3.0|
# +---+----------+----+-----+
df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- date: string (nullable = true)
# |-- text: string (nullable = true)
# |-- value: double (nullable = true)
df = df.with_date_column(col="date", fmt="yyyy-MM-dd")
df.show()
# +---+----------+----+-----+
# | id| date|text|value|
# +---+----------+----+-----+
# | 0|2022-01-01| a| 1.0|
# | 1|2022-02-01| b| 2.0|
# | 2|2022-03-01| c| 3.0|
# +---+----------+----+-----+
df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- date: date (nullable = true)
# |-- text: string (nullable = true)
# |-- value: double (nullable = true)
Without extending Dataframes 🌧️
from silex.fn.date import with_date_column
from pyspark.sql import DataFrame, SparkSession
spark = SparkSession.builder.getOrCreate()
data = [
(0, "2022-01-01", "a", 1.0),
(1, "2022-02-01", "b", 2.0),
(2, "2022-03-01", "c", 3.0),
]
df: DataFrame = spark.createDataFrame(data, schema=["id", "date", "text", "value"])
df.show()
# +---+----------+----+-----+
# | id| date|text|value|
# +---+----------+----+-----+
# | 0|2022-01-01| a| 1.0|
# | 1|2022-02-01| b| 2.0|
# | 2|2022-03-01| c| 3.0|
# +---+----------+----+-----+
df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- date: string (nullable = true)
# |-- text: string (nullable = true)
# |-- value: double (nullable = true)
df = with_date_column(df=df, col="date", fmt="yyyy-MM-dd")
df.show()
# +---+----------+----+-----+
# | id| date|text|value|
# +---+----------+----+-----+
# | 0|2022-01-01| a| 1.0|
# | 1|2022-02-01| b| 2.0|
# | 2|2022-03-01| c| 3.0|
# +---+----------+----+-----+
df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- date: date (nullable = true)
# |-- text: string (nullable = true)
# |-- value: double (nullable = true)
Contributing
# install poetry and python 3.8, using pyenv for instance
cd silex
poetry env use path/to/python3.8 # e.g. ~/.pyenv/versions/3.8.12/bin/python
poetry shell
poetry install
pre-commit install
make help
# or open Makefile to learn about available commands for development
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
spark-silex-0.1.0.tar.gz
(9.1 kB
view details)
Built Distribution
File details
Details for the file spark-silex-0.1.0.tar.gz
.
File metadata
- Download URL: spark-silex-0.1.0.tar.gz
- Upload date:
- Size: 9.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/37.1 requests/2.28.1 requests-toolbelt/0.9.1 urllib3/1.26.12 tqdm/4.64.1 importlib-metadata/4.12.0 keyring/23.9.1 rfc3986/2.0.0 colorama/0.4.5 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 552ed3ea17b2fa87c2f5fea3a0891a00173ce767ed4d6deb0c0570f0b38839dc |
|
MD5 | 212501fba5eed6d99deda3311a90bd0f |
|
BLAKE2b-256 | 81963ae4846b6f42f70b771b7dfc69ad6a68a09f08181379f5f7790d2d8f1e33 |
File details
Details for the file spark_silex-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: spark_silex-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/37.1 requests/2.28.1 requests-toolbelt/0.9.1 urllib3/1.26.12 tqdm/4.64.1 importlib-metadata/4.12.0 keyring/23.9.1 rfc3986/2.0.0 colorama/0.4.5 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 93a9d014ef51703dba0c656a7eb406627536651c03420829d7bc1fc79f784cfc |
|
MD5 | 94323e2db177b768bf47d751326f97ee |
|
BLAKE2b-256 | 9d147fb7aa1270488c1d7effe705d51a9ce8d8dd278c5f8636843808c1c9305b |