Common code for Python projects involving GCP, Pandas, and Spark.
Project description
Falgueras 🪴
Development framework for Python projects involving GCP, Pandas, and Spark.
The main goal is to accelerate development of data-driven projects by providing a common framework for developers with different backgrounds: software engineers, big data engineers and data scientists.
Installation
pip install falgueras (requieres Python>=3.10)
Set GOOGLE_APPLICATION_CREDENTIALS environment variable to enable GCP services.
Run local Spark applications in Windows from IntelliJ
try fast fail fast learn fast
For local Spark execution in Windows, the following environment variables must be set appropriately:
- SPARK_HOME; version spark-3.5.2-bin-hadoop3.
- HADOOP_HOME; same value than SPARK_HOME.
- JAVA_HOME; recommended Java SDK 11.
- PATH += %HADOOP_HOME%\bin, %JAVA_HOME%\bin.
%HADOOP_HOME%\bin must contain files winutils.exe and hadoop.dll, download from here.
Additionally, add findspark.init() at the beginning of the script in order to set and add
environment variables and dependencies to sys.path.
Connect to BigQuery from Spark
As shown in the spark_session_utils.py, the SparkSession used must include the jar
com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.41.1
in order to communicate with BigQuery.
Packages
falgueras.common
Shared code between other packages and utils functions: datetime, json, enums, logging.
falgueras.gcp
The functionalities of various Google Cloud Platform (GCP) services are encapsulated within custom client classes. This approach enhances clarity and promotes better encapsulation.
For instance, Google Cloud Storage (GCS) operations are wrapped in the gcp.GcsClient class,
which has an attribute that holds the actual storage.Client object from GCS. Multiple GcsClient
instances can share the same storage.Client object.
falgueras.pandas
Pandas related code.
The pandas_repo.py file provides a modular and extensible framework for handling pandas DataFrame operations
across various storage systems. Using the PandasRepo abstract base class and PandasRepoProtocol,
it standardizes read and write operations while enabling custom implementations for specific backends
such as BigQuery (BqPandasRepo). These implementations encapsulate backend-specific logic, allowing
users to interact with data sources using a consistent interface.
falgueras.spark
Spark related code.
In the same way than the pandas_repo.py file, the spark_repo.py file provides a modular and extensible
framework for handling Spark DataFrame operations across various storage systems. Using the SparkRepo abstract base
class and SparkRepoProtocol, it standardizes read and write operations while enabling custom implementations for
specific backends such as BigQuery (BqSparkRepo). These implementations encapsulate backend-specific logic, allowing
users to interact with data sources using a consistent interface.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file falgueras-1.0.0.tar.gz.
File metadata
- Download URL: falgueras-1.0.0.tar.gz
- Upload date:
- Size: 282.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f1994ced4999cad2579b5f9965dee72be2d9cfdf4553fd9d58fc0cfeb4cb8e84
|
|
| MD5 |
64d218daa60faf2d2b532fd23ad87f49
|
|
| BLAKE2b-256 |
bced2cbf6f0603b904aa997864043cd05d4eb1724b9f1bf115c5717a7e633dc4
|
File details
Details for the file falgueras-1.0.0-py3-none-any.whl.
File metadata
- Download URL: falgueras-1.0.0-py3-none-any.whl
- Upload date:
- Size: 25.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
615f423d522350d2b3719f25e37348b65739923021889786b1e62c62283812d8
|
|
| MD5 |
8803045d82c74ccd6120cca4a278b1f0
|
|
| BLAKE2b-256 |
099dfb89e660ce62b716b0f62563d94f1b4fc322846e2abce5c76dc39cb3e4bd
|