Skip to main content

Spark ETL Utility Framework

Project description

SeedSpark

Why Spark

Apache Spark is a unified analytics engine for large-scale data processing. It achieves high performance for both batch and streaming data, using a Directed Acyclic Graph (DAG) scheduler, a query optimizer, and a physical execution engine

Spark’s design philosophy centers around four key characteristics:

  • Speed: Leveraging in-memory data processing, Spark executes tasks up to 100 times faster in memory and 10 times faster on disk than traditional big data processing systems (e.g., Hadoop MapReduce).
  • Ease of Use: Through high-level APIs and built-in modules, Spark simplifies the process of complex data transformations and analyses, making it accessible to both developers and data analysts.
  • Modularity and Extensibility: Spark's modular nature allows it to be used for a range of data processing tasks from batch processing to real-time streams and machine learning. Extensibility with numerous data sources and libraries further enhances its utility.
  • Unified Analytics: Spark's unified framework reduces the complexity involved in processing data that might otherwise require multiple engines or different technologies.

Spark’s architecture is designed to optimize efficiency. The use of RDDs (Resilient Distributed Datasets) and subsequent abstractions like DataFrames and Datasets simplifies data manipulation while providing fault tolerance. By retaining intermediate results in memory rather than on disk, Spark minimizes costly I/O operations that are a common bottleneck in big data processing

The DAG execution engine enhances this by allowing for more complex operational pipelines and optimizing workflows dynamically. This approach minimizes redundant data shuffling across the cluster, leading to significant performance improvements

Run on GitPod

Start Dev Env in Gitpod: StartDevEnvInGitpod

Force build: ForcePrebuild

Installation

Install Python 3.10 or above

pyenv install 3.11 \
    && pyenv global 3.11

Install Scala and Spark

make install-scala &&\
    make install-spark

Optional: If you want yo u can verify the installation commands

$ make --just-print install-spark

It will output following:
echo "Installing Hadoop..."
source "$HOME/.sdkman/bin/sdkman-init.sh" && sdk install hadoop 3.3.5
# Set Global version
source "$HOME/.sdkman/bin/sdkman-init.sh" && sdk default hadoop 3.3.5
echo "Installing Spark..."
source "$HOME/.sdkman/bin/sdkman-init.sh" && sdk install spark 3.5.0
# Set Global version
source "$HOME/.sdkman/bin/sdkman-init.sh" && sdk default spark 3.5.0

Verify Installation

poetry env info
poetry version
sdk version

Run SDK Current to verify current packages

sdk current

should show:

Using:

java: 11.0.22-zulu
scala: 2.13.12
spark: 3.5.0

Verify

Then verify spark-shell version

spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.0
      /_/

Using Scala version 2.12.18, OpenJDK 64-Bit Server VM, 11.0.22

Verify top level packages

poetry show -T

PySpark version should match above spark version

pyspark                  3.5.0    Apache Spark Python API

Run Pytest

# Install packages
poetry install --with=testing --no-interaction
# Run Pytest
poetry run coverage run -m pytest -vv tests --reruns 5 --reruns-delay 20

PyTest

Then Check the code file seedspark/examples/music_sessions_top_n.py Update or Replace with actual path of music_sessions_data.tsv

Then run following:

# OPTIONAL - Skip this if already downloaded the dataset OR Download dataset
cd datasets/
pip install pandas requests tqdm; python lastfm_dataset_1k.py
# Update or Replace with actual path of new music_sessions_data.tsv
cd ..
# Execute Spark APP
poetry run python seedspark/examples/music_sessions_top_n.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seedspark-0.2.3.tar.gz (18.8 kB view hashes)

Uploaded Source

Built Distribution

seedspark-0.2.3-py3-none-any.whl (21.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page