Skip to main content

PySpark Project Buiding Tool

Project description

PySpark CLI

This will implement a PySpark Project boiler plate code based on user input.

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

PySpark is the Python API for Spark.

Installation Steps:

git clone https://github.com/qburst/PySparkCLI.git

cd PySparkCLI

pip3 install -e . --user

Create a PySpark Project

pysparkcli create [PROJECT_NAME] --master [MASTER_URL] --cores [NUMBER]

master - The URL of the cluster it connects to. You can also use -m instead of --master.
cores - You can also use -c instead of --cores.

Run a PySpark Project

pysparkcli run [PROJECT_NAME]

PySpark Project Test cases

  • Running by Project name
pysparkcli test [PROJECT_NAME]
  • Running individual test case with filename: test_etl_job.py
pysparkcli test [PROJECT_NAME] -t [etl_job]

FAQ

Common issues while installing pysparkcli:

* pysparkcli: command not found
    Make sure you add user’s local bin to PATH variable.
    Add the following code in .bashrc file

    # set PATH so it includes user's private bin if it exists
    if [ -d "$HOME/.local/bin" ] ; then
        PATH="$HOME/.local/bin:$PATH"
    fi


* JAVA_HOME is not set
    Make sure JAVA_HOME is pointing to your JDK and PYSPARK_PYTHON variable is created.
    You can add them manually by in .bashrc file:

    Example:

        export PYSPARK_PYTHON=python3
        export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

    Save the file and run the following to update environment.

        source ~/.bashrc

Project Structure

The basic project structure is as follows:

sample
├── __init__.py
├── src
│   ├── app.py
│   ├── configs
│      ├── etl_config.json
│      └── __init__.py
│   ├── __init__.py
│   ├── jobs      ├── etl_job.py
│      └── __init__.py
│   └── settings
│       ├── default.py
│       ├── __init__.py
│       ├── local.py
│       └── production.py
└── tests
    ├── __init__.py
    ├── test_data
       ├── employees
          └── part-00000-9abf32a3-db43-42e1-9639-363ef11c0d1c-c000.snappy.parquet
       └── employees_report
           └── part-00000-4a609ba3-0404-48bb-bb22-2fec3e2f1e68-c000.snappy.parquet
    └── test_etl_job.py

8 directories, 15 files

Contribution Guidelines

Check out here for our contribution guidelines.

Sponsors

QBurst

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysparkcli-0.0.8.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

pysparkcli-0.0.8-py3-none-any.whl (14.0 kB view details)

Uploaded Python 3

File details

Details for the file pysparkcli-0.0.8.tar.gz.

File metadata

  • Download URL: pysparkcli-0.0.8.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.5

File hashes

Hashes for pysparkcli-0.0.8.tar.gz
Algorithm Hash digest
SHA256 21db2d754221c39b5c71cbf9a8e5b711d1cbc5f355b314f665fa3b410372b04f
MD5 5d7c1bbc9e5317e703fbebfac8443e13
BLAKE2b-256 586dc79aa299b71298f36ccb19a8cfeedab758314be161c0499a5e1575c9df06

See more details on using hashes here.

File details

Details for the file pysparkcli-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: pysparkcli-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 14.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.5

File hashes

Hashes for pysparkcli-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 16733a7aca36308e1f5cebb323cde40d83dfd01260e468a2f9a1c5f87d9446dd
MD5 6da1fc1146612ca79e465ef0d83f7050
BLAKE2b-256 39723aa96d896966c2f1aec777123dd0167bd2589b771165b6a756e7a9daf958

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page