PySpark Project Buiding Tool

Project description

PySpark CLI

This will implement a PySpark Project boiler plate code based on user input.

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

PySpark is the Python API for Spark.

Installation Steps:

git clone https://github.com/qburst/PySparkCLI.git

cd PySparkCLI

pip3 install -e . --user

Create a PySpark Project

pysparkcli create [PROJECT_NAME] --master [MASTER_URL] --cores [NUMBER]

master - The URL of the cluster it connects to. You can also use -m instead of --master.
cores - You can also use -c instead of --cores.

Run a PySpark Project

pysparkcli run [PROJECT_NAME]

PySpark Project Test cases

Running by Project name

pysparkcli test [PROJECT_NAME]

Running individual test case with filename: test_etl_job.py

pysparkcli test [PROJECT_NAME] -t [etl_job]

FAQ

Common issues while installing pysparkcli:

* pysparkcli: command not found
    Make sure you add user’s local bin to PATH variable.
    Add the following code in .bashrc file

    # set PATH so it includes user's private bin if it exists
    if [ -d "$HOME/.local/bin" ] ; then
        PATH="$HOME/.local/bin:$PATH"
    fi


* JAVA_HOME is not set
    Make sure JAVA_HOME is pointing to your JDK and PYSPARK_PYTHON variable is created.
    You can add them manually by in .bashrc file:

    Example:

        export PYSPARK_PYTHON=python3
        export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

    Save the file and run the following to update environment.

        source ~/.bashrc

Project Structure

The basic project structure is as follows:

sample
├── __init__.py
├── src
│   ├── app.py
│   ├── configs
│   │   ├── etl_config.json
│   │   └── __init__.py
│   ├── __init__.py
│   ├── jobs
│   │   ├── etl_job.py
│   │   └── __init__.py
│   └── settings
│       ├── default.py
│       ├── __init__.py
│       ├── local.py
│       └── production.py
└── tests
    ├── __init__.py
    ├── test_data
    │   ├── employees
    │   │   └── part-00000-9abf32a3-db43-42e1-9639-363ef11c0d1c-c000.snappy.parquet
    │   └── employees_report
    │       └── part-00000-4a609ba3-0404-48bb-bb22-2fec3e2f1e68-c000.snappy.parquet
    └── test_etl_job.py

8 directories, 15 files

Contribution Guidelines

Check out here for our contribution guidelines.

Project details

Release history Release notifications | RSS feed

This version

0.0.8

Dec 16, 2019

0.0.7

Dec 16, 2019

0.0.5

Dec 5, 2019

0.0.4

Dec 5, 2019

0.0.3

Dec 5, 2019

0.0.2

Dec 4, 2019

0.0.1

Dec 4, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysparkcli-0.0.8.tar.gz (10.5 kB view details)

Uploaded Dec 16, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pysparkcli-0.0.8-py3-none-any.whl (14.0 kB view details)

Uploaded Dec 16, 2019 Python 3

File details

Details for the file pysparkcli-0.0.8.tar.gz.

File metadata

Download URL: pysparkcli-0.0.8.tar.gz
Upload date: Dec 16, 2019
Size: 10.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.5

File hashes

Hashes for pysparkcli-0.0.8.tar.gz
Algorithm	Hash digest
SHA256	`21db2d754221c39b5c71cbf9a8e5b711d1cbc5f355b314f665fa3b410372b04f`
MD5	`5d7c1bbc9e5317e703fbebfac8443e13`
BLAKE2b-256	`586dc79aa299b71298f36ccb19a8cfeedab758314be161c0499a5e1575c9df06`

See more details on using hashes here.

File details

Details for the file pysparkcli-0.0.8-py3-none-any.whl.

File metadata

Download URL: pysparkcli-0.0.8-py3-none-any.whl
Upload date: Dec 16, 2019
Size: 14.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.5

File hashes

Hashes for pysparkcli-0.0.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`16733a7aca36308e1f5cebb323cde40d83dfd01260e468a2f9a1c5f87d9446dd`
MD5	`6da1fc1146612ca79e465ef0d83f7050`
BLAKE2b-256	`39723aa96d896966c2f1aec777123dd0167bd2589b771165b6a756e7a9daf958`

See more details on using hashes here.

pysparkcli 0.0.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

PySpark CLI

Installation Steps:

Create a PySpark Project

Run a PySpark Project

PySpark Project Test cases

FAQ

Project Structure

Contribution Guidelines

Sponsors

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes