Skip to main content

PySpark Project Buiding Tool

Project description

PySpark CLI

This will implement a PySpark Project boiler plate code based on user input.

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

PySpark is the Python API for Spark.

Installation Steps:

git clone https://github.com/qburst/PySparkCLI.git

cd PySparkCLI

pip3 install -e . --user

Create a PySpark Project

pysparkcli create [PROJECT_NAME] --master [MASTER_URL] --cores [NUMBER]

master - The URL of the cluster it connects to. You can also use -m instead of --master.
cores - You can also use -c instead of --cores.

Run a PySpark Project

pysparkcli run [PROJECT_NAME]

Project Structure

The basic project structure is as follows:

sample
├── __init__.py
├── src
│   ├── app.py
│   ├── configs
│   │   ├── etl_config.json
│   │   └── __init__.py
│   ├── __init__.py
│   ├── jobs
│   │   ├── etl_job.py
│   │   └── __init__.py
│   └── settings
│       ├── default.py
│       ├── __init__.py
│       ├── local.py
│       └── production.py
└── tests
    ├── __init__.py
    ├── test_data
    │   ├── employees
    │   │   └── part-00000-9abf32a3-db43-42e1-9639-363ef11c0d1c-c000.snappy.parquet
    │   └── employees_report
    │       └── part-00000-4a609ba3-0404-48bb-bb22-2fec3e2f1e68-c000.snappy.parquet
    └── test_etl_job.py

8 directories, 15 files

Contribution Guidelines

Check out here for our contribution guidelines.

Sponsors

QBurst

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for pysparkcli, version 0.0.5
Filename, size File type Python version Upload date Hashes
Filename, size pysparkcli-0.0.5-py3.6.egg (12.9 kB) File type Egg Python version 3.6 Upload date Hashes View hashes
Filename, size pysparkcli-0.0.5-py3-none-any.whl (10.6 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size pysparkcli-0.0.5.tar.gz (6.6 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page