PySpark Project Buiding Tool
Project description
PySpark CLI
This will implement a PySpark Project boiler plate code based on user input.
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
PySpark is the Python API for Spark.
Installation Steps:
git clone https://github.com/qburst/PySparkCLI.git
cd PySparkCLI
pip3 install -e . --user
Create a PySpark Project
pysparkcli create [PROJECT_NAME] --master [MASTER_URL] --cores [NUMBER]
master - The URL of the cluster it connects to. You can also use -m instead of --master.
cores - You can also use -c instead of --cores.
Run a PySpark Project
pysparkcli run [PROJECT_NAME]
Initiate Stream for Project
pysparkcli stream [PROJECT_NAME] [STREAM_FILE_NAME]
PySpark Project Test cases
- Running by Project name
pysparkcli test [PROJECT_NAME]
- Running individual test case with filename: test_etl_job.py
pysparkcli test [PROJECT_NAME] -t [etl_job]
FAQ
Common issues while installing pysparkcli:
* pysparkcli: command not found
Make sure you add user’s local bin to PATH variable.
Add the following code in .bashrc file
# set PATH so it includes user's private bin if it exists
if [ -d "$HOME/.local/bin" ] ; then
PATH="$HOME/.local/bin:$PATH"
fi
* JAVA_HOME is not set
Make sure JAVA_HOME is pointing to your JDK and PYSPARK_PYTHON variable is created.
You can add them manually by in .bashrc file:
Example:
export PYSPARK_PYTHON=python3
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Save the file and run the following to update environment.
source ~/.bashrc
Project Structure
The basic project structure is as follows:
sample
├── __init__.py
├── src
│ ├── app.py
│ ├── configs
│ │ ├── etl_config.json
│ │ └── __init__.py
│ ├── __init__.py
│ ├── jobs
│ │ ├── etl_job.py
│ │ └── __init__.py
│ └── settings
│ ├── default.py
│ ├── __init__.py
│ ├── local.py
│ └── production.py
└── tests
├── __init__.py
├── test_data
│ ├── employees
│ │ └── part-00000-9abf32a3-db43-42e1-9639-363ef11c0d1c-c000.snappy.parquet
│ └── employees_report
│ └── part-00000-4a609ba3-0404-48bb-bb22-2fec3e2f1e68-c000.snappy.parquet
└── test_etl_job.py
8 directories, 15 files
PySparkCLI Demo
Contribution Guidelines
Check out here for our contribution guidelines.
Sponsors
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pyspark-cli-1.0.3.tar.gz
.
File metadata
- Download URL: pyspark-cli-1.0.3.tar.gz
- Upload date:
- Size: 13.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.40.0 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9d7fd7146288f8dc1f3d9e2d15bdc4ef894dcca8d969efa640cd0a88e675bc99 |
|
MD5 | 3463e65ab6b084a21878b74b908a1f61 |
|
BLAKE2b-256 | 19e06340ef59c7dbbabcf38997ee1006e523b0bc7204a97053988119a964dc14 |