Skip to main content

Seamlessly execute pyspark code on remote clusters

Project description

Under active development. Do not use for production use.

Seamlessly execute pyspark code on remote clusters.

How it works

Pyspark proxy is made of up a client and server. The client mimics the pyspark api but when objects get created or called a request is made to the API server. The calls the API server receives then calls the actual pyspark APIs.

What has been implemented

Currently only some basic functionalities with the SparkContext, sqlContext and DataFrame classes have been implemented. See the tests for more on what is currently working.

Getting Started

Pyspark Proxy requires set up a server where your Spark is located and simply install the package locally where you want to execute code from.

On Server

Install pyspark proxy via pip:

pip install pysparkproxy

Set up the API server with spark-submit. The API server is what calls the functions in pyspark.

For example:

import pyspark_proxy.server as server

server.run()

Then start the server spark-submit pyspark_proxy_server.py.

The server listens on localhost:5000 by default. You can customize this by passing in host and port keyword args in server.run().

Locally

Install pyspark proxy via pip:

pip install pysparkproxy

Now you can start a spark context and do some dataframe operations.

from pyspark_proxy import SparkContext
from pyspark_proxy.sql import SQLContext

sc = SparkContext(appName='pyspark_proxy_app')

sc.setLogLevel('ERROR')

sqlContext = SQLContext(sc)

df = sqlContext.read.json('my.json')

print(df.count())

Then use the normal python binary to run this python my_app.py. This code works the same if you were to run it via spark-submit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PysparkProxy-0.0.9.tar.gz (11.5 kB view details)

Uploaded Source

File details

Details for the file PysparkProxy-0.0.9.tar.gz.

File metadata

  • Download URL: PysparkProxy-0.0.9.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.18.4 setuptools/28.8.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/2.7.13

File hashes

Hashes for PysparkProxy-0.0.9.tar.gz
Algorithm Hash digest
SHA256 303c7f432543880a619c11e806a0f09c6afcdf25014147cb68a20d08af1c9265
MD5 283388ce8167f8259e5262084bd9271b
BLAKE2b-256 b7b339e0883a6c0468207ac4a2b2ad1ca21c2de365c5bc4864408357c8684b5c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page