A modular framework for creating applications in Apache Spark

These details have been verified by PyPI

Maintainers

alitajeldin bzhangusc cloudysail jacobdr laneb

These details have not been verified by PyPI

Project links

Homepage

Project description

Spark Modularized View (SMV)

Spark Modularized View enables users to build enterprise scale applications on Apache Spark platform.

SMV Quickstart

Installation

Pip Install

SMV is now distributed as a package on PyPi. It comes in two flavors -- with and without a dependnecy on pyspark. The first is for consumers who might be installing to a machine outside of a cluster that does not already have pyspark installed, while the second is targeted for those installing to a gateway machine in a cluster that already has Spark available in the environment.

Without Pyspark

pip install smv

With Pyspark

pip install smv[pyspark]

Docker

We avidly recommend using Docker to install SMV. Using Docker, start an SMV container with

docker run -it --rm tresamigos/smv

If Docker is not an option on your system, see the installation guide.

Create Example App

SMV provides a shell script to easily create template applications. We will use a simple example app to explore SMV.

$ smv-init -s MyApp

Run Example App

Run the entire application with

$ smv-run --run-app

This command must be run from the root of the project.

The output csv file and schema can be found in the data/output directory. Note that 'XXXXXXXX' here substitutes for a number which is like the version of the module.

$ cat data/output/stage1.employment.EmploymentByState_XXXXXXXX.csv/part-* | head -5
"50",245058
"51",2933665
"53",2310426
"54",531834
"55",2325877

$ cat data/output/stage1.employment.EmploymentByState_XXXXXXXX.schema/part-*
@delimiter = ,
@has-header = false
@quote-char = "
ST: String[,_SmvStrNull_]
EMP: Long

Edit Example App

The EmploymentByState module is defined in src/python/stage1/employment.py:

class EmploymentByState(SmvModule, SmvOutput):
    """Python ETL Example: employment by state"""

    def requiresDS(self):
        return [inputdata.Employment]

    def run(self, i):
        df = i[inputdata.Employment]
        df1 = df.groupBy(col("ST")).agg(sum(col("EMP")).alias("EMP"))
        return df1

The run method of a module defines the operations needed to get the output based on the input. We would like to filter the table based on if each row's state is greater or less than 1,000,000. To accomplish this, we need to add a filter to the run method:

  def run(self, i):
      df = i[inputdata.Employment]
      df1 = df.groupBy(col("ST")).agg(sum(col("EMP")).alias("EMP"))
      df2 = df1.filter((col("EMP") > lit(1000000)))
      return df2

Now run the module again with

smv-run --run-app

(make sure you run this from the from the root of the project)

Inspect the new output to see the changes.

$ cat data/output/stage1.employment.EmploymentByState_XXXXXXXX.csv/part-* | head -5
"51",2933665
"53",2310426
"55",2325877
"01",1501148
"04",2027240

$ cat data/output/stage1.employment.EmploymentByState_XXXXXXXX.schema/part-*
@delimiter = ,
@has-header = false
@quote-char = "
ST: String[,_SmvStrNull_]
EMP: Long

Publish to Hive Table

If you would like to publish your module to a hive table, add a tableName method to EmploymentByState. It should return the name of the Hive table as a string.

class EmploymentByState(SmvModule, SmvOutput):
    ...
    def tableName(self):
        return "myTableName"
    def requiresDS(self): ...
    def run(self, i): ...

Then use

$ smv-run --publish-hive -m stage1.employment.EmploymentByState

smv-pyshell

We can also view the results in the smv-pyshell. To start the shell, run

$ smv-pyshell

To get the DataFrame of EmploymentByState,

>>> x = df('stage1.employment.EmploymentByState')

To peek at the first row of results,

>>> x.peek(1)
ST:String            = 50
EMP:Long             = 245058
cat_high_emp:Boolean = false

See the user guide for further examples and documentation.

Contributions

Please see SMV Development Best Practices.

Project details

These details have been verified by PyPI

Maintainers

alitajeldin bzhangusc cloudysail jacobdr laneb

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.post13

Sep 19, 2019

2.post12

Apr 22, 2019

2.post11

Mar 27, 2019

2.post10

Feb 25, 2019

2.post9

Feb 8, 2019

This version

2.post8

Feb 5, 2019

2.post7

Jan 16, 2019

2.post6

Dec 11, 2018

2.post5

Nov 7, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smv-2.post8.tar.gz (5.9 MB view details)

Uploaded Feb 5, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

smv-2.post8-py2.py3-none-any.whl (6.1 MB view details)

Uploaded Feb 5, 2019 Python 2Python 3

File details

Details for the file smv-2.post8.tar.gz.

File metadata

Download URL: smv-2.post8.tar.gz
Upload date: Feb 5, 2019
Size: 5.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/2.7.15

File hashes

Hashes for smv-2.post8.tar.gz
Algorithm	Hash digest
SHA256	`8fdf50b31358e5d22021231ab5dff48428d00edcc76f236e1a217d5babf05cc5`
MD5	`3baf189962800f047fd3da5edaa69ebd`
BLAKE2b-256	`db29275357eb34899ac76dffc31ff5bcbc31c2432d9ab5f03dcd00a8464f0bcd`

See more details on using hashes here.

File details

Details for the file smv-2.post8-py2.py3-none-any.whl.

File metadata

Download URL: smv-2.post8-py2.py3-none-any.whl
Upload date: Feb 5, 2019
Size: 6.1 MB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/2.7.15

File hashes

Hashes for smv-2.post8-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`06bca081224bbc14d88b744823902f3f3ca89ce312afa22fae73a9f515df5dab`
MD5	`78086712f9c6d46138445c6104733c01`
BLAKE2b-256	`62356caf473ca71b2582953fe83dbd39918c7186725d71a73c7466240949c314`

See more details on using hashes here.

smv 2.post8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Spark Modularized View (SMV)

SMV Quickstart

Installation

Pip Install

Without Pyspark

With Pyspark

Docker

Create Example App

Run Example App

Edit Example App

Publish to Hive Table

smv-pyshell

Contributions

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes