Skip to main content

A modular framework for creating applications in Apache Spark

Project description

<img height="128" src="https://github.com/TresAmigosSD/SMV/raw/master/docs/images/smv-logo-100px.png"/>

# Spark Modularized View (SMV)

[![Build Status](https://travis-ci.org/TresAmigosSD/SMV.svg?branch=master)](https://travis-ci.org/TresAmigosSD/SMV)
[![Join the chat at https://gitter.im/TresAmigosSD/SMV](https://badges.gitter.im/TresAmigosSD/SMV.svg)](https://gitter.im/TresAmigosSD/SMV?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)


Spark Modularized View enables users to build enterprise scale applications on Apache Spark platform.

* [Quick Start](#smv-quickstart)
* [User Guide](docs/user/0_user_toc.md)
* [Python API docs](http://tresamigossd.github.io/SMV/pythondocs/2r11/index.html)

# SMV Quickstart

## Installation

### Pip Install

SMV is now [distributed as a package on PyPi](https://pypi.org/project/smv/). It comes in two flavors -- with and without a dependnecy on `pyspark`. The first is for consumers who might be installing to a machine outside of a cluster that does not already have `pyspark` installed, while the second is targeted for those installing to a gateway machine in a cluster that already has Spark available in the environment.

#### Without Pyspark

```bash
pip install smv
```

#### With Pyspark

```bash
pip install smv[pyspark]
```

### Docker

We avidly recommend using [Docker](https://docs.docker.com/engine/installation/) to install SMV. Using Docker, start an SMV container with

```
docker run -it --rm tresamigos/smv
```

If Docker is not an option on your system, see the [installation guide](docs/user/smv_install.md).

## Create Example App

SMV provides a shell script to easily create template applications. We will use a simple example app to explore SMV.

```shell
$ smv-init -s MyApp
```

## Run Example App

Run the entire application with

```shell
$ smv-run --run-app
```

This command must be run from the root of the project.

The output csv file and schema can be found in the `data/output` directory. Note that 'XXXXXXXX' here substitutes for a number which is like the version of the module.

```shell
$ cat data/output/stage1.employment.EmploymentByState_XXXXXXXX.csv/part-* | head -5
"50",245058
"51",2933665
"53",2310426
"54",531834
"55",2325877

$ cat data/output/stage1.employment.EmploymentByState_XXXXXXXX.schema/part-*
@delimiter = ,
@has-header = false
@quote-char = "
ST: String[,_SmvStrNull_]
EMP: Long
```

## Edit Example App

The `EmploymentByState` module is defined in `src/python/stage1/employment.py`:

```shell
class EmploymentByState(SmvModule, SmvOutput):
"""Python ETL Example: employment by state"""

def requiresDS(self):
return [inputdata.Employment]

def run(self, i):
df = i[inputdata.Employment]
df1 = df.groupBy(col("ST")).agg(sum(col("EMP")).alias("EMP"))
return df1
```

The `run` method of a module defines the operations needed to get the output based on the input. We would like to filter the table based on if each row's state is greater or less than 1,000,000. To accomplish this, we need to add a filter to the `run` method:

```shell
def run(self, i):
df = i[inputdata.Employment]
df1 = df.groupBy(col("ST")).agg(sum(col("EMP")).alias("EMP"))
df2 = df1.filter((col("EMP") > lit(1000000)))
return df2
```

Now run the module again with

```shell
smv-run --run-app
```
(make sure you run this from the from the root of the project)

Inspect the new output to see the changes.

```shell
$ cat data/output/stage1.employment.EmploymentByState_XXXXXXXX.csv/part-* | head -5
"51",2933665
"53",2310426
"55",2325877
"01",1501148
"04",2027240

$ cat data/output/stage1.employment.EmploymentByState_XXXXXXXX.schema/part-*
@delimiter = ,
@has-header = false
@quote-char = "
ST: String[,_SmvStrNull_]
EMP: Long
```

### Publish to Hive Table

If you would like to publish your module to a hive table, add a `tableName` method to EmploymentByState. It should return the name of the Hive table as a string.

```python
class EmploymentByState(SmvModule, SmvOutput):
...
def tableName(self):
return "myTableName"
def requiresDS(self): ...
def run(self, i): ...
```

Then use
```bash
$ smv-run --publish-hive -m stage1.employment.EmploymentByState
```

## smv-pyshell

We can also view the results in the smv-pyshell. To start the shell, run

```
$ smv-pyshell
```

To get the `DataFrame` of `EmploymentByState`,

```shell
>>> x = df('stage1.employment.EmploymentByState')

```

To peek at the first row of results,

```shell
>>> x.peek(1)
ST:String = 50
EMP:Long = 245058
cat_high_emp:Boolean = false
```

See the [user guide](docs/user/0_user_toc.md) for further examples and documentation.



# Contributions

Please see [SMV Development Best Practices](docs/dev/00_DevProcess/best_practice.md).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smv-2.post11.tar.gz (5.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smv-2.post11-py2.py3-none-any.whl (6.1 MB view details)

Uploaded Python 2Python 3

File details

Details for the file smv-2.post11.tar.gz.

File metadata

  • Download URL: smv-2.post11.tar.gz
  • Upload date:
  • Size: 5.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.18.4 setuptools/36.5.0.post20170921 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/2.7.14

File hashes

Hashes for smv-2.post11.tar.gz
Algorithm Hash digest
SHA256 cc08d71f0415bcf8026270a6f323dae3aebc1b4352c828087077e7b1e4092c64
MD5 31a08bcc5073859713bc3229c94b558b
BLAKE2b-256 804d5e7f11ca21d72284a34c77fdfcd631f79e916f37450a4ecc36af88d6d0ae

See more details on using hashes here.

File details

Details for the file smv-2.post11-py2.py3-none-any.whl.

File metadata

  • Download URL: smv-2.post11-py2.py3-none-any.whl
  • Upload date:
  • Size: 6.1 MB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.18.4 setuptools/36.5.0.post20170921 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/2.7.14

File hashes

Hashes for smv-2.post11-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 09c3fd58ee4a64ad710a91625e1481b5d946f1fee5af1c6e68d424c1e739ade5
MD5 dd9e62343dc8be98aa05873a860ce390
BLAKE2b-256 1c8903a22e9e30de390eab9ad267fc4b8228486703b633075fea5d2381475bf9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page