Skip to main content

Lets Airflow DAGs run Spark jobs via Livy: sessions and/or batches.

Project description

Airflow Livy Operators

Build Status Code coverage

Lets Airflow DAGs run Spark jobs via Livy:

  • Sessions,
  • Batches. This mode supports additional verification via Spark/YARN REST API.

See this blog post for more information and detailed comparison of ways to run Spark jobs from Airflow.

Directories and files of interest

  • airflow_home/plugins: Airflow Livy operators' code.
  • airflow_home/dags: example DAGs for Airflow.
  • batches: Spark jobs code, to be used in Livy batches.
  • sessions: (Optionally) templated Spark code for Livy sessions.
  • helper.sh: helper shell script. Can be used to run sample DAGs, prep development environment and more. Run it to find out what other commands are available.

How do I...

...run the examples?

Prerequisites:

Now,

  1. Optional - this step can be skipped if you're mocking a cluster on your machine. Open helper.sh. Inside init_airflow() function you'll see Airflow Connections for Livy, Spark and YARN. Redefine as appropriate.
  2. run ./helper.sh up to bring up the whole infrastructure. Airflow UI will be available at localhost:8888.
  3. Ctrl+C to stop Airflow. Then ./helper.sh down to dispose of remaining Airflow processes (shouldn't be required if everything goes well. Run this if you can't start Airflow again due to some non-informative errors) .

... use it in my project?

pip install airflow-livy-operators

This is how you import them:

from airflow_livy.session import LivySessionOperator
from airflow_livy.batch import LivyBatchOperator

... set up the development environment?

Alright, you want to contribute and need to be able to run the stuff on your machine, as well as the usual niceness that comes with IDEs (debugging, syntax highlighting).

  • run ./helper.sh dev to install all dev dependencies.
  • ./helper.sh updev runs Airflow with local operators' code (as opposed to pulling them from PyPi). Useful for development.
  • (Pycharm-specific) point PyCharm to your newly-created virtual environment: go to "Preferences" -> "Project: airflow-livy-operators" -> "Project interpreter", select "Existing environment" and pick python3 executable from venv folder (venv/bin/python3)
  • ./helper.sh cov - run tests with coverage report (will be saved to htmlcov/).
  • ./helper.sh lint - highlight code style errors.
  • ./helper.sh format to reformat all code (Black + isort)

... debug?

  • (Pycharm-specific) Step-by-step debugging with airflow test and running PySpark batch jobs locally (with debugging as well) is supported via run configurations under .idea/runConfigurations. You shouldn't have to do anything to use them - just open the folder in PyCharm as a project.
  • An example of how a batch can be ran on local Spark:
python ./batches/join_2_files.py \
"file:////Users/vpanov/data/vpanov/bigdata-docker-compose/data/grades.csv" \
"file:///Users/vpanov/data/vpanov/bigdata-docker-compose/data/ssn-address.tsv" \
-file1_sep=, -file1_header=true \
-file1_schema="\`Last name\` STRING, \`First name\` STRING, SSN STRING, Test1 INT, Test2 INT, Test3 INT, Test4 INT, Final INT, Grade STRING" \
-file1_join_column=SSN -file2_header=false \
-file2_schema="\`Last name\` STRING, \`First name\` STRING, SSN STRING, Address1 STRING, Address2 STRING" \
-file2_join_column=SSN -output_header=true \
-output_columns="file1.\`Last name\` AS LastName, file1.\`First name\` AS FirstName, file1.SSN, file2.Address1, file2.Address2" 

# Optionally append to save result to file
#-output_path="file:///Users/vpanov/livy_batch_example" 

TODO

  • helper.sh - replace with modern tools (e.g. pipenv + Docker image)
  • Disable some of flake8 flags for cleaner code

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

airflow-livy-operators-0.3.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

airflow_livy_operators-0.3-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file airflow-livy-operators-0.3.tar.gz.

File metadata

  • Download URL: airflow-livy-operators-0.3.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.1

File hashes

Hashes for airflow-livy-operators-0.3.tar.gz
Algorithm Hash digest
SHA256 41ede890e4ab086fff2de3cb3b94b569e65292c1c47ca35ace1f1184689a508d
MD5 353162b91c40bfb37385139327d7857c
BLAKE2b-256 ced281bb382a5475020c051576107b5ed7dcf525afa63810f7f3882edba34e2e

See more details on using hashes here.

File details

Details for the file airflow_livy_operators-0.3-py3-none-any.whl.

File metadata

  • Download URL: airflow_livy_operators-0.3-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.1

File hashes

Hashes for airflow_livy_operators-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a00b124b2a8f346a424e49a9971edaacd841526ffca8342c84252df4073787ed
MD5 fe1c8c5af760c89d3988eee46721d6cf
BLAKE2b-256 058cde6f947f8b1c7e662bc6b5e20da76bf5fedb0ac1b02a79799d61922bd367

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page