Skip to main content

Enabling meticulous logging for Spark Applications

Project description

Exelog: Meticulous logging for Apache Spark

Exelog: Meticulous logging for Apache Spark

Enabling meticulous logging for Spark Applications

Exelog is a refactored logging module that provides a decorator based approach to ensure standard Python logging from PySpark Executor nodes also.

Why?

The problem: logging from Spark executors doesn't work with normal logging

In Apache Spark, the actual data processing is done in what's called "executors", which are separate processes that are separate from the "driver" program. For end users, full control is only over the driver process, but not so much over the executor processes.

For example, in the PySpark driver program we can set up standard python logging as desired, but this setup is not replicated in the executor processes. There are no out-of-the-box logging handlers for executors, so all logging messages from executors are lost. Since Python 3.2 however, when there are no handlers, there is still a "improvised" handler that will show warning() and error() messages in their bare format on standard error, but for proper logging we probably need something more flexible and powerful than that.

Illustration in interactive PySpark shell:

>>> import os, logging
>>> logging.basicConfig(level=logging.INFO)

>>> def describe():
...     return "pid: {p}, root handlers: {h}".format(p=os.getpid(), h=logging.root.handlers)
... 
>>> describe()
'pid: 8915, root handlers: [<StreamHandler <stderr> (NOTSET)>]'

>>> sc.parallelize([4, 1, 7]).map(lambda x: describe()).collect()
['pid: 9111, root handlers: []', 'pid: 9128, root handlers: []', 'pid: 9142, root handlers: []']

The initial describe() happens in the driver and has root handlers because of the basicConfig() beforehand. However, the describe() calls in the map() happen in separate executor processes (note the different PIDs) and got no root handlers.

Exelog: Logging Spark executor execution one decorator at a time

Various ways for setting up logging for executors may be found on the internet. It usually entails sending and loading a separate file containing logging configuration code. Depending on the use case, managing this file may be difficult. One of the approaches is here

In contrast, Exelog takes a decorator based approach. We just have to decorate the data processing functions which we are passing to map(), filter(), sortBy(), etc.

A very minimal example:

@exelog.enable_info_logging
def process(x):
    logger.info("Got {x}".format(x=x))
    return x * x

result = rdd.map(process).collect()

What will happen here is that the first time process() is called in the executor, basic logging is set up with INFO level, so that logging messages are not lost.

Options and finetuning

The enable_exelog decorator will do a basic logging setup using logging.basicConfig(), and desired options can be directly provided to the decorator as illustrated in the following example using the interactive PySpark shell:

>>> import logging
>>> from exelog import enable_exelog
>>> logger = logging.getLogger("example")

>>> @enable_exelog(level=logging.INFO)
... def process(x):
...     logger.info("Got {x}".format(x=x))
...     return x * x
... 
>>> sc.parallelize(range(5)).map(process).collect()
INFO:example:Got 0
INFO:example:Got 1
INFO:example:Got 3
INFO:example:Got 2
INFO:example:Got 4
[0, 1, 4, 9, 16]

To improve readability or code reuse, you can of course predefine decorators:

with_logging = enable_exelog(
    level=logging.INFO,
    format="[%(process)s/%(name)s] %(levelname)s %(message)s"
)

@with_logging
def process(x):
    ...

exelog also defines some simple predefined decorators:

# Predefined decorator for stderr/NOTSET logging
enable_notset_logging = enable_exelog(level=logging.NOTSET)

# Predefined decorator for stderr/DEBUG logging
enable_debug_logging = enable_exelog(level=logging.DEBUG)

# Predefined decorator for stderr/INFO logging
enable_info_logging = enable_exelog(level=logging.INFO)

# Predefined decorator for stderr/WARN logging
enable_warn_logging = enable_exelog(level=logging.WARN)

# Predefined decorator for stderr/ERROR logging
enable_error_logging = enable_exelog(level=logging.ERROR)

# Predefined decorator for stderr/CRITICAL logging
enable_critical_logging = enable_exelog(level=logging.CRITICAL)

Fine-grained logging set up

If the logging.basicConfig() API is not flexible enough for your desired setup, you can also inject more advanced setup code with the initialized_call decorator. This decorator is not limited to logging setup, it just expects a callable (that can be called without arguments). A very simple example:

@exelog.initialized_call(lambda: print("Executor logging enabled"))
def process(x):
    ....

This will print "Executor logging enabled" the first time the process function is called in each executor.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exelog-0.0.1.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

exelog-0.0.1-py3-none-any.whl (8.7 kB view details)

Uploaded Python 3

File details

Details for the file exelog-0.0.1.tar.gz.

File metadata

  • Download URL: exelog-0.0.1.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.0

File hashes

Hashes for exelog-0.0.1.tar.gz
Algorithm Hash digest
SHA256 1a17d9532e6d7533be169a3e6f36b64cd574b8d995e26566e51c27728de2b85a
MD5 e46de7ecdc8e38a4f9823ed541535ce6
BLAKE2b-256 e9c538f4b468be3508abb7c095a0132a8edb3225f6a43da620c84c0ac543f578

See more details on using hashes here.

File details

Details for the file exelog-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: exelog-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 8.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.0

File hashes

Hashes for exelog-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c67039d194698abafa6f16c3205684d15f3b3e5183d932e2722a50f5bdccda70
MD5 50252f419c6f83a70a5241fc8bc7757c
BLAKE2b-256 6a2e0e405791da54a602d0a40f22fc7b848aa0bd57a28435925b4ac83a3751db

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page