Skip to main content

Some Kafka utility functions and patterns for the nubium project

Project description

Nubium Utils

Environment Variables

Nubium-Utils now has a prefix for every environment variable it uses. Every environment variable related to Nubium-Utils starts with NU_.

Similarly, anything DUDE-related will start with DUDE_.

Important Environment Variables

Nubium-Utils relies on having these environment variables defined:

  • NU_SCHEMA_REGISTRY_URL: schema registry url
  • NU_SCHEMA_REGISTRY_USERNAME: schema registry username (if applicable)
  • NU_SCHEMA_REGISTRY_PASSWORD: schema registry password (if applicable)
  • NU_KAFKA_CLUSTERS_CONFIGS_JSON: json of clusters and their respective connection settings. EX: '{"cluster_0": {"url": "url", "username": "un", "password": "pw"}, "cluster_1": {"url": "url", "username": "un", "password": "pw"}}'
  • NU_TOPIC_CONFIGS_JSON: json of topics and their respective cluster name (will reference NU_KAFKA_CLUSTERS) + (optional) create settings. EX: '{"topic_a": {"cluster": "cluster_0", "configs": {"num_partitions": 2, "replication_factor": 2, "config": {"": 120000}}}, "topic_b": {"cluster": cluster_1}}'
  • NU_HOSTNAME: used for identifying unique consumer instances
  • NU_APP_NAME: used to define your consumer group
  • NU_MP_PROJECT: used for openshift deployment

Confluent-Kafka GTFO App Classes


Nubium Utils now has an app framework for managing exactly-once processing confluent-kafka applications, named GTFO. The idea is to simplify confluent-kafka down to a kafka-streams-like interface, where consumer management is largely handled for you under the hood. This is particularly nice for having to manage exactly-once processing, which the class implements and uses by default.

There are some other subclasses for some more specific use cases, namely GtfoBatchApp and GtfoTableApp, but the base class GtfoApp has been written with the intention of being easily extensible. Further details in terms of the recommended approaches of this will be described below.

The Transaction and Gtfo classes

There are two essential classes you should understand before you dive in.

The Transaction classes are actually the heart of the GTFO framework; they are what your business logic methods will actually be interacting with on a message-to-message basis. That being said, for most use cases, you wont even need to do much with them other than Transaction.messages() to get the currently consumed message, and Transaction.produce() to send a new message out.

The Gtfo-based classes generally wrap/interface with the Transaction objects, generating a new one for every new transaction (which generally consists of consuming a message, produce desired messages, commit consumed message.) In general, you will really only use the Gtfo class to initialize everything, and it doesn't take much.

Finally, as a general rule of thumb for both, there are not many methods to interact with on either class... on purpose! The functionality outlined in here will likely cover >90% of use cases.

Initializing/running a GtfoApp: basic example

NOTE: there is a lot that is managed for you via environment variables, so definitely take a look at the "Important Environment Variables" section to see what you should have defined before trying to run a GTFO app.

There are two basic components you need to initialize an app at all:

  • app_function: the business logic of your application, which should use exactly 1 argument: the Transaction objects that will be handed to it for every message.

  • consume_topics_list: the topics you are consuming. Can be comma-separated string or python list.

That's it! That being said, this is if your app only consumes. To produce, you will additionally need, at minimum:

  • produce_topic_schema_dict: a dict that maps {'topic_name': schema_obj}, where the schema_obj is a valid avro schema dict.

Then, to run the app, do:

Altogether, that might look something like this:

from nubium_utils.confluent_utils.transaction_utils import GtfoApp

useless_schema = {
    "name": "CoolSchema",
    "type": "record",
    "fields": [
            "name": "cool_field",
            "type": "string"

def useless_example_func(transaction, thing_inited_at_runtime):
    msg = transaction.messages() # can also do transaction.value(); uses the confluent-kafka method with single messages
    cool_message_out = msg.value()
    cool_message_out['cool_field'] = thing_inited_at_runtime #  'cool value'
    transaction.produce({'topic':'cool_topic_out', 'key': msg.key(), 'value': cool_message_out, 'headers': msg.headers()})

init_at_runtime_thing = 'cool value' # can hand off objects you need to init at runtime, like an api session object
gtfo_app = GtfoApp(
    consume_topics_list=['test_topic_1', 'test_topic_2'],
    produce_topic_schema_dict={'cool_topic_out': useless_schema},
    app_function_arglist = [init_at_runtime_thing])  # optional! Here to show functionality.

Using GtfoBatchApp (plus BatchTransaction)

Sometimes, you want to handle multiple messages at once, such as doing a bulk upload of data to an API. In this case, treating each message as a separate transaction doesn't make much sense! For this, we have GtfoBatchApp!

We still rely on much of what Gtfo and Transaction lays out, but now we can specify how many messages should be consumed by default for a given transaction. Additionally, you can consume more messages on demand with the BulkTransaction object via BulkTransaction.consume() in your app_function, in case you'd like to do the normal consume pattern most of the time, but might need that to change on demand.

Biggest thing to note here: all messages for that transaction will (just like Transaction) be accessible via BatchTransaction.messages(). Also, all of them get committed when you finish with that instance of app_function just like a typical singleton run of app_function, so keep that in mind!

You can tweak the default batch size via NU_CONSUMER_DEFAULT_BATCH_CONSUME_MAX_COUNT; defaulted to 1.

Here is an example:

from nubium_utils.confluent_utils.transaction_utils import GtfoBatchApp
from cool_api import bulk_upload

useless_schema = {
    "name": "CoolSchema",
    "type": "record",
    "fields": [
            "name": "cool_field",
            "type": "string"

def prep_data(msg_val):
    # do stuff

def useless_example_func(bulk_transaction):
    msg = transaction.messages()
    last_msg = msg[-1]
    # I originally set my max to 10 via the env var...but with some logic, if a message says I should do a bigger batch...
    if last_msg.headers()['needs_bigger_bulk'] == 'true':
        bulk_transaction.consume(consume_max_count=500) # allow me to raise that max to 500, and consume up to that (will consume up to 490 more)!
    bulk_upload([prep_data(msg.value()) for msg in bulk_transaction.messages()]) # push all 500 messages
gtfo_app = GtfoBatchApp(
    produce_topic_schema_dict={'cool_topic_out': useless_schema})

Using GtfoTableApp (plus TableTransaction)

One of the more complex components of GTFO is the GtfoTableApp. It allows you to store state based on a kafka topic via a localized datastore. Basically, you can store whatever you want with respect to a given kafka key, and later reference/compare that data against a new version of that key.

There are some functional caveats/limitations that come along with this feature set:

  • The app can only consume from one topic (you can make an upstream "funnel" app as needed).
  • You must make a topic named the same as {NU_APP_NAME}__changelog, with the same partition count as the topic it would consume from.
  • You can only store/reference things based on the same key as the current message.
  • Each instance of your application needs to use the same volume.
  • Data needs to be stored as a json/dict (it is stored as a string and json.load-ed at runtime)

With that in mind, set up is almost exactly the same as GtfoApp; here is an example:

from nubium_utils.confluent_utils.transaction_utils import GtfoTableApp

useless_schema = {
    "name": "CoolSchema",
    "type": "record",
    "fields": [
            "name": "cool_field",
            "type": "string"

def useless_example_func(transaction):
    msg = transaction.messages() # can also do transaction.value(); uses the confluent-kafka method with single messages
    cool_message_out = msg.value()
    previous_message = transaction.read_table_entry()
    # lets only do stuff if our previous version for the same key was "just cool"!
    if previous_message['cool_field'] == 'just cool':
        cool_message_out['cool_field'] = 'very cool now'
        transaction.update_table_entry(cool_message_out) # we want to update the table with our new value. It does not do this automatically!
        transaction.produce({'topic':'cool_topic_out', 'key': msg.key(), 'value': cool_message_out, 'headers': msg.headers()})

gtfo_app = GtfoTableApp(
    produce_topic_schema_dict={'cool_topic_out': useless_schema})

Extending GtfoApp and Transaction

Of course, you can never cover every use case! As such, each class was designed with extensibility in mind!

Most often, what needs more customization is your consumption pattern (hence why there was already a bulk class!), and there is a relatively painless way to address that with minimal alterations.

There is an init argument on GtfoApp named transaction_type; this allows you to easily pass an augmented version of Transaction with your changed consumption pattern, potentially without changing the Transaction class at all!

Otherwise, hopefully things have been compartimentalized enough that you can just replace methods as needed, but in general, usually you'll need to mess a little with both classes, but likely mostly Transaction.


The monitoring utils enable metrics to be surfaced from the kafka applications so the Prometheus server can scrape them. The Prometheus server can't dynamically figure out pod IPs and scrape the services directly, so we're using a metrics cache instead.

The metrics cache is a StatefulSet with 2 services assigned to it. One service is a normal service, with a unique cluster IP address. The prometheus server scrapes this service endpoint. The other service doesn't have a cluster IP, which means that the monitoring utility can find the IP addresses of each of the backing pods, and send metrics to all of the pods. This setup gives us high-availability guarantees.

The Monitoring utils are organized into two classes, MetricsPusher and MetricsManager.

The MetricsManager is a container for all of the metrics for the app, and contains convenience methods for the 3 standardized metrics. These metrics are

  • messages_consumed: The number of messages consumed by the app
  • messages_produced: The number of messages produced by the app
  • message_errors: The number of exceptions caught in the app (labeled by type)

The MetricsPusher handles pushing the applications metrics to the metrics cache. It determines the list of IP addresses for all of the metrics cache pods, and sends the current metrics values for all of the metrics.

Metric names and labels

The names of the metrics in Prometheus are the same as their names as parameters

  • messages_consumed
  • messages_produced
  • message_errors

Two labels exist for every metric:

  • app: The name of the microservice the metric came from
  • job: The name of the individual pod the metric came from The message_errors metric also has another label:
  • exception: The name of the exception that triggered the metric

Monitoring Setup Examples

The initialization and update loop for application monitoring will differ from application to application based on their architecture. The following examples should cover the standard designs we use.

Default Kafka Client Application

A Kafka application that directly relies on interacting with Producer or Consumer clients should have it's monitoring classes set up and its pushing thread started in the main run function and passed to the loop, as follows:

import os

from confluent_kafka import Consumer, Producer
from nubium_utils.metrics import MetricsManager, MetricsPusher, start_pushing_metrics

def run_function():

    consumer = Consumer()
    producer = Producer()

    metrics_pusher = MetricsPusher(
    metrics_manager = MetricsManager(job=os.environ['HOSTNAME'], app=os.environ['APP_NAME'], metrics_pusher=metrics_pusher)
    start_pushing_metrics(metrics_manager, int(os.environ['METRICS_PUSH_RATE']))

        while True:
            loop_function(consumer, producer, metrics_manager=metrics_manager)

The consume_message() function from this library expects a metrics_manager object as an argument, so that it can increment the messages_consumed metric.

The application itself needs to increment the messages_produced metric needs to be incremented as necessary by the application itself whenever a Kafka message is produced. The convenience method on the metrics_manager inc_messages_produced() makes this easier, since it automatically adds the necessary labels to the metric.

The application also needs to be set to increment the message_errors metric whenever an exception is caught.

An example loop function might look like this:

import os
import logging

from nubium_utils import consume_message
from nubium_utils.custom_exceptions import NoMessageError

def loop_function(consumer, producer, metrics_manager):
        message = consume_message(consumer, int(os.environ['CONSUMER_POLL_TIMEOUT']), metrics_manager)
        outgoing_key = message.value()['email_address']
    except NoMessageError:
    except KeyError as error:
        logging.debug('Message missing email address')

Flask Kafka Application

The setup becomes a little bit different with a Flask application. The metrics_manager should be accessible through the app's configuration, so that it can be accessed in route functions.

The preferred method for error monitoring is to hook into the built in flask error handling loop, using the @app.errorhandler decorator. Here is an example create_app() function

import flask
from werkzeug.exceptions import HTTPException

from .forms_producer_app import forms_producer
from .util_blueprint import app_util_bp

def create_app(config):
    Creates app from config and needed blueprints
    :param config: (Config) object used to configure the flask app
    :return: (flask.App) the application object
    app = flask.Flask(__name__)


    def handle_exception(e):
        Increment error gauge on metrics_manager before returning error message
        response = e.get_response() = f'{e.code}:{} - {e.description}'
        return response

    def unhandled_exception(error):
        app.logger.error(f'Unhandled exception: {error}')
        return f'Unhandled exception: {error}', 500

    return app

The route functions for produced messages should increase the messages_produced metric when necessary. Example:

@forms_producer.route('/', methods=["POST"])
def handle_form():
    Ingests a dynamic form from Eloqua and produces it to the topic
    values = request.json
    string_values = {key: str(value) for key, value in values.items()}
    LOGGER.debug(f'Processing form: {values}')

        value={'form_data': string_values},

    return jsonify(success=True)

Project details

Release history Release notifications | RSS feed

This version


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nubium-utils-4.2.0.tar.gz (37.5 kB view hashes)

Uploaded Source

Built Distribution

nubium_utils-4.2.0-py3-none-any.whl (43.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page