Skip to main content

Prometheus extension for Mask.

Project description

Mask gRPC Interceptors for Prometheus monitoring

Install

pip install mask-prometheus

Usage

# 3p
from mask import Mask
from mask_prometheus import Prometheus
# project
from examples.protos.hello_pb2 import HelloResponse


app = Mask(__name__)

app.config["PROMETHEUS_PORT"] = 18080

prometheus = Prometheus()
prometheus.init_app(app)



@app.route(method="SayHelloStream", service="Hello")
def say_hello_stream_handler(request, context):
    """ Handler stream SayHello request
    """
    for item in request:
        yield HelloResponse(message="Hello Reply: %s" % item.name)

if __name__ == "__main__":
    app.run(port=1020)

Metrics

Labels

All metrics start with mask as Prometheus subsystem name. Both of them have mirror-concepts. Similarly all methods contain the same rich labels:

  • grpc_service - the gRPC service name, which is the combination of protobuf package and the grpc_service section name. E.g. for package = mwitkow.testproto and service TestService the label will be grpc_service="mwitkow.testproto.TestService"

  • grpc_method - the name of the method called on the gRPC service. E.g.
    grpc_method="Ping"

  • grpc_type - the gRPC type of request. Differentiating between the two is important especially for latency measurements.

    • unary is single request, single response RPC
    • client_stream is a multi-request, single response RPC
    • server_stream is a single request, multi-response RPC
    • bidi_stream is a multi-request, multi-response RPC

Additionally for completed RPCs, the following labels are used:

  • grpc_code - the human-readable gRPC status code. The list of all statuses is to long, but here are some common ones:

    • StatusCode.OK - means the RPC was successful
    • StatusCode.IllegalArgument - RPC contained bad values
    • StatusCode.Internal - server-side error not disclosed to the clients

Counters

For simplicity, let's assume we're tracking a single server-side RPC call of mwitkow.testproto.TestService, calling the method PingList. The call succeeds and returns 20 messages in the stream.

First, immediately after the server receives the call it will increment the grpc_server_started_total and start the handling time clock (if histograms are enabled).

grpc_server_started_total{grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 1

Then the user logic gets invoked. It receives one message from the client containing the request (it's a server_stream):

grpc_server_msg_received_total{grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 1

The user logic may return an error, or send multiple messages back to the client. In this case, on each of the 20 messages sent back, a counter will be incremented:

grpc_server_msg_sent_total{grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 20

After the call completes, its status (OK or other gRPC status code) and the relevant call labels increment the grpc_server_handled_total counter.

grpc_server_handled_total{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 1

Histograms

Prometheus histograms are a great way to measure latency distributions of your RPCs. However, since it is bad practice to have metrics of high cardinality the latency monitoring metrics are disabled by default.

After the call completes, its handling time will be recorded in a Prometheus histogram variable grpc_server_handling_seconds. The histogram variable contains three sub-metrics:

  • grpc_server_handling_seconds_count - the count of all completed RPCs by status and method
  • grpc_server_handling_seconds_sum - cumulative time of RPCs by status and method, useful for calculating average handling times
  • grpc_server_handling_seconds_bucket - contains the counts of RPCs by status and method in respective handling-time buckets. These buckets can be used by Prometheus to estimate SLAs (see here)

The counter values will look as follows:

grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.005"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.01"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.025"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.05"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.1"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.25"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.5"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="1"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="2.5"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="5"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="10"} 1
grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="+Inf"} 1
grpc_server_handling_seconds_sum{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 0.0003866430000000001
grpc_server_handling_seconds_count{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 1

Useful query examples

Prometheus philosophy is to provide raw metrics to the monitoring system, and let the aggregations be handled there. The verbosity of above metrics make it possible to have that flexibility. Here's a couple of useful monitoring queries:

request inbound rate

sum(rate(grpc_server_started_total{job="foo"}[1m])) by (grpc_service)

For job="foo" (common label to differentiate between Prometheus monitoring targets), calculate the rate of requests per second (1 minute window) for each gRPC grpc_service that the job has. Please note how the grpc_method is being omitted here: all methods of a given gRPC service will be summed together.

unary request error rate

sum(rate(grpc_server_handled_total{job="foo",grpc_type="unary",grpc_code!="OK"}[1m])) by (grpc_service)

For job="foo", calculate the per-grpc_service rate of unary (1:1) RPCs that failed, i.e. the ones that didn't finish with OK code.

unary request error percentage

sum(rate(grpc_server_handled_total{job="foo",grpc_type="unary",grpc_code!="OK"}[1m])) by (grpc_service)
 / 
sum(rate(grpc_server_started_total{job="foo",grpc_type="unary"}[1m])) by (grpc_service)
 * 100.0

For job="foo", calculate the percentage of failed requests by service. It's easy to notice that this is a combination of the two above examples. This is an example of a query you would like to alert on in your system for SLA violations, e.g. "no more than 1% requests should fail".

average response stream size

sum(rate(grpc_server_msg_sent_total{job="foo",grpc_type="server_stream"}[10m])) by (grpc_service)
 /
sum(rate(grpc_server_started_total{job="foo",grpc_type="server_stream"}[10m])) by (grpc_service)

For job="foo" what is the grpc_service-wide 10m average of messages returned for all server_stream RPCs. This allows you to track the stream sizes returned by your system, e.g. allows you to track when clients started to send "wide" queries that ret Note the divisor is the number of started RPCs, in order to account for in-flight requests.

99%-tile latency of unary requests

histogram_quantile(0.99, 
  sum(rate(grpc_server_handling_seconds_bucket{job="foo",grpc_type="unary"}[5m])) by (grpc_service,le)
)

For job="foo", returns an 99%-tile quantile estimation of the handling time of RPCs per service. Please note the 5m rate, this means that the quantile estimation will take samples in a rolling 5m window. When combined with other quantiles (e.g. 50%, 90%), this query gives you tremendous insight into the responsiveness of your system (e.g. impact of caching).

percentage of slow unary queries (>250ms)

100.0 - (
sum(rate(grpc_server_handling_seconds_bucket{job="foo",grpc_type="unary",le="0.25"}[5m])) by (grpc_service)
 / 
sum(rate(grpc_server_handling_seconds_count{job="foo",grpc_type="unary"}[5m])) by (grpc_service)
) * 100.0

For job="foo" calculate the by-grpc_service fraction of slow requests that took longer than 0.25 seconds. This query is relatively complex, since the Prometheus aggregations use le (less or equal) buckets, meaning that counting "fast" requests fractions is easier. However, simple maths helps. This is an example of a query you would like to alert on in your system for SLA violations, e.g. "less than 1% of requests are slower than 250ms".

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

mask_prometheus-1.0.0a1-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file mask_prometheus-1.0.0a1-py3-none-any.whl.

File metadata

  • Download URL: mask_prometheus-1.0.0a1-py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.6.5

File hashes

Hashes for mask_prometheus-1.0.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 1e4427c2f3e22541258f3dac8126d6ec8608fa4298dadc07409208b7522156e0
MD5 ff6004fccae0ad522f006d5262cdd0c1
BLAKE2b-256 c89a1f1e581dc51975ae9cb376fb40422a01e5d3e547295e1b4b68a4d066e618

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page