Skip to main content

A pymongo extension for auditing the database operations

Project description

pymongowatch is an extension for the MongoDB’s Python driver pymongo which enable watching and auditing the underlying operations.

MongoDB Community Server doesn’t support auditing and the database profiler is not very flexible. Basically your only option for filtering is to collect operations that take longer than a certain value. You can’t easily separate the insert log from the query log or you cannot extract the operations only for a certain collection or a certain set of queries.

Furthermore, operations at database level are not always the same as operations in an application. A very simple operation like fetching documents with find may be translated to several getMore operations at database level.

Therefore, if you want to analyze the user behavior by studying the database queries, a very simple auditing log for each user search containing the query itself (possibly with deleting value fields to protect the users’ privacy), the number of the documents read from the database and its time is all you need.

Doing this at database level is not always feasible and doing it manually at application level may not be very desirable for the developers. This is when an intermediate tool at library level comes handy.

pymongowatch can detect and audit operations at application level without any modifications to existing applications.

Quick Setup

First you have to install pymongowatch. It is available at Python Package Index (PyPI) and can be installed with pip. At a virtual environment or system-wide:

pip install pymongowatch

Other installation options such as Debian packages are available, too.

After basic installation, pymongo.watcher sub-package will be available and you have two options to go on:

Option 1: Call the watcher at application startup

This approach is recommended if you have no restrictions on modifying the pymongo-based application and you don’t want any hacks on the pymongo itself.

from pymongo import watcher

watcher.patch_pymongo()
watcher.add_logging_handlers()  # Adds a logging.StreamHandler by default

Of course, you can pass your own customized logging handlers to add_logging_handlers and add filters and formatters as you wish. See the following sections for more information.

Option 2: No application modification

If you don’t own the application or you don’t want to even add a single line of code to your existing application, pymongo_mask which is included in the pymongowatch package comes handy.

pymongo_mask is a directory with a module pymongo.py which emulates the pymongo package by using the real installed pymongo; So if you add pymongo_mask directory in your Python application search path for modules–in somewhere with more priority than where the real pymongo is installed–the mask will be imported in your application instead of the real pymongo. Nevertheless the mask will work as a proxy to the real installed pymongo and your application should work as usual.

The importation of the pymongo mask will also trigger watcher.patch_pymongo() automatically, so the watchers will be enabled without any need to change your application source code or even pymongo library itself.

Furthermore, the pymongo mask will apply the configuration at etc/pymongowatch.yaml with logging.config.dictConfig to the loggers of the watchers. A sample etc/pymongowatch.yaml is included in the pymongowatch package but you can modify it as needed. The yaml configuration follows the logging dictionary schema.

Finally, pymongowatch package includes a pymongowatch-install-mask script which can help you enable the pymongo mask with ease:

pymongowatch-install-mask  # follow the interactive instructions

Usage

pymongowatch implements classes such as WatchCursor (a subclass of pymongo’s cursor) which if used instead of pymongo’s cursor can emit logs as specified. To do so, you need some workaround to make pymongo’s methods to use pymongowatch classes instead of its defaults. The patch_pymongo function is the method that will call all the patch methods in pymongowatch classes to replace the default pymongo underlying classes:

from pymongo import watcher

watcher.patch_pymongo()

All the pymongowatch watcher classes will use the logger retrieved with pymongo.watcher.logger.get_log_emitter() to emit their logs. Each module will emit the logs with their name. So for example, all the cursor logs will be emitted under the pymongo.watcher.cursor. So you can apply specific logging configuration for each module or as all the logs has the prefix pymongo.watcher as their logger name you can apply global configuration with that name.

import logging

# The following code is not the recommended method for adding handlers
# to the pymongowatch loggers and is for illustrative purposes only

global_logger = logging.getLogger("pymongo.watcher")
# add a stream log handler globally to watch all the logs on the
# console
global_logger.addHandler(logging.StreamHandler())

cursor_logger = logging.getLogger("pymongo.watcher.cursor")
# add a file handler specifically for cursor logs to store them in
# file, too.
cursor_logger.addHandler(logging.FileHandler("/tmp/watcher.log"))

Of course you can apply filters and formatters or add more handlers as you wish but there are some details that you should take care of. The easiest way to achieve this is by leveraging the add_logging_handlers method:

from pymongo import watcher

# Add two handlers to get all the watcher logs both in file and
# console
console_handler = logging.StreamHandler()
global_handler = logging.FileHanlder("/tmp/watcher-all.log")
watcher.add_logging_handlers(console_handler, global_handler)

# Add a more customized handler for cursor logs
cursor_simple_handler = logging.FileHanlder("/tmp/watcher-cursor-simple.log")
watcher.add_logging_handlers(
    cursor_simple_handler,
    logger_name="pymongo.watcher.cursor",
    formatter="{name} - {watch}")

# Add a more customized handler for cursor logs
cursor_customized_handler = logging.FileHanlder(
    "/tmp/watcher-cursor-customized.log")
watcher.add_logging_handlers(
    cursor_customized_handler,
    logger_name="pymongo.watcher.cursor",
    formatter="{asctime} {name}.{watch.Collection} - {watch.Query} fetched "
              "{watch.RetrievedCount} in {watch.RetrieveTime} seconds")

# Add another handler to log the full information for cursors in csv
cursor_csv_handler = logging.FileHanlder(
    "/tmp/watcher-cursor-csv.log")
watcher.add_logging_handlers(
    cursor_csv_handler,
    logger_name="pymongo.watcher.cursor",
    formatter="{asctime},{name},{watch.csv}")

Note that using add_logging_handlers has not only the advantage of simplicity for adding formatters, but also take care of automatically adding an extra logging.handlers.QueueHandler and logging.handlers.QueueListener for each handler to overcome some log mutation issues we discuss later.

You can set the log format by using formatters either by passing a string as the formatter option to the add_logging_handlers or by creating a formatter object and using the handler’s setFormatter method directly. In either case the recommended way is to use { style. Specially if you want to access inner values with dot notation e.g. watch.Query or watch.Collection other styles such as % and $ (e.g. %(watch.Query)s or ${watch.Query}) will NOT work.

Another useful feature of Python logging module is its filters. You can use filter objects not only to filter unwanted logs but also to modify the ones that you want.

You can develop your own logging filters by sub-classing logging.Filter as usual and access the record.watch in filter method for investigating or modifying the watcher attributes such as DB, Collection, Query, etc.

import logging

class SlowQueriesOnNewsCollectionFilter(logging.Filter):
    def filter(self, record):
        watch = getattr(record, "watch", {})
        return (watch.get("Collection") == "news" and
                watch.get("RetrieveTime", 0) > 10)

Or you can add filters to modify the logs:

import logging

def remove_private_data(data):
    if isinstance(data, dict):
        return {k: remove_private_data(v) for k, v in data.items()}
    elif isinstance(data, list):
        return [remove_private_data(i) for i in data]
    return None

class UserPrivacyFilter(logging.Filter):
    def filter(self, record):
        watch = getattr(record, "watch", {})
        watch.update(remove_private_data(watch))
        return True

Don’t forget to add the defined filters to you handlers:

my_handler.addFilter(SlowQueriesOnNewsCollectionFilter)
my_handler.addFilter(UserPrivacyFilter)

Lastly, you can use Python’s great logging.conig module and specially the new flexible logging.config.dictConfig method to apply all the handlers, formatters and filters in a single configuration file.

pymongowatch has even the required filters implemented in pymongo.watcher.filters module. To see the examples for the dictConfig configuration with watcher filters refer to the etc/pymongowatch.yaml file which will be installed via pymongowatch (if you are using a virtual environment, it would be inside the venv directory).

Mutable vs Immutable Logs

Mutable Logs? Is that a thing?

Usually the good thing about logs is that they are immutable. So if you see a log you can trust it. This is always true when some atomic operation happens and you have no concerns about the start and end time of the operation (and you don’t have access to a time machine to travel to the past and change what happened).

But what if you start an operation which we have no idea when will it end? Suppose we have queried a very large database for a very slow query that may take some time to get back the full results. Also, we may use a cursor in our application to fetch data and the application has some delays itself that we don’t want the slowness of the application to affect the database auditing.

These are the sort of challenges that pymongo.watcher.WatchQueue tries to fix.

pymongowatch uses pymongo.watcher.logger.WatchMessage instead of strings as log messages as described in using arbitrary objects as messages in Python’s logging HOWTO. WatchMessage is a sub-class of Python’s dictionaries which are mutable objects.

WatchMessage instances are the {watch} templates in the format strings that we saw earlier. They are a dict so you can access log attributes with [] access e.g. watch["Query"]. For more convenience while using log formatters WatchMessage provides attribute access with dot notation e.g. watch.Query and one of the reasons why { style formatting (which let you use dot notation access) is recommended for logging formatters.

As a WatchMessage is a mutable object, the watcher classes can modify the log from time to time. For example if you fetch more items from a cursor, they can update the attributes such as RetrieveTime or RetrievedCount. So we have flexibility and it is pymongowatch users decision when to emit the final immutable log with the logger handlers.

WatchQueue alongside a QueueHandler is the right tool to make sure we handle logs at right time. A WatchQueue works like a priority queue which the earliest logs has the higher priority but some a twists.

First, earliest log doesn’t always mean the start time of the first operation and it doesn’t always mean the end time of the operation. It depends on the specific database operation and pymongowatch tries to pick the best for you. That means operations may arrive and retrieved with different orders from a WatchQueue.

Second, each WatchMessage has an internal ready state. It usually is False at the start of the operation but can set to True (with its one time set_ready method) when pymongowatch classes think the operation is at its final state and its ready to be logged. WatchQueue will assume more priority for these logs.

Finally you can set a default_delay_sec to add delay to all the logs in the queue (for example if you are using logging for analytics and you do care more about accuracy than delay for the logs) and you can even set force_default_delay to True to enforce logging order for all the logs. These are the optional keyword arguments that both WatchQueue constructor and add_logging_handlers method that we saw earlier accepts.

The add_logging_handlers has a with_queue optional argument which if is True (the default), will use pymongo.watcher.setup_queue_handler to setup a QueueHandler alongside a started QueueListener for each handler you specify with a WatchQueue so you usually don’t have to worry about log mutation if you use add_logging_handlers to add your handlers to watcher loggers.

Multiprocessing

Another useful application of QueueHandler is its use in logging to a single file from multiple processes. For this application you have to use a multiprocessing Queue alongside the QueueHandler. Fortunately WatchQueue also supports multiprocessing. All you have to do is to pass True as the enable_multiprocessing argument.

This is useful for example when you have a pymongo based web application with several web server processes and you need a process safe method to store the logs in a single file. Its worth noting that in such environments using the default value for enable_multiprocessing i.e. False will result in total failure in logging because the default Queue is not multiprocessing and the QueueHandler and the QueueListener will use different queues in different processes.

Passing True as enable_multiprocessing in WatchQueue constructor will makes the constructor to build and return a proxy object from a customized manager which has its own dedicated process. You can also pass enable_multiprocessing to setup_queue_handler and add_logging_handlers:

# Extra keyword arguments of `add_logging_handlers` will be passed to
# the newly created WatchQueue for each handler.
watcher.add_logging_handlers(enable_multiprocessing=True)

Pymongo Versions

pymongowatch is not a standalone MongoDB library and it relies on the the MongoDB’s Python driver pymongo. But does the pymongo’s version matter?

pymongowatch has been tested with the recent versions of pymongo i.e. 3.10 and the newer 4 series but you can use it for other versions at your own risk. If you have any problems you can open an issue at the project’s issue tracker.

One known difference between pymongo versions is that they handle operation closing differently. For example, 4 series close the cursors more intelligently and you can usually see the cursor logs very fast without any need to a explicit timeout whereas any 3 series usually an explicit timeout is required.

Other Installation Options

Debian Packages

If you are a Debian-based GNU/Linux distribution user you are in luck! There is a Debian package maintained in the project’s debian branch that can make your installation even easier.

You can find the binary packages at mrazavi’s pymongowatch PPA and to install it on Ubuntu:

sudo add-apt-repository ppa:mrazavi/pymongowatch
sudo apt update

sudo apt install python3-pymongowatch

TODOs

If you see an ugly TODO list below with oversize items, it’s not even clear which items are DONE and which ones are still TODO, maybe that is because this document is written in org-mode but you are seeing a bad render e.g. in GitHub or a converted reStructuredText format e.g. because the lack of org-mode support in PyPI.

That doesn’t make org-mode less lovable or inferior. Anyway org-mode is one of the most reasonable markup languages to use for text. Why not to use it and brag about it?

DONE Support queries with find

TODO Support for collection delete

TODO Support for collection insert

DONE Implement pip package

DONE Implement debian package

TODO Implement unit tests

DONE Supoort for multiprocessing

TODO Create API reference document

About

The pymongowatch has developed mainly by Mohammad Razavi.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pymongowatch-0.2.0.tar.gz (49.8 kB view hashes)

Uploaded Source

Built Distribution

pymongowatch-0.2.0-py3-none-any.whl (42.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page