Skip to main content

Python library to process hetrogenous log files by defining simple regex patterns and functions to handle them. Includes classes to parse data into postgres database.

Project description

codecov

Regex Log Parser

Regex Log Parser is a simple and easy to use Python library for log parsing/processing. It allows the user to define a dictionary of regex rules and handler functions which determine how logs should be processed. See the examples below for more information.

This was originally developed at MWA Telescope in order to perform data mining of a large amount of log data in order to gain useful insights from it. Following the success of the project we have open sourced and published it in the hope that it may be useful for somebody else.

We built this project to extract data from log files and load it into a PostgreSQL database so that it may be queried. However, it has been developed for extensibility and may be used to ingest into other data stores such as MySQL, SQLite, MongoDB, and more.

Only Postgres is supported currently, if you would like to see more data stores supported, see the contributing section below.

Basic Idea

Imagine that you have a directory containing a number of log files. These files may be generated by different systems (i.e. a web server) and different versions of those systems.

/logs
  web1_1.log
  web1_2.log
  web2_1.log
  web2_2.log

The log files contain historical information about activity on the system like so:

[2021-11-25 04:29:55,015] INFO, 192.168.0.1 "POST /login HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36"
[2021-11-25 04:29:56,542] INFO, 192.168.0.1 "GET /logout HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36"
[2021-11-25 04:30:05,731] INFO, 192.168.0.1 "GET / HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36"

You want to parse these files to answer questions about the number of logins that have taken place or some other kind of event.

Rather than writing code to read each file line by line and doing string splitting, Regex Log Parser allows you to define a set of rules that will be used to parse files within a directory, as well as rules to process lines within matched files.

rules = {
    'web1_*': {
        '[(.*)] INFO, (.*) .*/login.*': 'my_handler',
        '.*': 'skip'
    }
}

In the example rules above, we are defining a dictionary where the key is some regex that will be matched against path/filenames within your directory, and the value is a dictionary which defines how to process the file.

By using regex capture groups, we can pull out the information we want from each line (datetime and IP address in the example above) and pass them to some handler function (here called my_handler), which can store the information in a database or do something else with it.

Important Note

This library FORCES you to handle all lines in a file. i.e. there must be at least one rule to match a line within a file. If not, an exception will be raised.

This was done deliberately to ensure that users are handling all cases. Once you're confident that you are handling the lines that you care about, add a catch-all rule to skip everything else:

rules = {
    'web1_*': {
        '.*': 'skip'
    }
}

Installation

Prerequisites

  • Python >= 3.10

Install the package

pip install regex_log_parser

If you wish to use the included functionality for uploading data into a postgres database, install the extra dependencies like so:

pip install regex_log_parser[postgres]

Usage

Create a file and import the LogProcessor class. Create an instance of this object then call the run method, passing in a directory containing some logs that you would like to process.

Two things are required to setup the processor. A rules dictionary and a handler object.

from regex_log_parser import LogProcessor, HandlerBase

log_processor = LogProcessor(
    rules=rules,
    handler=handler,
    dry_run=False
)

log_processor.run('/path/to/my/logs')

Rules

Rules is a standard python dictionary of the format:

rules = {
    "file_regex": {
        "line_regex": "handler_function",
    }
}

Where:

  • file_regex is some regex to match the name of a file,
  • line_regex is some regex to match a line within the file,
  • handler_function is the name of a function in your handler object which will be used to process the line.

Handlers

The handler object should be subclassed from the HandlerBase class in handlers.py. Or, if you wish to parse your logs and upload into a Postgresql database, you can subclass from the PostgresHandler class.

The handler class can implement startup and shutdown methods. Which will be ran at the start and end of the processing run respectively. These can be used to perform some database setup or cleanup.

handler_functions will have the signature:

def handler(self, file_path, line, match):

Where:

  • file_path is the path to the file of the current line
  • line is the line in the log file to be handled
  • match is the regex match group

When using the PostgresHandler, you can call self.queue_op(sql, params) in your handler functions to queue a database operation. By default this will run SQL operations in batches of 1000, you can customise this by passing the BATCH_SIZE parameter in the constructor to PostgresHandler. If you want to run a database operation immediately, call self.queue_op(sql, params, run_now=True).

Full example

from Processor import LogProcessor
from handler import PostgresHandler

class MyHandler(PostgresHandler):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs):

    def startup(self):
        """Optionally run some setup"""
        pass

    def shutdown(self):
        """Optionally run some cleanup"""
        pass

    def my_handler(self, file_path, line, match):
        field_1 = match.group(1)
        field_2 = match.group(2)

        sql = """
            INSERT INTO my_table (field_1, field_2)
            VALUES
                (%s, %s);
        """
        params = (field_1, field_2)

        self.queue_op(sql, params, run_now=False)

rules = {
    'example\.log': {
        '(.*),(.*)': 'my_handler',
        '.*': 'skip'
    }
}

handler = MyHandler(
    dsn='user:pass@localhost:5432/test',
    setup_script='path/to/db_setup'
)

log_processor = LogProcessor(
    rules=rules,
    handler=handler,
    dry_run=False
)

log_processor.run('/path/to/my/logs')

The Handler class

The library only stipulates that the handler object passed to the LogProcessor object is an instance of HandlerBase.

You should subclass from HandlerBase and add your own methods to handle the lines found by your rules.

Override the startup and shutdown methods in your handler class to run a function at the start and end of parsing, respectively.

The PostgresHandler class

Or, if you wish to make use of the included PostgresHandler for uploading data into a PostgreSQL database, subclass from that instead.

The PostgresHandler object has the following constructor:

class PostgresHandler(HandlerBase):
    def __init__(self, dsn: Optional[str] = None, connection: Optional[Connection] = None, setup_script: Optional[str] = None, BATCH_SIZE: int = 1000):
  • dsn optionally provide a dsn string which will be used to connect to an existing PostgreSQL database or;
  • connection optionally provide an existing psycopg3 connection. Useful for unit tests.
  • setup_script optionally provide the path to a SQL file in order to perform some database setup/cleanup in between runs.
  • BATCH_SIZE Execute database operations in batches of BATCH_SIZE, defaults to 1000.

In your handler functions, define a SQL string and args tuple, and pass them to the queue_op function. These should be setup according to the psycopg3 format, see the example above. If you wish to execute a database operation immediately, pass run_now=True to queue_op, otherwise, it will be added to a queue, and executed in sequence when the size of the queue reaches BATCH_SIZE.

Contributing

As mentioned above, the only data store that is currently supported is Postgres. If you would like to add support for another data store such as MySQL or MongoDB, then please open a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

regex-log-parser-1.0.0.tar.gz (22.9 kB view hashes)

Uploaded Source

Built Distribution

regex_log_parser-1.0.0-py3-none-any.whl (19.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page