Python library to process hetrogenous log files by defining simple regex patterns and functions to handle them. Includes classes to parse data into postgres database.
Project description
Regex Log Parser
Regex Log Parser is a simple and easy to use Python library for log parsing/processing. It allows the user to define a dictionary of regex rules and handler functions which determine how logs should be processed. See the examples below for more information.
This was originally developed at MWA Telescope in order to perform data mining of a large amount of log data in order to gain useful insights from it. Following the success of the project we have open sourced and published it in the hope that it may be useful for somebody else.
We built this project to extract data from log files and load it into a PostgreSQL database so that it may be queried. However, it has been developed for extensibility and may be used to ingest into other data stores such as MySQL, SQLite, MongoDB, and more.
Only Postgres is supported currently, if you would like to see more data stores supported, see the contributing section below.
Basic Idea
Imagine that you have a directory containing a number of log files. These files may be generated by different systems (i.e. a web server) and different versions of those systems.
/logs
web1_1.log
web1_2.log
web2_1.log
web2_2.log
The log files contain historical information about activity on the system like so:
[2021-11-25 04:29:55,015] INFO, 192.168.0.1 "POST /login HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36"
[2021-11-25 04:29:56,542] INFO, 192.168.0.1 "GET /logout HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36"
[2021-11-25 04:30:05,731] INFO, 192.168.0.1 "GET / HTTP/1.1" 200 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36"
You want to parse these files to answer questions about the number of logins that have taken place or some other kind of event.
Rather than writing code to read each file line by line and doing string splitting, Regex Log Parser allows you to define a set of rules that will be used to parse files within a directory, as well as rules to process lines within matched files.
rules = {
'web1_*': {
'[(.*)] INFO, (.*) .*/login.*': 'my_handler',
'.*': 'skip'
}
}
In the example rules above, we are defining a dictionary where the key is some regex that will be matched against path/filenames within your directory, and the value is a dictionary which defines how to process the file.
By using regex capture groups, we can pull out the information we want from each line (datetime and IP address in the example above) and pass them to some handler function (here called my_handler), which can store the information in a database or do something else with it.
Important Note
This library FORCES you to handle all lines in a file. i.e. there must be at least one rule to match a line within a file. If not, an exception will be raised.
This was done deliberately to ensure that users are handling all cases. Once you're confident that you are handling the lines that you care about, add a catch-all rule to skip everything else:
rules = {
'web1_*': {
'.*': 'skip'
}
}
Installation
Prerequisites
- Python >= 3.10
Install the package
pip install regex_log_parser
If you wish to use the included functionality for uploading data into a postgres database, install the extra dependencies like so:
pip install regex_log_parser[postgres]
Usage
Create a file and import the LogProcessor class. Create an instance of this object then call the run method, passing in a directory containing some logs that you would like to process.
Two things are required to setup the processor. A rules dictionary and a handler object.
from regex_log_parser import LogProcessor, HandlerBase
log_processor = LogProcessor(
rules=rules,
handler=handler,
dry_run=False
)
log_processor.run('/path/to/my/logs')
Rules
Rules is a standard python dictionary of the format:
rules = {
"file_regex": {
"line_regex": "handler_function",
}
}
Where:
file_regex
is some regex to match the name of a file,line_regex
is some regex to match a line within the file,handler_function
is the name of a function in your handler object which will be used to process the line.
Handlers
The handler object should be subclassed from the HandlerBase class in handlers.py
. Or, if you wish to parse your logs and upload into a Postgresql database, you can subclass from the PostgresHandler
class.
The handler class can implement startup
and shutdown
methods. Which will be ran at the start and end of the processing run respectively. These can be used to perform some database setup or cleanup.
handler_functions will have the signature:
def handler(self, file_path, line, match):
Where:
file_path
is the path to the file of the current lineline
is the line in the log file to be handledmatch
is the regex match group
When using the PostgresHandler
, you can call self.queue_op(sql, params)
in your handler functions to queue a database operation. By default this will run SQL operations in batches of 1000, you can customise this by passing the BATCH_SIZE
parameter in the constructor to PostgresHandler
. If you want to run a database operation immediately, call self.queue_op(sql, params, run_now=True)
.
Full example
from Processor import LogProcessor
from handler import PostgresHandler
class MyHandler(PostgresHandler):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs):
def startup(self):
"""Optionally run some setup"""
pass
def shutdown(self):
"""Optionally run some cleanup"""
pass
def my_handler(self, file_path, line, match):
field_1 = match.group(1)
field_2 = match.group(2)
sql = """
INSERT INTO my_table (field_1, field_2)
VALUES
(%s, %s);
"""
params = (field_1, field_2)
self.queue_op(sql, params, run_now=False)
rules = {
'example\.log': {
'(.*),(.*)': 'my_handler',
'.*': 'skip'
}
}
handler = MyHandler(
dsn='user:pass@localhost:5432/test',
setup_script='path/to/db_setup'
)
log_processor = LogProcessor(
rules=rules,
handler=handler,
dry_run=False
)
log_processor.run('/path/to/my/logs')
The Handler class
The library only stipulates that the handler object passed to the LogProcessor
object is an instance of HandlerBase
.
You should subclass from HandlerBase
and add your own methods to handle the lines found by your rules.
Override the startup
and shutdown
methods in your handler class to run a function at the start and end of parsing, respectively.
The PostgresHandler class
Or, if you wish to make use of the included PostgresHandler
for uploading data into a PostgreSQL database, subclass from that instead.
The PostgresHandler object has the following constructor:
class PostgresHandler(HandlerBase):
def __init__(self, dsn: Optional[str] = None, connection: Optional[Connection] = None, setup_script: Optional[str] = None, BATCH_SIZE: int = 1000):
dsn
optionally provide a dsn string which will be used to connect to an existing PostgreSQL database or;connection
optionally provide an existing psycopg3 connection. Useful for unit tests.setup_script
optionally provide the path to a SQL file in order to perform some database setup/cleanup in between runs.BATCH_SIZE
Execute database operations in batches of BATCH_SIZE, defaults to 1000.
In your handler functions, define a SQL string and args tuple, and pass them to the queue_op
function. These should be setup according to the psycopg3 format, see the example above. If you wish to execute a database operation immediately, pass run_now=True
to queue_op
, otherwise, it will be added to a queue, and executed in sequence when the size of the queue reaches BATCH_SIZE
.
Contributing
As mentioned above, the only data store that is currently supported is Postgres. If you would like to add support for another data store such as MySQL or MongoDB, then please open a pull request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for regex_log_parser-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5c61eeb41f677959a8feba6538f37ecb4b702eaf5b606908e2df134a47696d09 |
|
MD5 | 90ac0b7ce65e070fc4421cc611fb1d60 |
|
BLAKE2b-256 | 30b6a4661d552a7b16f22c1b78cedbf954703abf53cd239f3e0941d2e4a6f451 |