A tool to parse syslog-like messages into word sequences
log2seq is a python package to help parsing syslog-like messages into word sequences that is more suitable for further automated analysis. It is based on a customizable procedure of rules in order, using regular expressions.
In log analysis, sometimes you may face following format of log messages:
Jan 1 12:34:56 host-device1 system: host 2001:0db8:1234::1 (interface:eth0) disconnected
This message cannot well splitted with str.split or re.split, because the usage of : is not consistent.
log2seq processes this message in multiple steps (in default):
- Process message header (i.e., timestamp and source hostname)
- Split message body into word sequence by standard symbol strings (e.g., spaces and brackets)
- Fix words that should not be splitted later (e.g., ipv6 addr)
- Split words by inconsistent symbol strings (e.g., :)
Following is a sample code:
mes = "Jan 1 12:34:56 host-device1 system: host 2001:0db8:1234::1 (interface:eth0) disconnected" import log2seq rules = log2seq.load_from_script("./default_parser.py") parser = log2seq.init_parser("rules") d = parser.process_line(mes) print(d["words"])
It outputs following sequence.
['system', '12345', 'host', '2001:0db8:1234::1', 'interface', 'eth0', 'disconnected']
You can see : in ipv6 addr is left, and other : are ignored.
To customize parsing rules, see log2seq/default_script.py .
log2seq also allows rules written in configparser (see log2seq/data/sample.conf).
The source code is available at https://github.com/cpflat/log2seq
3-Clause BSD license