Skip to main content

Parsing Semi-Structured Log Files into Tabular Format

Project description

An Example in Python

Let's again say you have the example logs in the file accesslog.txt.

10.0.0.8 - - [2019-01-01:10:58:12 -0500] "https://mysite.com/index.html"
173.28.102.33 - - [2019-01-01:10:58:25 -0500] "https://mysite.com/login"

We first define the template.

template = '{{ ip ip_address }} - - [{{ date date_time }}] "{{ url URL }}"'

We then need to define our classes. ip and url are builtins with the package, but dates come in a variety of formats so we must explicitly define ours here. Note you can see all builtins using default_classes()

import tabulog, datetime 

date_parser = tabulog.Parser(
  '[0-9]{4}\\-[0-9]{2}\\-[0-9]{2}:[0-9]{2}:[0-9]{2}:[0-9]{2}[ ][\\-\\+][0-9]{4}',
  lambda x:datetime.datetime.strptime(x, '%Y-%m-%d:%H:%M:%S %z'),
  name = 'date'
)
date_parser
Parser('[0-9]{4}\-[0-9]{2}\-[0-9]{2}:[0-9]{2}:[0-9]{2}:[0-9]{2}[ ][\-\+][0-9]{4}', <function <lambda> at 0x7f7a07574e18>, 'date')
for key in ['ip', 'url']:
  print(tabulog.default_classes()[key])
Parser('[0-9]{1,3}(\.[0-9]{1,3}){3}', <function <lambda> at 0x7f7a06c896a8>, 'ip')
Parser('(-|(?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&\'\(\)\*\+,;=.]+)', <function <lambda> at 0x7f7a06c896a8>, 'url')

Both ip and url require no formatting, so they have the identity function, (lambda x:x in python), as their formatter.

To get our final output in tabular format, we first combine everything into a Template object.

# We only need to pass our custom date parser class, the defaults will be included.
T = tabulog.Template(
  template_string = template,
  classes = [date_parser]
)
T
Template("{{ ip ip_address }} - - [{{ date date_time }}] \"{{ url URL }}\"", classes = ...)

Note that we only had to pass our custom class date. The builtin classes ip and url were included by default.

Finally, we can read in our log file, and call the tabulate function in our Template object. The final output is a Pandas DataFrame.

with open('accesslog.txt', 'r') as f:
  logs = f.read().split('\n')[:-1]

T.tabulate(logs)
      ip_address                 date_time                            URL
0       10.0.0.8 2019-01-01 10:58:12-05:00  https://mysite.com/index.html
1  173.28.102.33 2019-01-01 10:58:25-05:00       https://mysite.com/login

A more elegant and portable way of completing this task would be to define the template and the custom class in the same file, which can be ported to other Tabulog libraries in other languages, leaving only the formatters to be defined in the R script.

First, we define the template and the classes in a yaml file

~$ cat accesslog_template.yml
template: '{{ ip ip_address }} - - [{{ date date_time }}] "{{ url URL }}"'
classes:
  date: '[0-9]{4}\-[0-9]{2}\-[0-9]{2}:[0-9]{2}:[0-9]{2}:[0-9]{2}[ ][\-\+][0-9]{4}'

Next, we define the formatters for each of our classes. Here we only have one, but we still put it in a named list, with the name matching the name of the class in the template file.

formatters = {
  'date': lambda x:datetime.datetime.strptime(x, '%Y-%m-%d:%H:%M:%S %z')
}

Next, we make create our template again, this time using the file argument.

T = tabulog.Template(
  file = 'accesslog_template.yml',
  formatters = formatters
)
T
Template("{{ ip ip_address }} - - [{{ date date_time }}] \"{{ url URL }}\"", classes = ...)

Again, we get our final output with the same call to tabulate.

T.tabulate(logs)
      ip_address                 date_time                            URL
0       10.0.0.8 2019-01-01 10:58:12-05:00  https://mysite.com/index.html
1  173.28.102.33 2019-01-01 10:58:25-05:00       https://mysite.com/login

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabulog-0.1.1.tar.gz (5.2 kB view hashes)

Uploaded Source

Built Distribution

tabulog-0.1.1-py3-none-any.whl (5.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page