Skip to main content

Apache Common/Combined Log Format Parser

Project description

Apache Common/Combined Log Parser

Parses a single Apache web log format record. The parser wil first attempt to match a combined format record, if this fails it will attempt to match a common format record. In the event that the record matches neither pattern, a null record will be returned.

To return a dictionary representing the entire record or a list of specified objects call CLFParser.logDict(record), passing a single log record:

>>> from clfparser import CLFParser
>>> logRecord='10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET /favicon.ico HTTP/1.1" 404 209'
>>> clfDict=CLFParser.logDict(logRecord)
>>> print clfDict
{'%b': '209', '%h': '10.223.157.186', '%time': datetime.datetime(2009, 7, 15, 14, 58, 59), '%l': '-', '%Referer': '',
'%s': '404', '%r': '"GET /favicon.ico HTTP/1.1"', '%u': '-', '%t': '[15/Jul/2009:14:58:59 -0700]', '%timezone': '-0700', '%Useragent': ''}

To return a subset of the log record as a list, call CLFParser.logParts(record, formatMask). where formatMask is a quoted string listing the log items required in the output:

>>> clfParts=CLFParser.logParts(test,'%h %time')
>>> print clfParts
['10.223.157.186', datetime.datetime(2009, 7, 15, 14, 58, 59)]

To use with Apache Spark:

>>> from clfparser import CLFParser
>>> accLog = sc.textFile("access_log", 2).cache()

>>> logDict = accLog.map(lambda logRec: CLFParser.logDict(logRec))
>>> logDict.first()
{'%b': u'202', '%h': u'10.223.157.186', '%l': u'-', '%timezone': u'-0700', '%s': u'403', '%r': u'"GET / HTTP/1.1"', '%Referer': '', '%t': u'[15/Jul/2009:14:58:59 -0700]',
'%time': datetime.datetime(2009, 7, 15, 14, 58, 59), '%u': u'-', '%Useragent': ''}

>>> logParts = accLog.map(lambda logRec: CLFParser.logParts(logRec, '%h %t'))
>>> logParts.first()
[u'10.223.157.186', u'[15/Jul/2009:14:58:59 -0700]']

Common Log Format

Described by:

'%h %l %u %t \"%r\" %>s %b'

Where:

  • %h - host

  • %l - identity

  • %u - userid

  • %t - time

  • %r - request

  • %>s - status

  • %b - size

Combined Log Format

As Common Log Format with the addition of 2 further fields:

'%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"'

Where:

  • %{Referer}i - HTTP request header referer

  • %{User-agent}i - HTTP request header user agent

Additional Fields

In addition to the standard log fields, clfparser also parses the log time field, %t, to create a Python datetime object %time and a string object representing the timezone, timezone.

Installation

Install using pip:

pip install clfparser

To Do

  • Performance improvements

  • Command line tools

  • Identify request resource as an additional data item

Project details


Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page