Apache Common/Combined Log Format Parser
Project description
Apache Common/Combined Log Parser
Parses a single Apache web log format record. The parser wil first attempt to match a combined format record, if this fails it will attempt to match a common format record. In the event that the record matches neither pattern, a null record will be returned.
To return a dictionary representing the entire record or a list of specified objects call CLFParser.logDict(record), passing a single log record:
>>> from clfparser import CLFParser >>> logRecord='10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET /favicon.ico HTTP/1.1" 404 209' >>> clfDict=CLFParser.logDict(logRecord) >>> print clfDict {'%b': '209', '%h': '10.223.157.186', '%time': datetime.datetime(2009, 7, 15, 14, 58, 59), '%l': '-', '%Referer': '', '%s': '404', '%r': '"GET /favicon.ico HTTP/1.1"', '%u': '-', '%t': '[15/Jul/2009:14:58:59 -0700]', '%timezone': '-0700', '%Useragent': ''}
To return a subset of the log record as a list, call CLFParser.logParts(record, formatMask). where formatMask is a quoted string listing the log items required in the output:
>>> clfParts=CLFParser.logParts(test,'%h %time') >>> print clfParts ['10.223.157.186', datetime.datetime(2009, 7, 15, 14, 58, 59)]
To use with Apache Spark:
>>> from clfparser import CLFParser >>> accLog = sc.textFile("access_log", 2).cache() >>> logDict = accLog.map(lambda logRec: CLFParser.logDict(logRec)) >>> logDict.first() {'%b': u'202', '%h': u'10.223.157.186', '%l': u'-', '%timezone': u'-0700', '%s': u'403', '%r': u'"GET / HTTP/1.1"', '%Referer': '', '%t': u'[15/Jul/2009:14:58:59 -0700]', '%time': datetime.datetime(2009, 7, 15, 14, 58, 59), '%u': u'-', '%Useragent': ''} >>> logParts = accLog.map(lambda logRec: CLFParser.logParts(logRec, '%h %t')) >>> logParts.first() [u'10.223.157.186', u'[15/Jul/2009:14:58:59 -0700]']
Common Log Format
Described by:
'%h %l %u %t \"%r\" %>s %b'
Where:
%h - host
%l - identity
%u - userid
%t - time
%r - request
%>s - status
%b - size
Combined Log Format
As Common Log Format with the addition of 2 further fields:
'%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"'
Where:
%{Referer}i - HTTP request header referer
%{User-agent}i - HTTP request header user agent
Additional Fields
In addition to the standard log fields, clfparser also parses the log time field, %t, to create a Python datetime object %time and a string object representing the timezone, timezone.
Installation
Install using pip:
pip install clfparser
To Do
Performance improvements
Command line tools
Identify request resource as an additional data item