Skip to main content

Functions required by the access-logs-local-driver

Project description

Load the content of gzipped Apache HTTP log files Exclude bots, scrapers, etc., select URLs matching the provided regex(es), and generate a CSV of the relevant log entries.

Take postprocessed logs and strip out multiple hits in sessions, and resolve URLs to the chosen URI_SCHEME (e.g. info:doi).

We strip out entries where the same (IP address * user agent) pair has accessed a URL within the last SESSION_TIMEOUT (e.g. half-hour)

Additionally, we convert the URLs to ISBNs and collate request data by date, outputting a CSV for ingest via the stats system.

Release Notes:

[0.2.0] - 2026-03-03

Changed:
  • Moved and made use of the Request._clean_path method, which had been overlooked.

[0.2.1] - 2026-03-02

Added:
  • LogProcessor and Request include a match_url_params parameter, which will allow URL parameters to be included in the regex matches for measures_uri determination (Request.match_url).

  • LogProcessor and Request include a canonical_url_params parameter, which is used creates the Request.canonical_url attribute. This is returned by the instance to identify the work.

  • LogProcessor includes an allowed_codes parameter, which specifies the status codes are considered valid for request in the access logs.

  • LogProcessor includes an Boolean keep_trailing_slash parameter, which is set to False by default. If False , the trailing forward slashes will be stripped from the URL path.

Changed:
  • breaking | Request.url has been split into Request.match_url and Request.canonical_url

  • The logic requiring the creation of filter_groups to pass to LogProcessor has been replaced with passing measure_regexes and excluded_ips.

Removed:
  • process_download_logs functions have been removed

  • Request.url has been removed

[0.1.2] - 2026-02-13

Changed:
  • Allow the LogProcessor._parse_line regex match to fail with a warning.

[0.1.1] - 2026-01-28

Changed:
  • Fix up function types hints.

Added:
  • Allow customisable “successful status codes” to be set.

[0.1.0] - 2025-12-08

Changed:
  • LogStream replaced with LogProcessor, which requires open file-like objects as input.

Added:
  • Able to process different log formats.

[0.0.7] - 2024-01-05

Changed:
  • Deletion of the spiders filter in process_download_logs.py

[0.0.6] - 2023-08-13

Changed:
  • Refactored driver logic

  • breaking | Changed parameters for the Request.__init__() method
    • Removed re_match_dict parameter

    • Added timestamp and user_agent parameters

  • Changed Request.timestamp from type time to datetime

  • Changed LogStream to use the new Request.__init__()

  • Expanded range for LogStream.logfile_names logic to include files within 1 day of the search_date

  • LogStream.lines() yields Request objects, not str values

  • LogStream.filter_in_line_request() only yields one line per measure

[0.0.5] - 2023-07-03

Changed:
  • Added start_date and end_date for searching in the log files

  • Added the measure_uri to the result

[0.0.4] - 2023-07-31

Changed:
  • Update file structure and name of the driver

[0.0.3] - 2023-07-25

Changed:
  • Update requirements

  • Update using a pyproject.toml file as well as the new deployment structure

[0.0.2] - 2023-07-11

Added:
  • Unittests

Changed:
  • Moved the files out of the package and get the file’s data as parameters and return the filtered data.

  • renamed the plugin to access-logs-local

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

access_logs_local-0.2.1.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

access_logs_local-0.2.1-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file access_logs_local-0.2.1.tar.gz.

File metadata

  • Download URL: access_logs_local-0.2.1.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for access_logs_local-0.2.1.tar.gz
Algorithm Hash digest
SHA256 082747711bb691c9852ac7c4ed54072df7cbf640cb6be908fa456127b948f189
MD5 e7ad43812dbf4878cfbd60c254662b1b
BLAKE2b-256 215c1d11a5aca557d101d8d73ac350d6a4bfd25c23276f18f93aac7ee2af3bd9

See more details on using hashes here.

File details

Details for the file access_logs_local-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for access_logs_local-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1b7a60d570b8d685e7b8f2d6a8663c0148a7b51aa83687b81abba6dbf3fc6a9c
MD5 e81833606ded7c56c89b499983533cbb
BLAKE2b-256 3e7e35e02bff8b33bb8f92f24e8583374fdda5aca18f1fee018abf53ad38eec5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page