Skip to main content

Characterize the clients hitting a web site by analyzing its access logs.

Project description

agent-census

What's hitting your site, classified by how it behaves -- not just what it claims to be.

Most of the traffic to a typical site isn't people; it's software, and a fair bit of it lies about what it is. agent-census reads your access log and sorts the clients by what they actually do -- whether they pull a page's sub-resources like a browser, walk the site like a crawler, poll a feed on a schedule, or go looking for known-vulnerable paths. Anything claiming to be a known crawler is checked against DNS and published address ranges, so a Googlebot arriving from some random datacentre gets called what it is. What you end up with is your traffic broken down by what each client is for. The User-Agent still counts -- it's just treated as a claim to weigh against behaviour and origin, not a fact to take on trust.

Here's a sample report generated from a real access log.

Install

pipx is recommended:

pipx install agent-census

Use: Analysis

The simplest case is analyzing one or more Apache logs in the default combined format:

agent-census analyze access.log* > census.html

The presets common, combined, and vhost_combined are available via --log-format-preset.

For a custom log format, pass the LogFormat/CustomLog directive string verbatim from your Apache config. Tab separators (\t), quoted fields with spaces, %{...}x SSL variables, and %{...}e environment variables are all handled:

agent-census analyze access.log \
    --log-format '%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" %D'

See "What to log" below for the most important information to gather.

Cloudflare Logpush logs (newline-delimited JSON) are also supported, as another preset:

agent-census analyze cloudflare-logs.json --log-format-preset cloudflare

Options

Use agent-census analyze -h for the full list of analysis options.

robots.txt compliance: Use --robots-file to supply a local file, hostname, or URL:

agent-census analyze access.log --robots-file ./robots.txt

Output format: Output is a self-contained HTML page by default; redirect it with -o, or pass --md for Markdown:

agent-census analyze access.log -o census.html
agent-census analyze access.log --md

Host header filtering: --vhost SUBSTRING analyses only the lines served for a matching host:

agent-census analyze access.log --log-format-preset vhost_combined \
    --vhost mnot.net --vhost www.mnot.net

Client identity: Use --identity to change how requests are associated with clients. The default, ip_ua, groups by (IP, User-Agent). Behind a CDN, use forwarded (the left-most X-Forwarded-For); for IP-rotating bots in one range, ip_ua_subnet.

agent-census analyze access.log --identity forwarded

AS lookups: If your logs don't record the AS number, point --mm-asn-db at a MaxMind ASN database to recover it from each client's IP. The database is consulted first (it can be fresher than the log) and is remembered between runs:

agent-census analyze access.log --mm-asn-db ./GeoLite2-ASN.mmdb

Remembered settings

Some options are sticky, so you needn't retype them. --log-format / --log-format-preset, --identity, and --robots-file / --robots-url are saved to ~/.config/agent-census/config.json and reused when a later run omits them. Passing one updates the saved value.

Use: Inspecting a client

To see why a client was classified the way it was, use inspect. It shows every signal that fired (including the runners-up), the measured features, the robots.txt finding, and the request trace:

agent-census inspect access.log --kind vuln_scanner
agent-census inspect access.log --client 203.0.113.66
agent-census inspect access.log --kind scraper --network aws

--network matches a substring of the origin-network name and composes with --kind, so the two together select a single cell of the cross-tab.

Most analyze options apply; see agent-census inspect -h for a full list of options.

What to log

The Apache combined format already carries everything the core analysis needs. The common preset drops the User-Agent and the Referer, so prefer combined, or a custom format that includes them.

Required (all present in combined):

  • Client address (%h) -- the identity everything else groups on, and the basis for the network, datacentre, and crawler-verification checks.
  • Timestamp (%t) -- timing regularity, peak request rate, the reported time range, and (with --quiescent-hours) freeing memory mid-run.
  • Request line ("%r") -- the method and path; the most load-bearing field, behind vulnerability probing, feed detection, path coverage, and crawl shape.
  • Status code (%>s) -- the status mix, 404 storms, 304 Not Modified (the has-cache tag), and robots.txt compliance.
  • User-Agent ("%{User-Agent}i") -- browser, bot, and declared-crawler recognition.

Strongly Recommended. The first two are already in combined; the rest aren't in any preset, so add them to a custom LogFormat (quoted) -- they're worth it:

  • Referer ("%{Referer}i", in combined) -- referer-following, which separates crawlers from scrapers and flags fabricated referers.
  • Bytes sent (%b or %B, in combined) -- the bandwidth figures in the report.
  • AS organisation and number ("%{MM_ASORG}e" and "%{MM_ASN}e", MaxMind mod_maxminddb) -- name datacentre clients by their hosting organisation, and recognise datacentres and ASN-listed crawlers by AS number. Much of Networks and hosting leans on these; log both (the number drives recognition, the org names it). Can't log them? --mm-asn-db recovers the AS from a MaxMind database instead (see Options).
  • Content-Type ("%{Content-Type}o") -- the response media type, which sharpens feed-reader detection (an RSS/Atom type, not just a feed-shaped URL).
  • X-Forwarded-For ("%{X-Forwarded-For}i") -- if you're behind a CDN or proxy, for --identity forwarded.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_census-0.0.3.tar.gz (182.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_census-0.0.3-py3-none-any.whl (160.8 kB view details)

Uploaded Python 3

File details

Details for the file agent_census-0.0.3.tar.gz.

File metadata

  • Download URL: agent_census-0.0.3.tar.gz
  • Upload date:
  • Size: 182.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_census-0.0.3.tar.gz
Algorithm Hash digest
SHA256 61993705036148a78cd86d31ea975624d086f507d19b60daddd3d18d61acce87
MD5 b7e5c0dfd15248d384dfb96e1cdde658
BLAKE2b-256 d8a8c4dd9db4fd200460e6fcebe22e6daa5949a4a379c6965361f7cbf6561204

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_census-0.0.3.tar.gz:

Publisher: publish.yml on mnot/agent-census

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agent_census-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: agent_census-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 160.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_census-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 cc1d2e3048fa1d9d6d3e1118d9db68e8c24e3c1fd64e8363aade70c64eb987c0
MD5 587f373ca1d0b969d9205828673af65f
BLAKE2b-256 fd3a99e4172c61efaf55a033760e4a60179853a8bd9b91f8f29d947836a00bf6

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_census-0.0.3-py3-none-any.whl:

Publisher: publish.yml on mnot/agent-census

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page