Skip to main content

Characterize the clients hitting a web site by analyzing its access logs.

Project description

agent-census

What's hitting your site, classified by how it behaves -- not just what it claims to be.

Most of the traffic to a typical site isn't people; it's software, and a fair bit of it lies about what it is. agent-census reads your access log and sorts the clients by what they actually do -- whether they pull a page's sub-resources like a browser, walk the site like a crawler, poll a feed on a schedule, or go looking for known-vulnerable paths. Anything claiming to be a known crawler is checked against DNS and published address ranges, so a Googlebot arriving from some random datacentre gets called what it is. What you end up with is your traffic broken down by what each client is for. The User-Agent still counts -- it's just treated as a claim to weigh against behaviour and origin, not a fact to take on trust.

Here's a sample report generated from a real access log.

Install

pipx is recommended:

pipx install agent-census

Use: Analysis

The simplest case is analyzing one or more Apache logs in the default combined format:

agent-census analyze access.log* > census.html

The presets common, combined, and vhost_combined are available via --log-format-preset.

For a custom log format, pass the LogFormat/CustomLog directive string verbatim from your Apache config. Tab separators (\t), quoted fields with spaces, %{...}x SSL variables, and %{...}e environment variables are all handled:

agent-census analyze access.log \
    --log-format '%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" %D'

See "What to log" below for the most important information to gather.

Cloudflare Logpush logs (newline-delimited JSON) are also supported, as another preset:

agent-census analyze cloudflare-logs.json --log-format-preset cloudflare

Options

Use agent-census analyze -h for the full list of analysis options.

robots.txt compliance: Use --robots-file to supply a local file, hostname, or URL:

agent-census analyze access.log --robots-file ./robots.txt

Output format: Output is a self-contained HTML page by default; redirect it with -o, or pass --md for Markdown:

agent-census analyze access.log -o census.html
agent-census analyze access.log --md

Host header filtering: --vhost SUBSTRING analyses only the lines served for a matching host:

agent-census analyze access.log --log-format-preset vhost_combined \
    --vhost mnot.net --vhost www.mnot.net

Client identity: Use --identity to change how requests are associated with clients. The default, ip_ua, groups by (IP, User-Agent). Behind a CDN, use forwarded (the left-most X-Forwarded-For); for IP-rotating bots in one range, ip_ua_subnet.

agent-census analyze access.log --identity forwarded

AS lookups: If your logs don't record the AS number, point --mm-asn-db at a MaxMind ASN database to recover it from each client's IP. The database is consulted first (it can be fresher than the log) and is remembered between runs:

agent-census analyze access.log --mm-asn-db ./GeoLite2-ASN.mmdb

Remembered settings

Some options are sticky, so you needn't retype them. --log-format / --log-format-preset, --identity, and --robots-file / --robots-url are saved to ~/.config/agent-census/config.json and reused when a later run omits them. Passing one updates the saved value.

Use: Inspecting a client

To see why a client was classified the way it was, use inspect. It shows every signal that fired (including the runners-up), the measured features, the robots.txt finding, and the request trace:

agent-census inspect access.log --kind vuln_scanner
agent-census inspect access.log --client 203.0.113.66
agent-census inspect access.log --kind scraper --network aws

--network matches a substring of the origin-network name and composes with --kind, so the two together select a single cell of the cross-tab.

Most analyze options apply; see agent-census inspect -h for a full list of options.

What to log

The Apache combined format already carries everything the core analysis needs. The common preset drops the User-Agent and the Referer, so prefer combined, or a custom format that includes them.

Required (all present in combined):

  • Client address (%h) -- the identity everything else groups on, and the basis for the network, datacentre, and crawler-verification checks.
  • Timestamp (%t) -- timing regularity, peak request rate, the reported time range, and (with --quiescent-hours) freeing memory mid-run.
  • Request line ("%r") -- the method and path; the most load-bearing field, behind vulnerability probing, feed detection, path coverage, and crawl shape.
  • Status code (%>s) -- the status mix, 404 storms, 304 Not Modified (the has-cache tag), and robots.txt compliance.
  • User-Agent ("%{User-Agent}i") -- browser, bot, and declared-crawler recognition.

Strongly Recommended. The first two are already in combined; the rest aren't in any preset, so add them to a custom LogFormat (quoted) -- they're worth it:

  • Referer ("%{Referer}i", in combined) -- referer-following, which separates crawlers from scrapers and flags fabricated referers.
  • Bytes sent (%b or %B, in combined) -- the bandwidth figures in the report.
  • AS organisation and number ("%{MM_ASORG}e" and "%{MM_ASN}e", MaxMind mod_maxminddb) -- name datacentre clients by their hosting organisation, and recognise datacentres and ASN-listed crawlers by AS number. Much of Networks and hosting leans on these; log both (the number drives recognition, the org names it). Can't log them? --mm-asn-db recovers the AS from a MaxMind database instead (see Options).
  • Content-Type ("%{Content-Type}o") -- the response media type, which sharpens feed-reader detection (an RSS/Atom type, not just a feed-shaped URL).
  • X-Forwarded-For ("%{X-Forwarded-For}i") -- if you're behind a CDN or proxy, for --identity forwarded.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_census-0.0.2.tar.gz (177.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_census-0.0.2-py3-none-any.whl (158.0 kB view details)

Uploaded Python 3

File details

Details for the file agent_census-0.0.2.tar.gz.

File metadata

  • Download URL: agent_census-0.0.2.tar.gz
  • Upload date:
  • Size: 177.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_census-0.0.2.tar.gz
Algorithm Hash digest
SHA256 cb7dea52352c61d8151de2859881ea597967734d848b2f94d3fdaa74777cfadc
MD5 3928f629893914c3e3261a8a2c4c04b0
BLAKE2b-256 97a049d42a4ee478c042af37eb9a215181782599d6c343122e0c23d28149ecd0

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_census-0.0.2.tar.gz:

Publisher: publish.yml on mnot/agent-census

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agent_census-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: agent_census-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 158.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_census-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5ef195d8796e86d9766d803b51e61bfbcca7d1b554e2a7a3c515b7cc7fcb5e73
MD5 7de9cd1b124ed67c8410a7f26bdcfff8
BLAKE2b-256 61ce0642d6507fe58cf26062da1d93f6615e1e40d95757fb133d84e41928021a

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_census-0.0.2-py3-none-any.whl:

Publisher: publish.yml on mnot/agent-census

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page