Skip to main content

Characterize the clients hitting a web site by analyzing its access logs.

Project description

agent-census

What's hitting your site, classified by how it behaves -- not just what it claims to be.

Most of the traffic to a typical site isn't people; it's software, and a fair bit of it lies about what it is. agent-census reads your access log and sorts the clients by what they actually do -- whether they pull a page's sub-resources like a browser, walk the site like a crawler, poll a feed on a schedule, or go looking for known-vulnerable paths. Anything claiming to be a known crawler is checked against DNS and published address ranges, so a Googlebot arriving from some random datacentre gets called what it is. What you end up with is your traffic broken down by what each client is for. The User-Agent still counts -- it's just treated as a claim to weigh against behaviour and origin, not a fact to take on trust.

Here's a sample report generated from a real access log.

Install

pipx install agent-census

Use

The simplest case is an Apache log in the default combined format:

agent-census analyze /var/log/apache2/access.log

You can pass several rotated logs at once. They're pooled into one analysis, so a client that spans the rotation is counted once:

agent-census analyze /var/log/httpd/access.log*

For a custom format, pass the LogFormat/CustomLog directive string verbatim from your Apache config. Tab separators (\t), quoted fields with spaces, %{...}x SSL variables, and %{...}e environment variables are all handled:

agent-census analyze access.log \
    --log-format '%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" %D'

The presets common, combined, and vhost_combined are available via --log-format-preset. Options may appear before, after, or between the log files.

Cloudflare Logpush logs (newline-delimited JSON) are also supported, as another preset:

agent-census analyze cloudflare-logs.json --log-format-preset cloudflare

Cloudflare logs carry the client's AS number, so network and ASN-based detection work without any extra configuration.

What to log

The Apache combined format already carries everything the core analysis needs. The common preset drops the User-Agent and the Referer, so prefer combined, or a custom format that includes them.

Required (all present in combined):

  • Client address (%h) -- the identity everything else groups on, and the basis for the network, datacentre, and crawler-verification checks.
  • Timestamp (%t) -- timing regularity, peak request rate, the reported time range, and (with --quiescent-hours) freeing memory mid-run.
  • Request line ("%r") -- the method and path; the most load-bearing field, behind vulnerability probing, feed detection, path coverage, and crawl shape.
  • Status code (%>s) -- the status mix, 404 storms, 304 Not Modified (the has-cache tag), and robots.txt compliance.
  • User-Agent ("%{User-Agent}i") -- browser, bot, and declared-crawler recognition.

Recommended. The first two are already in combined; the rest aren't in any preset, so add them to a custom LogFormat (quoted) -- they're worth it:

  • Referer ("%{Referer}i", in combined) -- referer-following, which separates crawlers from scrapers and flags fabricated referers.
  • Bytes sent (%b or %B, in combined) -- the bandwidth figures in the report.
  • AS organisation and number ("%{MM_ASORG}e" and "%{MM_ASN}e", MaxMind mod_maxminddb) -- name datacentre clients by their hosting organisation, and recognise datacentres and ASN-listed crawlers by AS number. Much of Networks and hosting leans on these; log both (the number drives recognition, the org names it).
  • Content-Type ("%{Content-Type}o") -- the response media type, which sharpens feed-reader detection (an RSS/Atom type, not just a feed-shaped URL).
  • X-Forwarded-For ("%{X-Forwarded-For}i") -- if you're behind a CDN or proxy, for --identity forwarded.

Response time (%D / %T) and the virtual host are parsed if present but not currently used by the analysis.

Output is Markdown by default. Pass --html for a self-contained, styled page (one file, no external assets) you can open in a browser. Both formats work for analyze and inspect:

agent-census analyze access.log --html -o census.html

The report opens with a summary of each kind, then a cross-tab of where each kind's traffic came from (see Networks and hosting), then the notable clients in each kind. Within a kind, clients that differ only by IP address and origin AS — same User-Agent, same tags — are collapsed into one row showing their combined traffic; in the HTML report a disclosure expands to the per-IP/ASN breakdown, and inspect always lists them individually.

robots.txt compliance

To check robots.txt compliance, give agent-census the file. A local copy is the default, since it should match the period the log covers:

agent-census analyze access.log --robots-file ./robots.txt

Naming a host or URL instead fetches it over the network. A live robots.txt may not match the rules that applied when the log was written, so the report flags it:

agent-census analyze access.log --host example.com

The summary's robots column reads N✓ / M✗ / K?: respected, ignored, or too few requests to tell (a client that hasn't yet requested a disallowed path isn't counted either way).

Verifying declared crawlers

A User-Agent claiming Googlebot proves nothing on its own. Verification checks the client's IP against the crawler's published address ranges and its reverse/forward DNS. It runs by default and makes network calls (DNS lookups, and the occasional ranges fetch); turn it off for an offline, faster run:

agent-census analyze access.log --no-verify-bots

A verified crawler's IPs collapse into one entry keyed by its domain. A client whose IP is outside the published ranges, or whose reverse DNS doesn't check out, is classed impersonator, which means a forged identity that verification has disproved. Misbehaviour is separate: a "Googlebot" that probes for /.env keeps its declared kind and gets a probing tag (and ignores-robots if it earns one), because a real crawler can still behave badly. With verification off there's nothing to disprove the claim, so it stays a declared crawler with those tags.

Networks and hosting

Where a client comes from matters. A "browser" arriving from a datacentre rather than a consumer ISP is usually automation. agent-census recognises the major cloud and hosting providers (AWS, Google Cloud, Cloudflare, Hetzner) from their published IP ranges, folds shared-egress traffic (iCloud Private Relay, Tor) into one entry per network, and breaks the kinds down by origin network in a cross-tab. In the HTML report that table is interactive: switch between raw counts, share of each kind, and share of each network, with the busier cells shaded.

Range lists are fetched and cached weekly by default. --no-fetch-ranges stays offline on the bundled data.

If your log carries the client's autonomous-system details (for example from MaxMind's mod_maxminddb: %{MM_ASORG}e for the organisation and %{MM_ASN}e for the number, quoted in your LogFormat), datacentre clients are named by their hosting organisation. You can also list extra AS numbers to treat as datacentres in the bundled datacenter_ranges.toml.

Inspecting a client

To see why a client was classified the way it was, use inspect. It shows every signal that fired (including the runners-up), the measured features, the robots.txt finding, and the request trace:

agent-census inspect access.log --kind vuln_scanner
agent-census inspect access.log --client 203.0.113.66
agent-census inspect access.log --kind scraper --network aws

--network matches a substring of the origin-network name and composes with --kind, so the two together select a single cell of the cross-tab.

Identity

How requests are grouped into clients is configurable, since no single rule fits every deployment. The default, ip_ua, groups by (IP, User-Agent). Behind a CDN, use forwarded (the left-most X-Forwarded-For); for IP-rotating bots in one range, ip_ua_subnet. The report notes how the chosen strategy fragmented or merged the data, so you can judge whether it fit.

agent-census analyze access.log --identity forwarded

Scoping to one site

If one server's log mixes several virtual hosts, --vhost SUBSTRING analyses only the lines served for a matching host (matched against the logged %v, or the Host header if you don't log %v). The filtered lines are reported as excluded, separately from parse skips. --vhost is repeatable — a line is kept if it matches any of the given hosts.

agent-census analyze access.log --log-format-preset vhost_combined \
    --vhost mnot.net --vhost www.mnot.net

This also sidesteps a CDN artefact: if a slice of your traffic was proxied to this origin under another hostname, those requests arrive from the CDN's IPs (so they can't be attributed or crawler-verified). Scoping to your own host drops that slice cleanly.

Remembered settings

Some options are sticky, so you needn't retype them. --log-format / --log-format-preset, --identity, and --robots-file / --robots-url are saved to ~/.config/agent-census/config.json and reused when a later run omits them. Passing one updates the saved value.

How it works

Classification is based on behaviour, not just the User-Agent (which is easy to forge). Each client's requests are reduced to measured features: request volume, status mix, timing regularity, sub-resource co-loading, path coverage, and the like. A set of independent classifiers each vote for a kind, with a confidence and the reasons behind it. The strongest vote wins, or unknown if nothing clears a threshold. Secondary tags such as verified, ignores-robots, datacenter, and has-cache annotate the result.

The confidence weights and the threshold are hand-tuned, so check the classifications against your own logs before trusting the headline numbers. inspect shows why any client landed where it did.

Contributing

Contributions are welcome. See CONTRIBUTING.md for the development setup, conventions, and an outline of how the code fits together.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_census-0.0.1.tar.gz (159.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_census-0.0.1-py3-none-any.whl (145.5 kB view details)

Uploaded Python 3

File details

Details for the file agent_census-0.0.1.tar.gz.

File metadata

  • Download URL: agent_census-0.0.1.tar.gz
  • Upload date:
  • Size: 159.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_census-0.0.1.tar.gz
Algorithm Hash digest
SHA256 9d4a43f479dfd969a1523eee1d98aa89fe6d6c8edd286316d652634527545db4
MD5 16fe3517ec22106f27ec680750208b36
BLAKE2b-256 3d5d293d615359b00dda16ab6d4cb44a932d32d8f39eaa900277f2617db22d0b

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_census-0.0.1.tar.gz:

Publisher: publish.yml on mnot/agent-census

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file agent_census-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: agent_census-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 145.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_census-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bbfd1ef257bf27cedf598d4028956d1b05a7427dac6f761d35d4ec673ca657b3
MD5 6bf91ee102972637d20e0e29b5caf806
BLAKE2b-256 c034bd31dbae101164c314f6dced9d6e99da14283b306986975763aecea3169b

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_census-0.0.1-py3-none-any.whl:

Publisher: publish.yml on mnot/agent-census

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page