Characterize the clients hitting a web site by analyzing its access logs.
Project description
agent-census
What's hitting your site, classified by how it behaves -- not just what it claims to be.
Most of the traffic to a typical site isn't people; it's software, and a fair bit of it lies about what it is. agent-census reads your access log and sorts the clients by what they actually do -- whether they pull a page's sub-resources like a browser, walk the site like a crawler, poll a feed on a schedule, or go looking for known-vulnerable paths. Anything claiming to be a known crawler is checked against DNS and published address ranges, so a Googlebot arriving from some random datacentre gets called what it is. What you end up with is your traffic broken down by what each client is for. The User-Agent still counts -- it's just treated as a claim to weigh against behaviour and origin, not a fact to take on trust.
Here's a sample report generated from a real access log.
Install
pipx install agent-census
Use
The simplest case is an Apache log in the default combined format:
agent-census analyze /var/log/apache2/access.log
You can pass several rotated logs at once. They're pooled into one analysis, so a client that spans the rotation is counted once:
agent-census analyze /var/log/httpd/access.log*
For a custom format, pass the LogFormat/CustomLog directive string verbatim
from your Apache config. Tab separators (\t), quoted fields with spaces,
%{...}x SSL variables, and %{...}e environment variables are all handled:
agent-census analyze access.log \
--log-format '%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" %D'
The presets common, combined, and vhost_combined are available via
--log-format-preset. Options may appear before, after, or between the log files.
Cloudflare Logpush logs (newline-delimited JSON) are also supported, as another preset:
agent-census analyze cloudflare-logs.json --log-format-preset cloudflare
Cloudflare logs carry the client's AS number, so network and ASN-based detection work without any extra configuration.
What to log
The Apache combined format already carries everything the core analysis needs.
The common preset drops the User-Agent and the Referer, so prefer combined,
or a custom format that includes them.
Required (all present in combined):
- Client address (
%h) -- the identity everything else groups on, and the basis for the network, datacentre, and crawler-verification checks. - Timestamp (
%t) -- timing regularity, peak request rate, the reported time range, and (with--quiescent-hours) freeing memory mid-run. - Request line (
"%r") -- the method and path; the most load-bearing field, behind vulnerability probing, feed detection, path coverage, and crawl shape. - Status code (
%>s) -- the status mix, 404 storms,304 Not Modified(thehas-cachetag), and robots.txt compliance. - User-Agent (
"%{User-Agent}i") -- browser, bot, and declared-crawler recognition.
Recommended. The first two are already in combined; the rest aren't in any
preset, so add them to a custom LogFormat (quoted) -- they're worth it:
- Referer (
"%{Referer}i", incombined) -- referer-following, which separates crawlers from scrapers and flags fabricated referers. - Bytes sent (
%bor%B, incombined) -- the bandwidth figures in the report. - AS organisation and number (
"%{MM_ASORG}e"and"%{MM_ASN}e", MaxMindmod_maxminddb) -- name datacentre clients by their hosting organisation, and recognise datacentres and ASN-listed crawlers by AS number. Much of Networks and hosting leans on these; log both (the number drives recognition, the org names it). - Content-Type (
"%{Content-Type}o") -- the response media type, which sharpens feed-reader detection (an RSS/Atom type, not just a feed-shaped URL). - X-Forwarded-For (
"%{X-Forwarded-For}i") -- if you're behind a CDN or proxy, for--identity forwarded.
Response time (%D / %T) and the virtual host are parsed if present but not
currently used by the analysis.
Output is Markdown by default. Pass --html for a self-contained, styled page
(one file, no external assets) you can open in a browser. Both formats work for
analyze and inspect:
agent-census analyze access.log --html -o census.html
The report opens with a summary of each kind, then a cross-tab of where each
kind's traffic came from (see Networks and hosting),
then the notable clients in each kind. Within a kind, clients that differ only
by IP address and origin AS — same User-Agent, same tags — are collapsed into
one row showing their combined traffic; in the HTML report a disclosure expands
to the per-IP/ASN breakdown, and inspect always lists them individually.
robots.txt compliance
To check robots.txt compliance, give agent-census the file. A local copy is the
default, since it should match the period the log covers:
agent-census analyze access.log --robots-file ./robots.txt
Naming a host or URL instead fetches it over the network. A live robots.txt may
not match the rules that applied when the log was written, so the report flags it:
agent-census analyze access.log --host example.com
The summary's robots column reads N✓ / M✗ / K?: respected, ignored, or too few
requests to tell (a client that hasn't yet requested a disallowed path isn't
counted either way).
Verifying declared crawlers
A User-Agent claiming Googlebot proves nothing on its own. Verification checks the client's IP against the crawler's published address ranges and its reverse/forward DNS. It runs by default and makes network calls (DNS lookups, and the occasional ranges fetch); turn it off for an offline, faster run:
agent-census analyze access.log --no-verify-bots
A verified crawler's IPs collapse into one entry keyed by its domain. A client
whose IP is outside the published ranges, or whose reverse DNS doesn't check out,
is classed impersonator, which means a forged identity that verification has
disproved. Misbehaviour is separate: a "Googlebot" that probes for /.env keeps
its declared kind and gets a probing tag (and ignores-robots if it earns one),
because a real crawler can still behave badly. With verification off there's
nothing to disprove the claim, so it stays a declared crawler with those tags.
Networks and hosting
Where a client comes from matters. A "browser" arriving from a datacentre rather than a consumer ISP is usually automation. agent-census recognises the major cloud and hosting providers (AWS, Google Cloud, Cloudflare, Hetzner) from their published IP ranges, folds shared-egress traffic (iCloud Private Relay, Tor) into one entry per network, and breaks the kinds down by origin network in a cross-tab. In the HTML report that table is interactive: switch between raw counts, share of each kind, and share of each network, with the busier cells shaded.
Range lists are fetched and cached weekly by default. --no-fetch-ranges stays
offline on the bundled data.
If your log carries the client's autonomous-system details (for example from
MaxMind's mod_maxminddb: %{MM_ASORG}e for the organisation and %{MM_ASN}e
for the number, quoted in your LogFormat), datacentre clients are named by their
hosting organisation. You can also list extra AS numbers to treat as datacentres
in the bundled datacenter_ranges.toml.
Inspecting a client
To see why a client was classified the way it was, use inspect. It shows every
signal that fired (including the runners-up), the measured features, the
robots.txt finding, and the request trace:
agent-census inspect access.log --kind vuln_scanner
agent-census inspect access.log --client 203.0.113.66
agent-census inspect access.log --kind scraper --network aws
--network matches a substring of the origin-network name and composes with
--kind, so the two together select a single cell of the cross-tab.
Identity
How requests are grouped into clients is configurable, since no single rule fits
every deployment. The default, ip_ua, groups by (IP, User-Agent). Behind a CDN,
use forwarded (the left-most X-Forwarded-For); for IP-rotating bots in one
range, ip_ua_subnet. The report notes how the chosen strategy fragmented or
merged the data, so you can judge whether it fit.
agent-census analyze access.log --identity forwarded
Scoping to one site
If one server's log mixes several virtual hosts, --vhost SUBSTRING analyses
only the lines served for a matching host (matched against the logged %v, or
the Host header if you don't log %v). The filtered lines are reported as
excluded, separately from parse skips. --vhost is repeatable — a line is kept
if it matches any of the given hosts.
agent-census analyze access.log --log-format-preset vhost_combined \
--vhost mnot.net --vhost www.mnot.net
This also sidesteps a CDN artefact: if a slice of your traffic was proxied to this origin under another hostname, those requests arrive from the CDN's IPs (so they can't be attributed or crawler-verified). Scoping to your own host drops that slice cleanly.
Remembered settings
Some options are sticky, so you needn't retype them. --log-format /
--log-format-preset, --identity, and --robots-file / --robots-url are
saved to ~/.config/agent-census/config.json and reused when a later run omits
them. Passing one updates the saved value.
How it works
Classification is based on behaviour, not just the User-Agent (which is easy to
forge). Each client's requests are reduced to measured features: request volume,
status mix, timing regularity, sub-resource co-loading, path coverage, and the
like. A set of independent classifiers each vote for a kind, with a confidence and
the reasons behind it. The strongest vote wins, or unknown if nothing clears a
threshold. Secondary tags such as verified, ignores-robots, datacenter, and
has-cache annotate the result.
The confidence weights and the threshold are hand-tuned, so check the
classifications against your own logs before trusting the headline numbers.
inspect shows why any client landed where it did.
Contributing
Contributions are welcome. See CONTRIBUTING.md for the development setup, conventions, and an outline of how the code fits together.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_census-0.0.1.tar.gz.
File metadata
- Download URL: agent_census-0.0.1.tar.gz
- Upload date:
- Size: 159.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d4a43f479dfd969a1523eee1d98aa89fe6d6c8edd286316d652634527545db4
|
|
| MD5 |
16fe3517ec22106f27ec680750208b36
|
|
| BLAKE2b-256 |
3d5d293d615359b00dda16ab6d4cb44a932d32d8f39eaa900277f2617db22d0b
|
Provenance
The following attestation bundles were made for agent_census-0.0.1.tar.gz:
Publisher:
publish.yml on mnot/agent-census
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_census-0.0.1.tar.gz -
Subject digest:
9d4a43f479dfd969a1523eee1d98aa89fe6d6c8edd286316d652634527545db4 - Sigstore transparency entry: 1951148799
- Sigstore integration time:
-
Permalink:
mnot/agent-census@0a80afb1ded6b93bba42ef9c717792a43010da94 -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/mnot
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0a80afb1ded6b93bba42ef9c717792a43010da94 -
Trigger Event:
push
-
Statement type:
File details
Details for the file agent_census-0.0.1-py3-none-any.whl.
File metadata
- Download URL: agent_census-0.0.1-py3-none-any.whl
- Upload date:
- Size: 145.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bbfd1ef257bf27cedf598d4028956d1b05a7427dac6f761d35d4ec673ca657b3
|
|
| MD5 |
6bf91ee102972637d20e0e29b5caf806
|
|
| BLAKE2b-256 |
c034bd31dbae101164c314f6dced9d6e99da14283b306986975763aecea3169b
|
Provenance
The following attestation bundles were made for agent_census-0.0.1-py3-none-any.whl:
Publisher:
publish.yml on mnot/agent-census
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_census-0.0.1-py3-none-any.whl -
Subject digest:
bbfd1ef257bf27cedf598d4028956d1b05a7427dac6f761d35d4ec673ca657b3 - Sigstore transparency entry: 1951148940
- Sigstore integration time:
-
Permalink:
mnot/agent-census@0a80afb1ded6b93bba42ef9c717792a43010da94 -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/mnot
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0a80afb1ded6b93bba42ef9c717792a43010da94 -
Trigger Event:
push
-
Statement type: