Characterize the clients hitting a web site by analyzing its access logs.
Project description
agent-census
What's hitting your site, classified by how it behaves -- not just what it claims to be.
Most of the traffic to a typical site isn't people; it's software, and a fair bit of it lies about what it is. agent-census reads your access log and sorts the clients by what they actually do -- whether they pull a page's sub-resources like a browser, walk the site like a crawler, poll a feed on a schedule, or go looking for known-vulnerable paths. Anything claiming to be a known crawler is checked against DNS and published address ranges, so a Googlebot arriving from some random datacentre gets called what it is. What you end up with is your traffic broken down by what each client is for. The User-Agent still counts -- it's just treated as a claim to weigh against behaviour and origin, not a fact to take on trust.
Here's a sample report generated from a real access log.
Install
pipx is recommended:
pipx install agent-census
Use: Analysis
The simplest case is analyzing one or more Apache logs in the default combined format:
agent-census analyze access.log* > census.html
The presets common, combined, and vhost_combined are available via
--log-format-preset.
For a custom log format, pass the LogFormat/CustomLog directive string verbatim
from your Apache config. Tab separators (\t), quoted fields with spaces,
%{...}x SSL variables, and %{...}e environment variables are all handled:
agent-census analyze access.log \
--log-format '%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" %D'
See "What to log" below for the most important information to gather.
Cloudflare Logpush logs (newline-delimited JSON) are also supported, as another preset:
agent-census analyze cloudflare-logs.json --log-format-preset cloudflare
Options
Use agent-census analyze -h for the full list of analysis options.
robots.txt compliance: Use --robots-file to supply a local file, hostname, or URL:
agent-census analyze access.log --robots-file ./robots.txt
Output format: Output is a self-contained HTML page by default; redirect it with -o, or pass
--md for Markdown:
agent-census analyze access.log -o census.html
agent-census analyze access.log --md
Host header filtering: --vhost SUBSTRING analyses only the lines served for a matching host:
agent-census analyze access.log --log-format-preset vhost_combined \
--vhost mnot.net --vhost www.mnot.net
Client identity: Use --identity to change how requests are associated with clients. The
default, ip_ua, groups by (IP, User-Agent). Behind a CDN, use forwarded (the left-most
X-Forwarded-For); for IP-rotating bots in one range, ip_ua_subnet.
agent-census analyze access.log --identity forwarded
AS lookups: If your logs don't record the AS number, point --mm-asn-db at a MaxMind ASN
database to recover it from each client's IP.
The database is consulted first (it can be fresher than the log) and is remembered between runs:
agent-census analyze access.log --mm-asn-db ./GeoLite2-ASN.mmdb
Remembered settings
Some options are sticky, so you needn't retype them. --log-format /
--log-format-preset, --identity, and --robots-file / --robots-url are
saved to ~/.config/agent-census/config.json and reused when a later run omits
them. Passing one updates the saved value.
Use: Inspecting a client
To see why a client was classified the way it was, use inspect. It shows every
signal that fired (including the runners-up), the measured features, the
robots.txt finding, and the request trace:
agent-census inspect access.log --kind vuln_scanner
agent-census inspect access.log --client 203.0.113.66
agent-census inspect access.log --kind scraper --network aws
--network matches a substring of the origin-network name and composes with
--kind, so the two together select a single cell of the cross-tab.
Most analyze options apply; see agent-census inspect -h for a full list of options.
What to log
The Apache combined format already carries everything the core analysis needs.
The common preset drops the User-Agent and the Referer, so prefer combined,
or a custom format that includes them.
Required (all present in combined):
- Client address (
%h) -- the identity everything else groups on, and the basis for the network, datacentre, and crawler-verification checks. - Timestamp (
%t) -- timing regularity, peak request rate, the reported time range, and (with--quiescent-hours) freeing memory mid-run. - Request line (
"%r") -- the method and path; the most load-bearing field, behind vulnerability probing, feed detection, path coverage, and crawl shape. - Status code (
%>s) -- the status mix, 404 storms,304 Not Modified(thehas-cachetag), and robots.txt compliance. - User-Agent (
"%{User-Agent}i") -- browser, bot, and declared-crawler recognition.
Strongly Recommended. The first two are already in combined; the rest aren't in any
preset, so add them to a custom LogFormat (quoted) -- they're worth it:
- Referer (
"%{Referer}i", incombined) -- referer-following, which separates crawlers from scrapers and flags fabricated referers. - Bytes sent (
%bor%B, incombined) -- the bandwidth figures in the report. - AS organisation and number (
"%{MM_ASORG}e"and"%{MM_ASN}e", MaxMindmod_maxminddb) -- name datacentre clients by their hosting organisation, and recognise datacentres and ASN-listed crawlers by AS number. Much of Networks and hosting leans on these; log both (the number drives recognition, the org names it). Can't log them?--mm-asn-dbrecovers the AS from a MaxMind database instead (see Options). - Content-Type (
"%{Content-Type}o") -- the response media type, which sharpens feed-reader detection (an RSS/Atom type, not just a feed-shaped URL). - X-Forwarded-For (
"%{X-Forwarded-For}i") -- if you're behind a CDN or proxy, for--identity forwarded.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_census-0.0.3.tar.gz.
File metadata
- Download URL: agent_census-0.0.3.tar.gz
- Upload date:
- Size: 182.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61993705036148a78cd86d31ea975624d086f507d19b60daddd3d18d61acce87
|
|
| MD5 |
b7e5c0dfd15248d384dfb96e1cdde658
|
|
| BLAKE2b-256 |
d8a8c4dd9db4fd200460e6fcebe22e6daa5949a4a379c6965361f7cbf6561204
|
Provenance
The following attestation bundles were made for agent_census-0.0.3.tar.gz:
Publisher:
publish.yml on mnot/agent-census
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_census-0.0.3.tar.gz -
Subject digest:
61993705036148a78cd86d31ea975624d086f507d19b60daddd3d18d61acce87 - Sigstore transparency entry: 1967183001
- Sigstore integration time:
-
Permalink:
mnot/agent-census@40caab692a56d9343c0220d81ab7abe6ce372367 -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/mnot
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@40caab692a56d9343c0220d81ab7abe6ce372367 -
Trigger Event:
push
-
Statement type:
File details
Details for the file agent_census-0.0.3-py3-none-any.whl.
File metadata
- Download URL: agent_census-0.0.3-py3-none-any.whl
- Upload date:
- Size: 160.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc1d2e3048fa1d9d6d3e1118d9db68e8c24e3c1fd64e8363aade70c64eb987c0
|
|
| MD5 |
587f373ca1d0b969d9205828673af65f
|
|
| BLAKE2b-256 |
fd3a99e4172c61efaf55a033760e4a60179853a8bd9b91f8f29d947836a00bf6
|
Provenance
The following attestation bundles were made for agent_census-0.0.3-py3-none-any.whl:
Publisher:
publish.yml on mnot/agent-census
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
agent_census-0.0.3-py3-none-any.whl -
Subject digest:
cc1d2e3048fa1d9d6d3e1118d9db68e8c24e3c1fd64e8363aade70c64eb987c0 - Sigstore transparency entry: 1967183330
- Sigstore integration time:
-
Permalink:
mnot/agent-census@40caab692a56d9343c0220d81ab7abe6ce372367 -
Branch / Tag:
refs/tags/v0.0.3 - Owner: https://github.com/mnot
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@40caab692a56d9343c0220d81ab7abe6ce372367 -
Trigger Event:
push
-
Statement type: