Syntactic patterns of HTTP user-agents used by bots / robots / crawlers / scrapers / spiders.
Project description
crawler-user-agents
This repository contains a list of of HTTP user-agents used by robots, crawlers, and spiders as in single JSON file.
- NPM package: https://www.npmjs.com/package/crawler-user-agents
- Go package: https://pkg.go.dev/github.com/monperrus/crawler-user-agents
- PyPi package: https://pypi.org/project/crawler-user-agents/
Each pattern is a regular expression. It should work out-of-the-box wih your favorite regex library.
Sponsor
💼 Using crawler-user-agents in a commercial product? This package is free to use, but it takes real time to maintain and expand. If it's providing value (and it probably is), please consider sponsoring at the commercial tier.
It keeps the project alive and actively maintained. Your company can afford it. 🙏
Install
Direct download
Download the crawler-user-agents.json file from this repository directly.
Javascript
crawler-user-agents is deployed on npmjs.com: https://www.npmjs.com/package/crawler-user-agents
To use it using npm or yarn:
npm install --save crawler-user-agents
# OR
yarn add crawler-user-agents
In Node.js, you can require the package to get an array of crawler user agents.
const crawlers = require('crawler-user-agents');
console.log(crawlers);
Python
Install with pip install crawler-user-agents
Then:
import crawleruseragents
if crawleruseragents.is_crawler("Googlebot/"):
# do something
or:
import crawleruseragents
indices = crawleruseragents.matching_crawlers("bingbot/2.0")
print("crawlers' indices:", indices)
print(
"crawler's URL:",
crawleruseragents.CRAWLER_USER_AGENTS_DATA[indices[0]]["url"]
)
Note that matching_crawlers is much slower than is_crawler, if the given User-Agent does indeed match any crawlers.
Go
Go: use this package,
it provides global variable Crawlers (it is synchronized with crawler-user-agents.json),
functions IsCrawler and MatchingCrawlers.
Example of Go program:
package main
import (
"fmt"
"github.com/monperrus/crawler-user-agents"
)
func main() {
userAgent := "Mozilla/5.0 (compatible; Discordbot/2.0; +https://discordapp.com)"
isCrawler := agents.IsCrawler(userAgent)
fmt.Println("isCrawler:", isCrawler)
indices := agents.MatchingCrawlers(userAgent)
fmt.Println("crawlers' indices:", indices)
fmt.Println("crawler's URL:", agents.Crawlers[indices[0]].URL)
}
Output:
isCrawler: true
crawlers' indices: [237]
crawler' URL: https://discordapp.com
Contributing
I do welcome additions contributed as pull requests.
The pull requests should:
- contain a single addition
- specify a discriminant relevant syntactic fragment (for example "totobot" and not "Mozilla/5 totobot v20131212.alpha1")
- contain the pattern (generic regular expression), the discovery date (year/month/day) and the official url of the robot
- result in a valid JSON file (don't forget the comma between items)
Example:
{
"pattern": "rogerbot",
"addition_date": "2014/02/28",
"url": "http://moz.com/help/pro/what-is-rogerbot-",
"instances" : ["rogerbot/2.3 example UA"],
"tags": ["seo"]
}
License
The list is under a MIT License. The versions prior to Nov 7, 2016 were under a CC-SA license.
Related work
There are a few wrapper libraries that use this data to detect bots:
- Voight-Kampff (Ruby)
- isbot (Ruby)
- crawlers (Clojure)
- isBot (Node.JS)
Other systems for spotting robots, crawlers, and spiders that you may want to consider are:
- Crawler-Detect (PHP)
- BrowserDetector (PHP)
- browscap (JSON files)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crawler_user_agents-1.43.0-py3-none-any.whl.
File metadata
- Download URL: crawler_user_agents-1.43.0-py3-none-any.whl
- Upload date:
- Size: 58.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ae8a1e6fb0b4041015cfc719ceb2da1e891f76585837559ac66c6d5ec68b1ffa
|
|
| MD5 |
afa4fb32f16a40e077d6c7499d4b8318
|
|
| BLAKE2b-256 |
085ad5892519cac8bb6c8761184b241b564f8c67451c1eaac294f671a7f9d890
|
Provenance
The following attestation bundles were made for crawler_user_agents-1.43.0-py3-none-any.whl:
Publisher:
pypi-publish.yml on monperrus/crawler-user-agents
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
crawler_user_agents-1.43.0-py3-none-any.whl -
Subject digest:
ae8a1e6fb0b4041015cfc719ceb2da1e891f76585837559ac66c6d5ec68b1ffa - Sigstore transparency entry: 1338602770
- Sigstore integration time:
-
Permalink:
monperrus/crawler-user-agents@d832f8d29cd0345a32858254ea06e6a90ea548ae -
Branch / Tag:
refs/heads/master - Owner: https://github.com/monperrus
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@d832f8d29cd0345a32858254ea06e6a90ea548ae -
Trigger Event:
workflow_run
-
Statement type: