No project description provided
Project description
crawler-user-agents
This repository contains a list of of HTTP user-agents used by robots, crawlers, and spiders as in single JSON file.
- NPM package: https://www.npmjs.com/package/crawler-user-agents
- Go package: https://pkg.go.dev/github.com/monperrus/crawler-user-agents
- PyPi package: https://pypi.org/project/crawler-user-agents/
Install
Direct download
Download the crawler-user-agents.json
file from this repository directly.
Npm / Yarn
crawler-user-agents is deployed on npmjs.com: https://www.npmjs.com/package/crawler-user-agents
To use it using npm or yarn:
npm install --save crawler-user-agents
# OR
yarn add crawler-user-agents
In Node.js, you can require
the package to get an array of crawler user agents.
const crawlers = require('crawler-user-agents');
console.log(crawlers);
Usage
Each pattern
is a regular expression. It should work out-of-the-box wih your favorite regex library:
- JavaScript:
if (RegExp(entry.pattern).test(req.headers['user-agent']) { ... }
- PHP: add a slash before and after the pattern:
if (preg_match('/'.$entry['pattern'].'/', $_SERVER['HTTP_USER_AGENT'])): ...
- Python:
if re.search(entry['pattern'], ua): ...
- Go: use this package,
it provides global variable
Crawlers
(it is synchronized withcrawler-user-agents.json
), functionsIsCrawler
andMatchingCrawlers
.
Example of Go program:
package main
import (
"fmt"
"github.com/monperrus/crawler-user-agents"
)
func main() {
userAgent := "Mozilla/5.0 (compatible; Discordbot/2.0; +https://discordapp.com)"
isCrawler := agents.IsCrawler(userAgent)
fmt.Println("isCrawler:", isCrawler)
indices := agents.MatchingCrawlers(userAgent)
fmt.Println("crawlers' indices:", indices)
fmt.Println("crawler' URL:", agents.Crawlers[indices[0]].URL)
}
Output:
isCrawler: true
crawlers' indices: [237]
crawler' URL: https://discordapp.com
Contributing
I do welcome additions contributed as pull requests.
The pull requests should:
- contain a single addition
- specify a discriminant relevant syntactic fragment (for example "totobot" and not "Mozilla/5 totobot v20131212.alpha1")
- contain the pattern (generic regular expression), the discovery date (year/month/day) and the official url of the robot
- result in a valid JSON file (don't forget the comma between items)
Example:
{
"pattern": "rogerbot",
"addition_date": "2014/02/28",
"url": "http://moz.com/help/pro/what-is-rogerbot-",
"instances" : ["rogerbot/2.3 example UA"]
}
License
The list is under a MIT License. The versions prior to Nov 7, 2016 were under a CC-SA license.
Related work
There are a few wrapper libraries that use this data to detect bots:
- Voight-Kampff (Ruby)
- isbot (Ruby)
- crawlers (Clojure)
- isBot (Node.JS)
Other systems for spotting robots, crawlers, and spiders that you may want to consider are:
- Crawler-Detect (PHP)
- BrowserDetector (PHP)
- browscap (JSON files)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for crawler_user_agents-0.4.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | d1fb41c9a17e5db7f49b18ecfb1e0db68d072ef30d1ed8aef948dd3e596eaf7e |
|
MD5 | bcc88f02463dfc89ba9bf0109a086226 |
|
BLAKE2b-256 | e0521938b092b959adb67896ede40c0acbb7c53c6bca5787e303f9d290dfeec3 |