No project description provided

Project description

crawler-user-agents

This repository contains a list of of HTTP user-agents used by robots, crawlers, and spiders as in single JSON file.

NPM package: https://www.npmjs.com/package/crawler-user-agents
Go package: https://pkg.go.dev/github.com/monperrus/crawler-user-agents
PyPi package: https://pypi.org/project/crawler-user-agents/

Install

Direct download

Download the crawler-user-agents.json file from this repository directly.

Npm / Yarn

crawler-user-agents is deployed on npmjs.com: https://www.npmjs.com/package/crawler-user-agents

To use it using npm or yarn:

npm install --save crawler-user-agents
# OR
yarn add crawler-user-agents

In Node.js, you can require the package to get an array of crawler user agents.

const crawlers = require('crawler-user-agents');
console.log(crawlers);

Usage

Each pattern is a regular expression. It should work out-of-the-box wih your favorite regex library:

JavaScript: if (RegExp(entry.pattern).test(req.headers['user-agent']) { ... }
PHP: add a slash before and after the pattern: if (preg_match('/'.$entry['pattern'].'/', $_SERVER['HTTP_USER_AGENT'])): ...
Python: if re.search(entry['pattern'], ua): ...
Go: use this package, it provides global variable Crawlers (it is synchronized with crawler-user-agents.json), functions IsCrawler and MatchingCrawlers.

Example of Go program:

package main

import (
	"fmt"

	"github.com/monperrus/crawler-user-agents"
)

func main() {
	userAgent := "Mozilla/5.0 (compatible; Discordbot/2.0; +https://discordapp.com)"

	isCrawler := agents.IsCrawler(userAgent)
	fmt.Println("isCrawler:", isCrawler)

	indices := agents.MatchingCrawlers(userAgent)
	fmt.Println("crawlers' indices:", indices)
	fmt.Println("crawler' URL:", agents.Crawlers[indices[0]].URL)
}

Output:

isCrawler: true
crawlers' indices: [237]
crawler' URL: https://discordapp.com

Contributing

I do welcome additions contributed as pull requests.

The pull requests should:

contain a single addition
specify a discriminant relevant syntactic fragment (for example "totobot" and not "Mozilla/5 totobot v20131212.alpha1")
contain the pattern (generic regular expression), the discovery date (year/month/day) and the official url of the robot
result in a valid JSON file (don't forget the comma between items)

Example:

{
  "pattern": "rogerbot",
  "addition_date": "2014/02/28",
  "url": "http://moz.com/help/pro/what-is-rogerbot-",
  "instances" : ["rogerbot/2.3 example UA"]
}

License

The list is under a MIT License. The versions prior to Nov 7, 2016 were under a CC-SA license.

Related work

There are a few wrapper libraries that use this data to detect bots:

Voight-Kampff (Ruby)
isbot (Ruby)
crawlers (Clojure)
isBot (Node.JS)

Other systems for spotting robots, crawlers, and spiders that you may want to consider are:

Crawler-Detect (PHP)
BrowserDetector (PHP)
browscap (JSON files)

Project details

Release history Release notifications | RSS feed

0.7.0

May 19, 2024

0.6.0

May 18, 2024

0.5.0

May 18, 2024

This version

0.4.0

May 18, 2024

0.3.0

May 18, 2024

0.2.0

May 18, 2024

0.1

May 18, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawler_user_agents-0.4.0.tar.gz (48.8 kB view hashes)

Uploaded May 18, 2024 Source

Hashes for crawler_user_agents-0.4.0.tar.gz

Hashes for crawler_user_agents-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`d1fb41c9a17e5db7f49b18ecfb1e0db68d072ef30d1ed8aef948dd3e596eaf7e`
MD5	`bcc88f02463dfc89ba9bf0109a086226`
BLAKE2b-256	`e0521938b092b959adb67896ede40c0acbb7c53c6bca5787e303f9d290dfeec3`