Skip to main content

This is a python library that verifies the validity of a search engine crawler.

Project description

Search Engine Bot Checker

Version

This is a simple python library that verifies the validity of a search engine crawler based on it's IP and user agent.

It is designed to assist SEO's and DevOps validate googlebot and other search engine bots.

Installation

pip install se-bot-checker

Usage

Using SE Bot Checker to validate a search engine crawler is simple. There are two basic steps.

  1. Instantiate the bot class.
  2. Call the bot class with IP and user agent arguments.
from se_bot_checker.bots import GoogleBot
googlebot = GoogleBot()
test_one = googlebot(
    '66.249.66.1', 
    'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
)
test_two = googlebot(
    '127.0.0.1', 
    'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
)
print(test_one)
print(test_two)

Output:

(True, 'googlebot')
(False, 'unknown')

Prebuilt Bots

There are several bot definitions that are already created, have been tested and will be maintained. The prebuilt crawlers are the most common search engine crawlers.

Crawler validation methods

Bot User Agent IP DNS
BaiduSpider X X* X**
BingBot X X* X
DuckDuckBot X X
GoogleBot X X* X
YandexBot X X* X

* IP validation is only used on consecutive checks run using the same bot checker instance. This means that in the following example there will be only one DNS network request since the IP in test_two has already been validated when test_one was run.

** BaiduSpider only supports reverse DNS validation not reverse and forward. Although it on first glance it appears BaiduSpider should support reverse/forward DNS validation I have never had forward success for BaiduSpider.

from se_bot_checker.bots import GoogleBot
googlebot = GoogleBot()
test_one = googlebot(
    '66.249.66.1', 
    'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
)
print(test_one)  # (True, 'googlebot')
test_two = googlebot(
    '66.249.66.1', 
    'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
)
print(test_two)  # (True, 'googlebot')

BaiduSpider

BaiduSpider validation only uses reverse DNS lookup not reverse and forward.

  • Name: baiduspider
  • Domains: .baidu.com, .baidu.jp
  • User Agents: baiduspider
  • Use RegEx: False

BingBot

  • Name: bingbot
  • Domains: .search.msn.com
  • User Agents: bingbot, msnbot, bingpreview
  • Use RegEx: True

DuckDuckBot

DuckDuckBot only uses IP validation from the list of valid IPs.

  • Name: duckduckbot
  • IPs: See list below
  • User Agents: duckduckbot, duckduckgo
  • Use RegEx: True
23.21.227.69
50.16.241.113
50.16.241.114
50.16.241.117
50.16.247.234
52.204.97.54
52.5.190.19
54.197.234.188
54.208.100.253
54.208.102.37
107.21.1.8

Updated: April 08, 2020

GoogleBot

  • Name: googlebot
  • Domains: .googlebot.com, .google.com
  • User Agents: googlebot
  • Use RegEx: False

YandexBot

  • Name: bingbot
  • Domains: .search.msn.com
  • User Agents: bingbot, msnbot, bingpreview
  • Use RegEx: True

Creating Your Own Bot Definition

SE Bot Checker was designed to be extensible. The core of SE Bot Checker is the Bot class. To create your own bot you can simply extend Bot.

Here is custom bot that will only validate Googlebot mobile.

from se_bot_checker.bots import Bot

class MobileGoogleBot(Bot):
    """
    Mobile googlebot checker
    """
    name = 'googlebot-mobile'
    domains = ['.googlebot.com', '.google.com']
    user_agent = 'android.*googlebot'

That is all there is to it. However, we could simplify this a little by extending the GoogleBot class.

from se_bot_checker.bots import GoogleBot

class MobileGoogleBot(GoogleBot):
    """
    Mobile googlebot checker
    """
    name = 'googlebot-mobile'
    user_agent = 'android.*googlebot'

Both the desktop and mobile versions of Googlebot use the same domains for the reverse/forward DNS validation. This means we can simply extend GoogleBot. This is the recommended approach when possible.

Bot API

This class is the core of SE Bot Checker. It handles the validation process. New bot definitions should subclass this class.

A single bot class can be instantiated once and called many times. The allows base settings to be configured and multiple IP and user agent pairs to be validated simply.

Bot.name: str This is the name the bot will return if it validates to True.

Bot.ips: iterable A list of known valid IPs.

Bot.domains: iterable A list of known valid domains. This is used to validate the results of the reverse DNS lookup. An exact match or a super domain of the DNS lookup results is considered a positive match.

Bot.user_agent: str A substring or RegEx pattern to use to validate the request user agent. For the best performance and compatibility request user agent string are changed to lowercase prior to matching. the user_agent string should be lower case. If you need to validate upper or mixed case user agents you can override the Bot.valid_user_agent() method.

Bot.use_regex: bool Whether the user agent validation should use substring or regex matching. If user_agent is just a string and not a RegEx pattern this should be False. It slightly faster. Defaults to False.

Contributors

@danielmorell

Copyright © 2020 Daniel Morell

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

se_bot_checker-1.0.2.tar.gz (7.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

se_bot_checker-1.0.2-py3-none-any.whl (18.9 kB view details)

Uploaded Python 3

File details

Details for the file se_bot_checker-1.0.2.tar.gz.

File metadata

  • Download URL: se_bot_checker-1.0.2.tar.gz
  • Upload date:
  • Size: 7.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for se_bot_checker-1.0.2.tar.gz
Algorithm Hash digest
SHA256 8b6675737bcf3ad16f426fbe17700c4be606f8eb3a5baff53f60499dd4f18675
MD5 06bfe5757508724fa4d56ff20a365073
BLAKE2b-256 f15946d7b2878830f1667c68da709d656d21351c50eeea717e4ea86e49e0937d

See more details on using hashes here.

File details

Details for the file se_bot_checker-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: se_bot_checker-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 18.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for se_bot_checker-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 423de4db80789a22b098262075aab428f924e4e9a9e0aaa839a4aa68f43e7781
MD5 129158ead876bfdef7bed359fb483293
BLAKE2b-256 7131e3a80934ba351a547f87c6da94950de86c35f36c8f1a87be4e02ce7125f5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page