Skip to main content

An simple site crawler with proxy support

Project description

Pbot contains two modules, Bot and Spider

Bot is a simple helper, created to save request state (cookies, referrer) between http requests. Also, it provides addional methods for adding cookies. With no dependencies this module is easy to use when you need to simulate browser.

Spider it’s pbot, armed by lxml (required). Provides addional methods for easy website crawling, see below.

Bot is very easy to use:

from pbot.pbot import Bot
bot = Bot(proxies={'http': 'localhost:3128'}) # You can provide proxies, during bot creation, or set later as bot.proxies
bot.add_cookie({'name': 'sample', 'value': 1, 'domain': 'example.com'})
response = bot.open('http://example.com') # Open with cookies and empty referrer
bot.follow('http://google.com') # Open google with example.com as a referrer
response = bot.response # Response saved, and can be read later
bot.follow('http://example.com', post={'q': 'abc'}) # You can provide post and get as keyword arguments
bot.refresh_connector() # Flush cookies and referrer

Spider gives you special features:

from pbot.spider import Spider
bot = Spider() # or Spider(force_encoding='utf-8') to force encoding for parser
bot.open('http://example.com')
bot.tree.xpath('//a') # lxml tree can be accessed by .tree, response will be automatically readed and parsed by lxml.html
form = bot.xpath('//form[@id="main"]') # xpath shortcut for bot.tree.xpath
bot.submit(form) # Submit lxml f§orm
#
# Crawler, recursively crawl from target page yielding xml_tree, query_url, real_url (real_url - url after all redirects).
bot.crawl(self,
    url=None, # Target url to start crawling
    check_base=True, # Yield pages only on domain from url
    only_descendant=True, # Yield only pages that urls starts with url
    max_level=None, #Maximum level
    allowed_protocols=('http:', 'https:'),
    ignore_errors=True,
    ignore_starts=(), # Tuple/array,  ignore urls that starts with ignore_starts (exclude some parts of site)
    check_mime=())

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pbot-1.4.0.tar.gz (5.0 kB view details)

Uploaded Source

File details

Details for the file pbot-1.4.0.tar.gz.

File metadata

  • Download URL: pbot-1.4.0.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for pbot-1.4.0.tar.gz
Algorithm Hash digest
SHA256 84c182f459a81a4e0967c4e9034a5fda694d875daa98c063ac6443e72e838daf
MD5 adfa46604311542b76750811748b1353
BLAKE2b-256 0e7138e17675ac35cc4e83b94e623517fe29530a97bc3087d1f754fc9ecd86ba

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page