An simple site crawler with proxy support
Project description
Pbot contains two modules, Bot and Spider
Bot is a simple helper, created to save request state (cookies, referrer) between http requests. Also, it provides addional methods for adding cookies. With no dependencies this module is easy to use when you need to simulate browser.
Spider it’s pbot, armed by lxml (required). Provides addional methods for easy website crawling, see below.
Bot is very easy to use:
from pbot.pbot import Bot bot = Bot(proxies={'http': 'localhost:3128'}) # You can provide proxies, during bot creation, or set later as bot.proxies bot.add_cookie({'name': 'sample', 'value': 1, 'domain': 'example.com'}) response = bot.open('http://example.com') # Open with cookies and empty referrer bot.follow('http://google.com') # Open google with example.com as a referrer response = bot.response # Response saved, and can be read later bot.follow('http://example.com', post={'q': 'abc'}) # You can provide post and get as keyword arguments bot.refresh_connector() # Flush cookies and referrer
Spider gives you special features:
from pbot.spider import Spider bot = Spider() # or Spider(force_encoding='utf-8') to force encoding for parser bot.open('http://example.com') bot.tree.xpath('//a') # lxml tree can be accessed by .tree, response will be automatically readed and parsed by lxml.html form = bot.xpath('//form[@id="main"]') # xpath shortcut for bot.tree.xpath bot.submit(form) # Submit lxml f§orm # # Crawler, recursively crawl from target page yielding xml_tree, query_url, real_url (real_url - url after all redirects). bot.crawl(self, url=None, # Target url to start crawling check_base=True, # Yield pages only on domain from url only_descendant=True, # Yield only pages that urls starts with url max_level=None, #Maximum level allowed_protocols=('http:', 'https:'), ignore_errors=True, ignore_starts=(), # Tuple/array, ignore urls that starts with ignore_starts (exclude some parts of site) check_mime=())
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pbot-1.4.0.tar.gz
(5.0 kB
view details)
File details
Details for the file pbot-1.4.0.tar.gz
.
File metadata
- Download URL: pbot-1.4.0.tar.gz
- Upload date:
- Size: 5.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 84c182f459a81a4e0967c4e9034a5fda694d875daa98c063ac6443e72e838daf |
|
MD5 | adfa46604311542b76750811748b1353 |
|
BLAKE2b-256 | 0e7138e17675ac35cc4e83b94e623517fe29530a97bc3087d1f754fc9ecd86ba |