Scraping and Crawling Micro-Framework
Project description
TurboCrawler
What it is?
It is a Micro-Framework that you can use to build your crawlers easily, focused in being fast, extremely customizable, extensible and easy to use, giving you the power to control the crawler behavior. Provide ways to schedule requests, parse your data asynchronously, extract redirect links from an HTML page.
Install
pip install turbocrawler
Code Example
from pprint import pprint
import requests
from selectolax.lexbor import LexborHTMLParser
from turbocrawler import Crawler, CrawlerRequest, CrawlerResponse, CrawlerRunner, ExecutionInfo, ExtractRule
class QuotesToScrapeCrawler(Crawler):
crawler_name = "QuotesToScrape"
allowed_domains = ['quotes.toscrape.com']
regex_extract_rules = [ExtractRule(r'https://quotes.toscrape.com/page/[0-9]')]
time_between_requests = 1
session: requests.Session
@classmethod
def start_crawler(cls) -> None:
cls.session = requests.session()
@classmethod
def crawler_first_request(cls) -> CrawlerResponse | None:
cls.crawler_queue.add(CrawlerRequest(url="https://quotes.toscrape.com/page/9/"))
response = cls.session.get(url="https://quotes.toscrape.com/page/1/")
return CrawlerResponse(url=response.url,
body=response.text,
status_code=response.status_code)
@classmethod
def process_request(cls, crawler_request: CrawlerRequest) -> CrawlerResponse:
response = cls.session.get(crawler_request.url)
return CrawlerResponse(url=response.url,
body=response.text,
status_code=response.status_code)
@classmethod
def parse(cls, crawler_request: CrawlerRequest, crawler_response: CrawlerResponse) -> None:
selector = LexborHTMLParser(crawler_response.body)
quote_list = selector.css('div[class="quote"]')
for quote in quote_list:
data = {"quote": quote.css_first('span:nth-child(1)').text()[1:-1],
"author": quote.css_first('span:nth-child(2)>small').text(),
"tag_list": [tag.text() for tag in quote.css('div[class="tags"]>a') if tag]}
pprint(data)
@classmethod
def stop_crawler(cls, execution_info: ExecutionInfo) -> None:
cls.session.close()
CrawlerRunner(crawler=QuotesToScrapeCrawler).run()
Understanding the turbocrawler:
Crawler
Attributes
crawler_namethe name of your crawler, this info will be used byCrawledQueueallowed_domainslist containing all domains that the crawler may add toCrawlerQueueregex_extract_ruleslist containingExtractRuleobjects, the regex passed here will be
used to extract all redirect links from an HTML page, EX: 'href="/users"', that you return inCrawlerResponse.body
If you let this list empty will not enable the automatic population ofCrawlerQueuefor everyCrawlerResponse.bodytime_between_requestsTime that each request will have to wait before being executed
Methods
start_crawler
Should be used to start a session, webdriver, etc...
crawler_first_request
Should be used to make the first request in a site normally the login,
Can also be used to schedule the first pages to crawl.
2 possible Returns:
- return
CrawlerResponsethe response will be sent toparsemethod and apply follow rule **OBS-1 - return
Nonethe response will not be sent toparsemethod
process_request
This method receives all scheduled requests in the CrawlerQueue.add
being added through manual CrawlerQueue.add or by automatic schedule with regex_extract_rules.
Here you must implement all your request logic, cookies, headers, proxy, retries, etc...
The method receives a CrawlerRequest and must return a CrawlerResponse.
Apply follow rule **OBS-1.
process_response
This method receives all requests made by process_request
Here you can implement any logic, like, scheduling requests,
validating response, retrying logic, etc...
Isn't mandatory to implement this method
parse
This method receives all CrawlerResponse from
crawler_first_request, process_request or process_respose
Here you can parse your response,
getting the targets' fields from HTML and dump the data, in a database for example.
stop_crawler
Should be used to close a session, webdriver, etc...
OBS:
- If filled
regex_extract_rulesthe redicts specified in the rules will schedule in theCrawlerQueue, if not filledregex_extract_ruleswill not schedule any request.
Order of calls
start_crawlercrawler_first_request- Start loop executing the methods sequentially
process_request->process_response->parse-> loop forever.
The loop only stops whenCrawlerQueueis empty. stop_crawler
CrawlerRunner
Is the responsible to run the Crawler, calling the methods in order,
responsible to automatic schedule your requests, and handle the queues.
It uses by default:
FIFOMemoryCrawlerQueueforCrawlerQueueMemoryCrawledQueueforCrawledQueue
But you can change it using the built-ins queues
in turbocrawler.queues or creating your own queues
CrawlerQueue
CrawlerQueue is where yours CrawlerRequest are stored
and then will be removed to be processed at process_request
CrawledQueue
CrawledQueue is where all urls from the processed CrawlerRequest are stored
It prevents dispatching a request to an already crawled url, but this behavior can be changed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file turbocrawler-0.0.3.tar.gz.
File metadata
- Download URL: turbocrawler-0.0.3.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.0 CPython/3.13.5 Linux/6.17.0-8-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25ba48b6ae2e156cf7ef8cb480a0815d205fedc062701367f86d7cb010cf178e
|
|
| MD5 |
34f16005ca322ab58119060aae4e067b
|
|
| BLAKE2b-256 |
3f57507f3f0720cef2c318d8fbffd6bc571b0d60d8d10f1036d19c38d5f0d3a9
|
File details
Details for the file turbocrawler-0.0.3-py3-none-any.whl.
File metadata
- Download URL: turbocrawler-0.0.3-py3-none-any.whl
- Upload date:
- Size: 20.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.0 CPython/3.13.5 Linux/6.17.0-8-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a0069cfd01b388254e97e7ea26fa2efae3ecf1cc14d6497600330506fc6d6a4
|
|
| MD5 |
f007775eda48212d0d5a58be3166edcb
|
|
| BLAKE2b-256 |
3e988a8d464c224096e3b33c66392ef6acb07f33d23a879fb648b3db96f14a32
|