Scraping and Crawling Micro-Framework
Project description
TurboCrawler
What it is?
It is a Micro-Framework that you can use to build your crawlers easily, focused in being fast, extremely customizable, extensible and easy to use, giving you the power to control the crawler behavior. Provide ways to schedule requests, parse your data asynchronously, extract redirect links from an HTML page.
Install
pip install turbocrawler
Code Example
from pprint import pprint
import requests
from parsel import Selector
from turbocrawler import Crawler, CrawlerRequest, CrawlerResponse, CrawlerRunner, ExecutionInfo, ExtractRule
class QuotesToScrapeCrawler(Crawler):
crawler_name = "QuotesToScrape"
allowed_domains = ['quotes.toscrape']
regex_extract_rules = [ExtractRule(r'https://quotes.toscrape.com/page/[0-9]')]
time_between_requests = 1
session: requests.Session
@classmethod
def start_crawler(cls) -> None:
cls.session = requests.session()
@classmethod
def crawler_first_request(cls) -> CrawlerResponse | None:
cls.crawler_queue.add(CrawlerRequest(url="https://quotes.toscrape.com/page/9/"))
response = cls.session.get(url="https://quotes.toscrape.com/page/1/")
return CrawlerResponse(url=response.url,
body=response.text,
status_code=response.status_code)
@classmethod
def process_request(cls, crawler_request: CrawlerRequest) -> CrawlerResponse:
response = cls.session.get(crawler_request.url)
return CrawlerResponse(url=response.url,
body=response.text,
status_code=response.status_code)
@classmethod
def parse(cls, crawler_request: CrawlerRequest, crawler_response: CrawlerResponse) -> None:
selector = Selector(crawler_response.body)
for quote in selector.css('div[class="quote"]'):
data = {"quote": quote.css('span:nth-child(1)::text').get()[1:-1],
"author": quote.css('span:nth-child(2)>small::text').get(),
"tags_list": quote.css('div[class="tags"]>a::text').getall()}
pprint(data)
@classmethod
def stop_crawler(cls, execution_info: ExecutionInfo) -> None:
cls.session.close()
CrawlerRunner(crawler=QuotesToScrapeCrawler).run()
Understanding the turbocrawler:
Crawler
Attributes
crawler_name
the name of your crawler, this info will be used byCrawledQueue
allowed_domains
list containing all domains that the crawler may add toCrawlerQueue
regex_extract_rules
list containingExtractRule
objects, the regex passed here will be
used to extract all redirect links from an HTML page, EX: 'href="/users"', that you return inCrawlerResponse.body
If you let this list empty will not enable the automatic population ofCrawlerQueue
for everyCrawlerResponse.body
time_between_requests
Time that each request will have to wait before being executed
Methods
start_crawler
Should be used to start a session, webdriver, etc...
crawler_first_request
Should be used to make the first request in a site normally the login,
Here could also be used to schedule the first pages to crawl.
2 possible Returns:
- return
CrawlerResponse
the response will be sent toparse
method and apply follow rule **OBS-1 - return
None
the response will not be sent toparse
method
process_request
This method receives all scheduled requests in the CrawlerQueue.add
being added through manual CrawlerQueue.add
or by automatic schedule with regex_extract_rules.
Here you must implement all your request logic, cookies, headers, proxy, retries, etc...
The method receives a CrawlerRequest
and must return a CrawlerResponse
.
Apply follow rule **OBS-1.
process_respose
This method receives all requests made by process_request
Here you can implement any logic, like, scheduling requests,
validating response, remake the requests, etc...
Isn't mandatory to implement this method
parse
This method receives all CrawlerResponse
from
crawler_first_request
, process_request
or process_respose
Here you can parse your response,
getting the targets fields from HTML and dump the data, in a database for example.
stop_crawler
Should be used to close a session, webdriver, etc...
OBS:
- If filled
regex_extract_rules
the redicts specified in the rules will schedule in theCrawlerQueue
, if not filledregex_extract_rules
will not schedule any request.
Order of calls
start_crawler
crawler_first_request
- Start loop executing the methods sequentially
process_request
->process_response
->parse
-> loop forever.
The loop only stops whenCrawlerQueue
is empty. stop_crawler
CrawlerRunner
Is the responsible to run the Crawler, calling the methods in order,
responsible to automatic schedule your requests, and handle the queues.
It uses by default:
FIFOMemoryCrawlerQueue
forCrawlerQueue
MemoryCrawledQueue
forCrawledQueue
But you can change it using the built-ins queues
in turbocrawler.queues
or creating your own queues
CrawlerQueue
CrawlerQueue is where yours CrawlerRequest
are stored
and then will be removed to be processed at process_request
CrawledQueue
CrawledQueue is where all urls from the processed CrawlerRequest
are stored
It prevents to remake a request to the same url, but this behavior can be changed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file turbocrawler-0.0.1.tar.gz
.
File metadata
- Download URL: turbocrawler-0.0.1.tar.gz
- Upload date:
- Size: 40.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.11.4 Linux/6.2.0-35-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b36785866f49be72794f90d9108dc21b75259d5d0664f0e486dfabf6c05bf87 |
|
MD5 | 542e1200929db0557251cadfb3aa0121 |
|
BLAKE2b-256 | b49d1327f2bdb760293ac642e3c236448b6f8c7351596ed66ac436f3c952da4c |
File details
Details for the file turbocrawler-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: turbocrawler-0.0.1-py3-none-any.whl
- Upload date:
- Size: 18.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.11.4 Linux/6.2.0-35-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e3b00d13d3f557634193a91ce26d22d0a03685ce67e70bb7bd8daaa887d744d |
|
MD5 | 2886ba857eb03bdd33d79678aa67b136 |
|
BLAKE2b-256 | d97d4a90d0f502fc54facba2e651c049b9d57d13f345aa974a8725950d83cb42 |