Skip to main content

Scraping and Crawling Micro-Framework

Project description

TurboCrawler

What it is?

It is a Micro-Framework that you can use to build your crawlers easily, focused in being fast, extremely customizable, extensible and easy to use, giving you the power to control the crawler behavior. Provide ways to schedule requests, parse your data asynchronously, extract redirect links from an HTML page.

Install

pip install turbocrawler

Code Example

from pprint import pprint
import requests
from selectolax.lexbor import LexborHTMLParser
from turbocrawler import Crawler, CrawlerRequest, CrawlerResponse, CrawlerRunner, ExecutionInfo, ExtractRule


class QuotesToScrapeCrawler(Crawler):
    crawler_name = "QuotesToScrape"
    allowed_domains = ['quotes.toscrape.com']
    regex_extract_rules = [ExtractRule(r'https://quotes.toscrape.com/page/[0-9]')]
    time_between_requests = 1
    session: requests.Session

    @classmethod
    def start_crawler(cls) -> None:
        cls.session = requests.session()

    @classmethod
    def crawler_first_request(cls) -> CrawlerResponse | None:
        cls.crawler_queue.add(CrawlerRequest(url="https://quotes.toscrape.com/page/9/"))
        response = cls.session.get(url="https://quotes.toscrape.com/page/1/")
        return CrawlerResponse(url=response.url,
                               body=response.text,
                               status_code=response.status_code)

    @classmethod
    def process_request(cls, crawler_request: CrawlerRequest) -> CrawlerResponse:
        response = cls.session.get(crawler_request.url)
        return CrawlerResponse(url=response.url,
                               body=response.text,
                               status_code=response.status_code)

    @classmethod
    def parse(cls, crawler_request: CrawlerRequest, crawler_response: CrawlerResponse) -> None:
        selector = LexborHTMLParser(crawler_response.body)
        quote_list = selector.css('div[class="quote"]')
        for quote in quote_list:
            data = {"quote": quote.css_first('span:nth-child(1)').text()[1:-1],
                    "author": quote.css_first('span:nth-child(2)>small').text(),
                    "tag_list": [tag.text() for tag in quote.css('div[class="tags"]>a') if tag]}
            pprint(data)

    @classmethod
    def stop_crawler(cls, execution_info: ExecutionInfo) -> None:
        cls.session.close()


CrawlerRunner(crawler=QuotesToScrapeCrawler).run()

Understanding the turbocrawler:

Crawler

Attributes

  • crawler_name the name of your crawler, this info will be used by CrawledQueue
  • allowed_domains list containing all domains that the crawler may add to CrawlerQueue
  • regex_extract_rules list containing ExtractRule objects, the regex passed here will be
    used to extract all redirect links from an HTML page, EX: 'href="/users"', that you return in CrawlerResponse.body
    If you let this list empty will not enable the automatic population of CrawlerQueue for every CrawlerResponse.body
  • time_between_requests Time that each request will have to wait before being executed

Methods

start_crawler

Should be used to start a session, webdriver, etc...

crawler_first_request

Should be used to make the first request in a site normally the login, Can also be used to schedule the first pages to crawl.
2 possible Returns:

  • return CrawlerResponse the response will be sent to parse method and apply follow rule **OBS-1
  • return None the response will not be sent to parse method

process_request

This method receives all scheduled requests in the CrawlerQueue.add being added through manual CrawlerQueue.add or by automatic schedule with regex_extract_rules.
Here you must implement all your request logic, cookies, headers, proxy, retries, etc...
The method receives a CrawlerRequest and must return a CrawlerResponse.
Apply follow rule **OBS-1.

process_response

This method receives all requests made by process_request
Here you can implement any logic, like, scheduling requests, validating response, retrying logic, etc... Isn't mandatory to implement this method

parse

This method receives all CrawlerResponse from crawler_first_request, process_request or process_respose
Here you can parse your response, getting the targets' fields from HTML and dump the data, in a database for example.

stop_crawler

Should be used to close a session, webdriver, etc...

OBS:

  1. If filled regex_extract_rules the redicts specified in the rules will schedule in the CrawlerQueue, if not filled regex_extract_rules will not schedule any request.

Order of calls

  1. start_crawler
  2. crawler_first_request
  3. Start loop executing the methods sequentially process_request -> process_response -> parse -> loop forever.
    The loop only stops when CrawlerQueue is empty.
  4. stop_crawler

CrawlerRunner

Is the responsible to run the Crawler, calling the methods in order, responsible to automatic schedule your requests, and handle the queues.
It uses by default:

  • FIFOMemoryCrawlerQueue for CrawlerQueue
  • MemoryCrawledQueue for CrawledQueue

But you can change it using the built-ins queues in turbocrawler.queues or creating your own queues


CrawlerQueue

CrawlerQueue is where yours CrawlerRequest are stored and then will be removed to be processed at process_request


CrawledQueue

CrawledQueue is where all urls from the processed CrawlerRequest are stored It prevents dispatching a request to an already crawled url, but this behavior can be changed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbocrawler-0.0.3.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turbocrawler-0.0.3-py3-none-any.whl (20.6 kB view details)

Uploaded Python 3

File details

Details for the file turbocrawler-0.0.3.tar.gz.

File metadata

  • Download URL: turbocrawler-0.0.3.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.0 CPython/3.13.5 Linux/6.17.0-8-generic

File hashes

Hashes for turbocrawler-0.0.3.tar.gz
Algorithm Hash digest
SHA256 25ba48b6ae2e156cf7ef8cb480a0815d205fedc062701367f86d7cb010cf178e
MD5 34f16005ca322ab58119060aae4e067b
BLAKE2b-256 3f57507f3f0720cef2c318d8fbffd6bc571b0d60d8d10f1036d19c38d5f0d3a9

See more details on using hashes here.

File details

Details for the file turbocrawler-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: turbocrawler-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 20.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.0 CPython/3.13.5 Linux/6.17.0-8-generic

File hashes

Hashes for turbocrawler-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4a0069cfd01b388254e97e7ea26fa2efae3ecf1cc14d6497600330506fc6d6a4
MD5 f007775eda48212d0d5a58be3166edcb
BLAKE2b-256 3e988a8d464c224096e3b33c66392ef6acb07f33d23a879fb648b3db96f14a32

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page