Skip to main content

Tool for General Purpose Web Scraping and Crawling

Project description

scrawler

"scrawler" = "scraper" + "crawler"

Provides functionality for the automatic collection of website data (web scraping) and following links to map an entire domain (crawling). It can handle these tasks individually, or process several websites/domains in parallel using asyncio and multithreading.

This project was initially developed while working at the Fraunhofer Institute for Systems and Innovation Research. Many thanks for the opportunity and support!

Table of Contents

  1. Getting Started
  2. Basic Objects
  3. Attributes
  4. FAQ

Getting Started

You can install scrawler from PyPI:

pip install scrawler

Alternatively, you can find the .whl and .tar.gz files on GitHub for each respective release.

To start, have a look at the templates folder. It contains four files, each one doing a different task. All template include three sections: Imports, Setup and Execution.

  • Imports retrieves all code dependencies from various files
  • Setup is where all parameters are specified
  • In Execution, an instance of the respective Python object is created and its run() method executed

You can copy-and-paste the template and make any adjustments you would like.

Specifying parameters

Let's have a closer look at the "setup" section.

First, the URLs that are to be processed are specified (for more details, have a look at the section Attributes).

Then, the attributes that define how to accomplish the tasks are specified:

  • The search attributes specify which data to collect/search for in the website (and how to do it)
  • The export attributes specify how and where to export the collected data to
  • In the case of a crawling task, the crawling attributes specify how to conduct the crawling, e.g. how to filter irrelevant URLs or limits on the number of URLs crawled.

In the section "execution", these parameters are then passed to the relevant object (see section Basic Objects).

Basic Objects

The basic functionality of scrawler is contained in two classes, scraping.Scraper and crawling.Crawler.

Functionality

The objects are passed all relevant parameters during object initialization and then executed by calling the object's run() or run_and_export() methods. Afterwards, data may be exported by calling the export_data() method.

To sum it up:

  • run(): Execute the task and return the results.
  • run_and_export(): This may be used when scraping/crawling many sites at once, generating huge amounts of data. In order to prevent a MemoryError, data will be exported as soon as it is ready and then discarded to make room for the next sites/domains.
  • export_data(): Export the collected data to CSV file(s).

Example Crawling

Let's have a look at an example. For more information on how to create search, export and crawling attributes, you can refer to the section Attributes:

from scrawler import Crawler

search_attrs, export_attrs, crawling_attrs = ..., ..., ...

crawler = Crawler("https://example.com",
                  search_attributes=search_attrs,
                  export_attributes=export_attrs,
                  crawling_attributes=crawling_attrs)
results = crawler.run()
crawler.export_data()

Example Scraping

Here, multiple sites are scraped at once.

from scrawler import Scraper

search_attrs, export_attrs = ..., ...

scraper = Scraper(["https://www.example1.com", "https://www.example2.com", "https://www.example3.com"],
                  search_attributes=search_attrs,
                  export_attributes=export_attrs)
results = scraper.run()
scraper.export_data()

Attributes

Now that the object necessary for our task has been created, we would like to specify exactly how to go about the task.

Search Attributes

The search attributes specify which data to collect/search for in the website (and how to do it). This is done by passing data extractor objects to the SearchAttributes during initialization.

There are many data extractors already build into the project. For example, this example uses the built-in DateExtractor to extract a website's publication date from an HTML meta tag.

from scrawler.data_extractors import DateExtractor

pubdate_extractor = DateExtractor(tag_types="meta", tag_attrs={"name": "pubdate"})

Have a look at scrawler > data_extractors.py to see all available built-in data extractors.

Here's an exemplary SearchAttributes object creation:

from scrawler.attributes import SearchAttributes
from scrawler.data_extractors import *

search_attrs = SearchAttributes(
    UrlExtractor(),  # returns URL
    TitleExtractor(),  # returns website <title> tag content
    DateExtractor(tag_types="meta", tag_attrs={"name": "pubdate"})  # returns publication date
)

Note how parameters for the data extractors are passed directly during initialization.

Custom data extractors

If you do not find what you need, you can also built a data extractor yourself.

Data extractors are passed a Website object, which provides access to three types of data:

  • The website's HTML parsed as a BeautifulSoup object (see their documentation for how to extract data from it). Because Website extends BeautifulSoup, you can directly execute BeautifulSoup methods on the website object.
  • The HTTP response object (http_response attribute). This is an aiohttp ClientResponse object.
  • The website's raw URL (url attribute) and parsed URL (parsed_url attribute) parts.

Data extractors must inherit from BaseExtractor and implement two methods:

  • __init__(): Where parameters to the extractor can be passed and are stored in object attributes.
  • run(): To do the extraction. Make sure that the method signature is the same as for BaseExtractor, i.e. two parameters can be passed, website and indexas an optional parameter.

In this example, we build a data extractor to retrieve a website's copyright tag (if available):

from scrawler import Website
from scrawler.data_extractors import BaseExtractor


class CopyrightExtractor(BaseExtractor):
    def __init__(self, **kwargs):
        """Extract website copyright tag."""
        super().__init__(**kwargs)

    def run(self, website: Website, index: int = None):
        copyright_tag = website.find("meta", attrs={"name": "copyright"})

        # Important: Do not forget to handle exceptions, because many sites will not have this copyright tag
        try:
            copyright_text = copyright_tag.attrs["content"]
        except (AttributeError, KeyError):
            copyright_text = "NULL"

        return copyright_text

In this case, we could actually have had an easier solution. The built-in extractor GeneralHtmlTagExtractor already contains all the necessary functionality:

from scrawler.data_extractors import GeneralHtmlTagExtractor

copyright_extractor = GeneralHtmlTagExtractor(tag_types="meta", tag_attrs={"name": "copyright"},
                                              attr_to_extract="content")

Special parameters

Some notes on basic parameters specified in BaseExtractor that apply to all data extractors.

The parameter n_return_values specifies the number of values that will be returned by the extractor. This is almost always 1, but there are cases such as DateExtractor which may return more values. If you build your own data extractor that may return more than one value, make sure to update self.n_return_values. This attribute is used to validate that the length of the header of the CSV file equals the number of columns generated by the search attributes. Have a look at the implementation of DateExtractor to see how this might be handled.

The parameter dynamic_parameters handles a special case of data extraction when scraping/crawling multiple sites. There may be cases where you would like to have a different set of parameters for each URL. In this case, you can pass the relevant parameter as a list and set dynamic_parameters to True. The scraper/crawler will then have each URL/scraping target use a different value from that list based on an index. In this example, a different ID will be put for each crawled domain:

from scrawler.data_extractors import CustomStringPutter

DOMAINS_TO_CRAWL = ["https://www.abc.com", "https://www.def.com", "https://www.ghi.com"]
putter = CustomStringPutter(["id_1001", "id_1002", "id_1003"], use_index=True)

Note that when enabling dynamic_parameters, to parameters going into this data extractor can only have one of two forms:

  • A list (not a tuple!) where each list entry matches exactly one URL (in the same order as in the list of the URLs, see variable DOMAINS_TO_CRAWL in the example above).
  • A constant (of a type other than list) than will be the same for all URLs.

Passing a parameter list shorter or longer than the list of URLs will raise an error.

All built-in data extractors support dynamic parameters and you can easily add support to your custom data extractor by using the supports_dynamic_parameters function decorator to decorate the run() method, like this:

from scrawler import Website
from scrawler.data_extractors import BaseExtractor, supports_dynamic_parameters


class CopyrightExtractor(BaseExtractor):
    def __init__(self, **kwargs):
        """Extract website copyright tag."""
        super().__init__(**kwargs)

    @supports_dynamic_parameters
    def run(self, website: Website, index: int = None):
        copyright_tag = website.find("meta", attrs={"name": "copyright"})

        # Important: Do not forget to handle exceptions, because many sites will not have this copyright tag
        try:
            copyright_text = copyright_tag.attrs["content"]
        except (AttributeError, KeyError):
            copyright_text = "NULL"

        return copyright_text

Export Attributes

The export attributes specify how and where to export the collected data to. Data is always exported to the CSV format, therefore the various parameters are geared towards the CSV format.

Two parameters must be specified here:

  • directory: The directory (folder) that the file(s) will be saved to.
  • fn: (filename/filenames) Name(s) of the file(s) containing the crawled data, without the file extension. For example, crawled_data instead of crawled_data.csv.

Refer to the object's documentation in the code at scrawler > attributes.py for more information on other possible parameters.

Here's an exemplary ExportAttributes object creation:

from scrawler.attributes import ExportAttributes

export_attrs = ExportAttributes(
    directory=r"C:\Users\USER\Documents",
    fn=["crawled_data_abc", "crawled_data_def", "crawled_data_ghi"],
    header=["URL", "Title", "Publication Date"],
    separator="\t"
)

Crawling Attributes

The crawling attributes specify how to conduct the crawling, e.g. how to filter irrelevant URLs or limits on the number of URLs crawled. As implied by their name, they are only relevant for crawling tasks. Some commonly adjusted parameters include:

  • filter_foreign_urls: This parameter defines how the crawler knows that a given URL is still part of the target domain. For example, one may only want to crawl a subdomain, not the entire domain (only URLs from subdomain.example.com vs. the entire example.com domain). Details on valid input values can be found in the documentation for CrawlingAttributes in the code. By default, this is set to auto, which means that the correct mode will be inferred by looking at the passed base/start URL. For example, if the start URL contains a subdomain, only links from the subdomain will be crawled. For details, refer to the documentation for the extract_same_host_pattern() function. Note that you can also pass your own comparison function here. It has to include two parameters, url1 and url2. The first URL is the one to be checked, and the second is the reference (the crawling start URL). This function should return True for URLs that belong to the same host, and False for foreign URLs.
  • filter_media_files: Controls whether to filter out (ignore) media files. Media files can be quite large and make the crawling process significantly longer, while not adding any new information because media file data can't be parsed and processed. Therefore, the crawler filters media by looking at the URL (e.g. URLs ending in .pdf or .jpg), as well as the response header content-type.
  • blocklist: Some directories might not be interesting for the crawling process (e.g., /media/). The blocklist parameter makes it possible to pass a list of strings that might occur in a URL. If the URL contains any of the given strings, it is filtered out.
  • max_no_urls: Some domains contain many webpages. This parameter can be passed an integer as the maximum total amount of URLs to be crawled.

Refer to the object's documentation in the code at scrawler > attributes.py for more information on other possible parameters.

Here's an exemplary CrawlingAttributes object creation:

from scrawler.attributes import CrawlingAttributes

DOMAIN_TO_CRAWL = "https://www.blog.example.com"

crawling_attrs = CrawlingAttributes(
    filter_foreign_urls="subdomain1",  # only crawling the `blog` subdomain
    filter_media_files=True,
    blocklist=("git.", "datasets.", "nextcloud."),
    max_no_urls=1000
)

Another example with a custom foreign URL filter:

import tld.exceptions

from scrawler.attributes import CrawlingAttributes
from scrawler.utils.web_utils import ParsedUrl

DOMAIN_TO_CRAWL = "https://www.blog.example.com/my_directory/index.html"


def should_be_crawled(url1: str, url2: str) -> bool:  # Custom foreign URL filter
    try:  # don't forget exception handling
        url1 = ParsedUrl(url1)
        url2 = ParsedUrl(url2)
    except (tld.exceptions.TldBadUrl, tld.exceptions.TldDomainNotFound):  # URL couldn't be parsed
        return False

    return ((url1.hostname == url2.hostname)  # hostname "www.blog.example.com"
            and ("my_directory" in url1.path) and ("my_directory" in url2.path))


crawling_attrs = CrawlingAttributes(
    filter_foreign_urls=should_be_crawled,  # crawl URLs from host "www.blog.example.com" inside the directory "my_directory"
    filter_media_files=True,
    blocklist=("git.", "datasets.", "nextcloud."),
    max_no_urls=1000
)

Other Settings

Most parameters are encompassed in the three attribute objects above. However, there are some additional settings available for special cases.

If you look at the templates' "setup" section again, it includes a USER_AGENT parameter that sets the user agent to be used during scraping/crawling.

Finally, the file scrawler > defaults.py contains standard settings that are used throughout the project.

FAQ

Why are there two backends?

The module backends contains two files with the same functions for scraping and crawling, but built on different technologies for parallelization. In general, the asyncio version is preferable because more sites can be processed in parallel. However, on very large sites, scrawler may get stuck, and the entire crawling will hang. Also, there you may occasionally get many ServerDisconnectedErrors when using the asyncio backend. If you expect or experience these cases, it is preferable to use the backend built on multithreading, which is slower, but more robust.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrawler-0.2.1.tar.gz (41.5 kB view hashes)

Uploaded source

Built Distribution

scrawler-0.2.1-py3-none-any.whl (43.8 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page