Tool for General Purpose Web Scraping and Crawling
Project description
scrawler
"scrawler" = "scraper" + "crawler"
Provides functionality for the automatic collection of website data (web scraping)
and following links to map an entire domain (crawling).
It can handle these tasks individually, or process several websites/domains in parallel using asyncio
and multithreading
.
This project was initially developed while working at the Fraunhofer Institute for Systems and Innovation Research. Many thanks for the opportunity and support!
Table of Contents
Getting Started
You can install scrawler from PyPI:
pip install scrawler
Alternatively, you can find the .whl
and .tar.gz
files on GitHub for each respective release.
To start, have a look at the templates
folder. It contains four files, each one doing a different task.
All template include three sections: Imports, Setup and Execution.
- Imports retrieves all code dependencies from various files
- Setup is where all parameters are specified
- In Execution, an instance of the respective Python object is created and its
run()
method executed
You can copy-and-paste the template and make any adjustments you would like.
Specifying parameters
Let's have a closer look at the "setup" section.
First, the URLs that are to be processed are specified (for more details, have a look at the section Attributes).
Then, the attributes that define how to accomplish the tasks are specified:
- The search attributes specify which data to collect/search for in the website (and how to do it)
- The export attributes specify how and where to export the collected data to
- In the case of a crawling task, the crawling attributes specify how to conduct the crawling, e.g. how to filter irrelevant URLs or limits on the number of URLs crawled.
In the section "execution", these parameters are then passed to the relevant object (see section Basic Objects).
Basic Objects
The basic functionality of scrawler is contained in two classes, scraping
.Scraper
and crawling
.Crawler
.
Functionality
The objects are passed all relevant parameters during object initialization
and then executed by calling the object's run()
or run_and_export()
methods.
Afterwards, data may be exported by calling the export_data()
method.
To sum it up:
run()
: Execute the task and return the results.run_and_export()
: This may be used when scraping/crawling many sites at once, generating huge amounts of data. In order to prevent aMemoryError
, data will be exported as soon as it is ready and then discarded to make room for the next sites/domains.export_data()
: Export the collected data to CSV file(s).
Example Crawling
Let's have a look at an example. For more information on how to create search, export and crawling attributes, you can refer to the section Attributes:
from scrawler import Crawler
search_attrs, export_attrs, crawling_attrs = ..., ..., ...
crawler = Crawler("https://example.com",
search_attributes=search_attrs,
export_attributes=export_attrs,
crawling_attributes=crawling_attrs)
results = crawler.run()
crawler.export_data()
Example Scraping
Here, multiple sites are scraped at once.
from scrawler import Scraper
search_attrs, export_attrs = ..., ...
scraper = Scraper(["https://www.example1.com", "https://www.example2.com", "https://www.example3.com"],
search_attributes=search_attrs,
export_attributes=export_attrs)
results = scraper.run()
scraper.export_data()
Attributes
Now that the object necessary for our task has been created, we would like to specify exactly how to go about the task.
Search Attributes
The search attributes specify which data to collect/search for in the website (and how to do it).
This is done by passing data extractor objects to the SearchAttributes
during initialization.
There are many data extractors already build into the project. For example, this example uses the built-in DateExtractor
to extract a website's publication date from an HTML meta tag.
from scrawler.data_extractors import DateExtractor
pubdate_extractor = DateExtractor(tag_types="meta", tag_attrs={"name": "pubdate"})
Have a look at scrawler
> data_extractors.py
to see all available built-in data extractors.
Here's an exemplary SearchAttributes
object creation:
from scrawler.attributes import SearchAttributes
from scrawler.data_extractors import *
search_attrs = SearchAttributes(
UrlExtractor(), # returns URL
TitleExtractor(), # returns website <title> tag content
DateExtractor(tag_types="meta", tag_attrs={"name": "pubdate"}) # returns publication date
)
Note how parameters for the data extractors are passed directly during initialization.
Custom data extractors
If you do not find what you need, you can also built a data extractor yourself.
Data extractors are passed a Website
object, which provides access to three types of data:
- The website's HTML parsed as a BeautifulSoup object (see their documentation for how to extract data from it).
Because
Website
extendsBeautifulSoup
, you can directly execute BeautifulSoup methods on the website object. - The HTTP response object (
http_response
attribute). This is an aiohttpClientResponse
object. - The website's raw URL (
url
attribute) and parsed URL (parsed_url
attribute) parts.
Data extractors must inherit from BaseExtractor
and implement two methods:
__init__()
: Where parameters to the extractor can be passed and are stored in object attributes.run()
: To do the extraction. Make sure that the method signature is the same as forBaseExtractor
, i.e. two parameters can be passed,website
andindex
as an optional parameter.
In this example, we build a data extractor to retrieve a website's copyright tag (if available):
from scrawler import Website
from scrawler.data_extractors import BaseExtractor
class CopyrightExtractor(BaseExtractor):
def __init__(self, **kwargs):
"""Extract website copyright tag."""
super().__init__(**kwargs)
def run(self, website: Website, index: int = None):
copyright_tag = website.find("meta", attrs={"name": "copyright"})
# Important: Do not forget to handle exceptions, because many sites will not have this copyright tag
try:
copyright_text = copyright_tag.attrs["content"]
except (AttributeError, KeyError):
copyright_text = "NULL"
return copyright_text
In this case, we could actually have had an easier solution. The built-in extractor GeneralHtmlTagExtractor
already contains all the necessary functionality:
from scrawler.data_extractors import GeneralHtmlTagExtractor
copyright_extractor = GeneralHtmlTagExtractor(tag_types="meta", tag_attrs={"name": "copyright"},
attr_to_extract="content")
Special parameters
Some notes on basic parameters specified in BaseExtractor
that apply to all data extractors.
The parameter n_return_values
specifies the number of values that will be returned by the extractor.
This is almost always 1, but there are cases such as DateExtractor
which may return more values.
If you build your own data extractor that may return more than one value, make sure to update self.n_return_values
.
This attribute is used to validate that the length of the header of the CSV file equals the number of columns generated by the search attributes.
Have a look at the implementation of DateExtractor
to see how this might be handled.
The parameter dynamic_parameters
handles a special case of data extraction when scraping/crawling multiple sites.
There may be cases where you would like to have a different set of parameters for each URL.
In this case, you can pass the relevant parameter as a list and set dynamic_parameters
to True.
The scraper/crawler will then have each URL/scraping target use a different value from that list based on an index.
In this example, a different ID will be put for each crawled domain:
from scrawler.data_extractors import CustomStringPutter
DOMAINS_TO_CRAWL = ["https://www.abc.com", "https://www.def.com", "https://www.ghi.com"]
putter = CustomStringPutter(["id_1001", "id_1002", "id_1003"], use_index=True)
Note that when enabling dynamic_parameters
, to parameters going into this data extractor can only have one of two forms:
- A list (not a tuple!) where each list entry matches exactly one URL (in the same order as in the list of the URLs, see variable
DOMAINS_TO_CRAWL
in the example above). - A constant (of a type other than list) than will be the same for all URLs.
Passing a parameter list shorter or longer than the list of URLs will raise an error.
All built-in data extractors support dynamic parameters and you can easily add support to your custom data extractor
by using the supports_dynamic_parameters
function decorator to decorate the run()
method, like this:
from scrawler import Website
from scrawler.data_extractors import BaseExtractor, supports_dynamic_parameters
class CopyrightExtractor(BaseExtractor):
def __init__(self, **kwargs):
"""Extract website copyright tag."""
super().__init__(**kwargs)
@supports_dynamic_parameters
def run(self, website: Website, index: int = None):
copyright_tag = website.find("meta", attrs={"name": "copyright"})
# Important: Do not forget to handle exceptions, because many sites will not have this copyright tag
try:
copyright_text = copyright_tag.attrs["content"]
except (AttributeError, KeyError):
copyright_text = "NULL"
return copyright_text
Export Attributes
The export attributes specify how and where to export the collected data to. Data is always exported to the CSV format, therefore the various parameters are geared towards the CSV format.
Two parameters must be specified here:
directory
: The directory (folder) that the file(s) will be saved to.fn
: (filename/filenames) Name(s) of the file(s) containing the crawled data, without the file extension. For example,crawled_data
instead ofcrawled_data.csv
.
Refer to the object's documentation in the code at scrawler
> attributes.py
for more information on other possible parameters.
Here's an exemplary ExportAttributes
object creation:
from scrawler.attributes import ExportAttributes
export_attrs = ExportAttributes(
directory=r"C:\Users\USER\Documents",
fn=["crawled_data_abc", "crawled_data_def", "crawled_data_ghi"],
header=["URL", "Title", "Publication Date"],
separator="\t"
)
Crawling Attributes
The crawling attributes specify how to conduct the crawling, e.g. how to filter irrelevant URLs or limits on the number of URLs crawled. As implied by their name, they are only relevant for crawling tasks. Some commonly adjusted parameters include:
filter_foreign_urls
: This parameter defines how the crawler knows that a given URL is still part of the target domain. For example, one may only want to crawl a subdomain, not the entire domain (only URLs fromsubdomain.example.com
vs. the entireexample.com
domain). Details on valid input values can be found in the documentation forCrawlingAttributes
in the code. By default, this is set toauto
, which means that the correct mode will be inferred by looking at the passed base/start URL. For example, if the start URL contains a subdomain, only links from the subdomain will be crawled. For details, refer to the documentation for theextract_same_host_pattern()
function. Note that you can also pass your own comparison function here. It has to include two parameters,url1
andurl2
. The first URL is the one to be checked, and the second is the reference (the crawling start URL). This function should returnTrue
for URLs that belong to the same host, andFalse
for foreign URLs.filter_media_files
: Controls whether to filter out (ignore) media files. Media files can be quite large and make the crawling process significantly longer, while not adding any new information because media file data can't be parsed and processed. Therefore, the crawler filters media by looking at the URL (e.g. URLs ending in.pdf
or.jpg
), as well as the response header content-type.blocklist
: Some directories might not be interesting for the crawling process (e.g.,/media/
). Theblocklist
parameter makes it possible to pass a list of strings that might occur in a URL. If the URL contains any of the given strings, it is filtered out.max_no_urls
: Some domains contain many webpages. This parameter can be passed an integer as the maximum total amount of URLs to be crawled.
Refer to the object's documentation in the code at scrawler
> attributes.py
for more information on other possible parameters.
Here's an exemplary CrawlingAttributes
object creation:
from scrawler.attributes import CrawlingAttributes
DOMAIN_TO_CRAWL = "https://www.blog.example.com"
crawling_attrs = CrawlingAttributes(
filter_foreign_urls="subdomain1", # only crawling the `blog` subdomain
filter_media_files=True,
blocklist=("git.", "datasets.", "nextcloud."),
max_no_urls=1000
)
Another example with a custom foreign URL filter:
import tld.exceptions
from scrawler.attributes import CrawlingAttributes
from scrawler.utils.web_utils import ParsedUrl
DOMAIN_TO_CRAWL = "https://www.blog.example.com/my_directory/index.html"
def should_be_crawled(url1: str, url2: str) -> bool: # Custom foreign URL filter
try: # don't forget exception handling
url1 = ParsedUrl(url1)
url2 = ParsedUrl(url2)
except (tld.exceptions.TldBadUrl, tld.exceptions.TldDomainNotFound): # URL couldn't be parsed
return False
return ((url1.hostname == url2.hostname) # hostname "www.blog.example.com"
and ("my_directory" in url1.path) and ("my_directory" in url2.path))
crawling_attrs = CrawlingAttributes(
filter_foreign_urls=should_be_crawled, # crawl URLs from host "www.blog.example.com" inside the directory "my_directory"
filter_media_files=True,
blocklist=("git.", "datasets.", "nextcloud."),
max_no_urls=1000
)
Other Settings
Most parameters are encompassed in the three attribute objects above. However, there are some additional settings available for special cases.
If you look at the templates' "setup" section again, it includes a USER_AGENT
parameter that sets the user agent to be used during scraping/crawling.
Finally, the file scrawler
> defaults.py
contains standard settings that are used throughout the project.
FAQ
Why are there two backends?
The module backends
contains two files with the same functions for scraping and crawling, but built on different technologies for parallelization.
In general, the asyncio
version is preferable because more sites can be processed in parallel.
However, on very large sites, scrawler may get stuck, and the entire crawling will hang.
Also, there you may occasionally get many ServerDisconnectedError
s when using the asyncio
backend.
If you expect or experience these cases, it is preferable to use the backend built on multithreading
, which is slower, but more robust.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.