A web scraper, that can scrape articles from a recursive website navigation

These details have not been verified by PyPI

Project links

Project description

Web Scraping — only trough configuration

How does this work?

The following is a flowchart of all decisions (white) and actions (green) performed by the program, and the configration options (blue) used in the process. It should help to understand the workflow and configuration of this tool.

Diagram

The flowchart was created with Excalidraw, and can be found here.

Why is this an Excalidraw file export and not a mermaid diagram? — Because with Mermaid its mess.

Found a bug?

Please report it to me, so I can fix it. You can contact me via email. You can also create an issue on GitHub if you are familiar with it.

Usage

General

This does not require you to write any code. (Well a bit, depending on the way you see it). This requires only configuration.

How to configure?

There are many ways to configure this. By using the class methods one by one, by just loading a JSON file that contains all the configuration or command line arguments.

When you are using the prebuild .exe file, you have to use a JSON file or command line arguments. Writing the settings JSON file by hand can be a bit tricky, so I recommend using the command line arguments.

Using a JSON file

When the program is started, it will print the configuration options currently applied as a JSON string. This can be used to save the configuration for later use.

This Text can be stored in a .json file.

Later you can load this file with the --config_file argument.

Start the program with the following command:

scraper_no_ai.exe --config_file "path/to/your/config.json"

A configuration file might look like

{
	"base_url": "https://www.surgicalholdings.co.uk/browse-products.html",
	"base_url_paging_prefix": "?p=",
	"base_url_paging_suffix": "",
	"pages_required": true,
	"page_start": 1,
	"page_step": 1,
	"page_end": 1000,
	"recursive_navigation": [
		{
			"css_selector": "ul>li>a",
			"base_url_in_case_links_are_relative": "https://www.surgicalholdings.co.uk"
		}
	],
	"product_element_css_selector": "div.m-product-item",
	"product_site_needed": true,
	"product_site_link_css_selector": "a.link",
	"product_site_link_prefix": "https://www.surgicalholdings.co.uk",
	"data_extraction": [
		{
			"column_name": "Name",
			"css_selector": ".product-name"
		},
		{
			"column_name": "ID",
			"css_selector": ".product-id"
		}
	],
	"verbose_output": false,
	"custom_http_headers": {
		"User-Agent": "Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148"
	},
	"url_blacklist_regex": ".*((instagram)|(facebook)|(twitter)|(linkedin)|(youtube)|(login)|(spotify)|(blog)|(account)|(browse-products)|(pdf-catalogue)|(about)|(contact)|(basket)).*",
	"output_file_path": "output.csv",
	"request_delay": 200
}

Using command line arguments

You can always use the -h or --help argument to get a list of all available arguments.

Available arguments are:

--base_url: The base URL of the website you want to scrape.
--base_url_paging_prefix: The prefix of the paging URL.
--base_url_paging_suffix: The suffix of the paging URL.
--pages_required: If the website has multiple pages, set this to True.
--page_start: The starting page number.
--page_step: The step between pages.
--page_end: The ending page number.
--recursive_navigation: A list of CSS selectors that will be used to navigate the website.
--product_element_css_selector: The CSS selector of the product element.
--product_site_needed: If the product site is needed, set this to True.
--product_site_link_css_selector: The CSS selector of the product site link.
--product_site_link_prefix: The prefix of the product site link.
--data_extraction: A list of dictionaries with the keys column_name and css_selector.
--verbose_output: If you want to see more output, set this to True.
--custom_http_headers: Custom HTTP headers that will be sent with the request.
--url_blacklist_regex: A regex pattern that will be used to filter out URLs.
--output_file_path: The path to the output file.
--config_file: The path to the configuration file.
--warning_tag_if_present: This will create a column in the output file with the name Warning and will contain True if the tag is present and False if not.
--request_delay: The delay between requests in seconds.

Example:

scraper_no_ai.exe --base_url "https://www.surgicalholdings.co.uk/browse-products.html" --pages_required True --base_url_paging_prefix "?p=" --page_start 1 --page_step 1 --page_end 1000 --recursive_navigation "[{\"css_selector\": \"ul>li>a\", \"base_url_in_case_links_are_relative\": \"https://www.surgicalholdings.co.uk\"}]" --product_element_css_selector "div.m-product-item" --product_site_needed True --product_site_link_css_selector "a.link" --product_site_link_prefix "https://www.surgicalholdings.co.uk" --data_extraction "[{\"column_name\": \"Name\", \"css_selector\": \".product-name\"}, {\"column_name\": \"ID\", \"css_selector\": \".product-id\"}]" --verbose_output False --custom_http_headers "{\"User-Agent\": \"Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148\"}" --url_blacklist_regex ".*((instagram)|(facebook)|(twitter)|(linkedin)|(youtube)|(login)|(spotify)|(blog)|(account)|(browse-products)|(pdf-catalogue)|(about)|(contact)|(basket)).*" --output_file_path "output.csv"

Step by step:

Using class methods (for Developers working directly with the code)

# Import the class
from src.web_scraper_dmeurer import Settings

# Create an object of the Settings class
my_configuration = Settings()

# Set the base url. Your entry point.
my_configuration.set_base_url("https://www.surgicalholdings.co.uk/browse-products.html")

Now you can use this format for everything:

my_configuration.set_pages_required(True)
my_configuration.set_base_url_paging_prefix("?p=")
my_configuration.set_page_range(1, 100, 1)

Or you can chain them:

my_configuration.set_product_site_needed(True).set_product_site_link_css_selector("a.link").set_product_site_link_prefix("https://www.surgicalholdings.co.uk")

Using a JSON object (for Developers working directly with the code)

This can be usefull if you want to share your configuration with someone else, or if you want to save it for later use.

The JSON object can also be generated by the class methods and then calling the get_settings() method or str(my_configuration).

# Import the class
from src.web_scraper_dmeurer import Settings

# Create an object of the Settings class
my_configuration = Settings()

# Load the JSON
settings_dict = {
    "base_url": "https://www.example.com",
    "base_url_paging_prefix": "/page/",
    "base_url_paging_suffix": "",
    "pages_required": True,
    "page_start": 1,
    "page_step": 1,
    "page_end": 10,
    "recursive_navigation": [{"css_selector": ".category"}],
    "product_element_css_selector": ".product",
    "product_site_needed": True,
    "product_site_link_css_selector": ".product-link",
    "product_site_link_prefix": "",
    "data_extraction": [{"column_name": "Name", "css_selector": ".product-name"}, {"column_name": "ID", "css_selector": ".product-id"}],
    "verbose_output": False,
    "custom_http_headers": {
        'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
    },
    "url_blacklist_regex": ".*((instagram)|(facebook)|(twitter)|(linkedin)|(youtube)|(login)|(spotify)|(contact)).*",
    "output_file_path": "output.csv",
    "request_delay": 200
}

How can I get the right CSS selectors?

There are plenty of guides on how to do this using the browser's developer tools. Or you can contact Dominik Meurer and I will help you.

Known Issue

Mostly critical for category navigation, but this is valid for all CSS selectors.

When you have a website structure like this:

<ul>
	<li>
		<a href="/some-category/">cat 1</a>
	</li>
	<li>
		<a href="/some-other-category/">cat 2</a>
	</li>
</ul>

You cant use the selector ul>li>a because it will only select the first match in the ul element.

--> So the only match will be cat 1.

To fix this, you have to edit your selector, so the first element does not contain multiple matches.

In this case, it would be li>a.

Configuration

Options:

base_url: The base URL of the website you want to scrape.
base_url_paging_prefix: The prefix of the paging URL. For example, if the URL is https://www.example.com/page/1, the prefix is /page/.
base_url_paging_suffix: The suffix of the paging URL. For example, if the URL is https://www.example.com/page/1.html, the suffix is .html.
pages_required: If the website has multiple pages, set this to True.
page_start: The starting page number.
page_step: The step between pages.
page_end: The ending page number. (In most cases, you can set this to an arbitrary large number if you don't know the number of pages, because it will stop when it doesn't get any results from a page.)
recursive_navigation: A list of CSS selectors that will be used to navigate the website. For example, if you want to navigate through categories, you can set this to [{ "css_selector": ".category" }]. Each element in the list should be a dictionary with the key css_selector and optionally base_url_in_case_links_are_relative if relative links are used.
product_element_css_selector: The CSS selector of the product element.
product_site_needed: If the product site is needed, set this to True.
product_site_link_css_selector: The CSS selector of the product site link.
product_site_link_prefix: The prefix of the product site link. This is needed if the link is relative.
data_extraction: A list of dictionaries with the keys column_name and css_selector. The column_name is the name of the column in the output CSV file, and the css_selector is the CSS selector of the element that contains the data.
verbose_output: If you want to see more output, set this to True.
custom_http_headers: Custom HTTP headers that will be sent with the request.
url_blacklist_regex: A regex pattern that will be used to filter out URLs. If the URL matches the pattern, it will be ignored.
output_file_path: The path to the output file.
config_file: The path to the configuration file.
warning_tag_if_present: This will create a column in the output file with the name Warning and will contain True if the tag is present and False if not.
request_delay: The delay between requests in seconds.

I will explain the configuration option with the example of the website https://www.surgicalholdings.co.uk.

You will find code snippets for bash (top one) and python (below bash).

base_url

The base URL is the starting point of the scraper. This is the URL that will be used to start the scraping process.

To make it easier to navigate, we try to minimize the navigation steps. So the starting point will be https://www.surgicalholdings.co.uk/browse-products.html

As JSON configuration:

{
	"base_url": "https://www.surgicalholdings.co.uk/browse-products.html"
}

As a command line argument:

scraper_no_ai.exe --base_url "https://www.surgicalholdings.co.uk/browse-products.html"

As a python code:

my_configuration.set_base_url("https://www.surgicalholdings.co.uk/browse-products.html")

recursive_navigation

This is used to navigate through the website. In this example the base_url is an overview over categories. So we need to navigate through the categories to get to the products.

In this case we can use the selector li>a to get all the link elements in the list. This will also find all other links (to social media, login, etc.) so we need to filter them out later using the url_blacklist_regex.

What is a CSS selector? Normally, you would use a CSS selector to style elements on a website. But in this case, we use it to find elements on the website. A guide on how to use CSS selectors can be found here (or in the Official Documentation of the used library). Or you can always just google.

Since the links are relative, we need to set the base_url_in_case_links_are_relative to the base URL.

As JSON configuration:

{
	"recursive_navigation": [
		{
			"css_selector": "ul>li>a",
			"base_url_in_case_links_are_relative": "https://www.surgicalholdings.co.uk"
		}
	]
}

As a command line argument:

scraper_no_ai.exe --recursive_navigation "[{\"css_selector\": \"ul>li>a\", \"base_url_in_case_links_are_relative\": \"https://www.surgicalholdings.co.uk\"}]"

As a python code:

my_configuration.set_recursive_navigation(
    [
        {"css_selector": "ul>li>a", "base_url_in_case_links_are_relative": "https://www.surgicalholdings.co.uk"}
    ]
)

If we would want to add another navigation step, we could add another element to the list.

As JSON configuration:

{
	"recursive_navigation": [
		{
			"css_selector": "ul>li>a",
			"base_url_in_case_links_are_relative": "https://www.surgicalholdings.co.uk"
		},
		{
			"css_selector": "ul>li>a",
			"base_url_in_case_links_are_relative": "https://www.surgicalholdings.co.uk"
		}
	]
}

As a command line argument:

scraper_no_ai.exe --recursive_navigation "[{\"css_selector\": \"ul>li>a\", \"base_url_in_case_links_are_relative\": \"https://www.surgicalholdings.co.uk\"}, {\"css_selector\": \"ul>li>a\", \"base_url_in_case_links_are_relative\": \"https://www.surgicalholdings.co.uk\"}]"

As a python code:

my_configuration.set_recursive_navigation(
    [
        {"css_selector": "ul>li>a", "base_url_in_case_links_are_relative": "https://www.surgicalholdings.co.uk"},
        {"css_selector": "ul>li>a", "base_url_in_case_links_are_relative": "https://www.surgicalholdings.co.uk"}
    ]
)

url_blacklist_regex

This is used to filter out URLs that we don't want to scrape. This is fully optional.

If any of the words in the brackets is in the URL, it will be skipped.

As JSON configuration:

{
	"url_blacklist_regex": ".*((instagram)|(facebook)|(twitter)|(linkedin)|(youtube)|(login)|(spotify)|(blog)|(account)|(browse-products)|(pdf-catalogue)|(about)|(contact)|(basket)).*"
}

As a command line argument:

scraper_no_ai.exe --url_blacklist_regex ".*((instagram)|(facebook)|(twitter)|(linkedin)|(youtube)|(login)|(spotify)|(blog)|(account)|(browse-products)|(pdf-catalogue)|(about)|(contact)|(basket)).*"

As a python code:

my_configuration.set_url_blacklist_regex(".*((instagram)|(facebook)|(twitter)|(linkedin)|(youtube)|(login)|(spotify)|(blog)|(account)|(browse-products)|(pdf-catalogue)|(about)|(contact)|(basket)).*")

pages

If your catalog page, you want to load has multiple pages, you can set the following options.

You need to enable the pages_required option and set the page_start, page_step and page_end. This will activate the system and go from start to end with the step size.

The page_end can be set to an arbitrary large number if you don't know the number of pages, because it will stop when it doesn't get any results from a page.

Additionally, in most cases, you need to set the base_url_paging_prefix and base_url_paging_suffix.

The prefix is the part between the URL and the page number. In the case of the website https://www.surgicalholdings.co.uk, the prefix is ?p=, but it can also be /page/ or something else.

The suffix is the part after the page number. In the case of the website https://www.surgicalholdings.co.uk, the suffix is ``, because we don't need it, but it can also be .html or something else.

As JSON configuration:

{
	"pages_required": true,
	"page_start": 1,
	"page_step": 1,
	"page_end": 1000,
	"base_url_paging_prefix": "?p=",
	"base_url_paging_suffix": ""
}

As a command line argument:

scraper_no_ai.exe --pages_required True --base_url_paging_prefix "?p=" --page_start 1 --page_step 1 --page_end 1000

As a python code:

my_configuration.set_pages_required(True)
my_configuration.set_base_url_paging_prefix("?p=")
my_configuration.set_page_range(1, 1000, 1)

product_element_css_selector

Now we are on the gallery page of the products. We need to find the CSS selector of the product element, to continue.

In the case of the website https://www.surgicalholdings.co.uk, the selector is div.m-product-item.

As JSON configuration:

{
	"product_element_css_selector": "div.m-product-item"
}

As a command line argument:

scraper_no_ai.exe --product_element_css_selector "div.m-product-item"

As a python code:

my_configuration.set_product_element_css_selector("div.m-product-item")

product_site_needed

Before we can extract the data, we need to think about if the gallery has all the information we need, or if we need to go to the product site.

In this example, it is necessary because the gallery only contains the name and the ID can only be found on the product page.

To make it work, we need an CSS selector for the link to the product site, and a base URL in case the link is relative.

As JSON configuration:

{
	"product_site_needed": true,
	"product_site_link_css_selector": "a.link",
	"product_site_link_prefix": "https://www.surgicalholdings.co.uk"
}

As a command line argument:

scraper_no_ai.exe --product_site_needed True --product_site_link_css_selector "a.link" --product_site_link_prefix "https://www.surgicalholdings.co.uk"

As a python code:

my_configuration.set_product_site_needed(True)
my_configuration.set_product_site_link_css_selector("a.link")
my_configuration.set_product_site_link_prefix("https://www.surgicalholdings.co.uk")

data_extraction

Now we can extract the data from the site. Whether its the gallery or the product site, as long as all the data is on the same site.

In this example, we want to extract the name and the ID of the product.

The configuration is in the JSON format. The column_name is the name of the column in the output CSV file, and the css_selector is the CSS selector of the element that contains the data.

As JSON configuration:

{
	"data_extraction": [
		{
			"column_name": "Name",
			"css_selector": ".product-name"
		},
		{
			"column_name": "ID",
			"css_selector": ".product-id"
		}
	]
}

As a command line argument:

scraper_no_ai.exe --data_extraction "[{\"column_name\": \"Name\", \"css_selector\": \".product-name\"}, {\"column_name\": \"ID\", \"css_selector\": \".product-id\"}]"

As a python code:

my_configuration.set_data_extraction(
    [
        {"column_name": "Name", "css_selector": ".product-name"},
        {"column_name": "ID", "css_selector": ".product-id"}
    ]
)

With this pattern you can add as many columns as you want.

verbose_output

If you want to see more output, you can set this to True.

As JSON configuration:

{
	"verbose_output": true
}

As a command line argument:

scraper_no_ai.exe --verbose_output True

As a python code:

my_configuration.set_verbose_output(True)

custom_http_headers

If you need to send custom HTTP headers with the request, you can set this option. By default, the scraper sends the following headers:

{
	"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}

You can set them on your own.

As JSON configuration:

{
	"custom_http_headers": {
		"User-Agent": "Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148"
	}
}

As a command line argument:

scraper_no_ai.exe --custom_http_headers "{\"User-Agent\": \"Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148\"}"

As a python code:

my_configuration.set_custom_http_headers({
    'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'
})

output_file_path

The path to the output file.

As JSON configuration:

{
	"output_file_path": "output.csv"
}

As a command line argument:

scraper_no_ai.exe --output_file_path "output.csv"

As a python code:

my_configuration.set_output_file_path("output.csv")

warning_tag_if_present

This will create a column in the output file with the name Warning and will contain True if the tag is present and False if not.

As JSON configuration:

{
	"warning_tag_if_present": ".element-i-dont-want-to-see"
}

As a command line argument:

scraper_no_ai.exe --warning_tag_if_present ".element-i-dont-want-to-see"

As a python code:

my_configuration.set_warning_tag_if_present(".element-i-dont-want-to-see")

request_delay

The delay between requests in seconds.

This is useful if you don't want to overload the server with requests. Sending too many requests in a short period of time can lead to your IP address being blocked.

As JSON configuration:

{
	"request_delay": 1
}

As a command line argument:

scraper_no_ai.exe --request_delay 1

As a python code:

my_configuration.set_request_delay(1)

Start the scraper

Now you can start the scraper.

Using the prebuild `.exe` file

Using the JSON file, run:

sceaper_no_ai.exe --config_file "path/to/your/config.json"

Using the command line arguments, run:

scraper_no_ai.exe --base_url "https://www.surgicalholdings.co.uk/browse-products.html" --pages_required True --base_url_paging_prefix "?p=" --page_start 1 --page_step 1 --page_end 1000 --recursive_navigation "[{\"css_selector\": \"ul>li>a\", \"base_url_in_case_links_are_relative\": \"https://www.surgicalholdings.co.uk\"}]" --product_element_css_selector "div.m-product-item" --product_site_needed True --product_site_link_css_selector "a.link" --product_site_link_prefix "https://www.surgicalholdings.co.uk" --data_extraction "[{\"column_name\": \"Name\", \"css_selector\": \".product-name\"}, {\"column_name\": \"ID\", \"css_selector\": \".product-id\"}]" --verbose_output False --custom_http_headers "{\"User-Agent\": \"Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148\"}" --url_blacklist_regex ".*((instagram)|(facebook)|(twitter)|(linkedin)|(youtube)|(login)|(spotify)|(blog)|(account)|(browse-products)|(pdf-catalogue)|(about)|(contact)|(basket)).*" --output_file_path "output.csv"

Using the class methods

Insert your configuration calls into main.py:

import sys
from src.web_scraper_dmeurer import Scraper

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Please provide a configuration file path as the first argument.")
        sys.exit(1)

    scraper = Scraper()

    # Your configuration here

    # Your configuration above

    scraper.start_scraper(run_args=sys.argv[1:])

    sys.exit(0)

And run the script without any arguments:

python main.py

Hint for developing this

Compile and push using:

py -m build
py -m twine upload --repository pypi dist/*

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.1

Jul 19, 2024

1.0.0

Jul 19, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web_scraper_dmeurer-1.0.1.tar.gz (141.9 kB view hashes)

Uploaded Jul 19, 2024 Source

Built Distribution

web_scraper_dmeurer-1.0.1-py3-none-any.whl (15.8 kB view hashes)

Uploaded Jul 19, 2024 Python 3

Hashes for web_scraper_dmeurer-1.0.1.tar.gz

Hashes for web_scraper_dmeurer-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`dc674dd68e723c6d53d4f7b2f0026f48c570dbadaaabe4eeb35d6b9ac8051cde`
MD5	`02f1b2119d935e93938153465492daac`
BLAKE2b-256	`6dac7db7f8e279cad92ddced97db5b11a6f8e2c5af080528b59da34577cee7b1`

Hashes for web_scraper_dmeurer-1.0.1-py3-none-any.whl

Hashes for web_scraper_dmeurer-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3266e6bdbe90125d61d8e6d5439d4343d611560283d505d35b0eea2c4bd66fb8`
MD5	`194cb9a4a00d3151836b4e5c82939334`
BLAKE2b-256	`cf8b5ebdff59c9b1f7923039799d00abea1c0ffb19b2d3d20739b53a01b71c51`

web-scraper-dmeurer 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Web Scraping — only trough configuration

How does this work?

Found a bug?

Usage

General

How to configure?

Using a JSON file

Using command line arguments

Using class methods (for Developers working directly with the code)

Using a JSON object (for Developers working directly with the code)

How can I get the right CSS selectors?

Configuration

base_url

recursive_navigation

url_blacklist_regex

pages

product_element_css_selector

product_site_needed

data_extraction

verbose_output

custom_http_headers

output_file_path

warning_tag_if_present

request_delay

Start the scraper

Using the prebuild .exe file

Using the class methods

Hint for developing this

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Using the prebuild `.exe` file