A package for web scraping

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Webscraperr

This Python library is designed to facilitate the common workflow of web scraping, particularly for e-commerce websites. It provides a structured framework where users can define their own logic for gathering product URLs, parsing individual product pages, and selecting the next page. The URLs and product info are saved directly to a database. It supports various databases such as SQLite and MySQL.

Installation

Install webscraperr with pip

    pip install webscraperr

Usage

The configurations of the scraper is stored in a config dictionary. The config must be prepared, modified and validated before passing it to the scraper.

from webscraperr.config import get_default_config, validate_config, DBTypes

config = get_default_config()
config['DATABASE']['TYPE'] = DBTypes.SQLITE
config['DATABASE']['DATABASE'] = 'mydatabase.db'
config['DATABASE']['TABLE'] = 'products' # If TABLE is not set "items" will be the defaut table name
config['SCRAPER']['REQUEST_DELAY'] = 1.6

validate_config(config) # Will raise an error if config is not properly set

After preparing and validating the config, you must initialize the database

from webscraperr.db import init_sqlite

init_sqlite(config['DATABASE'])

# This will create the database and the table

For this example we are going to use WebScraperRequest. This scrapper will be using requests library for the http requests. You will need to define the functions for parsing the html. There is also WebScraperChrome that uses selenium-wire and undetected-chromedriver.

from webscraperr import WebScraperRequest
from urllib.parse import urljoin
import parsel

urls = ["https://webscraper.io/test-sites/e-commerce/static/computers/tablets"]

# The `get_next_page_func` must return a url or None. If it returns None it means there is no next page

def get_next_page_func(response):
    selector = parsel.Selector(text=response.text) # in this example `parsel` is used for parsing the html
    next_page_url = selector.css('a[rel="next"]::attr(href)').get()
    if next_page_url is not None:
        return urljoin(BASE_URL, next_page_url)
    return None

# The `parse_info_func` must return a `dict`.

def parse_info_func(response):
    selector = parsel.Selector(text=response.text)
    info = {
        'name': selector.css(".caption h4:nth-child(2)::text").get(),
        'price': selector.css(".caption .price::text").get()
    }
    return info


with WebScraperRequest(config) as scraper:
    scraper.get_items_urls_func = lambda selector : [urljoin(BASE_URL, i) for i in selector.css(".thumbnail a::attr(href)").getall()]
    scraper.get_next_page_func = get_next_page_func
    scraper.parse_info_func = parse_info_func

    scraper.scrape_items_urls(urls) # This will start the scraping of products urls

    scraper.scrape_items_infos() # This will navigate to the product page and parse the html

Development Status

Please note that this library is still under development and may be subject to changes. I am constantly working on improving its functionality, flexibility and performance. Your patience, feedback, and contributions are much appreciated.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.8

Apr 14, 2024

0.1.7

Nov 13, 2023

0.1.6

Oct 18, 2023

0.1.5

Oct 17, 2023

0.1.4

Oct 5, 2023

0.1.3

Oct 5, 2023

0.1.2

Oct 4, 2023

0.1.1

Sep 28, 2023

0.1.0

Sep 28, 2023

0.0.9

Sep 27, 2023

0.0.8

Sep 25, 2023

0.0.7

Sep 24, 2023

0.0.6

Sep 14, 2023

0.0.5

Sep 13, 2023

0.0.4

Sep 13, 2023

0.0.3

Sep 11, 2023

0.0.2

Sep 11, 2023

0.0.1

Sep 11, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webscraperr-0.1.8.tar.gz (8.5 kB view details)

Uploaded Apr 14, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

webscraperr-0.1.8-py3-none-any.whl (9.0 kB view details)

Uploaded Apr 14, 2024 Python 3

File details

Details for the file webscraperr-0.1.8.tar.gz.

File metadata

Download URL: webscraperr-0.1.8.tar.gz
Upload date: Apr 14, 2024
Size: 8.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for webscraperr-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`2ed0fff2ff4f648a3a544a18207ac20b219f2e3489b9567a7dc94e65ca6f6648`
MD5	`1f7e1555294196a1bc51d82a2cb20a56`
BLAKE2b-256	`cccd1d9a3f058f32a2eec977f87a64ae806c4563dc9483fb93cc61075451a8b1`

See more details on using hashes here.

File details

Details for the file webscraperr-0.1.8-py3-none-any.whl.

File metadata

Download URL: webscraperr-0.1.8-py3-none-any.whl
Upload date: Apr 14, 2024
Size: 9.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for webscraperr-0.1.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d2386c713cf466e16d6af18edc9dd066f8ddeec33651e055d604bc58dd1e5df5`
MD5	`104c657ce5861d4d23639ca7c4a50f8e`
BLAKE2b-256	`eafbdeff554eadc9c76e9c95f962b0652baa686f8da76a5db436b966d5e1ca15`

See more details on using hashes here.

webscraperr 0.1.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Webscraperr

Installation

Usage

Development Status

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes