simple scraping framework

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Scrapework

Scrapework is a simple and opiniatated framework to extract data from the web. It is inspired by Scrapy and designed for simple tasks and easy management, allowing you to focus on the scraping logic. It is built on top of parsel (used by Scrapy) and httpx libraries. Some of the key differences are:

No CLI
No twisted framework
Designed for in-process usage

Installation

First, clone the repository or install as a dependencies:

With pip:

pip install scrapework

With poetry:

poetry add scrapework

Quick Start

Create a Scraper

First, create a Scraper class to define how extract data and optionally navigate a website. Here's how you can create a simple Scraper:

from scrapework.scraper import Scraper

class SimpleScraper(Scraper):
    name = "simple_scraper"

    def extract(self, ctx, selector):
        for quote in selector.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
            }

    def process(self, items, config):
        for item in items:
            print(f"Quote: {item['text']}, Author: {item['author']}")


scraper = SimpleScraper()
scraper.run(['http://quotes.toscrape.com'])

Similar to Scrapy parse, the extract method is an expected and this is where you define your scraping logic. It's called with the HTTP response of the initial URL. You can use the parsel.Selector object to extract data from the HTML using css or xpath.

To run the Scraper, you need to create an instance and call the run method passing the URLs to scrape:

scraper = SimpleScraper()
scraper.run(['http://quotes.toscrape.com'])

Modules Configuration

Scrapework can be extended using modules:

middleware to configure the request handling (chache, proxy, ...).
handlers: to export the data, save them to file or database.
reporters: to export and log the scraping events and metadata.

Flow

The scraping flow consists of the following steps:

Webpage downloading: Fetch the webpages using httpx. Optionally, use middleware to handle requests.
Extract data: Extract structured data from the HTML using parsers.
Export data: Use handlers to store or export the structured data.
Reporting: Generate reports and logs of the scraping process using reporters.

For more details see Design.

Advanced Usage

For more advanced usage, you can override other methods in the Scraper, Parser, and Pipeline classes. Check the source code for more details.

Add Parser

Alternatively, you can create a Parser class to define how to extract data from a webpage. Here's how you can create an Parser and configure it in the Scraper:

from scrapework.parsers import Parser, Scraper

class SimpleScraper(Scraper):
    name = "simple_scraper"
    parser = SimpleParser()

class SimpleParser(Parser):
    def extract(self, ctx, selector):
        return {
            'text': selector.css('span.text::text').get(),
            'author': selector.css('span small::text').get(),
        }

The extract method is where you define your extraction logic. It's called passing a parsel.Selector object that you can use to extract data from the HTML using css or xpath.

Add a data handler

Similar to a pipeline, an handler defines how to process and store the data:

from scrapework.handlers import Handler

class SimpleHandler(Handler):
    def process_items(self, items, config):
        for item in items:
            print(f"Quote: {item['text']}, Author: {item['author']}")

The process_items method is where you define your processing logic. It's called with the items extracted by the Parser and a PipelineConfig object.

scraper = SimpleScraper()

scraper.use(SimpleHandler())

Testing

To run the tests, use the following command:

pytest tests/

using playwright:

playwright install

Contributing

Contributions are welcome! Please read the contributing guidelines first.

License

Scrapework is licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.5.5

May 8, 2024

This version

0.5.4

May 8, 2024

0.5.3

May 8, 2024

0.5.2

Apr 27, 2024

0.5.1

Apr 27, 2024

0.5.0

Apr 27, 2024

0.4.7

Apr 26, 2024

0.4.6

Apr 17, 2024

0.4.5

Mar 29, 2024

0.4.4

Mar 27, 2024

0.4.3

Mar 27, 2024

0.4.0

Mar 26, 2024

0.3.3

Mar 26, 2024

0.3.2

Mar 25, 2024

0.3.1

Mar 25, 2024

0.3.0

Mar 25, 2024

0.2.0

Mar 25, 2024

0.1.3

Mar 25, 2024

0.1.1

Mar 25, 2024

0.1.0

Mar 23, 2024

0.0.11

Mar 24, 2024

0.0.10

Mar 24, 2024

0.0.7

Mar 23, 2024

0.0.6

Mar 23, 2024

0.0.5

Mar 23, 2024

0.0.4

Mar 23, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapework-0.5.4.tar.gz (12.5 kB view hashes)

Uploaded May 8, 2024 Source

Built Distribution

scrapework-0.5.4-py3-none-any.whl (15.8 kB view hashes)

Uploaded May 8, 2024 Python 3

Hashes for scrapework-0.5.4.tar.gz

Hashes for scrapework-0.5.4.tar.gz
Algorithm	Hash digest
SHA256	`769c97d5d1cd6ac728ceb19f44a1692ca2d6b5a7177f098b80665feb25522ae7`
MD5	`67b0855491ad4eb8d261ccff0f1d65a5`
BLAKE2b-256	`a247f7e814c23a9a6077fb372d7a07ab2bc676bee1bbe2ca9f3d9b97becd2014`

Hashes for scrapework-0.5.4-py3-none-any.whl

Hashes for scrapework-0.5.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f70584e3e888edb9b9f8f7536f085e29ae266a71e7f28fc41788ec27d9b45992`
MD5	`75c0d1a51576069e0195e0b0ffed699a`
BLAKE2b-256	`d00345bf4691409bd146069629381179767754ffa72681e0a2e111d28fb209cb`