Skip to main content

simple scraping framework

Project description

Scrapework

Scrapework is a simple and opiniatated scraping framework inspired by Scrapy. It's designed for simple tasks and management, allowing you to focus on the scraping logic and not on the boilerplate code.

  • No CLI
  • No twisted / async
  • Respectful and slow for websites

Getting Started

Installation

First, clone the repository and install the dependencies:

poetry add scrapework

Quick Start

Flow:

  • Fetch: retrieve web pages
  • Extract: parse and extract structured data from pages
  • Pipeline: transform and export the structured data

Spider Configuration

  • start_urls: A list of URLs to start scraping from.
  • pipelines
  • extractors: comes with various extractors (plain body, smart extractors, markedown.)
  • middlewares: comes with various middlewares

Creating a Spider

A Spider is a class that defines how to navigate a website and extract data. Here's how you can create a Spider:

from scrapework.spider import Spider

class MySpider(Spider):
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
            }

The parse method is where you define your scraping logic. It's called with the HTTP response of the initial URL.

Creating an Extractor

An Extractor is a class that defines how to extract data from a webpage. Here's how you can create an Extractor:

from scrapework.extractors import Extractor

class MyExtractor(Extractor):
    def extract(self, selector):
        return {
            'text': selector.css('span.text::text').get(),
            'author': selector.css('span small::text').get(),
        }

The extract method is where you define your extraction logic. It's called with a parsel.Selector object that you can use to extract data from the HTML.

Creating a Pipeline

A Pipeline is a class that defines how to process and store the data. Here's how you can create a Pipeline:

from scrapework.pipelines import ItemPipeline

class MyPipeline(ItemPipeline):
    def process_items(self, items, config):
        for item in items:
            print(f"Quote: {item['text']}, Author: {item['author']}")

The process_items method is where you define your processing logic. It's called with the items extracted by the Extractor and a PipelineConfig object.

Running the Spider

To run the Spider, you need to create an instance of it and call the start_requests method:

spider = MySpider()
spider.start_requests()

Advanced Usage

For more advanced usage, you can override other methods in the Spider, Extractor, and Pipeline classes. Check the source code for more details.

Testing

To run the tests, use the following command:

pytest tests/

Contributing

Contributions are welcome! Please read the contributing guidelines first.

License

Scrapework is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapework-0.0.6.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapework-0.0.6-py3-none-any.whl (7.0 kB view details)

Uploaded Python 3

File details

Details for the file scrapework-0.0.6.tar.gz.

File metadata

  • Download URL: scrapework-0.0.6.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for scrapework-0.0.6.tar.gz
Algorithm Hash digest
SHA256 e383d7ef2db4558fe100c15c3d061415f645d441737604050e3a9a3cc1fa87a4
MD5 818d255183d26b7ad77f29e6ced4d2b1
BLAKE2b-256 f502f714a82ceb49edd6057117514dc2846e4b63d5b907c91854185ae063c3e4

See more details on using hashes here.

File details

Details for the file scrapework-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: scrapework-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 7.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for scrapework-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 4d9e6a1279cd7b25e63b48c728d9507907904f0c05bc4dd5e2fb53f1cc58a6c5
MD5 f4bb614f274bcda0eb2a30332228f115
BLAKE2b-256 be18e2f9aeeacc4d88450056287a081aa5e1292eb68752ecdb458d233b6724f0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page