simple scraping framework
Project description
Scrapework
Scrapework is a simple and opiniatated framework to extract data from the web. It is inspired by Scrapy and designed for simple tasks and easy management, allowing you to focus on the scraping logic. It is built on top of parsel
(used by Scrapy) and httpx
libraries. Some of the key differences are:
- No CLI
- No twisted framework
- Designed for in-process usage
Installation
First, clone the repository or install as a dependencies:
With pip:
pip install scrapework
With poetry:
poetry add scrapework
Quick Start
Create a Scraper
First, create a Scraper class to define how extract data and optionally navigate a website. Here's how you can create a simple Scraper:
from scrapework.scraper import Scraper
class SimpleScraper(Scraper):
name = "simple_scraper"
def extract(self, ctx, selector):
for quote in selector.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
}
def process(self, items, config):
for item in items:
print(f"Quote: {item['text']}, Author: {item['author']}")
scraper = SimpleScraper()
scraper.run(['http://quotes.toscrape.com'])
Similar to Scrapy parse
, the extract
method is an expected and this is where you define your scraping logic. It's called with the HTTP response of the initial URL. You can use the parsel.Selector
object to extract data from the HTML using css
or xpath
.
To run the Scraper, you need to create an instance and call the run
method passing the URLs to scrape:
scraper = SimpleScraper()
scraper.run(['http://quotes.toscrape.com'])
Modules Configuration
Scrapework can be extended using modules:
middleware
to configure the request handling (chache, proxy, ...).handlers
: to export the data, save them to file or database.reporters
: to export and log the scraping events and metadata.
Flow
The scraping flow consists of the following steps:
- Webpage downloading: Fetch the webpages using
httpx
. Optionally, usemiddleware
to handle requests. - Extract data: Extract structured data from the HTML using
parsers
. - Export data: Use
handlers
to store or export the structured data. - Reporting: Generate reports and logs of the scraping process using
reporters
.
For more details see Design.
Advanced Usage
For more advanced usage, you can override other methods in the Scraper, Parser, and Pipeline classes. Check the source code for more details.
Add Parser
Alternatively, you can create a Parser class to define how to extract data from a webpage. Here's how you can create an Parser and configure it in the Scraper:
from scrapework.parsers import Parser, Scraper
class SimpleScraper(Scraper):
name = "simple_scraper"
parser = SimpleParser()
class SimpleParser(Parser):
def extract(self, ctx, selector):
return {
'text': selector.css('span.text::text').get(),
'author': selector.css('span small::text').get(),
}
The extract
method is where you define your extraction logic. It's called passing a parsel.Selector
object that you can use to extract data from the HTML using css
or xpath
.
Add a data handler
Similar to a pipeline, an handler
defines how to process and store the data:
from scrapework.handlers import Handler
class SimpleHandler(Handler):
def process_items(self, items, config):
for item in items:
print(f"Quote: {item['text']}, Author: {item['author']}")
The process_items
method is where you define your processing logic. It's called with the items extracted by the Parser and a PipelineConfig
object.
scraper = SimpleScraper()
scraper.use(SimpleHandler())
Testing
To run the tests, use the following command:
pytest tests/
using playwright:
playwright install
Contributing
Contributions are welcome! Please read the contributing guidelines first.
License
Scrapework is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapework-0.5.5.tar.gz
.
File metadata
- Download URL: scrapework-0.5.5.tar.gz
- Upload date:
- Size: 12.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f51938323c4bbfb929ac25791253693f2f41bef5b68d9aab68771ca8b1e3bfde |
|
MD5 | 8ef61e1b7129f1cf8a34a38914bdd3dc |
|
BLAKE2b-256 | 5dff7a528c5b8d9606c7e28852cabca78a1bf5f6131fbe9f4295e89befd8266a |
File details
Details for the file scrapework-0.5.5-py3-none-any.whl
.
File metadata
- Download URL: scrapework-0.5.5-py3-none-any.whl
- Upload date:
- Size: 15.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.0.0 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3f9f322ba6c9e472ae815a0c7d1892a7ae589ae6bfff9a3f9776fa7de082cb93 |
|
MD5 | 657b75ff10fb22473c5e4748e276f66b |
|
BLAKE2b-256 | 461b8a35965d73772eeb8001fb7b1e9155cbbdf108609c1dc4624f744b919a75 |