Scrapy spider for TW Rental House
Project description
TW Rental House Utility for Scrapy
This package is built for crawling Taiwanese rental house related website using Scrapy. As behaviour of crawlers may differ from their goal, scale, and pipeline, this package provides only minimum feature set, which allow developer to list and decode a rental house web page into structured data, without knowing too much about detail HTML and API structure of each website. In addition, this package is also designed for extensibility, which allow developers to insert customized callback, manipulate data, and integrate with existing crawler structure.
Although this package provide the ability to crawl rental house website, it's developer's responsibility to ensure crawling mechanism and usage of data. Please be friendly to target website, such as consider using DOWNLOAD_DELAY or AUTO_THROTTLING to prevent bulk requesting.
Requirement
- Python 3.10+
Installation
poetry add scrapy-tw-rental-house
Basic Usage
This package currently support 591. Each rental house website is a Scrapy Spider class. You can either crawl entire website using default setting , which will take couple days, or customize the behaviour base on your need.
The most basic usage would be creating a new Spider class that inherit Rental591Spider:
from scrapy_twrh.spiders.rental591 import Rental591Spider
class MyAwesomeSpider(Rental591Spider):
name='awesome'
And than start crawling by
scrapy crawl awesome
Please see example for detail usage.
Items
All spiders populates 2 type of Scrapy items: GenericHouseItem
and RawHouseItem
.
GenericHouseItem
contains normalized data field, spirders from different website will decode their data and fit into this schema in best effort.
RawHouseItem
contains unnormalized data field, which keep original and structured data in best effort.
Note that both item are super set of schema. It developer's responsibility to check which field is provided when receiving an item.
For example, in Rental591Spider
, for a single rental house, Scrapy will get:
- 1x
RawHouseItem
+ 1xGenericHouseItem
during listing all houses, which provide only minimun data field forGenericHouseItem
- 1x
RawHouseItem
+ 1xGenericHouseItem
during retrieving house detail.
Handlers
All spiders in this package provide the following handlers:
start_list
, similiar tostart_requests
in Scrapy, control how crawler issue search/list request to find all rental houses.parse_list
, similiar toparse
in Scrapy, control how crawler handles response fromstart_list
and generate request for detail house info page.parse_detail
, control how crawler parse detail page.
All spiders implements their own default handler, say, default_start_list
, default_parse_list
, and default_parse_detail
, and can be overwrite during __init__
. Please see example for how to control spider behavior using handlers.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapy_tw_rental_house-1.3.0.tar.gz
.
File metadata
- Download URL: scrapy_tw_rental_house-1.3.0.tar.gz
- Upload date:
- Size: 20.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/6.8.0-40-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d3efa877292d1b01cb34a4cb92eb1c6b6a5837b0a8100b4b40311939bb82a9cd |
|
MD5 | dbab0b5e68e509d06ad15c1c4957752c |
|
BLAKE2b-256 | 9fa35e99a26c30444515eabbe0af90c8d1b51de86da6cdd2b8ceef2f67ea6f30 |
File details
Details for the file scrapy_tw_rental_house-1.3.0-py3-none-any.whl
.
File metadata
- Download URL: scrapy_tw_rental_house-1.3.0-py3-none-any.whl
- Upload date:
- Size: 23.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/6.8.0-40-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 44b99fe4b960ad2c62a6608e6b0678bc0843fcdecc13fddd5fe99fbb0f5ec435 |
|
MD5 | 274e9cc9e8e544b608c5acc4bed6aba7 |
|
BLAKE2b-256 | bdc3b03fefe70bbd740a609ec999312b544bd23fdbad9ee598fcf87ed39cedf4 |