Scrapy spider for TW Rental House

These details have not been verified by PyPI

Project links

Project description

TW Rental House Utility for Scrapy

This package is built for crawling Taiwanese rental house related website using Scrapy. As behavior of crawlers may differ from their goal, scale, and pipeline, this package provides only minimum feature set, which allow developer to list and decode a rental house web page into structured data, without knowing too much about detail HTML and API structure of each website. In addition, this package is also designed for extensibility, which allow developers to insert customized callback, manipulate data, and integrate with existing crawler structure.

Although this package provide the ability to crawl rental house website, it's developer's responsibility to ensure crawling mechanism and usage of data. Please be friendly to target website, such as consider using DOWNLOAD_DELAY or AUTO_THROTTLING to prevent bulk requesting.

Requirement

Python 3.10+
Playwright (for 591 spiders)
PaddleOCR (for 591 spiders)

Installation

poetry add scrapy-tw-rental-house

Install Playwright

We use Playwright default browser (Chromium) to render JavaScript content. Please install Playwright Chromium before using this package.

For more information, please refer to official document

poetry shell
playwright install chromium

591 specific

As 591 implements anti-crawler mechanism, it require additional setup to bypass it. To enable Playwright to bypass 591 anti-crawler mechanism, please ensure you get access to browser developer tool on browsing 591, and copy the setting to settings.py.

BROWSER_INIT_SCRIPT = 'console.log("This command enable Playwright")'

Enable OCR cache

As OCR is a time consuming process, we provide a cache mechanism to store OCR result. To enable OCR cache, please configure scrapy settings.py as following:

# Enable OCR cache
OCR_CACHE_ENABLED = True # default false
OCR_CACHE_DIR = 'path/to/cache' # default to ocr_cache

Speed up browser page loading

This package support skip specific domain request and cache JS. To enable these features, please configure scrapy settings.py as following:

# Enable cache for JS
BROWSER_JS_CACHE_ENABLED = True
BROWSER_JS_CACHE_DIR = 'path/to/cache' # default to js_cache

# Enable skip specific domain request
BROWSER_SKIP_DOMAIN = [
    'https://the.unnecessary.domain',
]

Basic Usage

This package currently support 591. Each rental house website is a Scrapy Spider class. You can either crawl entire website using default setting , which will take couple days, or customize the behaviour base on your need.

The most basic usage would be creating a new Spider class that inherit Rental591Spider:

from scrapy_twrh.spiders.rental591 import Rental591Spider

class MyAwesomeSpider(Rental591Spider):
    name='awesome'

And than start crawling by

scrapy crawl awesome

Please see example for detail usage.

Items

All spiders populates 2 type of Scrapy items: GenericHouseItem and RawHouseItem.

GenericHouseItem contains normalized data field, spirders from different website will decode their data and fit into this schema in best effort.

RawHouseItem contains unnormalized data field, which keep original and structured data in best effort.

Note that both item are super set of schema. It developer's responsibility to check which field is provided when receiving an item. For example, in Rental591Spider, for a single rental house, Scrapy will get:

1x RawHouseItem + 1x GenericHouseItem during listing all houses, which provide only minimun data field for GenericHouseItem
1x RawHouseItem + 1x GenericHouseItem during retrieving house detail.

Handlers

All spiders in this package provide the following handlers:

start_list, similiar to start_requests in Scrapy, control how crawler issue search/list request to find all rental houses.
parse_list, similiar to parse in Scrapy, control how crawler handles response from start_list and generate request for detail house info page.
parse_detail, control how crawler parse detail page.

All spiders implements their own default handler, say, default_start_list, default_parse_list, and default_parse_detail, and can be overwrite during __init__. Please see example for how to control spider behavior using handlers.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.1.3

Apr 11, 2025

2.1.2

Apr 11, 2025

2.1.1

Apr 11, 2025

2.1.0

Apr 6, 2025

2.0.4

Apr 2, 2025

2.0.3

Mar 27, 2025

2.0.1

Mar 22, 2025

2.0.0

Mar 22, 2025

1.5.1

Oct 29, 2024

1.5.0

Oct 28, 2024

1.4.1

Sep 26, 2024

1.4.0

Sep 23, 2024

1.3.7

Sep 9, 2024

1.3.6

Sep 8, 2024

1.3.5

Sep 8, 2024

1.3.4

Sep 7, 2024

1.3.3

Sep 7, 2024

1.3.2

Sep 7, 2024

1.3.1

Sep 7, 2024

1.3.0

Sep 7, 2024

1.2.1

Dec 30, 2023

1.2.0

Dec 30, 2023

1.1.2

Oct 27, 2021

1.1.1

Oct 27, 2021

1.1.0

Oct 26, 2021

1.0.0

Oct 25, 2021

0.1.2

Sep 15, 2019

0.1.1

Jun 11, 2019

0.1.0

Jun 11, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_tw_rental_house-2.1.3.tar.gz (28.1 kB view details)

Uploaded Apr 11, 2025 Source

Built Distribution

scrapy_tw_rental_house-2.1.3-py3-none-any.whl (30.9 kB view details)

Uploaded Apr 11, 2025 Python 3

File details

Details for the file scrapy_tw_rental_house-2.1.3.tar.gz.

File metadata

Download URL: scrapy_tw_rental_house-2.1.3.tar.gz
Upload date: Apr 11, 2025
Size: 28.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/6.8.0-57-generic

File hashes

Hashes for scrapy_tw_rental_house-2.1.3.tar.gz
Algorithm	Hash digest
SHA256	`0aff83466a368976e51fc7bbe6820ae31db591a029c0800b7b0192a0f5fc2da8`
MD5	`ac6959fc05dcd14628cb79f27541a501`
BLAKE2b-256	`8e80b8b18e83f0360f5536c35478d4195d475cb8cd3ec102e61ab3c361200c14`

See more details on using hashes here.

File details

Details for the file scrapy_tw_rental_house-2.1.3-py3-none-any.whl.

File metadata

Download URL: scrapy_tw_rental_house-2.1.3-py3-none-any.whl
Upload date: Apr 11, 2025
Size: 30.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/6.8.0-57-generic

File hashes

Hashes for scrapy_tw_rental_house-2.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a5c7540c05b880428288002f570e2add3eb0edb45112a2687509ea2808c78c1b`
MD5	`14f5e0c1f9fead5f820a3decb135f158`
BLAKE2b-256	`5cfd8770751173590f4c8b1785d4478cbad834ff4a3d1b162d32607da72f368f`

See more details on using hashes here.

scrapy-tw-rental-house 2.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TW Rental House Utility for Scrapy

Requirement

Installation

Install Playwright

591 specific

Enable OCR cache

Speed up browser page loading

Basic Usage

Items

Handlers

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes