Skip to main content

Scrapy spider for TW Rental House

Project description

TW Rental House Utility for Scrapy

This package is built for crawling Taiwanese rental house related website using Scrapy. As behaviour of crawlers may differ from their goal, scale, and pipeline, this package provides only minimun feature set, which allow developer to list and decode a rental house web page into structured data, without knowning too much about detail HTML and API structure of each website. In addition, this package is also designed for extensibility, which allow developers to insert customized callback, manipulate data, and integrate with existing crawler structure.

Although this package provide the ability to crawl rental house website, it's developer's responsibility to ensure crawling mechanism and usage of data. Please be friendly to target website, such as consider using DOWNLOAD_DELAY or AUTO_THROTTLING to prevent bulk requesting.

Requirement

  1. Python 3.5+

Installation

pip install scrapy-tw-rental-house

Basic Usage

This package currently support 591. Each rental house website is a Scrapy Spider class. You can either crawl entire website using default setting , which will take couple days, or customize the behaviour base on your need.

The most basic usage would be creating a new Spider class that inherit Rental591Spider:

from scrapy_twrh.spiders.rental591 import Rental591Spider

class MyAwesomeSpider(Rental591Spider):
    name='awesome'

And than start crawling by

scrapy crawl awesome

Please see example for detail usage.

Items

All spiders populates 2 type of Scrapy items: GenericHouseItem and RawHouseItem.

GenericHouseItem contains normalized data field, spirders from different website will decode their data and fit into this schema in best effort.

RawHouseItem contains unnormalized data field, which keep original and structured data in best effort.

Note that both item are super set of schema. It developer's responsibility to check which field is provided when receiving an item. For example, in Rental591Spider, for a single rental house, Scrapy will get:

  1. 1x RawHouseItem + 1x GenericHouseItem during listing all houses, which provide only minimun data field for GenericHouseItem
  2. 1x RawHouseItem + 2x GenericHouseItem during retrieving house detail. The 2nd GenericHouseItem contains only location info.

Handlers

All spiders in this package provide the following handlers:

  1. start_list, similiar to start_requests in Scrapy, control how crawler issue search/list request to find all rental houses.
  2. parse_list, similiar to parse in Scrapy, control how crawler handles response from start_list and generate request for detail house info page.
  3. parse_detail, control how crawler parse detail page.

All spiders implements their own default handler, say, default_start_list, default_parse_list, and default_parse_detail, and can be overwrite during __init__. Please see example for how to control spider behavior using handlers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-tw-rental-house-0.1.1.tar.gz (18.5 kB view details)

Uploaded Source

Built Distribution

scrapy_tw_rental_house-0.1.1-py3-none-any.whl (37.8 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-tw-rental-house-0.1.1.tar.gz.

File metadata

  • Download URL: scrapy-tw-rental-house-0.1.1.tar.gz
  • Upload date:
  • Size: 18.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.5.2

File hashes

Hashes for scrapy-tw-rental-house-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d480f651261d2d3f08d05d384d195a185e26cd0d4a1bccead1204a9716200731
MD5 e1dbe7922be60b497f72729060aff3bc
BLAKE2b-256 34d6aaeeb0591b61af51d2a3b506e555ddbfa0d98367043d1ba2f19b51ff6d3a

See more details on using hashes here.

File details

Details for the file scrapy_tw_rental_house-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: scrapy_tw_rental_house-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 37.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.5.2

File hashes

Hashes for scrapy_tw_rental_house-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 041079e6b1fdcf1e99dcbe5f8f43ccbe05d7d46b542ef7c50040542c55d76eca
MD5 36c53cf036221fd6b1a78ae4597a67d1
BLAKE2b-256 cb8dc3478810aab78c9473880e9d942d07c50041c3f538ca3dce197595063412

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page