rentswatch-scraper

A basic framework to scrap renting ads

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)
Natural Language
- English
Operating System
- OS Independent
- Unix
Programming Language

Project description

This package provides an easy and maintenable way to build a Rentswatch scraper. Rentswatch is a cross-borders investigation that collects data on flat rents in Europe. Its scrapers mainly focus on classified ads.

How to install

Install using pip…

pip install rentswatch-scraper

How to use

Let’s take a look at a quick example of using Rentswatch Scraper to build a simple model-backed scraper to collect data from a website.

First, import the package components to build your scraper:

#!/usr/bin/env python
from rentswatch_scraper.scraper import Scraper
from rentswatch_scraper.browser import geocode, convert
from rentswatch_scraper.fields import RegexField, ComputedField
from rentswatch_scraper import reporting

To factorize as much code as possible we created an abstract class that every scraper will implement. For the sake of simplicity we’ll use a dummy website as follow:

class DummyScraper(Scraper):
    # Those are the basic meta-properties that define the scraper behavior
    class Meta:
        country         = 'FR'
        site            = "dummy"
        baseUrl         = 'http://dummy.io'
        listUrl         = baseUrl + '/rent/city/paris/list.php'
        adBlockSelector = '.ad-page-link'

Without any further configuration, this scraper will start to collect ads from the list page of dummy.io. To find links to the ads, it will use the CSS selector .ad-page-link to get <a> markups and follow their href attributes.

We have now to teach the scraper how to extract key figures from the ad page.

class DummyScraper(Scraper):
    # HEADS UP: Meta declarations are hidden here
    # ...
    # ...

    # Extract data using a CSS Selector.
    realtorName = RegexField('.realtor-title')
    # Extract data using a CSS Selector and a Regex.
    serviceCharge = RegexField('.description-list', 'charges : (.*)\s€')
    # Extract data using a CSS Selector and a Regex.
    # This will throw a custom exception if the field is missing.
    livingSpace = RegexField('.description-list', 'surface :(\d*)', required=True, exception=reporting.SpaceMissingError)
    # Extract the value directly, without using a Regex
    totalRent = RegexField('.description-price', required=True, exception=reporting.RentMissingError)
    # Store this value as a private property (begining with a underscore).
    # It won't be saved in the database but it can be helpful as you we'll see.
    _address = RegexField('.description-address')

Every attribute will be saved as an Ad’s property, according to the Ad model.

Some properties may not be extractable from the HTML. You may need to use a custom function that received existing properties. For this reason we created a second field type named ComputedField. Since the properties order of declaration is recorded, we can use previously declared (and extracted) values to compute new ones.

class DummyScraper(Scraper):
    # ...
    # ...

    # Use existing properties `totalRent` and `livingSpace` as they were
    # extracted before this one.
    pricePerSqm = ComputedField(fn=lambda s, values: values["totalRent"] / values["livingSpace"])
    # This full exemple uses private properties to find latitude and longitude.
    # To do so we use a buid-in function named `convert` that transforms an
    # address into a dictionary of coordinates.
    _latLng = ComputedField(fn=lambda s, values: geocode(values['_address'], 'FRA') )
    # Gets a the dictionary field we want.
    latitude = ComputedField(fn=lambda s, values: values['_latLng']['lat'])
    longitude = ComputedField(fn=lambda s, values: values['_latLng']['lng'])

All you need to do now is to create an instance of your class and run the scraper.

# When you script is executed directly
if __name__ == "__main__":
  dummyScraper = DummyScraper()
  dummyScraper.run()

API Doc

class Ad

Attributes

As seen above, every Ad attribute might be used as a Scraper attribute to declare which attribute extract.

Name	Type	Description
status	String	“listed” if needs more scraping, “scraped” if it’s done
site	String	Name of the website
createdAt	DateTime	Date the ad was first scraped
siteId	String	The unique ID from the site where it’s scrapped from
serviceCharge	Float	Extra costs (heating mostly)
baseRent	Float	Base costs (without heating)
totalRent	Float	Total cost
livingSpace	Float	Surface in square meters
pricePerSqm	Float	Price per square meter
furnished	Bool	True if the flat or house is furnished
realtor	Bool	True if realtor, n if rented by a physical person
realtorName	Unicode	The name of the realtor or person offering the flat
latitude	Float	Latitude
longitude	Float	Longitude
balcony	Bool	True if there is a balcony/terrasse
yearConstructed	String	The year the building was built
cellar	Bool	True if the flat comes with a cellar
parking	Bool	True if the flat comes with a parking or a garage
houseNumber	String	House Number in the street
street	String	Street name (incl. “street”)
zipCode	String	ZIP code
city	Unicode	City
lift	Bool	True if a lift is present
typeOfFlat	String	Type of flat (no typology)
noRooms	String	Number of rooms
floor	String	Floor the flat is at
garden	Bool	True if there is a garden
barrierFree	Bool	True if the flat is wheelchair accessible
country	String	Country, 2 letter code
sourceUrl	String	URL of the page

class Scraper

Methods

The Scraper class defines a lot of method that we encourage you to redefine in order to have the full control of your scraper behavior.

Name	Description
extract_ad	Extract ads list from a page’s soup.
fail	Print out an error message.
fetch_ad	Fetch a single ad page from the target website then create Ad instances by calling èxtract_ad.
fetch_series	Fetch a single list page from the target website then fetch an ad by calling fetch_ad.
find_ad_blocks	Extract ad block from a page list. Called within fetch_series.
get_ad_href	Extract a href attribute from an ad block. Called within fetch_series.
get_ad_id	Extract a siteId from an ad block. Called within fetch_series.
get_fields	Used internally to generate a list of property to extract from the ad.
get_series	Fetch a list page from the target website.
has_issue	True if we met issues with this ad before.
is_scraped	True if we already scraped this ad before.
ok	Print out an success message.
prepare	Just before saving the values.
run	Run the scrapper.
transform_page	Transform HTML content of the series page before parsing it.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)
Natural Language
- English
Operating System
- OS Independent
- Unix
Programming Language

Release history Release notifications | RSS feed

1.0.1

Jun 1, 2016

1.0.0

May 31, 2016

0.33.9

May 13, 2016

0.33.8

Apr 25, 2016

0.33.7

Apr 21, 2016

0.33.6

Apr 21, 2016

0.33.4

Apr 12, 2016

0.33.2

Mar 24, 2016

0.33.0

Mar 24, 2016

0.32.0

Jan 21, 2016

0.31.0

Jan 20, 2016

0.30.0

Dec 17, 2015

0.29.0

Dec 17, 2015

0.28.0

Dec 17, 2015

0.27.0

Dec 8, 2015

0.26.0

Dec 8, 2015

0.25.0

Dec 3, 2015

0.24.0

Dec 2, 2015

0.23.0

Nov 24, 2015

0.22.0

Nov 24, 2015

0.21.0

Nov 17, 2015

0.20.0

Nov 17, 2015

0.19.0

Nov 17, 2015

0.18.0

Nov 17, 2015

0.17.0

Nov 17, 2015

0.16.0

Nov 17, 2015

0.15.0

Nov 17, 2015

This version

0.14.0

Nov 17, 2015

0.13.0

Nov 17, 2015

0.12.0

Nov 17, 2015

0.11.0

Nov 12, 2015

0.10.0

Nov 10, 2015

0.9.0

Nov 9, 2015

0.8.0

Nov 9, 2015

0.7.0

Nov 9, 2015

0.6.0

Nov 9, 2015

0.5.0

Nov 6, 2015

0.4.0

Nov 6, 2015

0.3.0

Nov 6, 2015

0.2.0

Nov 6, 2015

0.1.0

Nov 5, 2015

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rentswatch-scraper-0.14.0.tar.gz (19.7 kB view details)

Uploaded Nov 17, 2015 Source

File details

Details for the file rentswatch-scraper-0.14.0.tar.gz.

File metadata

Download URL: rentswatch-scraper-0.14.0.tar.gz
Upload date: Nov 17, 2015
Size: 19.7 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for rentswatch-scraper-0.14.0.tar.gz
Algorithm	Hash digest
SHA256	`59d97969653c6f0c1c2c28b32c14222bce1a1748e157d5d58ef4e1d03e493b43`
MD5	`066f0a51ce60c41cbf977e6023c4d413`
BLAKE2b-256	`8112f1209147108dcf6dd89ed407bcf00f6c3969ef638a584ea3118b1d5c34f3`