Skip to main content

No project description provided

Project description

zlodziej-crawler

Table of Contents

About The Project

Small web-scraper for scraping and processing offers from website olx.pl.

Built With

Getting Started

Prerequisites

Poetry is used for managing project dependencies, you can install it by:

pip install poetry

Installation

  • Clone the repo
git clone https://gitlab.com/mwozniak11121/zlodziej-crawler-public.git
  • Spawn poetry shell
poetry shell
  • Install dependencies and package
poetry install

 

Or if you want to install package through pip

pip install zlodziej-crawler

Usage

The only script made available is steal, which prompts for url with offer's category, e.g. olx.pl/nieruchomosci/mieszkania/wynajem/wroclaw/
and then scraps, processes and saves found offers. (Results are saved in dir: cwd / results)

Example output for RentOffer looks like this:

Extending Project

Project is meant to be easily extendable by adding new Pydantic models to zlodziej_crawler/models.py.
BaseOffer serves purpose as a generic offer for all types of offers that are not specificly processed.
RentOffer and its parent class BaseOffer look like this:

class BaseOffer(BaseModel):
    url: HttpUrl
    offer_name: str
    description: str
    id: PositiveInt
    time_offer_added: datetime
    views: PositiveInt
    location: str
    price: Union[PositiveInt, str]
    website: Optional[Website] = None
    unused_data: Optional[Dict] = None


class RentOffer(BaseOffer):
    rent: PositiveInt
    area: float

    number_of_rooms: Optional[str] = None
    offer_type: Optional[OfferType] = OfferType.UNKNOWN
    floor: Optional[str] = None
    building_type: Optional[BuildingType] = BuildingType.UNKNOWN
    furnished: Optional[bool] = None

    total_price: Optional[int] = None
    price_per_m: Optional[PositiveFloat] = None
    total_price_per_m: Optional[PositiveFloat] = None

Project can be simply extended by adding matching classes based on other categories at olx.pl.
Adding new OfferType needs:

  • Parsing functions in zlodziej_crawler/olx/offers_extraction/NEW_OFFER.py
  • Factory function in OLXParserFactory (zlodziej_crawler/olx/parser_factory.py)
  • Matching offer category url in OLXParserFactory.get_parser (zlodziej_crawler/olx/parser_factory.py)

Currently any information found by scraper in titlebox-details section and not yet processed is saved as unused_data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zlodziej-crawler-0.1.1.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

zlodziej_crawler-0.1.1-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file zlodziej-crawler-0.1.1.tar.gz.

File metadata

  • Download URL: zlodziej-crawler-0.1.1.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.2 CPython/3.8.5 Windows/10

File hashes

Hashes for zlodziej-crawler-0.1.1.tar.gz
Algorithm Hash digest
SHA256 7ae9261d37b8deabf55e92aa0ed3633e3a623fcebcaa552a548e9e949a256442
MD5 a7f21266b2c3b2234a3ea40dd634a261
BLAKE2b-256 6db3487c73ef7ebd6872caf2eaa86694bc5aece740f86ec70e745741c11843f8

See more details on using hashes here.

File details

Details for the file zlodziej_crawler-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for zlodziej_crawler-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 88beba95facb4cf459060b7a3498614263371973463077906917240712e8c621
MD5 b4ba29d8d4bda459d6d6e2846286e5de
BLAKE2b-256 a531de769c37ccd444b0044f922eae0c7e08696bfafa9ebd2e0fb8c5cd43339d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page