No project description provided
Project description
zlodziej-crawler
Table of Contents
About The Project
Small web-scraper for scraping and processing offers from website olx.pl.
Built With
Getting Started
Prerequisites
Poetry
is used for managing project dependencies, you can install it by:
pip install poetry
Installation
- Clone the repo
git clone https://gitlab.com/mwozniak11121/zlodziej-crawler-public.git
- Spawn poetry shell
poetry shell
- Install dependencies and package
poetry install
Or if you want to install package through pip
pip install zlodziej-crawler
Usage
The only script made available is steal
, which prompts for url
with offer's category, e.g.
olx.pl/nieruchomosci/mieszkania/wynajem/wroclaw/
and then scraps, processes and saves found offers.
(Results are saved in dir: cwd / results
)
Example output for RentOffer
looks like this:
Extending Project
Project is meant to be easily extendable by adding new Pydantic models to zlodziej_crawler/models.py
.
BaseOffer
serves purpose as a generic offer for all types of offers that are not specificly processed.
RentOffer
and its parent class BaseOffer
look like this:
class BaseOffer(BaseModel):
url: HttpUrl
offer_name: str
description: str
id: PositiveInt
time_offer_added: datetime
views: PositiveInt
location: str
price: Union[PositiveInt, str]
website: Optional[Website] = None
unused_data: Optional[Dict] = None
class RentOffer(BaseOffer):
rent: PositiveInt
area: float
number_of_rooms: Optional[str] = None
offer_type: Optional[OfferType] = OfferType.UNKNOWN
floor: Optional[str] = None
building_type: Optional[BuildingType] = BuildingType.UNKNOWN
furnished: Optional[bool] = None
total_price: Optional[int] = None
price_per_m: Optional[PositiveFloat] = None
total_price_per_m: Optional[PositiveFloat] = None
Project can be simply extended by adding matching classes based on other categories at olx.pl.
Adding new OfferType needs:
- Parsing functions in
zlodziej_crawler/olx/offers_extraction/NEW_OFFER.py
- Factory function in
OLXParserFactory
(zlodziej_crawler/olx/parser_factory.py
) - Matching offer category url in
OLXParserFactory.get_parser
(zlodziej_crawler/olx/parser_factory.py
)
Currently any information found by scraper in titlebox-details
section and not yet processed is saved as unused_data
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for zlodziej_crawler-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 88beba95facb4cf459060b7a3498614263371973463077906917240712e8c621 |
|
MD5 | b4ba29d8d4bda459d6d6e2846286e5de |
|
BLAKE2b-256 | a531de769c37ccd444b0044f922eae0c7e08696bfafa9ebd2e0fb8c5cd43339d |