A crawler for product information of sellers on Ruten.
Project description
Ruten Seller Product Parser
This is a repository that offers a ProductCrawler class to crawl Ruten web pages for the product information in json format.
from ruten_crawler import ProductCrawler
product_crawler = ProductCrawler(seller_id = "hambergurs")
results = product_crawler.get_crawl_result()
Installation
To install this verson from PyPI, type:
pip install rutencrawler
To get the newest one from this repo (note that we are in the alpha stage, so there may be frequent updates), type:
pip install git+git://github.com/jn8029/ruten_crawler.git
Overview
class ProductCrawler
class handles the whole web crawling logic. It takes optional arguments of sleep_time
and sleep_at_each_iteration
class ProductPageParser
handles the product page information extraction. Currently the parser only extracts shipping information, urls for images and the title of the product. More info can be extracted and the logic can be added here.
class ProdcutListParser
handles the parsing of product list page. The main function is to extract a list of product urls at each page, and then the urls are then used to parse product information with ProductPageParser
To-do
- add more error-proof exception handlers in ProductCrawler due to the multi-threaded nature of the process.
- add more product info extraction features in ProductCrawler, e.g. price, remaining time, description, etc.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ruten_crawler-0.0.6.tar.gz
.
File metadata
- Download URL: ruten_crawler-0.0.6.tar.gz
- Upload date:
- Size: 3.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.5.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f314316b279d602405ebda7a934cedbd35a24bd6ae44534a023e6f0581ad7721 |
|
MD5 | 3c98c4938b974f136d5aa75303e9f297 |
|
BLAKE2b-256 | 52ed7a432f85f561bb10d7c4a617d2e92270602cbafbc1c8fb2d481ee89fad63 |
File details
Details for the file ruten_crawler-0.0.6-py3-none-any.whl
.
File metadata
- Download URL: ruten_crawler-0.0.6-py3-none-any.whl
- Upload date:
- Size: 8.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.5.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4b538f333e6484b231efa19ab157111e37dd0124f389e676b625f9509d4d0f67 |
|
MD5 | e4186eb9da3cb5519cd1142d9225eb7b |
|
BLAKE2b-256 | 3adff081e5b889299889807e3d06e9f703ad91cbd889962ec3f6a2f1cc2faa0f |