Skip to main content

A crawler for product information of sellers on Ruten.

Project description

Ruten Seller Product Parser

PyPI version PyPI license

This is a repository that offers a ProductCrawler class to crawl Ruten web pages for the product information in json format.

from ruten_crawler import ProductCrawler
product_crawler = ProductCrawler(seller_id = "hambergurs")
results = product_crawler.get_crawl_result()

Installation

To install this verson from PyPI, type:


pip install rutencrawler

To get the newest one from this repo (note that we are in the alpha stage, so there may be frequent updates), type:


pip install git+git://github.com/jn8029/ruten_crawler.git

Overview

class ProductCrawler class handles the whole web crawling logic. It takes optional arguments of sleep_time and sleep_at_each_iteration

class ProductPageParser handles the product page information extraction. Currently the parser only extracts shipping information, urls for images and the title of the product. More info can be extracted and the logic can be added here.

class ProdcutListParser handles the parsing of product list page. The main function is to extract a list of product urls at each page, and then the urls are then used to parse product information with ProductPageParser

To-do

  • add more error-proof exception handlers in ProductCrawler due to the multi-threaded nature of the process.
  • add more product info extraction features in ProductCrawler, e.g. price, remaining time, description, etc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ruten_crawler-0.0.6.tar.gz (3.7 kB view details)

Uploaded Source

Built Distribution

ruten_crawler-0.0.6-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file ruten_crawler-0.0.6.tar.gz.

File metadata

  • Download URL: ruten_crawler-0.0.6.tar.gz
  • Upload date:
  • Size: 3.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.5.4

File hashes

Hashes for ruten_crawler-0.0.6.tar.gz
Algorithm Hash digest
SHA256 f314316b279d602405ebda7a934cedbd35a24bd6ae44534a023e6f0581ad7721
MD5 3c98c4938b974f136d5aa75303e9f297
BLAKE2b-256 52ed7a432f85f561bb10d7c4a617d2e92270602cbafbc1c8fb2d481ee89fad63

See more details on using hashes here.

File details

Details for the file ruten_crawler-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: ruten_crawler-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 8.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.5.4

File hashes

Hashes for ruten_crawler-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 4b538f333e6484b231efa19ab157111e37dd0124f389e676b625f9509d4d0f67
MD5 e4186eb9da3cb5519cd1142d9225eb7b
BLAKE2b-256 3adff081e5b889299889807e3d06e9f703ad91cbd889962ec3f6a2f1cc2faa0f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page