Skip to main content

Roboparse HTML

Project description

roboparse

Simple utility which helps to organize code of your scraper.

Example

Go to the example directory.

Installation

  • Via pip
pip install roboparse
  • Via git
git clone https://github.com/Toffooo/roboparse.git
cd roboparse
pip install -e .

Routers

You have 2 options when you create routers.

  1. Make one and big router for all features that you need
  2. Divide it to small parts
  • Big router
from roboparse import BaseRouter
from roboparse.schemas import RouterResponse


class BlogSiteRouter(BaseRouter):
    def get_posts(self) -> RouterResponse:    
        response = self.create_router_response(
            path="<site_url>",  # Path is just meta data. It uses for nothing
            linter={
                "type": "LIST",
                "tag": "li",
                "attrs": {"class": "content-list__item"},
                "children": {
                    "type": "ELEMENT",
                    "tag": "h2",
                    "attrs": {"class": "post__title"},
                    "children": {
                        "type": "ELEMENT",
                        "tag": "a",
                        "attrs": {"class": "post__title_link"}
                    }
                }
            }
        )
        return response
    
    def get_main(self) -> RouterResponse:
        response = self.create_router_response_from_json(
            path="json_file.json"
        )
        return response

    def _fb_exclude_none_blocks(self, data):
        return [element for element in data if element is not None]
  • Small router
from roboparse import BaseRouter
from roboparse.schemas import RouterResponse


class BlogFilters:
    def _fb_exclude_none_blocks(self, data):
        return [element for element in data if element is not None]


class BlogMainRouter(BaseRouter, BlogFilters):
    def get(self) -> RouterResponse:
        response = self.create_router_response_from_json(
            path="json_file.json"
        )
        return response


class BlogPostRouter(BaseRouter, BlogFilters):
    def get(self) -> RouterResponse:    
        response = self.create_router_response(
            path="<site_url>",  # Path is just meta data. It uses for nothing
            linter={
                "type": "LIST",
                "tag": "li",
                "attrs": {"class": "content-list__item"},
                "children": {
                    "type": "ELEMENT",
                    "tag": "h2",
                    "attrs": {"class": "post__title"},
                    "children": {
                        "type": "ELEMENT",
                        "tag": "a",
                        "attrs": {"class": "post__title_link"}
                    }
                }
            }
        )
        return response

Explanation:

  1. create_router_response - Every method of router should return router response as following, this responses will be provided to parser, and handled by it
    a) path - Meta about url of page
    b) linter - You have to provide there hierarchy of html elements
  2. create_router_responsefrom_json - Same as create_router_response, provide json file's path and load your linter's schema from it. Json structure should be same
  3. _fb prefix - You can register filters for your router. In this example, I've declared the filter by adding to method name _fb prefix. This will register your method in the class as filter. My filter just removes None elements from list and returning handled data.

See code example at example/scraper.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

roboparse-0.0.2.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

roboparse-0.0.2-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file roboparse-0.0.2.tar.gz.

File metadata

  • Download URL: roboparse-0.0.2.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.0

File hashes

Hashes for roboparse-0.0.2.tar.gz
Algorithm Hash digest
SHA256 0fd8ed393e60bab8fb718d5e3d7c231cb551c2a29cdfb24cc2a34377b5ba7cde
MD5 1c38733b4a1c6682e3e7ec5162e76a2a
BLAKE2b-256 d9ea8d7983cd44abdd0e85925158e0894ebf0fb6982e20780e410bd14b4a5255

See more details on using hashes here.

File details

Details for the file roboparse-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: roboparse-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 8.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.0

File hashes

Hashes for roboparse-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4a2c852e8bfbf3716579a90106ffedf7a68aa84803362b7dcc924048dc1da789
MD5 0b9d898f48343b19193d4a6316476312
BLAKE2b-256 cd29ad8b7ef90c5af430b18758722cef2e1029b3a12d01ac1fcc7ef2bab198a6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page