Roboparse HTML
Project description
roboparse
Simple utility which helps to organize code of your scraper.
Example
Go to the example
directory.
Installation
- Via pip
pip install roboparse
- Via git
git clone https://github.com/Toffooo/roboparse.git
cd roboparse
pip install -e .
Routers
You have 2 options when you create routers.
- Make one and big router for all features that you need
- Divide it to small parts
- Big router
from roboparse import BaseRouter
from roboparse.schemas import RouterResponse
class BlogSiteRouter(BaseRouter):
def get_posts(self) -> RouterResponse:
response = self.create_router_response(
path="<site_url>", # Path is just meta data. It uses for nothing
linter={
"type": "LIST",
"tag": "li",
"attrs": {"class": "content-list__item"},
"children": {
"type": "ELEMENT",
"tag": "h2",
"attrs": {"class": "post__title"},
"children": {
"type": "ELEMENT",
"tag": "a",
"attrs": {"class": "post__title_link"}
}
}
}
)
return response
def get_main(self) -> RouterResponse:
response = self.create_router_response_from_json(
path="json_file.json"
)
return response
def _fb_exclude_none_blocks(self, data):
return [element for element in data if element is not None]
- Small router
from roboparse import BaseRouter
from roboparse.schemas import RouterResponse
class BlogFilters:
def _fb_exclude_none_blocks(self, data):
return [element for element in data if element is not None]
class BlogMainRouter(BaseRouter, BlogFilters):
def get(self) -> RouterResponse:
response = self.create_router_response_from_json(
path="json_file.json"
)
return response
class BlogPostRouter(BaseRouter, BlogFilters):
def get(self) -> RouterResponse:
response = self.create_router_response(
path="<site_url>", # Path is just meta data. It uses for nothing
linter={
"type": "LIST",
"tag": "li",
"attrs": {"class": "content-list__item"},
"children": {
"type": "ELEMENT",
"tag": "h2",
"attrs": {"class": "post__title"},
"children": {
"type": "ELEMENT",
"tag": "a",
"attrs": {"class": "post__title_link"}
}
}
}
)
return response
Explanation:
create_router_response
- Every method of router should return router response as following, this responses will be provided to parser, and handled by it
a)path
- Meta about url of page
b)linter
- You have to provide there hierarchy of html elementscreate_router_responsefrom_json
- Same ascreate_router_response
, provide json file's path and load your linter's schema from it. Json structure should be same_fb prefix
- You can register filters for your router. In this example, I've declared the filter by adding to method name_fb
prefix. This will register your method in the class as filter. My filter just removes None elements from list and returning handled data.
See code example at example/scraper.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
roboparse-0.0.2.tar.gz
(6.1 kB
view details)
Built Distribution
File details
Details for the file roboparse-0.0.2.tar.gz
.
File metadata
- Download URL: roboparse-0.0.2.tar.gz
- Upload date:
- Size: 6.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0fd8ed393e60bab8fb718d5e3d7c231cb551c2a29cdfb24cc2a34377b5ba7cde |
|
MD5 | 1c38733b4a1c6682e3e7ec5162e76a2a |
|
BLAKE2b-256 | d9ea8d7983cd44abdd0e85925158e0894ebf0fb6982e20780e410bd14b4a5255 |
File details
Details for the file roboparse-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: roboparse-0.0.2-py3-none-any.whl
- Upload date:
- Size: 8.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4a2c852e8bfbf3716579a90106ffedf7a68aa84803362b7dcc924048dc1da789 |
|
MD5 | 0b9d898f48343b19193d4a6316476312 |
|
BLAKE2b-256 | cd29ad8b7ef90c5af430b18758722cef2e1029b3a12d01ac1fcc7ef2bab198a6 |