Roboparse HTML
Project description
roboparse
Simple utility which helps to organize code of your scraper.
Example
Go to the example
directory.
Installation
- Via pip
pip install roboparse
- Via git
git clone https://github.com/Toffooo/roboparse.git
cd roboparse
pip install -e .
Routers
You have 2 options when you create routers.
- Make one and big router for all features that you need
- Divide it to small parts
- Big router
from roboparse import BaseRouter
from roboparse.schemas import RouterResponse
class BlogSiteRouter(BaseRouter):
def get_posts(self) -> RouterResponse:
response = self.create_router_response(
path="<site_url>", # Path is just meta data. It uses for nothing
linter={
"type": "LIST",
"tag": "li",
"attrs": {"class": "content-list__item"},
"children": {
"type": "ELEMENT",
"tag": "h2",
"attrs": {"class": "post__title"},
"children": {
"type": "ELEMENT",
"tag": "a",
"attrs": {"class": "post__title_link"}
}
}
}
)
return response
def get_main(self) -> RouterResponse:
response = self.create_router_response_from_json(
path="json_file.json"
)
return response
def _fb_exclude_none_blocks(self, data):
return [element for element in data if element is not None]
- Small router
from roboparse import BaseRouter
from roboparse.schemas import RouterResponse
class BlogFilters:
def _fb_exclude_none_blocks(self, data):
return [element for element in data if element is not None]
class BlogMainRouter(BaseRouter, BlogFilters):
def get(self) -> RouterResponse:
response = self.create_router_response_from_json(
path="json_file.json"
)
return response
class BlogPostRouter(BaseRouter, BlogFilters):
def get(self) -> RouterResponse:
response = self.create_router_response(
path="<site_url>", # Path is just meta data. It uses for nothing
linter={
"type": "LIST",
"tag": "li",
"attrs": {"class": "content-list__item"},
"children": {
"type": "ELEMENT",
"tag": "h2",
"attrs": {"class": "post__title"},
"children": {
"type": "ELEMENT",
"tag": "a",
"attrs": {"class": "post__title_link"}
}
}
}
)
return response
Explanation:
create_router_response
- Every method of router should return router response as following, this responses will be provided to parser, and handled by it
a)path
- Meta about url of page
b)linter
- You have to provide there hierarchy of html elementscreate_router_responsefrom_json
- Same ascreate_router_response
, provide json file's path and load your linter's schema from it. Json structure should be same_fb prefix
- You can register filters for your router. In this example, I've declared the filter by adding to method name_fb
prefix. This will register your method in the class as filter. My filter just removes None elements from list and returning handled data.
See code example at example/scraper.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
roboparse-0.0.2.tar.gz
(6.1 kB
view hashes)
Built Distribution
Close
Hashes for roboparse-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4a2c852e8bfbf3716579a90106ffedf7a68aa84803362b7dcc924048dc1da789 |
|
MD5 | 0b9d898f48343b19193d4a6316476312 |
|
BLAKE2b-256 | cd29ad8b7ef90c5af430b18758722cef2e1029b3a12d01ac1fcc7ef2bab198a6 |