Roboparse HTML
Project description
roboparse
Simple utility which helps to organize code of your scraper.
Example
Go to the example directory.
Installation
- Via pip
pip install roboparse
- Via git
git clone https://github.com/Toffooo/roboparse.git
cd roboparse
pip install -e .
Routers
You have 2 options when you create routers.
- Make one and big router for all features that you need
- Divide it to small parts
- Big router
from roboparse import BaseRouter
from roboparse.schemas import RouterResponse
class BlogSiteRouter(BaseRouter):
def get_posts(self) -> RouterResponse:
response = self.create_router_response(
path="<site_url>", # Path is just meta data. It uses for nothing
linter={
"type": "LIST",
"tag": "li",
"attrs": {"class": "content-list__item"},
"children": {
"type": "ELEMENT",
"tag": "h2",
"attrs": {"class": "post__title"},
"children": {
"type": "ELEMENT",
"tag": "a",
"attrs": {"class": "post__title_link"}
}
}
}
)
return response
def get_main(self) -> RouterResponse:
response = self.create_router_response_from_json(
path="json_file.json"
)
return response
def _fb_exclude_none_blocks(self, data):
return [element for element in data if element is not None]
- Small router
from roboparse import BaseRouter
from roboparse.schemas import RouterResponse
class BlogFilters:
def _fb_exclude_none_blocks(self, data):
return [element for element in data if element is not None]
class BlogMainRouter(BaseRouter, BlogFilters):
def get(self) -> RouterResponse:
response = self.create_router_response_from_json(
path="json_file.json"
)
return response
class BlogPostRouter(BaseRouter, BlogFilters):
def get(self) -> RouterResponse:
response = self.create_router_response(
path="<site_url>", # Path is just meta data. It uses for nothing
linter={
"type": "LIST",
"tag": "li",
"attrs": {"class": "content-list__item"},
"children": {
"type": "ELEMENT",
"tag": "h2",
"attrs": {"class": "post__title"},
"children": {
"type": "ELEMENT",
"tag": "a",
"attrs": {"class": "post__title_link"}
}
}
}
)
return response
Explanation:
create_router_response- Every method of router should return router response as following, this responses will be provided to parser, and handled by it
a)path- Meta about url of page
b)linter- You have to provide there hierarchy of html elementscreate_router_responsefrom_json- Same ascreate_router_response, provide json file's path and load your linter's schema from it. Json structure should be same_fb prefix- You can register filters for your router. In this example, I've declared the filter by adding to method name_fbprefix. This will register your method in the class as filter. My filter just removes None elements from list and returning handled data.
See code example at example/scraper.py
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
roboparse-0.0.2.tar.gz
(6.1 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file roboparse-0.0.2.tar.gz.
File metadata
- Download URL: roboparse-0.0.2.tar.gz
- Upload date:
- Size: 6.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0fd8ed393e60bab8fb718d5e3d7c231cb551c2a29cdfb24cc2a34377b5ba7cde
|
|
| MD5 |
1c38733b4a1c6682e3e7ec5162e76a2a
|
|
| BLAKE2b-256 |
d9ea8d7983cd44abdd0e85925158e0894ebf0fb6982e20780e410bd14b4a5255
|
File details
Details for the file roboparse-0.0.2-py3-none-any.whl.
File metadata
- Download URL: roboparse-0.0.2-py3-none-any.whl
- Upload date:
- Size: 8.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.8.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4a2c852e8bfbf3716579a90106ffedf7a68aa84803362b7dcc924048dc1da789
|
|
| MD5 |
0b9d898f48343b19193d4a6316476312
|
|
| BLAKE2b-256 |
cd29ad8b7ef90c5af430b18758722cef2e1029b3a12d01ac1fcc7ef2bab198a6
|