Web tools and interfaces for Internet data processing.
Project description
webtoolkit
webtoolkit provides utilities and interfaces for processing and managing Internet data, including URL parsing, HTTP status handling, page type recognition (HTML, RSS, OPML), and support for integrating crawling systems.
Features
- URL parsing and cleaning
- HTTP status code classification
- Page abstraction interfaces (HtmlPage, RssPage, OpmlPage, etc.)
- Interfaces for integrating with crawling systems
Remote crawling is supported via crawler-buddy. Provides various crawlers and handlers using interfaces from this package.
Available on pypi.
Install by
pip install webtoolkit
Url parsing
Sanitize link and remove trackers:
link = UrlLocation.get_cleaned_link(link)
Extract domain name:
domain = UrlLocation(link).get_domain()
Parse link, returns parts of the link [TBD]. It should return .scheme .domain .location .args
location = UrlLocation(link)
parsed_data = location.parse_url()
link = location.join(parsed_data) - joins back parsed data into a link
Go up in link structure. First to parent location, then to domain, then to domain super.
location = UrlLocation(link).up()
UrlLocation(link).is_onion()
Page definitions
HTML pages
page = HtmlPage(url, contents)
page.get_title()
page.get_description()
RSS pages
page = RssPage(url, contents)
page.get_title()
page.get_description()
page.get_entries()
OPML pages
page = OpmlPage(url, contents)
page.get_entries()
Content processing
Extracts links from contents
ContentLinkParser().get_links()
Check if contents if captcha protected
ContentInterface().is_captcha_protected()
Obtain text ready for display
ContentText(text).htmlify() # returns text, where http links are turned into HTML links
ContentText(text).noattrs() # removes HTML attributes
Status analysis. Note that from some status we cannot know if page is OK, or not.
is_status_code_valid(status_code) # provides information if input status code indicates the page is OK
is_status_code_invalid(status_code) # provides information if input status code indicates the page is invalid
Standard interfaces
Two standard interfaces
- CrawlerInterface - Standard interface for crawler implementations
- HandlerInterface - Allows implementing custom handlers for different use cases
Crawlers are different means of obtaining Internet data. Examples: requests, selenium, playwright, httpx, curlcffi. This package does not provide them, to make it more clean and neat.
Handlers are classes that allows automatic deduction of links, places, video codes from links, or data. Examples: youtube handler can use yt-dlp to obtain channel video list, or obtain channel ID, etc.
Default User agents
webtoolkit.get_default_user_agent()
Default User headers
webtoolkit.get_default_headers()
HTTP processing
Request HTTP object allows to make HTTP call.
request = PageRequestObject()
To send request to any scraping / crawling server just encode it to GET params
url_data = request_encode(request)
json_data = request_to_json(request) # json
request = json_to_request(json_data) # json
Check for valid HTTP responses:
PageResponseObject().is_valid()
Check for invalid HTTP responses:
PageResponseObject().is_invalid()
To check if response is captcha protected
PageResponseObject().is_captcha_protected()
Note: Some status codes may indicate uncertain results (e.g. throttling), where the page cannot be confirmed as valid or invalid yet.
Response communication is done via JSON
json_data = response_to_json(response)
response = json_to_response(json_data)
To obtain page contents object:
page = PageResponseObject().get_page() # for example could be HtmlPage
Remote interfaces
You can implement scraping servers yourself. The communication between remotes use PageRequestObject and PageResponseObjects (and encoding them / converting to JSON).
- RemoteServer - Interface for calling external crawling systems
- RemoteUrl - Wrapper around RemoteServer for easy access to remote data
Testing
Provides data and facilities that will aid you in testing.
Do you want to implement new RSS parser? Go ahead, use the data.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file webtoolkit-0.0.70.tar.gz.
File metadata
- Download URL: webtoolkit-0.0.70.tar.gz
- Upload date:
- Size: 377.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.11.2 Linux/6.12.20+rpt-rpi-v8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c23af03c280fbf99bfea4bc7ae3823491cb158379398c428b56b02b92cde6d64
|
|
| MD5 |
39469953309fca5a39b295601bc93021
|
|
| BLAKE2b-256 |
39464a0c1c4a4ac18afe4d388d708e14be268d369ed8fd7532b22ad339a113d2
|
File details
Details for the file webtoolkit-0.0.70-py3-none-any.whl.
File metadata
- Download URL: webtoolkit-0.0.70-py3-none-any.whl
- Upload date:
- Size: 395.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.11.2 Linux/6.12.20+rpt-rpi-v8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e0cb174b4572194809414639d31f31eb914ae16d78cf7dbdfde4c054421bf8b
|
|
| MD5 |
4d263778fcf186a34b114945ceb6f020
|
|
| BLAKE2b-256 |
9d191128f572161ca844ce8624a57ab414e5f912c123fd681a21c9e0fe325dc3
|