Skip to main content

Web tools and interfaces for Internet data processing.

Project description

webtoolkit

webtoolkit provides utilities and interfaces for processing and managing Internet data, including URL parsing, HTTP status handling, page type recognition (HTML, RSS, OPML), and support for integrating crawling systems.

Features

  • URL parsing and cleaning
  • HTTP status code classification
  • Page abstraction interfaces (HtmlPage, RssPage, OpmlPage, etc.)
  • Interfaces for integrating with crawling systems

Remote crawling is supported via crawler-buddy. Provides various crawlers and handlers using interfaces from this package.

Available on pypi.

Install by

pip install webtoolkit

Url parsing

Sanitize link and remove trackers:

link = UrlLocation.get_cleaned_link(link)

Extract domain name:

domain = UrlLocation(link).get_domain()

Parse link, returns parts of the link [TBD]. It should return .scheme .domain .location .args

location = UrlLocation(link)
parsed_data = location.parse_url()
link = location.join(parsed_data) - joins back parsed data into a link

Go up in link structure. First to parent location, then to domain, then to domain super.

location = UrlLocation(link).up()
UrlLocation(link).is_onion()

Page definitions

HTML pages

page = HtmlPage(url, contents)
page.get_title()
page.get_description()

RSS pages

page = RssPage(url, contents)
page.get_title()
page.get_description()
page.get_entries()

OPML pages

page = OpmlPage(url, contents)
page.get_entries()

Content processing

Extracts links from contents

ContentLinkParser().get_links()

Check if contents if captcha protected

ContentInterface().is_captcha_protected()

Standard interfaces

Two standard interfaces

  • CrawlerInterface - Standard interface for crawler implementations
  • HandlerInterface - Allows implementing custom handlers for different use cases

Crawlers are different means of obtaining Internet data. Examples: requests, selenium, playwright, httpx, curlcffi. This package does not provide them, to make it more clean and neat.

Handlers are classes that allows automatic deduction of links, places, video codes from links, or data. Examples: youtube handler can use yt-dlp to obtain channel video list, or obtain channel ID, etc.

Default User agents

webtoolkit.default_user_agents

Default User headers

webtoolkit.default_headers

HTTP processing

Request HTTP object allows to make HTTP call.

request = PageRequestObject()

To send request to any scraping / crawling server just encode it to GET params [TBD]

encoded_data = encode_request(request)
request = decode_request(request_data)

Check for valid HTTP responses:

PageResponseObject().is_valid()

Check for invalid HTTP responses:

PageResponseObject().is_invalid()

To check if response is captcha protected [TBD]

PageResponseObject().is_captcha_protected()

Note: Some status codes may indicate uncertain results (e.g. throttling), where the page cannot be confirmed as valid or invalid yet.

Response communication is done via JSON

json_data = response_to_json(response)
response = json_to_response(json_data)

To obtain page contents object: [TBD]

page = PageResponseObject().get_page()   # for example could be HtmlPage

Remote interfaces

You can implement scraping servers yourself. The communication between remotes use PageRequestObject and PageResponseObjects (and encoding them / converting to JSON).

  • RemoteServer - Interface for calling external crawling systems
  • RemoteUrl - Wrapper around RemoteServer for easy access to remote data

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webtoolkit-0.0.31.tar.gz (48.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webtoolkit-0.0.31-py3-none-any.whl (53.7 kB view details)

Uploaded Python 3

File details

Details for the file webtoolkit-0.0.31.tar.gz.

File metadata

  • Download URL: webtoolkit-0.0.31.tar.gz
  • Upload date:
  • Size: 48.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.3 Linux/6.8.0-85-generic

File hashes

Hashes for webtoolkit-0.0.31.tar.gz
Algorithm Hash digest
SHA256 dc308570e0408df03c37ca0f9e4d283ac90f5318b03401ccbea423645eecce2a
MD5 22145751e632835b0cbd70e35b073cf1
BLAKE2b-256 d58c1b89eb51749acac23b662c6fef869d11ec99b53fae8e65ed9255449a1717

See more details on using hashes here.

File details

Details for the file webtoolkit-0.0.31-py3-none-any.whl.

File metadata

  • Download URL: webtoolkit-0.0.31-py3-none-any.whl
  • Upload date:
  • Size: 53.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.3 Linux/6.8.0-85-generic

File hashes

Hashes for webtoolkit-0.0.31-py3-none-any.whl
Algorithm Hash digest
SHA256 8f5ebd58935b62503b4d013eb1a03e281fdf7be29534511bdf0edce6de412804
MD5 51abcf540eeeb6834fd8cdc49c4bbaa1
BLAKE2b-256 cbf2e4fe07fc3221a66fc2c1109783c97c94f1c11b0b61afa5b4d9dc569cac72

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page