Web tools and interfaces for Internet data processing.

These details have not been verified by PyPI

Project description

webtoolkit

webtoolkit provides utilities and interfaces for processing and managing Internet data, including URL parsing, HTTP status handling, page type recognition (HTML, RSS, OPML), and support for integrating crawling systems.

Features

URL parsing and cleaning
HTTP status code classification
Page abstraction interfaces (HtmlPage, RssPage, OpmlPage, etc.)
Interfaces for integrating with crawling systems

Remote crawling is supported via crawler-buddy. Provides various crawlers and handlers using interfaces from this package.

Available on pypi.

Install by

pip install webtoolkit

Url processing

To obtain a Url’s data, you can simply do:

url = BaseUrl("https://example.com")

response = url.get_response()

url.get_title()
url.get_description()
url.get_lanugage()
url.get_date_published()
url.get_author()
url.get_feeds()
url.get_entries()

BaseUrl automatically detects and supports many different page types, including YouTube, GitHub, Reddit, and others.

Chain of data

url = BaseUrl("https://example.com")

response = url.get_response()
handler = url.get_handler()
page = handler.get_page()

Page definitions

BaseUrl supports various page types through different classes

HTML pages

page = HtmlPage(url, contents)
page.get_title()
page.get_description()
page.get_lanugage()
page.get_date_published()
page.get_author()
page.get_feeds()

RSS pages

page = RssPage(url, contents)
page.get_title()
page.get_description()
page.get_lanugage()
page.get_date_published()
page.get_author()
page.get_entries()

OPML pages

page = OpmlPage(url, contents)
page.get_entries()

Url location processing

Sanitize link and remove trackers:

location = UrlLocation(link).get_clean()

Extract domain name:

location = UrlLocation(link).get_domain()

Parse and reconstruct links

location = UrlLocation(link)
parsed_data = location.parse_url()
link = location.join(parsed_data) - joins back parsed data into a link

Navigate up the URL structure Go up in the link hierarchy — first to the parent path, then to the domain, and finally to the domain root.

location = UrlLocation(link).up()

is_onion = UrlLocation(link).is_onion()
is_domain = UrlLocation(link).is_domain()
is_image = UrlLocation(link).is_image()
is_audio = UrlLocation(link).is_audio()
is_video = UrlLocation(link).is_video()

is_web_link = UrlLocation(link).is_web_link()          # https://example.com/file.js is a web link
is_webpage_link = UrlLocation(link).is_webpage_link()  # https://example.com/file.js is not a webpage link

Content processing

Internet contents can be parsed in various ways.

Extracts links from contents

ContentLinkParser(contents).get_links()

Obtain text ready for display

ContentText(text).htmlify()  # returns text, where http links are turned into HTML links
ContentText(text).noattrs()  # removes HTML attributes

Status analysis. Note that from some status we cannot know if page is OK, or not.

is_status_code_valid(status_code)   # provides information if input status code indicates the page is OK
is_status_code_invalid(status_code) # provides information if input status code indicates the page is invalid

HTTP processing - requests

Communication is performed via request - response pairs.

Request HTTP object allows to make HTTP call.

request = PageRequestObject()

To send request to any scraping / crawling server just encode it to GET params

url_data = request_encode(request)

json_data = request_to_json(request)  # json
request = json_to_request(json_data)  # json

HTTP processing - response

Check for valid HTTP responses:

PageResponseObject().is_valid()

Check for invalid HTTP responses:

PageResponseObject().is_invalid()

To check if response is captcha protected

PageResponseObject().is_captcha_protected()

Note: Some status codes may indicate uncertain results (e.g. throttling), where the page cannot be confirmed as valid or invalid yet.

To obtain page structure from response, simply

PageResponseObject().get_page()   # can return HtmlPage, RssPage, etc.

Response communication is done via JSON

json_data = response_to_json(response)
response = json_to_response(json_data)

To obtain page contents object:

page = PageResponseObject().get_page()   # returns type of page, be it HtmlPage, RssPage, etc.

Remote interfaces

You can use existing scraping servers.

RemoteUrl - Wrapper around RemoteServer for easy access to remote data. Provides API similar to BaseUrl.

url = RemoteUrl("http://192.168.0.168...")
response = url.get_response()

url.get_title()
url.get_description()
url.get_lanugage()
url.get_date_published()
url.get_author()
url.get_feeds()
url.get_entries()

The communication between client and server should be through JSON requests and responses.

Other classes

RemoteServer - Interface for calling external crawling systems

Standard interfaces

Two standard interfaces

CrawlerInterface - Standard interface for crawler implementations
HandlerInterface - Allows implementing custom handlers for different use cases

Crawlers are different means of obtaining Internet data. Examples: requests, selenium, playwright, httpx, curlcffi. This package does not provide them, to make it more clean and neat.

Handlers are classes that allows automatic deduction of links, places, video codes from links, or data. Examples: youtube handler can use yt-dlp to obtain channel video list, or obtain channel ID, etc.

Default User agents

webtoolkit.get_default_user_agent()

Default User headers

webtoolkit.get_default_headers()

Testing

webtoolkit provides data and facilities that will aid you in testing.

You can use them in your project:

FakeResponse
MockUrl

Project also provides manual tests that check if project works

make tests
make tests-unit # run unit tests
make tests-real # tests performed on real internet data

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.52

May 18, 2026

0.1.51

May 1, 2026

0.1.50

Apr 22, 2026

0.1.49

Apr 2, 2026

0.1.48

Mar 30, 2026

0.1.47

Mar 28, 2026

0.1.46

Mar 26, 2026

0.1.45

Mar 24, 2026

0.1.44

Mar 24, 2026

0.1.43

Mar 24, 2026

0.1.42

Mar 19, 2026

0.1.41

Mar 18, 2026

0.1.40

Mar 18, 2026

0.1.39

Mar 18, 2026

0.1.37

Mar 16, 2026

0.1.36

Mar 16, 2026

0.1.35

Mar 15, 2026

0.1.34

Mar 14, 2026

0.1.33

Mar 9, 2026

0.1.32

Mar 9, 2026

0.1.31

Mar 9, 2026

0.1.30

Mar 9, 2026

0.1.29

Mar 9, 2026

0.1.28

Mar 7, 2026

0.1.27

Mar 7, 2026

0.1.26

Mar 6, 2026

0.1.25

Mar 6, 2026

0.1.24

Mar 6, 2026

0.1.23

Mar 6, 2026

0.1.22

Mar 6, 2026

0.1.21

Mar 6, 2026

0.1.20

Mar 5, 2026

0.1.19

Mar 5, 2026

0.1.18

Mar 5, 2026

0.1.17

Mar 5, 2026

0.1.16

Mar 4, 2026

0.1.15

Mar 4, 2026

0.1.14

Mar 4, 2026

0.1.13

Mar 4, 2026

0.1.12

Mar 4, 2026

0.1.11

Mar 3, 2026

0.1.10

Mar 3, 2026

This version

0.1.9

Mar 2, 2026

0.1.8

Mar 1, 2026

0.1.7

Mar 1, 2026

0.1.6

Feb 26, 2026

0.1.5

Feb 26, 2026

0.1.4

Feb 25, 2026

0.1.3

Feb 25, 2026

0.1.2

Feb 25, 2026

0.1.1

Feb 25, 2026

0.1.0

Feb 24, 2026

0.0.207

Feb 22, 2026

0.0.206

Feb 20, 2026

0.0.205

Feb 19, 2026

0.0.204

Feb 19, 2026

0.0.203

Feb 19, 2026

0.0.202

Feb 19, 2026

0.0.201

Feb 18, 2026

0.0.200

Feb 18, 2026

0.0.197

Feb 7, 2026

0.0.196

Feb 7, 2026

0.0.195

Feb 7, 2026

0.0.194

Feb 6, 2026

0.0.193

Feb 5, 2026

0.0.192

Feb 4, 2026

0.0.191

Jan 26, 2026

0.0.190

Jan 22, 2026

0.0.189

Jan 16, 2026

0.0.188

Jan 16, 2026

0.0.187

Jan 16, 2026

0.0.186

Jan 15, 2026

0.0.185

Jan 14, 2026

0.0.184

Jan 12, 2026

0.0.182

Jan 11, 2026

0.0.181

Jan 10, 2026

0.0.180

Jan 10, 2026

0.0.178

Dec 30, 2025

0.0.177

Dec 30, 2025

0.0.176

Dec 23, 2025

0.0.175

Dec 23, 2025

0.0.174

Dec 23, 2025

0.0.173

Dec 23, 2025

0.0.172

Dec 22, 2025

0.0.171

Dec 22, 2025

0.0.170

Dec 22, 2025

0.0.169

Dec 22, 2025

0.0.168

Dec 21, 2025

0.0.167

Dec 21, 2025

0.0.166

Dec 20, 2025

0.0.165

Dec 20, 2025

0.0.164

Dec 20, 2025

0.0.163

Dec 20, 2025

0.0.161

Dec 20, 2025

0.0.160

Dec 20, 2025

0.0.159

Dec 20, 2025

0.0.158

Dec 19, 2025

0.0.157

Dec 19, 2025

0.0.156

Dec 18, 2025

0.0.155

Dec 18, 2025

0.0.154

Dec 18, 2025

0.0.153

Dec 18, 2025

0.0.152

Dec 18, 2025

0.0.151

Dec 12, 2025

0.0.150

Dec 10, 2025

0.0.149

Dec 10, 2025

0.0.148

Dec 10, 2025

0.0.147

Dec 10, 2025

0.0.146

Dec 9, 2025

0.0.145

Dec 9, 2025

0.0.144

Dec 9, 2025

0.0.143

Dec 9, 2025

0.0.141

Dec 8, 2025

0.0.139

Dec 8, 2025

0.0.138

Dec 8, 2025

0.0.137

Dec 5, 2025

0.0.135

Dec 4, 2025

0.0.134

Dec 3, 2025

0.0.133

Nov 22, 2025

0.0.132

Nov 22, 2025

0.0.129

Nov 20, 2025

0.0.127

Nov 17, 2025

0.0.126

Nov 16, 2025

0.0.125

Nov 16, 2025

0.0.124

Nov 16, 2025

0.0.123

Nov 16, 2025

0.0.122

Nov 15, 2025

0.0.121

Nov 15, 2025

0.0.120

Nov 13, 2025

0.0.119

Nov 12, 2025

0.0.118

Nov 12, 2025

0.0.116

Nov 11, 2025

0.0.115

Nov 10, 2025

0.0.113

Nov 7, 2025

0.0.111

Nov 7, 2025

0.0.110

Nov 7, 2025

0.0.109

Nov 7, 2025

0.0.108

Nov 6, 2025

0.0.107

Nov 6, 2025

0.0.106

Nov 5, 2025

0.0.105

Nov 5, 2025

0.0.104

Nov 5, 2025

0.0.103

Nov 5, 2025

0.0.102

Nov 5, 2025

0.0.101

Nov 5, 2025

0.0.100

Nov 5, 2025

0.0.99

Nov 4, 2025

0.0.98

Nov 4, 2025

0.0.97

Nov 4, 2025

0.0.96

Nov 4, 2025

0.0.93

Nov 4, 2025

0.0.91

Nov 1, 2025

0.0.90

Oct 31, 2025

0.0.89

Oct 31, 2025

0.0.88

Oct 31, 2025

0.0.87

Oct 31, 2025

0.0.86

Oct 31, 2025

0.0.85

Oct 31, 2025

0.0.84

Oct 31, 2025

0.0.83

Oct 31, 2025

0.0.82

Oct 30, 2025

0.0.81

Oct 30, 2025

0.0.80

Oct 30, 2025

0.0.78

Oct 30, 2025

0.0.77

Oct 30, 2025

0.0.76

Oct 30, 2025

0.0.75

Oct 30, 2025

0.0.74

Oct 30, 2025

0.0.73

Oct 29, 2025

0.0.72

Oct 29, 2025

0.0.71

Oct 29, 2025

0.0.70

Oct 29, 2025

0.0.69

Oct 28, 2025

0.0.68

Oct 28, 2025

0.0.67

Oct 28, 2025

0.0.66

Oct 27, 2025

0.0.65

Oct 27, 2025

0.0.63

Oct 27, 2025

0.0.62

Oct 26, 2025

0.0.61

Oct 26, 2025

0.0.60

Oct 26, 2025

0.0.59

Oct 26, 2025

0.0.58

Oct 25, 2025

0.0.57

Oct 25, 2025

0.0.56

Oct 24, 2025

0.0.55

Oct 24, 2025

0.0.54

Oct 24, 2025

0.0.53

Oct 24, 2025

0.0.52

Oct 24, 2025

0.0.51

Oct 24, 2025

0.0.50

Oct 24, 2025

0.0.49

Oct 24, 2025

0.0.48

Oct 24, 2025

0.0.47

Oct 23, 2025

0.0.46

Oct 22, 2025

0.0.45

Oct 22, 2025

0.0.44

Oct 22, 2025

0.0.43

Oct 22, 2025

0.0.42

Oct 22, 2025

0.0.41

Oct 22, 2025

0.0.40

Oct 22, 2025

0.0.39

Oct 22, 2025

0.0.38

Oct 21, 2025

0.0.37

Oct 21, 2025

0.0.36

Oct 21, 2025

0.0.35

Oct 21, 2025

0.0.34

Oct 21, 2025

0.0.33

Oct 21, 2025

0.0.32

Oct 21, 2025

0.0.31

Oct 21, 2025

0.0.26

Oct 20, 2025

0.0.25

Oct 20, 2025

0.0.24

Oct 19, 2025

0.0.23

Oct 19, 2025

0.0.22

Oct 19, 2025

0.0.21

Oct 17, 2025

0.0.20

Oct 17, 2025

0.0.19

Oct 16, 2025

0.0.18

Oct 16, 2025

0.0.17

Oct 16, 2025

0.0.16

Oct 16, 2025

0.0.15

Oct 16, 2025

0.0.14

Oct 16, 2025

0.0.13

Oct 16, 2025

0.0.12

Oct 13, 2025

0.0.11

Oct 13, 2025

0.0.10

Oct 12, 2025

0.0.7

Oct 10, 2025

0.0.6

Oct 9, 2025

0.0.5

Oct 9, 2025

0.0.4

Oct 8, 2025

0.0.3

Oct 6, 2025

0.0.2

Oct 6, 2025

0.0.1

Oct 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webtoolkit-0.1.9.tar.gz (391.8 kB view details)

Uploaded Mar 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

webtoolkit-0.1.9-py3-none-any.whl (412.2 kB view details)

Uploaded Mar 2, 2026 Python 3

File details

Details for the file webtoolkit-0.1.9.tar.gz.

File metadata

Download URL: webtoolkit-0.1.9.tar.gz
Upload date: Mar 2, 2026
Size: 391.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.2 CPython/3.11.2 Linux/6.12.20+rpt-rpi-v8

File hashes

Hashes for webtoolkit-0.1.9.tar.gz
Algorithm	Hash digest
SHA256	`79d038e78572fd26e5ab089da521d507a768de242217ed6e4b01acc1c1faad49`
MD5	`a34a5fc2ca1d9cc249ee387c0c50ca09`
BLAKE2b-256	`46f12092236afa7884e141c3008909bd8d49a1c891fb939341084066f9a1b8ea`

See more details on using hashes here.

File details

Details for the file webtoolkit-0.1.9-py3-none-any.whl.

File metadata

Download URL: webtoolkit-0.1.9-py3-none-any.whl
Upload date: Mar 2, 2026
Size: 412.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.2 CPython/3.11.2 Linux/6.12.20+rpt-rpi-v8

File hashes

Hashes for webtoolkit-0.1.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`21327bd117c82f6d3569524ca2e8433141184134ac5005f51c701ca64268808f`
MD5	`c9518fa2ad8532c895d99ac2ccddfeef`
BLAKE2b-256	`d741c9b7277d8aca8666a5cd78ec093559db30c8faf3ea91b4b48e5a64103667`

See more details on using hashes here.

webtoolkit 0.1.9

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

webtoolkit

Url processing

Page definitions

Url location processing

Content processing

HTTP processing - requests

HTTP processing - response

Remote interfaces

Standard interfaces

Testing

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes