A package used for web crawling

Project description

WebWeaver

WebWeaver is a Python package for crawling and extracting URLs from web pages. It provides an easy-to-use interface for crawling a single page or an entire site, while handling errors and incomplete URLs gracefully. All crawling functionality is encapsulated within the WebWeaver class.

Features

crawl_url(url): Given a URL, this method returns a list of all URLs found on the page.
crawl_site(urls, limit, timeout): Crawls multiple URLs with the ability to limit the number of pages and set a timeout for each page to load. It returns an object of the UrlList class containing information about successfully crawled URLs, incomplete URLs, and error-causing URLs.
crawl_site_multiThreading(urls, limit, timeout, no_of_threads): Crawls multiple URLs using multithreading for faster performance, allowing you to control the number of threads. It returns an object of the UrlList class.

Installation

Install the package using pip:

pip install WebWeaver

Usage

`WebWeaver` Class

The WebWeaver class provides methods for URL extraction and site crawling.

`crawl_url(url)`

Extracts all URLs found on a given web page.

Parameters:

url (str): The URL of the page you want to crawl.

Returns:

list: A list of URLs found on the page.

Example:

from WebWeaver import WebWeaver

# Instantiate the WebWeaver class
weaver = WebWeaver()

# Crawl a single URL
urls = weaver.crawl_url("https://example.com")
print(urls)

`crawl_site(urls, limit, timeout)`

Crawls multiple web pages and returns an UrlList object that categorizes URLs into three sets: successfully crawled URLs, incomplete URLs, and URLs that caused errors.

Parameters:

urls (list): A list of URLs to start crawling.
limit (int): The maximum number of pages to crawl.
timeout (int): The time limit (in seconds) for each page to load.

Returns:

UrlList: An object containing three sets:
- urls: A set of all successfully crawled and retrieved URLs.
- abnormal_urls: A set of incomplete or malformed URLs extracted from the web pages.
- error_urls: A set of URLs that caused errors when trying to make a request.

Example:

from WebWeaver import WebWeaver

# Instantiate the WebWeaver class
weaver = WebWeaver()

# Crawl multiple URLs
urls_to_crawl = ["https://example.com", "https://anotherexample.com"]
result = weaver.crawl_site(urls_to_crawl, limit=10, timeout=5)

# Accessing the sets from the result
print("Crawled URLs:", result.urls)
print("Abnormal URLs:", result.abnormal_urls)
print("Error URLs:", result.error_urls)

`crawl_site_multiThreading(urls, limit, timeout, no_of_threads)`

Crawls multiple web pages using multithreading, allowing for faster crawling by specifying the number of threads. Returns an UrlList object categorizing URLs into successfully crawled, incomplete, and error-causing URLs.

Parameters:

urls (list): A list of URLs to start crawling.
limit (int): The maximum number of pages to crawl.
timeout (int): The time limit (in seconds) for each page to load.
no_of_threads (int): The number of threads to use for crawling.

Returns:

UrlList: An object containing three sets:
- urls: A set of all successfully crawled and retrieved URLs.
- abnormal_urls: A set of incomplete or malformed URLs extracted from the web pages.
- error_urls: A set of URLs that caused errors when trying to make a request.

Example:

from WebWeaver import WebWeaver

# Instantiate the WebWeaver class
weaver = WebWeaver()

# Crawl multiple URLs using multithreading
urls_to_crawl = ["https://example.com", "https://anotherexample.com"]
result = weaver.crawl_site_multiThreading(urls_to_crawl, limit=10, timeout=5, no_of_threads=16)

# Accessing the sets from the result
print("Crawled URLs:", result.urls)
print("Abnormal URLs:", result.abnormal_urls)
print("Error URLs:", result.error_urls)

`UrlList` Class

The crawl_site and crawl_site_multiThreading methods return an object of the UrlList class, which contains the following sets:

urls (set): A set of all successfully crawled URLs.
abnormal_urls (set): A set of incomplete or malformed URLs found during the crawl.
error_urls (set): A set of URLs that caused errors when attempting to access them.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request.

Happy crawling!

Project details

Release history Release notifications | RSS feed

This version

1.2.3

Apr 14, 2025

1.2.2

Sep 25, 2024

1.2

Sep 18, 2024

1.1

Sep 17, 2024

1.0

Sep 16, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webweaver-1.2.3.tar.gz (5.4 kB view details)

Uploaded Apr 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

webweaver-1.2.3-py3-none-any.whl (6.0 kB view details)

Uploaded Apr 14, 2025 Python 3

File details

Details for the file webweaver-1.2.3.tar.gz.

File metadata

Download URL: webweaver-1.2.3.tar.gz
Upload date: Apr 14, 2025
Size: 5.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for webweaver-1.2.3.tar.gz
Algorithm	Hash digest
SHA256	`8ac9432e6714bf0e000e421180c2b5d718998fdbaebd1375d6525db2c26f13db`
MD5	`f564708b53877bb7a495fc3708854c70`
BLAKE2b-256	`6746afeebf6f197c9003d42bfa5ada352916a69d642a6e4db4cbc984448c1a4f`

See more details on using hashes here.

File details

Details for the file webweaver-1.2.3-py3-none-any.whl.

File metadata

Download URL: webweaver-1.2.3-py3-none-any.whl
Upload date: Apr 14, 2025
Size: 6.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for webweaver-1.2.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2e450538ac191fc5c7445711df314b193b6e193d27116b8371ef1ac5fd35cd6d`
MD5	`c7b1981d667c5e1f7ed83ef47d87d0b8`
BLAKE2b-256	`ca66b1fea7b039dac3de2c058c6b5589da028b8716fd740dafe73041cb61301e`

See more details on using hashes here.

WebWeaver 1.2.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

WebWeaver

Features

Installation

Usage

`WebWeaver` Class

`crawl_url(url)`

`crawl_site(urls, limit, timeout)`

`crawl_site_multiThreading(urls, limit, timeout, no_of_threads)`

`UrlList` Class

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

WebWeaver 1.2.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

WebWeaver

Features

Installation

Usage

WebWeaver Class

crawl_url(url)

crawl_site(urls, limit, timeout)

crawl_site_multiThreading(urls, limit, timeout, no_of_threads)

UrlList Class

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`WebWeaver` Class

`crawl_url(url)`

`crawl_site(urls, limit, timeout)`

`crawl_site_multiThreading(urls, limit, timeout, no_of_threads)`

`UrlList` Class