Skip to main content

A package used for web crawling

Project description

WebWeaver

WebWeaver is a Python package for crawling and extracting URLs from web pages. It provides an easy-to-use interface for crawling a single page or an entire site, while handling errors and incomplete URLs gracefully. All crawling functionality is encapsulated within the WebWeaver class.

Features

  • crawl_url(url): Given a URL, this method returns a list of all URLs found on the page.
  • crawl_site(urls, limit, timeout): Crawls multiple URLs with the ability to limit the number of pages and set a timeout for each page to load. It returns an object of the UrlList class containing information about successfully crawled URLs, incomplete URLs, and error-causing URLs.
  • crawl_site_multiThreading(urls, limit, timeout, no_of_threads): Crawls multiple URLs using multithreading for faster performance, allowing you to control the number of threads. It returns an object of the UrlList class.

Installation

Install the package using pip:

pip install WebWeaver

Usage

WebWeaver Class

The WebWeaver class provides methods for URL extraction and site crawling.

crawl_url(url)

Extracts all URLs found on a given web page.

Parameters:

  • url (str): The URL of the page you want to crawl.

Returns:

  • list: A list of URLs found on the page.

Example:

from WebWeaver import WebWeaver

# Instantiate the WebWeaver class
weaver = WebWeaver()

# Crawl a single URL
urls = weaver.crawl_url("https://example.com")
print(urls)

crawl_site(urls, limit, timeout)

Crawls multiple web pages and returns an UrlList object that categorizes URLs into three sets: successfully crawled URLs, incomplete URLs, and URLs that caused errors.

Parameters:

  • urls (list): A list of URLs to start crawling.
  • limit (int): The maximum number of pages to crawl.
  • timeout (int): The time limit (in seconds) for each page to load.

Returns:

  • UrlList: An object containing three sets:
    • urls: A set of all successfully crawled and retrieved URLs.
    • abnormal_urls: A set of incomplete or malformed URLs extracted from the web pages.
    • error_urls: A set of URLs that caused errors when trying to make a request.

Example:

from WebWeaver import WebWeaver

# Instantiate the WebWeaver class
weaver = WebWeaver()

# Crawl multiple URLs
urls_to_crawl = ["https://example.com", "https://anotherexample.com"]
result = weaver.crawl_site(urls_to_crawl, limit=10, timeout=5)

# Accessing the sets from the result
print("Crawled URLs:", result.urls)
print("Abnormal URLs:", result.abnormal_urls)
print("Error URLs:", result.error_urls)

crawl_site_multiThreading(urls, limit, timeout, no_of_threads)

Crawls multiple web pages using multithreading, allowing for faster crawling by specifying the number of threads. Returns an UrlList object categorizing URLs into successfully crawled, incomplete, and error-causing URLs.

Parameters:

  • urls (list): A list of URLs to start crawling.
  • limit (int): The maximum number of pages to crawl.
  • timeout (int): The time limit (in seconds) for each page to load.
  • no_of_threads (int): The number of threads to use for crawling.

Returns:

  • UrlList: An object containing three sets:
    • urls: A set of all successfully crawled and retrieved URLs.
    • abnormal_urls: A set of incomplete or malformed URLs extracted from the web pages.
    • error_urls: A set of URLs that caused errors when trying to make a request.

Example:

from WebWeaver import WebWeaver

# Instantiate the WebWeaver class
weaver = WebWeaver()

# Crawl multiple URLs using multithreading
urls_to_crawl = ["https://example.com", "https://anotherexample.com"]
result = weaver.crawl_site_multiThreading(urls_to_crawl, limit=10, timeout=5, no_of_threads=16)

# Accessing the sets from the result
print("Crawled URLs:", result.urls)
print("Abnormal URLs:", result.abnormal_urls)
print("Error URLs:", result.error_urls)

UrlList Class

The crawl_site and crawl_site_multiThreading methods return an object of the UrlList class, which contains the following sets:

  • urls (set): A set of all successfully crawled URLs.
  • abnormal_urls (set): A set of incomplete or malformed URLs found during the crawl.
  • error_urls (set): A set of URLs that caused errors when attempting to access them.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request.


Happy crawling!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webweaver-1.2.3.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webweaver-1.2.3-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file webweaver-1.2.3.tar.gz.

File metadata

  • Download URL: webweaver-1.2.3.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for webweaver-1.2.3.tar.gz
Algorithm Hash digest
SHA256 8ac9432e6714bf0e000e421180c2b5d718998fdbaebd1375d6525db2c26f13db
MD5 f564708b53877bb7a495fc3708854c70
BLAKE2b-256 6746afeebf6f197c9003d42bfa5ada352916a69d642a6e4db4cbc984448c1a4f

See more details on using hashes here.

File details

Details for the file webweaver-1.2.3-py3-none-any.whl.

File metadata

  • Download URL: webweaver-1.2.3-py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for webweaver-1.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2e450538ac191fc5c7445711df314b193b6e193d27116b8371ef1ac5fd35cd6d
MD5 c7b1981d667c5e1f7ed83ef47d87d0b8
BLAKE2b-256 ca66b1fea7b039dac3de2c058c6b5589da028b8716fd740dafe73041cb61301e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page