Skip to main content

Top 100 Visited Sites Web Scraper

Project description

Top 100 Visited Sites Web Scraper

Description

This Python script is a simple web scraper that fetches data from a Wikipedia's Top 100 Sites (https://en.wikipedia.org/wiki/List_of_most-visited_websites), specifically an HTML table, and converts it into a list of dictionaries. It also accesses each domain in the list and prints its response code.

I created this script to automate testing of Firewall, Content Filtering and NAT related services. It can be used to test if a domain is accessible or not. It can also be used to test if a domain is being redirected to another domain.

Using this script helps me debug whether a rule (code) is working as expected or not. For example, if I have a rule that blocks access to a domain, I can use this script to test if the domain is indeed blocked or not. At the same time, I can get useful Syslogs for each run.

Dependencies

  • Python 3.6 or higher
  • requests library
  • BeautifulSoup from bs4 library
  • urlparse from urllib.parse library

Functions

  • is_valid_url(url: str) -> bool: Checks if the URL is valid.
  • html_table_to_list(url: str, num_columns: int) -> list: Converts an HTML table into a list of dictionaries.
  • access_domains(top_100: list) -> None: Accesses each domain in a list and prints its response code.

Usage

  1. Install the required dependencies.
  2. Run the script with Python 3.6 or higher.
  3. The script will fetch data from the specified URL, convert the HTML table into a list of dictionaries, and print the response code for each domain.

Please note that the URL of the webpage containing the table and the number of columns in the table are parameters for the html_table_to_list function. The default number of columns is set to 5.

The access_domains function takes in a list of dictionaries representing a table of domains and prints their response codes.

Example

# Use the function
top_100_reference = "https://en.wikipedia.org/wiki/List_of_most-visited_websites"  # Replace with your URL
top_100_list = html_table_to_list(top_100_reference)
access_domains(top_100_list)

In this example, the script fetches data from a Wikipedia page that lists the most visited websites, converts the HTML table into a list of dictionaries, and prints the response code for each domain.

Disclaimer

Please use this script responsibly and ensure that you are allowed to scrape the websites you choose to scrape. Some websites may prohibit scraping in their terms of service. Always respect others' intellectual property rights.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

top_100-0.1.3.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

top_100-0.1.3-py3-none-any.whl (8.1 kB view details)

Uploaded Python 3

File details

Details for the file top_100-0.1.3.tar.gz.

File metadata

  • Download URL: top_100-0.1.3.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.6 Darwin/23.2.0

File hashes

Hashes for top_100-0.1.3.tar.gz
Algorithm Hash digest
SHA256 84ed0e61ca8c32a03a47de2b3f6c92284a0ed39b9cfcd4b09f15e705c10313fc
MD5 c83ea400ec9742f7cbfa31cc061d4924
BLAKE2b-256 243fa13a38249fff984794dad39a42f984fc50cd92e0ddc7b96b2001e35d13cb

See more details on using hashes here.

File details

Details for the file top_100-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: top_100-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 8.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.6 Darwin/23.2.0

File hashes

Hashes for top_100-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4be6354d3d718c72467d1b1e76bfc2687cdd03efdcb4f3a465d0774cce3b1e97
MD5 cc3120b9d3035900bea6efa29c3cc506
BLAKE2b-256 6ad21536f4c2f5c6ca0d78888ef71c95b166a58f264dbf6c0044726c355b0ccb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page