Skip to main content

Web crawling tool

Project description

pywebcrwl

pywebcrwl is a simple Python web crawler that extracts various types of information such as links, emails, phone numbers, keywords, and more from websites.

Features

  • Crawl and extract all pages from a given URL
  • Extract email addresses (with optional domain filtering)
  • Extract phone numbers (including international formats)
  • Detect cities mentioned in the text
  • Find matches for a given regular expression
  • Extract all image URLs
  • Extract all websites/domains mentioned on a page
  • Extract downloadable documents (optionally by file extension)
  • Extract raw HTML code of pages
  • Identify keywords from the content
  • Extract all sentences containing a specific word
  • Extract website favicons
  • Extract social media links
  • Generate a summary (resume) of a page

Installation

pip install pywebcrwl

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywebcrwl-0.1.2.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pywebcrwl-0.1.2-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file pywebcrwl-0.1.2.tar.gz.

File metadata

  • Download URL: pywebcrwl-0.1.2.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for pywebcrwl-0.1.2.tar.gz
Algorithm Hash digest
SHA256 b0c09b3c57b976400d6efbd3817dbafd8d81a37366b773b7fc3f3276ff0a2615
MD5 261c6b039c66f9d399fa25f1d1d41e65
BLAKE2b-256 504b8b66f0bfff18ad38807a0b1cc1f944d3c12d481eea80971e24691d1d305e

See more details on using hashes here.

File details

Details for the file pywebcrwl-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: pywebcrwl-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for pywebcrwl-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 be97da40ca2dad95d3259d24d2ffad6690d339eb2a4cf755083b167394685513
MD5 dc2c9d2182586b789a8d01b9bf6fd107
BLAKE2b-256 b3b42fdf83ec7e68a5d86fe08efb12b0dd7f8736df1d977aca8c57421ba92bb2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page