Skip to main content

Web crawling tool

Project description

pywebcrwl

pywebcrwl is a simple Python web crawler that extracts various types of information such as links, emails, phone numbers, keywords, and more from websites.

Features

  • Crawl and extract all pages from a given URL
  • Extract email addresses (with optional domain filtering)
  • Extract phone numbers (including international formats)
  • Detect cities mentioned in the text
  • Find matches for a given regular expression
  • Extract all image URLs
  • Extract all websites/domains mentioned on a page
  • Extract downloadable documents (optionally by file extension)
  • Extract raw HTML code of pages
  • Identify keywords from the content
  • Extract all sentences containing a specific word
  • Extract website favicons
  • Extract social media links
  • Generate a summary (resume) of a page

Installation

pip install pywebcrwl

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywebcrwl-0.1.0.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pywebcrwl-0.1.0-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file pywebcrwl-0.1.0.tar.gz.

File metadata

  • Download URL: pywebcrwl-0.1.0.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for pywebcrwl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 864ef879cecbce5c0627bfb091c92da0005410cc1764fd3b68ef15962bb1be55
MD5 ad7472c140173681cef7362c96a19aa3
BLAKE2b-256 090dbfd468b4ec7cbe85a9725db3309529c4b7e32ee0602f7ce8fa961e05dd09

See more details on using hashes here.

File details

Details for the file pywebcrwl-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pywebcrwl-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for pywebcrwl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 585503a6aafc2186ac33320202f3abd4abf2dd3608fc371c55c30d57a7c4f412
MD5 08a0d1189ebb66142390c759e722f677
BLAKE2b-256 ec52f47ba9bb0c5101a1702364c3cebdfd21664705aaac89cb4c7432bf302460

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page