Skip to main content

Web crawling tool

Project description

pywebcrwl

pywebcrwl is a simple Python web crawler that extracts various types of information such as links, emails, phone numbers, keywords, and more from websites.

Features

  • Crawl and extract all pages from a given URL
  • Extract email addresses (with optional domain filtering)
  • Extract phone numbers (including international formats)
  • Detect cities mentioned in the text
  • Find matches for a given regular expression
  • Extract all image URLs
  • Extract all websites/domains mentioned on a page
  • Extract downloadable documents (optionally by file extension)
  • Extract raw HTML code of pages
  • Identify keywords from the content
  • Extract all sentences containing a specific word
  • Extract website favicons
  • Extract social media links
  • Generate a summary (resume) of a page

Installation

pip install pywebcrwl

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywebcrwl-0.1.1.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pywebcrwl-0.1.1-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file pywebcrwl-0.1.1.tar.gz.

File metadata

  • Download URL: pywebcrwl-0.1.1.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for pywebcrwl-0.1.1.tar.gz
Algorithm Hash digest
SHA256 bf52d09942b77a8f8b27b46c497a7dbe352473659995de9fb93b7dc61b64f586
MD5 0b4fd62c43f88ab377b9710513273d25
BLAKE2b-256 2623bb395aab00742c3baea84c1f344d3d59b7cfa2199277a974463f04fac052

See more details on using hashes here.

File details

Details for the file pywebcrwl-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pywebcrwl-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for pywebcrwl-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2b1bc39f1246e6d372c8d9ff5c45cf9d4f21f349ebb0b31f5fa4e054f8311b4d
MD5 69dbdf47c5393ac0fb6ce854faca66dc
BLAKE2b-256 b460f1d1d136135fb7a12fac5e81b1ba23cb7793a6bb30b82ba23d6ee690681a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page