Skip to main content

A fast web crawler to satisfy all your needs

Project description

Python Web Crawler Build Status

A web crawler written in Python to crawl a given website.

Features!

  • Faster
  • Ablility to specify the number of threads to use to crawl the given website
  • Ability to use proxies to bypass IP restrictions
  • Clear summary of all the urls that were crawled. View the crawled.txt file to see the complete list of all the links crawled
  • Ability to specify delay between each HTTP Request
  • Stop and resume crawler whenever you need
  • Gather all the urls with their titles to a csv, incase if you are planning to create a search engine
  • Search for specific text throughout the website
  • Clear statistics about how many links ended up as Files,Timeout Errors,Connecrion Errors
  • Crawl until you need. You can specify upto what level the crawler should crawl.
  • Random browser user agents will be used while crawling.

Upcoming Features!

  • Gather AWS Buckets,Emails,Phone Numbers etc
  • Download all images

Dependencies

This tool uses a number of open source projects to work properly:

  • BeautifulSoup - Parser to parse the HTML response of each request made.
  • Requests - To make GET requests to the URLs.

Usage

If you like to see the list of supported features, simply run Usage Demo

Specifying only to crawl for 3 levels

Depth Crawl

Search for specific text throughout the website

Text Search

Gather all the links along with their titles to a CSV file. A CSV file with the links and their titles will be created after the crawl completes

Gather Titles

Use proxies to crawl the site.

Use Proxies

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywebcrawler-0.0.1.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pywebcrawler-0.0.1-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file pywebcrawler-0.0.1.tar.gz.

File metadata

  • Download URL: pywebcrawler-0.0.1.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.5

File hashes

Hashes for pywebcrawler-0.0.1.tar.gz
Algorithm Hash digest
SHA256 8746bb0d60b12be062757375718a4fd96d58619f1edcaddc8c5ac4ee67223e01
MD5 9c1fb2f4442a859811673d719f1a276c
BLAKE2b-256 7810a5debfdd4e9c6949fca4233e152fb3d5dbe9a52eb6c3918978c6919517d1

See more details on using hashes here.

File details

Details for the file pywebcrawler-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: pywebcrawler-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 17.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.5

File hashes

Hashes for pywebcrawler-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9decb290e655c1bd8b851cdf95556162c8285bb404b459dabada1f0ac8c70d2a
MD5 3f6d412bf4f73e7e0d62a956d787df3a
BLAKE2b-256 3138877e41e197bf1aec6a73533b0fb7ae326ec5decf7a3d7ce4b6ad577070de

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page