Skip to main content

A fast web crawler to satisfy all your needs

Project description

Python Web Crawler Build Status

A web crawler written in Python to crawl a given website.

Features!

  • Faster
  • Ablility to specify the number of threads to use to crawl the given website
  • Ability to use proxies to bypass IP restrictions
  • Clear summary of all the urls that were crawled. View the crawled.txt file to see the complete list of all the links crawled
  • Ability to specify delay between each HTTP Request
  • Stop and resume crawler whenever you need
  • Gather all the urls with their titles to a csv, incase if you are planning to create a search engine
  • Search for specific text throughout the website
  • Clear statistics about how many links ended up as Files,Timeout Errors,Connecrion Errors
  • Crawl until you need. You can specify upto what level the crawler should crawl.
  • Random browser user agents will be used while crawling.

Upcoming Features!

  • Gather AWS Buckets,Emails,Phone Numbers etc
  • Download all images

Dependencies

This tool uses a number of open source projects to work properly:

  • BeautifulSoup - Parser to parse the HTML response of each request made.
  • Requests - To make GET requests to the URLs.

Usage

If you like to see the list of supported features, simply run Usage Demo

Specifying only to crawl for 3 levels

Depth Crawl

Search for specific text throughout the website

Text Search

Gather all the links along with their titles to a CSV file. A CSV file with the links and their titles will be created after the crawl completes

Gather Titles

Use proxies to crawl the site.

Use Proxies

Project details


Release history Release notifications

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for pywebcrawler, version 0.0.1
Filename, size File type Python version Upload date Hashes
Filename, size pywebcrawler-0.0.1-py3-none-any.whl (17.4 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size pywebcrawler-0.0.1.tar.gz (8.2 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page