Skip to main content

A fast web crawler to satisfy all your needs

Project description

Python Web Crawler Build Status

A web crawler written in Python to crawl a given website.

Features!

  • Faster
  • Ablility to specify the number of threads to use to crawl the given website
  • Ability to use proxies to bypass IP restrictions
  • Clear summary of all the urls that were crawled. View the crawled.txt file to see the complete list of all the links crawled
  • Ability to specify delay between each HTTP Request
  • Stop and resume crawler whenever you need
  • Gather all the urls with their titles to a csv, incase if you are planning to create a search engine
  • Search for specific text throughout the website
  • Clear statistics about how many links ended up as Files,Timeout Errors,Connecrion Errors
  • Crawl until you need. You can specify upto what level the crawler should crawl.
  • Random browser user agents will be used while crawling.

Upcoming Features!

  • Gather AWS Buckets,Emails,Phone Numbers etc
  • Download all images

Dependencies

This tool uses a number of open source projects to work properly:

  • BeautifulSoup - Parser to parse the HTML response of each request made.
  • Requests - To make GET requests to the URLs.

Usage

If you like to see the list of supported features, simply run Usage Demo

Specifying only to crawl for 3 levels

Depth Crawl

Search for specific text throughout the website

Text Search

Gather all the links along with their titles to a CSV file. A CSV file with the links and their titles will be created after the crawl completes

Gather Titles

Use proxies to crawl the site.

Use Proxies

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywebcrawler-0.0.1.tar.gz (8.2 kB view hashes)

Uploaded Source

Built Distribution

pywebcrawler-0.0.1-py3-none-any.whl (17.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page