A fast web crawler to satisfy all your needs
Project description
Python Web Crawler
A web crawler written in Python to crawl a given website.
Features!
- Faster
- Ablility to specify the number of threads to use to crawl the given website
- Ability to use proxies to bypass IP restrictions
- Clear summary of all the urls that were crawled. View the crawled.txt file to see the complete list of all the links crawled
- Ability to specify delay between each HTTP Request
- Stop and resume crawler whenever you need
- Gather all the urls with their titles to a csv, incase if you are planning to create a search engine
- Search for specific text throughout the website
- Clear statistics about how many links ended up as Files,Timeout Errors,Connecrion Errors
- Crawl until you need. You can specify upto what level the crawler should crawl.
- Random browser user agents will be used while crawling.
Upcoming Features!
- Gather AWS Buckets,Emails,Phone Numbers etc
- Download all images
Dependencies
This tool uses a number of open source projects to work properly:
- BeautifulSoup - Parser to parse the HTML response of each request made.
- Requests - To make GET requests to the URLs.
Usage
If you like to see the list of supported features, simply run
Specifying only to crawl for 3 levels
Search for specific text throughout the website
Gather all the links along with their titles to a CSV file. A CSV file with the links and their titles will be created after the crawl completes
Use proxies to crawl the site.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pywebcrawler-0.0.1.tar.gz
(8.2 kB
view hashes)
Built Distribution
Close
Hashes for pywebcrawler-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9decb290e655c1bd8b851cdf95556162c8285bb404b459dabada1f0ac8c70d2a |
|
MD5 | 3f6d412bf4f73e7e0d62a956d787df3a |
|
BLAKE2b-256 | 3138877e41e197bf1aec6a73533b0fb7ae326ec5decf7a3d7ce4b6ad577070de |