A concurrent flood web scraper.
Project description
Concurrent Flood Scraper
It’s probably exactly what you think it is, based off the name
GET a page. scrape for urls, filter those according to some regex. Put all those in a master queue. Scrape page for any data you want. Repeat…
There’s a small demo in the wikipedia_demo. There you can see how easy it is to set up to fit your web scraping needs!
Specifics
Create a child class of concurrentfloodscraper.Scraper and implement the scrape_page(self, text) method. text is the raw html. In this method you do the specific scraping required. Note that only urls that match the class url_filter_regex will be added to the master queue.
Annotate your Scraper subclass with concurrentfloodscraper.Route. The single parameter is a regex; URL’s that match the regex will be parsed with that scraper.
Repeat steps 1 and 2 for as many different types of pages you expect to be scraping from.
Create an instance of concurrentfloodscraper.ConcurrentFloodScraper, pass it the root URL, the number of threads to use, and a page limit. Page limit defaults to None, which means ‘go forever’.
Start the ConcurrentFloodScraper instance, and enjoy the magic!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for concurrentfloodscraper-1.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 27a56763c000c81d987efc3bf82835772f0695899d088b61e434d29bf0fac8a8 |
|
MD5 | 30db68b52d4375893c571360abe8ad4e |
|
BLAKE2b-256 | e000995311f710f0a7217b65cbf128a636d7d14e764a5e4883b9b5e6beb31a84 |