A concurrent flood web scraper.
Concurrent Flood Scraper
It’s probably exactly what you think it is, based off the name
GET a page. scrape for urls, filter those according to some regex. Put all those in a master queue. Scrape page for any data you want. Repeat…
There’s a small demo in the wikipedia_demo. There you can see how easy it is to set up to fit your web scraping needs!
- Create a child class of concurrentfloodscraper.Scraper and implement the scrape_page(self, text) method. text is the raw html. In this method you do the specific scraping required. Note that only urls that match the class url_filter_regex will be added to the master queue.
- Annotate your Scraper subclass with concurrentfloodscraper.Route. The single parameter is a regex; URL’s that match the regex will be parsed with that scraper.
- Repeat steps 1 and 2 for as many different types of pages you expect to be scraping from.
- Create an instance of concurrentfloodscraper.ConcurrentFloodScraper, pass it the root URL, the number of threads to use, and a page limit. Page limit defaults to None, which means ‘go forever’.
- Start the ConcurrentFloodScraper instance, and enjoy the magic!
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size & hash||File type||Python version||Upload date|
|concurrentfloodscraper-1.0.1-py3-none-any.whl (9.8 kB) View hashes||Wheel||py3|
Hashes for concurrentfloodscraper-1.0.1-py3-none-any.whl