A concurrent flood web scraper.
Project description
Concurrent Flood Scraper
It’s probably exactly what you think it is, based off the name
GET a page. scrape for urls, filter those according to some regex. Put all those in a master queue. Scrape page for any data you want. Repeat…
There’s a small demo in the wikipedia_demo. There you can see how easy it is to set up to fit your web scraping needs!
Specifics
Create a child class of concurrentfloodscraper.Scraper and implement the scrape_page(self, text) method. text is the raw html. In this method you do the specific scraping required. Note that only urls that match the class url_filter_regex will be added to the master queue.
Annotate your Scraper subclass with concurrentfloodscraper.Route. The single parameter is a regex; URL’s that match the regex will be parsed with that scraper.
Repeat steps 1 and 2 for as many different types of pages you expect to be scraping from.
Create an instance of concurrentfloodscraper.ConcurrentFloodScraper, pass it the root URL, the number of threads to use, and a page limit. Page limit defaults to None, which means ‘go forever’.
Start the ConcurrentFloodScraper instance, and enjoy the magic!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for concurrentfloodscraper-1.0.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1c6ee38f60cdc141e8d95c11fcd88e17df2d98eb453767c39a137978eaf2d698 |
|
MD5 | 270bbb9ae6eaab4e43317df2a3b0a803 |
|
BLAKE2b-256 | d415786a012c9f4e0edd7955fea0114674bbaf54e29158dbb02f586bd996bffd |
Hashes for concurrentfloodscraper-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 717bad5c60d21f5be2a8e2d14bcb425dff1ec13760a842a018a947025fbc5f9a |
|
MD5 | f6ff065d8c786d35016f2de1452a22f2 |
|
BLAKE2b-256 | 0ea959d0b305df5e6be03c5a69533e2e5707f21946757ef55b51347787541195 |