concurrentfloodscraper·PyPI

A concurrent flood web scraper.

These details have not been verified by PyPI

Project links

Homepage

Project description

Concurrent Flood Scraper

It’s probably exactly what you think it is, based off the name

GET a page. scrape for urls, filter those according to some regex. Put all those in a master queue. Scrape page for any data you want. Repeat…

There’s a small demo in the wikipedia_demo. There you can see how easy it is to set up to fit your web scraping needs!

Specifics

Create a child class of concurrentfloodscraper.Scraper and implement the scrape_page(self, text) method. text is the raw html. In this method you do the specific scraping required. Note that only urls that match the class url_filter_regex will be added to the master queue.
Annotate your Scraper subclass with concurrentfloodscraper.Route. The single parameter is a regex; URL’s that match the regex will be parsed with that scraper.
Repeat steps 1 and 2 for as many different types of pages you expect to be scraping from.
Create an instance of concurrentfloodscraper.ConcurrentFloodScraper, pass it the root URL, the number of threads to use, and a page limit. Page limit defaults to None, which means ‘go forever’.
Start the ConcurrentFloodScraper instance, and enjoy the magic!

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.1

Feb 14, 2017

1.0.0

Feb 14, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

concurrentfloodscraper-1.0.1-py3-none-any.whl (9.8 kB view details)

Uploaded Feb 14, 2017 Python 3

File details

Details for the file concurrentfloodscraper-1.0.1-py3-none-any.whl.

File metadata

Download URL: concurrentfloodscraper-1.0.1-py3-none-any.whl
Upload date: Feb 14, 2017
Size: 9.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No

File hashes

Hashes for concurrentfloodscraper-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`27a56763c000c81d987efc3bf82835772f0695899d088b61e434d29bf0fac8a8`
MD5	`30db68b52d4375893c571360abe8ad4e`
BLAKE2b-256	`e000995311f710f0a7217b65cbf128a636d7d14e764a5e4883b9b5e6beb31a84`

See more details on using hashes here.

concurrentfloodscraper 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Concurrent Flood Scraper

Specifics

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes