Skip to main content

Scalable web crawler.

Project description

BarkingOwl is a scalable web crawler intended to be used to find specific document types such as PDFs.:

> pip install barkingowl

Example implementation of just the web scraper:

# import BarkingOwl scraper
from barking_owl import Scraper

# import the uuid lib so we can give our scraper a unique ID
import uuid

def finished():
    print "done."

def started():
    print "starting."

def docfound(payload):

    """
    {
        'sourceid': 'a0351229-3ebd-42b2-add7-4dea1012f96b',
        'destinationid': 'broadcast',
        'command': 'found_doc',
        'message': {
            'scrapedatetime': '2014-02-03 22:47:25',
            'docurl': u'http://targeturl.com/somedoc.pdf',
            'urldata': {
                'maxlinklevel': 1,
                'doctype': 'application/pdf',
                'targeturl': 'http://targetwebsite.com',
                'myfield': 'anything I want!',
            },
            'linktext': u'A PDF Document'
        }
    }
    """

    print payload;

if __name__ == '__main__':

    # create a unique id for our scraper
    uid = str(uuid.uuid4())

    # create an instance of our scraper
    scraper = Scraper(uid=uid)

    # set our scraper callbacks
    scraper.setFinishedCallback(
        finishedCallback=finished,         # called when the scraper has finished scraping the entire site
        startedCallback=started,           # called when the scraper successfully starts
        broadcastDocCallback=docfound, # called every time a document is found
    )

    # create the URL payload that will be sent to the
    urldata = {
        'targeturl': "http://targetwebsite.com", # the url you are trying to scrape
        'maxlinklevel': 1,                       # the number of link levels to follow
                                             # note: anything over 3 can get really dangerious ...
        'doctype': 'application/pdf',            # the type of document to search for based on the magic number lib
        'myfield': 'anything I want!',           # note: the pyaload dictionary can have any additional fields
                                             #       within it you would like.
    }

    # set the payload within the scraper
    self.scraper.seturldata(urldata)

    # start the scraper thread
    self.scraper.start()

    # begin scraping!
    try:
        scraper.begin()
    except:
        # an exception will be thrown when the scraper is exited.
        pass

    print "all done."

Fully tested Pull requests and well written issues welcome on github: https://github.com/thequbit/BarkingOwl

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

BarkingOwl-0.0.4.tar.gz (8.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page