Scalable web crawler.
Project description
BarkingOwl is a scalable web crawler intended to be used to find specific document types such as PDFs.:
> pip install barkingowl
Example implementation of just the web scraper:
# import BarkingOwl scraper from barking_owl import Scraper # import the uuid lib so we can give our scraper a unique ID import uuid def finished(): print "done." def started(): print "starting." def docfound(payload): """ { 'sourceid': 'a0351229-3ebd-42b2-add7-4dea1012f96b', 'destinationid': 'broadcast', 'command': 'found_doc', 'message': { 'scrapedatetime': '2014-02-03 22:47:25', 'docurl': u'http://targeturl.com/somedoc.pdf', 'urldata': { 'maxlinklevel': 1, 'doctype': 'application/pdf', 'targeturl': 'http://targetwebsite.com', 'myfield': 'anything I want!', }, 'linktext': u'A PDF Document' } } """ print payload; if __name__ == '__main__': # create a unique id for our scraper uid = str(uuid.uuid4()) # create an instance of our scraper scraper = Scraper(uid=uid) # set our scraper callbacks scraper.setFinishedCallback( finishedCallback=finished, # called when the scraper has finished scraping the entire site startedCallback=started, # called when the scraper successfully starts broadcastDocCallback=docfound, # called every time a document is found ) # create the URL payload that will be sent to the urldata = { 'targeturl': "http://targetwebsite.com", # the url you are trying to scrape 'maxlinklevel': 1, # the number of link levels to follow # note: anything over 3 can get really dangerious ... 'doctype': 'application/pdf', # the type of document to search for based on the magic number lib 'myfield': 'anything I want!', # note: the pyaload dictionary can have any additional fields # within it you would like. } # set the payload within the scraper self.scraper.seturldata(urldata) # start the scraper thread self.scraper.start() # begin scraping! try: scraper.begin() except: # an exception will be thrown when the scraper is exited. pass print "all done."
Fully tested Pull requests and well written issues welcome on github: https://github.com/thequbit/BarkingOwl
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
BarkingOwl-0.0.4.tar.gz
(8.8 kB
view hashes)