three ways for spider by python
djangospider is light web crawling framework, it have a few code, but can do high speed crawling, it support three modes to crawl: multithreading, tornado IOloop, and twisted rector.you can easily to understand to how to use async crawler.
Python2.7 Works on Linux
- you can download the zip package in github. then unpack the zip package, find the path of setup.py, Execute the command: $sudo python setup.py install
The entry function: Start(start_urls,mode)
start_urls parameter: is a list, and it’s element is tuple:the first of the tuple is url which you will crawl, the second of the tuple is the callback for url.
the mode parameter: the crawler’s way, it has three types:if mode is int 1 : multithreading ways if mode is int 2 : tornado async ways if mode is int 3 : twisted async ways
from djangospider.mycrawl.run import Start ,_crawl
- def callback(response,url):
- print “get the %s” %url