Async web scraping framework
Project description
Key Features
Provides a web scraping framework used to crawl web pages.
Provides data extraction tools used to extract structured data from web pages.
Spider Example
以下是我们的一个爬虫类示例,其作用为爬取 腾讯新闻 首页的”要闻”:
from xpaw import Spider, HttpRequest, Selector, run_spider
class TencentNewsSpider(Spider):
def start_requests(self):
yield HttpRequest("http://news.qq.com/", callback=self.parse)
def parse(self, response):
selector = Selector(response.text)
major_news = selector.css("div.major a.linkto").text
self.log("Major news:")
for i in range(len(major_news)):
self.log("%s: %s", i + 1, major_news[i])
if __name__ == '__main__':
run_spider(TencentNewsSpider)
在爬虫类中我们定义了一些方法:
start_requests: 返回爬虫初始请求。
parse: 处理请求得到的页面,这里借助 Selector 及CSS Selector语法提取到了我们所需的数据。
Documentation
Requirements
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
xpaw-0.10.3.tar.gz
(172.7 kB
view hashes)