ruia_motor - a Ruia plugin that uses the motor to store data
Project description
ruia-motor
A Ruia plugin that uses the motor to store data
Notice: Works on ruia >= 0.5.0
Installation
pip install -U ruia-motor
Usage
ruia-motor
will be automatically store data to mongodb:
from ruia import AttrField, Item, Spider, TextField
from ruia_motor import RuiaMotor
class DoubanItem(Item):
target_item = TextField(css_select='div.item')
title = TextField(css_select='span.title')
cover = AttrField(css_select='div.pic>a>img', attr='src')
abstract = TextField(css_select='span.inq', default='')
async def clean_title(self, title):
if isinstance(title, str):
return title
else:
return ''.join([i.text.strip().replace('\xa0', '') for i in title])
class DoubanSpider(Spider):
start_urls = ['https://movie.douban.com/top250']
mongodb_config = {
'host': '127.0.0.1',
'port': 27017,
'db': 'ruia_motor'
}
async def parse(self, response):
etree = response.html_etree
pages = ['?start=0&filter='] + [i.get('href') for i in etree.cssselect('.paginator>a')]
for index, page in enumerate(pages):
url = self.start_urls[0] + page
yield self.request(
url=url,
metadata={'index': index},
callback=self.parse_item
)
async def parse_item(self, response):
async for item in DoubanItem.get_items(html=response.html):
data = item.results
yield RuiaMotor(collection='douban250', data=data)
async def init_plugins_after_start(spider_ins):
RuiaMotor.init_spider(spider_ins=spider_ins)
if __name__ == '__main__':
DoubanSpider.start(after_start=init_plugins_after_start)
Enjoy it :)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
ruia_motor-0.0.2.tar.gz
(3.5 kB
view hashes)
Built Distribution
Close
Hashes for ruia_motor-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7a6beb555958ccbf4b708c34d333469c0ea7d08fff9e9a25a1f9bfdddb59b307 |
|
MD5 | 841ae6361b74db15cb3b47951c7723bd |
|
BLAKE2b-256 | a2d83b99ef0b28ecab5eefc2c71a482982ad5efa73e16564e4b2a68bae7c4363 |