this is a small spider,you can easy running. When you often need to crawl a single site, you don't have to redo some repeated code every time, using this small framework you can quickly crawl data into a file or database.
Project description
lrabbit_scrapy
this is a small spider,you can easy running. When you often need to crawl a single site, you don't have to redo some repeated code every time, using this small framework you can quickly crawl data into a file or database.
Installing
$ pip install lrabbit_scrapy
quick start
- create blog_spider.py
from lrabbit_scrapy.spider import LrabbitSpider
from lrabbit_scrapy.common_utils.network_helper import RequestSession
from lrabbit_scrapy.common_utils.print_log_helper import LogUtils
from lrabbit_scrapy.common_utils.all_in_one import FileStore
import os
from parsel import Selector
class Spider(LrabbitSpider):
"""
spider_name : lrabbit blog spider
"""
# unique spider name
spider_name = "lrabbit_blog"
# max thread worker numbers
max_thread_num = 10
# reset all task_list
reset_task_config = True
# open loop init_task_list
loop_task_config = False
# remove config option
remove_confirm_config = False
# config_path_name
config_env_name = "config_path"
# redis db_num
redis_db_config = 0
def __init__(self):
super().__init__()
self.session = RequestSession()
self.proxy_session = RequestSession(proxies=None)
csv_path = os.path.join(os.path.abspath(os.getcwd()), f"{self.spider_name}.csv")
self.field_names = ['id', 'title', 'datetime']
self.blog_file = FileStore(file_path=csv_path, filed_name=self.field_names)
def worker(self, task):
LogUtils.log_info(task)
html = self.session.send_request(method='GET', url=f'http://www.lrabbit.life/post_detail/?id={task}')
selector = Selector(html)
title = selector.css(".detail-title h1::text").get()
datetime = selector.css(".detail-info span::text").get()
if title:
post_data = {"id": task, "title": title, 'datetime': datetime}
self.blog_file.write(post_data)
# when you succes get content update redis stat
self.update_stat_redis()
self.task_list.remove(task)
LogUtils.log_finish(task)
def init_task_list(self):
# res = self.mysql_client.query("select id from rookie limit 100 ")
# return [item['id'] for item in res]
return [i for i in range(100)]
if __name__ == '__main__':
spider = Spider()
spider.run()
-
set config.ini and config env variable
- create crawl.ini forexam this file path is /root/crawl.ini
[server] mysql_user = root mysql_password = 123456 mysql_database = test mysql_host = 192.168.1.1 redis_user = lrabbit redis_host = 192.168.1.1 redis_port = 6379 redis_password = love20100001314 [test] mysql_user = root mysql_password = 123456 mysql_database = test mysql_host = 192.168.1.1 redis_user = lrabbit redis_host = 192.168.1.1 redis_port = 6379 redis_password = 123456
- set config env
- windows power shell
- $env:config_path = "/root/crawl.ini"
- linux
- export config_path="/root/crawl.ini"
-
python3 blog_spider.py
Links
- author: https://www.lrabbit.life/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
lrabbit_scrapy-2.0.1.tar.gz
(17.5 kB
view hashes)
Built Distribution
Close
Hashes for lrabbit_scrapy-2.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4a85828f55fbe4240247b518719ee95e15175c726534e49390fd933b8a8239bb |
|
MD5 | d5d2f05d6e275054c4cf6d8378ae5717 |
|
BLAKE2b-256 | cb3f0df4590a5dd49a181a72cc16493a7f039bbc8290a3e47b1fe76cb953626c |