Make your spider multi-threaded.
Project description
MSpider
A Multi-threaded Spider wrapper that could make your spider multi-threaded easily, helping you crawl website faster. :zap:
Note that this is for python3 only.
Install
MSpider could be easily installed using pip:
pip install mspider
Quick Start
Automatically create a MSpider
-
cd
to the folder you’d like to create aMSpider
in terminal or cmd, then typegenspider -b <template based> <your spider name>
, such as:$ genspider -b MSpider test
where
-b
is to choose the template of spider you based, you could choose 'MSpider' (Default if not given) or 'Crawler', andtest
is the spider name.A file
test.py
that contains aMSpider
is created successfully if seeing the following information.create a spider named test.
-
Open the spider file
test.py
. Findself.source = []
in line 8 (or line 15 if your spider template is 'Crawler'), and replacing it by the sources (usually a list of urls) you’d like to handle by the spider, such as:self.source = ['http://www.github.com', 'http://www.baidu.com']
Each element of the
self.source
is calledsrc_item
, and the index ofsrc_item
is calledindex
. -
Find the function
basic_func
, where you could define your spider function, such as:def basic_func(self, index, src_item): url = src_item res = self.sess.get(url) html = res.content.decode('utf-8') # deal with the html # save the extracted information
-
Run the spider to start crawling.
$ python3 test.py
You just input the number of source items handled by each thread (BATCH SIZE) in the terminal or cmd, then return it, and then the MSpider will crawl your sources in a multi-threaded manner.
[INFO]: MSpider is ready. [INFO]: 2 urls in total. [INPUT]: BATCH SIZE: 1 [INFO]: Open threads: 100%|████████████████| 2/2 [00:00<00:00, 356.36it/s] [INFO]: Task done. [INFO]: The task costs 1.1157 sec. [INFO]: 0 urls failed.
Mannually create a MSpider
-
Standard import the MSpider.
from mspider.spider import MSpider
-
Define the function of your single threaded spider.
Note that this function must has two parameters.
index
: the index of source itemsrc_item
: the source item you are going to deal with in this function, which is usually an url or anything you need to process, such as a tuple like(name, url)
.
def spi_func(index, src_item): name, url = src_item res = mspider.sess.get(url) html = res.content.decode('utf-8') # deal with the html # save the extracted information
-
Now comes the key part. Create an instance of
MSpider
and pass it your spider function and sources you’d crawl.sources = [('github', 'http://www.github.com'), ('baidu', 'http://www.baidu.com')] mspider = MSpider(spi_func, sources)
-
Start to crawl!
mspider.crawl()
Then you will see the following information in your terminal or cmd. You just input the BATCH SIZE, and then the MSpider will crawl your sources in a multi-threaded manner.
[INFO]: MSpider is ready. [INFO]: 2 urls in total. [INPUT]: BATCH SIZE: 1 [INFO]: Open threads: 100%|████████████████| 2/2 [00:00<00:00, 356.36it/s] [INFO]: Task done. [INFO]: The task costs 1.1157 sec. [INFO]: 0 urls failed.
Usages
The mspider
package has three main modules, pp
, mtd
and spider
-
pp
has a class ofProxyPool
, which helps you get the proxy IP pool from xici free IPs.Note that there are few free IPs could work, so try not to use this module. If you’d like to use proxy IP for your spider, this code may be helpful for you to write your own proxy pool.
-
mtd
has two classes,Crawler
andDownloader
Crawler
helps you make your spider multi-threaded.Downloader
helps you download things multi-threadedly as long as you pass your urls in the form oflist(zip(names, urls))
in it.
-
spider
has the class ofMSpider
, which uses theCrawler
in modulemtd
, and has some basic configurations ofCrawler
, so this is an easier way to turn your spider into a multi-threaded spider.
Usage of pp.ProxyPool
from mspider.pp import ProxyPool
pool = ProxyPool()
# Once an instance of ProxyPool is initialized,
# it will has an attribute named ip_list, which
# has a list of IPs crawled from xici free IPs.
print(pool.ip_list)
"""
{'http': ['HTTP://211.162.70.229:3128',
'HTTP://124.207.82.166:8008',
'HTTP://121.69.37.6:9797',
'HTTP://1.196.160.94:9999',
'HTTP://59.44.247.194:9797',
'HTTP://14.146.92.72:9797',
'HTTP://223.166.247.206:9000',
'HTTP://182.111.129.37:53281',
'HTTP://58.243.50.184:53281',
'HTTP://218.28.58.150:53281'],
'https': ['HTTPS://113.140.1.82:53281',
'HTTPS://14.23.58.58:443',
'HTTPS://122.136.212.132:53281']}
"""
# Randomly choose an IP
protocol = "http" # or "https"
ip = pool.random_choose_ip(protocol)
print(ip)
"""
'HTTP://59.44.247.194:9797'
"""
# Update the IP list
pool.get_ip_list()
pool.check_all_ip()
# Request an url using proxy by 'GET'
url = "http://www.google.com"
res = pool.open_url(url)
print(res.status_code)
"""
200
"""
# Request an url using post by 'POST'
url = "http://www.google.com"
data = {'key':'value'}
res = pool.post(url, data)
print(res.status_code)
"""
200
"""
Usage of mtd.Downloader
from mspider.mtd import Downloader
# Prepare source data that need download
names = ['a', 'b', 'c']
urls = ['https://www.baidu.com/img/baidu_resultlogo@2.png',
'https://www.baidu.com/img/baidu_resultlogo@2.png',
'https://www.baidu.com/img/baidu_resultlogo@2.png']
source = list(zip(names, urls))
# Download them!
dl = Downloader(source)
dl.download(out_folder='test', engine='wget')
"""
[INFO]: 3 urls in total.
[INPUT]: BATCH SIZE: 1
[INFO]: Open threads: 100%|███████████████| 3/3 [00:00<00:00, 3167.90it/s]
[INFO]: Task done.
[INFO]: The task costs 0.3324 sec.
[INFO]: 0 urls failed.
"""
Usage of spider.MSpider
See this in Quick Start.
Feature
- v2.0.5:
- Add spider templates. One is based on
spider.MSpider
, the other is based onmtd.Crawler
. - Add the argument
batch_size
tospider.MSpider
andmtd.Crawler
.
- Add spider templates. One is based on
License
Copyright (c) 2019 tishacy.
Licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mspider-0.2.5.tar.gz
.
File metadata
- Download URL: mspider-0.2.5.tar.gz
- Upload date:
- Size: 9.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5d9881caf957580668afe86470f18758426da04a3f7a1fce83db76b8a75f373e |
|
MD5 | 277bc54299da484088bff7aebfa7b3ae |
|
BLAKE2b-256 | ce3fdd5e8d70b571d589f3efc79351053bef0db82ac46afe099926ede188fadf |
File details
Details for the file mspider-0.2.5-py3-none-any.whl
.
File metadata
- Download URL: mspider-0.2.5-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b1a5a50d5cb1ff76b7b9aba9af39c163027c9ca1c53e9d6d1bdace3e40d1d7e6 |
|
MD5 | 45c61b8a8daa50213667c4a2a4b9b04c |
|
BLAKE2b-256 | 24db7601857b3071abb7f8b3fd2d21610f55480469c0b59bad25acebbc278390 |