Make your spider multi-threaded.

These details have not been verified by PyPI

Project links

Homepage

Project description

MSpider

A Multi-threaded Spider wrapper that could make your spider multi-threaded easily, helping you crawl website faster. :zap:

Note that this is for python3 only.

Install

MSpider could be easily installed using pip:

pip install mspider

Quick Start

Automatically create a `MSpider`

cd to the folder you’d like to create a MSpider in terminal or cmd, then type genspider -b <template based> <your spider name>, such as:
```
$ genspider -b MSpider test
```
where -b is to choose the template of spider you based, you could choose 'MSpider' (Default if not given) or 'Crawler', and test is the spider name.

A file test.py that contains a MSpider is created successfully if seeing the following information.
```
create a spider named test.
```
Open the spider file test.py. Find self.source = [] in line 8 (or line 15 if your spider template is 'Crawler'), and replacing it by the sources (usually a list of urls) you’d like to handle by the spider, such as:
```
self.source = ['http://www.github.com',
               'http://www.baidu.com']
```
Each element of the self.source is called src_item, and the index of src_item is called index.

Find the function basic_func, where you could define your spider function, such as:

def basic_func(self, index, src_item):
    url = src_item
    res = self.sess.get(url)
    html = res.content.decode('utf-8')
    # deal with the html
    # save the extracted information

Run the spider to start crawling.

$ python3 test.py

You just input the number of source items handled by each thread (BATCH SIZE) in the terminal or cmd, then return it, and then the MSpider will crawl your sources in a multi-threaded manner.

[INFO]: MSpider is ready.
[INFO]: 2 urls in total.
[INPUT]: BATCH SIZE: 1
[INFO]: Open threads: 100%|████████████████| 2/2 [00:00<00:00, 356.36it/s]
[INFO]: Task done.
[INFO]: The task costs 1.1157 sec.
[INFO]: 0 urls failed.

Mannually create a `MSpider`

Standard import the MSpider.
```
from mspider.spider import MSpider
```
Define the function of your single threaded spider.

Note that this function must has two parameters.
- index: the index of source item
- src_item: the source item you are going to deal with in this function, which is usually an url or anything you need to process, such as a tuple like (name, url).
```
def spi_func(index, src_item):
    name, url = src_item
    res = mspider.sess.get(url)
    html = res.content.decode('utf-8')
    # deal with the html
    # save the extracted information
```

Now comes the key part. Create an instance of MSpider and pass it your spider function and sources you’d crawl.

sources = [('github', 'http://www.github.com'),
           ('baidu', 'http://www.baidu.com')]
mspider = MSpider(spi_func, sources)

Start to crawl!

mspider.crawl()

Then you will see the following information in your terminal or cmd. You just input the BATCH SIZE, and then the MSpider will crawl your sources in a multi-threaded manner.

[INFO]: MSpider is ready.
[INFO]: 2 urls in total.
[INPUT]: BATCH SIZE: 1
[INFO]: Open threads: 100%|████████████████| 2/2 [00:00<00:00, 356.36it/s]
[INFO]: Task done.
[INFO]: The task costs 1.1157 sec.
[INFO]: 0 urls failed.

Usages

The mspider package has three main modules, pp, mtd and spider

pp has a class of ProxyPool, which helps you get the proxy IP pool from xici free IPs.

Note that there are few free IPs could work, so try not to use this module. If you’d like to use proxy IP for your spider, this code may be helpful for you to write your own proxy pool.
mtd has two classes, Crawler and Downloader
- Crawler helps you make your spider multi-threaded.
- Downloader helps you download things multi-threadedly as long as you pass your urls in the form of list(zip(names, urls)) in it.
spider has the class of MSpider, which uses the Crawler in module mtd, and has some basic configurations of Crawler, so this is an easier way to turn your spider into a multi-threaded spider.

Usage of `pp.ProxyPool`

from mspider.pp import ProxyPool

pool = ProxyPool()

# Once an instance of ProxyPool is initialized,
# it will has an attribute named ip_list, which
# has a list of IPs crawled from xici free IPs.
print(pool.ip_list)
"""
{'http': ['HTTP://211.162.70.229:3128',
          'HTTP://124.207.82.166:8008',
          'HTTP://121.69.37.6:9797',
          'HTTP://1.196.160.94:9999',
          'HTTP://59.44.247.194:9797',
          'HTTP://14.146.92.72:9797',
          'HTTP://223.166.247.206:9000',
          'HTTP://182.111.129.37:53281',
          'HTTP://58.243.50.184:53281',
          'HTTP://218.28.58.150:53281'],
 'https': ['HTTPS://113.140.1.82:53281',
           'HTTPS://14.23.58.58:443',
           'HTTPS://122.136.212.132:53281']}
"""
# Randomly choose an IP
protocol = "http" # or "https"
ip = pool.random_choose_ip(protocol)
print(ip)
"""
'HTTP://59.44.247.194:9797'
"""

# Update the IP list
pool.get_ip_list()
pool.check_all_ip()

# Request an url using proxy by 'GET'
url = "http://www.google.com"
res = pool.open_url(url)
print(res.status_code)
"""
200
"""

# Request an url using post by 'POST'
url = "http://www.google.com"
data = {'key':'value'}
res = pool.post(url, data)
print(res.status_code)
"""
200
"""

Usage of `mtd.Downloader`

from mspider.mtd import Downloader

# Prepare source data that need download
names = ['a', 'b', 'c']
urls = ['https://www.baidu.com/img/baidu_resultlogo@2.png',
        'https://www.baidu.com/img/baidu_resultlogo@2.png',
        'https://www.baidu.com/img/baidu_resultlogo@2.png']
source = list(zip(names, urls))

# Download them!
dl = Downloader(source)
dl.download(out_folder='test', engine='wget')
"""
[INFO]: 3 urls in total.
[INPUT]: BATCH SIZE: 1
[INFO]: Open threads: 100%|███████████████| 3/3 [00:00<00:00, 3167.90it/s]
[INFO]: Task done.
[INFO]: The task costs 0.3324 sec.
[INFO]: 0 urls failed.
"""

Usage of `spider.MSpider`

See this in Quick Start.

Feature

v2.0.5:
- Add spider templates. One is based on spider.MSpider, the other is based on mtd.Crawler.
- Add the argument batch_size to spider.MSpider and mtd.Crawler.

License

Licensed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.5

Apr 7, 2019

0.2.3

Mar 25, 2019

0.2.2

Mar 22, 2019

0.2.1

Mar 22, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mspider-0.2.5.tar.gz (9.1 kB view details)

Uploaded Apr 7, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mspider-0.2.5-py3-none-any.whl (12.0 kB view details)

Uploaded Apr 7, 2019 Python 3

File details

Details for the file mspider-0.2.5.tar.gz.

File metadata

Download URL: mspider-0.2.5.tar.gz
Upload date: Apr 7, 2019
Size: 9.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for mspider-0.2.5.tar.gz
Algorithm	Hash digest
SHA256	`5d9881caf957580668afe86470f18758426da04a3f7a1fce83db76b8a75f373e`
MD5	`277bc54299da484088bff7aebfa7b3ae`
BLAKE2b-256	`ce3fdd5e8d70b571d589f3efc79351053bef0db82ac46afe099926ede188fadf`

See more details on using hashes here.

File details

Details for the file mspider-0.2.5-py3-none-any.whl.

File metadata

Download URL: mspider-0.2.5-py3-none-any.whl
Upload date: Apr 7, 2019
Size: 12.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.7

File hashes

Hashes for mspider-0.2.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b1a5a50d5cb1ff76b7b9aba9af39c163027c9ca1c53e9d6d1bdace3e40d1d7e6`
MD5	`45c61b8a8daa50213667c4a2a4b9b04c`
BLAKE2b-256	`24db7601857b3071abb7f8b3fd2d21610f55480469c0b59bad25acebbc278390`

See more details on using hashes here.

mspider 0.2.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MSpider

Install

Quick Start

Automatically create a `MSpider`

Mannually create a `MSpider`

Usages

Usage of `pp.ProxyPool`

Usage of `mtd.Downloader`

Usage of `spider.MSpider`

Feature

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

mspider 0.2.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MSpider

Install

Quick Start

Automatically create a MSpider

Mannually create a MSpider

Usages

Usage of pp.ProxyPool

Usage of mtd.Downloader

Usage of spider.MSpider

Feature

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Automatically create a `MSpider`

Mannually create a `MSpider`

Usage of `pp.ProxyPool`

Usage of `mtd.Downloader`

Usage of `spider.MSpider`