Skip to main content

Make your spider multi-threaded.

Project description

MSpider

A Multi-threaded Spider wrapper that could make your spider multi-threaded easily, helping you crawl website faster. :zap:

Note that this is for python3 only.

Install

pip install mspider

Quick Start

Automatically create a MSpider

  1. cd to the folder you’d like to create a MSpider in terminal or cmd, then type genspider <your spider name>, such as:

    $ genspider test
    

    A file test.py that contains a MSpider is created successfully if seeing the following information.

    create a spider named test.
    
  2. Open the spider file test.py. Find self.source = [] in line 14, and replacing it by the sources (usually a list of urls) you’d like to handle by the spider, such as:

    self.source = ['http://www.github.com',
                   'http://www.baidu.com']
    

    Each element of the self.source is called src_item, the index of src_item is called index.

  3. Find the funciont basic_func, where you could define your spider funcion, such as:

    def basic_func(self, index, src_item):
        url = src_item
        res = self.pool.open_url(url)
        html = res.content.decode('utf-8')
        # deal with the html
        # save the extracted information
    
  4. Run the spider to start crawling.

    $ python3 test.py
    

    You just input the number of source items handled by each thread (BATCH SIZE) in the terminal or cmd, then return it, and the MSpider will crawl your sources in a multi-threaded manner.

    [INFO]: MSpider is ready.
    [INFO]: 2 urls in total.
    [INPUT]: BATCH SIZE: 1
    [INFO]: Open threads: 100%|████████████████| 2/2 [00:00<00:00, 356.36it/s]
    [INFO]: Task done.
    [INFO]: The task costs 1.1157 sec.
    [INFO]: 0 urls failed.
    

Mannually create a MSpider

  1. Standard import the MSpider.

    from mspider.spider import MSpider
    
  2. Define the function of your single threaded spider.

    Note that this function must has two parameters.

    • index: the index of source item
    • src_item: the source item you are going to deal with in this function, which is usually an url or anything you need to process, such as a tuple like (name, url).
    def spi_func(index, src_item):
        name, url = src_item
        res = mspider.pool.open_url(url)
        html = res.content.decode('utf-8')
        # deal with the html
        # save the extracted information
    

    mspider.pool is an instance of mspider.pp.ProxyPool, see "Usages of pp.ProxyPool" in Usages part for more information.

  3. Now comes the key part. Create an instance of MSpider and pass it your spider function and sources you’d crawl.

    sources = [('github', 'http://www.github.com'),
               ('baidu', 'http://www.baidu.com')]
    mspider = MSpider(spi_func, sources)
    
  4. Start to crawl!

    mspider.crawl()
    

    Then you will see the following information in your terminal or cmd. You just input the BATCH SIZE, and the MSpider will crawl your sources in a multi-threaded manner.

    [INFO]: MSpider is ready.
    [INFO]: 2 urls in total.
    [INPUT]: BATCH SIZE: 1
    [INFO]: Open threads: 100%|████████████████| 2/2 [00:00<00:00, 356.36it/s]
    [INFO]: Task done.
    [INFO]: The task costs 1.1157 sec.
    [INFO]: 0 urls failed.
    

Usages

The mspider package has three main module pp mtd and spider

  • pp has the class of ProxyPool, which helps you get the proxy IP pool from xici free IPs.
  • mtd has two classes, Crawler and Downloader
    • Crawler helps you make your spider multi-threaded.
    • Downloader helps you download things multi-threadedly as long as you pass your urls in the form of list(zip(names, urls)) in it.
  • spider has the class of MSpider, which uses the Crawler in module mtd, and has some basic configurations of Crawler, so this is a easier way to turn your spider into a multi-threaded spider.

Usage of pp.ProxyPool

from mspider.pp import ProxyPool

pool = ProxyPool()

# Once an instance of ProxyPool is initialized,
# it will has an attribute named ip_list, which
# has a list of IPs crawled from xici free IPs.
print(pool.ip_list)
"""
['HTTP://58.249.55.222:9797', 'HTTPS://113.54.152.170:8080', 'HTTP://180.140.191.233:36820', 'HTTP://163.125.69.145:8888', 'HTTP://14.115.107.83:808', 'HTTP://182.111.129.37:53281', 'HTTPS://202.112.237.102:3128', 'HTTPS://163.125.252.109:9797', ... , 'HTTPS://120.24.43.177:8080', 'HTTP://113.116.144.124:9000', 'HTTP://114.249.118.17:9000']
"""
# Randomly choose an IP
ip = pool.random_choose_ip()
print(ip)
"""
'HTTP://182.111.129.37:53281'
"""

# Update the IP list
pool.get_ip_list()

# Request an url using proxy by 'GET'
url = "http://www.google.com"
res = pool.open_url(url)
print(res.status_code)
"""
200
"""

# Request an url using post by 'POST'
url = "http://www.google.com"
data = {'key':'value'}
res = pool.post(url, data)
print(res.status_code)
"""
200
"""

Usage of mtd.Downloader

from mspider.mtd import Downloader

# Prepare source data that need download
names = ['a', 'b', 'c']
urls = ['https://www.baidu.com/img/baidu_resultlogo@2.png',
        'https://www.baidu.com/img/baidu_resultlogo@2.png',
        'https://www.baidu.com/img/baidu_resultlogo@2.png']
source = list(zip(names, urls))

# Download them!
dl = Downloader(source)
dl.download(out_folder='test', engine='wget')
"""
[INFO]: 3 urls in total.
[INPUT]: BATCH SIZE: 1
[INFO]: Open threads: 100%|███████████████| 3/3 [00:00<00:00, 3167.90it/s]
[INFO]: Task done.
[INFO]: The task costs 0.3324 sec.
[INFO]: 0 urls failed.
"""

Usage of spider.MSpider

See this in Quick Start.

License

Copyright (c) 2019 tishacy.

Licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mspider-0.2.1.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mspider-0.2.1-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file mspider-0.2.1.tar.gz.

File metadata

  • Download URL: mspider-0.2.1.tar.gz
  • Upload date:
  • Size: 7.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.18.4 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.6.7

File hashes

Hashes for mspider-0.2.1.tar.gz
Algorithm Hash digest
SHA256 1fed767401ac21ac342c35a55d3d68849acd00e7776e107f250c8393405a6c1f
MD5 d00f0b9e409afa9d4211dd34895d8a1f
BLAKE2b-256 fb0614c7f20f3272eceb923434dce92215538fcab2700b85625a438dec5209c0

See more details on using hashes here.

File details

Details for the file mspider-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: mspider-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 10.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.18.4 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.6.7

File hashes

Hashes for mspider-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 eb80064a3894d3a1dcaf8f4f383abdd4d220f081bd48b9b2d17d762fb7754df5
MD5 c4e8b2f74de4135701042f394f2ddc09
BLAKE2b-256 cb6122dcfae91c8e388d0167d0267b923ad3c095137aee4dbe00a8d06c5d4ceb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page