Skip to main content

A multiprocessing web crawling and web scraping framework.

Project description

MultiprocessingSpider

[简体中文版]

Description

MultiprocessingSpider is a simple and easy-to-use web crawling and web scraping framework.

Architecture

Architecture

Dependencies

  • requests

Installation

pip install MultiprocessingSpider

Basic Usage

MultiprocessingSpider

from MultiprocessingSpider.spiders import MultiprocessingSpider
from MultiprocessingSpider.packages import TaskPackage, ResultPackage


class MyResultPackage(ResultPackage):
    def __init__(self, prop1, prop2, sleep=True):
        super().__init__(sleep)
        self.prop1 = prop1
        self.prop2 = prop2


class MySpider(MultiprocessingSpider):
    start_urls = ['https://www.a.com/page1']

    proxies = [
        {"http": "http://111.111.111.111:80"},
        {"http": "http://123.123.123.123:8080"}
    ]

    def router(self, url):
        return self.parse

    def parse(self, response):
        # Parsing task or new page from "response"
        ...
        # Yield a task package
        yield TaskPackage('https://www.a.com/task1')
        ...
        # Yield a url or a url list
        yield 'https://www.a.com/page2'
        ...
        yield ['https://www.a.com/page3', 'https://www.a.com/page4']

    @classmethod
    def subprocess_handler(cls, package, sleep_time, timeout, retry):
        url = package.url
        # Request "url" and parse data
        ...
        # Return result package
        return MyResultPackage('value1', 'value2')

    @staticmethod
    def process_result_package(package):
        # Processing result package
        if 'value1' == package.prop1:
            return package
        else:
            return None


if __name__ == '__main__':
    s = MySpider()

    # Start the spider
    s.start()

    # Block current process
    s.join()

    # Export results to csv file
    s.to_csv('result.csv')

    # Export results to json file
    s.to_json('result.json')

FileSpider

from MultiprocessingSpider.spiders import FileSpider
from MultiprocessingSpider.packages import FilePackage


class MySpider(FileSpider):
    start_urls = ['https://www.a.com/page1']

    stream = True

    buffer_size = 1024

    overwrite = False

    def router(self, url):
        return self.parse

    def parse(self, response):
        # Parsing task or new page from "response"
        ...
        # Yield a file package
        yield FilePackage('https://www.a.com/file.png', 'file.png')
        ...
        # Yield a new url or a url list
        yield 'https://www.a.com/page2'
        ...
        yield ['https://www.a.com/page3', 'https://www.a.com/page4']


if __name__ == '__main__':
    s = MySpider()

    # Add a url
    s.add_url('https://www.a.com/page5')

    # Start the spider
    s.start()

    # Block current process
    s.join()

FileDownloader

from MultiprocessingSpider.spiders import FileDownloader


if __name__ == '__main__':
    d = FileDownloader()

    # Start the downloader
    d.start()

    # Add a file
    d.add_file('https://www.a.com/file.png', 'file.png')

    # Block current process
    d.join()

License

GPLv3.0
This is a free library, anyone is welcome to modify : )

Release Note

v1.1.1

Bug Fixes

  • Fix "start_urls" invalidation.

v1.1.0

Features

  • Add overwrite option for "FileSpider".
  • Add routing system. After overriding "router" method, you can yield a single url or a url list in your parse method.

Bug Fixes

  • Fix retry message display error.

Refactor

  • Optimize setter method. Now you can do this: spider.sleep_time = ' 5'.
  • Will not resend request when "status_code" is not between 200 and 300.
a) MultiprocessingSpider
  • Rename property "handled_url_table" to "handled_urls".
  • Remove "parse" method, add "example_parse_method".
  • "User-Agent" in "web_headers" is now generated randomly.
  • Change url_table parsing order, current rule: "FIFP" (first in first parse).
b) FileDownloader
  • Remove "add_files" method.

v1.0.0

  • The first version.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for MultiprocessingSpider, version 1.1.1
Filename, size File type Python version Upload date Hashes
Filename, size MultiprocessingSpider-1.1.1-py3-none-any.whl (24.9 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size MultiprocessingSpider-1.1.1.tar.gz (23.7 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page