A multiprocessing web crawling and web scraping framework.
Project description
MultiprocessingSpider
Description
MultiprocessingSpider is a simple and easy-to-use web crawling and web scraping framework.
Architecture
Dependencies
- requests
Installation
pip install MultiprocessingSpider
Basic Usage
MultiprocessingSpider
from MultiprocessingSpider.spiders import MultiprocessingSpider
from MultiprocessingSpider.packages import TaskPackage, ResultPackage
class MyResultPackage(ResultPackage):
def __init__(self, prop1, prop2, sleep=True):
super().__init__(sleep)
self.prop1 = prop1
self.prop2 = prop2
class MySpider(MultiprocessingSpider):
start_urls = ['https://www.a.com/page1']
proxies = [
{"http": "http://111.111.111.111:80"},
{"http": "http://123.123.123.123:8080"}
]
def parse(self, response):
# Parsing task or new page from "response"
...
# Yield a task package
yield TaskPackage('https://www.a.com/task1')
...
# Yield a new web page url and its parsing method
yield 'https://www.a.com/page2', self.parse
@classmethod
def subprocess_handler(cls, package, sleep_time, timeout, retry):
url = package.url
# Request "url" and parse data
...
# Return result package
return MyResultPackage('value1', 'value2')
@staticmethod
def process_result_package(package):
# Processing result package
if 'value1' == package.prop1:
return package
else:
return None
if __name__ == '__main__':
s = MySpider()
# Start the spider
s.start()
# Block current process
s.join()
# Export results to csv file
s.to_csv('result.csv')
# Export results to json file
s.to_json('result.json')
FileSpider
from MultiprocessingSpider.spiders import FileSpider
from MultiprocessingSpider.packages import FilePackage
class MySpider(FileSpider):
start_urls = ['https://www.a.com/page1']
stream = True
buffer_size = 1024
def parse(self, response):
# Parsing task or new page from "response"
...
# Yield a file package
yield FilePackage('https://www.a.com/file.png', 'file.png')
...
# Yield a new web page url and its parsing method
yield 'https://www.a.com/page2', self.parse
if __name__ == '__main__':
s = MySpider()
# Add a new page
s.add_url('https://www.a.com/page3')
# Start the spider
s.start()
# Block current process
s.join()
FileDownloader
from MultiprocessingSpider.spiders import FileDownloader
if __name__ == '__main__':
d = FileDownloader()
# Start the downloader
d.start()
# Add a file
d.add_file('https://www.a.com/file.png', 'file.png')
# Block current process
d.join()
License
GPLv3.0
This is a free library, welcome interested developers to modify it : )
Release Note
v1.0.0
- The first version.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for MultiprocessingSpider-1.0.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | f32444a1fb5b6f306080e0b452898aef8f55f5096213ebb83935d92d594876a3 |
|
MD5 | e912c81324b0c1edb70fd3d31b2a9ef8 |
|
BLAKE2b-256 | 663f4a976cf4f3b68e8135d2fe85ab2bb9e98fdbba9bccbf5e697f53ff384feb |
Close
Hashes for MultiprocessingSpider-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 654689712cfb08bd8ad9019f5c60194b5dcb6d9b23e8237edaac31f98381fdf7 |
|
MD5 | 107cd081188e6864882a9fa2b8390ef2 |
|
BLAKE2b-256 | d500477120d46fa374f0fe585c9d9a70983eeec9f0a5fda0f2b798cef1772deb |