Skip to main content

python版本的爬虫程序。根据java版本的webmagic改编而成。该爬虫程序主要包含downloader、storage、processor、schemular等四大功能模块。通过该爬虫程序可以快速的编写一个自定义的爬虫程序。

Project description

# pcrawler爬虫程序 [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/mumupy/pcrawler/blob/master/LICENSE) [![Build Status](https://travis-ci.org/mumupy/pcrawler.svg?branch=master)](https://travis-ci.org/mumupy/pcrawler) [![codecov](https://codecov.io/gh/mumupy/pcrawler/branch/master/graph/badge.svg)](https://codecov.io/gh/mumupy/pcrawler) [![pypi](https://img.shields.io/pypi/v/pcrawler.svg)](https://pypi.python.org/pypi/pcrawler) [![Documentation Status](https://readthedocs.org/projects/pcrawler/badge/?version=latest)](https://pcrawler.readthedocs.io/en/latest/?badge=latest)

*pcrawler是一款python版本的爬虫程序,通过该爬虫程序可以非常快速方便的编写一个自己的爬虫程序。pcrawler主要 包含downloader、schedular、processor、storage四大组件组成。而且可以非常方便快捷的拓展各个组件。*

## 特性: - 简单的API,可快速上手 - 模块化的结构,可轻松扩展 - 提供多线程和分布式支持

## 架构 pcrawler主要包含downloader、schedular、processor、storage四大组件组成。 - processor 爬虫页面处理器,对页面进行分析。目前集成图片下载处理器、多媒体视频下载处理器、新浪新闻处理器。 - schedular URL管理组件,对待抓取的URL队列进行管理,对已抓取的URL进行去重。目前url队列管理支持文件缓存管理和集合管理。url去重支持文件缓存、集合、bloomFilter布隆过滤器等。 - downloader 下载组件,默认使用urllib2下载。 - storage 存储组件,支持多样文件格式(csv、json、avro、video)

## 相关阅读 [webmagic爬虫](http://webmagic.io/) [Bloom Filter](http://blog.csdn.net/jiaomeng/article/details/1495500)

## 联系方式 以上观点纯属个人看法,如有不同,欢迎指正。 email:<babymm@aliyun.com> github:[https://github.com/babymm](https://github.com/babymm)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pcrawler-0.0.3.tar.gz (17.6 kB view details)

Uploaded Source

File details

Details for the file pcrawler-0.0.3.tar.gz.

File metadata

  • Download URL: pcrawler-0.0.3.tar.gz
  • Upload date:
  • Size: 17.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: Python-urllib/2.7

File hashes

Hashes for pcrawler-0.0.3.tar.gz
Algorithm Hash digest
SHA256 3dcfb39d59bdfd7eefdde00cca803c47b0d870e5b995206b3c547bbd892f2531
MD5 713958cdb159b17d6b401ae89c7ee072
BLAKE2b-256 36f26e60188b051b35bf590fb51aae5cb9ffcda9197c6cb081db976ab93f4759

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page