page-parser

web crawler or spider parse page

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

GitHub

项目简介

项目名称：六行代码写爬虫

英文名称：PageParser

项目简介：一个爬虫使用的网页解析包，实现最大限度的代码复用

项目目标：不懂网页解析也能写爬虫

安装模块

pip install page-parser

最小项目示例：

import requests
from page_parser import BaiduParser

# 1、下载网页
response = requests.get("https://www.baidu.com/")
html = response.content.decode("utf-8")

# 2、解析网页
items = BaiduParser.parse_index(html)

# 3、输出数据
for item in items: print(item)
# {'title': '百度一下，你就知道'}

支持网页

序号	网站	网页名称	网页地址
1	百度	主页	https://www.baidu.com/
2	豆瓣	电影正在热映	https://movie.douban.com/
3	拉勾	招聘职位列表页	https://www.lagou.com/zhaopin/
4	企查查	融资事件页	https://www.qichacha.com/elib_financing
5	西刺代理	主页	http://www.xicidaili.com/
6	西刺代理	国内高匿代理	http://www.xicidaili.com/nn/
7	西刺代理	国内普通代理	http://www.xicidaili.com/nt/
8	西刺代理	国内HTTPS代理	http://www.xicidaili.com/wn/
9	西刺代理	国内HTTP代理	http://www.xicidaili.com/wt/
10	搜狗搜索	微信公众号搜索页	https://weixin.sogou.com/weixin?type=1&query=百度
11	煎蛋网	主页列表	http://jandan.net/
12	伯乐在线	python栏目	http://python.jobbole.com/

使用示例

# -*- coding: utf-8 -*-

import requests
from page_parser import BaiduParser

# 1、下载网页
url = "https://www.baidu.com/"
response = requests.get(url)
response.encoding = response.apparent_encoding

# 2、解析网页
items = BaiduParser.parse_index(response.text)

# 3、输出数据
for item in items:
    print(item)

# {'title': '百度一下，你就知道'}

网络爬虫工作流程：

页面下载器 -> 页面解析器 -> 数据存储

页面下载器: 主要涉及防爬攻破，方法各异，爬虫的难点也在此

页面解析器: 一般页面在一段时间内是固定的，每个人下载页面后都需要解析出页面内容，属于重复工作

数据存储: 不管是存储到什么文件或数据库，主要看业务需求

此项目就是将这项工作抽离出来，让网络爬虫程序重点关注于：网页下载，而不是重复的网页解析

项目说明

此项目可以和python 的requests 和scrapy 配合使用

当然如果要和其他编程语言使用，可以使用flask等网络框架再次对此项目进行封装，提供网络接口即可

发起人：mouday

发起时间：2018-10-13

需要更多的人一起来维护

贡献代码

贡献的代码统一放入文件夹：page_parser

代码示例，如没有更好的理由，应该按照下面的格式，便于使用者调用

baidu_parser.py

# -*- coding: utf-8 -*-

# @Date    : 2018-10-13
# @Author  : Peng Shiyu

from parsel import Selector


class BaiduParser(object):
    """
    百度网：https://www.baidu.com/
    """

    @staticmethod
    def parse_index(html):
        """
        解析主页：https://www.baidu.com/
        2018-10-13 pengshiyuyx@gmai.com
        :param html: {str} 网页文本
        :return: {iterator} 抽取的内容
        """
        sel = Selector(html)
        title = sel.css("title::text").extract_first()
        item = {
            "title": title
        }
        yield item


if __name__ == '__main__':
    import requests
    response = requests.get("https://www.baidu.com/")
    response.encoding = response.apparent_encoding
    items = BaiduParser.parse_index(response.text)
    for item in items:
        print(item)

    # {'title': '百度一下，你就知道'}

说明：

原则：

按照网站分类建立解析类
解析方法包含在解析类中为方便调用需要静态方法
因为网页解析有时效性，所以必须注明日期

命名规则：

例如:

文件名：baidu_parser
类名：BaiduParser
方法名：parse_index

其他

必要的代码注释
必要的测试代码
其他必要的代码

加入我们

基本要求

python的基本语法 + 面向对象 + 迭代器（yield）
掌握的库：requests、parsel、scrapy（了解即可）
解析库统一使用parsel（基于xpath），简单高效，与scrapy无缝衔接
不太懂也没关系，自己看参考文章，只要愿意学就会，瞬间提升自己

参考文章：

联系方式

PageParser QQ群号: 932301512

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.4

Mar 21, 2019

0.0.3

Oct 17, 2018

0.0.2

Oct 15, 2018

0.0.1

Oct 15, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

page_parser-0.0.4.tar.gz (7.7 kB view details)

Uploaded Mar 21, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

page_parser-0.0.4-py3-none-any.whl (14.8 kB view details)

Uploaded Mar 21, 2019 Python 3

File details

Details for the file page_parser-0.0.4.tar.gz.

File metadata

Download URL: page_parser-0.0.4.tar.gz
Upload date: Mar 21, 2019
Size: 7.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.20.1 setuptools/39.1.0 requests-toolbelt/0.8.0 tqdm/4.23.3 CPython/3.6.5

File hashes

Hashes for page_parser-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`7bffbb1b502f9c0c7a260aefba280a700bea4c95310b05e2271863f9be1a731e`
MD5	`d97c0d866d041b86cfb11928dd6ffed5`
BLAKE2b-256	`46669a32790324fe241c3c4cee6eb9b9e3605ea83bcc312aa56e00958353e182`

See more details on using hashes here.

File details

Details for the file page_parser-0.0.4-py3-none-any.whl.

File metadata

Download URL: page_parser-0.0.4-py3-none-any.whl
Upload date: Mar 21, 2019
Size: 14.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.20.1 setuptools/39.1.0 requests-toolbelt/0.8.0 tqdm/4.23.3 CPython/3.6.5

File hashes

Hashes for page_parser-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0679f154273e3f71773071c9651df2e3f07d999859c8a4eb8c492dcdbf897d48`
MD5	`8bc631693a0ad573083a2073e67940bc`
BLAKE2b-256	`5cec9640dfcbb0440a7bf94deb317615800768d57b2b27bb3fa9d0e8353296cb`

See more details on using hashes here.

page-parser 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

项目简介

安装模块

支持网页

使用示例

网络爬虫工作流程：

项目说明

贡献代码

说明：

原则：

命名规则：

其他

加入我们

基本要求

联系方式

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes