A generic HTML page parser

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

HtmlPageParser

通用HTML页面内容解析器

安装

pip install HtmlPageParser

使用示例

# 爬取页面内容
from HtmlPageParser.src.parser import Parser
with open("test.html", "r", encoding="utf-8") as f:
    html = f.read()
client = Parser(base_url="https://www.163.com/")
xpath_css = [
    {".//p[@class='f_center']": ".f_center"},
    {".//p[@class='f_center']": ".f_center"},
    {".//p[@class='f_center']": ".f_center"},
    {".//p[@class='f_center']": ".f_center"},
    {".//p[@class='f_center']": ".f_center"}
]
data = client.parser(html, css_selector="#content > div.post_body", xpath_css=xpath_css)
print(data)

# json结构数据转为markdown格式
import json
from HtmlPageParser.src.json2markdown import Json2Markdown
with open("json_data.json", 'r', encoding='utf-8') as f:
    json_data = json.load(f)
J2M = Json2Markdown()
markdown_data = J2M.json2markdown(json_data)
print(markdown_data)

参数	参数说明
base_url	爬取页面的域名，如爬取的页面是https://www.163.com/dy/article/I35SM5AN0514EGPO.html，那base_url就是https://www.163.com/
html	该页面的html元素，即requests.get()返回的response.text
css_selector	需要爬取的元素的上一级标签的css_selector,右键检查选中复制selector即可
xpath_css	xpath和css的映射字典，如果需要爬取的页面内部还有需要继续深入爬取的标签，则需要配置需要深入爬取标签上一级的xpath和css的映射字典，具体示例如下
a_attr	需要抓取a标签中的属性，默认为herf
img_attr	需要抓取img标签中的属性，默认为src
video_attr	需要抓取video标签中的属性，默认为src

配置示例

如下述元素块，例如我想要抓取div[class='post_body']下面的所有标签，那css_selector就是<div class="post_body">这个标签位置的css_selector; 如果只配置了上面的css_selector，那只能抓到<div class="post_body">标签下一层的标签内容，不能抓到该标签下一层标签里面的标签内容，这个时候需要配置xpath_css参数，即如下所示，如果我想继续深入抓取<p class="f_center">标签下面的img标签，那我需要写<p class="f_center">这层标签的xpath和css字典，下述元素中有两个<p class="f_center">标签，所以需要按顺序写两个映射字典，格式如下：

xpath_css = [
    {".//p[@class='f_center']": ".f_center"},
    {".//p[@class='f_center']": ".f_center"}
]

    <div class="post_top">
        <div class="post_body">
            <p id="1O5D2DRS">
                <video src="http://flv0.bn.netease.com/6ac0c4.mp4" data-video="http://flv0.bn.netease.com/6ac0c40c71faab9.jpg">
                当地时间4月24日，联合国秘书长古特雷斯与俄外交部长拉夫罗夫举行了会面，双方就乌克兰局势、阿富汗、叙利亚等方面的问题进行了讨论。
            </p>
            <p class="f_center">
                <img src="https://nimg.ws.126.net/?url=http%3A%2F%2Fdingyue.00ne00esc.jpg">
                <br>
            </p>
            <p id="1O5G5ICJ">视频截图</p>
            <p id="1O556NEP">古特雷斯还向拉夫罗夫提交了一封致俄总统普京的信，概述了旨在改进、延长和扩大黑海粮食协议的方向。
            </p>
            <p id="1O556NEQ">报道称古特雷斯已向该协议的另外两个签署方乌克兰、土耳其，发送了类似函件。
            </p>
            <p id="1O556NER">此外，古特雷斯还向拉夫罗夫介绍了秘书处在解决俄罗斯官员签证问题上所做的最新努力。</p>
            <p class="f_center">
                <img src="https://nimg.ws.126.net/?url=http%3A%2F%2Fdingyue.000hp00ajc.jpg">
                <br>
            </p>
        </div>
    </div>

解析结果格式

下述结果不是由上面的html元素解析而来，只是例举出多种标签的结构

[
    {
        "type": "p",
        "context": "Imprimir",
        "link": [
            {
                "start": 0,
                "end": 8,
                "origin_url": "https://www.minsalud.gob.bo/1089-sorata-cumple-con-la-implementacion-de-la-politica-sanitaria-safci-encaminada-por-el-ministerio-de-salud?tmpl=component&print=1&layout=default",
                "url": "https://www.minsalud.gob.bo/1089-sorata-cumple-con-la-implementacion-de-la-politica-sanitaria-safci-encaminada-por-el-ministerio-de-salud?tmpl=component&print=1&layout=default"
            }
        ]
    },
    {
        "type": "img",
        "context": "",
        "link": [
            {
                "origin_url": "https://www.minsalud.gob.bo/images/noticias16/sorata2.gif",
                "url": "Bolivia_Regulation/Files//8b36d2ab18f5fed475dffa42b7e0bbe7."
            }
        ]
    },
    {
        "type": "h1",
        "context": "La Paz – Viernes 6 de Mayo de 2016 | Unidad de Comunicación",
        "link": []
    },
    {
        "type": "h2",
        "context": "Otras peticiones fundamentales fueron la compra de ambulancias",
        "link": []
    },
    {
        "type": "table",
        "link": [],
        "context": [
            {
                "type": "tr",
                "link": [],
                "context": [
                    {
                        "type": "th",
                        "context": "配信動画",
                        "link": []
                    },
                    {
                        "type": "th",
                        "context": "配信日",
                        "link": []
                    }
                ]
            },
            {
                "type": "tr",
                "link": [],
                "context": [
                    {
                        "type": "td",
                        "context": "これでわかる!適合性調査における再審査等申請から日程調整までの手続き -資料作成のポイント-",
                        "link": []
                    },
                    {
                        "type": "td",
                        "context": "2022年11月15日",
                        "link": []
                    }
                ]
            },
            {
                "type": "tr",
                "link": [],
                "context": [
                    {
                        "type": "td",
                        "context": "再審査適合性調査等における解析用データセットの活用について",
                        "link": []
                    },
                    {
                        "type": "td",
                        "context": "2022年11月15日",
                        "link": []
                    }
                ]
            }
        ]
    },

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.4

Apr 28, 2023

0.0.3

Apr 26, 2023

0.0.2

Apr 26, 2023

0.0.1

Apr 26, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

HtmlPageParser-0.0.4.tar.gz (9.1 kB view details)

Uploaded Apr 28, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

HtmlPageParser-0.0.4-py3-none-any.whl (9.4 kB view details)

Uploaded Apr 28, 2023 Python 3

File details

Details for the file HtmlPageParser-0.0.4.tar.gz.

File metadata

Download URL: HtmlPageParser-0.0.4.tar.gz
Upload date: Apr 28, 2023
Size: 9.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for HtmlPageParser-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`234e397364ad9eef98a56bdb16b4b69b9a7d971faa07233f2715d2509d679ff7`
MD5	`3e98e37fe2deb25f35d415f62507219c`
BLAKE2b-256	`aebd536c39ca8bf77dc69db9113a75ee00be6995b5ca5ec6b8f1a5c08c4d1b86`

See more details on using hashes here.

File details

Details for the file HtmlPageParser-0.0.4-py3-none-any.whl.

File metadata

Download URL: HtmlPageParser-0.0.4-py3-none-any.whl
Upload date: Apr 28, 2023
Size: 9.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for HtmlPageParser-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`94bf9bb517d5333b3aeb5e55c0b4aa27c4b79b64d98455e74a51a233afc62403`
MD5	`b28bb202f70e4688e370f8d1f2113ed5`
BLAKE2b-256	`0e4c9f230cc19a4ad5365a1293c842893a71f586cdc566ae87c472b582f40e52`

See more details on using hashes here.

HtmlPageParser 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

HtmlPageParser

安装

使用示例

配置示例

解析结果格式

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes