Skip to main content

A generic HTML page parser

Project description

HtmlPageParser

通用HTML页面内容解析器

安装

pip install HtmlPageParser

使用示例

# 爬取页面内容
from HtmlPageParser.src.parser import Parser
with open("test.html", "r", encoding="utf-8") as f:
    html = f.read()
client = Parser(base_url="https://www.163.com/")
xpath_css = [
    {".//p[@class='f_center']": ".f_center"},
    {".//p[@class='f_center']": ".f_center"},
    {".//p[@class='f_center']": ".f_center"},
    {".//p[@class='f_center']": ".f_center"},
    {".//p[@class='f_center']": ".f_center"}
]
data = client.parser(html, css_selector="#content > div.post_body", xpath_css=xpath_css)
print(data)

# json结构数据转为markdown格式
import json
from HtmlPageParser.src.json2markdown import Json2Markdown
with open("json_data.json", 'r', encoding='utf-8') as f:
    json_data = json.load(f)
J2M = Json2Markdown()
markdown_data = J2M.json2markdown(json_data)
print(markdown_data)
参数 参数说明
base_url 爬取页面的域名,如爬取的页面是https://www.163.com/dy/article/I35SM5AN0514EGPO.html,那base_url就是https://www.163.com/
html 该页面的html元素,即requests.get()返回的response.text
css_selector 需要爬取的元素的上一级标签的css_selector,右键检查选中复制selector即可
xpath_css xpath和css的映射字典,如果需要爬取的页面内部还有需要继续深入爬取的标签,则需要配置需要深入爬取标签上一级的xpath和css的映射字典,具体示例如下
a_attr 需要抓取a标签中的属性,默认为herf
img_attr 需要抓取img标签中的属性,默认为src
video_attr 需要抓取video标签中的属性,默认为src

配置示例

如下述元素块, 例如我想要抓取div[class='post_body']下面的所有标签,那css_selector就是<div class="post_body">这个标签位置的css_selector; 如果只配置了上面的css_selector,那只能抓到<div class="post_body">标签下一层的标签内容,不能抓到该标签下一层标签里面的标签内容, 这个时候需要配置xpath_css参数,即如下所示,如果我想继续深入抓取<p class="f_center">标签下面的img标签,那我需要写<p class="f_center">这层标签的xpath和css字典,下述元素中有两个<p class="f_center">标签,所以需要按顺序写两个映射字典,格式如下:

xpath_css = [
    {".//p[@class='f_center']": ".f_center"},
    {".//p[@class='f_center']": ".f_center"}
]
    <div class="post_top">
        <div class="post_body">
            <p id="1O5D2DRS">
                <video src="http://flv0.bn.netease.com/6ac0c4.mp4" data-video="http://flv0.bn.netease.com/6ac0c40c71faab9.jpg">
                当地时间4月24日,联合国秘书长古特雷斯与俄外交部长拉夫罗夫举行了会面,双方就乌克兰局势、阿富汗、叙利亚等方面的问题进行了讨论。
            </p>
            <p class="f_center">
                <img src="https://nimg.ws.126.net/?url=http%3A%2F%2Fdingyue.00ne00esc.jpg">
                <br>
            </p>
            <p id="1O5G5ICJ">视频截图</p>
            <p id="1O556NEP">古特雷斯还向拉夫罗夫提交了一封致俄总统普京的信,概述了旨在改进、延长和扩大黑海粮食协议的方向。
            </p>
            <p id="1O556NEQ">报道称古特雷斯已向该协议的另外两个签署方乌克兰、土耳其,发送了类似函件。
            </p>
            <p id="1O556NER">此外,古特雷斯还向拉夫罗夫介绍了秘书处在解决俄罗斯官员签证问题上所做的最新努力。</p>
            <p class="f_center">
                <img src="https://nimg.ws.126.net/?url=http%3A%2F%2Fdingyue.000hp00ajc.jpg">
                <br>
            </p>
        </div>
    </div>

解析结果格式

下述结果不是由上面的html元素解析而来,只是例举出多种标签的结构

[
    {
        "type": "p",
        "context": "Imprimir",
        "link": [
            {
                "start": 0,
                "end": 8,
                "origin_url": "https://www.minsalud.gob.bo/1089-sorata-cumple-con-la-implementacion-de-la-politica-sanitaria-safci-encaminada-por-el-ministerio-de-salud?tmpl=component&print=1&layout=default",
                "url": "https://www.minsalud.gob.bo/1089-sorata-cumple-con-la-implementacion-de-la-politica-sanitaria-safci-encaminada-por-el-ministerio-de-salud?tmpl=component&print=1&layout=default"
            }
        ]
    },
    {
        "type": "img",
        "context": "",
        "link": [
            {
                "origin_url": "https://www.minsalud.gob.bo/images/noticias16/sorata2.gif",
                "url": "Bolivia_Regulation/Files//8b36d2ab18f5fed475dffa42b7e0bbe7."
            }
        ]
    },
    {
        "type": "h1",
        "context": "La Paz – Viernes 6 de Mayo de 2016 | Unidad de Comunicación",
        "link": []
    },
    {
        "type": "h2",
        "context": "Otras peticiones fundamentales fueron la compra de ambulancias",
        "link": []
    },
    {
        "type": "table",
        "link": [],
        "context": [
            {
                "type": "tr",
                "link": [],
                "context": [
                    {
                        "type": "th",
                        "context": "配信動画",
                        "link": []
                    },
                    {
                        "type": "th",
                        "context": "配信日",
                        "link": []
                    }
                ]
            },
            {
                "type": "tr",
                "link": [],
                "context": [
                    {
                        "type": "td",
                        "context": "これでわかる!適合性調査における再審査等申請から日程調整までの手続き -資料作成のポイント-",
                        "link": []
                    },
                    {
                        "type": "td",
                        "context": "2022年11月15日",
                        "link": []
                    }
                ]
            },
            {
                "type": "tr",
                "link": [],
                "context": [
                    {
                        "type": "td",
                        "context": "再審査適合性調査等における解析用データセットの活用について",
                        "link": []
                    },
                    {
                        "type": "td",
                        "context": "2022年11月15日",
                        "link": []
                    }
                ]
            }
        ]
    },

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

HtmlPageParser-0.0.4.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

HtmlPageParser-0.0.4-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file HtmlPageParser-0.0.4.tar.gz.

File metadata

  • Download URL: HtmlPageParser-0.0.4.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for HtmlPageParser-0.0.4.tar.gz
Algorithm Hash digest
SHA256 234e397364ad9eef98a56bdb16b4b69b9a7d971faa07233f2715d2509d679ff7
MD5 3e98e37fe2deb25f35d415f62507219c
BLAKE2b-256 aebd536c39ca8bf77dc69db9113a75ee00be6995b5ca5ec6b8f1a5c08c4d1b86

See more details on using hashes here.

File details

Details for the file HtmlPageParser-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: HtmlPageParser-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for HtmlPageParser-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 94bf9bb517d5333b3aeb5e55c0b4aa27c4b79b64d98455e74a51a233afc62403
MD5 b28bb202f70e4688e370f8d1f2113ed5
BLAKE2b-256 0e4c9f230cc19a4ad5365a1293c842893a71f586cdc566ae87c472b582f40e52

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page