Skip to main content

General-purpose HTML main-content extractor

Project description

magic-html - 通用HTML数据提取器

欢迎使用magic-html,这是一个旨在简化从HTML中提取主体区域内容的Python库。

项目描述

magic-html提供了一套工具,能够轻松地从HTML中提取主体区域内容。无论您处理的是复杂的HTML结构还是简单的网页,这个库都旨在为您的HTML抽取需求提供一个便捷高效的接口。

特点

  • 返回主体区域html结构,可自定义输出纯文本/markdown
  • 支持多模态抽取
  • 支持多种版面extractor,文章/论坛
  • 支持latex公式提取转换

安装

pip install https://github.com/opendatalab/magic-html/releases/download/magic_html-0.1.6-released/magic_html-0.1.6-py3-none-any.whl

使用

from magic_html import GeneralExtractor

# 初始化提取器
extractor = GeneralExtractor()

url = "http://example.com/"
html = """

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />  
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
"""

# 文章类型HTML提取数据
data = extractor.extract(html, base_url=url)

# 论坛类型HTML提取数据
# data = extractor.extract(html, base_url=url, html_type="forum")

# 微信文章HTML提取数据
# data = extractor.extract(html, base_url=url, html_type="weixin")

print(data)

benchmark report

根据html页面类型,文章/论坛,对比不同开源通用抽取框架抽取准确性

文章类型:选取头部新闻、博客站点共标注158个html页面

╒══════════════════════╤═════════════╤════════════╤═══════════╕
 func                    prec_mean    rec_mean    f1_mean 
╞══════════════════════╪═════════════╪════════════╪═══════════╡
 magic_html               0.908865    0.95032    0.92913  
├──────────────────────┼─────────────┼────────────┼───────────┤
 trafilatura              0.833434    0.912384   0.871124 
├──────────────────────┼─────────────┼────────────┼───────────┤
 trafilatura_fallback     0.831229    0.933713   0.879496 
├──────────────────────┼─────────────┼────────────┼───────────┤
 readability-lxml         0.86587     0.861391   0.863625 
├──────────────────────┼─────────────┼────────────┼───────────┤
 newspaper3k              0.409585    0.372083   0.389935 
├──────────────────────┼─────────────┼────────────┼───────────┤
 goose3                   0.525717    0.457669   0.489339 
├──────────────────────┼─────────────┼────────────┼───────────┤
 justext                  0.224945    0.117092   0.154014 
├──────────────────────┼─────────────┼────────────┼───────────┤
 gne                      0.828849    0.629112   0.715299 
╘══════════════════════╧═════════════╧════════════╧═══════════╛

论坛类型:选取头部论坛、问答站点与开源建站框架搭建站点共103个html页面

╒══════════════════════╤═════════════╤════════════╤═══════════╕
 func                    prec_mean    rec_mean    f1_mean 
╞══════════════════════╪═════════════╪════════════╪═══════════╡
 magic_html               0.796252   0.826819   0.811248  
├──────────────────────┼─────────────┼────────────┼───────────┤
 trafilatura              0.716009   0.695947   0.705835  
├──────────────────────┼─────────────┼────────────┼───────────┤
 trafilatura_fallback     0.730304   0.691328   0.710282  
├──────────────────────┼─────────────┼────────────┼───────────┤
 readability-lxml         0.788018   0.445087   0.568867  
├──────────────────────┼─────────────┼────────────┼───────────┤
 newspaper3k              0.596976   0.298322   0.397837  
├──────────────────────┼─────────────┼────────────┼───────────┤
 goose3                   0.675835   0.312969   0.427821  
├──────────────────────┼─────────────┼────────────┼───────────┤
 justext                  0.175889   0.0517628  0.0799863 
├──────────────────────┼─────────────┼────────────┼───────────┤
 gne                      0.81003    0.389709   0.526241  
╘══════════════════════╧═════════════╧════════════╧═══════════╛

许可

本项目代码采用Apache 2.0 license授权。

鸣谢

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

magic_html-0.1.8.tar.gz (51.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

magic_html-0.1.8-py3-none-any.whl (50.6 kB view details)

Uploaded Python 3

File details

Details for the file magic_html-0.1.8.tar.gz.

File metadata

  • Download URL: magic_html-0.1.8.tar.gz
  • Upload date:
  • Size: 51.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.3

File hashes

Hashes for magic_html-0.1.8.tar.gz
Algorithm Hash digest
SHA256 b6f4be7c17272a7214fe340ec4a7719c46102514d56819eb7b6d2f65479b647a
MD5 3cdcb227f6e407f14f130b0f6ddaee4e
BLAKE2b-256 d9073f4ccc3be504334befab22bc00a9fe52716c81f3d475ee714cc102def26e

See more details on using hashes here.

File details

Details for the file magic_html-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: magic_html-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 50.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.3

File hashes

Hashes for magic_html-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 14223452df1e473fd1764550681c8bcfc4888801ad0485fba23c4938121062a0
MD5 7df5b9413109d44a84c274ab329e7417
BLAKE2b-256 697f0f384e815891ccdaa7ea29b5e7e1aa75a5292888c740c7f84c0b6581e041

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page