Skip to main content

A tool for data structuring, mainly for web data.

Project description

web-to-struct

A tool for data structuring, mainly for web data. 将数据格式化的小工具,主要处理web数据。

安装

pip install web-to-struct

环境

>= python3.6

使用

DEMO

import requests
import json
from web_to_struct import Parser

if __name__ == '__main__':
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
    }
    r = requests.get("https://copymanga.org/recommend", headers=headers)

    config = {
        "name": "data",
        "map": [
            {"function": "string-to-element"},
            {"function": "css", "kwargs": {"patterns": ["#comic > .row > .exemptComicItem"]}},
        ],
        "children": [{
            "name": "title",
            "map": [
                {"function": "css", "kwargs": {"patterns": ["p[title]"]}},
            ]
        }, {
            "name": "img",
            "map": [
                {"function": "css", "kwargs": {"patterns": [".exemptComicItem-img > a > img"]}},
                {"function": "attr", "kwargs": {"attr_name": "data-web_to_struct"}},
            ]
        }, {
            "name": "comic_id",
            "map": [
                {"function": "css", "kwargs": {"patterns": [".exemptComicItem-img > a"]}},
                {"function": "attr", "kwargs": {"attr_name": "href"}},
                {"function": "regex", "kwargs": {"pattern": r"comic/(.*?)$"}},
            ]
        }, {
            "name": "author",
            "map": [
                {"function": "css", "kwargs": {"patterns": [".exemptComicItem-txt > span.exemptComicItem-txt-span > a[href^=\"/author\"]"]}},
            ],
        }]
    }
    parser = Parser()
    resp = parser.parse(r.text, config)
    print(json.dumps(resp, ensure_ascii=False, indent=2))

returns

{
  "data": [
    {
      "title": "見到你之後該說什麼呢",
      "img": "https://mirror277.mangafuna.xyz:12001/comic/jiandaonizhihougaishuoshenmene/cover/e54e3f14-8425-11eb-869d-00163e0ca5bd.jpg!kb_w_item",
      "comic_id": "jiandaonizhihougaishuoshenmene",
      "author": "ねむようこ"
    } //,...
  ]
}

Config参数

{
  "name": "",
  "map": [
    { "function": "", "kwargs": {} } // 内置函数,上一个的输出作为下一个的输入
  ],
  "children": [{}] // optional 子节点,结构同本结构。
}

内置函数 Functions

Function 函数名 Accepted Returns 可接受的上一个函数的返回类型 Extra Args 额外的参数 Returns 返回类型 Description 描述
string-to-element Union[str, bytes] feature: str = "html5lib" Element -
css Element patterns: Union[str, List[str]] [Element, None] -
index Union[Dict, Tuple, List] pattern: str # eg."[1].x" Any -
text Element - String get pure strings inside the current elements
html Element - String get HTML strings inside the current element
attr Element attr_name: str str get attribute value of the current element
regex str pattern: str Union[str, tuple, None] regex match result
tuple-to-string Tuple pattern: str String use $1,$2,... to replace tuple elements, eg. "hello $1, $2" for tuple ("a", "b") returns "hello a, b"
json-parse str - Union[Dict, List] parse json string to dict

其他行为

  • 返回值如果是list,且有children,则处理为返回值叉乘children

参考

  • 部分内置函数参考了Yealico

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web-to-struct-1.0.4.tar.gz (4.9 kB view details)

Uploaded Source

File details

Details for the file web-to-struct-1.0.4.tar.gz.

File metadata

  • Download URL: web-to-struct-1.0.4.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.55.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.2

File hashes

Hashes for web-to-struct-1.0.4.tar.gz
Algorithm Hash digest
SHA256 d2a43acde1ff0cf83df52a6470fb4fd74472a7c9eb9c48ffd77d826a80321f48
MD5 d08eb1f03a697e338dc32d963ff5c653
BLAKE2b-256 59a00c9216369e3b4e79b4b48fa860c8b3151aa60149159b0a9276f1bb47e6d0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page