Skip to main content

A tool for data structuring, mainly for web data.

Project description

web-to-struct

A tool for data structuring, mainly for web data. 将数据格式化的小工具,主要处理web数据。

安装

pip install web-to-struct

环境

>= python3.6

使用

DEMO

import requests
import json
from web_to_struct import Parser

if __name__ == '__main__':
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
    }
    r = requests.get("https://copymanga.org/recommend", headers=headers)

    config = {
        "name": "data",
        "map": [
            {"function": "string-to-element"},
            {"function": "css", "kwargs": {"patterns": ["#comic > .row > .exemptComicItem"]}},
        ],
        "children": [{
            "name": "title",
            "map": [
                {"function": "css", "kwargs": {"patterns": ["p[title]"]}},
            ]
        }, {
            "name": "img",
            "map": [
                {"function": "css", "kwargs": {"patterns": [".exemptComicItem-img > a > img"]}},
                {"function": "attr", "kwargs": {"attr_name": "data-web_to_struct"}},
            ]
        }, {
            "name": "comic_id",
            "map": [
                {"function": "css", "kwargs": {"patterns": [".exemptComicItem-img > a"]}},
                {"function": "attr", "kwargs": {"attr_name": "href"}},
                {"function": "regex", "kwargs": {"pattern": r"comic/(.*?)$"}},
            ]
        }, {
            "name": "author",
            "map": [
                {"function": "css", "kwargs": {"patterns": [".exemptComicItem-txt > span.exemptComicItem-txt-span > a[href^=\"/author\"]"]}},
            ],
        }]
    }
    parser = Parser()
    resp = parser.parse(r.text, config)
    print(json.dumps(resp, ensure_ascii=False, indent=2))

returns

{
  "data": [
    {
      "title": "見到你之後該說什麼呢",
      "img": "https://mirror277.mangafuna.xyz:12001/comic/jiandaonizhihougaishuoshenmene/cover/e54e3f14-8425-11eb-869d-00163e0ca5bd.jpg!kb_w_item",
      "comic_id": "jiandaonizhihougaishuoshenmene",
      "author": "ねむようこ"
    } //,...
  ]
}

Config参数

{
  "name": "",
  "map": [
    { "function": "", "kwargs": {} } // 内置函数,上一个的输出作为下一个的输入
  ],
  "children": [{}] // optional 子节点,结构同本结构。
}

内置函数 Functions

Function 函数名 Accepted Returns 可接受的上一个函数的返回类型 Extra Args 额外的参数 Returns 返回类型 Description 描述
string-to-element Union[str, bytes] feature: str = "html5lib" Element -
css Element patterns: Union[str, List[str]] [Element, None] -
index Union[Dict, Tuple, List] pattern: str # eg."[1].x" Any -
text Element - String get pure strings inside the current elements
html Element - String get HTML strings inside the current element
attr Element attr_name: str str get attribute value of the current element
regex str pattern: str Union[str, tuple, None] regex match result
tuple-to-string Tuple pattern: str String use $1,$2,... to replace tuple elements, eg. "hello $1, $2" for tuple ("a", "b") returns "hello a, b"
json-parse str - Union[Dict, List] parse json string to dict

其他行为

  • 返回值如果是list,且有children,则处理为返回值叉乘children

参考

  • 部分内置函数参考了Yealico

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web-to-struct-1.0.3.tar.gz (4.9 kB view details)

Uploaded Source

File details

Details for the file web-to-struct-1.0.3.tar.gz.

File metadata

  • Download URL: web-to-struct-1.0.3.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.55.1 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.2

File hashes

Hashes for web-to-struct-1.0.3.tar.gz
Algorithm Hash digest
SHA256 dbf9dd87fc89d1e15aca46d5b1eac7fbf6cdbde9bd84ff4d17fe700d967ed688
MD5 b6c9dc6e280e3611ac10adcb1fa1cc70
BLAKE2b-256 cead92d57c2b5e2ba09007b56a1632b49607c09b98805056b613345cd46d92f6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page