Skip to main content

A tool for data structuring, mainly for web data.

Project description

web-to-struct

A tool for data structuring, mainly for web data. 将数据格式化的小工具,主要处理web数据。

安装

pip install web-to-struct

环境

>= python3.6

使用

DEMO

import requests
import json
from web_to_struct import Parser

if __name__ == '__main__':
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
    }
    r = requests.get("https://copymanga.org/recommend", headers=headers)

    config = {
        "name": "data",
        "map": [
            {"function": "string-to-element"},
            {"function": "css", "kwargs": {"patterns": ["#comic > .row > .exemptComicItem"]}},
        ],
        "children": [{
            "name": "title",
            "map": [
                {"function": "css", "kwargs": {"patterns": ["p[title]"]}},
            ]
        }, {
            "name": "img",
            "map": [
                {"function": "css", "kwargs": {"patterns": [".exemptComicItem-img > a > img"]}},
                {"function": "attr", "kwargs": {"attr_name": "data-web_to_struct"}},
            ]
        }, {
            "name": "comic_id",
            "map": [
                {"function": "css", "kwargs": {"patterns": [".exemptComicItem-img > a"]}},
                {"function": "attr", "kwargs": {"attr_name": "href"}},
                {"function": "regex", "kwargs": {"pattern": r"comic/(.*?)$"}},
            ]
        }, {
            "name": "author",
            "map": [
                {"function": "css", "kwargs": {"patterns": [".exemptComicItem-txt > span.exemptComicItem-txt-span > a[href^=\"/author\"]"]}},
            ],
        }]
    }
    parser = Parser()
    resp = parser.parse(r.text, config)
    print(json.dumps(resp, ensure_ascii=False, indent=2))

returns

{
  "data": [
    {
      "title": "見到你之後該說什麼呢",
      "img": "https://mirror277.mangafuna.xyz:12001/comic/jiandaonizhihougaishuoshenmene/cover/e54e3f14-8425-11eb-869d-00163e0ca5bd.jpg!kb_w_item",
      "comic_id": "jiandaonizhihougaishuoshenmene",
      "author": "ねむようこ"
    } //,...
  ]
}

Config参数

{
  "name": "",
  "map": [
    { "function": "", "kwargs": {} } // 内置函数,上一个的输出作为下一个的输入
  ],
  "children": [{}] // optional 子节点,结构同本结构。
}

内置函数 Functions

Function 函数名 Accepted Returns 可接受的上一个函数的返回类型 Extra Args 额外的参数 Returns 返回类型 Description 描述
string-to-element Union[str, bytes] feature: str = "html5lib" Element -
css Element patterns: Union[str, List[str]] [Element, None] -
index Union[Dict, Tuple, List] pattern: str # eg."[1].x" Any -
text Element - String get pure strings inside the current elements
html Element - String get HTML strings inside the current element
attr Element attr_name: str str get attribute value of the current element
regex str pattern: str Union[str, tuple, None] regex match result
tuple-to-string Tuple pattern: str String use $1,$2,... to replace tuple elements, eg. "hello $1, $2" for tuple ("a", "b") returns "hello a, b"
json-parse str - Union[Dict, List] parse json string to dict

其他行为

  • 返回值如果是list,且有children,则处理为返回值叉乘children

参考

  • 部分内置函数参考了Yealico

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web-to-struct-1.0.2.tar.gz (4.8 kB view details)

Uploaded Source

File details

Details for the file web-to-struct-1.0.2.tar.gz.

File metadata

  • Download URL: web-to-struct-1.0.2.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.8.5

File hashes

Hashes for web-to-struct-1.0.2.tar.gz
Algorithm Hash digest
SHA256 2159942ee7dac1ad07998174481544fdbc759ff620da57626aa1152b9cf9a05b
MD5 00b46edfc2bc5b9fc1cfa63289a68794
BLAKE2b-256 35f1ddc4740a602abc09a12735b55aea3809596493a65b70fa48c3965c81defe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page