A Python library manipulate html
Project description
ReadMe
Installation
$ pip3 install nvhtml
refer to Installation
Documentation
refer to https://nvhtml.readthedocs.io
License
MIT
Quickstart
refer to Usage
Features
search html via tags-pathlist
search html via breadth,depth
search html via tag
width first traverse html
description matrix of html
depth first traverse html (sax)
beautify html
- get html node via relationship:
ancestor,descendants,parent,grand-parent,children,
siblings,lsib,rcib,preceding-sib,following-sib,
lcin,rcin
References
elist
mlconjug(for docs template format)
Installation
From pip
To install nvhtml, run this command in your terminal:
$ pip3 install nvhtml
From sources
The sources for nvhtml can be downloaded from the github repo:
$ git clone git@github.com:ihgazni2/nvhtml.git
Once you have a copy of the source, you can install it with:
$ python setup.py install
Usage
Command Line
rshtml write rule
- # pound:
comment
- - dash:
attrib
- . dot:
text and tail
- | vbar:
text and tail content can multiline
rshtml
NVHTML-BENCH# cat tstrs.rshtml html head meta -http-equiv X-UA-Compatible -content IE=edge,chrome=1 meta -name viewport -content user-scalable=yes, initial-scale=1.0, minimum-scale=1.0, maximum-scale=3.0 link body div -id menu-item-27961 -class qtranxs-lang-menu-item menu-item-object-custom .text |hello |hihihi .tail |this is a tail li div li div #comment .text |this is acomment NVHTML-BENCH# NVHTML-BENCH# nvhtml_rsh2html -input tstrs.rshtml NVHTML-BENCH# ls -l | egrep tstrs.rshtml -rw-r--r-- 1 root root 597 Aug 19 11:36 tstrs.rshtml -rw-r--r-- 1 root root 658 Aug 19 11:56 tstrs.rshtml.html NVHTML-BENCH# cat tstrs.rshtml.html <html> <head> <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> </meta> <meta name="viewport" content="user-scalable=yes, initial-scale=1.0, minimum-scale=1.0, maximum-scale=3.0"> </meta> <link> </link> </head> <body> <div id="menu-item-27961" class="qtranxs-lang-menu-item menu-item-object-custom"> hello hihihi <li> </li> </div> this is a tail <div> <li> </li> </div> <div> </div> </body> <!-- this is acomment --> </html> NVHTML-BENCH# vice versa NVHTML-BENCH# nvhtml_html2rsh -input tstrs.html
vertical show
with color
NVHTML-BENCH# nvhtml_struct_show -input disp.html color: True -------------------------- | html | -------------------------- | head | body | -------------------------- | | div | div | -------------------------- | | | div | div | --------------------------
no color
#if two big, disable color , and open it in editor such as notepad ++ NVHTML-BENCH# nvhtml_struct_show -input opis.html -color "no" > html_txt_tb
nvrsh_struct_show
TEST# nvrsh_struct_show -input jobj2.rshtml color: True ------------------------------- | html | ------------------------------- | body | ------------------------------- | im-dict-root | ------------------------------- | l | t | s | ------------------------------- | o0 | o1 | o0 | o1 | u0 | u1 | ------------------------------- | a0 | a1 | t0 | t1 | s0 | s1 | ------------------------------- NVHTML# cat TEST/jobj2.rshtml im-dict-root l o0 a0 o1 a1 t o0 t0 o1 t1 s u0 s0 u1 s1 NVHTML#
beautify
nvhtml_beauty -input opis.html vim opis.html.out.html # nvhtml_beauty -h usage: nvhtml_beauty [-h] [-input INPUT_HTML_FILE] [-output OUTPUT_HTML_FILE] [-codec INPUT_CODEC] optional arguments: -h, --help show this help message and exit -input --input_html_file input html file name -output --output_html_file output html file name -codec --input_codec input html file codec
search with loc
NVHTML-BENCH# nvhtml_loc -h usage: nvhtml_loc [-h] [-input INPUT_HTML_FILE] [-codec INPUT_CODEC] [-loc MAT_LOC] optional arguments: -h, --help show this help message and exit -input --input_html_file input html file name -codec --input_codec input html file codec -loc --mat_loc depth,breadth NVHTML-BENCH# nvhtml_loc -input opis.html -loc 11,2 <------json----- { 'pl': [ 'html', 'body', 'div', 'div', 'header', 'div', 'nav', 'div', 'ul', 'li', 'ul', 'li' ], 'breadth': 2, 'depth': 11, 'pbreadth': 2, 'samepl_sibseq': 0, 'samepl_breadth': 0, 'tag': 'li', 'sibseq': 0, 'attrib': { 'id': 'menu-item-22951', 'class': 'menu-item menu-item-type-post_type menu-item-object-page menu-item-22951' }, 'text': None, 'tail': '\n\t', 'text_intag': '' } ------json----> <----------html-------------- <li id="menu-item-22951" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-22951"> <a href="https://opistobranquis.info/en/home/presentacio/"> <span> Presentation </span> </a> </li> <----------html--------------
tag search with depth
NVHTML-BENCH# nvhtml_tag -h usage: nvhtml_tag [-h] [-input INPUT_HTML_FILE] [-codec INPUT_CODEC] [-tag HTML_TAG] [-which WHICH_TAG] [-sdepth START_LEVEL_DEPTH] [-edepth END_LEVEL_DEPTH] optional arguments: -h, --help show this help message and exit -input --input_html_file input html file name -codec --input_codec input html file codec -tag --html_tag html tag for search -which --which_tag sequence of tag-array -sdepth --start_level_depth start level depth -edepth --end_level_depth end level depth NVHTML-BENCH# NVHTML-BENCH# nvhtml_tag -input opis.html.out.html -tag img -sdepth 3 -edepth 6 <img alt="Twitter" src="https://opistobranquis.info/wp-content/themes/tempera/images/socials/Twitter.png"> </img> <img alt="Facebook" src="https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png"> </img> <img alt="Twitter" src="https://opistobranquis.info/wp-content/themes/tempera/images/socials/Twitter.png"> </img> <img alt="Facebook" src="https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png"> </img> <img style="float: right; display: none" class="loading" src="https://opistobranquis.info/wp-content/plugins/jetpack/modules/sharedaddy/images/loading.gif" alt="loading" width="16" height="16"> </img> NVHTML-BENCH#
html to db
NVHTML-BENCH# nvhtml_sqlite -input opis.html db: ./opis.html.sqlite.db table: tb_html NVHTML-BENCH# NVHTML-BENCH# sqlite3 opis.html.sqlite.db SQLite version 3.22.0 2018-01-22 18:45:57 Enter ".help" for usage hints. sqlite> sqlite> .table tb_html sqlite> sqlite> .schema tb_html CREATE TABLE IF NOT EXISTS "tb_html" ( "index" INTEGER, "_pl" TEXT, "_breadth" TEXT, "_depth" TEXT, "_pbreadth" TEXT, "_samepl_sibseq" TEXT, "_samepl_breadth" TEXT, "_tag" TEXT, "_sibseq" TEXT, "_text" TEXT, "_tail" TEXT, "class" TEXT, "href" TEXT, "id" TEXT, "style" TEXT, "type" TEXT, "src" TEXT, "rel" TEXT, "target" TEXT, "title" TEXT, "content" TEXT, "alt" TEXT, "media" TEXT, "name" TEXT, "align" TEXT, "property" TEXT, "role" TEXT, "value" TEXT, "data-shared" TEXT, "hreflang" TEXT, "for" TEXT, "aria-current" TEXT, "colspan" TEXT, "method" TEXT, "action" TEXT, "sizes" TEXT, "placeholder" TEXT, "height" TEXT, "width" TEXT, "http-equiv" TEXT, "autocomplete" TEXT, "data-layout" TEXT, "data-orig-file" TEXT, "data-href" TEXT, "lang" TEXT, "data-image-title" TEXT, "data-recalc-dims" TEXT, "data-attachment-id" TEXT, "data-text" TEXT, "data-flxmap" TEXT, "size" TEXT, "data-image-description" TEXT, "data-medium-file" TEXT, "async" TEXT, "language" TEXT, "srcset" TEXT, "data-comments-opened" TEXT, "data-large-file" TEXT, "data-via" TEXT, "defer" TEXT, "data-permalink" TEXT, "data-image-meta" TEXT, "data-noptimize" TEXT, "data-url" TEXT, "data-orig-size" TEXT ); CREATE INDEX "ix_tb_html_index"ON "tb_html" ("index"); sqlite> sqlite> sqlite> SELECT src FROM tb_html WHERE _tag=="img" AND src like "%acebook.png"; https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png sqlite> sqlite> sqlite> sqlite> sqlite> SELECT href FROM tb_html WHERE _tag=="link" AND href like "%.com"; //s0.wp.com //c0.wp.com //i0.wp.com //i1.wp.com //i2.wp.com sqlite>
html to dir
NVHTML-BENCH# mkdir TMP NVHTML-BENCH# nvhtml_dir -input opis.html -wkdir TMP NVHTML-BENCH# tree -fdL 4 TMP | head TMP └── TMP/html.0 ├── TMP/html.0/body.1 │ ├── TMP/html.0/body.1/<comment>.91 │ ├── TMP/html.0/body.1/div.90 │ │ ├── TMP/html.0/body.1/div.90/<comment>.4 │ │ ├── TMP/html.0/body.1/div.90/<comment>.7 │ │ ├── TMP/html.0/body.1/div.90/div.0 │ │ ├── TMP/html.0/body.1/div.90/div.1 │ │ ├── TMP/html.0/body.1/div.90/div.2 NVHTML-BENCH# NVHTML-BENCH# tree -fdL 4 TMP | tail ├── TMP/html.0/head.0/style.45 ├── TMP/html.0/head.0/style.55 ├── TMP/html.0/head.0/style.56 ├── TMP/html.0/head.0/style.57 ├── TMP/html.0/head.0/style.58 ├── TMP/html.0/head.0/style.78 ├── TMP/html.0/head.0/style.79 └── TMP/html.0/head.0/title.7 138 directories NVHTML-BENCH# ls -l TMP/html.0/body.1/div.90/div.2 total 36 drwxr-xr-x 3 root root 4096 Aug 11 02:49 a.3 drwxr-xr-x 3 root root 4096 Aug 11 02:49 a.4 -rw-r--r-- 1 root root 7 Aug 11 02:49 attrib.class -rw-r--r-- 1 root root 7 Aug 11 02:49 attrib.id -rw-r--r-- 1 root root 538 Aug 11 02:49 outter_html -rw-r--r-- 1 root root 3 Aug 11 02:49 tag -rw-r--r-- 1 root root 1 Aug 11 02:49 tail -rw-r--r-- 1 root root 4 Aug 11 02:49 text -rw-r--r-- 1 root root 8 Aug 11 02:49 text_intag NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/attrib.id srights NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/attrib.class socials NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/outter_html <div class="socials" id="srights"> <a target="_blank" href="https://twitter.com/InfoOpk" class="socialicons social -Twitter external" title="Twitter"> <img alt="Twitter" src="https://opistobranquis.info/wp-content/themes/t empera/images/socials/Twitter.png"/> </a> <a target="_blank" href="https://www.facebook.com/OPK.Opistobranquis/" class="s ocialicons social-Facebook external" title="Facebook"> <img alt="Facebook" src="https://opistobranquis.info/wp-content/themes/ tempera/images/socials/Facebook.png"/> </a></div> NVHTML-BENCH# NVHTML-BENCH# ls -al TMP/html.0/body.1/div.90/div.2 | egrep " \.[a-z]" -rw-r--r-- 1 root root 1 Aug 11 02:49 .breadth -rw-r--r-- 1 root root 1 Aug 11 02:49 .depth -rw-r--r-- 1 root root 27 Aug 11 02:49 .mkdir_pth -rw-r--r-- 1 root root 2 Aug 11 02:49 .pbreadth -rw-r--r-- 1 root root 18 Aug 11 02:49 .pl -rw-r--r-- 1 root root 1 Aug 11 02:49 .samepl_breadth -rw-r--r-- 1 root root 1 Aug 11 02:49 .samepl_sibseq -rw-r--r-- 1 root root 1 Aug 11 02:49 .sibseq NVHTML-BENCH# NVHTML-BENCH# NVHTML-BENCH# NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.breadth 2 NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.depth 3 NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.pbreadth 90 NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.pl /html/body/div/div NVHTML-BENCH# NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.samepl_breadth 2 NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.samepl_sibseq 2 NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.sibseq 2 NVHTML-BENCH#
find all
NVHTML-BENCH# nvhtml_find_all -input opis.html -attrib "http-equiv" [ 'X-UA-Compatible', 'Content-Type' ] NVHTML-BENCH# NVHTML-BENCH# nvhtml_find_all -input opis.html -attrib "href" | egrep "jorunna-e" 'https://opistobranquis.info/en/guia/nudibranchia/doridina/doridoidei/doridoidea/jorunna-efe/', 'https://opistobranquis.info/en/guia/nudibranchia/doridina/doridoidei/doridoidea/jorunna-evansi/', NVHTML-BENCH# NVHTML-BENCH# nvhtml_find_all -input opis.html common attribs: [ '_pl', '_breadth', '_depth', '_pbreadth', '_samepl_sibseq', '_samepl_breadth', '_tag', '_sibseq', '_text', '_tail' ] attrib_names:frequency { 'class': 947, 'href': 810, 'id': 181, 'style': 80, 'type': 78, 'src': 55, 'rel': 49, 'target': 41, 'title': 36, 'content': 23, 'alt': 19, 'media': 17, 'name': 15, 'align': 13, 'property': 12, 'role': 9, 'value': 7, 'hreflang': 4, 'data-shared': 4, 'colspan': 3, 'for': 3, 'aria-current': 3, 'sizes': 3, 'action': 3, 'method': 3, 'placeholder': 2, 'width': 2, 'http-equiv': 2, 'height': 2, 'data-permalink': 1, 'data-recalc-dims': 1, 'srcset': 1, 'size': 1, 'data-layout': 1, 'data-orig-size': 1, 'language': 1, 'data-medium-file': 1, 'data-href': 1, 'data-image-description': 1, 'data-image-title': 1, 'data-orig-file': 1, 'defer': 1, 'data-flxmap': 1, 'data-noptimize': 1, 'data-image-meta': 1, 'lang': 1, 'data-url': 1, 'data-large-file': 1, 'autocomplete': 1, 'data-via': 1, 'async': 1, 'data-comments-opened': 1, 'data-attachment-id': 1, 'data-text': 1 } NVHTML-BENCH#
nvhtml_wfs_udlrpls
NVHTML-BENCH#nvhtml_wfs_udlrpls -input xxx.html
nvhtml_wfs_dulrpls
NVHTML-BENCH#nvhtml_wfs_dulrpls -input xxx.html
Examples
relation get
html_str = fs.rfile("./test.html") root = LXHTML(html_str) ele = engine.xpath(root,"//div",5) engine.parent(ele) engine.grand_parent(ele) engine.ancestors(ele) engine.parent(ele) engine.grand_parent(ele) engine.ancestors(ele) engine.lsib(ele) engine.rsib(ele) engine.lcin(ele) engine.rcin(ele) engine.siblings(ele) engine.descendants(ele,5,6) engine.layer(ele) engine.breadth(ele) engine.depth(ele) engine.pathlist(ele)
description matrix
html_str = fs.rfile("./test.html") root = LXHTML(html_str) wfs = engine.WFS(root) pobj(wfs.mat[3][1])
width-first-traverse
html_str = fs.rfile("./test.html") root = LXHTML(html_str) pls = engine.wfspls(root) utils.parr(pls[:10])
depth-first-traverse
import lxml.sax html_str = fs.rfile("./test.html") root = LXHTML(html_str) dfs = engine.DFS() lxml.sax.saxify(root, dfs) utils.parr(dfs.pls[:5]) utils.parr(dfs.pls[-10:])
beautify
html_str = fs.rfile("./test.html") root = LXHTML(html_str) html_str = engine.beautify(root) print(html_str[:480])
lvsrch
[
'a',
'abbr',
'acronym',
'address',
'applet',
'area',
'arguments',
'article',
'aside',
'audio',
'b',
'base',
'basefont',
'bdi',
'bdo',
'big',
'blockquote',
'body',
'br',
'button',
'canvas',
'caption',
'center',
'cite',
'code',
'col',
'colgroup',
'command',
'datalist',
'dd',
'del_',
'details',
'dfn',
'dialog',
'dir',
'div',
'dl',
'dt',
'elel',
'em',
'embed',
'engine',
'fieldset',
'figcaption',
'figure',
'font',
'footer',
'form',
'frame',
'frameset',
'h1',
'h2',
'h3',
'h4',
'h5',
'h6',
'head',
'header',
'hr',
'html',
'i',
'iframe',
'img',
'input',
'ins',
'isindex',
'kbd',
'keygen',
'label',
'legend',
'li',
'link',
'map',
'mark',
'menu',
'menuitem',
'meta',
'meter',
'nav',
'noframes',
'noscript',
'object',
'ol',
'optgroup',
'option',
'output',
'p',
'param',
'pre',
'progress',
'q',
'rp',
'rt',
'ruby',
's',
'samp',
'script',
'section',
'select',
'small',
'source',
'span',
'srch',
'strike',
'strong',
'style',
'sub',
'summary',
'sup',
'table',
'tbody',
'td',
'textarea',
'tfoot',
'th',
'thead',
'time',
'title',
'tr',
'track',
'tt',
'u',
'ul',
'utils',
'var',
'video',
'wbr',
'xmp'
]
engine
[
'BEAUTIFY',
'ContentHandler',
'DFS',
'WFS',
'ancestor',
'ancestors',
'beautify',
'between_levels_cond_func',
'breadth',
'child',
'children',
'copy',
'default_wfs_handler',
'depth',
'descendants',
'descendants_pls',
'dfs_traverse',
'dfspls',
'disconnect',
'elel',
'extract_pls',
'following_sibs',
'grand_parent',
'html',
'init_cls_wfs_arguments',
'is_leaf',
'layer',
'layer_wfs_handler',
'lcin',
'leaf_descendants',
'leaf_descendants_pls',
'loc',
'loc2node',
'lsib',
'lxe',
'lxml',
'nonleaf_descendants',
'nonleaf_descendants_pls',
'parent',
'pathlist',
'plget',
'preceding_sibs',
'rcin',
're',
'rootnode',
'rsib',
'samepl_breadth',
'samepl_siblings',
'samepl_sibseq',
'siblings',
'sibseq',
'source',
'text_intag',
'txtize',
'utils',
'wfs_traverse',
'wfspls',
'xpath',
'xpath_levels'
]
History
0.0.1 (2018-12-18)
first commit
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file nvhtml-0.0.42.tar.gz
.
File metadata
- Download URL: nvhtml-0.0.42.tar.gz
- Upload date:
- Size: 292.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a87dcc5bd7cd4ea4b19e2e7acb784eedac6d1a7937a83081c80f51f792ae5893 |
|
MD5 | 08a99ca2eaeb66839f222939178291a7 |
|
BLAKE2b-256 | 434b2396a004b42f2e3ce78f1b57fa116b91e7368263a8cc1de8786f208f36b7 |