A Python library manipulate html
Project description
ReadMe
Installation
$ pip3 install nvhtml
refer to Installation
Documentation
refer to https://nvhtml.readthedocs.io
License
MIT
Quickstart
refer to Usage
Features
search html via tags-pathlist
search html via breadth,depth
search html via tag
width first traverse html
description matrix of html
depth first traverse html (sax)
beautify html
- get html node via relationship:
ancestor,descendants,parent,grand-parent,children,
siblings,lsib,rcib,preceding-sib,following-sib,
lcin,rcin
References
elist
mlconjug(for docs template format)
Installation
From pip
To install nvhtml, run this command in your terminal:
$ pip3 install nvhtml
From sources
The sources for nvhtml can be downloaded from the github repo:
$ git clone git@github.com:ihgazni2/nvhtml.git
Once you have a copy of the source, you can install it with:
$ python setup.py install
Usage
Command Line
rshtml write rule
- # pound:
comment
- - dash:
attrib
- . dot:
text and tail
- | vbar:
text and tail content can multiline
rshtml
NVHTML-BENCH# cat tstrs.rshtml
html
head
meta
-http-equiv X-UA-Compatible
-content IE=edge,chrome=1
meta
-name viewport
-content user-scalable=yes, initial-scale=1.0, minimum-scale=1.0, maximum-scale=3.0
link
body
div
-id menu-item-27961
-class qtranxs-lang-menu-item menu-item-object-custom
.text
|hello
|hihihi
.tail
|this is a tail
li
div
li
div
#comment
.text
|this is acomment
NVHTML-BENCH#
NVHTML-BENCH# nvhtml_rsh2html -input tstrs.rshtml
NVHTML-BENCH# ls -l | egrep tstrs.rshtml
-rw-r--r-- 1 root root 597 Aug 19 11:36 tstrs.rshtml
-rw-r--r-- 1 root root 658 Aug 19 11:56 tstrs.rshtml.html
NVHTML-BENCH# cat tstrs.rshtml.html
<html>
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
</meta>
<meta name="viewport" content="user-scalable=yes, initial-scale=1.0, minimum-scale=1.0, maximum-scale=3.0">
</meta>
<link>
</link>
</head>
<body>
<div id="menu-item-27961" class="qtranxs-lang-menu-item menu-item-object-custom">
hello
hihihi
<li>
</li>
</div>
this is a tail
<div>
<li>
</li>
</div>
<div>
</div>
</body>
<!--
this is acomment
-->
</html>
NVHTML-BENCH#
vice versa
NVHTML-BENCH# nvhtml_html2rsh -input tstrs.html
vertical show
with color
NVHTML-BENCH# nvhtml_struct_show -input disp.html color: True -------------------------- | html | -------------------------- | head | body | -------------------------- | | div | div | -------------------------- | | | div | div | --------------------------
no color
#if two big, disable color , and open it in editor such as notepad ++ NVHTML-BENCH# nvhtml_struct_show -input opis.html -color "no" > html_txt_tb
nvrsh_struct_show
TEST# nvrsh_struct_show -input jobj2.rshtml
color: True
-------------------------------
| html |
-------------------------------
| body |
-------------------------------
| im-dict-root |
-------------------------------
| l | t | s |
-------------------------------
| o0 | o1 | o0 | o1 | u0 | u1 |
-------------------------------
| a0 | a1 | t0 | t1 | s0 | s1 |
-------------------------------
NVHTML# cat TEST/jobj2.rshtml
im-dict-root
l
o0
a0
o1
a1
t
o0
t0
o1
t1
s
u0
s0
u1
s1
NVHTML#
beautify
nvhtml_beauty -input opis.html vim opis.html.out.html # nvhtml_beauty -h usage: nvhtml_beauty [-h] [-input INPUT_HTML_FILE] [-output OUTPUT_HTML_FILE] [-codec INPUT_CODEC] optional arguments: -h, --help show this help message and exit -input --input_html_file input html file name -output --output_html_file output html file name -codec --input_codec input html file codec
search with loc
NVHTML-BENCH# nvhtml_loc -h
usage: nvhtml_loc [-h] [-input INPUT_HTML_FILE] [-codec INPUT_CODEC]
[-loc MAT_LOC]
optional arguments:
-h, --help show this help message and exit
-input --input_html_file input html file name
-codec --input_codec input html file codec
-loc --mat_loc depth,breadth
NVHTML-BENCH# nvhtml_loc -input opis.html -loc 11,2
<------json-----
{
'pl':
[
'html',
'body',
'div',
'div',
'header',
'div',
'nav',
'div',
'ul',
'li',
'ul',
'li'
],
'breadth': 2,
'depth': 11,
'pbreadth': 2,
'samepl_sibseq': 0,
'samepl_breadth': 0,
'tag': 'li',
'sibseq': 0,
'attrib':
{
'id': 'menu-item-22951',
'class': 'menu-item menu-item-type-post_type menu-item-object-page menu-item-22951'
},
'text': None,
'tail': '\n\t',
'text_intag': ''
}
------json---->
<----------html--------------
<li id="menu-item-22951" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-22951">
<a href="https://opistobranquis.info/en/home/presentacio/">
<span>
Presentation
</span>
</a>
</li>
<----------html--------------
tag search with depth
NVHTML-BENCH# nvhtml_tag -h
usage: nvhtml_tag [-h] [-input INPUT_HTML_FILE] [-codec INPUT_CODEC]
[-tag HTML_TAG] [-which WHICH_TAG]
[-sdepth START_LEVEL_DEPTH] [-edepth END_LEVEL_DEPTH]
optional arguments:
-h, --help show this help message and exit
-input --input_html_file input html file name
-codec --input_codec input html file codec
-tag --html_tag html tag for search
-which --which_tag sequence of tag-array
-sdepth --start_level_depth start level depth
-edepth --end_level_depth end level depth
NVHTML-BENCH#
NVHTML-BENCH# nvhtml_tag -input opis.html.out.html -tag img -sdepth 3 -edepth 6
<img alt="Twitter" src="https://opistobranquis.info/wp-content/themes/tempera/images/socials/Twitter.png">
</img>
<img alt="Facebook" src="https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png">
</img>
<img alt="Twitter" src="https://opistobranquis.info/wp-content/themes/tempera/images/socials/Twitter.png">
</img>
<img alt="Facebook" src="https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png">
</img>
<img style="float: right; display: none" class="loading" src="https://opistobranquis.info/wp-content/plugins/jetpack/modules/sharedaddy/images/loading.gif" alt="loading" width="16" height="16">
</img>
NVHTML-BENCH#
html to db
NVHTML-BENCH# nvhtml_sqlite -input opis.html
db: ./opis.html.sqlite.db
table: tb_html
NVHTML-BENCH#
NVHTML-BENCH# sqlite3 opis.html.sqlite.db
SQLite version 3.22.0 2018-01-22 18:45:57
Enter ".help" for usage hints.
sqlite>
sqlite> .table
tb_html
sqlite>
sqlite> .schema tb_html
CREATE TABLE IF NOT EXISTS "tb_html" (
"index" INTEGER,
"_pl" TEXT,
"_breadth" TEXT,
"_depth" TEXT,
"_pbreadth" TEXT,
"_samepl_sibseq" TEXT,
"_samepl_breadth" TEXT,
"_tag" TEXT,
"_sibseq" TEXT,
"_text" TEXT,
"_tail" TEXT,
"class" TEXT,
"href" TEXT,
"id" TEXT,
"style" TEXT,
"type" TEXT,
"src" TEXT,
"rel" TEXT,
"target" TEXT,
"title" TEXT,
"content" TEXT,
"alt" TEXT,
"media" TEXT,
"name" TEXT,
"align" TEXT,
"property" TEXT,
"role" TEXT,
"value" TEXT,
"data-shared" TEXT,
"hreflang" TEXT,
"for" TEXT,
"aria-current" TEXT,
"colspan" TEXT,
"method" TEXT,
"action" TEXT,
"sizes" TEXT,
"placeholder" TEXT,
"height" TEXT,
"width" TEXT,
"http-equiv" TEXT,
"autocomplete" TEXT,
"data-layout" TEXT,
"data-orig-file" TEXT,
"data-href" TEXT,
"lang" TEXT,
"data-image-title" TEXT,
"data-recalc-dims" TEXT,
"data-attachment-id" TEXT,
"data-text" TEXT,
"data-flxmap" TEXT,
"size" TEXT,
"data-image-description" TEXT,
"data-medium-file" TEXT,
"async" TEXT,
"language" TEXT,
"srcset" TEXT,
"data-comments-opened" TEXT,
"data-large-file" TEXT,
"data-via" TEXT,
"defer" TEXT,
"data-permalink" TEXT,
"data-image-meta" TEXT,
"data-noptimize" TEXT,
"data-url" TEXT,
"data-orig-size" TEXT
);
CREATE INDEX "ix_tb_html_index"ON "tb_html" ("index");
sqlite>
sqlite>
sqlite> SELECT src FROM tb_html WHERE _tag=="img" AND src like "%acebook.png";
https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png
https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png
https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png
https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png
sqlite>
sqlite>
sqlite>
sqlite>
sqlite> SELECT href FROM tb_html WHERE _tag=="link" AND href like "%.com";
//s0.wp.com
//c0.wp.com
//i0.wp.com
//i1.wp.com
//i2.wp.com
sqlite>
html to dir
NVHTML-BENCH# mkdir TMP
NVHTML-BENCH# nvhtml_dir -input opis.html -wkdir TMP
NVHTML-BENCH# tree -fdL 4 TMP | head
TMP
└── TMP/html.0
├── TMP/html.0/body.1
│ ├── TMP/html.0/body.1/<comment>.91
│ ├── TMP/html.0/body.1/div.90
│ │ ├── TMP/html.0/body.1/div.90/<comment>.4
│ │ ├── TMP/html.0/body.1/div.90/<comment>.7
│ │ ├── TMP/html.0/body.1/div.90/div.0
│ │ ├── TMP/html.0/body.1/div.90/div.1
│ │ ├── TMP/html.0/body.1/div.90/div.2
NVHTML-BENCH#
NVHTML-BENCH# tree -fdL 4 TMP | tail
├── TMP/html.0/head.0/style.45
├── TMP/html.0/head.0/style.55
├── TMP/html.0/head.0/style.56
├── TMP/html.0/head.0/style.57
├── TMP/html.0/head.0/style.58
├── TMP/html.0/head.0/style.78
├── TMP/html.0/head.0/style.79
└── TMP/html.0/head.0/title.7
138 directories
NVHTML-BENCH# ls -l TMP/html.0/body.1/div.90/div.2
total 36
drwxr-xr-x 3 root root 4096 Aug 11 02:49 a.3
drwxr-xr-x 3 root root 4096 Aug 11 02:49 a.4
-rw-r--r-- 1 root root 7 Aug 11 02:49 attrib.class
-rw-r--r-- 1 root root 7 Aug 11 02:49 attrib.id
-rw-r--r-- 1 root root 538 Aug 11 02:49 outter_html
-rw-r--r-- 1 root root 3 Aug 11 02:49 tag
-rw-r--r-- 1 root root 1 Aug 11 02:49 tail
-rw-r--r-- 1 root root 4 Aug 11 02:49 text
-rw-r--r-- 1 root root 8 Aug 11 02:49 text_intag
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/attrib.id
srights
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/attrib.class
socials
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/outter_html
<div class="socials" id="srights">
<a target="_blank" href="https://twitter.com/InfoOpk" class="socialicons social
-Twitter external" title="Twitter">
<img alt="Twitter" src="https://opistobranquis.info/wp-content/themes/t
empera/images/socials/Twitter.png"/>
</a>
<a target="_blank" href="https://www.facebook.com/OPK.Opistobranquis/" class="s
ocialicons social-Facebook external" title="Facebook">
<img alt="Facebook" src="https://opistobranquis.info/wp-content/themes/
tempera/images/socials/Facebook.png"/>
</a></div>
NVHTML-BENCH#
NVHTML-BENCH# ls -al TMP/html.0/body.1/div.90/div.2 | egrep " \.[a-z]"
-rw-r--r-- 1 root root 1 Aug 11 02:49 .breadth
-rw-r--r-- 1 root root 1 Aug 11 02:49 .depth
-rw-r--r-- 1 root root 27 Aug 11 02:49 .mkdir_pth
-rw-r--r-- 1 root root 2 Aug 11 02:49 .pbreadth
-rw-r--r-- 1 root root 18 Aug 11 02:49 .pl
-rw-r--r-- 1 root root 1 Aug 11 02:49 .samepl_breadth
-rw-r--r-- 1 root root 1 Aug 11 02:49 .samepl_sibseq
-rw-r--r-- 1 root root 1 Aug 11 02:49 .sibseq
NVHTML-BENCH#
NVHTML-BENCH#
NVHTML-BENCH#
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.breadth
2
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.depth
3
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.pbreadth
90
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.pl
/html/body/div/div
NVHTML-BENCH#
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.samepl_breadth
2
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.samepl_sibseq
2
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.sibseq
2
NVHTML-BENCH#
find all
NVHTML-BENCH# nvhtml_find_all -input opis.html -attrib "http-equiv"
[
'X-UA-Compatible',
'Content-Type'
]
NVHTML-BENCH#
NVHTML-BENCH# nvhtml_find_all -input opis.html -attrib "href" | egrep "jorunna-e"
'https://opistobranquis.info/en/guia/nudibranchia/doridina/doridoidei/doridoidea/jorunna-efe/',
'https://opistobranquis.info/en/guia/nudibranchia/doridina/doridoidei/doridoidea/jorunna-evansi/',
NVHTML-BENCH#
NVHTML-BENCH# nvhtml_find_all -input opis.html
common attribs:
[
'_pl',
'_breadth',
'_depth',
'_pbreadth',
'_samepl_sibseq',
'_samepl_breadth',
'_tag',
'_sibseq',
'_text',
'_tail'
]
attrib_names:frequency
{
'class': 947,
'href': 810,
'id': 181,
'style': 80,
'type': 78,
'src': 55,
'rel': 49,
'target': 41,
'title': 36,
'content': 23,
'alt': 19,
'media': 17,
'name': 15,
'align': 13,
'property': 12,
'role': 9,
'value': 7,
'hreflang': 4,
'data-shared': 4,
'colspan': 3,
'for': 3,
'aria-current': 3,
'sizes': 3,
'action': 3,
'method': 3,
'placeholder': 2,
'width': 2,
'http-equiv': 2,
'height': 2,
'data-permalink': 1,
'data-recalc-dims': 1,
'srcset': 1,
'size': 1,
'data-layout': 1,
'data-orig-size': 1,
'language': 1,
'data-medium-file': 1,
'data-href': 1,
'data-image-description': 1,
'data-image-title': 1,
'data-orig-file': 1,
'defer': 1,
'data-flxmap': 1,
'data-noptimize': 1,
'data-image-meta': 1,
'lang': 1,
'data-url': 1,
'data-large-file': 1,
'autocomplete': 1,
'data-via': 1,
'async': 1,
'data-comments-opened': 1,
'data-attachment-id': 1,
'data-text': 1
}
NVHTML-BENCH#
nvhtml_wfs_udlrpls
NVHTML-BENCH#nvhtml_wfs_udlrpls -input xxx.html
nvhtml_wfs_dulrpls
NVHTML-BENCH#nvhtml_wfs_dulrpls -input xxx.html
Examples
relation get
html_str = fs.rfile("./test.html")
root = LXHTML(html_str)
ele = engine.xpath(root,"//div",5)
engine.parent(ele)
engine.grand_parent(ele)
engine.ancestors(ele)
engine.parent(ele)
engine.grand_parent(ele)
engine.ancestors(ele)
engine.lsib(ele)
engine.rsib(ele)
engine.lcin(ele)
engine.rcin(ele)
engine.siblings(ele)
engine.descendants(ele,5,6)
engine.layer(ele)
engine.breadth(ele)
engine.depth(ele)
engine.pathlist(ele)
description matrix
html_str = fs.rfile("./test.html")
root = LXHTML(html_str)
wfs = engine.WFS(root)
pobj(wfs.mat[3][1])
width-first-traverse
html_str = fs.rfile("./test.html")
root = LXHTML(html_str)
pls = engine.wfspls(root)
utils.parr(pls[:10])
depth-first-traverse
import lxml.sax
html_str = fs.rfile("./test.html")
root = LXHTML(html_str)
dfs = engine.DFS()
lxml.sax.saxify(root, dfs)
utils.parr(dfs.pls[:5])
utils.parr(dfs.pls[-10:])
beautify
html_str = fs.rfile("./test.html")
root = LXHTML(html_str)
html_str = engine.beautify(root)
print(html_str[:480])
lvsrch
[
'a',
'abbr',
'acronym',
'address',
'applet',
'area',
'arguments',
'article',
'aside',
'audio',
'b',
'base',
'basefont',
'bdi',
'bdo',
'big',
'blockquote',
'body',
'br',
'button',
'canvas',
'caption',
'center',
'cite',
'code',
'col',
'colgroup',
'command',
'datalist',
'dd',
'del_',
'details',
'dfn',
'dialog',
'dir',
'div',
'dl',
'dt',
'elel',
'em',
'embed',
'engine',
'fieldset',
'figcaption',
'figure',
'font',
'footer',
'form',
'frame',
'frameset',
'h1',
'h2',
'h3',
'h4',
'h5',
'h6',
'head',
'header',
'hr',
'html',
'i',
'iframe',
'img',
'input',
'ins',
'isindex',
'kbd',
'keygen',
'label',
'legend',
'li',
'link',
'map',
'mark',
'menu',
'menuitem',
'meta',
'meter',
'nav',
'noframes',
'noscript',
'object',
'ol',
'optgroup',
'option',
'output',
'p',
'param',
'pre',
'progress',
'q',
'rp',
'rt',
'ruby',
's',
'samp',
'script',
'section',
'select',
'small',
'source',
'span',
'srch',
'strike',
'strong',
'style',
'sub',
'summary',
'sup',
'table',
'tbody',
'td',
'textarea',
'tfoot',
'th',
'thead',
'time',
'title',
'tr',
'track',
'tt',
'u',
'ul',
'utils',
'var',
'video',
'wbr',
'xmp'
]
engine
[
'BEAUTIFY',
'ContentHandler',
'DFS',
'WFS',
'ancestor',
'ancestors',
'beautify',
'between_levels_cond_func',
'breadth',
'child',
'children',
'copy',
'default_wfs_handler',
'depth',
'descendants',
'descendants_pls',
'dfs_traverse',
'dfspls',
'disconnect',
'elel',
'extract_pls',
'following_sibs',
'grand_parent',
'html',
'init_cls_wfs_arguments',
'is_leaf',
'layer',
'layer_wfs_handler',
'lcin',
'leaf_descendants',
'leaf_descendants_pls',
'loc',
'loc2node',
'lsib',
'lxe',
'lxml',
'nonleaf_descendants',
'nonleaf_descendants_pls',
'parent',
'pathlist',
'plget',
'preceding_sibs',
'rcin',
're',
'rootnode',
'rsib',
'samepl_breadth',
'samepl_siblings',
'samepl_sibseq',
'siblings',
'sibseq',
'source',
'text_intag',
'txtize',
'utils',
'wfs_traverse',
'wfspls',
'xpath',
'xpath_levels'
]
History
0.0.1 (2018-12-18)
first commit
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file nvhtml-0.0.44.tar.gz.
File metadata
- Download URL: nvhtml-0.0.44.tar.gz
- Upload date:
- Size: 293.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
49e72eef244f3f5fe585190f115a88e600d49edb88407d1badc9d8c7379041dc
|
|
| MD5 |
b998833c6f60776fb3d45b3fdb8b2136
|
|
| BLAKE2b-256 |
b4b66e57ede3a2b6eb97a7e84be479fcbd59ab48850fdd65abd354e05b4fe553
|