Skip to main content

A Python library manipulate html

Project description

nvhtml PyPi Home Page

ReadMe

Pypi Python Package Index Status Documentation Status

Installation

$ pip3 install nvhtml

Documentation

License

  • MIT

Quickstart

Features

  • search html via tags-pathlist

  • search html via breadth,depth

  • search html via tag

  • width first traverse html

  • description matrix of html

  • depth first traverse html (sax)

  • beautify html

  • get html node via relationship:
    • ancestor,descendants,parent,grand-parent,children,

    • siblings,lsib,rcib,preceding-sib,following-sib,

    • lcin,rcin

References

  • elist

  • mlconjug(for docs template format)

Installation

From pip

To install nvhtml, run this command in your terminal:

$ pip3 install nvhtml

From sources

The sources for nvhtml can be downloaded from the github repo:

$ git clone git@github.com:ihgazni2/nvhtml.git

Once you have a copy of the source, you can install it with:

$ python setup.py install

Usage

Command Line

vertical show

with color
NVHTML-BENCH# nvhtml_struct_show -input disp.html
color:  True
--------------------------
|          html          |
--------------------------
| head |      body       |
--------------------------
|      | div |    div    |
--------------------------
|      |     | div | div |
--------------------------
./images/nvhtml_struct_show.0.png
no color
#if two big, disable color , and open it in editor such as notepad ++
NVHTML-BENCH# nvhtml_struct_show -input opis.html -color "no" > html_txt_tb
./images/nvhtml_struct_show.1.png

beautify

nvhtml_beauty -input opis.html
vim opis.html.out.html

# nvhtml_beauty -h
usage: nvhtml_beauty [-h] [-input INPUT_HTML_FILE] [-output OUTPUT_HTML_FILE] [-codec INPUT_CODEC]

optional arguments:
  -h,           --help                      show this help message and exit
  -input        --input_html_file           input html file name
  -output       --output_html_file          output html file name
  -codec        --input_codec               input html file codec

search with loc

NVHTML-BENCH# nvhtml_loc -h
usage: nvhtml_loc [-h] [-input INPUT_HTML_FILE] [-codec INPUT_CODEC]
                  [-loc MAT_LOC]

optional arguments:
  -h,     --help                 show this help message and exit
  -input  --input_html_file      input html file name
  -codec  --input_codec          input html file codec
  -loc    --mat_loc              depth,breadth


NVHTML-BENCH# nvhtml_loc -input opis.html -loc 11,2
<------json-----
{
 'pl':
       [
        'html',
        'body',
        'div',
        'div',
        'header',
        'div',
        'nav',
        'div',
        'ul',
        'li',
        'ul',
        'li'
       ],
 'breadth': 2,
 'depth': 11,
 'pbreadth': 2,
 'samepl_sibseq': 0,
 'samepl_breadth': 0,
 'tag': 'li',
 'sibseq': 0,
 'attrib':
           {
            'id': 'menu-item-22951',
            'class': 'menu-item menu-item-type-post_type menu-item-object-page menu-item-22951'
           },
 'text': None,
 'tail': '\n\t',
 'text_intag': ''
}
------json---->

<----------html--------------
<li id="menu-item-22951" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-22951">
    <a href="https://opistobranquis.info/en/home/presentacio/">
        <span>
            Presentation
        </span>
    </a>
</li>

<----------html--------------

tag search with depth

NVHTML-BENCH# nvhtml_tag -h
usage: nvhtml_tag [-h] [-input INPUT_HTML_FILE] [-codec INPUT_CODEC]
                  [-tag HTML_TAG] [-which WHICH_TAG]
                  [-sdepth START_LEVEL_DEPTH] [-edepth END_LEVEL_DEPTH]

optional arguments:
  -h,                     --help                show this help message and exit
  -input                  --input_html_file     input html file name
  -codec                  --input_codec         input html file codec
  -tag                    --html_tag            html tag for search
  -which                  --which_tag           sequence of tag-array
  -sdepth                 --start_level_depth   start level depth
  -edepth                 --end_level_depth     end level depth
NVHTML-BENCH#

NVHTML-BENCH# nvhtml_tag -input opis.html.out.html -tag img -sdepth 3 -edepth 6
<img alt="Twitter" src="https://opistobranquis.info/wp-content/themes/tempera/images/socials/Twitter.png">
</img>

<img alt="Facebook" src="https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png">
</img>

<img alt="Twitter" src="https://opistobranquis.info/wp-content/themes/tempera/images/socials/Twitter.png">
</img>

<img alt="Facebook" src="https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png">
</img>

<img style="float: right; display: none" class="loading" src="https://opistobranquis.info/wp-content/plugins/jetpack/modules/sharedaddy/images/loading.gif" alt="loading" width="16" height="16">
</img>

NVHTML-BENCH#

search with tags-path

NVHTML-BENCH# nvhtml_tgpth -input opis.html  -tgpth html.body.di
[
 'div',
 'div',
 'div',
 'div'
]


NVHTML-BENCH# nvhtml_tgpth -input opis.html  -tgpth html.body.div
[
 0,
 1,
 2,
 3
]

NVHTML-BENCH#
NVHTML-BENCH# nvhtml_tgpth -input opis.html  -tgpth html.body.div.3
<div id="cookie-banner">

    <div id="cookie-banner-container">

        <div class="left">
                                        Our website uses cookies. By accessing our website and
            <br>
            </br>
            agreeing to this policy, you consent to our use of cookies.
        </div>

        <div class="right">

            <a class="accept" href="#">
                ACCEPT
            </a>

            <a class="more-info" href="https://opistobranquis.info/1HWEw">
                                                        MORE INFO
            </a>

        </div>

    </div>
</div>
NVHTML-BENCH#


usage: nvhtml_tgpth [-h] [-input INPUT_HTML_FILE] [-codec INPUT_CODEC]
                    [-tgpth TAG_PATH]

optional arguments:
  -h,        --help                                  show this help message and exit
  -input     --input_html_file INPUT_HTML_FILE       input html file name
  -codec     --input_codec INPUT_CODEC               input html file codec
  -tgpth     --tag_path TAG_PATH                      html tag dot path

html to db

NVHTML-BENCH# nvhtml_sqlite -input opis.html
db:  ./opis.html.sqlite.db
table:  tb_html
NVHTML-BENCH#



NVHTML-BENCH# sqlite3 opis.html.sqlite.db
SQLite version 3.22.0 2018-01-22 18:45:57
Enter ".help" for usage hints.
sqlite>
sqlite> .table
tb_html
sqlite>
sqlite> .schema tb_html
CREATE TABLE IF NOT EXISTS "tb_html" (
"index" INTEGER,
  "_pl" TEXT,
  "_breadth" TEXT,
  "_depth" TEXT,
  "_pbreadth" TEXT,
  "_samepl_sibseq" TEXT,
  "_samepl_breadth" TEXT,
  "_tag" TEXT,
  "_sibseq" TEXT,
  "_text" TEXT,
  "_tail" TEXT,
  "class" TEXT,
  "href" TEXT,
  "id" TEXT,
  "style" TEXT,
  "type" TEXT,
  "src" TEXT,
  "rel" TEXT,
  "target" TEXT,
  "title" TEXT,
  "content" TEXT,
  "alt" TEXT,
  "media" TEXT,
  "name" TEXT,
  "align" TEXT,
  "property" TEXT,
  "role" TEXT,
  "value" TEXT,
  "data-shared" TEXT,
  "hreflang" TEXT,
  "for" TEXT,
  "aria-current" TEXT,
  "colspan" TEXT,
  "method" TEXT,
  "action" TEXT,
  "sizes" TEXT,
  "placeholder" TEXT,
  "height" TEXT,
  "width" TEXT,
  "http-equiv" TEXT,
  "autocomplete" TEXT,
  "data-layout" TEXT,
  "data-orig-file" TEXT,
  "data-href" TEXT,
  "lang" TEXT,
  "data-image-title" TEXT,
  "data-recalc-dims" TEXT,
  "data-attachment-id" TEXT,
  "data-text" TEXT,
  "data-flxmap" TEXT,
  "size" TEXT,
  "data-image-description" TEXT,
  "data-medium-file" TEXT,
  "async" TEXT,
  "language" TEXT,
  "srcset" TEXT,
  "data-comments-opened" TEXT,
  "data-large-file" TEXT,
  "data-via" TEXT,
  "defer" TEXT,
  "data-permalink" TEXT,
  "data-image-meta" TEXT,
  "data-noptimize" TEXT,
  "data-url" TEXT,
  "data-orig-size" TEXT
);
CREATE INDEX "ix_tb_html_index"ON "tb_html" ("index");
sqlite>


sqlite>
sqlite> SELECT src FROM tb_html WHERE  _tag=="img" AND src like "%acebook.png";
https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png
https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png
https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png
https://opistobranquis.info/wp-content/themes/tempera/images/socials/Facebook.png
sqlite>
sqlite>
sqlite>
sqlite>
sqlite> SELECT href FROM tb_html WHERE  _tag=="link" AND href like "%.com";
//s0.wp.com
//c0.wp.com
//i0.wp.com
//i1.wp.com
//i2.wp.com
sqlite>

html to dir

NVHTML-BENCH# mkdir TMP
NVHTML-BENCH# nvhtml_dir -input opis.html -wkdir TMP

NVHTML-BENCH# tree -fdL 4 TMP | head
TMP
└── TMP/html.0
    ├── TMP/html.0/body.1
    │   ├── TMP/html.0/body.1/<comment>.91
    │   ├── TMP/html.0/body.1/div.90
    │   │   ├── TMP/html.0/body.1/div.90/<comment>.4
    │   │   ├── TMP/html.0/body.1/div.90/<comment>.7
    │   │   ├── TMP/html.0/body.1/div.90/div.0
    │   │   ├── TMP/html.0/body.1/div.90/div.1
    │   │   ├── TMP/html.0/body.1/div.90/div.2
NVHTML-BENCH#
NVHTML-BENCH# tree -fdL 4 TMP | tail
        ├── TMP/html.0/head.0/style.45
        ├── TMP/html.0/head.0/style.55
        ├── TMP/html.0/head.0/style.56
        ├── TMP/html.0/head.0/style.57
        ├── TMP/html.0/head.0/style.58
        ├── TMP/html.0/head.0/style.78
        ├── TMP/html.0/head.0/style.79
        └── TMP/html.0/head.0/title.7

138 directories

NVHTML-BENCH# ls -l TMP/html.0/body.1/div.90/div.2
total 36
drwxr-xr-x 3 root root 4096 Aug 11 02:49 a.3
drwxr-xr-x 3 root root 4096 Aug 11 02:49 a.4
-rw-r--r-- 1 root root    7 Aug 11 02:49 attrib.class
-rw-r--r-- 1 root root    7 Aug 11 02:49 attrib.id
-rw-r--r-- 1 root root  538 Aug 11 02:49 outter_html
-rw-r--r-- 1 root root    3 Aug 11 02:49 tag
-rw-r--r-- 1 root root    1 Aug 11 02:49 tail
-rw-r--r-- 1 root root    4 Aug 11 02:49 text
-rw-r--r-- 1 root root    8 Aug 11 02:49 text_intag
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/attrib.id
srights
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/attrib.class
socials
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/outter_html
<div class="socials" id="srights">
                        <a target="_blank" href="https://twitter.com/InfoOpk" class="socialicons social
-Twitter external" title="Twitter">
                                <img alt="Twitter" src="https://opistobranquis.info/wp-content/themes/t
empera/images/socials/Twitter.png"/>
                        </a>
                        <a target="_blank" href="https://www.facebook.com/OPK.Opistobranquis/" class="s
ocialicons social-Facebook external" title="Facebook">
                                <img alt="Facebook" src="https://opistobranquis.info/wp-content/themes/
tempera/images/socials/Facebook.png"/>
                        </a></div>
NVHTML-BENCH#

NVHTML-BENCH# ls -al TMP/html.0/body.1/div.90/div.2 | egrep " \.[a-z]"
-rw-r--r--  1 root root    1 Aug 11 02:49 .breadth
-rw-r--r--  1 root root    1 Aug 11 02:49 .depth
-rw-r--r--  1 root root   27 Aug 11 02:49 .mkdir_pth
-rw-r--r--  1 root root    2 Aug 11 02:49 .pbreadth
-rw-r--r--  1 root root   18 Aug 11 02:49 .pl
-rw-r--r--  1 root root    1 Aug 11 02:49 .samepl_breadth
-rw-r--r--  1 root root    1 Aug 11 02:49 .samepl_sibseq
-rw-r--r--  1 root root    1 Aug 11 02:49 .sibseq
NVHTML-BENCH#
NVHTML-BENCH#
NVHTML-BENCH#
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.breadth
2
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.depth
3
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.pbreadth
90
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.pl
/html/body/div/div
NVHTML-BENCH#
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.samepl_breadth
2
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.samepl_sibseq
2
NVHTML-BENCH# more TMP/html.0/body.1/div.90/div.2/.sibseq
2
NVHTML-BENCH#

find all

NVHTML-BENCH# nvhtml_find_all -input opis.html -attrib "http-equiv"
[
 'X-UA-Compatible',
 'Content-Type'
]
NVHTML-BENCH#
NVHTML-BENCH# nvhtml_find_all -input opis.html -attrib "href" | egrep "jorunna-e"
 'https://opistobranquis.info/en/guia/nudibranchia/doridina/doridoidei/doridoidea/jorunna-efe/',
 'https://opistobranquis.info/en/guia/nudibranchia/doridina/doridoidei/doridoidea/jorunna-evansi/',
NVHTML-BENCH#

NVHTML-BENCH# nvhtml_find_all -input opis.html
common attribs:
[
 '_pl',
 '_breadth',
 '_depth',
 '_pbreadth',
 '_samepl_sibseq',
 '_samepl_breadth',
 '_tag',
 '_sibseq',
 '_text',
 '_tail'
]
attrib_names:frequency
{
 'class': 947,
 'href': 810,
 'id': 181,
 'style': 80,
 'type': 78,
 'src': 55,
 'rel': 49,
 'target': 41,
 'title': 36,
 'content': 23,
 'alt': 19,
 'media': 17,
 'name': 15,
 'align': 13,
 'property': 12,
 'role': 9,
 'value': 7,
 'hreflang': 4,
 'data-shared': 4,
 'colspan': 3,
 'for': 3,
 'aria-current': 3,
 'sizes': 3,
 'action': 3,
 'method': 3,
 'placeholder': 2,
 'width': 2,
 'http-equiv': 2,
 'height': 2,
 'data-permalink': 1,
 'data-recalc-dims': 1,
 'srcset': 1,
 'size': 1,
 'data-layout': 1,
 'data-orig-size': 1,
 'language': 1,
 'data-medium-file': 1,
 'data-href': 1,
 'data-image-description': 1,
 'data-image-title': 1,
 'data-orig-file': 1,
 'defer': 1,
 'data-flxmap': 1,
 'data-noptimize': 1,
 'data-image-meta': 1,
 'lang': 1,
 'data-url': 1,
 'data-large-file': 1,
 'autocomplete': 1,
 'data-via': 1,
 'async': 1,
 'data-comments-opened': 1,
 'data-attachment-id': 1,
 'data-text': 1
}
NVHTML-BENCH#

Examples

tagsrch

from lxml.etree import HTML as LXHTML
from lxml.etree import XML as LXML
from xdict.jprint import pdir,pobj
from nvhtml import txt
from nvhtml import lvsrch
from nvhtml import fs
from nvhtml import engine
from nvhtml import utils
import lxml.sax
html_str = fs.rfile("./test.html")
root = LXHTML(html_str)
eles = lvsrch.a(root,7,8,show=False)
print(eles[0])
print(eles[5])
eles = lvsrch.a(root,7,8,which=0)
eles = lvsrch.a(root,7,8,which=0,source=False)
./images/lvsrch.a.0.png

relation get

html_str = fs.rfile("./test.html")
root = LXHTML(html_str)
ele =  engine.xpath(root,"//div",5)

engine.parent(ele)
engine.grand_parent(ele)
engine.ancestors(ele)
engine.parent(ele)
engine.grand_parent(ele)
engine.ancestors(ele)
engine.lsib(ele)
engine.rsib(ele)
engine.lcin(ele)
engine.rcin(ele)
engine.siblings(ele)
engine.descendants(ele,5,6)

engine.layer(ele)
engine.breadth(ele)
engine.depth(ele)
engine.pathlist(ele)
./images/engine.0.png

description matrix

html_str = fs.rfile("./test.html")
root = LXHTML(html_str)
wfs = engine.WFS(root)
pobj(wfs.mat[3][1])
./images/engine.1.png

width-first-traverse

html_str = fs.rfile("./test.html")
root = LXHTML(html_str)
pls = engine.wfspls(root)
utils.parr(pls[:10])
./images/engine.2.png

depth-first-traverse

import lxml.sax
html_str = fs.rfile("./test.html")
root = LXHTML(html_str)
dfs = engine.DFS()
lxml.sax.saxify(root, dfs)
utils.parr(dfs.pls[:5])
utils.parr(dfs.pls[-10:])
./images/engine.3.png

beautify

html_str = fs.rfile("./test.html")
root = LXHTML(html_str)
html_str = engine.beautify(root)
print(html_str[:480])
./images/engine.4.png

lvsrch

[
 'a',
 'abbr',
 'acronym',
 'address',
 'applet',
 'area',
 'arguments',
 'article',
 'aside',
 'audio',
 'b',
 'base',
 'basefont',
 'bdi',
 'bdo',
 'big',
 'blockquote',
 'body',
 'br',
 'button',
 'canvas',
 'caption',
 'center',
 'cite',
 'code',
 'col',
 'colgroup',
 'command',
 'datalist',
 'dd',
 'del_',
 'details',
 'dfn',
 'dialog',
 'dir',
 'div',
 'dl',
 'dt',
 'elel',
 'em',
 'embed',
 'engine',
 'fieldset',
 'figcaption',
 'figure',
 'font',
 'footer',
 'form',
 'frame',
 'frameset',
 'h1',
 'h2',
 'h3',
 'h4',
 'h5',
 'h6',
 'head',
 'header',
 'hr',
 'html',
 'i',
 'iframe',
 'img',
 'input',
 'ins',
 'isindex',
 'kbd',
 'keygen',
 'label',
 'legend',
 'li',
 'link',
 'map',
 'mark',
 'menu',
 'menuitem',
 'meta',
 'meter',
 'nav',
 'noframes',
 'noscript',
 'object',
 'ol',
 'optgroup',
 'option',
 'output',
 'p',
 'param',
 'pre',
 'progress',
 'q',
 'rp',
 'rt',
 'ruby',
 's',
 'samp',
 'script',
 'section',
 'select',
 'small',
 'source',
 'span',
 'srch',
 'strike',
 'strong',
 'style',
 'sub',
 'summary',
 'sup',
 'table',
 'tbody',
 'td',
 'textarea',
 'tfoot',
 'th',
 'thead',
 'time',
 'title',
 'tr',
 'track',
 'tt',
 'u',
 'ul',
 'utils',
 'var',
 'video',
 'wbr',
 'xmp'
]

engine

[
 'BEAUTIFY',
 'ContentHandler',
 'DFS',
 'WFS',
 'ancestor',
 'ancestors',
 'beautify',
 'between_levels_cond_func',
 'breadth',
 'child',
 'children',
 'copy',
 'default_wfs_handler',
 'depth',
 'descendants',
 'descendants_pls',
 'dfs_traverse',
 'dfspls',
 'disconnect',
 'elel',
 'extract_pls',
 'following_sibs',
 'grand_parent',
 'html',
 'init_cls_wfs_arguments',
 'is_leaf',
 'layer',
 'layer_wfs_handler',
 'lcin',
 'leaf_descendants',
 'leaf_descendants_pls',
 'loc',
 'loc2node',
 'lsib',
 'lxe',
 'lxml',
 'nonleaf_descendants',
 'nonleaf_descendants_pls',
 'parent',
 'pathlist',
 'plget',
 'preceding_sibs',
 'rcin',
 're',
 'rootnode',
 'rsib',
 'samepl_breadth',
 'samepl_siblings',
 'samepl_sibseq',
 'siblings',
 'sibseq',
 'source',
 'text_intag',
 'txtize',
 'utils',
 'wfs_traverse',
 'wfspls',
 'xpath',
 'xpath_levels'
]

History

0.0.1 (2018-12-18)

  • first commit

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nvhtml-0.0.32.tar.gz (285.6 kB view details)

Uploaded Source

File details

Details for the file nvhtml-0.0.32.tar.gz.

File metadata

  • Download URL: nvhtml-0.0.32.tar.gz
  • Upload date:
  • Size: 285.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.7

File hashes

Hashes for nvhtml-0.0.32.tar.gz
Algorithm Hash digest
SHA256 59ad74274e8b9da0b275288e05048e7af1cef720d5b173e930e717d4b2eefb93
MD5 78daa4930ff1d905f23efc7a255053b5
BLAKE2b-256 508924a105f83c5790127497c0c603ad5a90d7dd5317cf255266c451fe1b0d13

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page