matchtpl

Matching template to extract data from xml or html

These details have not been verified by PyPI

Project links

Homepage

Project description

A python library to match and extract xml/html source with pre-defined template. It provides a convenient and coding-free way for data processing, especially for web page.

The features of matchtpl are summarized as follows:

Easy to use. The goal is to help developer ease their text-data processing job. Only basic knowledge of jQuery (mostly, CSSSelector), one popular javascript DOM-manipulation library, is assumed. User only need to provide the XML-template to tell how to extract information and what the expected output is, then matchtpl will finish the rest of the work.
User-friendly. Our toolkit does not require coding in python. If you are to do very sophisticated work, py-matchtpl can take over dirty things, such as parse html file, extract useful information, organize data into preferrable data structures, or streaming into string (plaintext) / json / yaml / python builtin structures (by default).
Extensibilty. Currently, it supports three basic types of data structures: (1) string; (2) array; (3) map. We can utilize their combination to meet the requirements in most cases. What’s more, user can provide UDF (user-defined function) to customize in his/her own way.

The fundamental philosophy of matchtpl is:

Neat: keep it clean and hide the dirty things.
Simple: everything looks configurable, declarative and intuitive. (avoid to use complex control flow syntax: if/for/while.)
Extensible: leave imagination to user, and any ideas can be integrated in a rapid way.

Installation

You can install the latest package from source (or, download and unzip from github):

$ git clone https://github.com/bolitt/py-matchtpl.git

$ python setup.py install

or use python easy_install or pip:

$ easy_install matchtpl

# alternatively install by pip

$ pip install matchtpl

Basic Data Structures

string: <s></s>. Typical atom structure, can be post-processed and converted into other types, like int, float and etc.
array: <array></array>. An ordered list of data, also known as list. It can be retrieved by its index: array[0].
map: <map></map>. A key-value based structure, also known as hash or table. It can be retrieved by key-like way: map[‘name’] or by property-like way: map.name.

We believe most data can be fit into those data structures or their combinations.

Keywords & Elements

Here are typical keywords:

select: select target element(s) from document.
- selector_string (string): CSS3 Selector to choose target.
get: get internal text | html of target DOM element.
- type (string): “text” | “html”.
eval: locally evaluate via python syntax. (Often used to call jquery-like API.)
- script_text (string): script using python syntax.
default: default value if none.
- value (string): default value.
as: output format in human-readable way.
- type (string): str | json | yaml. If not provided, will return python builtin data strucutures.
encoding: set decoder for datasource.
- encode_type (string): such as UTF-8 (default), GBK/GB2312 (some Chinese websites), UTF-16, etc.

(Keywords are not limited as above.)

And extensible elements are:

Strucuture element: <s></s>, <array></array>, <map></map> (see: above).
Root element: <root></root>. Act as serilization class, and provide multiple formats to output result.
Customized element: <action></action>, where action here can be other non-conflictive tag. action is a customized action provide by user when calling parser.parse(…, {‘action’: some_function}).

Quick Start

The example shows how to extract data from html source. Matchtpl provides an easy way to parse your html file and format output. It is a real case to extract products information from web page of amazon.com.

Python Code

In python, typical usage often looks like this:

#!/usr/bin/env python

from matchtpl import MTemplateEnv, MTemplate, MTemplateParser

if __name__ == '__main__':
    # initialize environment
    env = MTemplateEnv(template = 'tpl_amazon.xml')

    # build template
    tpl = MTemplate()
    tpl.build(env)

    # initialize parser and parse
    parser = MTemplateParser(tpl)
    results = parser.parse('amazon.html')

Configurable Template

The pre-defined template is written in xml, which acts as a config file to indicates the meta information of the target (usually another html/xml file or stream). Then, parser will use the template to guide its processing, and output the result:

<!-- serilize result as json. (other format is also supported) -->
<root as="json">
    <!-- the collection of entries are started with 'result_*' in their IDs,
         and each entry is a map -->
    <array select="div[id^='result_']" >
        <map>
            <!-- title: get internal text as result -->
            <s key="title" select="h3 span.lrg" get="text" />
            <s key="info" select="h3 span.med" get="text" />
            <!-- image: get src link in jquery-like way -->
            <s key="image" select="div.image img.productImage" eval="attr('src')" />
            <!-- price: pseudo-class of CSSSelector is used -->
            <s key="price" select="li.newp span:eq(0)" get="text" />
            <!-- review: default value is enabled -->
            <s key="review" select="span.asinReviewsSummary a" eval="attr('alt')" default='0' />
        </map>
    </array>
</root>

After execution, the output is organized as json:

[
    [
        {
            "image": "http://ec4.images-amazon.com/images/I/516Vhic-I9L._AA160_.jpg",
            "info": "刘亚莉 广东省出版集团，广东经济出版社  (2011-05) - Kindle电子书",
            "price": "￥1.99",
            "review": "平均4.4 星",
            "title": "总经理财务一本通"
        },
        // up to 25 results: map
    ]
]

(At present, json, yaml, plaintext or python builtin structures are allowed. More format will be supported later.)

Future Scenarios

Possible functionalities:

Unix-like pipe: |. Just concatenate output|input step by step.
Interactive. Interaction with pages: like doing automation/login/testing.
Type-casting. convert type into int/float, or direct instantiation of a class.
Regex support /^abcd/ABCD/g and some basic UDFs, like split/trim/toUpper/toLower.

Project Links

Package Release: https://pypi.python.org/pypi/matchtpl

Source Code: https://github.com/bolitt/py-matchtpl.git

Contributors

v0.1 Tian Lin<bolitt@gmail.com> Initialize the project, and alpha release of the library.

Any contributions are welcome!

See https://pypi.python.org/pypi/matchtpl for the full documentation

News

v-0.1.0.dev1, 11/8/2013 – Initial release.

v-0.1.0.dev2, 12/11/2013 – Minor change on class interfaces.

v-0.1.0.dev3, 12/15/2013 – Cleanup some dependences and fix setup bug.

v-0.1.2, ?/?/2013 – Add keyword encoding for root element!

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.2

Jan 6, 2014

0.1.1

Dec 30, 2013

0.1.0.dev3 pre-release

Dec 14, 2013

0.1.0.dev2 pre-release

Dec 14, 2013

0.1.0.dev1 pre-release

Nov 21, 2013

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matchtpl-0.1.2.zip (95.2 kB view details)

Uploaded Jan 6, 2014 Source

File details

Details for the file matchtpl-0.1.2.zip.

File metadata

Download URL: matchtpl-0.1.2.zip
Upload date: Jan 6, 2014
Size: 95.2 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for matchtpl-0.1.2.zip
Algorithm	Hash digest
SHA256	`38e74ed0cd5281c631fa26628507fd92b5d2f4de9de87cc16bdb55e5548ff7da`
MD5	`87c2668d7220a594e0c928249d29f53a`
BLAKE2b-256	`329f1f968e3a02db780cd5a2cb27cbfee922fb4c88dd066663682bbe2761e459`

See more details on using hashes here.

matchtpl 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Basic Data Structures

Keywords & Elements

Quick Start

Python Code

Configurable Template

Future Scenarios

Project Links

Contributors

News

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes