Skip to main content

Matching template to extract data from xml or html

Project description

A python library to match and extract xml/html source with pre-defined template. It provides a convenient and coding-free way for data processing, especially for web page.

The features of matchtpl are summarized as follows:

  • Easy to use. The goal is to help developer ease their text-data processing job. Only basic knowledge of jQuery (mostly, CSSSelector), one popular javascript DOM-manipulation library, is assumed. User only need to provide the XML-template to tell how to extract information and what the expected output is, then matchtpl will finish the rest of the work.

  • User-friendly. Our toolkit does not require coding in python. If you are to do very sophisticated work, py-matchtpl can take over dirty things, such as parse html file, extract useful information, organize data into preferrable data structures, or streaming into string (plaintext) / json / yaml / python builtin structures (by default).

  • Extensibilty. Currently, it supports three basic types of data structures: (1) string; (2) array; (3) map. We can utilize their combination to meet the requirements in most cases. What’s more, user can provide UDF (user-defined function) to customize in his/her own way.

The fundamental philosophy of matchtpl is:

  • Neat: keep it clean and hide the dirty things.

  • Simple: everything looks configurable, declarative and intuitive. (avoid to use complex control flow syntax: if/for/while.)

  • Extensible: leave imagination to user, and any ideas can be integrated in a rapid way.

Installation

You can install the latest package from source (or, download and unzip from github):

$ git clone https://github.com/bolitt/py-matchtpl.git

$ python setup.py install

or use python easy_install or pip:

$ easy_install matchtpl

# alternatively install by pip

$ pip install matchtpl

Basic Data Structures

  1. string: <s></s>. Typical atom structure, can be post-processed and converted into other types, like int, float and etc.

  2. array: <array></array>. An ordered list of data, also known as list. It can be retrieved by its index: array[0].

  3. map: <map></map>. A key-value based structure, also known as hash or table. It can be retrieved by key-like way: map[‘name’] or by property-like way: map.name.

We believe most data can be fit into those data structures or their combinations.

Keywords & Elements

Here are typical keywords:

  • select: select target element(s) from document.
    • selector_string (string): CSS3 Selector to choose target.

  • get: get internal text | html of target DOM element.
    • type (string): “text” | “html”.

  • eval: locally evaluate via python syntax. (Often used to call jquery-like API.)
    • script_text (string): script using python syntax.

  • default: default value if none.
    • value (string): default value.

  • as: output format in human-readable way.
    • type (string): str | json | yaml. If not provided, will return python builtin data strucutures.

  • encoding: set decoder for datasource.
    • encode_type (string): such as UTF-8 (default), GBK/GB2312 (some Chinese websites), UTF-16, etc.

(Keywords are not limited as above.)

And extensible elements are:

  • Strucuture element: <s></s>, <array></array>, <map></map> (see: above).

  • Root element: <root></root>. Act as serilization class, and provide multiple formats to output result.

  • Customized element: <action></action>, where action here can be other non-conflictive tag. action is a customized action provide by user when calling parser.parse(…, {‘action’: some_function}).

Quick Start

The example shows how to extract data from html source. Matchtpl provides an easy way to parse your html file and format output. It is a real case to extract products information from web page of amazon.com.

Python Code

In python, typical usage often looks like this:

#!/usr/bin/env python

from matchtpl import MTemplateEnv, MTemplate, MTemplateParser

if __name__ == '__main__':
    # initialize environment
    env = MTemplateEnv(template = 'tpl_amazon.xml')

    # build template
    tpl = MTemplate()
    tpl.build(env)

    # initialize parser and parse
    parser = MTemplateParser(tpl)
    results = parser.parse('amazon.html')

Configurable Template

The pre-defined template is written in xml, which acts as a config file to indicates the meta information of the target (usually another html/xml file or stream). Then, parser will use the template to guide its processing, and output the result:

<!-- serilize result as json. (other format is also supported) -->
<root as="json">
    <!-- the collection of entries are started with 'result_*' in their IDs,
         and each entry is a map -->
    <array select="div[id^='result_']" >
        <map>
            <!-- title: get internal text as result -->
            <s key="title" select="h3 span.lrg" get="text" />
            <s key="info" select="h3 span.med" get="text" />
            <!-- image: get src link in jquery-like way -->
            <s key="image" select="div.image img.productImage" eval="attr('src')" />
            <!-- price: pseudo-class of CSSSelector is used -->
            <s key="price" select="li.newp span:eq(0)" get="text" />
            <!-- review: default value is enabled -->
            <s key="review" select="span.asinReviewsSummary a" eval="attr('alt')" default='0' />
        </map>
    </array>
</root>

After execution, the output is organized as json:

[
    [
        {
            "image": "http://ec4.images-amazon.com/images/I/516Vhic-I9L._AA160_.jpg",
            "info": "刘亚莉 广东省出版集团,广东经济出版社  (2011-05) - Kindle电子书",
            "price": "¥1.99",
            "review": "平均4.4 星",
            "title": "总经理财务一本通"
        },
        // up to 25 results: map
    ]
]

(At present, json, yaml, plaintext or python builtin structures are allowed. More format will be supported later.)

Future Scenarios

Possible functionalities:

  1. Unix-like pipe: |. Just concatenate output|input step by step.

  2. Interactive. Interaction with pages: like doing automation/login/testing.

  3. Type-casting. convert type into int/float, or direct instantiation of a class.

  4. Regex support /^abcd/ABCD/g and some basic UDFs, like split/trim/toUpper/toLower.

Contributors

  • v0.1 Tian Lin<bolitt@gmail.com> Initialize the project, and alpha release of the library.

Any contributions are welcome!

See https://pypi.python.org/pypi/matchtpl for the full documentation

News

v-0.1.0.dev1, 11/8/2013 – Initial release.

v-0.1.0.dev2, 12/11/2013 – Minor change on class interfaces.

v-0.1.0.dev3, 12/15/2013 – Cleanup some dependences and fix setup bug.

v-0.1.2, ?/?/2013 – Add keyword encoding for root element!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matchtpl-0.1.2.zip (95.2 kB view details)

Uploaded Source

File details

Details for the file matchtpl-0.1.2.zip.

File metadata

  • Download URL: matchtpl-0.1.2.zip
  • Upload date:
  • Size: 95.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for matchtpl-0.1.2.zip
Algorithm Hash digest
SHA256 38e74ed0cd5281c631fa26628507fd92b5d2f4de9de87cc16bdb55e5548ff7da
MD5 87c2668d7220a594e0c928249d29f53a
BLAKE2b-256 329f1f968e3a02db780cd5a2cb27cbfee922fb4c88dd066663682bbe2761e459

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page