Matching template to extract data from xml or html
Project description
A python library to match and extract xml/html source with pre-defined template. It provides a convenient and coding-free way for data processing, especially for web page.
The features of matchtpl are summarized as follows:
Easy to use. The goal is to help developer ease their text-data processing job. Only basic knowledge of jQuery (mostly, CSSSelector), one popular javascript DOM-manipulation library, is assumed. User only need to provide the XML-template to tell how to extract information and what the expected output is, then matchtpl will finish the rest of the work.
User-friendly. Our toolkit does not require coding in python. If you are to do very sophisticated work, py-matchtpl can take over dirty things, such as parse html file, extract useful information, organize data into preferrable data structures, or streaming into string (plaintext) / json / yaml / python builtin structures (by default).
Extensibilty. Currently, it supports three basic types of data structures: (1) string; (2) array; (3) map. We can utilize their combination to meet the requirements in most cases. What’s more, user can provide UDF (user-defined function) to customize in his/her own way.
The fundamental philosophy of matchtpl is:
Neat: keep it clean and hide the dirty things.
Simple: everything looks configurable, declarative and intuitive. (avoid to use complex control flow syntax: if/for/while.)
Extensible: leave imagination to user, and any ideas can be integrated in a rapid way.
Installation
You can install the latest package from source (or, download and unzip from github):
$ git clone https://github.com/bolitt/py-matchtpl.git $ python setup.py install
or use python easy_install or pip:
$ easy_install matchtpl # alternatively install by pip $ pip install matchtpl
Basic Data Structures
string: <s></s>. Typical atom structure, can be post-processed and converted into other types, like int, float and etc.
array: <array></array>. An ordered list of data, also known as list. It can be retrieved by its index: array[0].
map: <map></map>. A key-value based structure, also known as hash or table. It can be retrieved by key-like way: map[‘name’] or by property-like way: map.name.
We believe most data can be fit into those data structures or their combinations.
Keywords & Elements
Here are typical keywords:
- select: select target element(s) from document.
selector_string (string): CSS3 Selector to choose target.
- get: get internal text | html of target DOM element.
type (string): “text” | “html”.
- eval: locally evaluate via python syntax. (Often used to call jquery-like API.)
script_text (string): script using python syntax.
- default: default value if none.
value (string): default value.
- as: output format in human-readable way.
type (string): str | json | yaml. If not provided, will return python builtin data strucutures.
- encoding: set decoder for datasource.
encode_type (string): such as UTF-8 (default), GBK/GB2312 (some Chinese websites), UTF-16, etc.
(Keywords are not limited as above.)
And extensible elements are:
Strucuture element: <s></s>, <array></array>, <map></map> (see: above).
Root element: <root></root>. Act as serilization class, and provide multiple formats to output result.
Customized element: <action></action>, where action here can be other non-conflictive tag. action is a customized action provide by user when calling parser.parse(…, {‘action’: some_function}).
Quick Start
The example shows how to extract data from html source. Matchtpl provides an easy way to parse your html file and format output. It is a real case to extract products information from web page of amazon.com.
Python Code
In python, typical usage often looks like this:
#!/usr/bin/env python from matchtpl import MTemplateEnv, MTemplate, MTemplateParser if __name__ == '__main__': # initialize environment env = MTemplateEnv(template = 'tpl_amazon.xml') # build template tpl = MTemplate() tpl.build(env) # initialize parser and parse parser = MTemplateParser(tpl) results = parser.parse('amazon.html')
Configurable Template
The pre-defined template is written in xml, which acts as a config file to indicates the meta information of the target (usually another html/xml file or stream). Then, parser will use the template to guide its processing, and output the result:
<!-- serilize result as json. (other format is also supported) --> <root as="json"> <!-- the collection of entries are started with 'result_*' in their IDs, and each entry is a map --> <array select="div[id^='result_']" > <map> <!-- title: get internal text as result --> <s key="title" select="h3 span.lrg" get="text" /> <s key="info" select="h3 span.med" get="text" /> <!-- image: get src link in jquery-like way --> <s key="image" select="div.image img.productImage" eval="attr('src')" /> <!-- price: pseudo-class of CSSSelector is used --> <s key="price" select="li.newp span:eq(0)" get="text" /> <!-- review: default value is enabled --> <s key="review" select="span.asinReviewsSummary a" eval="attr('alt')" default='0' /> </map> </array> </root>
After execution, the output is organized as json:
[ [ { "image": "http://ec4.images-amazon.com/images/I/516Vhic-I9L._AA160_.jpg", "info": "刘亚莉 广东省出版集团,广东经济出版社 (2011-05) - Kindle电子书", "price": "¥1.99", "review": "平均4.4 星", "title": "总经理财务一本通" }, // up to 25 results: map ] ]
(At present, json, yaml, plaintext or python builtin structures are allowed. More format will be supported later.)
Future Scenarios
Possible functionalities:
Unix-like pipe: |. Just concatenate output|input step by step.
Interactive. Interaction with pages: like doing automation/login/testing.
Type-casting. convert type into int/float, or direct instantiation of a class.
Regex support /^abcd/ABCD/g and some basic UDFs, like split/trim/toUpper/toLower.
Project Links
Package Release: https://pypi.python.org/pypi/matchtpl
Source Code: https://github.com/bolitt/py-matchtpl.git
Contributors
v0.1 Tian Lin<bolitt@gmail.com> Initialize the project, and alpha release of the library.
Any contributions are welcome!
See https://pypi.python.org/pypi/matchtpl for the full documentation
News
v-0.1.0.dev1, 11/8/2013 – Initial release.
v-0.1.0.dev2, 12/11/2013 – Minor change on class interfaces.
v-0.1.0.dev3, 12/15/2013 – Cleanup some dependences and fix setup bug.
v-0.1.2, ?/?/2013 – Add keyword encoding for root element!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file matchtpl-0.1.2.zip
.
File metadata
- Download URL: matchtpl-0.1.2.zip
- Upload date:
- Size: 95.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38e74ed0cd5281c631fa26628507fd92b5d2f4de9de87cc16bdb55e5548ff7da |
|
MD5 | 87c2668d7220a594e0c928249d29f53a |
|
BLAKE2b-256 | 329f1f968e3a02db780cd5a2cb27cbfee922fb4c88dd066663682bbe2761e459 |