python library to detect and extract listing data from HTML page
Project description
MDR is a library detect and extract listing data from HTML page. It implemented base on the Finding and Extracting Data Records from Web Pages but change the similarity to tree alignment proposed by Web Data Extraction Based on Partial Tree Alignment and Automatic Wrapper Adaptation by Tree Edit Distance Matching.
Requires
numpy and scipy must be installed to build this package.
Usage
Detect listing data
MDR assume the data record close to the elements has most text nodes:
[1]: import requests [2]: from mdr.mdr import MDR [3]: mdr = MDR() [4]: r = requests.get('http://www.yelp.co.uk/biz/the-ledbury-london') [5]: candidates, doc = mdr.list_candidates(r.text.encode('utf8')) ... [8]: [doc.getpath(c) for c in candidates[:10]] ['/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]/div[2]/ul', '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]', '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]', '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[1]/div/div[2]/div[1]/div[1]/div', '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[2]/div/div[3]', '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[2]/ul/li[2]/div/div/ul', '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]', '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]/div[1]/table/tbody', '/html/body/div[2]', '/html/body/div[2]/div[4]/div/div[1]']
Extract data record
MDR can find the repetiton pattern by using tree matching under certain candidate DOM tree.then it will build a mapping from so-called seed element to a list of matched elements from different DOM trees.
Used with annotation (optional)
You can annotate the seed record with any tools (e.g. scrapely) you like, then mdr will be able to find the other data in the page.
e.g. you can find this demo page here. the colored data in first row are annotated manually, the rest are extracted by MDR.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file mdr-0.0.1.tar.gz
.
File metadata
- Download URL: mdr-0.0.1.tar.gz
- Upload date:
- Size: 49.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7d7be84742642e82e96ab555de31904a90e109a3ff92f2586b6c16920589bbb3 |
|
MD5 | 8e66378ab5c993bf650acd75b2880ee0 |
|
BLAKE2b-256 | 9a9ec8330017d8c0aec9a053f29310137da11d5d33633f5d806404c38d3b13e8 |