Skip to main content

python library to detect and extract listing data from HTML page

Project description

https://travis-ci.org/scrapinghub/mdr.svg?branch=master

MDR is a library detect and extract listing data from HTML page. It implemented base on the Finding and Extracting Data Records from Web Pages but change the similarity to tree alignment proposed by Web Data Extraction Based on Partial Tree Alignment and Automatic Wrapper Adaptation by Tree Edit Distance Matching.

Requires

numpy and scipy must be installed to build this package.

Usage

Detect listing data

MDR assume the data record close to the elements has most text nodes:

[1]: import requests
[2]: from mdr.mdr import MDR
[3]: mdr = MDR()
[4]: r = requests.get('http://www.yelp.co.uk/biz/the-ledbury-london')
[5]: candidates, doc = mdr.list_candidates(r.text.encode('utf8'))
...

[8]: [doc.getpath(c) for c in candidates[:10]]
 ['/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]/div[2]/ul',
 '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]',
 '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]',
 '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[1]/div/div[2]/div[1]/div[1]/div',
 '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[2]/div/div[3]',
 '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[2]/ul/li[2]/div/div/ul',
 '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]',
 '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]/div[1]/table/tbody',
 '/html/body/div[2]',
 '/html/body/div[2]/div[4]/div/div[1]']

Extract data record

MDR can find the repetiton pattern by using tree matching under certain candidate DOM tree.then it will build a mapping from so-called seed element to a list of matched elements from different DOM trees.

Used with annotation (optional)

You can annotate the seed record with any tools (e.g. scrapely) you like, then mdr will be able to find the other data in the page.

e.g. you can find this demo page here. the colored data in first row are annotated manually, the rest are extracted by MDR.

Author

Terry Peng <pengtaoo@gmail.com>

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mdr-0.0.1.tar.gz (49.3 kB view details)

Uploaded Source

File details

Details for the file mdr-0.0.1.tar.gz.

File metadata

  • Download URL: mdr-0.0.1.tar.gz
  • Upload date:
  • Size: 49.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for mdr-0.0.1.tar.gz
Algorithm Hash digest
SHA256 7d7be84742642e82e96ab555de31904a90e109a3ff92f2586b6c16920589bbb3
MD5 8e66378ab5c993bf650acd75b2880ee0
BLAKE2b-256 9a9ec8330017d8c0aec9a053f29310137da11d5d33633f5d806404c38d3b13e8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page