This is a pre-production deployment of Warehouse, however changes made here WILL affect the production instance of PyPI.
Latest Version Dependencies status unknown Test status unknown Test coverage unknown
Project Description

MDR is a library detect and extract listing data from HTML page. It implemented base on the Finding and Extracting Data Records from Web Pages but change the similarity to tree alignment proposed by Web Data Extraction Based on Partial Tree Alignment and Automatic Wrapper Adaptation by Tree Edit Distance Matching.

Requires

numpy and scipy must be installed to build this package.

Usage

Detect listing data

MDR assume the data record close to the elements has most text nodes:

[1]: import requests
[2]: from mdr.mdr import MDR
[3]: mdr = MDR()
[4]: r = requests.get('http://www.yelp.co.uk/biz/the-ledbury-london')
[5]: candidates, doc = mdr.list_candidates(r.text.encode('utf8'))
...

[8]: [doc.getpath(c) for c in candidates[:10]]
 ['/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]/div[2]/ul',
 '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]',
 '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]',
 '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[1]/div/div[2]/div[1]/div[1]/div',
 '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[2]/div/div[3]',
 '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[2]/ul/li[2]/div/div/ul',
 '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]',
 '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]/div[1]/table/tbody',
 '/html/body/div[2]',
 '/html/body/div[2]/div[4]/div/div[1]']

Extract data record

MDR can find the repetiton pattern by using tree matching under certain candidate DOM tree.then it will build a mapping from so-called seed element to a list of matched elements from different DOM trees.

Used with annotation (optional)

You can annotate the seed record with any tools (e.g. scrapely) you like, then mdr will be able to find the other data in the page.

e.g. you can find this demo page here. the colored data in first row are annotated manually, the rest are extracted by MDR.

Author

Terry Peng <pengtaoo@gmail.com>

License

MIT

Release History

Release History

0.0.1

This version

History Node

TODO: Figure out how to actually get changelog content.

Changelog content for this version goes here.

Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Show More

Download Files

Download Files

TODO: Brief introduction on what you do with files - including link to relevant help section.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
mdr-0.0.1.tar.gz (49.3 kB) Copy SHA256 Checksum SHA256 Source Aug 14, 2014

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS HPE HPE Development Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting