parsedom

A fast DOM parser

These details have not been verified by PyPI

Project links

Homepage

Project description

This is a fork of Common Functions and ParseDOM for use outside of XBMC.

Getting element content.

from parsedom import parseDOM
link_html = "<a href='bla.html'>Link Test</a>"
ret = parseDOM(link_html, "a")
print repr(ret) # Prints ['Link Test']

Getting an element attribute.

link_html = "<a href='bla.html'>Link Test</a>"
ret = parseDOM(link_html, "a", ret = "href")
print repr(ret) # Prints ['bla.html']

Get element with matching attribute.

link_html = "<a href='bla1.html' id='link1'>Link Test1</a><a href='bla2.html' id='link2'>Link Test2</a><a href='bla3.html' id='link3'>Link Test3</a>"
ret1 = parseDOM(link_html, "a", attrs = { "id": "link1" }, ret = "href")
ret2 = parseDOM(link_html, "a", attrs = { "id": "link2" })
ret3 = parseDOM(link_html, "a", attrs = { "id": "link3" }, ret = "id")
print repr(ret1) # Prints ['bla1.html']
print repr(ret2) # Prints ['Link Test2']
print repr(ret3) # Prints ['link3']

When scraping sites it is prudent to scrape in steps, since real websites are often complicated.

Take this example where you want to get all the user uploads.

&lt;div id="content"&gt;
 &lt;div id="sidebar"&gt;
  &lt;div id="latest"&gt;
   <a href="/video?8wxOVn99FTE">Miley Cyrus - When I Look At You</a>&gt;br /&lt;
   <a href="/video?46">Puppet theater</a>&lt;br /&gt;
   <a href="/video?98">VBLOG #42</a>&lt;br /&gt;
   <a href="/video?11">Fourth upload</a>&lt;br /&gt;
  &lt;/div&gt;
 &lt;/div&gt;
 &lt;div id="user"&gt;
  &lt;div id="uploads"&gt;
   <a href="/video?12">First upload</a>&lt;br /&gt;
   <a href="/video?23">Second upload</a>&lt;br /&gt;
   <a href="/video?34">Third upload</a>&lt;br /&gt;
   <a href="/video?41">Fourth upload</a>&lt;br /&gt;
  &lt;/div&gt;
 &lt;/div&gt;
&lt;/div&gt;

The first step is to limit your search to the correct area.

One should always find the inner most DOM element that contains the needed data.

ret = parseDOM(html, "div", attrs = { "id": "uploads" })

The variable ret now contains

['<a href="/video?12">First upload</a>&lt;br /&gt;
<a href="/video?23">Second upload</a>&lt;br /&gt;
<a href="/video?34">Third upload</a>&lt;br /&gt;
<a href="/video?41">Fourth upload</a>&lt;br /&gt;']

And now we get the video url.

videos = parseDOM(ret, "a", ret = "href")
print repr(videos) # Prints [ "video?12", "video?23", "video?34", "video?41" ]

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.0

Mar 24, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsedom-1.0.0.tar.gz (16.8 kB view details)

Uploaded Mar 24, 2014 Source

File details

Details for the file parsedom-1.0.0.tar.gz.

File metadata

Download URL: parsedom-1.0.0.tar.gz
Upload date: Mar 24, 2014
Size: 16.8 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for parsedom-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`09c15a77c9115127d38b330bc8d688506d5282b4ed0aaa910604587f23ca43b8`
MD5	`4247bc3bab09a6166773cf55d398ce2c`
BLAKE2b-256	`b2cbdd97f8e212095cb947b40cbc3748d05a003b0bfe1ca85a1a4a24548305ae`

See more details on using hashes here.

parsedom 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Getting element content.

Getting an element attribute.

Get element with matching attribute.

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes