A robust parser which can extract title, content, images from html pages
Project description
import urllib2
from jparser import PageModel
html = urllib2.urlopen("http://news.sohu.com/20170512/n492734045.shtml").read().decode('gb18030')
pm = PageModel(html)
result = pm.extract()
print "==title=="
print result['title']
print "==content=="
for x in result['content']:
if x['type'] == 'text':
print x['data']
if x['type'] == 'image':
print "[IMAGE]", x['data']['src']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
jparser-0.0.15.tar.gz
(3.3 kB
view details)
File details
Details for the file jparser-0.0.15.tar.gz.
File metadata
- Download URL: jparser-0.0.15.tar.gz
- Upload date:
- Size: 3.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86f0764f1bc1d1ddf90886c0fb25921fb0301923eec4afc4dda2f5a6addc71a6
|
|
| MD5 |
58c84be79273f5f84a260414f11c0fd2
|
|
| BLAKE2b-256 |
56e449f48745c45fec0fcf705ec1a586deed75c6810a3d036f0124f41c8a722b
|