A robust parser which can extract title, content, images from html pages
Project description
import urllib2
from jparser import PageModel
html = urllib2.urlopen("http://news.sohu.com/20170512/n492734045.shtml").read().decode('gb18030')
pm = PageModel(html)
result = pm.extract()
print "==title=="
print result['title']
print "==content=="
for x in result['content']:
if x['type'] == 'text':
print x['data']
if x['type'] == 'image':
print "[IMAGE]", x['data']['src']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
jparser-0.0.10.tar.gz
(3.1 kB
view details)
File details
Details for the file jparser-0.0.10.tar.gz.
File metadata
- Download URL: jparser-0.0.10.tar.gz
- Upload date:
- Size: 3.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0610e833c2f415a82f2fbc31b89dec91f7cdc6d4b15ca14d6221effb1835121
|
|
| MD5 |
d15524b8880c651e11039f63b23d2a0b
|
|
| BLAKE2b-256 |
551f6623738559fad34777d9c656ac59d3f23d81a5a625c0242c6a1dfed9b756
|