A readability parser which can extract title, content, images from html pages
Project description
import urllib2
from jparser import PageModel
html = urllib2.urlopen("http://news.sohu.com/20170512/n492734045.shtml").read().decode('gb18030')
pm = PageModel(html)
result = pm.extract()
print "==title=="
print result['title']
print "==content=="
for x in result['content']:
if x['type'] == 'text':
print x['data']
if x['type'] == 'image':
print "[IMAGE]", x['data']['src']
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
jparser-0.0.20.tar.gz
(3.5 kB
view details)
File details
Details for the file jparser-0.0.20.tar.gz.
File metadata
- Download URL: jparser-0.0.20.tar.gz
- Upload date:
- Size: 3.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6b3c6ff5cc20c615f4b097c4f1b765495a315790110a5032f694b72ac6b392b
|
|
| MD5 |
4d6a655aaf14a49e0b61c8a2692b8ea3
|
|
| BLAKE2b-256 |
78fea080447f4058c0961d8db205f278d8fc4f623bdb581a5c56e750012af3a4
|