Skip to main content

A readability parser which can extract title, content, images from html pages

Project description

import urllib2
from jparser import PageModel
html = urllib2.urlopen("http://news.sohu.com/20170512/n492734045.shtml").read().decode('gb18030')
pm = PageModel(html)
result = pm.extract()

print "==title=="
print result['title']
print "==content=="
for x in result['content']:
    if x['type'] == 'text':
        print x['data']
    if x['type'] == 'image':
        print "[IMAGE]", x['data']['src']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jparser-0.0.20.tar.gz (3.5 kB view details)

Uploaded Source

File details

Details for the file jparser-0.0.20.tar.gz.

File metadata

  • Download URL: jparser-0.0.20.tar.gz
  • Upload date:
  • Size: 3.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for jparser-0.0.20.tar.gz
Algorithm Hash digest
SHA256 c6b3c6ff5cc20c615f4b097c4f1b765495a315790110a5032f694b72ac6b392b
MD5 4d6a655aaf14a49e0b61c8a2692b8ea3
BLAKE2b-256 78fea080447f4058c0961d8db205f278d8fc4f623bdb581a5c56e750012af3a4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page