Skip to main content

A robust parser which can extract title, content, images from html pages

Project description

import urllib2
from jparser import PageModel
html = urllib2.urlopen("http://news.sohu.com/20170512/n492734045.shtml").read().decode('gb18030')
pm = PageModel(html)
result = pm.extract()

print "==title=="
print result['title']
print "==content=="
for x in result['content']:
    if x['type'] == 'text':
        print x['data']
    if x['type'] == 'image':
        print "[IMAGE]", x['data']['src']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jparser-0.0.15.tar.gz (3.3 kB view details)

Uploaded Source

File details

Details for the file jparser-0.0.15.tar.gz.

File metadata

  • Download URL: jparser-0.0.15.tar.gz
  • Upload date:
  • Size: 3.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for jparser-0.0.15.tar.gz
Algorithm Hash digest
SHA256 86f0764f1bc1d1ddf90886c0fb25921fb0301923eec4afc4dda2f5a6addc71a6
MD5 58c84be79273f5f84a260414f11c0fd2
BLAKE2b-256 56e449f48745c45fec0fcf705ec1a586deed75c6810a3d036f0124f41c8a722b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page