Skip to main content

A robust parser which can extract title, content, images from html pages

Project description

import urllib2
from jparser import PageModel
html = urllib2.urlopen("http://news.sohu.com/20170512/n492734045.shtml").read().decode('gb18030')
pm = PageModel(html)
result = pm.extract()

print "==title=="
print result['title']
print "==content=="
for x in result['content']:
    if x['type'] == 'text':
        print x['data']
    if x['type'] == 'image':
        print "[IMAGE]", x['data']['src']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jparser-0.0.10.tar.gz (3.1 kB view details)

Uploaded Source

File details

Details for the file jparser-0.0.10.tar.gz.

File metadata

  • Download URL: jparser-0.0.10.tar.gz
  • Upload date:
  • Size: 3.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for jparser-0.0.10.tar.gz
Algorithm Hash digest
SHA256 f0610e833c2f415a82f2fbc31b89dec91f7cdc6d4b15ca14d6221effb1835121
MD5 d15524b8880c651e11039f63b23d2a0b
BLAKE2b-256 551f6623738559fad34777d9c656ac59d3f23d81a5a625c0242c6a1dfed9b756

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page