Skip to main content

HTML content extraction

Project description

Templating content from HTML. A Python do-alike to boilerpipe.

boilerpipe (http://code.google.com/p/boilerpipe/) is a Java program that looks at HTML tags and tries to deduce where the actual content is sans navigation, headers & footers, etc.

This is a rough rewrite of that written in Python and should be considered super duper alpha. I did it during a 2-day company (Curata.com) hackathon. I haven’t even run a comparison of its output against its step-father let alone done any corpus comparisons against commoncrawl.org. Consider yourself warned.

The only advantages over boilerpipe is that it is easier to interface with Python and the code is much more accessible: 500 lines of Python in one module versus 9000 lines of Java scattered accross a bazillion files and directories (I hate me some directories).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

boilerpot-0.92.tar.gz (9.4 kB view details)

Uploaded Source

File details

Details for the file boilerpot-0.92.tar.gz.

File metadata

  • Download URL: boilerpot-0.92.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for boilerpot-0.92.tar.gz
Algorithm Hash digest
SHA256 1a495f3f428c28898261c704bb777ea7df75abdf86789356c79c9be252e25119
MD5 2852b1de154f093d964235ec52917452
BLAKE2b-256 b99027e7b4bf2d47ca1689746fd120cc5ffa87826c4af2a29ca0a98f249dc461

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page