Skip to main content

HTML content extraction

Project description

Templating content from HTML. A Python do-alike to boilerpipe.

boilerpipe (http://code.google.com/p/boilerpipe/) is a Java program that looks at HTML tags and tries to deduce where the actual content is sans navigation, headers & footers, etc.

This is a rough rewrite of that written in Python and should be considered super duper alpha. I did it during a 2-day company (Curata.com) hackathon. I haven’t even run a comparison of its output against its step-father let alone done any corpus comparisons against commoncrawl.org. Consider yourself warned.

The only advantages over boilerpipe is that it is easier to interface with Python and the code is much more accessible: 500 lines of Python in one module versus 9000 lines of Java scattered accross a bazillion files and directories (I hate me some directories).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for boilerpot, version 0.92
Filename, size File type Python version Upload date Hashes
Filename, size boilerpot-0.92.tar.gz (9.4 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page