Skip to main content

Python3 version of speedparser https://github.com/jmoiron/speedparser

Project description

speedparser3

Speedparser3 is a Python 3.5+ version of the original Speedparser by https://github.com/jmoiron/speedparser.

speedparser

Speedparser is a black-box “style” reimplementation of the Universal Feed Parser. It uses some feedparser code for date and authors, but mostly re-implements its data normalization algorithms based on feedparser output. It uses lxml for feed parsing and for optional HTML cleaning. Its compatibility with feedparser is very good for a strict subset of fields, but poor for fields outside that subset. See tests/speedparsertests.py for more information on which fields are more or less compatible and which are not.

On an Intel(R) Core(TM) i5 750, running only on one core, feedparser managed 2.5 feeds/sec on the test feed set (roughly 4200 “feeds” in tests/feeds.tar.bz2), while speedparser manages around 65 feeds/sec with HTML cleaning on and 200 feeds/sec with cleaning off.

installing

pip3 install speedparser3

usage

Usage is similar to feedparser:

>>> import speedparser3
>>> result = speedparser3.parse(feed)
>>> result = speedparser3.parse(feed, clean_html=False)

differences

There are a few interface differences and many result differences between speedparser3 and feedparser. The biggest similarity is that they both return a FeedParserDict() object (with keys accessible as attributes), they both set the bozo key when an error is encountered, and various aspects of the feed and entries keys are likely to be identical or very similar.

speedparser3 uses different (and in some cases less or none; buyer beware) data cleaning algorithms than feedparser. When it is enabled, lxml’s html.cleaner library will be used to clean HTML and give similar but not identical protection against various attributes and elements. If you supply your own Cleaner element to the “clean_html kwarg, it will be used by speedparser3 to clean the various attributes of the feed and entries.

speedparser3 does not attempt to fix character encoding by default because this processing can take a long time for large feeds. If the encoding value of the feed is wrong, or if you want this extra level of error tollerance, you can either use the chardet module to detect the encoding based on the document or pass encoding=True to speedparser3.parse and it will fall back to encoding detection if it encounters encoding errors.

If your application is using feedparser to consume many feeds at once and CPU is becoming a bottleneck, you might want to try out speedparser3 as an alternative (using feedparser as a backup). If you are writing an application that does not ingest many feeds, or where CPU is not a problem, you should use feedparser as it is flexible with bad or malformed data and has a much better test suite.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speedparser3-0.3.1.tar.gz (16.8 kB view details)

Uploaded Source

File details

Details for the file speedparser3-0.3.1.tar.gz.

File metadata

  • Download URL: speedparser3-0.3.1.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for speedparser3-0.3.1.tar.gz
Algorithm Hash digest
SHA256 3e1ad8140fc2d07e2dc94eb04c7be8e8e8d57b354210f410e82fbfc00ad101c9
MD5 d1df234b48f9867769fc1acf15daff2f
BLAKE2b-256 f9f0550bc1b13665dd8ddbba496a837ca12cf8df0aadd286f9fc4874ee1e7e5c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page