Skip to main content

Python3 version of speedparser https://github.com/jmoiron/speedparser

Project description

speedparser3
-----------

Speedparser3 is a Python 3.5+ version of the original Speedparser by https://github.com/jmoiron/speedparser.

speedparser
----------

Speedparser is a black-box "style" reimplementation of the `Universal Feed
Parser <http://code.google.com/p/feedparser/>`_. It uses some feedparser code
for date and authors, but mostly re-implements its data normalization algorithms
based on feedparser output. It uses ``lxml`` for feed parsing and for optional
HTML cleaning. Its compatibility with ``feedparser`` is very good for a strict
subset of fields, but poor for fields outside that subset. See
``tests/speedparsertests.py`` for more information on which fields are more or
less compatible and which are not.

On an Intel(R) Core(TM) i5 750, running only on one core, ``feedparser`` managed
``2.5 feeds/sec`` on the test feed set (roughly 4200 "feeds" in
``tests/feeds.tar.bz2``), while ``speedparser`` manages around ``65 feeds/sec``
with HTML cleaning on and ``200 feeds/sec`` with cleaning off.

installing
----------

``pip3 install speedparser3``

usage
-----

Usage is similar to feedparser::

>>> import speedparser3
>>> result = speedparser3.parse(feed)
>>> result = speedparser3.parse(feed, clean_html=False)

differences
-----------

There are a few interface differences and many result differences between
speedparser3 and feedparser. The biggest similarity is that they both return
a ``FeedParserDict()`` object (with keys accessible as attributes), they both
set the ``bozo`` key when an error is encountered, and various aspects of the
``feed`` and ``entries`` keys are likely to be identical *or* very similar.

``speedparser3`` uses different (and in some cases less or none; buyer beware)
data cleaning algorithms than ``feedparser``. When it is enabled, lxml's
``html.cleaner`` library will be used to clean HTML and give similar but not
identical protection against various attributes and elements. If you supply
your own ``Cleaner`` element to the "``clean_html`` kwarg, it will be used
by ``speedparser3`` to clean the various attributes of the feed and entries.

``speedparser3`` does not attempt to fix character encoding by default because
this processing can take a long time for large feeds. If the encoding value of
the feed is wrong, or if you want this extra level of error tollerance, you
can either use the ``chardet`` module to detect the encoding based on the
document or pass ``encoding=True`` to ``speedparser3.parse`` and it will fall
back to encoding detection if it encounters encoding errors.

If your application is using ``feedparser`` to consume many feeds at once and
CPU is becoming a bottleneck, you might want to try out ``speedparser3`` as an
alternative (using ``feedparser`` as a backup). If you are writing an
application that does not ingest many feeds, or where CPU is not a problem,
you should use ``feedparser`` as it is flexible with bad or malformed data and
has a much better test suite.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

speedparser3-0.3.0.tar.gz (16.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page