Python3 version of speedparser https://github.com/jmoiron/speedparser
Project description
speedparser3
-----------
Speedparser3 is a Python 3.5+ version of the original Speedparser by https://github.com/jmoiron/speedparser.
speedparser
----------
Speedparser is a black-box "style" reimplementation of the `Universal Feed
Parser <http://code.google.com/p/feedparser/>`_. It uses some feedparser code
for date and authors, but mostly re-implements its data normalization algorithms
based on feedparser output. It uses ``lxml`` for feed parsing and for optional
HTML cleaning. Its compatibility with ``feedparser`` is very good for a strict
subset of fields, but poor for fields outside that subset. See
``tests/speedparsertests.py`` for more information on which fields are more or
less compatible and which are not.
On an Intel(R) Core(TM) i5 750, running only on one core, ``feedparser`` managed
``2.5 feeds/sec`` on the test feed set (roughly 4200 "feeds" in
``tests/feeds.tar.bz2``), while ``speedparser`` manages around ``65 feeds/sec``
with HTML cleaning on and ``200 feeds/sec`` with cleaning off.
installing
----------
``pip3 install speedparser3``
usage
-----
Usage is similar to feedparser::
>>> import speedparser3
>>> result = speedparser3.parse(feed)
>>> result = speedparser3.parse(feed, clean_html=False)
differences
-----------
There are a few interface differences and many result differences between
speedparser3 and feedparser. The biggest similarity is that they both return
a ``FeedParserDict()`` object (with keys accessible as attributes), they both
set the ``bozo`` key when an error is encountered, and various aspects of the
``feed`` and ``entries`` keys are likely to be identical *or* very similar.
``speedparser3`` uses different (and in some cases less or none; buyer beware)
data cleaning algorithms than ``feedparser``. When it is enabled, lxml's
``html.cleaner`` library will be used to clean HTML and give similar but not
identical protection against various attributes and elements. If you supply
your own ``Cleaner`` element to the "``clean_html`` kwarg, it will be used
by ``speedparser3`` to clean the various attributes of the feed and entries.
``speedparser3`` does not attempt to fix character encoding by default because
this processing can take a long time for large feeds. If the encoding value of
the feed is wrong, or if you want this extra level of error tollerance, you
can either use the ``chardet`` module to detect the encoding based on the
document or pass ``encoding=True`` to ``speedparser3.parse`` and it will fall
back to encoding detection if it encounters encoding errors.
If your application is using ``feedparser`` to consume many feeds at once and
CPU is becoming a bottleneck, you might want to try out ``speedparser3`` as an
alternative (using ``feedparser`` as a backup). If you are writing an
application that does not ingest many feeds, or where CPU is not a problem,
you should use ``feedparser`` as it is flexible with bad or malformed data and
has a much better test suite.
-----------
Speedparser3 is a Python 3.5+ version of the original Speedparser by https://github.com/jmoiron/speedparser.
speedparser
----------
Speedparser is a black-box "style" reimplementation of the `Universal Feed
Parser <http://code.google.com/p/feedparser/>`_. It uses some feedparser code
for date and authors, but mostly re-implements its data normalization algorithms
based on feedparser output. It uses ``lxml`` for feed parsing and for optional
HTML cleaning. Its compatibility with ``feedparser`` is very good for a strict
subset of fields, but poor for fields outside that subset. See
``tests/speedparsertests.py`` for more information on which fields are more or
less compatible and which are not.
On an Intel(R) Core(TM) i5 750, running only on one core, ``feedparser`` managed
``2.5 feeds/sec`` on the test feed set (roughly 4200 "feeds" in
``tests/feeds.tar.bz2``), while ``speedparser`` manages around ``65 feeds/sec``
with HTML cleaning on and ``200 feeds/sec`` with cleaning off.
installing
----------
``pip3 install speedparser3``
usage
-----
Usage is similar to feedparser::
>>> import speedparser3
>>> result = speedparser3.parse(feed)
>>> result = speedparser3.parse(feed, clean_html=False)
differences
-----------
There are a few interface differences and many result differences between
speedparser3 and feedparser. The biggest similarity is that they both return
a ``FeedParserDict()`` object (with keys accessible as attributes), they both
set the ``bozo`` key when an error is encountered, and various aspects of the
``feed`` and ``entries`` keys are likely to be identical *or* very similar.
``speedparser3`` uses different (and in some cases less or none; buyer beware)
data cleaning algorithms than ``feedparser``. When it is enabled, lxml's
``html.cleaner`` library will be used to clean HTML and give similar but not
identical protection against various attributes and elements. If you supply
your own ``Cleaner`` element to the "``clean_html`` kwarg, it will be used
by ``speedparser3`` to clean the various attributes of the feed and entries.
``speedparser3`` does not attempt to fix character encoding by default because
this processing can take a long time for large feeds. If the encoding value of
the feed is wrong, or if you want this extra level of error tollerance, you
can either use the ``chardet`` module to detect the encoding based on the
document or pass ``encoding=True`` to ``speedparser3.parse`` and it will fall
back to encoding detection if it encounters encoding errors.
If your application is using ``feedparser`` to consume many feeds at once and
CPU is becoming a bottleneck, you might want to try out ``speedparser3`` as an
alternative (using ``feedparser`` as a backup). If you are writing an
application that does not ingest many feeds, or where CPU is not a problem,
you should use ``feedparser`` as it is flexible with bad or malformed data and
has a much better test suite.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
speedparser3-0.3.0.tar.gz
(16.7 kB
view hashes)