html2ans

Convert HTML to ANS

These details have not been verified by PyPI

Project links

Homepage

Project description

https://img.shields.io/pypi/v/html2ans.svg

https://img.shields.io/pypi/pyversions/html2ans.svg

https://circleci.com/gh/washingtonpost/html2ans.svg?style=shield

https://img.shields.io/pypi/l/html2ans.svg

This project provides a standardized method of parsing HTML elements into ANS elements. It is mainly used by Arc Publishing’s professional services team to migrate client data into the Arc platform, but can also be used for arbitrary conversion of HTML to JSON.

html2ans is hosted on pypi.

Please use the GitHub issue tracker to submit bugs or request features.

Full documentation can be found here.

Quickstart

Generating ANS from HTML

from html2ans.default import Html2Ans

parser = Html2Ans()
content_elements = parser.generate_ans(your_html_here)

Adding Parsers

Basic Addition

If you need to parse a certain tag in a customized way, you can write your own parser class and add it to the parsers Html2Ans will use like so:

from html2ans.default import Html2Ans

parser = Html2Ans()
parser.add_parser(YourCustomImageParser())
parser.generate_ans(your_html_here)

The types of items your parser can parse should be listed in its applicable_elements attribute.

The default parser class (DefaultHtmlAnsParser or Html2Ans) has parsers for text, links, images, various social media embeds, etc.

Prioritized Addition

The parsers that can be used for each element type (e.g. img, p) are held in a list. If you want your parser to have a higher priority than the default parsers, add it like so:

from html2ans.default import Html2Ans

parser = Html2Ans()
parser.insert_parser('img', YourCustomImageParser(), 0)
parser.generate_ans(your_html_here)

Creating Custom Parsers

Missing from the snippet above is a definition of YourCustomImageParser. Before talking about how to create such a parser, let’s examine why you might need to do so.

The default image parser html2ans.parsers.image.ImageParser applies to html img tags only. Imagine you need to parse html whose images come in div tags (labelled with the class fancy-figure) that also hold a caption (labelled with the class fancy-caption). Here is a possible implementation of a parser for such images (note: this returns basic image ANS, not a reference):

from html2ans.parsers.image import ImageParser
from html2ans.parsers.base import ParseResult

class YourCustomImageParser(ImageParser):
    applicable_elements = ['div', 'figure']
    applicable_classes = ['fancy-figure']

    def parse(self, element, *args, **kwargs):
        image_tag = element.find('img')
        caption_tag = element.find('p', {"class": "fancy-caption"})
        if image_tag:
            image = self.construct_output(image_tag)
            if caption_tag:
              image["caption"] = caption_tag.text
            return ParseResult(image, True)
        return ParseResult(None, True)

Custom Parsing Tips

ANS Versions

Some ANS types require a version. You can set a version in your main parser (Html2Ans) and then automatically include that version in any element parser’s output by setting the parser’s version_required attribute to True.

Note: this doesn’t mean valid, version-compatible ANS is automatically produced!

Keeping HTML in text Output

To adjust what HTML is/isn’t left inline when parsing text, adjust the INLINE_TAGS attribute on the text parser. Every parser inherits from html2ans.parsers.utils.AbstractParserUtilities which provides a list of default INLINE_TAGS which can be used to make sure text formatters (e.g. strong, em, etc.) are left in place when text is parsed.

Link Parsing

By default, a tags are left inline in text, assuming there is text outside of the link. A link by itself (e.g. <p><a href="google.com">Search</a></p>) will be turned into an interstitial_link. If interstitial_link elements are unwanted, simply add a to the list of applicable_elements for the ParagraphParser.

Removing Unnecessary Tags

Sometimes it is helpful to remove unnecessary tags (e.g. <p></p>, <div><img src="..." /></div>). By default, Html2Ans considers p and div tags with no attributes other than id, class, or style to be unnecessary “wrappers”. When these are encountered, they are ignored and their children are parsed.

The benefit of this is that <p></p> is ignored and <div><img src="..." /></div> is parsed as an image.

The downside is that sometimes you don’t want your HTML removed! There are a few options in this case. You can configure what tags can be considered wrappers via the WRAPPER_TAGS attribute on Html2Ans. So if div tags should never be removed, simply remove div from this list. If a more complicated set of rules are necessary, override the is_wrapper method on Html2Ans.

If it’s easier to modify the HTML than to modify this library, you can also add an arbitrary attribute like so: <div no_parse_flag="true">...</div>. This div will not be considered a wrapper when it is encountered.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

3.0.6

Dec 12, 2019

3.0.5

Oct 9, 2019

3.0.4

Jul 30, 2019

3.0.3

May 15, 2019

3.0.2

Apr 2, 2019

3.0.1

Mar 18, 2019

3.0.0

Feb 16, 2019

3.0.0.dev0 pre-release

Feb 14, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html2ans-3.0.6.tar.gz (18.0 kB view details)

Uploaded Dec 12, 2019 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

html2ans-3.0.6-py3.6.egg (44.9 kB view details)

Uploaded Dec 12, 2019 Egg

html2ans-3.0.6-py2.py3-none-any.whl (22.2 kB view details)

Uploaded Dec 12, 2019 Python 2Python 3

File details

Details for the file html2ans-3.0.6.tar.gz.

File metadata

Download URL: html2ans-3.0.6.tar.gz
Upload date: Dec 12, 2019
Size: 18.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.6.9

File hashes

Hashes for html2ans-3.0.6.tar.gz
Algorithm	Hash digest
SHA256	`6348bf55bfbe45cc16c7614fff3cfba77c1500a9dc2cb07d76bb2e4708523ccb`
MD5	`2d00d200ddf852645a6600970726234b`
BLAKE2b-256	`3e169b652369f28e061ef43d513fc7fcf50949a5420510895120fd6a34cc0a49`

See more details on using hashes here.

File details

Details for the file html2ans-3.0.6-py3.6.egg.

File metadata

Download URL: html2ans-3.0.6-py3.6.egg
Upload date: Dec 12, 2019
Size: 44.9 kB
Tags: Egg
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.6.9

File hashes

Hashes for html2ans-3.0.6-py3.6.egg
Algorithm	Hash digest
SHA256	`7841393568c09efcc71c0b0d213091782abcc5a23dc99cb39902b2df24e5c1c0`
MD5	`9701ceba1ab3beafc9cc479664fd992b`
BLAKE2b-256	`def2b4b00d00b12b681f490a3fc3a288795bd7cad46792335ddc132c045fc61a`

See more details on using hashes here.

File details

Details for the file html2ans-3.0.6-py2.py3-none-any.whl.

File metadata

Download URL: html2ans-3.0.6-py2.py3-none-any.whl
Upload date: Dec 12, 2019
Size: 22.2 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.6.9

File hashes

Hashes for html2ans-3.0.6-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`6f9d770e693de409d54766bf407ffdbe366e864bb2c0e06886cb9c123630574e`
MD5	`f10d15d939748642e1994bc9072d8a8f`
BLAKE2b-256	`de51b147d7c2dcab29257c0eab2612f7053cba06d7c5b87b4c85606ade0720cb`

See more details on using hashes here.

html2ans 3.0.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Quickstart

Generating ANS from HTML

Adding Parsers

Basic Addition

Prioritized Addition

Creating Custom Parsers

Custom Parsing Tips

ANS Versions

Keeping HTML in text Output

Link Parsing

Removing Unnecessary Tags

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes