Skip to main content

Convert HTML to ANS

Project description

https://img.shields.io/pypi/v/html2ans.svg https://img.shields.io/pypi/pyversions/html2ans.svg https://circleci.com/gh/washingtonpost/html2ans.svg?style=shield https://img.shields.io/pypi/l/html2ans.svg

This project provides a standardized method of parsing HTML elements into ANS elements. It is mainly used by Arc Publishing’s professional services team to migrate client data into the Arc platform, but can also be used for arbitrary conversion of HTML to JSON.

html2ans is hosted on pypi.

Please use the GitHub issue tracker to submit bugs or request features.

Full documentation can be found here.

Quickstart

Generating ANS from HTML

from html2ans.default import Html2Ans

parser = Html2Ans()
content_elements = parser.generate_ans(your_html_here)

Adding Parsers

Basic Addition

If you need to parse a certain tag in a customized way, you can write your own parser class and add it to the parsers Html2Ans will use like so:

from html2ans.default import Html2Ans

parser = Html2Ans()
parser.add_parser(YourCustomImageParser())
parser.generate_ans(your_html_here)

The default parser class (DefaultHtmlAnsParser or Html2Ans) has parsers for text, links, images, various social media embeds, etc.

Prioritized Addition

The parsers that can be used for each element type (e.g. img, p) are held in a list. If you want your parser to have a higher priority than the default parsers, add it like so:

from html2ans.default import Html2Ans

parser = Html2Ans()
parser.insert_parser('img', YourCustomImageParser(), 0)
parser.generate_ans(your_html_here)

Creating Custom Parsers

Missing from the snippet above is a definition of YourCustomImageParser. Before talking about how to create such a parser, let’s examine why you might need to do so.

The default image parser html2ans.parsers.image.ImageParser applies to html img tags only. Imagine you need to parse html whose images come in div tags (labelled with the class fancy-figure) that also hold a caption (labelled with the class fancy-caption). Here is a possible implementation of a parser for such images (note: this returns basic image ANS, not a reference):

from html2ans.parsers.image import ImageParser
from html2ans.parsers.base import ParseResult

class YourCustomImageParser(ImageParser):
    applicable_elements = ['div']
    applicable_classes = ['fancy-figure']

    def parse(self, element, *args, **kwargs):
        image_tag = element.find('img')
        caption_tag = element.find('p', {"class": "fancy-caption"})
        if image_tag:
            image = self.construct_output(image_tag)
            if caption_tag:
              image["caption"] = caption_tag.text
            return ParseResult(image, True)
        return ParseResult(None, True)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html2ans-3.0.1.tar.gz (17.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

html2ans-3.0.1-py3.6.egg (42.3 kB view details)

Uploaded Egg

html2ans-3.0.1-py2.py3-none-any.whl (20.7 kB view details)

Uploaded Python 2Python 3

File details

Details for the file html2ans-3.0.1.tar.gz.

File metadata

  • Download URL: html2ans-3.0.1.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for html2ans-3.0.1.tar.gz
Algorithm Hash digest
SHA256 e87119d01844c813453c9c0077cb9195c40bd05c7e8c4d130fb604535c6e03df
MD5 e5b076a2f839f1bcbab18500eea4ceee
BLAKE2b-256 3bf40ee1afb808d1320d246a2a8730c758d923c10546851fd5c1508e0c7b75c8

See more details on using hashes here.

File details

Details for the file html2ans-3.0.1-py3.6.egg.

File metadata

  • Download URL: html2ans-3.0.1-py3.6.egg
  • Upload date:
  • Size: 42.3 kB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for html2ans-3.0.1-py3.6.egg
Algorithm Hash digest
SHA256 70a5090d08e381b1a671b71010f6a587eef428853967dd8a78a6d813b4ec0cef
MD5 73edbeacdd3052fb0862c56ec9bd31ae
BLAKE2b-256 89fc4975d28f1b875c3c96bb7e383440f6cf2681680ebccba3f80fa45f5ae1ab

See more details on using hashes here.

File details

Details for the file html2ans-3.0.1-py2.py3-none-any.whl.

File metadata

  • Download URL: html2ans-3.0.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 20.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for html2ans-3.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6316ababfd70fe3880e6276fb233b67808445111261e367ceee794fe1432053c
MD5 ffdd52210cdc026e4f82ad128a3742d7
BLAKE2b-256 45ff2f07c28a9479e5dead7d14133aa5f7cd14777db0ea083a28ef5280068a15

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page