Skip to main content

docx (OOXML) to html converter

Project description

https://travis-ci.org/OpenScienceFramework/pydocx.png?branch=master

pydocx is a parser that breaks down the elements of a docxfile and converts them into different markup languages. Right now, HTML is supported. Markdown and LaTex will be available soon. You can extend any of the available parsers to customize it to your needs. You can also create your own class that inherits DocxParser to create your own methods for a markup language not yet supported.

Currently Supported

  • tables
    • nested tables

    • rowspans

    • colspans

    • lists in tables

  • lists
    • list styles

    • nested lists

    • list of tables

    • list of pragraphs

  • justification

  • images

  • styles
    • bold

    • italics

    • underline

    • hyperlinks

  • headings

Usage

DocxParser includes abstracts methods that each parser overwrites to satsify its own needs. The abstract methods are as follows:

class DocxParser:

    @property
    def parsed(self):
        return self._parsed

    @property
    def escape(self, text):
        return text

    @abstractmethod
    def linebreak(self):
        return ''

    @abstractmethod
    def paragraph(self, text):
        return text

    @abstractmethod
    def heading(self, text, heading_level):
        return text

    @abstractmethod
    def insertion(self, text, author, date):
        return text

    @abstractmethod
    def hyperlink(self, text, href):
        return text

    @abstractmethod
    def image_handler(self, path):
        return path

    @abstractmethod
    def image(self, path, x, y):
        return self.image_handler(path)

    @abstractmethod
    def deletion(self, text, author, date):
        return text

    @abstractmethod
    def bold(self, text):
        return text

    @abstractmethod
    def italics(self, text):
        return text

    @abstractmethod
    def underline(self, text):
        return text

    @abstractmethod
    def tab(self):
        return True

    @abstractmethod
    def ordered_list(self, text):
        return text

    @abstractmethod
    def unordered_list(self, text):
        return text

    @abstractmethod
    def list_element(self, text):
        return text

    @abstractmethod
    def table(self, text):
        return text
    @abstractmethod
    def table_row(self, text):
        return text

    @abstractmethod
    def table_cell(self, text):
        return text

    @abstractmethod
    def page_break(self):
        return True

    @abstractmethod
    def indent(self, text, left='', right='', firstLine=''):
        return text

Docx2Html inherits DocxParser and implements basic HTML handling. Ex.

class Docx2Html(DocxParser):

    #  Escape '&', '<', and '>' so we render the HTML correctly
    def escape(self, text):
        return xml.sax.saxutils.quoteattr(text)[1:-1]

    # return a line break
    def linebreak(self, pre=None):
        return '<br />'

    # add paragraph tags
    def paragraph(self, text, pre=None):
        return '<p>' + text + '</p>'

However, let’s say you want to add a specific style to your HTML document. In order to do this, you want to make each paragraph a class of type my_implementation. Simply extend docx2Html and add what you need.

class My_Implementation_of_Docx2Html(Docx2Html):

   def paragraph(self, text, pre = None):
       return <p class="my_implementation"> + text + '</p>'

OR, let’s say FOO is your new favorite markup language. Simply customize your own new parser, overwritting the abstract methods of DocxParser

class Docx2Foo(DocxParser):

    # because linebreaks in are denoted by '!!!!!!!!!!!!' with the FOO markup langauge  :)
    def linebreak(self):
        return '!!!!!!!!!!!!'

Styles

The base parser Docx2Html relies on certain css class being set for certain behaviour to occur. Currently these include:

  • class pydocx-insert -> Turns the text green.

  • class pydocx-delete -> Turns the text red and draws a line through the text.

  • class pydocx-center -> Aligns the text to the center.

  • class pydocx-right -> Aligns the text to the right.

  • class pydocx-left -> Aligns the text to the left.

  • class pydocx-comment -> Turns the text blue.

  • class pydocx-underline -> Underlines the text.

Optional Arguments

You can pass in convert_root_level_upper_roman=True to the parser and it will convert all root level upper roman lists to headings instead.

Changelog

  • 0.2.1
    • Added a changelog

    • Added the version in pydocx.__init__

    • Fixed an issue with duplicating content if there was indentation or justification on a p element that had multiple t tags.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyDocX-0.2.1.tar.gz (352.0 kB view details)

Uploaded Source

File details

Details for the file PyDocX-0.2.1.tar.gz.

File metadata

  • Download URL: PyDocX-0.2.1.tar.gz
  • Upload date:
  • Size: 352.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for PyDocX-0.2.1.tar.gz
Algorithm Hash digest
SHA256 3ea1545c84d0c9c1129edc6fb398073aa9c99cff3c38c61e93a3ff6e64022e38
MD5 9339d4ac51e09d8922282eb5e0daf3df
BLAKE2b-256 65f819f18229f02e1822be5e4700e5300f2c9a65d8e02b7a9689dc29b1b917a8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page