Skip to main content

Extract content from docx files

Project description

docx2python

Extract docx headers, footers, (formatted) text, footnotes, endnotes, properties, and images to a Python object.

full documentation

The code is an expansion/contraction of python-docx2txt (Copyright (c) 2015 Ankush Shah). The original code is mostly gone, but some of the bones may still be here.

shared features:

  • extracts text from docx files
  • extracts images from docx files
  • no dependencies (docx2python requires pytest to test)

additions:

  • converts bullets and numbered lists to ascii with indentation
  • retains some structure of the original file (more below)
  • extracts document properties (creator, lastModifiedBy, etc.)
  • inserts image placeholders in text ('----image1.jpg----')
  • (optionally) retains font size, font color, bold, italics, and underscore as html
  • full test coverage and documentation for developers

subtractions:

  • no command-line interface
  • will only work with later versions of Python

Installation

pip install docx2python

Use

from docx2python import docx2python

# extract docx content
docx2python('path/to/file.docx')

# extract docx content, write images to image_directory
docx2python('path/to/file.docx', 'path/to/image_directory')

# extract docx content with basic font styles converted to html
docx2python('path/to/file.docx', html=True)

Note on html feature:

  • font size, font color, bold, italics, and underline supported
  • every tag open in a paragraph will be closed in that paragraph (and, where appropriate, reopened in the next paragraph). If two subsequenct paragraphs are bold, they will be returned as <b>paragraph q</b>, <b>paragraph 2</b>. This is intentional to make each paragraph its own entity.
  • if you specify export_font_style=True, > and < in your docx text will be encoded as &gt; and &lt;

Return Value

Function docx2python returns an object with several attributes.

header - contents of the docx headers in the return format described herein

footer - contents of the docx footers in the return format described herein

body - contents of the docx in the return format described herein

document - header + body + footer (read only)

text - all docx text as one string, similar to what you'd get from python-docx2txt

properties - docx property names mapped to values (e.g., {"lastModifiedBy": "Shay Hill"})

images - image names mapped to images in binary format. Write to filesystem with

for name, image in result.images.items():
    with open(name, 'wb') as image_destination:
        write(image_destination, image)

Return Format

Some structure will be maintained. Text will be returned in a nested list, with paragraphs always at depth 4 (i.e., output.body[i][j][k][l] will be a paragraph).

If your docx has no tables, output.body will appear as one a table with all contents in one cell:

[  # document
    [  # table
        [  # row
            [  # cell
                "Paragraph 1",
                "Paragraph 2",
                "-- bulleted list",
                "-- continuing bulleted list",
                "1)  numbered list",
                "2)  continuing numbered list"
                "    a)  sublist",
                "        i)  sublist of sublist",
                "3)  keeps track of indention levels",
                "    a)  resets sublist counters"
            ]
        ]
     ]
 ]

Table cells will appear as table cells. Text outside tables will appear as table cells.

To preserve the even depth (text always at depth 4), nested tables will appear as new, top-level tables. This is clearer with an example:

#  docx structure

[  # document
    [  # table A
        [  # table A row
            [  # table A cell 1
                "paragraph in table A cell 1"
            ],
            [  # nested table B
                [  # table B row
                    [  # table B cell
                        "paragraph in table B"
                    ]
                ]
            ],
            [  # table A cell 2
                'paragraph in table A cell 2'
            ]
        ]
    ]
]

becomes ...

[  # document 
    [  # table A
        [  # row in table A
            [  # cell in table A
                "table A cell 1"
            ]
        ]
    ],
    [  # table B
        [  # row in table B
            [  # cell in table B
                "table B cell"
            ]
        ]
    ],
    [  # table C
        [  # row in table C
            [  # cell in table C
                "table A cell 2"
            ]
        ]
    ]
]

This ensures text appears

  1. only once
  2. in the order it appears on the docx
  3. always at depth four (i.e., result.body[i][j][k][l] will be a string).

Working with output

This package provides several documented helper functions in the docx2python.iterators module. Here are a few recipes possible with these functions:

from docx2python.iterators import enum_cells

def remove_empty_paragraphs(tables):
    for (i, j, k), cell in enum_cells(tables):
        tables[i][j][k] = [x for x in cell if x]
>>> tables = [[[['a', 'b'], ['a', '', 'd', '']]]]
>>> remove_empty_paragraphs(tables)
    [[[['a', 'b'], ['a', 'd']]]]
from docx2python.iterators import enum_at_depth

def html_map(tables) -> str:
    """Create an HTML map of document contents.

    Render this in a browser to visually search for data.
    """
    tables = self.document

    # prepend index tuple to each paragraph
    for (i, j, k, l), paragraph in enum_at_depth(tables, 4):
        tables[i][j][k][l] = " ".join([str((i, j, k, l)), paragraph])

    # wrap each paragraph in <pre> tags
    for (i, j, k), cell in enum_at_depth(tables, 3):
        tables[i][j][k] = "".join([f"<pre>{x}</pre>" for x in cell])

    # wrap each cell in <td> tags
    for (i, j), row in enum_at_depth(tables, 2):
        tables[i][j] = "".join([f"<td>{x}</td>" for x in row])

    # wrap each row in <tr> tags
    for (i,), table in enum_at_depth(tables, 1):
        tables[i] = "".join(f"<tr>{x}</tr>" for x in table)

    # wrap each table in <table> tags
    tables = "".join([f'<table border="1">{x}</table>' for x in tables])

    return ["<html><body>"] + tables + ["</body></html>"]
>>> tables = [[[['a', 'b'], ['a', 'd']]]]
>>> html_toc(tables)
<html>
    <body>
        <table border="1">
            <tr>
                <td>
                    '(0, 0, 0, 0) a'
                    '(0, 0, 0, 1) b'
                </td>
                <td>
                    '(0, 0, 1, 0) a'
                    '(0, 0, 1, 1) d'
                </td>
            </tr>
        </table>
    </body>
</html>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docx2python-1.2.tar.gz (23.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

docx2python-1.2.0-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

docx2python-1.2-py3-none-any.whl (18.6 kB view details)

Uploaded Python 3

File details

Details for the file docx2python-1.2.tar.gz.

File metadata

  • Download URL: docx2python-1.2.tar.gz
  • Upload date:
  • Size: 23.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for docx2python-1.2.tar.gz
Algorithm Hash digest
SHA256 5bcf062defdd497d83093838ddaba5dd85fe015850393deaa2db85d41291b75f
MD5 6866518e4a2b32d0d061eb375e014c76
BLAKE2b-256 4d05eeabf6a6b663212afc7ade6cdc96a296ca3ca5b421cfee515c2dd15d37f9

See more details on using hashes here.

File details

Details for the file docx2python-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: docx2python-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 21.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for docx2python-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b982efbd5a32d6ac400f80dc847a5d3c384afc1e331c9c344b87d73081a9c9c5
MD5 897265b12eca1de82972368def98d81e
BLAKE2b-256 89c3a22455ccf3e9c7e9282b92bfd6299c051a18d392c82d7c9926a15311909e

See more details on using hashes here.

File details

Details for the file docx2python-1.2-py3-none-any.whl.

File metadata

  • Download URL: docx2python-1.2-py3-none-any.whl
  • Upload date:
  • Size: 18.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for docx2python-1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3e3024e3c2ec24ce63ea88792e3cd0dea547c86a23a95d8aa748a99207e8b44c
MD5 ff3142648b28df126cad0f566f594c0b
BLAKE2b-256 a96c90a9e7980c03ab329e97ad4f8f2dc442214e6a287724c2dbc1798c17aa06

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page