docx2python·PyPI

Extract content from docx files

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

#docx2python

Extract docx headers, footers, text, properties, and images to a Python object.

shared features:

extracts text from docx files
extracts images from docx files
no dependencies (docx2python requires pytest to test)

additions:

converts bullets and numbered lists to ascii with indentation
retains some structure of the original file (more below)
extracts document properties (creator, lastModifiedBy, etc.)
inserts image placeholders in text ('----image1.jpg----')
(optionally) retains font size, font color, bold, italics, and underscore as html
full test coverage

subtractions:

no command-line interface
will only work with later versions of Python

#Installation

pip install docx2python

#Use

from docx2python import docx2python

# extract docx content
docx2python('path/to/file.docx')

# extract docx content, write images to image_directory
docx2python('path/to/file.docx', 'path/to/image_directory')

# extract docx content with basic font styles converted to html
docx2python('path/to/file.docx', html=True)

Note on html feature:

font size, font color, bold, italics, and underline supported
every tag open in a paragraph will be closed in that paragraph (and, where appropriate, reopened in the next paragraph). If two subsequenct paragraphs are bold, they will be returned as <b>paragraph q</b>, <b>paragraph 2</b>. This is intentional to make each paragraph its own entity.
if you specify export_font_style=True, > and < in your docx text will be encoded as > and <

#Return Value Function docx2python returns an object with several attributes.

header - contents of the docx headers in the return format described herein

footer - contents of the docx footers in the return format described herein

body - contents of the docx in the return format described herein

document - header + body + footer

text - all docx text as one string, similar to what you'd get from python-docx2txt

tables - all docx text as simple html tables

properties - docx property names mapped to values (e.g., {"lastModifiedBy": "Shay Hill"})

images - image names mapped to images in binary format. Write to filesystem with

for name, image in result.images.items():
    with open(name, 'wb') as image_destination:
        write(image_destination, image)

#Return Format Some structure will be maintained. Text will be returned in a nested list, with paragraphs always at depth 4 (i.e., output.body[i][j][k][l] will be a paragraph).

If your docx has no tables, output.body will appear as one a table with all contents in one cell:

[  # document
    [  # table
        [  # row
            [  # cell
                "Paragraph 1",
                "Paragraph 2",
                "-- bulleted list",
                "-- continuing bulleted list",
                "1)  numbered list",
                "2)  continuing numbered list"
                "    a)  sublist",
                "        i)  sublist of sublist",
                "3)  keeps track of indention levels",
                "    a)  resets sublist counters"
            ]
        ]
     ]
 ]

Table cells will appear as table cells. Text outside tables will appear as table cells.

To preserve the even depth (text always at depth 4), nested tables will appear as new, top-level tables. This is clearer with an example:

#  docx structure

[  # document
    [  # table A
        [  # table A row
            [  # table A cell 1
                "paragraph in table A cell 1"
            ],
            [  # nested table B
                [  # table B row
                    [  # table B cell
                        "paragraph in table B"
                    ]
                ]
            ],
            [  # table A cell 2
                'paragraph in table A cell 2'
            ]
        ]
    ]
]

becomes ...

[  # document 
    [  # table A
        [  # row in table A
            [  # cell in table A
                "table A cell 1"
            ]
        ]
    ],
    [  # table B
        [  # row in table B
            [  # cell in table B
                "table B cell"
            ]
        ]
    ],
    [  # table C
        [  # row in table C
            [  # cell in table C
                "table A cell 2"
            ]
        ]
    ]
]

This ensures text appears

1) only once
2) in the order it appears on the docx
3) always at depth four (i.e., result.body[i][j][k][l] will be a string).

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

3.5.0

Feb 3, 2025

3.4.1

Feb 2, 2025

3.4.0

Feb 2, 2025

3.3.1

Feb 1, 2025

3.3.0

Dec 5, 2024

3.2.2

Nov 21, 2024

3.1.0

Nov 15, 2024

3.0.2

Sep 26, 2024

3.0.0

Jul 27, 2024

2.10.1

Apr 3, 2024

2.8.0

Jan 21, 2024

2.7.3

Jun 17, 2023

2.7.2

Jun 16, 2023

2.6.3

Apr 27, 2023

2.6.0

Feb 3, 2023

2.5.1

Feb 3, 2023

2.5.0

Jan 23, 2023

2.4.0

Jan 23, 2023

2.3.0

Jan 19, 2023

2.0.5

Dec 21, 2022

2.0.4

Mar 1, 2022

2.0.3

Dec 30, 2021

2.0.2

Dec 23, 2021

2.0.1

Dec 23, 2021

2.0.0

Dec 22, 2021

1.27.1

Nov 15, 2020

1.27

Nov 2, 2020

1.26

Oct 5, 2020

1.25

Aug 19, 2020

1.24

Jun 17, 2020

1.23

Apr 19, 2020

1.22

Apr 3, 2020

1.21

Feb 3, 2020

1.19

Oct 15, 2019

1.18

Jul 17, 2019

1.17

Jul 17, 2019

1.16

Jul 17, 2019

1.15

Jul 17, 2019

1.14

Jul 17, 2019

1.13

Jul 17, 2019

1.12

Jul 17, 2019

1.11

Jul 17, 2019

1.2

Jul 10, 2019

1.1

Jul 8, 2019

1.0

Jul 8, 2019

This version

0.1

Jul 7, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docx2python-0.1.tar.gz (21.9 kB view details)

Uploaded Jul 7, 2019 Source

Built Distribution

docx2python-0.1-py3-none-any.whl (17.9 kB view details)

Uploaded Jul 7, 2019 Python 3

File details

Details for the file docx2python-0.1.tar.gz.

File metadata

Download URL: docx2python-0.1.tar.gz
Upload date: Jul 7, 2019
Size: 21.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for docx2python-0.1.tar.gz
Algorithm	Hash digest
SHA256	`8d381bdc55799d521a9ac4c5e6156f8652d1ee120006a29927c2b415fa109dbb`
MD5	`7cc447cd3d6e14a9c52184ef43e77f33`
BLAKE2b-256	`7dc217e21087a7bf0ec76c916ec97eae7b6e65bc8db7ca70abc20d93b4088962`

See more details on using hashes here.

File details

Details for the file docx2python-0.1-py3-none-any.whl.

File metadata

Download URL: docx2python-0.1-py3-none-any.whl
Upload date: Jul 7, 2019
Size: 17.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for docx2python-0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`74cd8a787b442cd2d2c923b62ca8335c47c1c88e396b8cb280837a2f9618cfc8`
MD5	`20446d664750f7b01fd70de9de1a9f79`
BLAKE2b-256	`9592ae96875d51b23ae14a0bed597a56592c0ed7ead8d09f865dc6dcf4a6e0db`

See more details on using hashes here.

docx2python 0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes