A pure python-based utility to extract text, hyperlinks and imagesfrom docx files.
This project is forked from ankushshah89/python-docx2txt. A new feature is added: extract the hyperlinks and its corresponding texts.
It is a pure python-based utility to extract text from docx files. The code is taken and adapted from python-docx. It can however also extract text from header, footer and hyperlinks. It can now also extract images.
How to install?
pip install docxpy
How to run?
- From command line:
# extract text docx2txt file.docx # extract text and images docx2txt -i /tmp/img_dir file.docx
- From python:
import docxpy file = 'file.docx' # extract text text = docxpy.process(file) # extract text and write images in /tmp/img_dir text = docxpy.process(file, "/tmp/img_dir") # if you want the hyperlinks doc = docxpy.DOCReader(file) doc.process() # process file hyperlinks = doc.data['links']
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.