A pure python-based utility to extract text, hyperlinks and imagesfrom docx files.
Project description
This project is forked from ankushshah89/python-docx2txt. A new feature is added: extract the hyperlinks and its corresponding texts.
It is a pure python-based utility to extract text from docx files. The code is taken and adapted from python-docx. It can however also extract text from header, footer and hyperlinks. It can now also extract images.
How to install?
pip install docxpy
How to run?
From command line:
# extract text
docx2txt file.docx
# extract text and images
docx2txt -i /tmp/img_dir file.docx
From python:
import docxpy
file = 'file.docx'
# extract text
text = docxpy.process(file)
# extract text and write images in /tmp/img_dir
text = docxpy.process(file, "/tmp/img_dir")
# if you want the hyperlinks
doc = docxpy.DOCReader(file)
doc.process() # process file
hyperlinks = doc.data['links']
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
docxpy-0.8.5.tar.gz
(4.1 kB
view hashes)