Skip to main content

A pure python-based utility to extract text, hyperlinks and imagesfrom docx files.

Project description

image0 PyPI

This project is forked from ankushshah89/python-docx2txt. A new feature is added: extract the hyperlinks and its corresponding texts.

It is a pure python-based utility to extract text from docx files. The code is taken and adapted from python-docx. It can however also extract text from header, footer and hyperlinks. It can now also extract images.

How to install?

pip install docxpy

How to run?

  1. From command line:
# extract text
docx2txt file.docx
# extract text and images
docx2txt -i /tmp/img_dir file.docx
  1. From python:
import docxpy

file = 'file.docx'

# extract text
text = docxpy.process(file)

# extract text and write images in /tmp/img_dir
text = docxpy.process(file, "/tmp/img_dir")


# if you want the hyperlinks
doc = docxpy.DOCReader(file)
doc.process()  # process file
hyperlinks = doc.data['links']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for docxpy, version 0.8.5
Filename, size File type Python version Upload date Hashes
Filename, size docxpy-0.8.5.tar.gz (4.1 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page