Skip to main content

A pure python-based utility to extract text, hyperlinks and imagesfrom docx files.

Project description

image0 PyPI

This project is forked from ankushshah89/python-docx2txt. A new feature is added: extract the hyperlinks and its corresponding texts.

It is a pure python-based utility to extract text from docx files. The code is taken and adapted from python-docx. It can however also extract text from header, footer and hyperlinks. It can now also extract images.

How to install?

pip install docxpy

How to run?

  1. From command line:

# extract text
docx2txt file.docx
# extract text and images
docx2txt -i /tmp/img_dir file.docx
  1. From python:

import docxpy

file = 'file.docx'

# extract text
text = docxpy.process(file)

# extract text and write images in /tmp/img_dir
text = docxpy.process(file, "/tmp/img_dir")


# if you want the hyperlinks
doc = docxpy.DOCReader(file)
doc.process()  # process file
hyperlinks = doc.data['links']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docxpy-0.8.5.tar.gz (4.1 kB view details)

Uploaded Source

File details

Details for the file docxpy-0.8.5.tar.gz.

File metadata

  • Download URL: docxpy-0.8.5.tar.gz
  • Upload date:
  • Size: 4.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for docxpy-0.8.5.tar.gz
Algorithm Hash digest
SHA256 7949c5b8f6a1b749d1449f4590a3ddc6a3c16d62944b548df2efba52bad3d857
MD5 da1711806e41ee9410186f5eac99d43c
BLAKE2b-256 de39d3c28e3ef0637237356306d3e7916cf9d4deddc2c7517b16765b4bdb7b13

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page