Skip to main content

A pure python-based utility to extract and convert DOCX files to various formats including plain text and markdown

Project description

docx2everything

Convert DOCX files to plain text or markdown format with preserved structure.

Installation

pip install docx2everything

Or install from source:

# Modern way (recommended)
pip install .

# Or using setup.py (deprecated but still works)
python setup.py install

Testing Without Installation

The CLI script works directly without installation - no PYTHONPATH needed!

Using CLI (no installation required):

# Extract text
python3 bin/docx2everything demo.docx

# Convert to markdown
python3 bin/docx2everything --markdown demo.docx > output.md

# With images
python3 bin/docx2everything --markdown -i images/ demo.docx > output.md

Using Python:

# Set PYTHONPATH to current directory
PYTHONPATH=. python3 -c "import docx2everything; print(docx2everything.process('demo.docx')[:100])"

In Python script:

import sys
sys.path.insert(0, '/path/to/python-docx2txt')

import docx2everything
text = docx2everything.process('document.docx')

Usage

Command Line

Extract plain text:

docx2everything document.docx

Convert to markdown:

docx2everything --markdown document.docx > output.md

Extract images:

docx2everything -i images/ document.docx

Markdown with images:

docx2everything --markdown -i images/ document.docx > output.md

Python API

import docx2everything

# Extract plain text
text = docx2everything.process("document.docx")

# Convert to markdown
markdown = docx2everything.process_to_markdown("document.docx")

# Extract images
text = docx2everything.process("document.docx", img_dir="images/")

# Markdown with images
markdown = docx2everything.process_to_markdown("document.docx", img_dir="images/")

Features

  • ✅ Plain text extraction
  • ✅ Markdown conversion with preserved structure:
    • Tables → Markdown tables
    • Lists → Bulleted/numbered lists
    • Headings → Markdown headings (#, ##, ###)
    • Formatting → Bold, italic, strikethrough
    • Links → Markdown links
    • Images → Markdown image references
  • ✅ Image extraction
  • ✅ Header and footer support

Requirements

Python 3.6+

License

MIT License - see LICENSE.txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docx2everything-1.0.0.tar.gz (11.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docx2everything-1.0.0-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file docx2everything-1.0.0.tar.gz.

File metadata

  • Download URL: docx2everything-1.0.0.tar.gz
  • Upload date:
  • Size: 11.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for docx2everything-1.0.0.tar.gz
Algorithm Hash digest
SHA256 20ccd54173da789d4c7cba931b90fe4f87d3f353ac7d23ec9f3e50bc325a4863
MD5 3d376a5a37d62ec5377de3daf91d109f
BLAKE2b-256 c6dd4aac22ad67eb7bfa03d66c06d7cdb6e3dad4320a9f635d14a7953f5b1d39

See more details on using hashes here.

File details

Details for the file docx2everything-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for docx2everything-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a7b7b7b7f03731e24ef27cee2c548d18c59be41cd526a41e8ab7c65c64c37153
MD5 0435a3e62bb24f69928c877331866644
BLAKE2b-256 3690cdd7097d43de8bcb7d183596431789164f5da450993d26ef73723978398d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page