Skip to main content

A pure python-based utility to extract and convert DOCX files to various formats including plain text and markdown

Project description

docx2everything

Convert DOCX files to plain text or markdown format with preserved structure.

Installation

pip install docx2everything

Or install from source:

# Modern way (recommended)
pip install .

# Or using setup.py (deprecated but still works)
python setup.py install

Testing Without Installation

The CLI script works directly without installation - no PYTHONPATH needed!

Using CLI (no installation required):

# Extract text
python3 bin/docx2everything demo.docx

# Convert to markdown
python3 bin/docx2everything --markdown demo.docx > output.md

# With images
python3 bin/docx2everything --markdown -i images/ demo.docx > output.md

Using Python:

# Set PYTHONPATH to current directory
PYTHONPATH=. python3 -c "import docx2everything; print(docx2everything.process('demo.docx')[:100])"

In Python script:

import sys
sys.path.insert(0, '/path/to/python-docx2txt')

import docx2everything
text = docx2everything.process('document.docx')

Usage

Command Line

Extract plain text:

docx2everything document.docx

Convert to markdown:

docx2everything --markdown document.docx > output.md

Extract images:

docx2everything -i images/ document.docx

Markdown with images:

docx2everything --markdown -i images/ document.docx > output.md

Python API

import docx2everything

# Extract plain text
text = docx2everything.process("document.docx")

# Convert to markdown
markdown = docx2everything.process_to_markdown("document.docx")

# Extract images
text = docx2everything.process("document.docx", img_dir="images/")

# Markdown with images
markdown = docx2everything.process_to_markdown("document.docx", img_dir="images/")

Features

  • ✅ Plain text extraction
  • ✅ Markdown conversion with preserved structure:
    • Tables → Markdown tables (with merged cells support, alignment hints)
    • Lists → Bulleted/numbered lists (with proper sequence tracking)
    • Headings → Markdown headings (#, ##, ###, etc.) with custom style detection
    • Formatting → Bold, italic, strikethrough
    • Links → Markdown links
    • Images → Markdown image references
    • Footnotes → Markdown footnote references [^1]
    • Endnotes → Markdown endnote references [^1]
    • Comments → Inline HTML comments with author info
    • Charts → Chart placeholders with type and metadata *[Chart: Title (Chart Type)]*
    • Page breaks → HTML comments <!-- Page Break -->
    • Section breaks → HTML comments <!-- Section Break -->
  • ✅ Image extraction
  • ✅ Header and footer support
  • ✅ Custom style detection (parses styles.xml for better heading detection)
  • ✅ Table formatting (column alignment detection and hints)
  • ✅ Robust error handling for malformed DOCX files

Requirements

Python 3.6+

License

MIT License - see LICENSE.txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docx2everything-1.1.0.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docx2everything-1.1.0-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file docx2everything-1.1.0.tar.gz.

File metadata

  • Download URL: docx2everything-1.1.0.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for docx2everything-1.1.0.tar.gz
Algorithm Hash digest
SHA256 4772d8131a970fe6c5c22f24f09669fd05360d4b2e0bb3e975e91b037cb685f8
MD5 ab7dfc7da765e1e7cfeb2b49e68a99fc
BLAKE2b-256 c7bd46060c9f65eb0b1624b6601d2ecf516eea3f60848b27180f1ea14107f8e9

See more details on using hashes here.

File details

Details for the file docx2everything-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for docx2everything-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a0437f0959732f302444a9c5bbb5c2f09e969360a520b11558cdb263e3674122
MD5 932b6390c0526a223ba3dc0d21761a6c
BLAKE2b-256 fbb844881ce72db76cd4c052275ad4e8ebcae47ffea939f35aa7a51d64384786

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page