A pure python-based utility to extract and convert DOCX files to various formats including plain text and markdown
Project description
docx2everything
Convert DOCX files to plain text or markdown format with preserved structure.
Installation
pip install docx2everything
Or install from source:
# Modern way (recommended)
pip install .
# Or using setup.py (deprecated but still works)
python setup.py install
Testing Without Installation
The CLI script works directly without installation - no PYTHONPATH needed!
Using CLI (no installation required):
# Extract text
python3 bin/docx2everything demo.docx
# Convert to markdown
python3 bin/docx2everything --markdown demo.docx > output.md
# With images
python3 bin/docx2everything --markdown -i images/ demo.docx > output.md
Using Python:
# Set PYTHONPATH to current directory
PYTHONPATH=. python3 -c "import docx2everything; print(docx2everything.process('demo.docx')[:100])"
In Python script:
import sys
sys.path.insert(0, '/path/to/python-docx2txt')
import docx2everything
text = docx2everything.process('document.docx')
Usage
Command Line
Extract plain text:
docx2everything document.docx
Convert to markdown:
docx2everything --markdown document.docx > output.md
Extract images:
docx2everything -i images/ document.docx
Markdown with images:
docx2everything --markdown -i images/ document.docx > output.md
Python API
import docx2everything
# Extract plain text
text = docx2everything.process("document.docx")
# Convert to markdown
markdown = docx2everything.process_to_markdown("document.docx")
# Extract images
text = docx2everything.process("document.docx", img_dir="images/")
# Markdown with images
markdown = docx2everything.process_to_markdown("document.docx", img_dir="images/")
Features
- ✅ Plain text extraction
- ✅ Markdown conversion with preserved structure:
- Tables → Markdown tables
- Lists → Bulleted/numbered lists
- Headings → Markdown headings (#, ##, ###)
- Formatting → Bold, italic, strikethrough
- Links → Markdown links
- Images → Markdown image references
- ✅ Image extraction
- ✅ Header and footer support
Requirements
Python 3.6+
License
MIT License - see LICENSE.txt
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docx2everything-1.0.0.tar.gz.
File metadata
- Download URL: docx2everything-1.0.0.tar.gz
- Upload date:
- Size: 11.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20ccd54173da789d4c7cba931b90fe4f87d3f353ac7d23ec9f3e50bc325a4863
|
|
| MD5 |
3d376a5a37d62ec5377de3daf91d109f
|
|
| BLAKE2b-256 |
c6dd4aac22ad67eb7bfa03d66c06d7cdb6e3dad4320a9f635d14a7953f5b1d39
|
File details
Details for the file docx2everything-1.0.0-py3-none-any.whl.
File metadata
- Download URL: docx2everything-1.0.0-py3-none-any.whl
- Upload date:
- Size: 13.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a7b7b7b7f03731e24ef27cee2c548d18c59be41cd526a41e8ab7c65c64c37153
|
|
| MD5 |
0435a3e62bb24f69928c877331866644
|
|
| BLAKE2b-256 |
3690cdd7097d43de8bcb7d183596431789164f5da450993d26ef73723978398d
|