Skip to main content

convert .docx to .md

Project description

docx2md

Converts Microsoft Word document files (.docx extension) to Markdown files.

Japanese

1. Install

pip install docx2md

2. How to use

usage: docx2md [-h] [-m] [-v] [--debug] SRC.docx DST.md

positional arguments:
  SRC.docx        Microsoft Word file to read
  DST.md          Markdown file to write

optional arguments:
  -h, --help      show this help message and exit
  -m, --md_table  use Markdown table notation instead of <table>
  -v, --version   show version
  --debug         for debug

3. Tables

A table is output as <table id="table(n)">. id is the order of output, starting from 1.

If --md_table is specified, the output will use |, but the title line item will be # fixed.

| # | # | # |
|---|---|---|
|a|b|c|
|d|e|f|
|g|h|i|

4. Pictures

Images will be output as <img id="image(n)">. The id is output in order starting from 1.

5. Examples

6. Elements that can be converted

  • Tables (including merged cells)
  • Lists (also with numbers as bullets)
  • Headings
  • Embedded images
  • Page breaks (converted to <div class="break"></div>)
  • Line breaks within paragraphs (converted to <br>)
  • Text boxes (inserted in the body)

7. Elements that cannot be converted (only known ones)

  • Table of Contents
  • Text decoration (bold and etc...)

8. API

8.1. function

  • docx2md.do_convert
>>> help(docx2md.do_convert)
Help on function do_convert in module docx2md.convert:

do_convert(docx_file: str, target_dir='', use_md_table=False) -> str
    convert docx_file to Markdown text and return it

    Args:
        docx_file(str): a file to parse
        target_dir(str): save images into target_dir/media/ if specified
        use_md_table(bool): use Markdown table notation instead of HTHML
    Returns:
        Markdown text(str)

8.2. class

  • docx2md.DocxFile
  • docx2md.DocxMedia
  • docx2md.Converter

Refer to the do_convert implementation for the usage of each class.

def do_convert(docx_file: str, target_dir="", use_md_table=False)  -> str:
    try:
        docx = DocxFile(docx_file)
        media = DocxMedia(docx)
        if target_dir:
            media.save(target_dir)
        converter = Converter(docx.document(), media, use_md_table)
        return converter.convert()
    except Exception as e:
        return f"Exception: {e}"

9. License

MIT

10. Changelog

  • 1.0.5 merge PR #7
  • 1.0.4 fix issue #6
  • 1.0.3 add API
  • 1.0.2 change packaging system to pyproject.toml

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docx2md-1.0.5.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docx2md-1.0.5-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file docx2md-1.0.5.tar.gz.

File metadata

  • Download URL: docx2md-1.0.5.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for docx2md-1.0.5.tar.gz
Algorithm Hash digest
SHA256 af3389d6ed005160be9c7a26a0343865f3013634b293b068b747070cc0504416
MD5 30b353508efb0d428860f7ee778b9e30
BLAKE2b-256 2f378869f44924ca8ca8ad1ab3499b6b3d09ca7d0cc6a84ad707cb951b96d2c5

See more details on using hashes here.

File details

Details for the file docx2md-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: docx2md-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 10.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for docx2md-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 4acb587f54699d3977eaa0dd1170197f872994af177d4876e3285c8e47c0b549
MD5 6e97184dfcaa44c44d7d2459a6d5d39a
BLAKE2b-256 97cbf204b9a3e8c79ca9b5c6b16e8dd176bac93cb007708763549f017a0f702f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page