Skip to main content

Convert legacy MS Word .doc files to Markdown — inspired by antiword

Project description

unword

Convert legacy Microsoft Word .doc files (OLE/CFB format) to Markdown. Inspired by antiword.

Extracts body text with heading levels, page breaks, and textbox contents. No external dependencies (no LibreOffice, no COM).

Installation

CLI (Rust)

cargo install unword

Python

pip install unword

From source

Requires maturin and a virtual environment:

uv venv .venv && source .venv/bin/activate
maturin develop

Or build a wheel:

maturin build --release
pip install target/wheels/unword-*.whl

Usage

CLI

# Print to stdout
unword -i document.doc

# Write to file
unword -i document.doc -o output.md

Python

import unword

doc = unword.parse_doc(open("document.doc", "rb").read())

print(doc.body_text)      # Markdown string with headings
print(doc.textboxes)      # List of textbox strings

Rust library

let data = std::fs::read("document.doc")?;
let doc = unword::parse_doc(&data)?;
println!("{}", doc.body_text);

Output format

  • Headings are rendered as #, ##, ###, etc. based on Word styles
  • Paragraphs are separated by blank lines
  • Page breaks become ---
  • Textboxes are extracted separately

Tests

# Rust
cargo test

# Python
pytest tests/test_python.py

License

MIT

Alternative

  • antiword
  • abiword
  • tika
  • libreoffice

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unword-0.2.2.tar.gz (18.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

unword-0.2.2-cp38-abi3-win_amd64.whl (354.0 kB view details)

Uploaded CPython 3.8+Windows x86-64

unword-0.2.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (496.0 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

unword-0.2.2-cp38-abi3-macosx_11_0_arm64.whl (462.0 kB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

unword-0.2.2-cp38-abi3-macosx_10_12_x86_64.whl (462.3 kB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file unword-0.2.2.tar.gz.

File metadata

  • Download URL: unword-0.2.2.tar.gz
  • Upload date:
  • Size: 18.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.18 {"installer":{"name":"uv","version":"0.11.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for unword-0.2.2.tar.gz
Algorithm Hash digest
SHA256 738a8a929ac27319347078ca13f4db5d474570ec943e567e5164a0a4cac2fe9c
MD5 a26f0e4c31a30badf53fdba04eab79f1
BLAKE2b-256 4a8b52d6898d8d314becc76ad0153c84acb2682f238ad381e0ad582c5ec56df3

See more details on using hashes here.

File details

Details for the file unword-0.2.2-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: unword-0.2.2-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 354.0 kB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.18 {"installer":{"name":"uv","version":"0.11.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for unword-0.2.2-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 b180e1356b3d95f183702befd0534cd924775ec7d55d9002623242ba7f0ed7df
MD5 c6f2ba55c8636e6fef56f77ec4e2e075
BLAKE2b-256 7f60e24955624b272256febc45b030c0fbc9e5f2eeaf0977c1ebaedff1532080

See more details on using hashes here.

File details

Details for the file unword-0.2.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

  • Download URL: unword-0.2.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
  • Upload date:
  • Size: 496.0 kB
  • Tags: CPython 3.8+, manylinux: glibc 2.17+ x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.18 {"installer":{"name":"uv","version":"0.11.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for unword-0.2.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fd671553a52f5d9986fd334d714fa4466c9394dc6d5ee0ac4b6c72e45df214c6
MD5 ccd85b06b3bfccac195cdfe1caf0fe40
BLAKE2b-256 0348b061e1bb43d3ccfcb04f746e90ebb9fa70e3df31f901e3fd21a9858ff6c0

See more details on using hashes here.

File details

Details for the file unword-0.2.2-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

  • Download URL: unword-0.2.2-cp38-abi3-macosx_11_0_arm64.whl
  • Upload date:
  • Size: 462.0 kB
  • Tags: CPython 3.8+, macOS 11.0+ ARM64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.18 {"installer":{"name":"uv","version":"0.11.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for unword-0.2.2-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 676fab7ea39bfc2029917f1b9e60336d80590d91412321437b3b640192730e64
MD5 1af0143272756aba4c0c27e4c2edb872
BLAKE2b-256 281f4cec6adfc0846a3b88c61cded8080bdc15db816f2fed26c3ab6da395b1c4

See more details on using hashes here.

File details

Details for the file unword-0.2.2-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

  • Download URL: unword-0.2.2-cp38-abi3-macosx_10_12_x86_64.whl
  • Upload date:
  • Size: 462.3 kB
  • Tags: CPython 3.8+, macOS 10.12+ x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.18 {"installer":{"name":"uv","version":"0.11.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for unword-0.2.2-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 212ac214ea02f49866ba7bb68bf26f05435252d4c6ba18dd1e5fca879a74941d
MD5 950bb153819e5f3f5eca03221c8cb0b4
BLAKE2b-256 8a414a36ca6eaa219ac2d275d7794325b2669272e1d2f79c9a13732fd9ff1eba

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page