Skip to main content

Parse legacy WPS Writer (.wps) and Word 97-2003 (.doc) OLE2 binary files into structured text and Markdown.

Project description

wps2md

A tiny Python library and CLI for converting legacy WPS Writer .wps and Word 97-2003 .doc files (OLE2 Word-binary format, FIB magic 0xA5EC/0xA5DC) into structured text and Markdown.

Unlike .docx (which is OOXML/zip and can be read by python-docx), .wps files saved by WPS Office are binary OLE2 compound documents. This library reads the WordDocument stream, validates the FIB, recovers paragraph style indices (istd) via PlcfBtePapx → FKPs, and renders Heading 1-9 styles as #..######### in Markdown.

Install

pip install wps2md

CLI

wps2md example.wps                 # print Markdown to stdout
wps2md example.wps > example.md
wps2md example.doc                 # .doc files also supported
python -m wps2md example.wps       # equivalent

Library

from wps2md import parse, to_markdown

doc = parse("example.wps")
print(doc.main_text)                # plain text of the main body
print(doc.num_pages)                # from OLE SummaryInformation
print(to_markdown(doc.paragraphs))  # Markdown with H1-H9 from Word styles

for p in doc.paragraphs:
    print(p.heading_level, p.text)  # 0 for normal text, 1-9 for headings

API

  • parse(path) -> WpsDocument — parse a .wps or .doc file.
  • WpsDocument — dataclass with main_text, paragraphs, footnotes, headers_footers, annotations, encoding, num_pages.
  • Paragraph(istd: int, text: str) — one paragraph; heading_level returns 1-9 for built-in Heading styles, else 0.
  • to_markdown(paragraphs) -> str — render paragraphs as Markdown.
  • WpsParseError — raised for unsupported extensions, encrypted files, or unreadable streams.

Limitations

  • Tables, images, footnotes/headers paragraph styles, complex fields, and CHPX (character formatting like bold/italic) are not currently surfaced — only paragraph-level Heading styles drive Markdown output.
  • Encrypted/password-protected files are rejected.
  • Only the OLE2 Word-binary variant of .wps is supported (modern WPS Office still writes this for .wps; the OOXML .docx variant should be read with python-docx instead).

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wps2md-0.2.0.tar.gz (26.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wps2md-0.2.0-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file wps2md-0.2.0.tar.gz.

File metadata

  • Download URL: wps2md-0.2.0.tar.gz
  • Upload date:
  • Size: 26.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.16

File hashes

Hashes for wps2md-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3058e10be248fc1ccd6dae1c0f80970c3b4c7e9c4c3e406755c0414095c25a86
MD5 1950557df41e644999a0fc9f1b388efc
BLAKE2b-256 f739987f72e1cd29467747f61922c9fe0afb994c859b5101fc8775c50bf85755

See more details on using hashes here.

File details

Details for the file wps2md-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: wps2md-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 10.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.16

File hashes

Hashes for wps2md-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c42f025f6b9122363f1002359df8510e1853ac205e1af1789a0ee446741d669e
MD5 85f8c948401c1661434705527a0c591f
BLAKE2b-256 cac8516e8de59fdbdf87357e39bc9c7169307eb3de874745d8fb6de02e05ffcb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page