Skip to main content

Parse legacy WPS Writer (.wps) and Word 97-2003 (.doc) OLE2 binary files into structured text and Markdown.

Project description

wps2md

A tiny Python library and CLI for converting legacy WPS Writer .wps and Word 97-2003 .doc files (OLE2 Word-binary format, FIB magic 0xA5EC/0xA5DC) into structured text and Markdown.

Unlike .docx (which is OOXML/zip and can be read by python-docx), .wps files saved by WPS Office are binary OLE2 compound documents. This library reads the WordDocument stream, validates the FIB, recovers paragraph style indices (istd) via PlcfBtePapx → FKPs, and renders Heading 1-9 styles as #..######### in Markdown.

Install

pip install wps2md

CLI

wps2md example.wps                 # print Markdown to stdout
wps2md example.wps > example.md
wps2md example.doc                 # .doc files also supported
python -m wps2md example.wps       # equivalent

Library

from wps2md import parse, to_markdown

doc = parse("example.wps")             # path (str or pathlib.Path)
# or pass raw bytes (e.g. from an upload, S3, zip member, etc.)
# doc = parse(open("example.wps", "rb").read())
print(doc.main_text)                # plain text of the main body
print(doc.num_pages)                # from OLE SummaryInformation
print(to_markdown(doc.paragraphs))  # Markdown with H1-H9 from Word styles

for p in doc.paragraphs:
    print(p.heading_level, p.text)  # 0 for normal text, 1-9 for headings

API

  • parse(source) -> WpsDocument — parse a .wps/.doc file from a path (str or pathlib.Path) or from raw bytes.
  • WpsDocument — dataclass with main_text, paragraphs, footnotes, headers_footers, annotations, encoding, num_pages.
  • Paragraph(istd: int, text: str) — one paragraph; heading_level returns 1-9 for built-in Heading styles, else 0.
  • to_markdown(paragraphs) -> str — render paragraphs as Markdown.
  • WpsParseError — raised for unsupported extensions, encrypted files, or unreadable streams.

Limitations

  • Tables, images, footnotes/headers paragraph styles, complex fields, and CHPX (character formatting like bold/italic) are not currently surfaced — only paragraph-level Heading styles drive Markdown output.
  • Encrypted/password-protected files are rejected.
  • Only the OLE2 Word-binary variant of .wps is supported (modern WPS Office still writes this for .wps; the OOXML .docx variant should be read with python-docx instead).

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wps2md-0.3.0.tar.gz (239.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wps2md-0.3.0-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file wps2md-0.3.0.tar.gz.

File metadata

  • Download URL: wps2md-0.3.0.tar.gz
  • Upload date:
  • Size: 239.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.5 {"installer":{"name":"uv","version":"0.11.5","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for wps2md-0.3.0.tar.gz
Algorithm Hash digest
SHA256 db2b862a3b194e9a7e265f078b57461ecbf8acd133d3e4b9ff37044db67fbe45
MD5 e815424c1df188cda2c46ace4aabb13a
BLAKE2b-256 d2216bfd6e4ef3fcab2744cf6dc40cfa04be68c69c45586f2913e92b26f1fb43

See more details on using hashes here.

File details

Details for the file wps2md-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: wps2md-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.5 {"installer":{"name":"uv","version":"0.11.5","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for wps2md-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 613c3ccf75a244eaf022ea9750787374ab14df28521d187833e8bd28f739e775
MD5 d08c49ca3f2eb4c3fa4c8bb710db9037
BLAKE2b-256 866856b4c710f59d54bd5970782bd1e8fa915cc9df305a597a7f7d0956a765e6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page