Skip to main content

Tool to parse Microsoft Rich Text Format (RTF)

Project description

rtfparse

Parses Microsoft's Rich Text Format (RTF) documents. It creates an in-memory object which represents the tree structure of the RTF document. This object can in turn be rendered by using one of the renderers. So far, rtfparse provides only one renderer (HTML_Decapsulator) which liberates the HTML code encapsulated in RTF. This will come handy, for examle, if you ever need to extract the HTML from a HTML-formatted email message saved by Microsoft Outlook.

MS Outlook also tends to use RTF compression, so the CLI of rtfparse can optionally decompress that, too.

You can of course write your own renderers of parsed RTF documents and consider contributing them to this project.

Installation

Install rtfparse from your local repository with pip:

pip install rtfparse

Installation creates an executable file rtfparse in your python scripts folder which should be in your $PATH.

Usage From Command Line

Use the rtfparse executable from the command line. Read rtfparse --help.

rtfparse writes logs into ~/rtfparse/ into these files:

rtfparse.debug.log
rtfparse.info.log
rtfparse.errors.log

Example: Decapsulate HTML from an uncompressed RTF file

rtfparse --rtf-file "path/to/rtf_file.rtf" --decapsulate-html --output-file "path/to/extracted.html"

Example: Decapsulate HTML from MS Outlook email file

For this, the CLI of rtfparse uses extract_msg and compressed_rtf.

rtfparse --msg-file "path/to/email.msg" --decapsulate-html --output-file "path/to/extracted.html"

Example: Only decompress the RTF from MS Outlook email file

rtfparse --msg-file "path/to/email.msg" --output-file "path/to/extracted.rtf"

Example: Decapsulate HTML from MS Outlook email file and save (and later embed) the attachments

When extracting the RTF from the .msg file, you can save the attachments (which includes images embedded in the email text) in a directory:

rtfparse --msg-file "path/to/email.msg" --output-file "path/to/extracted.rtf" --attachments-dir "path/to/dir"

In rtfparse version 1.x you will be able to embed these images in the decapsulated HTML. This functionality will be provided by the package embedimg.

rtfparse --msg-file "path/to/email.msg" --output-file "path/to/extracted.rtf" --attachments-dir "path/to/dir" --embed-img

In the current version the option --embed-img does nothing.

Programatic usage in a Python module

Decapsulate HTML from an uncompressed RTF file

from pathlib import Path
from rtfparse.parser import Rtf_Parser
from rtfparse.renderers.html_decapsulator import HTML_Decapsulator

source_path = Path(r"path/to/your/rtf/document.rtf")
target_path = Path(r"path/to/your/html/decapsulated.html")
# Create parent directory of `target_path` if it does not already exist:
target_path.parent.mkdir(parents=True, exist_ok=True)

parser = Rtf_Parser(rtf_path=source_path)
parsed = parser.parse_file()

renderer = HTML_Decapsulator()

with open(target_path, mode="w", encoding="utf-8") as html_file:
    renderer.render(parsed, html_file)

Decapsulate HTML from an MS Outlook msg file

from pathlib import Path
from extract_msg import openMsg
from compressed_rtf import decompress
from io import BytesIO
from rtfparse.parser import Rtf_Parser
from rtfparse.renderers.html_decapsulator import HTML_Decapsulator


source_file = Path("path/to/your/source.msg")
target_file = Path(r"path/to/your/target.html")
# Create parent directory of `target_path` if it does not already exist:
target_file.parent.mkdir(parents=True, exist_ok=True)

# Get a decompressed RTF bytes buffer from the MS Outlook message
msg = openMsg(source_file)
decompressed_rtf = decompress(msg.compressedRtf)
rtf_buffer = BytesIO(decompressed_rtf)

# Parse the rtf buffer
parser = Rtf_Parser(rtf_file=rtf_buffer)
parsed = parser.parse_file()

# Decapsulate the HTML from the parsed RTF
decapsulator = HTML_Decapsulator()
with open(target_file, mode="w", encoding="utf-8") as html_file:
    decapsulator.render(parsed, html_file)

RTF Specification Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rtfparse-0.9.2.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

rtfparse-0.9.2-py3-none-any.whl (16.1 kB view details)

Uploaded Python 3

File details

Details for the file rtfparse-0.9.2.tar.gz.

File metadata

  • Download URL: rtfparse-0.9.2.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.27.2

File hashes

Hashes for rtfparse-0.9.2.tar.gz
Algorithm Hash digest
SHA256 d01a5c43113cb6d88b9fc2687c8be81bb1b961c304c0569f96f2d2250f0be2a0
MD5 aff26a6fdaa2f1ee0fc0e26f40e7d190
BLAKE2b-256 ee5d69e50d3b6994600b3d4fa9888bcf7a823fce411bfc4b9aca40ce44d87fda

See more details on using hashes here.

File details

Details for the file rtfparse-0.9.2-py3-none-any.whl.

File metadata

  • Download URL: rtfparse-0.9.2-py3-none-any.whl
  • Upload date:
  • Size: 16.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.27.2

File hashes

Hashes for rtfparse-0.9.2-py3-none-any.whl
Algorithm Hash digest
SHA256 cfa3f3d393c335268588f9b7634c8957c360e29d5022dc68718142d49b46f325
MD5 2098478cd28081f09323d5caaa9895bf
BLAKE2b-256 081c786be81844b50c978283f1f66bb6464a82afa56014a1d2844dfa6f5fc0e1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page