Skip to main content

Parse a given Html/Text email and return only the new text, without the quoted part.

Project description

UnquoteMail

This library is intended to parse a given HTML and/or Text message and return only the new message without the previous conversation(s).

It is used in production at Fernand.

Parsing an email is quite difficult because of the amount of various mail providers and their specific approaches. Unfortunately, there is no standard for separating the new content to the previous conversation so we need to rely on different tricks.

We took a progressive approach on parsing the document, in the following order:

  1. We first try to identify and then remove all the known markup language, such as ".gmail_quote", ".protonmail_quote" and the likes
  2. If we don't find it, we fallback on Regex to identify standard "On YYYY/MM/dd HH:mm:ss, bob bob@example.com wrote:" patterns

If we succeed at the point 1, we then re-generate the text data by converting the remaining HTML to markdown using the html2text library.

If we succeed at point 2, we parse the HTML (again) to locate the matched regex pattern from the text version. We then remove everything from that point (including the matched pattern) in the HTML structure.

If we fail to locate the pattern in the HTML structure, we re-create a new HTML by converting the text to HTML

We allow ourselves to rewrite the text to HTML in that last resort because we consider that an email containing previous data is generally an human-written reply and not a marketing email with advanced structure, so the content should be basic to parse (link, bold/italic/underline, images and that's almost all; All of what a Markdown converter can do).

Usage

The library is very straightforward:

from unquotemail import Unquote

# You previously retrieved the text/html and text/plain version of an email

unquote = Unquote(email_html, email_text)
print(unquote.get_html())  # Will output the email without the included replies.

The Unquote class accepts 4 parameters:

  • html: A string containing the HTML data
  • text: A string containing the Text data
  • sender: (Optional) The message_id of the email, generally under the form <{hash}@{hostname}>
  • parse: (Optional) - Defaults to True. Will parse the message.

All the parameters are optional by default BUT you must past either a valid html or text value (otherwise it's kind of useless, isn't it?).

The sender parameter is not required and doesn't do anything for now, but it's possible in the future that we will rely on the sender to better parse an email. (A @yahoo.com email might help the parser better know what to do, and not lookup for a "gmail_quote" div for instance)

Finally, the parse boolean, if set to false, won't ... well, parse the email.

The reason for this is quite simple. Imagine the following:

unquote = Unquote(email_html, email_text, parse=False)

if not is_new_email:
    # We don't unquote a new email as we want to keep the context
    # But for all following emails, we do want to remove that context since we already have it
    unquote.parse()

message = Message(unquote.get_html(), unquote.get_text())
message.save()  # in the database

Special thanks

We used the regex from the following libraries to create our own. Most of the regex patterns you see on UnquoteMail have been modified, but the root is from these two libraries:

So, thank you to them!

Testing

Our test/ folder contains a suite of test. Some of the files present in the test folder have been retrieved from Crisp and Talon (again, thanks) and we adjusted these to our test case.

To run the tests, do the following:

pytest

Or to run only one test:

pytest -k "test_unquote[talon_9.html-talon_9.html]"

WARN: For now, we only have 105 tests successfully passing for a total of 168 tests. We need to continue a bit of work to improve the test suite

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unquotemail-1.0.0.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

unquotemail-1.0.0-py2.py3-none-any.whl (11.2 kB view details)

Uploaded Python 2Python 3

File details

Details for the file unquotemail-1.0.0.tar.gz.

File metadata

  • Download URL: unquotemail-1.0.0.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for unquotemail-1.0.0.tar.gz
Algorithm Hash digest
SHA256 8daa977fb69ebee2e25178b74c52ab8260e11516d6ee7bce214a05eee1d05159
MD5 4b9dbdc68eb163be7f4b237e86fd9dbd
BLAKE2b-256 2e31178ec344da0dbf656ef5056a30d0c12666fc5008d9784d7488c185bfb36e

See more details on using hashes here.

File details

Details for the file unquotemail-1.0.0-py2.py3-none-any.whl.

File metadata

  • Download URL: unquotemail-1.0.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 11.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for unquotemail-1.0.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 832fb1e6d4d54ce70b554d7a8ab62d2f0f8f4cc0383d8eadd59864945f994ff0
MD5 4709ea3387f465ac693331138c6ce16f
BLAKE2b-256 2d11de3f9d6d0b88873bdd8bf44fa954fa19e20ce00a6bfa2e532adfd62b8b46

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page