Skip to main content

Mailgun library to extract message quotations and signatures.

Project description

Mailgun library to extract message quotations and signatures.

If you ever tried to parse message quotations or signatures you know that absence of any formatting standards in this area could make this task a nightmare. Hopefully this library will make your life much easier. The name of the project is inspired by TALON - multipurpose robot designed to perform missions ranging from reconnaissance to combat and operate in a number of hostile environments. That’s what a good quotations and signature parser should be like :smile:

Usage

Here’s how you initialize the library and extract a reply from a text message:

import talon
from talon import quotations

talon.init()

text =  """Reply

-----Original Message-----

Quote"""

reply = quotations.extract_from(text, 'text/plain')
reply = quotations.extract_from_plain(text)
# reply == "Reply"

To extract a reply from html:

html = """Reply
<blockquote>

  <div>
    On 11-Apr-2011, at 6:54 PM, Bob &lt;bob@example.com&gt; wrote:
  </div>

  <div>
    Quote
  </div>

</blockquote>"""

reply = quotations.extract_from(html, 'text/html')
reply = quotations.extract_from_html(html)
# reply == "<html><body><p>Reply</p></body></html>"

Often the best way is the easiest one. Here’s how you can extract signature from email message without any machine learning fancy stuff:

from talon.signature.bruteforce import extract_signature


message = """Wow. Awesome!
--
Bob Smith"""

text, signature = extract_signature(message)
# text == "Wow. Awesome!"
# signature == "--\nBob Smith"

Quick and works like a charm 90% of the time. For other 10% you can use the power of machine learning algorithms:

import talon
# don't forget to init the library first
# it loads machine learning classifiers
talon.init()

from talon import signature


message = """Thanks Sasha, I can't go any higher and is why I limited it to the
homepage.

John Doe
via mobile"""

text, signature = signature.extract(message, sender='john.doe@example.com')
# text == "Thanks Sasha, I can't go any higher and is why I limited it to the\nhomepage."
# signature == "John Doe\nvia mobile"

For machine learning talon currently uses the scikit-learn library to build SVM classifiers. The core of machine learning algorithm lays in talon.signature.learning package. It defines a set of features to apply to a message (featurespace.py), how data sets are built (dataset.py), classifier’s interface (classifier.py).

Currently the data used for training is taken from our personal email conversations and from ENRON dataset. As a result of applying our set of features to the dataset we provide files classifier and train.data that don’t have any personal information but could be used to load trained classifier. Those files should be regenerated every time the feature/data set is changed.

To regenerate the model files, you can run

python train.py

or

from talon.signature import EXTRACTOR_FILENAME, EXTRACTOR_DATA
from talon.signature.learning.classifier import train, init
train(init(), EXTRACTOR_DATA, EXTRACTOR_FILENAME)

Open-source Dataset

Recently we started a forge project to create an open-source, annotated dataset of raw emails. In the project we used a subset of ENRON data, cleansed of private, health and financial information by EDRM. At the moment over 190 emails are annotated. Any contribution and collaboration on the project are welcome. Once the dataset is ready we plan to start using it for talon.

Training on your dataset

talon comes with a pre-processed dataset and a pre-trained classifier. To retrain the classifier on your own dataset of raw emails, structure and annotate them in the same way the forge project does. Then do:

from talon.signature.learning.dataset import build_extraction_dataset
from talon.signature.learning import classifier as c

build_extraction_dataset("/path/to/your/P/folder", "/path/to/talon/signature/data/train.data")
c.train(c.init(), "/path/to/talon/signature/data/train.data", "/path/to/talon/signature/data/classifier")

Note that for signature extraction you need just the folder with the positive samples with annotated signature lines (P folder).

Research

The library is inspired by the following research papers and projects:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

talon_v2-1.0.0.tar.gz (34.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

talon_v2-1.0.0-py3-none-any.whl (37.4 kB view details)

Uploaded Python 3

File details

Details for the file talon_v2-1.0.0.tar.gz.

File metadata

  • Download URL: talon_v2-1.0.0.tar.gz
  • Upload date:
  • Size: 34.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.6.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for talon_v2-1.0.0.tar.gz
Algorithm Hash digest
SHA256 7d8ff31f5a7409ed719a2992eca3240822b3a9ededf37a5272d7b8620e309a18
MD5 ed41d1172601f4c02edf8a237706d791
BLAKE2b-256 112497679b8202b1e6409b99ddefdd3351b57a5ff57e30044cb1b6a1045c73c8

See more details on using hashes here.

File details

Details for the file talon_v2-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: talon_v2-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 37.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.6.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.50.2 CPython/3.8.5

File hashes

Hashes for talon_v2-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 68caf20c878f63e69939b6173a3f7b9095fc7deae581bd59d34b44ada9836810
MD5 c83c8abcd727690186c69c640e070228
BLAKE2b-256 450d73534803f1908ea4d285efd8ff9782aa6478e4a2518be35087e73f937fa3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page