email_cleaning_service created by paul_lestrat

Project description

email_cleaning_service

This is an email segmenting service which takes a list of emails as input and returns the the header, body and signature of each message in the email

Getting Started

The project is published on Pypi and can be installed using the following command

pip install email-cleaning-service

Usage

The package can be used as follows

from email_cleaning_service.control import EmailCleaner

email_cleaner = EmailCleaner(tracking_uri, storage_uri)

Usage revolves around the emailCleaner class which is the preferred interface for the package. The class takes two arguments, the tracking_uri and the storage_uri. The tracking_uri is the uri of the MLflow tracking server and the storage_uri is the uri of the storage server (can be a path to a local folder).

BaseModel classes exist to simplify interactions with the class. The most important of these is the PipelineSpecs class which is used to define the pipeline to be used for cleaning the emails.

from email_cleaning_service.utils.request_classes import PipelineSpecs

pipeline_specs = PipelineSpecs(
    classifier_origin="mlflow", # or "h5" or "config
    classifier_id="a1f66311816e417cb94db7c2457b25d1",
    encoder_origin="hugg", # or "mlflow"
    encoder_id="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
    encoder_dim=384,
    features=[
            "phone_number",
            "url",
            "punctuation",
            "horizontal_separator",
            "hashtag",
            "pipe",
            "email",
            "capitalized",
            "full_caps"
        ]      # Can be any combination of the above features
)

The above pipeline is the recommended one for multilingual email segmenting for now.

The pipeline contains 3 main elements:

The embedding model: This is the model used to embed the emails into a vector space. The model can be either a huggingface model or an MLflow model. If the model is from hugging face, specify encoder_origin as "hugg" and the encoder_id as the model name on the platform. If the model is from MLflow, specify encoder_origin as "mlflow" and the encoder_id as the id of the run you want to use the model from on the MLFlow server.
The extracted features: a list of regex features that are concatenated to the embedding of each sentence
The classifier: This is the model used to realise the final classification and the separation of a thread in multiple messages. The model can come from mlflow in which case the run_id must be specified or from a h5 file in which case the path to the file must be specified.

The pipeline can then be used as follows:

email_list = [
    "This is a test email. I am testing the email cleaner.\nYours truly, Paul",
    "Hello team!\nThis is another test email with two lines.\n I am testing the email cleaner",
    "Bonjour!\nCeci est un autre email\n\nAu revoir!\nPaul",
]

email_cleaner.segment(email_list, pipeline_specs)

The output should look like this:

{'threads': [{'source': 'This is a test email. I am testing the email cleaner.\nYours truly, Paul\n0781759532',
   'messages': [{'full': 'This is a test email. I am testing the email cleaner.\nYours truly, Paul\n0781759532',
     'header': '',
     'disclaimer': '',
     'greetings': '',
     'body': 'This is a test email. I am testing the email cleaner.\nYours truly, Paul\n0781759532',
     'signature': '',
     'caution': ''}]},
  {'source': 'Hello team!\nThis is another test email with two lines.\n I am testing the email cleaner.',
   'messages': [{'full': 'Hello team!\nThis is another test email with two lines.\n I am testing the email cleaner.',
     'header': '',
     'disclaimer': '',
     'greetings': '',
     'body': 'Hello team!',
     'signature': 'This is another test email with two lines.\n I am testing the email cleaner.',
     'caution': ''}]},
  {'source': 'Bonjour!\nCeci est un autre email\n\nAu revoir!\nPaul',
   'messages': [{'full': 'Bonjour!\nCeci est un autre email\nAu revoir!\nPaul',
     'header': '',
     'disclaimer': '',
     'greetings': '',
     'body': 'Bonjour!\nCeci est un autre email\nAu revoir!',
     'signature': 'Paul',
     'caution': ''}]}]}

Model Training

This package also includes support for training pipelines. You can either train (fine-tune) an encoder model or a classifier model. A note-worthy difference between the 2 is that encoders are trained with a single line of the email as input while classifiers are trained with a sequence of 64 lines as input.

The csv files used for training must use be contain lines from emails and have the following columns:

Email: a unique Id for each email used to group email lines together
Text: the text of the email line
Section: the section of the email line (disclaimer, header, greetings, body, signature and caution represented as 1 thru 6 respectively)
FragmentChange: a boolean (0 or 1) indicating whether the line is a fragment change or not

Training an Encoder

To train an encoder, use the EncoderSpecs class and the RunSpecs class as follows:

from email_cleaning_service.utils.request_classes import EncoderSpecs, RunSpecs

dataset = RunSpecs(
    run_name="demo_encoder_test_run_2",
    csv_train="./train_multi_71.csv",
    csv_test="./test_multi_48.csv",
    batch_size=4,
    metrics=[],
    lr=0.0001,
    epochs=1,
)

encoder_specs = EncoderSpecs(
    origin="mlflow",
    encoder="14a633237e734575ad7f8eac9bd0319e"
)

email_cleaner.train_encoder(dataset, encoder_specs)

The EncoderSpecs class takes two arguments, origin and encoder which are the same as in PipelineSpecs. The RunSpecs define how you want to train the model. The arguments are:

run_name: The name of the run on the MLflow server
csv_train: The path to the csv file containing the training data
csv_test: The path to the csv file containing the test data
batch_size: The batch size to use for training
metrics: A list of metrics to track during training. The metrics must be defined in the metrics.py file in utils
lr: The learning rate to use for training
epochs: The number of epochs to train for

Training a Classifier

To train a classifier, an entire pipeline must be defined. This is done using the PipelineSpecs class as follows:

dataset = RunSpecs(
    run_name="with_fine_tuned_encoder",
    batch_size=4,
    csv_train="./train_multi.csv",
    csv_test="./test_multi.csv",
    metrics=["seq_f1", "frag_f1"],
    lr=0.007,
    epochs=3,
)

pipeline_specs = PipelineSpecs(
    classifier_origin="h5",
    classifier_id="./temp/base_multi_miniLM_classifier_optimized/multi_miniLM_classifier.h5",
    encoder_origin="mlflow",
    encoder_id="316fb5040b0a4353ade2e967290944ff",
    encoder_dim=384,
    features=[
            "phone_number",
            "url",
            "punctuation",
            "horizontal_separator",
            "hashtag",
            "pipe",
            "email",
            "capitalized",
            "full_caps"
        ]
)

email_cleaner.train_classifier(dataset, pipeline_specs)

Evaluate a Model

To evaluate a model, use the evaluate method as follows:

email_cleaner.evaluate(csv_path, pipeline_specs)

This will evaluate the designated pipeline on the data in the csv file and return the metrics. (all metrics are used by default)

Package Structure

package_architecture

Maintaining the Package

tests have been setup using tox and pytest. To run the tests, run the following command:

tox

to add tests, feel free to modify the python scriptys in the tests folder. An action automatically runs the tests on every push.

Known Issues

There seems to be an issue when training classifiers for too long. The metrics on the testing data start to decrease after a certain number of epochs. This is not noticeable when training for a few epochs but becomes more apparent when training for longer.

Project details

Release history Release notifications | RSS feed

This version

1.0.0

Jul 19, 2023

0.2.1

Jul 19, 2023

0.2.0

Jul 19, 2023

0.1.5

Jul 19, 2023

0.1.4

Jul 19, 2023

0.1.3

Jul 19, 2023

0.1.2

Jul 11, 2023

0.1.1

Jul 11, 2023

0.1.0

Jul 11, 2023

0.0.18

Jul 11, 2023

0.0.17

Jun 28, 2023

0.0.16

Jun 28, 2023

0.0.15

Jun 27, 2023

0.0.14

Jun 26, 2023

0.0.13

Jun 26, 2023

0.0.12

Jun 23, 2023

0.0.11

Jun 19, 2023

0.0.10

Jun 19, 2023

0.0.9

Jun 16, 2023

0.0.8

Jun 13, 2023

0.0.7

Jun 13, 2023

0.0.6

Jun 13, 2023

0.0.5

Jun 13, 2023

0.0.4

Jun 13, 2023

0.0.3

Jun 13, 2023

0.0.2

Jun 8, 2023

0.0.1

Jun 7, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

email_cleaning_service-1.0.0.tar.gz (39.6 MB view details)

Uploaded Jul 19, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

email_cleaning_service-1.0.0-py3-none-any.whl (26.6 kB view details)

Uploaded Jul 19, 2023 Python 3

File details

Details for the file email_cleaning_service-1.0.0.tar.gz.

File metadata

Download URL: email_cleaning_service-1.0.0.tar.gz
Upload date: Jul 19, 2023
Size: 39.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for email_cleaning_service-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`b88121ac97b6fd44f764d172699653c422e4dd878d12fda5edb5d5aba06baed1`
MD5	`70b659b98daf2fb3483a27345224b3a1`
BLAKE2b-256	`d5ac143624fc639cb7fef6a46236ff2d05a6ec0a8890be67044938dd16a757d7`

See more details on using hashes here.

File details

Details for the file email_cleaning_service-1.0.0-py3-none-any.whl.

File metadata

Download URL: email_cleaning_service-1.0.0-py3-none-any.whl
Upload date: Jul 19, 2023
Size: 26.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.9

File hashes

Hashes for email_cleaning_service-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d883d81486b5be4109337ce76ca2e8a012381628aeb0a1b94f72f560d2e63587`
MD5	`cc16213f656a290124c15204db0442ab`
BLAKE2b-256	`80c3fc8a72b9fd07635c94043d5b7d9b0c93e6e3b995e9dc08a3f997bae4322c`

See more details on using hashes here.

email-cleaning-service 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

email_cleaning_service

Getting Started

Usage

Model Training

Training an Encoder

Training a Classifier

Evaluate a Model

Package Structure

Maintaining the Package

Known Issues

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes