A simple email analyzer

These details have not been verified by PyPI

Project links

Project description

A fast spam filter written in Python inspired by SpamAssassin integrated with machine learning.

CircleCI PyPI - Status

Table of Contents
What is Spam Analyzer?
Installation
Usage
- CLI
- Python
Contributing
License

What is spam-analyzer?

spam-analyzer is a CLI (Command Line Interface) application that aims be a viable alternative to spam filter services.

This program can classify the email given in inputs in spam or non-spam using a machine learning algorithm (Random Forest), the model is trained using a dataset of 19900 emails. Anyway it could be wrong sometimes, if you want to improve the accuracy of the model you can train it with your persolized dataset.

The main features of spam-analyzer are:

spam recognition with the option to display a detailed analysis of the email
JSON output
it can be used as a library in your Python project to extract features from an email
it is written in Python with its most modern features to ensure software correctness
extensible with plugins
100% containerized with Docker

What is spam and how does spam-analyzer know it?

The analysis takes in consideration the following main aspects:

the headers of the email
the body of the email
the attachments of the email

The most significant parts are the headers and the body of the email. The headers are analyzed to extract the following features:

SPF (Sender Policy Framework)
DKIM (DomainKeys Identified Mail)
DMARC (Domain-based Message Authentication, Reporting & Conformance)
If the sender domain is the same as the first in received headers
The subject of the email
The send date
If the send date is compliant to the RFC 2822 and if it was sent from a valid time zone
The date of the first received header

While the body is analyzed to extract the following features:

If there are links
If there are images
If links are only http or https
The percentage of the body that is written in uppercase
The percentage of the body that contains blacklisted words
The polarity of the body calculated with TextBlob
The subjectivity of the body calculated with TextBlob
If it contains mailto links
If it contains javascript code
If it contains html code
If it contains html forms

The task could be solved in a programmatic way, chaining a long set of if statements based on the features extracted from the email. However, this approach is not scalable and it is not easy to maintain. Moreover, it is not possible to improve the accuracy of the model without changing the code and, the most important, the analysis would be based on the conaissance of the programmer and not on the data. Since we live in the data era, we should use the data to solve the problem, not the programmer's knowledge. So I decided to use a machine learning algorithm to solve the problem using all the features extracted from the email.

Installation

spam-analyzer is available on PyPI, so you can install it with pip:

pip install spam-analyzer

For the latest version, you can install it from the source code:

git clone https://github.com/matteospanio/spam-analyzer.git
cd spam-analyzer
pip install .

Usage

CLI

spam-analyzer can be used as a CLI application:

Usage: spam-analyzer [OPTIONS] COMMAND [ARGS]...

  A simple program to analyze emails.

Options:
  -h, --help                Show this message and exit.
  -v, --verbose             Enables verbose mode.
  --version                 Show the version and exit.
  -C, --config CONFIG_PATH  Location of the configuration file. Supports glob
                            pattern of local path and remote URL.

Commands:
  analyze    Analyze emails from a file or directory.
  configure  Configure the program.
  plugins    Show all available plugins.

spam-analyzer analyze <file>: classify the email given in input
spam-analyzer -v analyze <file>: classify the email given in input and display a detailed analysis[^1]
spam-analyzer analyze -fmt json <file>: classify the email given in input and display the result in JSON format (useful for integration with other programs)
spam-analyzer analyze -fmt json -o <outpath> <file> : classify the email given in input and write the result in JSON format in the file given in input[^2]
spam-analyzer analyze -l <wordlist> <file>: classify the email given in input using the wordlist given in input

Configuration

spam-analyzer is thought to be highly configurable: on its first execution it will create a configuration file in ~/.config/spamanalyzer/ with some other default files. You can change the configuration file to customize the behavior of the program. At the moment of writing there are only paths to the wordlist and the model, but in the future there will be more options (e.g. senders blacklist and whitelist, a default path where to copy classified emails,...).

[^1]: The --verbose option is available only for the first use case, it will not work in combination with the --output-format option.

[^2]: You should use the --output-file instead of the > operator to write the output in a file, because the spam-analyzer program prints some other information on the standard output while processing the email(s).

Python

from spamanalyzer import SpamAnalyzer

analyzer = SpamAnalyzer(forbidden_words=["viagra", "cialis"])
analysis = await analyzer.analyze("path/to/email.txt")

The spamanalyzer library provides a really simple interface to extract features from an email. The SpamAnalyzer class provides the analyze method that takes in input the path to the email and returns a SpamAnalyzer object containing the analysis of the email.

Furthermore, the MailAnalysis class provides the is_spam method that returns True if the email is spam, False otherwise. Further examples are available in the folder examples of the source code.

Contributing

Contributions are welcome! Please read the contribution guidelines first.

License

spam-analyzer is licensed under the GPLv3 license.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.11

Nov 22, 2023

1.0.10

Nov 19, 2023

1.0.9

Sep 21, 2023

1.0.8

Sep 19, 2023

1.0.7

Sep 4, 2023

1.0.6

Aug 27, 2023

1.0.5

Aug 24, 2023

1.0.4

Aug 14, 2023

1.0.3

Aug 8, 2023

1.0.2

Aug 7, 2023

1.0.1

Aug 3, 2023

1.0.1b0 pre-release

Jul 31, 2023

1.0.0

Jul 31, 2023

1.0.0b0 pre-release

Jul 29, 2023

0.2.1

Jul 18, 2023

0.2.1b0 pre-release

Jul 18, 2023

0.2.0b1 pre-release

Jul 18, 2023

0.2.0b0 pre-release

Jul 17, 2023

0.1.2

Jul 16, 2023

0.1.1

Jun 4, 2023

0.1.0

Dec 9, 2022

0.0.3

Dec 9, 2022

0.0.2

Nov 27, 2022

0.0.1

Nov 26, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spam_analyzer-1.0.11.tar.gz (13.0 MB view details)

Uploaded Nov 22, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

spam_analyzer-1.0.11-py3-none-any.whl (13.5 MB view details)

Uploaded Nov 22, 2023 Python 3

File details

Details for the file spam_analyzer-1.0.11.tar.gz.

File metadata

Download URL: spam_analyzer-1.0.11.tar.gz
Upload date: Nov 22, 2023
Size: 13.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.1 CPython/3.10.13 Linux/6.2.0-1016-azure

File hashes

Hashes for spam_analyzer-1.0.11.tar.gz
Algorithm	Hash digest
SHA256	`7ab50eec4fb82695f92a1727ee7c23a572607114ddb72f8f2cdbad498de05905`
MD5	`31fe267d632ce6ea13cefa58789c7fa4`
BLAKE2b-256	`7ab4eb0d4cc1e3e8a858c3bbf96aceab66cd94732b19c2eed3f31ea32c41c98a`

See more details on using hashes here.

File details

Details for the file spam_analyzer-1.0.11-py3-none-any.whl.

File metadata

Download URL: spam_analyzer-1.0.11-py3-none-any.whl
Upload date: Nov 22, 2023
Size: 13.5 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.5.1 CPython/3.10.13 Linux/6.2.0-1016-azure

File hashes

Hashes for spam_analyzer-1.0.11-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e3d37aea325e1639b3ed174931e49ba15e1441967a0e1f8160b34d211a8b83a4`
MD5	`b47c9667918dc32451bb4cbb6d582875`
BLAKE2b-256	`dc30f19cf919f04743f7497b10ffa84f7c5a058c0abc2decd9ba75d9b53cb8a0`

See more details on using hashes here.

spam-analyzer 1.0.11

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Table of Contents

What is spam-analyzer?

What is spam and how does spam-analyzer know it?

Installation

Usage

CLI

Configuration

Python

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes