A Python library for email canonicalization
Project description
emailCanon
A Python library for email address canonicalization and normalization. Normalizes email addresses according to provider-specific rules (Gmail, Outlook, Yahoo, etc.) to help identify duplicate accounts and standardize email formats.
This is the Python equivalent of @grml/nomadic.
Features
- Provider-aware normalization: Applies provider-specific rules (sub-address stripping, dot removal, case-folding)
- Canonical domain collapsing: Maps alias domains to canonical domains (e.g.,
googlemail.combecomesgmail.com) - Sub-address stripping: Removes subaddresses (e.g.,
user+tagbecomesuser) per provider - RFC-compliant: Handles quoted local parts and validates email structure
- Customizable: Extend with custom providers or override default rules
- Built-in providers: Gmail, Microsoft (Outlook/Hotmail), Yahoo, Apple iCloud, Fastmail, ProtonMail, and 10+ others
Installation
pip install emailcanon
Quick Start
from emailcanon import normalizeEmail, getEmailProvider, isSameEmail
# Basic normalization
normalized = normalizeEmail("Test.User+Tag@Gmail.com")
print(normalized) # "testuser@gmail.com"
# Get the provider ID
provider = getEmailProvider("user@outlook.com")
print(provider) # "microsoft"
# Check if two emails are equivalent
same = isSameEmail("john.doe+newsletter@gmail.com", "johndoe@googlemail.com")
print(same) # True
API Reference
normalizeEmail(email: str, options: NormalizeOptions | None = None) -> str
Normalizes an email address to its canonical form.
Applies provider-specific rules including:
- Sub-address stripping (e.g.,
+tagfor Gmail) - Dot removal (Gmail ignores dots in the local part)
- Case folding of the local part
- Canonical domain mapping (alias domains to primary domain)
Parameters:
email: The email address to normalizeoptions: Optional normalization options (seeNormalizeOptionsbelow)
Returns: Canonical email string. The string is returned regardless of
whether the address is structurally valid; this function discards the valid
flag. If you need the validity flag or the parsed components, use
normalizeEmailDetailed.
Raises: TypeError if email is not a string
Examples:
normalizeEmail("john.doe+newsletter@GMAIL.COM") # "johndoe@gmail.com"
normalizeEmail("first.name@outlook.com") # "firstname@outlook.com"
normalizeEmail("user-tag@yahoo.com") # "user@yahoo.com"
normalizeEmail("\"quoted.local\"@example.com") # "\"quoted.local\"@example.com"
normalizeEmailDetailed(email: str, options: NormalizeOptions | None = None) -> NormalizedEmail
Normalizes an email and returns detailed information about the normalization.
Returns: NormalizedEmail object with:
normalized: Canonical email stringlocal: Normalized local partdomain: Normalized domainproviderId: ID of matched provider (e.g., "gmail"), orNonesubaddress: Extracted sub-address (e.g., "tag" from "user+tag"), orNone(an empty tag such asuser+yieldsNone, just like having no separator)valid: Whether the email is structurally valid (see Validity flag limitations)
Example:
result = normalizeEmailDetailed("john+newsletter@gmail.com")
# NormalizedEmail(
# normalized="john@gmail.com",
# local="john",
# domain="gmail.com",
# providerId="gmail",
# subaddress="newsletter",
# valid=True
# )
getEmailProvider(email: str, options: NormalizeOptions | None = None) -> str | None
Returns the provider ID for an email address, or None if no provider matches.
Parameters:
email: The email addressoptions: Optional normalization options
Returns: Provider ID string (e.g., "gmail", "microsoft") or None
Raises: TypeError if email is not a string
Example:
getEmailProvider("user@gmail.com") # "gmail"
getEmailProvider("name@outlook.com") # "microsoft"
getEmailProvider("person@example.com") # None
isSameEmail(a: str, b: str, options: NormalizeOptions | None = None) -> bool
Checks if two email addresses normalize to the same canonical form.
Useful for detecting duplicate accounts where users registered with different aliases.
Parameters:
a,b: Email addresses to compareoptions: Optional normalization options
Returns: True if both emails normalize identically, False otherwise
Example:
isSameEmail("john.doe@gmail.com", "johndoe+spam@googlemail.com") # True
isSameEmail("john@example.com", "jane@example.com") # False
Configuration
NormalizeOptions
Control normalization behavior via options:
from emailcanon import NormalizeOptions, ProviderRule, normalizeEmail
options = NormalizeOptions(
lowercaseDomain=True, # Default: True
providers=[ # Custom providers to add/override
ProviderRule(
id="custom",
domains=["custom.example.com"],
lowercaseLocal=True,
removeDots=True,
subaddressSeparators=["+"]
)
],
replaceDefaultProviders=False, # Keep built-in providers
defaultRule=None # Rule for unknown domains
)
normalized = normalizeEmail("user@custom.example.com", options)
Caching note: The provider registry built from a NormalizeOptions
instance is memoized per options object (keyed on identity), so reusing the
same options across many calls — the typical bulk-deduplication pattern —
avoids rebuilding the registry every time. Because the cache is keyed on object
identity, mutating an options object after its first use will not be
reflected; construct a fresh NormalizeOptions instead of mutating one.
ProviderRule
Define custom email provider rules:
ProviderRule(
id="my_provider", # Unique identifier
domains=["example.com", "mail.example.com"], # Domain patterns
lowercaseLocal=True, # Convert local part to lowercase
removeDots=True, # Remove dots from local part
subaddressSeparators=["+", "-"], # Characters that separate subaddress
canonicalDomain="example.com" # Map all domains to this
)
Supported Providers
| Provider | ID | Domains | Rules |
|---|---|---|---|
| Gmail | gmail |
gmail.com, googlemail.com | Lowercase, remove dots, + subaddress, maps to gmail.com |
| Microsoft | microsoft |
outlook.com*, hotmail.com*, live.com*, msn.com, others | Lowercase, + subaddress |
| Yahoo | yahoo |
yahoo.com*, ymail.com, rocketmail.com | Lowercase, - subaddress |
| Apple iCloud | icloud |
icloud.com, me.com, mac.com | Lowercase, + subaddress |
| Fastmail | fastmail |
fastmail.com, fastmail.fm | Lowercase, + subaddress |
| ProtonMail | proton |
protonmail.com, protonmail.ch, proton.me, pm.me | Lowercase, + subaddress |
| Yandex | yandex |
yandex.com, yandex.ru, ya.ru, others | Lowercase, + subaddress |
| Zoho | zoho |
zoho.com, zohomail.com, zoho.eu | Lowercase, + subaddress |
| Tutanota | tutanota |
tutanota.com, tutanota.de, tutamail.com, tuta.com, others | Lowercase, + subaddress |
| Posteo | posteo |
posteo.de, posteo.net | Lowercase, + subaddress |
| Mailbox.org | mailbox |
mailbox.org | Lowercase, + subaddress |
| Mailfence | mailfence |
mailfence.com | Lowercase, + subaddress |
| Runbox | runbox |
runbox.com | Lowercase, + subaddress |
| Pobox | pobox |
pobox.com | Lowercase, + subaddress |
| AOL | aol |
aol.com, aim.com | Lowercase |
* Multiple regional variants supported (com.au, co.uk, de, fr, etc.)
Examples
Duplicate Account Detection
from emailcanon import normalizeEmail
emails = [
"john.doe@gmail.com",
"johndoe+shopping@googlemail.com",
"john.doe@yahoo.com",
"J.DOE@GMAIL.COM"
]
# Group by normalized form
normalized_map = {}
for email in emails:
norm = normalizeEmail(email)
if norm not in normalized_map:
normalized_map[norm] = []
normalized_map[norm].append(email)
# Find duplicates
for norm_email, originals in normalized_map.items():
if len(originals) > 1:
print(f"Duplicates: {originals} maps to {norm_email}")
# Output: Duplicates: ['john.doe@gmail.com', 'johndoe+shopping@googlemail.com', 'J.DOE@GMAIL.COM'] maps to john@gmail.com
Custom Provider Rules
from emailcanon import normalizeEmail, NormalizeOptions, ProviderRule
# Add custom provider
options = NormalizeOptions(
providers=[
ProviderRule(
id="company",
domains=["company.com", "corp.company.com"],
canonicalDomain="company.com",
lowercaseLocal=True,
subaddressSeparators=["+"]
)
]
)
normalizeEmail("User+Team@corp.company.com", options)
# "user@company.com"
Skip Default Providers
from emailcanon import normalizeEmail, NormalizeOptions, ProviderRule
# Use only custom providers, ignore built-in ones
options = NormalizeOptions(
replaceDefaultProviders=True,
providers=[
ProviderRule(
id="custom",
domains=["custom.local"],
lowercaseLocal=True
)
]
)
normalizeEmail("User@custom.local", options)
# "user@custom.local"
Design Notes
- Conservative by default: Unknown domains get minimal normalization (lowercase domain only, local part unchanged)
- Quoted strings: RFC 5321 quoted local parts (e.g.,
"user name"@example.com) are preserved verbatim - Domain validation: Domains must follow standard DNS naming (labels separated by dots, alphanumeric + hyphens)
- Immutable rules: Provider rules are frozen dataclasses; mutation is not possible
Validity flag limitations
The valid flag returned by normalizeEmailDetailed is a pragmatic structural
check, not a full RFC 5321/5322 validator. In particular, the domain check has
two known limitations:
- Single-label hosts are rejected. The domain must contain at least one dot,
so hosts like
localhostare reported asvalid=False, even though they are deliverable in some environments. - Non-ASCII / IDN domains are rejected. The check is ASCII-only, so
internationalized domains such as
münchen.deare reported asvalid=False. Pre-encode them to Punycode (xn--mnchen-3ya.de) if you need them to pass.
These limitations only affect the valid flag; normalization of the local part
and domain is still performed in all cases.
Why Email Canonicalization?
Email addresses can look different but deliver to the same mailbox:
| Input | Gmail Reality |
|---|---|
john.doe@gmail.com |
johndoe@gmail.com (dots ignored) |
johndoe+newsletter@gmail.com |
johndoe@gmail.com (subaddress stripped) |
johndoe@googlemail.com |
johndoe@gmail.com (domain alias) |
Without canonicalization, a user could register multiple accounts. emailCanon standardizes these to detect and prevent duplicate registrations.
Development
Set up a local virtual environment and install the package with its dev dependencies (mypy, ruff):
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install the package in editable mode with dev extras
pip install -e ".[dev]"
Then run the tooling:
# Run the test suite (the test files are named after the area they cover,
# e.g. gmail.py, so discovery needs an explicit pattern)
python -m unittest discover -s tests -p "*.py"
# ...or run a single test module directly
python -m unittest tests.gmail
mypy # type-check
ruff format # format with tabs
ruff check # lint
When you're done, leave the environment with deactivate.
References
Provider rules compiled from:
- Wikipedia, Email address (sub-addressing section)
- aaronbassett's Email sub-addressing gist
- validator.js normalizeEmail conventions
- Official provider documentation (Fastmail, Microsoft Learn, Proton)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file emailcanon-0.1.0.tar.gz.
File metadata
- Download URL: emailcanon-0.1.0.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ddd573c428157e24bdacee2bdbcd30ec13f6bc48aef80dced2ced85bec9d3f17
|
|
| MD5 |
1dd6ffadd49b22311f89b5a852f457d5
|
|
| BLAKE2b-256 |
776da4461c28d106173749e0f5df17e6aea0874f9e0cb635bb0b24bb54cc8daf
|
File details
Details for the file emailcanon-0.1.0-py3-none-any.whl.
File metadata
- Download URL: emailcanon-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f986f49bf35baedf6ebd1bd4b1063c587ebc9d7795691a7b728c6d53d8a6b71
|
|
| MD5 |
e415bb31eab1ddb587b2c6fa39b94c57
|
|
| BLAKE2b-256 |
97b8ef5c96913eb0949fbfc76c6af46d5470906ffaf75ce4f57da9f833745208
|