Skip to main content

A robust email syntax and deliverability validation library for Python 2.x/3.x forked for vectorization

Project description

email-validator-vector-friendly: Validate Email Addresses

This is a vectorization-friendly fork by Bryce Merrill of the email-validator library created by Joshua Tauberer. This fork is intended for more efficient use of the library with large datasets; it replaces exception raising in the case of invalid emails with the simple return of a False boolean value stored in a .valid property and a detailed error description stored in a .error property.

The original email-validator README is below (with adjustments for replaced functionality):

A robust email address syntax and deliverability validation library for Python 2.7/3.4+ by Joshua Tauberer.

This library validates that a string is of the form name@example.com. This is the sort of validation you would want for an email-based login form on a website.

Key features:

  • Checks that an email address has the correct syntax --- good for login forms or other uses related to identifying users.
  • Gives friendly error messages when validation fails (appropriate to show to end users).
  • (optionally) Checks deliverability: Does the domain name resolve? And you can override the default DNS resolver.
  • Supports internationalized domain names and (optionally) internationalized local parts.
  • Normalizes email addresses (super important for internationalized addresses! see below).

The library is NOT for validation of the To: line in an email message (e.g. My Name <my@address.com>), which flanker is more appropriate for. And this library does NOT permit obsolete forms of email addresses, so if you need strict validation against the email specs exactly, use pyIsEmail.

This library was first published in 2015. The current version is 1.1.1 (posted May 19, 2020). Starting in version 1.1.0, the type of the value returned from validate_email has changed, but dict-style access to the validated address information still works, so it is backwards compatible.

Installation

This package is on PyPI, so:

pip install email-validator-vector-friendly

pip3 also works.

Usage

To add "valid" and "error" columns to a DataFrame containing potential email addresses:

from email_validator import validate_email
import pandas as pd

examples = ['firstlast@gmail.', 'firstlast@gmail.com', '@gmail.com']

df = pd.DataFrame({'emails': examples})

df['valid'] = df.apply(lambda x: validate_email(x['emails']).valid, axis=1)
df['errors'] = df.apply(lambda x: validate_email(x['emails']).error, axis=1)

pd.set_option('display.expand_frame_repr', False)

print(df)

This would result in the below DataFrame:

                emails  valid                                             errors
0     firstlast@gmail.  False  The domain name gmail. is not valid. It is not...
1  firstlast@gmail.com   True                                                   
2           @gmail.com  False  The email address contains invalid characters ...

When validating many email addresses or to control the timeout (the default is 15 seconds), create a caching dns.resolver.Resolver to reuse in each call:

from email_validator import validate_email, caching_resolver

resolver = caching_resolver(timeout=10)

while True:
  valid = validate_email(email, dns_resolver=resolver)

The validator will accept internationalized email addresses, but not all mail systems can send email to an addresses with non-ASCII characters in the local part of the address (before the @-sign). See the allow_smtputf8 option below.

Overview

The module provides a function validate_email(email_address) which takes an email address (either a str or ASCII bytes) and:

  • Returns an object information about the address, including the properties .valid (Bool) and .error (str error description)

Regardless of validation result, an object is returned containing a normalized form of the email address (which you should use!) and other information (such as validation status and error description).

The validator doesn't permit obsoleted forms of email addresses that no one uses anymore even though they are still valid and deliverable, since they will probably give you grief if you're using email for login. (See later in the document about that.)

The validator checks that the domain name in the email address resolves. There is nothing to be gained by trying to actually contact an SMTP server, so that's not done here. For privacy, security, and practicality reasons servers are good at not giving away whether an address is deliverable or not: email addresses that appear to accept mail at first can bounce mail after a delay, and bounced mail may indicate a temporary failure of a good email address (sometimes an intentional failure, like greylisting).

The function also accepts the following keyword arguments (default as shown):

allow_smtputf8=True: Set to False to prohibit internationalized addresses that would require the SMTPUTF8 extension.

check_deliverability=True: Set to False to skip the domain name resolution check.

allow_empty_local=False: Set to True to allow an empty local part (i.e. @example.com), e.g. for validating Postfix aliases.

dns_resolver=None: Pass an instance of dns.resolver.Resolver to control the DNS resolver including setting a timeout and a cache. The caching_resolver function shown above is a helper function to construct a dns.resolver.Resolver with a LRUCache. Reuse the same resolver instance across calls to validate_email to make use of the cache.

In non-production test environments, you may want to allow @test or @mycompany.test email addresses to be used as placeholder email addresses, which would normally not be permitted. In that case, pass test_environment=True. DNS-based deliverability checks will be disabled as well.

Internationalized email addresses

The email protocol SMTP and the domain name system DNS have historically only allowed ASCII characters in email addresses and domain names, respectively. Each has adapted to internationalization in a separate way, creating two separate aspects to email address internationalization.

Internationalized domain names (IDN)

The first is internationalized domain names (RFC 5891), a.k.a IDNA 2008. The DNS system has not been updated with Unicode support. Instead, internationalized domain names are converted into a special IDNA ASCII "Punycode" form starting with xn--. When an email address has non-ASCII characters in its domain part, the domain part is replaced with its IDNA ASCII equivalent form in the process of mail transmission. Your mail submission library probably does this for you transparently. Note that most web browsers are currently in transition between IDNA 2003 (RFC 3490) and IDNA 2008 (RFC 5891) and compliance around the web is not very good in any case, so be aware that edge cases are handled differently by different applications and libraries. This library conforms to IDNA 2008 using the idna module by Kim Davies.

Internationalized local parts

The second sort of internationalization is internationalization in the local part of the address (before the @-sign). These email addresses require that your mail submission library and the mail servers along the route to the destination, including your own outbound mail server, all support the SMTPUTF8 (RFC 6531) extension. Support for SMTPUTF8 varies.

If you know ahead of time that SMTPUTF8 is not supported by your mail submission stack

By default all internationalized forms are accepted by the validator. But if you know ahead of time that SMTPUTF8 is not supported by your mail submission stack, then you must filter out addresses that require SMTPUTF8 using the allow_smtputf8=False keyword argument (see above). This will cause the validation function to return a False if delivery would require SMTPUTF8. That's just in those cases where non-ASCII characters appear before the @-sign. If you do not set allow_smtputf8=False, you can also check the value of the smtputf8 field in the returned object.

If your mail submission library doesn't support Unicode at all --- even in the domain part of the address --- then immediately prior to mail submission you must replace the email address with its ASCII-ized form. This library gives you back the ASCII-ized form in the ascii_email field in the returned object, which you can get like this:

valid = validate_email(email, allow_smtputf8=False)
email = valid.ascii_email

The local part is left alone (if it has internationalized characters allow_smtputf8=False will force validation to fail) and the domain part is converted to IDNA ASCII. (You probably should not do this at account creation time so you don't change the user's login information without telling them.)

UCS-4 support required for Python 2.7

Note that when using Python 2.7, it is required that it was built with UCS-4 support (see here); otherwise emails with unicode characters outside of the BMP (Basic Multilingual Plane) will not validate correctly.

Normalization

The use of Unicode in email addresses introduced a normalization problem. Different Unicode strings can look identical and have the same semantic meaning to the user. The email field returned on successful validation provides the correctly normalized form of the given email address:

valid = validate_email("me@Domain.com")
email = valid.ascii_email
print(email)
# prints: me@domain.com

Because an end-user might type their email address in different (but equivalent) un-normalized forms at different times, you ought to replace what they enter with the normalized form immediately prior to going into your database (during account creation), querying your database (during login), or sending outbound mail. Normalization may also change the length of an email address, and this may affect whether it is valid and acceptable by your SMTP provider.

The normalizations include lowercasing the domain part of the email address (domain names are case-insensitive), Unicode "NFC" normalization of the whole address (which turns characters plus combining characters into precomposed characters where possible, replacement of fullwidth and halfwidth characters in the domain part, possibly other UTS46 mappings on the domain part, and conversion from Punycode to Unicode characters.

(See RFC 6532 (internationalized email) section 3.1 and RFC 5895 (IDNA 2008) section 2.)

Examples

For the email address test@joshdata.me, the returned object is:

ValidatedEmail(
  email='test@joshdata.me',
  local_part='test',
  domain='joshdata.me',
  ascii_email='test@joshdata.me',
  ascii_local_part='test',
  ascii_domain='joshdata.me',
  smtputf8=False,
  mx=[(10, 'box.occams.info')],
  mx_fallback_type=None,
  valid=True,
  error="")

For the fictitious address example@ツ.life, which has an internationalized domain but ASCII local part, the returned object is:

ValidatedEmail(
  email='example@ツ.life',
  local_part='example',
  domain='ツ.life',
  ascii_email='example@xn--bdk.life',
  ascii_local_part='example',
  ascii_domain='xn--bdk.life',
  smtputf8=False,
  valid=True,
  error="")

Note that smtputf8 is False even though the domain part is internationalized because SMTPUTF8 is only needed if the local part of the address is internationalized (the domain part can be converted to IDNA ASCII Punycode). Also note that the email and domain fields provide a normalized form of the email address and domain name (casefolding and Unicode normalization as required by IDNA 2008).

Calling validate_email with the ASCII form of the above email address, example@xn--bdk.life, returns the exact same information (i.e., the email field always will contain Unicode characters, not Punycode).

For the fictitious address ツ-test@joshdata.me, which has an internationalized local part, the returned object is:

ValidatedEmail(
  email='ツ-test@joshdata.me',
  local_part='ツ-test',
  domain='joshdata.me',
  ascii_email=None,
  ascii_local_part=None,
  ascii_domain='joshdata.me',
  smtputf8=True,
  valid=True,
  error="")

Now smtputf8 is True and ascii_email is None because the local part of the address is internationalized. The local_part and email fields return the normalized form of the address: certain Unicode characters (such as angstrom and ohm) may be replaced by other equivalent code points (a-with-ring and omega).

Return value

When an email address passes validation, the fields in the returned object are:

Field Value
email The normalized form of the email address that you should put in your database. This merely combines the local_part and domain fields (see below).
ascii_email If set, an ASCII-only form of the email address by replacing the domain part with IDNA Punycode. This field will be present when an ASCII-only form of the email address exists (including if the email address is already ASCII). If the local part of the email address contains internationalized characters, ascii_email will be None. If set, it merely combines ascii_local_part and ascii_domain.
local_part The local part of the given email address (before the @-sign) with Unicode NFC normalization applied.
ascii_local_part If set, the local part, which is composed of ASCII characters only.
domain The canonical internationalized Unicode form of the domain part of the email address. If the returned string contains non-ASCII characters, either the SMTPUTF8 feature of your mail relay will be required to transmit the message or else the email address's domain part must be converted to IDNA ASCII first: Use ascii_domain field instead.
ascii_domain The IDNA Punycode-encoded form of the domain part of the given email address, as it would be transmitted on the wire.
smtputf8 A boolean indicating that the SMTPUTF8 feature of your mail relay will be required to transmit messages to this address because the local part of the address has non-ASCII characters (the local part cannot be IDNA-encoded). If allow_smtputf8=False is passed as an argument, this flag will always be false because an exception is raised if it would have been true.
mx A list of (priority, domain) tuples of MX records specified in the DNS for the domain (see RFC 5321 section 5). May be None if the deliverability check could not be completed because of a temporary issue like a timeout.
mx_fallback_type None if an MX record is found. If no MX records are actually specified in DNS and instead are inferred, through an obsolete mechanism, from A or AAAA records, the value is the type of DNS record used instead (A or AAAA). May be None if the deliverability check could not be completed because of a temporary issue like a timeout.
valid True if email address is valid, False if it is not
error "" if email address is valid, a detailed error (string) if it is not

Assumptions

By design, this validator does not pass all email addresses that strictly conform to the standards. Many email address forms are obsolete or likely to cause trouble:

  • The validator assumes the email address is intended to be deliverable on the public Internet. The domain part of the email address must be a resolvable domain name. Special Use Domain Names and their subdomains are always considered invalid (except see the test_environment parameter above).
  • The "quoted string" form of the local part of the email address (RFC 5321 4.1.2) is not permitted --- no one uses this anymore anyway. Quoted forms allow multiple @-signs, space characters, and other troublesome conditions.
  • The "literal" form for the domain part of an email address (an IP address) is not accepted --- no one uses this anymore anyway.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

email_validator_vector_friendly-1.1.4.tar.gz (25.1 kB view details)

Uploaded Source

Built Distribution

email_validator_vector_friendly-1.1.4-py2.py3-none-any.whl (20.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file email_validator_vector_friendly-1.1.4.tar.gz.

File metadata

File hashes

Hashes for email_validator_vector_friendly-1.1.4.tar.gz
Algorithm Hash digest
SHA256 72cd7f071e2820f170bf407920add50507a3b893ea3aee74d74b88b72581cea2
MD5 9a46002d195f1560987809e4cbfe7c1f
BLAKE2b-256 8a965ffc7627d2a86ad9adc30e7d7a07154aa154ff30f897788fe7d9aaf95130

See more details on using hashes here.

File details

Details for the file email_validator_vector_friendly-1.1.4-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for email_validator_vector_friendly-1.1.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 a07bf19d1882f6254fe2ecac58f840bfe695cd5f2963d3dc43bec9f57485b266
MD5 14774fca5cc6cac049ebfd35047498ce
BLAKE2b-256 c2a18b530b2ca2d5b6a562e9dcb12000b444197e0093169120302580cb30f4d3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page