Validate an E-Mail Address with PHP, the Right Way

Develop a working PHP function to validate e-mail addresses.
Requirements

IETF documents, RFC 1035 “Domain Implementation and Specification”, RFC 2234 “ABNF for Syntax Specifications”, RFC 2821 “Simple Mail Transfer Protocol”, RFC 2822 “Internet Message Format”, in addition to RFC 3696 (referenced earlier), all contain information relevant to e-mail address validation. RFC 2822 supersedes RFC 822 “Standard for ARPA Internet Text Messages” and makes it obsolete.

Following are the requirements for an e-mail address, with relevant references:

  1. An e-mail address consists of local part and domain separated by an at sign (@) character (RFC 2822 3.4.1).

  2. The local part may consist of alphabetic and numeric characters, and the following characters: !, #, $, %, &, ', *, +, -, /, =, ?, ^, _, `, {, |, } and ~, possibly with dot separators (.), inside, but not at the start, end or next to another dot separator (RFC 2822 3.2.4).

  3. The local part may consist of a quoted string—that is, anything within quotes ("), including spaces (RFC 2822 3.2.5).

  4. Quoted pairs (such as \@) are valid components of a local part, though an obsolete form from RFC 822 (RFC 2822 4.4).

  5. The maximum length of a local part is 64 characters (RFC 2821 4.5.3.1).

  6. A domain consists of labels separated by dot separators (RFC1035 2.3.1).

  7. Domain labels start with an alphabetic character followed by zero or more alphabetic characters, numeric characters or the hyphen (-), ending with an alphabetic or numeric character (RFC 1035 2.3.1).

  8. The maximum length of a label is 63 characters (RFC 1035 2.3.1).

  9. The maximum length of a domain is 255 characters (RFC 2821 4.5.3.1).

  10. The domain must be fully qualified and resolvable to a type A or type MX DNS address record (RFC 2821 3.6).

Requirement number four covers a now obsolete form that is arguably permissive. Agents issuing new addresses could legitimately disallow it; however, an existing address that uses this form remains a valid address.

The standard assumes a seven-bit character encoding, not multibyte characters. Consequently, according to RFC 2234, “alphabetic” corresponds to the Latin alphabet character ranges a–z and A–Z. Likewise, “numeric” refers to the digits 0–9. The lovely international standard Unicode alphabets are not accommodated—not even encoded as UTF-8. ASCII still rules here.

Developing a Better E-mail Validator

That's a lot of requirements! Most of them refer to the local part and domain. It makes sense, then, to start with splitting the e-mail address around the at sign separator. Requirements 2–5 apply to the local part, and 6–10 apply to the domain.

The at sign can be escaped in the local name. Examples are, Abc\@def@example.com and "Abc@def"@example.com. This means an explode on the at sign, $split = explode("@", $email); or another similar trick to separate the local and domain parts will not always work. We can try removing escaped at signs, $cleanat = str_replace("\\@", "");, but that will miss pathological cases, such as Abc\\@example.com. Fortunately, such escaped at signs are not allowed in the domain part. The last occurrence of the at sign must definitely be the separator. The way to separate the local and domain parts, then, is to use the strrpos function to find the last at sign in the e-mail string.

Listing 3 provides a better method for splitting the local part and domain of an e-mail address. The return type of strrpos will be boolean-valued false if the at sign does not occur in the e-mail string.

Let's start with the easy stuff. Checking the lengths of the local part and domain is simple. If those tests fail, there's no need to do the more complicated tests. Listing 4 shows the code for making the length tests.

Now, the local part has one of two forms. It may have a begin and end quote with no unescaped embedded quotes. The local part, Doug \"Ace\" L. is an example. The second form for the local part is, (a+(\.a+)*), where a stands for a whole slew of allowable characters. The second form is more common than the first; so, check for that first. Look for the quoted form after failing the unquoted form.

Characters quoted using the back slash (\@) pose a problem. This form allows doubling the back-slash character to get a back-slash character in the interpreted result (\\). This means we need to check for an odd number of back-slash characters quoting a non-back-slash character. We need to allow \\\\\@ and reject \\\\@.

It is possible to write a regular expression that finds an odd number of back slashes before a non-back-slash character. It is possible, but not pretty. The appeal is further reduced by the fact that the back-slash character is an escape character in PHP strings and an escape character in regular expressions. We need to write four back-slash characters in the PHP string representing the regular expression to show the regular expression interpreter a single back slash.

A more appealing solution is simply to strip all pairs of back-slash characters from the test string before checking it with the regular expression. The str_replace function fits the bill. Listing 5 shows a test for the content of the local part.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Great article, just a slight fix

David's picture

This is terrific.

There is some sort of typo in the part of the code in Listing 9 where you check the A and MX DNS records, which make this break as written.

Changing:
if ($isValid && !(checkdnsrr($domain,"MX") ||
↪checkdnsrr($domain,"A")))

To:
if ($isValid && !((checkdnsrr($domain,"MX")) ||
(checkdnsrr($domain,"A"))))

seems to make it work.

your fix works for me too

cruzanmo's picture

Thanks for the awesome script!

I ran into the same error with that line, and your fix made it work for me too!

Your format validation code

Anonymous's picture

Your format validation code will inappropriately permit an all numeric TLD.
“There is an additional rule that essentially requires that top-level domain names not be all- numeric.“ - RFC 3696 - 2

http://SimonSlick.com/VEAF/ValidateEmailAddressFormat.html

Sure your DNS lookup up would fail, but what is the point of validating the format if you are just going to do a DNS lookup anyway for a domain name that should have already been deemed invalid by the format validation code.

Format validation and existence verification (DNS lookup) serve different purposes, and just because a domain name does not exist does not mean the format is not valid.

There are so many holes in your code, whoever paid you for this write-up is highly deserving of a total refund. If you are going to title such an article as "... the Right Way", you could at least do it the Right Way.

The code at http://SimonSlick.com/VEAF/ValidateEmailAddressFormat.html is actually better, and even includes code for verifying actual existence of an eMailbox.

simonslick.com/veaf is busted

Anonymous's picture

The code at simonslick.com is wrong -- it does not seem to match the RFC at all. Just try the examples given in this article as well as more common cases like:

foo+bar@example.com
foo%bar@example.com
foo <bar@example.com>
(foo) bar@example.com

Working Code & Extensive Regular Expressions

Anonymous's picture

Working Code with Extensive use of Regular Expressions for validating email address format.

Check it out and see if you can find any faults.

http://SimonSlick.com/VEAF/ValidateEmailAddressFormat.html

Email address validation head-to-head

Dominic Sayers's picture

Yes, there are some faults with the Simon Slick code. It's also worth pointing out that both Simon Slick and Doug Lovell's code is copyright All Rights Reserved. You can't use it in your project.

I've written about some public-domain validation functions here: http://www.dominicsayers.com/isemail/

The Simon Slick code fails on some of the examples in RFC3696.

As some of these comments have pointed out, there are a lot of RFCs that cover this ground. For what it's worth, I believe my function complies with RFCs 1123, 2396, 3696, 4291, 4343, 5321 & 5322.

RFC Compliance

NOYB's picture

RFC Compliance

Backslash is not an RFC compliant component of an non-quoted email address local-part. May have been in the past, but not anymore, and has not be since the publication of RFC 2822 (2001). Move on folks.

This is also reinforced by RFC 3696 (2004).
http://tools.ietf.org/html/rfc3696

3. Restrictions on email addresses

Without quotes, local-parts may consist of any combination of
alphabetic characters, digits, or any of the special characters

! # $ % & ' * + - / = ? ^ _ ` . { | } ~

period (".") may also appear, but may not be used to start or end the
local part, nor may two or more consecutive periods appear. Stated
differently, any ASCII graphic (printing) character other than the
at-sign ("@"), backslash, double quote, comma, or square brackets may
appear without quoting. If any of that list of excluded characters
are to appear, they must be quoted.

Also see the RFC3696 Errata
http://www.rfc-editor.org/cgi-bin/errataSearch.pl?rfc=3696

These are not RFC compliant:
Fred\ Bloggs@example.com
Joe.\\Blow@example.com

And should have read as:
"Fred\ Bloggs"@example.com
"Joe.\\Blow"@example.com

Also, "the upper limit on address lengths (local-part@domain-part) should normally be considered to be 256."

And as someone already alluded to, the domain name is now, for quite some time I might add, allowed to begin with a digit.

You need to update your code, test data and this article.

RFC Compliance

NOYB's picture

Also the quoted string check appears would allow null (x00). According to RFC 2822 3.2.5. Quoted strings and 3.2.1. Primitive Tokens the permitted NO-WS-CTL characters are x01-x08, x0B, x0E-x1F, x7F. This does not include the null character x00.

RFC 2822
4.1. Miscellaneous obsolete tokens
The obs-char and obs-qp elements each add ASCII value 0.

Appendix B. Differences from earlier standards
Items marked with an asterisk (*) below are items which
appear in section 4 of this document and therefore can no longer be
generated.
12. ASCII 0 (null) removed.*

Challenge

Tom Burt's picture

So should the e-mail address someone@3com.com be accepted or not? It fails to satisfy requirement #7 above but I guess the code in Listing 9 would accept it.
Tom

It is working good

coderbari's picture

I tested it with many options and its working fine :D

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix