Regular expressions for street addresses

Glossary

regex    (noun) \ˈɹɛɡˌɛks\

"Regex" or "regexp" is short for regular expression, a special string that identifies patterns in text. The ability to take any amount of text, look for certain patterns, and manipulate or extract the text in certain regions has been of great value, especially in the human genome project.1

Sample

A regex which attempts to match an email address looks like this:2

\b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b

It's a load of garb to human eyes, but a regex engine reads that and knows exactly what kind of pattern to find in a bunch of text.

Regular expressions and street addresses

Here are some common attempts for using regular expressions for verifying street addresses.

Componentizing

A common task related to street addresses is called "componentizing," or breaking the address into components. Given a line that looks like the following, the goal is to determine what the primary (house) number, street, city, state, and ZIP are.

123 Main St. Louisville, OH 43071

Essentially, this string needs to be parsed. There are no guarantees that it comes in this form. It may or may not have punctuation or line breaks, and who knows about capitalization. This is a common need, however, which has been discussed time and time again. Many developers rely wholly on the graces of a regular expression to save them from this potentially bottomless parsing pit of despair. We propose that regular expressions are not the answer.

In this simple example, the problem is apparent. Regular expressions look for patterns in text. They have no way of knowing what each part of a string means. In other words, there's no way to know whether "Louisville" or "St. Louisville" is the name of the city (Ohio has both). Depending on which it is changes the street name. Is it "Main St." or just "Main," and thus could be "Main Avenue"?

Regular expressions do not parse street addresses effectively.

Extracting addresses

Extracting street addresses from text is another common task. This is useful for applications that highlight addresses in a document body, or use the address to put the content into context with some location or place.

Regardless of the use case, we've found regular expressions to fall short in this task. Let's look at an example:

Let's meet tomorrow at Independence Court for lunch at 4; Sloan St. and 315 Freedom Ct. I'll bring peanut butter, and you bring the jelly.

There's a casual message from one PB&J lover to another. Again, we have no guarantee of punctuation, capitalization, or use of abbreviations. Let's look at the text without most of that:

lets meet tomorrow at independence court for lunch at 4 sloan st and 315 freedom ct ill bring peanut butter and you bring the jelly

We do this because punctuation is not preferred in street addresses.3 Even humans trying to decipher that would have a hard time determining if "4 sloan st" was part of the time or was the address. Similarly, is "independence court" a street or a plaza/business name? Is "315 freedom ct" an address or a time and a business (like "3:15 @ Freedom Ct")?

Again, there is too much ambiguity to do this with regular expressions.

Standardizing

Addresses come in many forms and various styles from different sources. Often, they include unexpected data such as secondary numbers, are rural routes or post office boxes, or are even military addresses which look very different from regular street addresses.

In attempts to standardize (normalize) the address data using a regex, developers will probably find it a painful task to accomplish! (It's said that when you have a problem where you're using a regex to solve it, you now have two problems.)

The Solution

While extracting addresses from blocks of text is still a difficult task (modern parsers use complex NLP techniques), there are much easier ways to standardize, verify, and correct addresses. The best way to do this is with LiveAddress because it is affordable, easy, and efficient. It doesn't rely on pattern matching, but rather employs our custom, systematic algorithm for parsing and interpreting address data. Checked against the USPS official file, addresses are guaranteed to be correct and valid.

Sometimes your address data isn't spliced into components, and the whole address is on one line. No problem! LiveAddress is capable of what we call freeform address processing to turn your single-line addresses into fully validated, componentized, and standardized addresses. Just submit the entire address into the street field of the API.

  1. Jing-Jing Li, "Characterizing human gene splice sites using evolved regular expressions" <Neural Networks, 2005. IJCNN '05. Proceedings. 2005 IEEE International Joint Conference on 31 July-4 Aug. 2005>
  2. Jan Goyvaerts, Regular-Expressions.info <http://www.regular-expressions.info/email.html>
  3. Publication 28 Section 222, USPS <http://pe.usps.gov/text/pub28/28c2_007.htm>