regex (noun) \ˈɹɛɡˌɛks\
"Regex" or "regexp" is short for regular expression, a special sequence of characters that forms a search pattern to identify patterns in text. The ability to take any amount of text, look for certain patterns, and manipulate or extract the text in certain regions is of great value in scientific and software applications.
Should you use regular expressions to parse street addresses?
While regular expressions are useful in carefully controlled environments where the input is predictable and the language is context-free, regular expressions have certain theoretical and practical limitations that make them unsuitable for parsing or correcting street addresses.
The format for standardized street addresses as defined by USPS Publication 28 is not known to be a regular language (it cannot be expressed with a context-free grammar) and thus cannot be successfully parsed by a regular expression.
You can attempt to match a minor subset of valid street addresses using regular expressions, but that assumes the user's input is standardized or typed in a standard format that can be expressed with a regular grammar. In our experience, regular expressions have been useful, but not sufficient, in parsing street addresses.
A common task related to street addresses is componentizing—breaking the address into components. Given a line that looks like the following, the goal is to determine what the primary (house) number, street, city, state, and ZIP Code are:
123 Main St. Louisville, OH 43071
Essentially, this string needs to be parsed. There are no guarantees that it comes in this form. It may or may not have punctuation or line breaks, and who knows about capitalization. This is a common need, however, which has been discussed time and time again. Many developers rely wholly on the graces of a regular expression to save them from this potentially bottomless parsing pit of despair. Unless all your addresses look like this, regular expressions are not the answer.
In this simple example, the problem is obvious: Regular expressions look for patterns in text, but they have no way of knowing what each part of a string means. In other words, there's no way to know whether "Louisville" or "St. Louisville" is the name of the city (Ohio has both). Depending on which it is changes the street field. Is it "Main St." or just "Main" and thus could be "Main Avenue"?
This is just one example, but it quickly becomes clear that regular expressions cannot parse street addresses effectively.
Extracting street addresses from text is another common task. This is useful for applications that highlight addresses in a document body or use the address to put the content into context with some location or place.
For most use cases, we've found that regular expressions fall short in this task. Let's look at an example:
Let's meet tomorrow at Independence Court for lunch at 4; Sloan St. and 315 Freedom Ct. I'll bring peanut butter, and you bring the jelly.
Again, we have no guarantee of punctuation, capitalization, or use of abbreviations, but if we trust the user's input, we can at least infer some things about an address with them. Unfortunately, if the punctuation and spacing is used improperly, it can deceive our parsers. If we eliminate the punctuation to avoid that pitfall, we end up with:
lets meet tomorrow at independence court for lunch at 4 sloan st and 315 freedom ct ill bring peanut butter and you bring the jelly
Which is still difficult to interpret. Even humans trying to decipher that would have a hard time determining if "4 sloan st" was part of the time or was the address. Similarly, is "independence court" a street or a plaza/business name? Is "315 freedom ct" an address or a time and a business (like "3:15 @ Freedom Ct")?
So again, there are too many ambiguities to leave street addresses in the hands of regular expressions.
Addresses come in many forms and various styles from different sources. Often, they include unexpected data such as secondary numbers, or the address is a rural route, post office box, or military address (military addresses in particular look very different from regular street addresses). Further, a single address can sometimes contain two addresses called a dual address.
Trust us when we say that in attempts to standardize/normalize address data using a regex, developers will find it to be more painful than it's worth. Plus, necessarily hardcoding various abbreviations, keywords, and misspellings will make your regular expression thousands of bytes long which becomes enormously inefficient to compile and execute.
A Better Way
While extracting addresses from blocks of text is still a difficult task (modern parsers use complex NLP techniques), there are much easier ways to standardize, verify, and correct addresses. We may be biased, but the best way to do this is with SmartyStreets because it is affordable, easy, and efficient. It doesn't rely on pattern matching, but rather employs our custom, systematic algorithm for parsing and interpreting address data. Checked against the USPS official file, verified addresses are guaranteed to be correct and valid.
Sometimes your address data isn't spliced into components, and the whole address is on one line. Our API is capable of what we call freeform address processing to turn your single-line addresses into fully validated, componentized, and standardized addresses. Just submit the entire address into the street field of the API.
Motivation for This Article
We get a lot of questions from programmers about parsing addresses. We see a lot of people trying to use regular expressions for street addresses, and as the address user experience experts, we cringe whenever another programmer falls prey to this trap. We hope that this information will save you some trouble, and if your searching is in vain, please feel free to ask us any questions you have about addresses.