Last Thursday morning, October 14, we experienced a significant address verification outage lasting approximately five hours. The outage was incredibly frustrating for our customers on several levels. The primary service that was affected was our LiveAddress API address verification web service. This post is our attempt to detail exactly what happened and what we plan on doing about it.
Our primary billing/accounting server suffered a failure do to the automatic installation of scheduled operating system updates. Specifically, the server automatically installed the updates and rebooted as part of the update process. This process typically takes about two minutes, but this time the server failed to reboot properly. It hung during the shutdown sequence after the updates had been applied.
The main problem with this was related to how we were monitoring the status of the machine. According to the statistics gathered, the machine was online and functioning, and could be pinged, etc. But in reality, the machine was hung in a shutdown sequence. After a few reboots, we managed to get things back into an operational state. Currently things are operating normally.
Lack of Communication
One of the most infuriating aspects of this outage from a customers perspective was related to that of communication in both directions. Once we were aware of the outage it wasnt long before we were fully operational again. Unfortunately, this meant that all of our resources were dedicated to bringing everything back into an operational state rather than communicating to our customers what was happening. That was bad. The other aspect of this that was even more upsetting to customers was their inability to contact us in a time-critical manner. That was even worse. Opening the channels of communication alone could have cut the length of the outage dramatically. We pride ourselves on responding quickly to customer needs and being available to answer their questions. But this experience has clearly demonstrated to us that the way things currently operate in regards to communication is not even close to sufficient. In fact, its just plain short-sighted. So were making changes. Big ones.
Technical Solutions (The Dirty Details)
The way our LiveAddress API was designed,when a verification request is received, it makes a synchronous call to our billing mechanism to determine if you have an active subscription and if your account has sufficient queries available in order to fulfill the request. The architecture behind the LiveAddress system is a typical web services architecture. This type of service is advocated as a best practice by the major software vendors, including Microsoft, Sun, and Oracle. Yet, at the same time, it was this architecture that led directly to the outage. Recently there has been significant movement in the programming community related to leveraging better architectures, such as those heavily-utilized by such companies as Amazon, Google, Facebook, eBay, and Twitter. It is by following this new architectural guidancebreaking things apart slightly and moving toward the AP side of Eric Brewers CAP Theorem , rather than the CA sidethat we are able to gain tremendous advantages for ourselves and our customers.
Over the weekend and during the early part of this week, we worked on splitting the LiveAddress codebase apart in order to introduce what is known in computing terms as eventual consistency into the billing/accounting process associated with each LiveAddress verification lookup. Specifically, we will now be performing a look-up on a slightly out-of-date version of the subscription data against a local store. This means our accounting system, whose database located on a separate machine, may be a few hundred milliseconds behindor perhaps even a few hours behind in the case of a significant billing/accounting outagewhile the LiveAddress system will be completely unaffected and just reading an old copy of the subscription information. In a worst-case scenario you, our customers, may receive more queries for your subscription than originally purchased, all at no extra charge. Thats completely okay, because the LiveAddress system will remain online servicing requests. This will cost us significantly less in terms of customer aggravation and frustration.
This new architecture yields a number of positive consequences. First, each node is now entirely self-contained, and requires no outside information beyond what it has in its local database in order to service each request. Another positive consequence of this new architecture is that we are now able to more fully geo-distribute our system. In fact, we are already moving quickly towards new instances of the software on servers located in Washington DC and Seattle, in addition to extra capacity at our facility in Dallas.
We are also moving to change how we monitor our system. Instead of basic hardware and software monitoring, were moving to have the health of the application monitored in a much more thorough manner. We are also looking at third party providers, who provide out-of-band health/status pages that would remain available and reachable for users in the case of outages on our part.
By far, the biggest takeaway in all of this, is related to how we communicate with our customers and how our customers communicate with us. We have determined that you need multiple ways to communicate that work for you.
As a result, we are opening our communication channels more fully. Heres how you can get in touch with us:
By introducing some technical changes and providing open channels of communication, we hope to never experience down-time to the extent that we experienced on Thursday. We know that there are a lot of people and companies relying on us to keep our systems operational and we are committed to keeping our systems up and running. This is why we have no long-term contracts that keep you from going elsewhere. We know that if we have the best service at the best price, the customers will come. If youre ever unsatisfied, we invite you to let us know, but we wont force you to stay if youd like to go elsewhere.
At the same time, there are methods of architecting your external service calls such that down-time on the part of the service provider (us or others) does not affect the normal workflow of your application. If you are interested in these methods of external interaction, please contact us and well be happy to help you understand what they are and how they can benefit your application relative to all third-party services, not just ours.
We thank you for your patience with us and we hope to continue having the privilege of earning and keeping your business in the future. Dont hesitate to let us know about your questions and concerns. Well be here.
-- Jonathan Oliver, CTO, Qualified Address