What are Service Level Agreements?

The "Fast Lane" Answer

An SLA (Service Level Agreement) is a part of a standardized service contract in which specific aspects of a service are defined by a service provider. Often, it specifies which things are the company's fault, which things are not the company's fault, and what kind of "Whoops—our bad" compensation you're guaranteed if the company doesn't meet their own standards.

If you've ever heard a business make a promise before, something like "Get your pizza in 20 minutes or less, or it's free", then you've encountered a basic example of a Service Level Agreement. SLAs are things you're likely to see amongst tech companies in general. As an address validation service, we provide one regarding our services (like our USPS APIs), our services' functionality, and what you can expect when you use them.

Our SLA here at Smarty essentially guarantees three things:

Sub-500 millisecond response times on requests we receive (internet latency not included)
At least 99.98% uptime in any given month
We promise to credit your account with free service if we ever fail to fulfill either promise.

So whether your SLA is for a pizza-delivery time, or for how fast you can validate an address in Istanbul, Service Level Agreements tell you what you can expect from a service provider.

The "Scenic Route" Answer

The Good News

"Hello, my name is Inigo Montoya…"

The first—perhaps most important—thing an SLA does is tell you what the company is going to do. It is a contract: the company gives you a written statement, telling you just what you can expect out of them. This is done by way of clearly defined "metrics," measurements of performance that the provider feels firmly confident they can consistently meet and frequently exceed.

Some of the most common metrics will be measurements of uptime and speed, as well as approximations of how frequently the provider expects downtime and how long it will take to restore service. Just what is explained, and the promised performance in an SLA will vary based on the provider, but the theme is pretty consistent: their SLA will tell you what services you're getting and what level of quality you can expect with it.

Now if a company guarantees you awesomeness 98% of the time, that means 98% is what you can expect. They're making themselves accountable for anything at 97% or below. They're not holding themselves accountable for failing to hit 99% or 100% when they didn't promise it. Because no one can guarantee perfection 100% of the time, service providers have to draw the line somewhere.

The good news is, though, that what IS being guaranteed is usually exceptionally good. And, if what's being provided to you dips below that stellar-service guarantee, you have permission to give that service provider a piece of your mind, and expect a recompense of some kind.

Below is a discussion of 3 of the main metrics included in an SLA: Speed, Downtime and Uptime.

Response Time

At Smarty, the first metric we give a guarantee on, and one you're likely to see listed in other company's SLAs, is speed—in this particular case, we call it "response time." That means how long it takes from the time we receive your request to the time we send back an answer. This is a big one, since speed's pretty important to us over here. We've made it a priority of our service, and our guarantee reflects that.

Disclaimer: our speed guarantee does not account for your hardware or for network latency (internet lag). Our metric is not a measurement of time between when you hit "enter" to send the request to when your system receives the answer. It's a measurement of the time between the request arriving at our doorstep to when we put the response back in the pipeline. In short, it's how fast things happen on the end we can control.

Internet connection speeds are fickle and subject to multitudinous factors. If you've ever streamed a movie, or played online games, this variation in speed is all too apparent. Service providers like ourselves cannot be held accountable for the quality of your hardlines, your wireless signal, or your ISP connection to the internet at large. Likewise, we can't be blamed for issues that arise due to the limitations on your hardware or because of how you're using your machine.

Even with external factors slowing down the process, though, Smarty can get your address processed like we're making the Kessel Run. We guarantee sub-500 millisecond response times on requests, though we average closer to sub-30 millisecond speeds. Here's some perspective on those numbers, since they can be hard to conceptualize: the speed of an eyeblink is anywhere from 300-400 milliseconds (or about three times a second). The fastest speed of human thought is 13 milliseconds, according to a new study from MIT (it's really cool; go look it up sometime).

So we can guarantee you a processing turnaround that's about the blink of an eye, though you may experience speeds as fast as the speed of human thought. That's right, we return your answers in less time that it took you to comprehend that last sentence.

MTBF and MTTR

Mean Time Between Failures (MTBF) is a metric used to average the time between expected failures in the system. It's a mathematical estimate that approximates how often the system will break down or something will go wrong. When you complain to your buddy that the car you bought for $500 isn't ever operational for longer than a 4-day stretch, you're offering a rudimentary MTBF.

Exact meanings of MTBF depend on the company offering the statistic, as definitions of "failure" vary from place to place. The two most common definitions of failure are A) the failure of the entire system or part of the system that renders the whole inoperable, and B) the failure of a part of the system, independent of the functionality of the whole. The CPU on your computer going up in smoke is an "A" failure, while the LED power light on the computer failing is a "B" failure since the rest of the machine still functions. Knowing the definition of failure is critical to understanding the MTBF stat, so don't be afraid to ask follow-up questions.

MTBF doesn't include scheduled maintenance and does not always account for the time it takes from initial failure to system recovery (read MTTR below). MTBF just gives a yardstick by which a given company's SLA can be compared against another; for instance, a company with an MTBF of three days is less impressive than a company that has a MTBF of two weeks. The longer your system can go without a cessation of service, the more reliable it is.

Our SLA on the Smarty website doesn't have a listed MTBF time. That's due, at least in part, to our fully-redundant system: we've backed up our backups. We have multiple, identical systems running in a number of diverse geographical locations across the country. A full system failure at Smarty would require something on the scale of an alien invasion, a robot uprising, or a nation-wide blackout.

The other half of this statistic, Mean Time To Recovery (MTTR), is approximated from the beginning the failure to the end of repairs and the return to service. Unlike MTBF, you want this number to be short, to indicate that problems with the system won't hold you down for long. Think of it like getting sick: MTBF is how long you can go without needing to take a sick day, while MTTR is how many sick days you'll need to take before going back to work.

We also don't list a MTTR on our Smarty SLA, in part because the redundancies we use translate into an instant recovery—enough of the redundant system would have to break to create a failure that experiencing a failure is almost an impossibility.

Uptime

Uptime is the opposite of downtime. Uptime is time spent functional, operational, and accessible. Downtime is time when the service is inaccessible, whether due to scheduled maintenance, failed hardware, or external problems like power outages and wandering monsters. Simply put, it's when you can use the service. Downtime is when you can't use the service.

Uptime is usually represented as a percentage, as in "Our uptime is 99%," or, "99% of the time our system is up and running." As you probably already assumed, higher uptime percentage is better.

Now, you may have picked up on a bit of overlap between the "Mean Time" metrics and uptime. That's because they're two different methods of answering the same question: how often and how long will I be denied service? "Uptime" lumps everything into one number, whereas "Mean Time" metrics break it into pieces like "How long until the next outage?" and "How long will the outage last?"

Likewise, while "mean" by definition indicates an estimate or average, uptime percentage can be given as a guarantee: "We guarantee our service will be available to you at least 99% of the time." For reasons like this (in addition to reasons mentioned above), Smarty uses uptime instead of MTBF and MTTR—it allows us to guarantee service levels. Many tech-based and information-based services will do the same, listing uptime instead of "Mean Times," which are used more frequently with services where maintenance might actually involve using an actual wrench.

That said, you can't be sure which one (or both) you will be seeing on the SLA, so we thought a description of both methods was appropriate.

Another important detail: uptime doesn't measure performance, it measures reliability (technically, this goes for "Mean Times" as well). The service you're receiving could hypothetically be feeding you bad data or could be running slower than a herd of snails. So while uptime metrics are important, they're only part of the puzzle, if you want to know how good the service really is.

Here at Smarty, we guarantee an uptime of 99.98%, even though our average is closer to 100%. (We gave ourselves a little wiggle room just in case our servers transform into robot alien racecars and drive away. We like to be prepared.)

And speaking of preparation...

Finding Your Inner Boy Scout

Let's face it, it pays to expect the unexpected. Whether you're fending off a killer robo-shark or just trying to keep your website up and running, you can never be too careful when considering potential problems that might arise in the course of your day.

That's why we, like Batman, are always crazy-prepared.

By that we mean that our redundant systems are ready for anything because they're so redundant. In short, we've got our system "geo-distributed between multiple data centers strategically located around the United States." We set this up to ensure that there's never a time that you can't access the service you need.

The key word here is "redundancy." The heart and soul of redundancy is having more than one of something. Copies, backups, replacements, etc. Your spare tire in your car? It's a redundancy. The spare key you keep under the mat? It's a redundancy. The extra roll of toilet paper in the bathroom? Mission critical redundancy. You have these things in case the first one fails, runs out, or goes missing. It's the same way with computers.

Redundancy is, at its core, a concept regarding system infrastructure that's used to prevent data loss. It does this primarily by spreading data out and duplicating it. There are various different ways to make a system redundant, but potential setups all use similar tools: things like mirroring, striping, duplicates and backups. Regardless of the actual system design, the hunt for reliability optimized for the organization in question is the goal.

The first step is called mirroring. Mirroring is simply copying—keeping multiple versions of the same data in different drives. That way, should one drive fail or become corrupted, the data is still available and intact. Only the simultaneous failure of every drive containing the applicable data will render the data unavailable.

You can scale up, though, and duplicate things larger than drives. For us, we don't just add backup drives, we add backup machines (the things that run the drives); that way if a machine goes down and it's data becomes unavailable (even if the drives themselves are still technically viable), there are other machines that can still be accessed.

But we also go a step further. We don't just have backup drives in case one goes down, or backup machines incase drives or machines go caput. We also have redundant locations, setting up our data across the country. That way, our data center in Metropolis can keep us running, even if there's a power outage in Gotham, shutting down our servers there.

Why are we telling you all of this? We want you to feel as safe as we do with our system. We're experts, and we use the tools of experts to create something of quality, something reliable. Like that collector's edition lightsaber you own, our service doesn't just do its job, it does it with style.

The Guarantees

If a service provider has an SLA, they will most likely map out exactly what they will guarantee regarding their level of performance, and what they won't. Let's discuss failing to meet expectations, whose fault it is, and what to do when it happens. Here's a summarized breakdown:

If the problem can legitimately be blamed on the service provider, there will usually be protocol for compensation—if that ever happened in our case, we'd issue a credit to your account.
If the problem is not the provider's fault but also not yours, compensation will not be given, but recommendations on how to optimize service will be given
If the problem is your fault, you will be redirected to this page

Uptime Guarantees

Here's an example of an SLA guarantee, what performance levels are promised, and what obligations the user has. In this example, If the response time of the service provider on a request exceeds 500 milliseconds, or both ingress and egress are ceased for at least five minutes, then the promises made in the SLA have been broken. The following excerpt from our SLA should give a good example of where the line of demarcation is regarding what is not the provider's fault.

"The uptime and response guarantees set forth above do not apply to any unavailability, suspension, or termination of Service, or any other Service performance issues:

that result from a suspension of an account;

caused by factors outside of our reasonable control, including any force majeure event or Internet access or related problems beyond the demarcation point of the Service;

that result from any action or inaction of you or any third party;

that result from your equipment, software, other technology, and/or third party equipment (other than third party equipment within our direct control);

arising from our suspension or termination of your right to use the Service in accordance with the Terms of Service; or

that result from your failure to adhere to the Technical Requirements page listed in our documentation."

According to the terms above, anything that happens outside the above list is the fault of the provider, and that means it's their job to compensate you for the loss of service. Here's how we do it at Smarty (if we ever had a service outage that is): we credit accounts with a period of free service, depending on the severity of the lapse in service. A reduction in response time is a one-day credit. Service outages that drop uptime to anything 99% and up is a one-week credit. 98.99% uptime and below is a one-month credit.

SLA Limitation of Liability

This one includes service issues:

caused by factors outside of our reasonable control, including any force majeure event or Internet access or related problems beyond the demarcation point of the Service;
that result from any action or inaction of you (see below) or any third party;

Our SLA puts it beautifully: "We guarantee performance for equipment we control, not for networks we can't control. We can't control Internet latency because we don't own the Internet (yet)." It's no more fair to make us take the blame for external problems than it is for us to hand the blame for our failures off to you, saying "Tough breaks, kid." We claim what's ours, and we kindly request that you not blame us for stuff we don't control (yet).

That said, we do have some advice on how to avoid problems and get the most out of our service. Here they are, as listed in the SLA:

Reuse established HTTPS connections because it's way faster. It's not our fault if you don't.
HTTPS isn't very reliable by itself. Be prepared to send a request more than once to make sure it arrives.
For best results, don't send requests to IP addresses directly; let DNS resolve the best server for you, since specific IPs may go down from time to time or be swapped out entirely.

In the event that the above suggestions read like gibberish to you, A) we have some articles for you, and B) talk to the guy or gal in the office who starts counting at "0" instead of "1" and knows what "Stack Overflow" is. And don't feel bad; the monkey we gave the typewriter to doesn't know what any of that means, either.

"User Errors and SLA's."

Listed here are the fun problems—"fun" because they're the "Darwin Award section" of an SLA. It excludes system problems :

that result from a suspension of an account;
that result from your equipment, software, other technology, and/or third party (see above) equipment (other than third party equipment within our direct control);
arising from our suspension or termination of your right to use the Service in accordance with the Terms of Service; or
that result from your failure to adhere to the Technical Requirements page listed in our documentation.

There's the very real possibility that the problem exists between your chair and keyboard, and that the failure in service is strictly due to problems on your end. There's...not a lot we can do about that. We can't upgrade your hardware, or force you to adhere to our Terms of Service. So we invite you to call our patient and understanding customer support crew.

The Weather Report

There are a number of other things that an SLA might contain that we haven't covered yet, but there are only two more that we'd like to touch on, for the sake of brevity: how metrics are obtained, and maintenance schedules (and how you will be notified about said maintenance).

Metrics

We make mention of where we get our metrics because we feel it's important for you to know we're not just making our numbers up. Third-party monitoring services keep tabs on us and give us feedback on how we're doing. It's their supervision that's going to get you your credits in the event something goes wrong. So keep that in mind—you're not just in our capable hands, you're also in the hands of capable strangers.

Maintenance

While the actual schedule is not likely to be on the SLA (the schedule is subject to change, while the SLA is not), there might be notes on what you can expect as far as maintenance procedures, including, but not limited to, how the company will notify you when scheduled downtime is coming up. It's in the company's best interest to keep you as apprised as possible regarding these outages or changes in service.

Th-th-th-that's All, Folks!

Hopefully, we've been able to turn our own SLA into an effective teaching tool, demonstrating the basic structures and habits of the format. SLAs are important documents, and knowing what's in them can pay off (or save you a lot of trouble and headache when talking to customer support). So familiarize yourself with them, and then hold your provider to their promises. They deserve to be kept on their toes; they should be ready for anything. Like Batman.