Data Management:
When Good Data Goes Bad

The "Fast Lane" Answer

"Data management" is the process of facilitating data control and flow, from the data's creation to processing, accessing, and deleting it. Everything from research statistics to the contact information for Billy in accounting is governed and facilitated by data management because, definitively knowing a thing is very important. But knowing can be dangerous and costly if what you "know" turns out to be wrong.

Nowhere is this more critical than with address data. A mistake in address data can cost an organization up to $100 per occurrence, and with over a quarter of all mail in the US being addressed improperly, there's plenty of room for things to go wrong.

If you're not sure how to manage your data and ensure its accuracy, look no further. Below we've compiled some useful tips on how to prevent bad data and avoid some of its frightening costs:

The "Scenic Route" Answer

Bad Data: When Information Misbehaves

"Bad data" is what's happening when you think you know something, when really, you're wrong. It has many possible causes, but the end result is the same: one day you run into the unhappy surprise that everything you know is a lie. Or at least, a particular piece of information is incorrect.

Bad data can come from just about anywhere. Here are a few examples:

It's unavoidable. You're going to run into bad data.

The Costs of Bad Data

Let's cut to the chase: Bad data is going to cost you. A lot. The Data Warehousing Institute recently released an estimate that over $600 billion is lost by businesses per year due to bad data (that's all bad data, including bad addresses). That's kind of a big number, and it's kind of a wide net we're casting, so let's see if we can break that down into something more applicable to your business.

Let's start with how much data is bad. EBiz1 estimates that around 20% of an average database is bad data. That's one-fifth. One out of five data entries are wrong. Still don't think it's a problem? Consider that the cost of a single bad data entry (including addresses) could be as much as $100, according to Sirius Decisions2.

So let's do the math: let's say you're an ecommerce business, selling a $15 item to 100,000 people. Now let's take the estimates above and say that 20,000 of the addresses you have on file are wrong in some way, and each is going to cost you the full $100 (just so we're using round numbers). At that amount, you're losing $2,000,000 of your $15,000,000 gross profit, just because someone misspelled their street name or an address was formatted incorrectly.

An infographic over at Lemonly.com3 explains it this way: bad data could be costing you anywhere from 10-25% of your revenue. If that's money you want back, you're going to have to start doing something about your dirty data before it becomes a problem. On average, it costs $1 dollar to prevent a dirty entry, $10 to correct an incorrect record, and $100 to deal with it after it's become a problem4. So if you're looking to avoid the costs, you'll have to be proactive and practice good data management.

What is Data Management?

Data, as far as our conversation today is concerned, is information (statistics, contact information, customer input or responses, basically anything you would keep a record of), especially information stored in a digital format. Data management is the upkeep of records, information, and data.

"Data management" goes by a few aliases. Sometimes it goes by "data administration." It can also commonly be called "data resource management" (or DRM). It can even be called "master data management" (or MDM). Regardless of the name, the concepts in question and the issues at hand are the same.

Data Management Best Practices

At this point, it's probably important to talk about some best practices for keeping your data in good shape. These are some tips and advice used by the pros to turn "What now?" data into "Oh, wow!" data. For those of you not interested in reading the whole rundown, here's a spiffy bullet list:

First and foremost, the best piece of advice we can give you is to make a Data Management Plan (DMP). Data management plans are comprehensive "battle plans" if you will, that set out guidelines and rules for the management of your data. This is often inclusive of the entire data life cycle, covering everything from the actual creation and aggregation of data to how it's being stored a decade later. It gives the data a focused purpose, and helps those who are stewards over it, an idea of what to do with it.

Typically, a DMP is put together prior to the entry of the first piece of data. That doesn't mean you can't institute one on a database that's already in place. Better late than never, and even a lackluster plan is better than no plan at all. So confer with your generals, and put together a war strategy for your data.

Tied in along with that is what your data looks like, meaning how it's formatted. Just like with standardization and address validation, it's easier to compare things if you minimize the differences first. It's hard to use a computer search to find a specific date if every file or line of data in the database uses a different format for the date. A consistent, standard format will help with the management of your data. Moreover, it will help with the retrieval and use of necessary information.

Use a consistent naming convention for your files, so that they're easily identifiable, and easily located. Have the same layout for each entry, being sure to use keywords and consistent spelling. Really, the word "consistent" is the key here, since (depending on the data you're aggregating) the specific format of the data is less important than being consistent. In short, the practice of being consistent will prove a consistent benefit to your data.

(We realize the joke is getting a bit tired, but we're just trying to be...you get what we mean.)

In some cases, like with addresses (we'll discuss them more below), the format does matter, since there are official standards. While things might get a little more complicated for a database that tracks an international list, for those that only contain domestic US addresses it's as simple as doing as the Romans do. And by "Romans," we mean the USPS. They have an official standard format for US addresses, and "standardizing" (we mentioned that above) to that format will not only help streamline your database, but will make shipping easier too.

(By the way, you don't have to do that work by yourself. An address validation provider, like SmartyStreets for instance, can do that for you. And we can do it really quickly.)

And of course, when you set and use a standard, be sure that your team is properly trained on its implementation. All it takes is one guy getting crazy with the date and time, and suddenly your records are a mess.

Once you've got a system in place, occasionally check on your data to make sure things are going smoothly. You can do minor surveys now and then, but be sure you do at least one Data Quality Assessment. These are comprehensive. They check the data all the way through, locating and identifying each piece of information that is incorrect, incomplete, doesn't meet the standard, and so on.

A data quality assessment tracks the details of said bad data, so you can better identify where it's happening and why. It gives you rigorously obtained statistics, so that you can quantify just how bad your data is, how much it's costing you, and what you can do about it. Once you have an idea of where the problem areas are, you can start putting together a battle plan specifically to deal with the data that's already bad, and how to prevent the production of more bad data in the future.

In the process of monitoring your data, make sure you're trimming the fat. Keep your database efficient and streamlined. Identify and remove duplicates (that's called "deduplication"). Use proper archiving techniques, so that you can minimize the amount of storage space you need to use. And above all, don't store and maintain data that's not useful.

You also don't want to outdate your data, or risk incompatibility. To avoid both problems, save your information in open, non-proprietary formats, and avoid the proprietary ones. For instance, use formats like .txt, .csv, .zip and avoid formats like .docx, .xls, .jpeg, .rar .

Along those same lines, be sure you're backing up your data. Nothing is worse than having a huge lump of data and hard work on your hard drive, then losing it because you accidently deleted it, or the computer was struck by lightning or dropped out of a plane. Creating backups regularly can help prevent problems like this, but it can also prevent against another problem: data corruption. Whether by human or computer error, sometimes things get screwed up in the little bits and bytes of a file. Having a recent, healthy version you can restore, can prevent the total loss of data, should someone decide not to remove their USB safely.

Lastly, let's talk about security. Besides keeping people from taking what's yours, good security can also prevent against sabotage (which is not always intentional). Strict regulations, like limiting access, can limit mistakes by having fewer people handle the data less often.

Now, a lot of the nitty gritty on this one will need to be handled from the guys and gals on your crew that know their DOS from a Unix in the ground. But that doesn't mean you can't know what to ask from them. So here are a few tips that both the businessmen and the code monkeys can agree on:

Obviously there are other, more specific tactics (many of which depend on the industry and the type of data being stored), but these general tips should get you started.

What is Address Data Management?

Now we'd like to talk more specifically about the kind of bad data and data management that we're most familiar with. That's bad addresses and address data management, respectively. A bad address is one you can't mail to, and address data management is the same as the aforementioned data management, but directly applied to addresses.

Aggregating address data happens for any number of reasons. Maybe you're a political campaign that's polling potential voters. Maybe you're an ecommerce business that needs to ship product to customers. Regardless of the reason, the addresses in your database need to be kept ship-shape after they're obtained for them to be worth anything.

There are four things that need to be done to addresses to make sure they're everything you need them to be:

Checking an address for accuracy (a process that's called "address validation") involves comparing an address to an authoritative database and seeing if it's listed in that database. If the database has the address listed, then it's a real address, and it's considered "valid." If the address is not in the database, then there's a problem with the address, and it's marked "invalid." An invalid address may or may not be a correct or real address, but a valid address is always a real address.

Address validation also solves, at least partially, the problems of out-of-date and missing information. Since validating the address compares it against the most recent and updated version of the database, an address won't validate if it's no longer active on the list. If an address no longer exists, or is no longer receiving mail, it's out of date, and thus it's invalid.

Validation also helps with gaps in information, since "standardizing" an address is a prerequisite. You can't do the former without doing the latter first. Standardizing (which is often coupled with address parsing) is when addresses are cleaned up and made to match the format of the database in question. For example, standardizing a US address involves reformatting it to match the standards set by the USPS. This is a process that sometimes includes filling in gaps in the address (adding the proper street designation, supplying missing zip code details, etc.).

We don't mention standardization as its own bullet-list item up above simply because it's part of the validation process. You can't check an address's accuracy unless you can properly compare it to the database.

As for checking for duplicate entries, a good validation system can do that too; either removing troublesome data entries, or flagging them for deletion by whoever supplied the data (typically the latter). So all in all, validation is the way to go.

1 http://www.ebizq.net/blogs/integrationedge/2012/01/fixing-a-3-trillion-dirty-data-problem-with-crowd-computing.php 2 https://www.siriusdecisions.com/TheImpactofBadDataonDemandCreation.aspx 3 http://lemonly.com/work/the-cost-of-bad-data/ 4 https://www.ringlead.com/blog/cost-of-bad-data/
Product Features Demo Pricing Help Company Documentation Articles Contact Customers Legal Stuff