Articles

 

Data Management: When Good Data Goes Bad

The "Fast Lane" Answer

"Data management" is the process of facilitating data control and flow, from the data's creation to processing, accessing, and deleting it. Everything from research statistics to the contact information for Billy in accounting is governed and facilitated by data management because, definitively knowing a thing is very important. But knowing can be dangerous and costly if what you "know" turns out to be wrong.

Nowhere is this more critical than with address data. A mistake in address data can cost an organization up to $100 per occurrence, and with over a quarter of all mail in the US being addressed improperly, there's plenty of room for things to go wrong.

If you're not sure how to manage your data and ensure its accuracy, look no further. Below we've compiled some useful tips on how to prevent bad data and avoid some of its frightening costs:

  • Make a Data Management Plan
  • Use a Consistent, Standard Format
  • Do a Data Quality Assessment
  • Keep Your Database Efficient and Streamlined
  • Save in Open, Non-proprietary Formats
  • Backup Your Data
  • Practice Good Data Security

The "Scenic Route" Answer

Bad Data: When Information Misbehaves

"Bad data" is what's happening when you think you know something, when really, you're wrong. It has many possible causes, but the end result is the same: one day you run into the unhappy surprise that everything you know is a lie. Or at least, a particular piece of information is incorrect.

Bad data can come from just about anywhere. Here are a few examples:

  • Customer/user input the information incorrectly
  • Computer glitch altered the data
  • Autocorrect
  • Falsified information
  • Information was factual, but is now out of date
  • Content is accurate, but for whatever reason, the entry doesn't adhere to proper formats and standards

It's unavoidable. You're going to run into bad data.

The Costs of Bad Data

Let's cut to the chase: Bad data is going to cost you. A lot. The Data Warehousing Institute recently released an estimate that over $600 billion is lost by businesses per year due to bad data (that's all bad data, including bad addresses). That's kind of a big number, and it's kind of a wide net we're casting, so let's see if we can break that down into something more applicable to your business.

Let's start with how much data is bad. EBiz1 estimates that around 20% of an average database is bad data. That's one-fifth. One out of five data entries are wrong. Still don't think it's a problem? Consider that the cost of a single bad data entry (including addresses) could be as much as $100, according to Sirius Decisions2.

So let's do the math: let's say you're an ecommerce business, selling a $15 item to 100,000 people. Now let's take the estimates above and say that 20,000 of the addresses you have on file are wrong in some way, and each is going to cost you the full $100 (just so we're using round numbers). At that amount, you're losing $2,000,000 of your $15,000,000 gross profit, just because someone misspelled their street name or an address was formatted incorrectly.

An infographic over at Lemonly.com3 explains it this way: bad data could be costing you anywhere from 10-25% of your revenue. If that's money you want back, you're going to have to start doing something about your dirty data before it becomes a problem. On average, it costs $1 dollar to prevent a dirty entry, $10 to correct an incorrect record, and $100 to deal with it after it's become a problem4. So if you're looking to avoid the costs, you'll have to be proactive and practice good data management.

What is Data Management?

Data, as far as our conversation today is concerned, is information (statistics, contact information, customer input or responses, basically anything you would keep a record of), especially information stored in a digital format. Data management is the upkeep of records, information, and data.

"Data management" goes by a few aliases. Sometimes it goes by "data administration." It can also commonly be called "data resource management" (or DRM). It can even be called "master data management" (or MDM). Regardless of the name, the concepts in question and the issues at hand are the same.

Data Management Best Practices

At this point, it's probably important to talk about some best practices for keeping your data in good shape. These are some tips and advice used by the pros to turn "What now?" data into "Oh, wow!" data. For those of you not interested in reading the whole rundown, here's a spiffy bullet list:

  • Make a Data Management Plan
  • Use a Consistent, Standard Format
  • Do a Data Quality Assessment
  • Keep Your Database Efficient and Streamlined
  • Save in Open, Non-proprietary Formats
  • Backup Your Data
  • Practice Good Data Security

First and foremost, the best piece of advice we can give you is to make a Data Management Plan (DMP). Data management plans are comprehensive "battle plans" if you will, that set out guidelines and rules for the management of your data. This is often inclusive of the entire data life cycle, covering everything from the actual creation and aggregation of data to how it's being stored a decade later. It gives the data a focused purpose, and helps those who are stewards over it, an idea of what to do with it.

Typically, a DMP is put together prior to the entry of the first piece of data. That doesn't mean you can't institute one on a database that's already in place. Better late than never, and even a lackluster plan is better than no plan at all. So confer with your generals, and put together a war strategy for your data.

Tied in along with that is what your data looks like, meaning how it's formatted. Just like with standardization and address validation, it's easier to compare things if you minimize the differences first. It's hard to use a computer search to find a specific date if every file or line of data in the database uses a different format for the date. A consistent, standard format will help with the management of your data. Moreover, it will help with the retrieval and use of necessary information.

Use a consistent naming convention for your files, so that they're easily identifiable, and easily located. Have the same layout for each entry, being sure to use keywords and consistent spelling. Really, the word "consistent" is the key here, since (depending on the data you're aggregating) the specific format of the data is less important than being consistent. In short, the practice of being consistent will prove a consistent benefit to your data.

(We realize the joke is getting a bit tired, but we're just trying to be...you get what we mean.)

In some cases, like with addresses (we'll discuss them more below), the format does matter, since there are official standards. While things might get a little more complicated for a database that tracks an international list, for those that only contain domestic US addresses it's as simple as doing as the Romans do. And by "Romans," we mean the USPS. They have an official standard format for US addresses, and "standardizing" (we mentioned that above) to that format will not only help streamline your database, but will make shipping easier too.

(By the way, you don't have to do that work by yourself. An address validation provider, like SmartyStreets for instance, can do that for you. And we can do it really quickly.)

And of course, when you set and use a standard, be sure that your team is properly trained on its implementation. All it takes is one guy getting crazy with the date and time, and suddenly your records are a mess.

Once you've got a system in place, occasionally check on your data to make sure things are going smoothly. You can do minor surveys now and then, but be sure you do at least one Data Quality Assessment. These are comprehensive. They check the data all the way through, locating and identifying each piece of information that is incorrect, incomplete, doesn't meet the standard, and so on.

A data quality assessment tracks the details of said bad data, so you can better identify where it's happening and why. It gives you rigorously obtained statistics, so that you can quantify just how bad your data is, how much it's costing you, and what you can do about it. Once you have an idea of where the problem areas are, you can start putting together a battle plan specifically to deal with the data that's already bad, and how to prevent the production of more bad data in the future.

In the process of monitoring your data, make sure you're trimming the fat. Keep your database efficient and streamlined. Identify and remove duplicates (that's called "deduplication"). Use proper archiving techniques, so that you can minimize the amount of storage space you need to use. And above all, don't store and maintain data that's not useful.

You also don't want to outdate your data, or risk incompatibility. To avoid both problems, save your information in open, non-proprietary formats, and avoid the proprietary ones. For instance, use formats like .txt, .csv, .zip and avoid formats like .docx, .xls, .jpeg, .rar.

Along those same lines, be sure you're backing up your data. Nothing is worse than having a huge lump of data and hard work on your hard drive, then losing it because you accidently deleted it, or the computer was struck by lightning or dropped out of a plane. Creating backups regularly can help prevent problems like this, but it can also prevent against another problem: data corruption. Whether by human or computer error, sometimes things get screwed up in the little bits and bytes of a file. Having a recent, healthy version you can restore, can prevent the total loss of data, should someone decide not to remove their USB safely.

Lastly, let's talk about security. Besides keeping people from taking what's yours, good security can also prevent against sabotage (which is not always intentional). Strict regulations, like limiting access, can limit mistakes by having fewer people handle the data less often.

Now, a lot of the nitty gritty on this one will need to be handled from the guys and gals on your crew that know their DOS from a Unix in the ground. But that doesn't mean you can't know what to ask from them. So here are a few tips that both the businessmen and the code monkeys can agree on:

  • Lock the Doors—there's more than a few reasons that you don't want someone coming into your business and walking off with a laptop or hard drive, but data propriety is one of them.
  • Lock the Computer—passwords, passwords, passwords. If you have any information of a sensitive nature, lock it down and password-protect it. But make sure they're good passwords. None of this "P@s5word" nonsense.
  • Encryption—for all the information you need to store that's sensitive, don't just save it and lock it up. Encrypt it. Encryption doesn't necessarily protect the data from theft, but it can make that data useless to a thief. Make sure the cipher you use is a good one; pig latin's not going to "ooh day uch may." Also, don't forget to encrypt your emails and email attachments. You don't want some filthy packet sniffer stealing your stuff while it's enroute.
  • Hashing—besides being delicious, hashes can take your security one step further than encryption, for any information you don't specifically need to recall. It can be used to digitally "shred" information you're disposing of, but it can also be used for things like storing passwords. That way, someone who hacks in won't find passwords; they'll find a hash they can't eat use.
  • Host Your Website on HTTPS—we're going to go out on a limb and assume that some of your information comes in through the internet (and not just via the aforementioned emails). If that's the case, you'll probably want to host your website over Hypertext Transfer Protocol Secure, or HTTPS. It's a pretty solid system that's really tough to break, and it'll keep your client/server exchanges discreet and private. Just don't forget to host all of your website on HTTPS. Hosting even part of it over HTTP leaves the backdoor open.
  • Limit Access—it's the same principle behind only giving the manager a key to the register. Limit access to the sensitive data to those who need access in order to do the work you trust them to do. "Trustworthy" is important. "Need" is important. Don't give access to anyone to whom one or both does not apply.

Obviously there are other, more specific tactics (many of which depend on the industry and the type of data being stored), but these general tips should get you started.

What is Address Data Management?

Now we'd like to talk more specifically about the kind of bad data and data management that we're most familiar with. That's bad addresses and address data management, respectively. A bad address is one you can't mail to, and address data management is the same as the aforementioned data management, but directly applied to addresses.

Aggregating address data happens for any number of reasons. Maybe you're a political campaign that's polling potential voters. Maybe you're an ecommerce business that needs to ship product to customers. Regardless of the reason, the addresses in your database need to be kept ship-shape after they're obtained for them to be worth anything.

There are four things that need to be done to addresses to make sure they're everything you need them to be:

  • Checking the addresses for accuracy
  • Checking the addresses for missing information
  • Checking to see if the addresses are up to date
  • Checking the address list for duplicate entries

Checking an address for accuracy (a process that's called "address validation") involves comparing an address to an authoritative database and seeing if it's listed in that database. If the database has the address listed, then it's a real address, and it's considered "valid." If the address is not in the database, then there's a problem with the address, and it's marked "invalid." An invalid address may or may not be a correct or real address, but a valid address is always a real address.

Address validation also solves, at least partially, the problems of out-of-date and missing information. Since validating the address compares it against the most recent and updated version of the database, an address won't validate if it's no longer active on the list. If an address no longer exists, or is no longer receiving mail, it's out of date, and thus it's invalid.

Validation also helps with gaps in information, since "standardizing" an address is a prerequisite. You can't do the former without doing the latter first. Standardizing (which is often coupled with address parsing) is when addresses are cleaned up and made to match the format of the database in question. For example, standardizing a US address involves reformatting it to match the standards set by the USPS. This is a process that sometimes includes filling in gaps in the address (adding the proper street designation, supplying missing zip code details, etc.).

We don't mention standardization as its own bullet-list item up above simply because it's part of the validation process. You can't check an address's accuracy unless you can properly compare it to the database.

As for checking for duplicate entries, a good validation system can do that too; either removing troublesome data entries, or flagging them for deletion by whoever supplied the data (typically the latter). So all in all, validation is the way to go.