Garbage in, garbage out — if your input data is garbage, then your output data will be just as much rubbish. That’s something enterprises have to consider closely when the time comes to consolidate their data in a new storage system. The fastest, most reliable and shiniest disk tower or network storage
system won’t make any difference if your data is dumpster quality.
Frank Dravis, vice-president of data quality at Firstlogic, doesn’t mince words. “”As soon as a SAN system or some other data consolidation system is implemented, you have to run sanity checks on your data to ensure that no defects are instigated,”” he says. “”If people do not believe the data, they won’t use it.””
It’s enough to make you wonder whether consolidating data on a networked storage system or data warehouse is a blessing or a curse. “”It’s absolutely both,”” says John Bair, senior principal at Knightsbridge Solutions LLP, a Chicago-based data warehousing consultancy. “”We are still saddled with all of the systems we’ve built and created over the last thirty years. They were not designed to share information in the way we do now.””
The problem is that new applications that no one could have imagined when many companies invested in their first RAID arrays have begun to put unexpected stresses on enterprise data. Often, says Bair, these applications are the principal driver for data consolidation, and that’s where the problem is. Web services are getting data to talk to each other in ways they never have before. Data warehouses have centralized enterprise data to enable thorough and complete reporting. Enterprise information integration has helped to bring every pair of eyes and every department’s special applications to bear on everyone else’s dirty laundry.
“”The problem with bringing data together is that you bring together all the problems that are tolerable in isolated systems, but they become intolerable when you concentrate them,”” Bair says. “”We need to talk about how the data is being used and whether it is good enough for those uses.””
The bottom line is that data that is good enough for marketing might not be good enough for accounting, and you don’t want to find that out when the accounting department chokes on the marketers’ data. That becomes a major issue with the deployment of a storage-area network, for example.
“”Legacy data that has problems in it is going to have greater visibility, and that’s a big problem,”” Dravis says. “”Core people have been working with this data for years and have workarounds. But the wider enterprise audience doesn’t.
“”That’s why big events like data consolidation tend to be catalysts for data quality and cleansing projects,”” Dravis says. “”This is the ideal opportunity to address legacy issues that could come back and haunt you later on.””
How good — or bad — your data is has a whole lot to do with where it lives, Dravis says. “”When thinking about data quality, you have to be aware of where the data resides and what applications are going to use it,”” he says. “”When you’re building a new SAN system, you not only have to consider that the data needs cleansing, but you also have to know where the problems are.””
Consequently, consolidating storage can be an easy project to under-scope. It’s not just a hardware issue. It means taking the time to pore through every field of every record of every database with a fine-toothed comb before you open the Fibre Channel gates to that brand new network-attached 20TB disk tower. In other words, as good as your IT guys are at building a high-performance SAN, they have to bear in mind that data quality — and thus storage — is, at least in part, a business issue.
IT knows the systems, but the business users know the data. “”They need to get back to the business users,”” Dravis says. “”All data is accessed through business applications. IT has to say that this is an opportunity to improve performance.””
This, in fact, is where the data quality curse does become an opportunity. It is the one point where you can take your data aside, because you were going to anyway, and clean it up. It’s a chance to start afresh.
There are a number of proven methodologies for data cleansing, but they all start with a data quality assessment. Noting that practitioners often don’t know exactly where to start, Blair’s advice is to begin “”by working on less complex data quality issues, such as data type and domain, gradually building the foundation for addressing more complex aspects of data quality, such as business rule conformance.””
After that, Bair recommends that enterprises take what he calls an architectural approach to data quality — a 12-step process that begins with the definition of reference data and extraction, and passes through a transformation phase where the data is massaged into the structures appropriate for the centralized storage system. After that, it passes through certification and auditing before it is published, archived and cleansed.
It’s a big process, and it can add significant cost to the storage migration plan. But, as Dravis points out, “”if you don’t address all of those legacy issues when you have the chance, what’s the point of building the SAN at all?”” At the end of the day, he says that it pays to be proactive. “”It’s an opportunity for IT to be the heroes,”” he says.