data quality


		Business Intelligence Professionals

Introduction

It is often argued that good BI is dependent on good data quality. This is essentially a truism which it is hard to dispute. However, in this piece I will argue that it is helpful to look at this the other way round. My assertion is that good BI can have a major impact on driving improvements in data quality.

Food for thought from LinkedIn.com

Again this article is inspired by some discussions on a Linkedin.com group, this time Business Intelligence Professionals (as ever you need to be a member of both LinkedIn.com and the group to read the discussions). The specific thread asked about how to prove the source and quality of data that underpins BI and suggested that ISO 8000 could help.

I made what I hope are some pragmatic comments as follows:

My experience is that the best way to implement a data quality programme is to take a warts-and-all approach to your BI delivery. If data errors stand out like a sore thumb on senior management reports, then they tend to get fixed at source. If instead the BI layer massages away all such unpleasantness, then the errors persist. Having an “unknown” or, worse, “all others” category in a report is an abomination. If oranges show up in a report about apples, then this should not be swept under the carpet (apologies for the mixed metaphor).

Of course it is helpful to attack the beast in other ways: training for staff, extra validation in front-end systems, audit reports and so on; but I have found that nothing quite gets data fixed as quickly as the bad entries ending up on something the CEO sees.

Taking a more positive approach; if a report adds value, but has obvious data flaws, then it is clear to all that it is worth investing in fixing these. As data quality improves over time, the report becomes more valuable and we have established a virtuous circle. I have seen this happen many times and I think it is a great approach.

and, in response to a follow-up about exception reports:

Exception reports are an important tool, but I was referring to showing up bad data in actual reports (or cubes, or dashboards) read by executives.

So if product X should only really appear in department Y’s results, but has been miscoded, then erroneous product X entries should still be shown on department Z’s reports, rather than being suppressed in an “all others” or “unknown” line. That way wherever the problem is that led to its inclusion (user error, a lack of validation in a front-end system, a problem with one or more interface, etc.) can get fixed, as opposed to being ignored.

The same comments would apply to missing data (no product code), invalid data (a product code that doesn’t exist) or the lazy person’s approach to data (‘x’ being entered in a descriptive field rather than anything meaningful as the user just wants to get the transaction off their hands).

If someone senior enough wants these problems fixed, they tend to get fixed. If they are kept blissfully unaware, then the problem is perpetuated.

I thought that it was worth trying to lay out more explicitly what I think is the best strategy for improving data quality.

The four pillars of a data quality improvement programme

I have run a number of programmes specifically targeted at improving data quality that focussed on training and auditing progress. I have also delivered acclaimed BI systems that led to a measurable improvement in data quality. Experience has taught me that there are a number of elements that combine to improve the quality of data:

Improve how the data is entered
Make sure your interfaces aren’t the problem
Check how the data is entered / interfaced
Don’t suppress bad data in your BI

As with any strategy, it is ideal to have the support of all four pillars. However, I have seen greater and quicker improvements through the fourth element than with any of the others. I’ll now touch on each area briefly.

(if you are less interesting in my general thoughts on data quality and instead want to cut to the chase, then just click on the link to point 4. above)

1. Improve how the data is entered

Of course if there are no problems with how data is entered then (taking interface issues to one side) there should be no problems with the information that is generated from it. Problems with data entry can take many forms. Particularly where legacy systems are involved, it can sometimes be harder to get data right than it is to make a mistake. With more modern systems, one would hope that all fields are validated and that many only provide you with valid options (say in a drop down). However validating each field is only the start, entries that are valid may be nonsensical. A typical example here is with dates. It is unlikely that an order was placed in 1901 for example, or – maybe more typically – that an item was delivered before it was ordered. This leads us into the issue of combinations of fields.

Two valid entries may make no sense whatsoever in combination (e.g. a given product may not be sold in a given territory). Business rules around this area can be quite complex. Ideally, fields that limit the values of other fields should appear first. Drop downs on the later fields should then only show values which work in combination with the earlier ones. This speeds user entry as well as hopefully improving accuracy.

However what no system can achieve, no matter how good its validation, is ensuring that what is recorded 100% reflects the actual event. If some information is not always available but important if known, it is difficult to police that it is entered. If ‘x’ will suffice as a textual description of a business event, do not be surprised if it is used. If there is a default for a field, which will pass validation, then it is likely that a significant percentage of records will have this value. At the point of entry, these types of issues can be best addressed by training. This should emphasise to the people using the system what are the most important fields and why they are important.

Some errors and omissions can also be picked up in audit reports, which is the subject of the section 3. below. But valid data in one system can still be mangled before it gets to the next one and I will deal with this issue next.

2. Make sure your interfaces aren’t the problem

In many organisations the IT architecture is much more complex than a simple flow of data from a front-end system to BI and other reporting applications (e.g. Accounts). History often means that modern front-end systems wrap older (often mainframe-based) legacy systems that are too expensive to replace, or too embedded into the fabric of an organisation’s infrastructure. Also, there may be a number of different systems dealing with different parts of a business transaction. In Insurance, an industry in which I have worked for the last 12 years, the chain might look like this:

Simplified schematic of selected systems and interfaces within an Insurance company

Of course two or more of the functions that I have shown separately may be supported in a single, integrated system, but it is not likely that all of them will be. Also the use of “System(s)” in the diagram is telling. It is not atypical for each line of business to have its own suite of systems, or for these to change from region to region. Hopefully the accounting system is an area of consistency, but this is not always the case. Even legacy systems may vary, but one of the reasons that interfaces are maintained to these is that they may be one place where data from disparate front-end systems is collated. I have used Insurance here as an example, but you could draw similar diagrams for companies of a reasonable size in most industries.

There are clearly many problems that can occur in such an architecture and simplifying diagrams like the one above has been an aim of many IT departments in recent years. What I want to focus on here is the potential impact on data quality.

Where (as is typical) there is no corporate lexicon defining the naming and validation of fields across all systems, then the same business data and business events will be recorded differently in different systems. This means that data not only has to be passed between systems but mappings have to be made. Often over time (and probably for reasons that were valid at every step along the way) these mappings can become Byzantine in their complexity. This leads to a lack of transparency between what goes in to one end of the “machine” and what comes out of the other. It also creates multiple vulnerabilities to data being corrupted or meaning being lost along the way.

Let’s consider System A which has direct entry of data and System B which receives data from System A by interface. If the team supporting System A have a “fire and forget” attitude to what happens to their data once it leaves their system, this is a recipe for trouble. Equally if the long-suffering and ill-appreciated members of System B’s team lovingly fix the problems they inherit from System A, then this is not getting to the heart of the problem. Also, if System B people lack knowledge of the type of business events supported in System A and essentially guess how to represent these, then this can be a large issue. Things that can help here include: making sure that there is ownership of data and that problems are addressed at source; trying to increase mutual understanding of systems across different teams; and judicious use of exception reports to check that interfaces are working. This final area is the subject of the next section.

3. Check how the data is entered / interfaced

Exception or audit reports can be a useful tool in picking up on problems with data entry (or indeed interfaces). However, they need to be part of a rigorous process of feedback if they are to lead to improvement (such feedback would go to the people entering data or those building interfaces as is appropriate). If exception reports are simply produced and circulated, it is unlikely that anything much will change. Their content needs to be analysed to identify trends in problems. These in turn need to drive training programmes and systems improvements.

At the end of the day, if problems persist with a particular user (or IT team), then this needs to be taken up with them in a firm manner, supported by their management. Data quality is either important to the organisation, or it is not. There is either a unified approach by all management, or we accept that our data quality will always be poor. In summary, there needs to be a joined-up approach to the policing of data and people need to be made accountable for their own actions in this area.

4. Don’t suppress bad data in your BI

I have spent some time covering the previous three pillars. In my career I have run data quality improvement programmes that have essentially relied solely on these three approaches. While I have generally had success operating in this way, progress has generally been slow and vast reserves of perseverance have been necessary.

More recently, in BI programmes I have led, improvements in data quality have been quicker, easier and almost just a by-product of the BI work itself. Why has this happened?

The key is to always highlight data quality problems in your BI. The desire can be to deliver a flawless BI product and data that is less than pristine can compromise this. However the temptation to sanitise bad data, to exclude it from reports, to incorporate it in an innocuous “all others” line, or to try to guess which category it really should sit in are all to be resisted. As I mention in my LinkedIn.com comments, while this may make your BI system appear less trustworthy (is it built on foundations of sand?) any other approach is guaranteeing that it actually is untrustworthy. If you stop and think about it, the very act of massaging bad source data in a BI system is suppressing the truth. Perhaps is it a lesser crime than doctoring your report and accounts, but it is not far short. You are giving senior management an erroneous impression of what is happening in the company, the precise opposite of what good BI should do.

So the “warts and all” approach is the right one to adopt ethically (if that does not sound too pretentious), but I would argue that it is the right approach practically as well. When data quality issues are evident on the reports and analysis tools that senior management use and when they reduce the value or credibility of these, then there is likely to be greater pressure applied to resolve them. If senior management are deprived of the opportunity to realise that there are problems, how are they meant to focus their attention on resolving them or to lend their public support to remedial efforts?

This is not just a nice theoretical argument. I have seen the quality of data dramatically improve in a matter of weeks when Executives become aware of how it impinges on the information they need to run a business. Of course a prerequisite for this is that senior management places value on the BI they receive and realise the importance of its accuracy. However, if you cannot rely on these two things in your organisation, then your BI project has greater challenges to face that the quality of data it is based upon.