Using BI to drive improvements in data quality

linkedin Business Intelligence Professionals

Introduction

It is often argued that good BI is dependent on good data quality. This is essentially a truism which it is hard to dispute. However, in this piece I will argue that it is helpful to look at this the other way round. My assertion is that good BI can have a major impact on driving improvements in data quality.
 
 
Food for thought from LinkedIn.com

Again this article is inspired by some discussions on a Linkedin.com group, this time Business Intelligence Professionals (as ever you need to be a member of both LinkedIn.com and the group to read the discussions). The specific thread asked about how to prove the source and quality of data that underpins BI and suggested that ISO 8000 could help.

I made what I hope are some pragmatic comments as follows:

My experience is that the best way to implement a data quality programme is to take a warts-and-all approach to your BI delivery. If data errors stand out like a sore thumb on senior management reports, then they tend to get fixed at source. If instead the BI layer massages away all such unpleasantness, then the errors persist. Having an “unknown” or, worse, “all others” category in a report is an abomination. If oranges show up in a report about apples, then this should not be swept under the carpet (apologies for the mixed metaphor).

Of course it is helpful to attack the beast in other ways: training for staff, extra validation in front-end systems, audit reports and so on; but I have found that nothing quite gets data fixed as quickly as the bad entries ending up on something the CEO sees.

Taking a more positive approach; if a report adds value, but has obvious data flaws, then it is clear to all that it is worth investing in fixing these. As data quality improves over time, the report becomes more valuable and we have established a virtuous circle. I have seen this happen many times and I think it is a great approach.

and, in response to a follow-up about exception reports:

Exception reports are an important tool, but I was referring to showing up bad data in actual reports (or cubes, or dashboards) read by executives.

So if product X should only really appear in department Y’s results, but has been miscoded, then erroneous product X entries should still be shown on department Z’s reports, rather than being suppressed in an “all others” or “unknown” line. That way wherever the problem is that led to its inclusion (user error, a lack of validation in a front-end system, a problem with one or more interface, etc.) can get fixed, as opposed to being ignored.

The same comments would apply to missing data (no product code), invalid data (a product code that doesn’t exist) or the lazy person’s approach to data (‘x’ being entered in a descriptive field rather than anything meaningful as the user just wants to get the transaction off their hands).

If someone senior enough wants these problems fixed, they tend to get fixed. If they are kept blissfully unaware, then the problem is perpetuated.

I thought that it was worth trying to lay out more explicitly what I think is the best strategy for improving data quality.
 
 
The four pillars of a data quality improvement programme

I have run a number of programmes specifically targeted at improving data quality that focussed on training and auditing progress. I have also delivered acclaimed BI systems that led to a measurable improvement in data quality. Experience has taught me that there are a number of elements that combine to improve the quality of data:

  1. Improve how the data is entered
  2. Make sure your interfaces aren’t the problem
  3. Check how the data is entered / interfaced
  4. Don’t suppress bad data in your BI

As with any strategy, it is ideal to have the support of all four pillars. However, I have seen greater and quicker improvements through the fourth element than with any of the others. I’ll now touch on each area briefly.

(if you are less interesting in my general thoughts on data quality and instead want to cut to the chase, then just click on the link to point 4. above)
 
 
1. Improve how the data is entered

Of course if there are no problems with how data is entered then (taking interface issues to one side) there should be no problems with the information that is generated from it. Problems with data entry can take many forms. Particularly where legacy systems are involved, it can sometimes be harder to get data right than it is to make a mistake. With more modern systems, one would hope that all fields are validated and that many only provide you with valid options (say in a drop down). However validating each field is only the start, entries that are valid may be nonsensical. A typical example here is with dates. It is unlikely that an order was placed in 1901 for example, or – maybe more typically – that an item was delivered before it was ordered. This leads us into the issue of combinations of fields.

Two valid entries may make no sense whatsoever in combination (e.g. a given product may not be sold in a given territory). Business rules around this area can be quite complex. Ideally, fields that limit the values of other fields should appear first. Drop downs on the later fields should then only show values which work in combination with the earlier ones. This speeds user entry as well as hopefully improving accuracy.

However what no system can achieve, no matter how good its validation, is ensuring that what is recorded 100% reflects the actual event. If some information is not always available but important if known, it is difficult to police that it is entered. If ‘x’ will suffice as a textual description of a business event, do not be surprised if it is used. If there is a default for a field, which will pass validation, then it is likely that a significant percentage of records will have this value. At the point of entry, these types of issues can be best addressed by training. This should emphasise to the people using the system what are the most important fields and why they are important.

Some errors and omissions can also be picked up in audit reports, which is the subject of the section 3. below. But valid data in one system can still be mangled before it gets to the next one and I will deal with this issue next.
 
 
2. Make sure your interfaces aren’t the problem

In many organisations the IT architecture is much more complex than a simple flow of data from a front-end system to BI and other reporting applications (e.g. Accounts). History often means that modern front-end systems wrap older (often mainframe-based) legacy systems that are too expensive to replace, or too embedded into the fabric of an organisation’s infrastructure. Also, there may be a number of different systems dealing with different parts of a business transaction. In Insurance, an industry in which I have worked for the last 12 years, the chain might look like this:

Simplified schematic of selected systems and interfaces within an Insurance company
Simplified schematic of selected systems and interfaces within an Insurance company

Of course two or more of the functions that I have shown separately may be supported in a single, integrated system, but it is not likely that all of them will be. Also the use of “System(s)” in the diagram is telling. It is not atypical for each line of business to have its own suite of systems, or for these to change from region to region. Hopefully the accounting system is an area of consistency, but this is not always the case. Even legacy systems may vary, but one of the reasons that interfaces are maintained to these is that they may be one place where data from disparate front-end systems is collated. I have used Insurance here as an example, but you could draw similar diagrams for companies of a reasonable size in most industries.

There are clearly many problems that can occur in such an architecture and simplifying diagrams like the one above has been an aim of many IT departments in recent years. What I want to focus on here is the potential impact on data quality.

Where (as is typical) there is no corporate lexicon defining the naming and validation of fields across all systems, then the same business data and business events will be recorded differently in different systems. This means that data not only has to be passed between systems but mappings have to be made. Often over time (and probably for reasons that were valid at every step along the way) these mappings can become Byzantine in their complexity. This leads to a lack of transparency between what goes in to one end of the “machine” and what comes out of the other. It also creates multiple vulnerabilities to data being corrupted or meaning being lost along the way.

Let’s consider System A which has direct entry of data and System B which receives data from System A by interface. If the team supporting System A have a “fire and forget” attitude to what happens to their data once it leaves their system, this is a recipe for trouble. Equally if the long-suffering and ill-appreciated members of System B’s team lovingly fix the problems they inherit from System A, then this is not getting to the heart of the problem. Also, if System B people lack knowledge of the type of business events supported in System A and essentially guess how to represent these, then this can be a large issue. Things that can help here include: making sure that there is ownership of data and that problems are addressed at source; trying to increase mutual understanding of systems across different teams; and judicious use of exception reports to check that interfaces are working. This final area is the subject of the next section.
 
 
3. Check how the data is entered / interfaced

Exception or audit reports can be a useful tool in picking up on problems with data entry (or indeed interfaces). However, they need to be part of a rigorous process of feedback if they are to lead to improvement (such feedback would go to the people entering data or those building interfaces as is appropriate). If exception reports are simply produced and circulated, it is unlikely that anything much will change. Their content needs to be analysed to identify trends in problems. These in turn need to drive training programmes and systems improvements.

At the end of the day, if problems persist with a particular user (or IT team), then this needs to be taken up with them in a firm manner, supported by their management. Data quality is either important to the organisation, or it is not. There is either a unified approach by all management, or we accept that our data quality will always be poor. In summary, there needs to be a joined-up approach to the policing of data and people need to be made accountable for their own actions in this area.
 
 
4. Don’t suppress bad data in your BI

I have spent some time covering the previous three pillars. In my career I have run data quality improvement programmes that have essentially relied solely on these three approaches. While I have generally had success operating in this way, progress has generally been slow and vast reserves of perseverance have been necessary.

More recently, in BI programmes I have led, improvements in data quality have been quicker, easier and almost just a by-product of the BI work itself. Why has this happened?

The key is to always highlight data quality problems in your BI. The desire can be to deliver a flawless BI product and data that is less than pristine can compromise this. However the temptation to sanitise bad data, to exclude it from reports, to incorporate it in an innocuous “all others” line, or to try to guess which category it really should sit in are all to be resisted. As I mention in my LinkedIn.com comments, while this may make your BI system appear less trustworthy (is it built on foundations of sand?) any other approach is guaranteeing that it actually is untrustworthy. If you stop and think about it, the very act of massaging bad source data in a BI system is suppressing the truth. Perhaps is it a lesser crime than doctoring your report and accounts, but it is not far short. You are giving senior management an erroneous impression of what is happening in the company, the precise opposite of what good BI should do.

So the “warts and all” approach is the right one to adopt ethically (if that does not sound too pretentious), but I would argue that it is the right approach practically as well. When data quality issues are evident on the reports and analysis tools that senior management use and when they reduce the value or credibility of these, then there is likely to be greater pressure applied to resolve them. If senior management are deprived of the opportunity to realise that there are problems, how are they meant to focus their attention on resolving them or to lend their public support to remedial efforts?

This is not just a nice theoretical argument. I have seen the quality of data dramatically improve in a matter of weeks when Executives become aware of how it impinges on the information they need to run a business. Of course a prerequisite for this is that senior management places value on the BI they receive and realise the importance of its accuracy. However, if you cannot rely on these two things in your organisation, then your BI project has greater challenges to face that the quality of data it is based upon.
 

18 thoughts on “Using BI to drive improvements in data quality

  1. Peter,

    Excellent and pragmatic post that highlights the relationship between information quality and Business Intelligence. I particularly like your view that the problem data shouldn’t be masked in the BI reports as it just hides the data quality problem at the expense of trustworthy reporting.

    One of the key challenges in ‘selling’ a data quality/information quality improvement initiative is getting managers and senior executives to accept that the problem is as significant as it is. Hiding the problem in the BI view of that data simply encourages the shared delusion that everything is alright (because the reports say so).

    As a corollary, I’ve used a business rules based approach to identifying and resolving data quality issues in a telco. One problem we have had is a business area producing reports suggesting there is a big compliance problem – but only looking at half the data needed to reflect the business process being looked at.

    Ensuring that your BI reports are accurate reflections of the business context that is being examined is an important step to ensuring the quality of your BI and the quality of your data.

  2. Peter,

    You made some excellent points. I agree that bad data shouldn’t be “hidden” by cleansing or some other means. I also agree that it is prudent to audit bad data issues. However, I’m not sure that it would be prudent to allow bad data to be saved in a production OLAP system. My concern is that it would be very difficult to address and/or correct bad data at such a point. Certainly, all of the relevant stake holders should be informed of any data quality issues. But i don’t think that such issues should revealed after data has already been loaded into production.

  3. Hi Duane,

    Thanks for the comment. I guess we disagree on this issue. Maybe something that I should have made clear here is my assumption that the data warehouse gets refreshed when problems are sorted out in source systems. This is reliant on either a regular rebuild, or on such fixes appearing in deltas (my experience has been that they generally do). So you live data issue disappears soon after the problem is addressed.

    There is the issue of very old data, which you might want to hold for comparison purposes, but which is unlikely to ever get tidied up due to the amount of effort required. This is the only circumstance in which a degree of massaging might be appropriate in your BI system.

  4. Peter,

    From an ETL standpoint, addressing bad data that has already been loaded into a data warehouse could be very daunting. I think of it as attempting to “unscramble an egg” because records may already be aggregated, merged and/or joined. Thus, the only sure-proof way to correct the data warehouse would be to reload the entire thing from scratch. For a very large data warehouse (which isn’t uncommon), embarking upon such a task could be prohibitively time consuming and/or expensive.

  5. I can appreciate that in some designs this would be a problem, but it’s not one I have experienced. If the changed data comes across as a delta, running it through the same ETL cascades should result in the old data being backed out and the new data reinstated, even if there is aggregation. The warehouse that I was most recently accountable for had over 60 steps in some areas of ETL, when bad data was fixed, it flowed quite happily through to the live reports and analysis apps.

  6. Peter – a very insightful and innovative post.

    I’m sure the point of allowing poor DQ to flow through to execs will attract a lot of polarised views but I for one certainly agree with you.

    Actually, the way I’ve approached this in the past is to integrate DQ metrics into the report or intelligence dashboard. The business instantly then has a flag or metric to indicate the quality of information they may be basing a decision on.

    I think your point about trust is spot on. If you’re hiding the problems of upstream DQ then no-one benefits, least of all the execs. Transparency is vital.

    Senior management are perfectly placed to help you resolve turf-wars so any approach that gets them motivated and engaged should definitely be enouraged.

    Great post.

  7. Dylan,

    Thanks for your comments. I’ve used colour-coding before in cubes to show how much reliance can be placed on aggregate figures, but this has been more to do with where extrapolations have been made based on data with a high standard deviation. Sounds like the approach might work well with data quality as well as you suggest.

  8. Providing insight into data quality issues in BI can be touchy if not done appropriately. Depending on the environment, poor quality can undermine trust in processes and departments. I’ve lived through this. While it certainly gets attention, how you expose issues is as important as what is exposed. Linking to business outcome is the key. However, many dashboards today don’t really show business outcomes but rather metrics. The other aspect is that sometimes exposure of the data quality issue is irrelevant to the report. Where you provide the insight may count more.

  9. I think it helps if what you are highlighting is a glaring error in an otherwise well-behaved and well-reconciled set of data. Of course if everything is out then you have a major problem.

  10. […] A number of things can impact accuracy, including, but not limited to: how data has been entered into systems; how that data is transformed by interfaces; differences between terminology and calculation methods in different data sources; misunderstandings by IT people about the meaning of business data; errors in the extract transform and load logic that builds BI solutions; and sometimes even the decisions about how information is portrayed in BI tools themselves. I cover some of these in my previous piece Using BI to drive improvements in data quality. […]

  11. I love this article and I we have actually had reports where we have included bad data in the past. The problem is that I work in a matrix organization where my direct reports are not my customers. I complain to them that we need more of a focus on data quality and all I get is lip service. For my customers (or dotted lines), they fully support my data quality initiatives, but can’t give me any additional resources to perform the work.

    • Thanks for taking the time to post – it is tough where accountability for data quality is separate from accountability for reporting and there is no over-arching group that can take an active role.

      Peter

Leave a Reply to The Anatomy of a Data Function – Part II | Peter James ThomasCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.