# More Definitions in the Data and Analytics Dictionary

The peterjamesthomas.com Data and Analytics Dictionary is an active document and I will continue to issue revised versions of it periodically. Here are 20 new definitions, including the first from other contributors (thanks Tenny!):

Remember that The Dictionary is a free resource and quoting contents (ideally with acknowledgement) and linking to its entries (via the buttons provided) are both encouraged.

People are now also welcome to contribute their own definitions. You can use the comments section here, or the dedicated form. Submissions will be subject to editorial review and are not guaranteed to be accepted.

From: peterjamesthomas.com, home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases

# The Latest from the Maths & Science Section

This site has always had a strong flavour of both Mathematics and Science [1]; sometimes overt as in How to be Surprisingly Popular, Toast, Data Visualisation – A Scientific Treatment, Hurricanes and Data Visualisation: Part II – Map Reading or Patterns Patterns Everywhere; sometimes implicit as in Analogies or A Single Version of the Truth. These articles typically blend Business and Scientific or Mathematical ideas, seeking to find learnings from one applicable to the other. More recently, I have started writing some entirely Maths or Science-focused articles, starting with my book about Group Theory and Particle Physics, Glimpses of Symmetry [2]. These are collected in the Maths & Science Section, which can be accessed from the site menu; it’s right next to the Data & Analytics Dictionary as per:

Recent trends have begun to drive a confluence between Mathematics and aspects of the data arena, notably to do with Data Science and Artificial Intelligence approaches like Machine Learning [3]. For this reason, I will periodically share selected articles from the Maths & Science Section here on the main site. I hope that they are of interest to at least some regular readers. I’ll kick this off with three, interconnecting articles about one of the most important numbers in Mathematics, $e$, one of the most misunderstood, $\pi$, and finally their beautiful relationship to each other [4].

 1 Euler’s Number A long and winding road with the destination being, e, probably the most important number in Mathematics. 2 The Irrational Ratio The number π is surrounded by a fog of misunderstanding and even mysticism. This article seeks to address some common misconceptions about π, to show that in many ways it is just like any other number, but also to demonstrate some of its less common properties. 3 The Equation Deriving (and demystifying) one of the most famous equations in Mathematics.

I will occasionally share further content from the Maths & Science Section, particularly when there is some linkage to data matters or the other subjects that I cover in the main site.

Notes

 [1] No doubt related to its author’s background and indeed that of his wife. [2] Glimpses of Symmetry is about 75% finished at the time of writing. [3] A cornerstone of Machine Learning is of course Linear Algebra, a topic running throughout Glimpses of Symmetry from Chapter 6 onwards. [4] As may be obvious from elements of their content, these articles are presented out of chronological order, but the overall flow makes more sense this way round.

From: peterjamesthomas.com, home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases

# Data Science Challenges – It’s Deja Vu all over again!

The rather famous tautology, “It’s déjà vu all over again”, has of course been ascribed to that darling of malapropisms, baseball catcher Yogi Berra [1]. The phrase came to mind for me today when coming across the following exhibit:

© Business Over Broadway (2018). Based on Kaggle’s State of Data Science Survey 2017 (Sample size: 10,153).

The text in the above exhibit is not that clear [2], so here are the 20 top challenges [3] faced by those running Data Science teams in human-readable form:

 # Challenge Cited by 1 Dirty Data 35.9% 2 Lack of Data Science talent in the organization 30.2% 3 Company politics / Lack of management/financial support for a Data Science team 27.0% 4 The lack of a clear question to be answering or a clear direction to go with available data 22.1% 5 Unavailability of/difficult access to data 22.0% 6 Data Science results not used by business decision makers 17.7% 7 Explaining Data Science to others 16.0% 8 Privacy issues 14.4% 9 Lack of significant domain expert input 14.2% 10 Organization is small and cannot afford a Data Science team 13.0% 11 Team using multiple ad hoc development environments such as Python/R/Java etc. 12.7% 12 Limitations of tools 12.0% 13 Need to coordinate with IT 11.8% 14 Maintaining responsible expectations about the potential impact of Data Science projects 11.5% 15 Inability to integrate findings into organization’s decision-making process 9.8% 16 Lack of funds to buy useful datasets from external sources 9.6% 17 Difficulties in deployment/scoring 8.6% 18 Scaling Data Science solution up to full database 8.4% 19 Limitations in the state of the art in machine learning 7.7% 20 Did not instrument data useful for scientific analysis and decision-making 6.5% 21 I prefer not to say 4.8% 22 Other 2.9%

The table above is a transcription of a transcription, so it would be remarkable if no Data Quality issues had crept in, however let’s assume that the figures are robust enough for our purposes. Of course the people surveyed will have reported multiple issues, so the percentages above are not additive. Nevertheless there are some very obvious comments to be made (some of the above items are pertinent to more than one of the points I would like to make):

• Data Quality / Availability remain major issues(1, 5, and 8)

It is indeed true that Machine Learning can be quite good at dealing with some types or bad or missing data. But no technology or approach is going to be able to paper over all of the cracks if you data is essentially incomplete and of poor quality. This point (together with some others below) speaks to the need to not approach Data Science on a stand-alone basis, but as part of a more holistic approach to data matters [4].

• The Human angle and a focus on Culture are imperative(3, 6, 7, 14 and 15)

Findings are one thing; using these to take action is quite another. At the end of the day, most ventures are successful or fail because of people; the people conducting the venture, the people receiving its intended benefits and so on. Ignore this dimension of data work (or any type of work) at your peril [5].

While in some circumstances the data can indeed “speak for itself”, it makes a lot more sense for Data Scientists to partner with business colleagues to both get direction and to help ensure that their findings lead to action [6].

• Tools & Technology typically Trumped(11, 12 and 18)

These first appear outside of the Top 10 (and 11 is a bit dubious to include here – it relates more to a proliferation of tools than to issues with any of them). I would never say that tools and technology are unimportant, but they are typically much less important than other considerations [7].

The overriding point is of course that – much as I noted out recently in Convergent Evolution – there is little new under the Sun. A survey of Business Intelligence / Data Warehousing professionals back in 2010 would have generated something very like the list above. A survey of EIS [8] professionals back in 2000 would have done the same.

The important things to do – regardless of the technologies and approaches employed – are to:

1. Understand what questions are key to the running of an organisation [9]

2. Determine what data is available to support decisions in these key areas

3. Ensure that the data is in a “good enough” state, appropriately consolidated / made consistent, augmented / corrected by any useful external data and made available to the right people in a timely manner

4. Focus on the human aspects of acting on what data is telling us and how to use data outputs to drive positive actions

Here too, little is new under the Sun. I have been referring to essentially these same four pillars of good practice since the mid 2000s. Some of our technological advances since then have been amazing. The prospect of leveraging the power of both Data Science and Artificial Intelligence in a business context is very exciting. But to truly succeed with these newer approaches, it helps to recall the eternal verities that have always underpinned good data-centric work [10]. The survey above makes this point crystal clear.

A final corollary to this observation is something I covered in A truth universally acknowledged…. The replies to the Kaggle survey highlight the fact that, much like the conductor of an orchestra does not need to be able to play the violin to a virtuoso level, people leading Data Science teams (and broader Data Functions) need a set of rounded skills, ones honed to address the types of issues appearing in the exhibits above. The skill-set that makes for an excellent Data Scientist does not necessarily help so much with many of the less technical issues that will determine the success or failure of Data Science teams.

Notes

 [1] Other Yogi-isms included, “Always go to other people’s funerals; otherwise they won’t go to yours”, “You can observe a lot by watching” and “If you can’t imitate him, don’t copy him”. [2] A Data Visualisation challenge to include that much text I realise. I think I might have been tempted to come up with pithier categories to aid legibility. [3] Ignoring “I prefer not to say” and “Other”. [4] As laid out in my many articles about the importance of Cultural Transformation. [5] See: Building Momentum – How to begin becoming a Data-driven Organisation. [6] I make precisely this point in my recent interview for Venturi Voice (starting just after 31:38). [7] I make this point most forcibly back in: A bad workman blames his [Business Intelligence] tools. The technology may be different, but the learnings are just as relevant today. [8] Executive Information Systems for those of tender years. [9] Machine learning techniques can clearly help here, but only if in concert with dialogue with people actually on the front-line and leading business areas. [10] In your search for such eternal verities, you could do much worse than starting with: 20 Risks that Beset Data Programmes.

From: peterjamesthomas.com, home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases

# In-depth with CDO Christopher Bannocks

Part of the In-depth series of interviews

 Today I am talking to Christopher Bannocks, who is Group Chief Data Officer at ING. ING is a leading global financial institution, headquartered in the Netherlands. As stressed in other recent In-depth interviews [1], data is a critical asset in banking and related activities, so Christopher’s role is a pivotal one. I’m very glad that he has been able to find time in his busy calendar to speak to us.
 Hello Christopher, can you start by providing readers with a flavour of your career to date and perhaps also explain why you came to focus on the data arena. Sure, it’s probably right to say I didn’t start out here, data was not my original choice, and for anyone of a similar age to me, data wasn’t a choice, when I started out, in that respect it’s a “new segment”. I started out on a management development programme in a retail bank in the UK, after which I moved to be an operations manager in investment banking. As part of that time in my career, post Euro migration and Y2K (yes I am genuinely that old, I also remember Vinyl records and Betamax video!) [2] I was asked to help solve the data problem. What I recognised very quickly was this was an area with under-investment, that was totally central the focus of that time – STP (Straight Through Processing). Equally it provided me with much broader perspectives, connections to all parts of the organisation that I previously didn’t have and it was at that point, some 20 years ago, that I decided this was the thing for me! I have since run and driven transformation in Reference Data, Master Data, KYC [3], Customer Data, Data Warehousing and more recently Data Lakes and Analytics, constantly building experience and capability in the Data Governance, Quality and data services domains, both inside banks, as a consultant and as a vendor.
 I am trying to get a picture of the role and responsibilities of the typical CDO (not that there appears to be such a thing), so would you mind touching on the span of your work at ING? I know you have a strong background in Enterprise Data Management, how does the CDO role differ from this area? I guess that depends on how you determine the scope of Enterprise Data Management. However, in reality, the CDO role encompasses Enterprise Data Management, although generally speaking the EDM role includes responsibility for the day to day operations of the collection processes, which in my current role I don’t have. I have accountability for the governance and quality through those processes and for making the data available for downstream consumers, like Analytics, Risk, Finance and HR. My role encompasses being the business driver for the data platform that we are rolling out across the organisation and its success in terms of the data going onto the platform and the curation of that data in a governed state, depending on the consumer requirements. My role today boils down to 4 key objectives – data availability, data transparency, data quality and data control.
 I know that ING consists of many operating areas and has adopted a federated structure with respect to data matters. What are the strengths of this approach and how does it work on a day-to-day basis? This approach ensures that the CDO role (I have a number of CDOs functionally reporting to me) remains close to the business and the local entity it supports, it ensures that my management team is directly connected to the needs of the business locally, and that the local businesses have a direct connection to the global strategy. What I would say is that there is no “one size fits all” approach to the CDO organisation model. It depends on the company culture and structure and it needs to fit with the stated objectives of the role as designed. On a day to day basis, we are aligned with the business units and the functional units so we have CDOs in all of these areas. Additionally I have a direct set of reports who drive the standard solutions around tooling, governance, quality, data protection, Data Ethics, Metadata and data glossary and models.
 Helping organisations become “data-centric” is a key part of what you do. I often use this phrase myself; but was recently challenged to elucidate its meaning. What does a “data-centric” organisation look like to you? What sort of value does data-centricity release in your experience? Data centric is a cultural shift, in the structures of the past where we have technology people and process, we now have data that touches all three. You know if you have reached the right place when data becomes part of the decision making process across the organisation, when decisions are only made when data is presented to support it and this is of the requisite quality. This doesn’t mean all decisions require data, some decisions don’t have data and that’s where leaderships decisions can be made, but for those decisions that have good data to support them, these can be made easily and at a lower level in the organisation. Hence becoming data centric supports an agile organisation and servant / leadership principles, utilising data makes decisions faster and outcomes better.
 I am on record multiple times [4] stating that technology choices are much less important than other aspects of data work. However, it is hard to ignore the impact that Big Data and related technologies have had. A few years into the cycle of Big Data adoption, do you see the tools and approaches yielding the expected benefits? Should I revisit my technology-agnostic stance? I have also been on record multiple times saying that every data problem is a people problem in disguise. I still hold that this is true today although potentially this is changing. The problems of the past and still to this day originate with poor data stewardship, I saw it happening in front of my eyes last week in Heathrow when I purchased something in a well known electronics store. Because I have an overseas postcode the guy at the checkout put dummy data into all the fields to get through the process quickly and not impact my customer experience, I desperately wanted to stop him but also wanted to catch my plane. This is where the process efficiency impacts good data collection. If the software that supports the process isn’t flexible, the issue won’t be fixed without technology intervention, this is often true in data quality problems which have knock on effects to customers, which at the end of the day are why we are all here. This is a people problem (because who is taking responsibility here for fixing it, or educating that guy at the checkout) AND it’s a technology problem, caused by inflexible or badly implemented systems. However, in the future, with more focus on customer driven checkout, digital channels and better customer experience, better interface driven data controls and robotics and AI, it may become further nuanced. People are still involved, communication remains critical but we cannot ignore technology in the digital age. For a long time, data groups have struggled with getting access to good tools and technology, now this technology domain is growing daily, and the tools are improving all the time. What we can do now with data at a significantly lower cost than ever before is amazing, and continues to improve all the time. Hence ignoring technology can be costly when extending capabilities to your stakeholders and could be a serious mistake, however focusing only on technology and ignoring people, process, communication etc is also a serious mistake. Data Leaders have to be multi-disciplinary today, and be able to keep up with the pace of change.
 I have heard you talk about “data platforms”, what do you mean by this and how do these contrast with another perennial theme, that of data democratisation? How does a “data platform” relate to – say – Data Science teams? Data democratisation is enabled by the data platform. The data platform is the technology enablement of the four pillars I mentioned before, availability, transparency, quality and control. The platform is a collection of technologies that standardise the approach and access to well governed data across the organisation. Data Democratisation is simply making data available and abstracting away from siloed storage mechanisms, but the platform wraps the implementation of quality, controls and structure to the way that happens. Data Science teams then get the data they need, including data curation services to find the data they need quickly, for governed and structured data, Data Science teams can utilise the glossary to identify what they need and understand the level of quality based on consumer views, they also have access to metadata in standard forms. This empowers the analytics capability to move faster, spend less time on data discovery and curation, structure and quality and more time on building analytics.
 I mentioned the federated CDO team at ING above and assume this is reflected in the rest of the organisation structure. ING also has customers in 40 countries and I know first-hand that a global footprint adds complexity. What are the challenges in being a CDO in such an environment? Does this put a higher premium on influencing skills for example? I am not sure it puts a higher premium on influencing skills, these have a high premium in any CDO role, even if you don’t have a federated structure, the reality is if you are in a data role you have more stakeholders than anyone else in the company, so influencing skills remain premium. A global footprint means complexity for sure, it means differences in a world where you are trying to standardise and it means you have to be tuned in to cultural differences and boundaries. It also means a great deal of variety, opportunities to learn new cultures and approaches, it means you have to listen and understand and flex your style and it means pragmatism plays an important part in your decision making process. At ING we have an amazing team of people who collaborate in a way I have never experienced before, supported by a strong attachment and commitment to the success of the business and our customers. This makes dealing with the complexity a team effort, with great energy and a fantastic working environment. In an organisation without the drive and passion we have here it would present challenges, with the support of the board and being a core part of the overall strategy, it ensures broad alignment to the goal, which makes the challenge easier for the organisation to solve, not easy, but easier and more fun.
 Building on the last point, every CDO I have interviewed has stressed the importance of relationships; something that chimes with my own experience. How do you go about building strong relationships and maintaining them when inevitable differences of opinion or clashes in interests arise? I touched on this a little earlier. Pragmatism over purism. I see purist everywhere in data, with views that are so rigid that the execution of them is doomed because purism doesn’t build relationships. Relationships are built based on what you bring and give up, on what you can give, not on what you can get. I try every day to achieve this, but I am human too, so I don’t always get it right, I hope I get it right more than I get it wrong and where I get it wrong I hope I can be forgiven for my intention is pure. We owe it to our customers to work together for their benefit, where we have differences the customer outcomes should drive our decisions, in that we have a common goal. Disagreements can be helped and supported by identifying a common goal, this starts to align people behind a common outcome. Individual interests can be put aside in preference of the customer interest.
 I know that you are very interested in data ethics and feel that this is an important area for CDOs to consider. Can you tell the readers a bit more about data ethics and why they should be central to an organisation’s approach to data? In an increasingly digital world, the use of data is becoming widespread and the pace at which it is used is increasing daily, our compute power grows exponentially as does the availability of data. Given this, we need an ethical framework to help us make good decisions with our customers and stakeholders in mind. How do you ensure that decisions in your organisation about how you use data are ethical? What are ethical decisions in your organisation and what are the guiding principles? If this isn’t clear and communicated to help all staff make good decisions, or have good discussions there is a real danger that decisions may not be properly socialised before all angles are considered. Just meeting the bar of privacy regulation may not be enough, you can still meet that bar and do things that your customers may disagree with of find “creepy” so the correct thought needs to be applied and the organisation engaged to ensure the correct conversations take place, and there is a place to go to discuss ethics. I am not saying that there is a silver bullet to solve this problem, but the conversation and the ability to have the conversation in a structured way helps the organisation understand its approach and make good decisions in this respect. That’s why CDOs should consider this an important part of the role and a critical engagement with users of data across the organisation.
 Finally, I have worked for businesses with a presence in the Netherlands on a number of occasions. As a Brit living abroad, how have you found Amsterdam. What – if any – adaptations have you had to make to your style to thrive in a somewhat different culture? Having lived in India, I thought my move to the Netherlands could only be easy. I arrived thinking that a 45 minute flight could not possibly provide as many challenges as an 11 hour flight, especially from a cultural perspective. Of course I was wrong because any move to a different culture provides challenges you could never have expected and it’s the small adjustments that take you by surprise the most. It’s always a hugely enjoyable learning experience though. London is a more top down culture whereas in the Netherlands it’s a much flatter approach, my experience here is positive although it does require an adjustment. I work in Amsterdam but live in a small village, chosen deliberately to integrate faster. It’s harder, more of a challenge but helps you understand the culture as you make friends with local people and get closer to the culture. My wife and I have never been a fan of the expat scene, we prefer to integrate, however more difficult this feels at first, it’s worth it in the long run. I must admit though that I haven’t conquered the language yet, it’s a real work in progress!
 Christopher, I really enjoyed our chat, which I believe will also be of great interest to readers. Thank you.

Christoper Bannocks can be reached at via his LinkedIn profile.

Disclosure: At the time of publication, neither peterjamesthomas.com Ltd. nor any of its Directors had any shared commercial interests with Christopher Bannocks, ING or any entities associated with either of these.

 If you are a Chief Data Officer, a Chief Analytics Officer, a Director of Data, or hold some other “Top Data Job” and would like to share your thoughts with the readers of this site in an interview like this one, please get in contact.

Notes

 [1] Specifically: An in-depth interview with experienced Chief Data Officer Roberto Maranca – Previously of Lloyds Banking Group   In-depth with CDO Jo Coutuer – Of BNP Paribas Fortis [2] So does the interviewer. [3] Know your customer. [4] Most directly in: A bad workman blames his [Business Intelligence] tools

From: peterjamesthomas.com, home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases

# Latest Interviews / Podcasts

The interviews that I conduct with leaders in their fields as part of my “In-depth” series have hopefully brought a new and interesting aspect to this site. However, often the boot is on the other foot and I am the person being interviewed about my experience and expertise in the data field and related matters [1]. Maybe interviewing other people helps me when I am in turn interviewed, maybe it’s the other way round. Whatever the case, I enjoyed recording the two conversations appearing below (thanks to the interviewers in both cases) and hope that the content is of interest to readers.

In both instances a link to the site originally publishing the interview is followed by a locally hosted version of the audio track and then a download option. I’d encourage readers to explore the other excellent interviews contained on both sites.

Enterprise Management 360° Podcast – 31st July 2018

Venturi Voice 3650° Podcast – 22nd April 2018

If you would like to interview me for your site or periodical, of if you are just interested in further exploring some of the themes I discuss in these two interviews, then please feel free to get in contact.

Notes

 [1] A list of other video interviews and podcasts I have taken part in can be viewed in the Media section of this site.

From: peterjamesthomas.com, home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases

# As Nice as Pie

Work by the inimitable Randall Munroe, author of long-running web-comic, xkcd.com, has been featured (with permission) multiple times on these pages [1]. The above image got me thinking that I had not penned a data visualisation article since the series starting with Hurricanes and Data Visualisation: Part I – Rainbow’s Gravity nearly a year ago. Randall’s perspective led me to consider that staple of PowerPoint presentations, the humble and much-maligned Pie Chart.

While the history is not certain, most authorities credit the pioneer of graphical statistics, William Playfair, with creating this icon, which appeared in his Statistical Breviary, first published in 1801 [2]. Later Florence Nightingale (a statistician in case you were unaware) popularised Pie Charts. Indeed a Pie Chart variant (called a Polar Chart) that Nightingale compiled appears at the beginning of my article Data Visualisation – A Scientific Treatment.

I can’t imagine any reader has managed to avoid seeing a Pie Chart before reading this article. But, just in case, here is one (Since writing Rainbow’s Gravity – see above for a link – I have tried to avoid a rainbow palette in visualisations, hence the monochromatic exhibit):

The above image is a representation of the following dataset:

 Label Count A 4,500 B 3,000 C 3,000 D 3,000 E 4,500 Total 18,000

The Pie Chart consists of a circle divided in to five sectors, each is labelled A through E. The basic idea is of course that the amount of the circle taken up by each sector is proportional to the count of items associated with each category, A through E. What is meant by the innocent “amount of the circle” here? The easiest way to look at this is that going all the way round a circle consumes 360°. If we consider our data set, the total count is 18,000, which will equate to 360°. The count for A is 4,500 and we need to consider what fraction of 18,000 this represents and then apply this to 360°:

$\dfrac{4,500}{18,000}\times 360^o=\dfrac{1}{4}\times 360^o=90^o$

So A must take up 90°, or equivalently one quarter of the total circle. Similarly for B:

$\dfrac{3,000}{18,000}\times 360^o=\dfrac{1}{6}\times 360^o=60^o$

Or one sixth of the circle.

If we take this approach then – of course – the sum of all of the sectors must equal the whole circle and neither more nor less than this (pace Randall). In our example:

 Label Degrees A 90° B 60° C 60° D 60° E 90° Total 360°

So far, so simple. Now let’s consider a second data-set as follows:

 Label Count A 9,480,301 B 6,320,201 C 6,320,200 D 6,320,201 E 9,480,301 Total 37,921,204

What does its Pie Chart look like? Well it’s actually rather familiar, it looks like this:

This observation stresses something important about Pie Charts. They show how a number of categories contribute to a whole figure, but they only show relative figures (percentages of the whole if you like) and not the absolute figures. The totals in our two data-sets differ by a factor of over 2,100 times, but their Pie Charts are identical. We will come back to this point again later on.

Pie Charts have somewhat fallen into disrepute over the years. Some of this is to do with their ubiquity, but there is also at least one more substantial criticism. This is that the human eye is bad at comparing angles, particularly if they are not aligned to some reference point, e.g. a vertical. To see this consider the two Pie Charts below (please note that these represent a different data set from above – for starters, there are only four categories plotted as opposed to five earlier on):

The details of the underlying numbers don’t actually matter that much, but let’s say that the left-hand Pie Chart represents annual sales in 2016, broken down by four product lines. The right-hand chart has the same breakdown, but for 2017. This provides some context to our discussions.

Suppose what is of interest is how the sales for each product line in the 2016 chart compare to their counterparts in the right-hand one; e.g. A and A’, B and B’ and so on. Well for the As, we have the helpful fact that they both start from a vertical line and then swing down and round, initially rightwards. This can be used to gauge that A’ is a bit bigger than A. What about B and B’? Well they start in different places and end in different places, looking carefully, we can see that B’ is bigger than B. C and C’ are pretty easy, C is a lot bigger. Then we come to D and D’, I find this one a bit tricky, but we can eventually hazard a guess that they are pretty much the same.

So we can compare Pie Charts and talk about how sales change between two years, what’s the problem? The issue is that it takes some time and effort to reach even these basic conclusions. How about instead of working out which is bigger, A or A’, I ask the reader to guess by what percentage A’ is bigger. This is not trivial to do based on just the charts.

If we really want to look at year-on-year growth, we would prefer that the answer leaps off the page; after all, isn’t that the whole point of visualisations rather than tables of numbers? What if we focus on just the right-hand diagram? Can you say with certainty which is bigger, A or C, B or D? You can work to an answer, but it takes longer than should really be the case for a graphical exhibit.

 Aside: There is a further point to be made here and it relates to what we said Pie Charts show earlier in this piece. What we have in our two Pie Charts above is the make-up of a whole number (in the example we have been working through, this is total annual sales) by categories (product lines). These are percentages and what we have been doing above is to compare the fact that A made up 30% of the total sales in 2016 and 33% in 2017. What we cannot say based on just the above exhibits is how actual sales changed. The total sales may have gone up or down, the Pie Chat does not tell us this, it just deals in how the make-up of total sales has shifted. Some people try to address this shortcoming, which can result in exhibits such as: Here some attempt has been made to show the growth in the absolute value of sales year on year. The left-hand Pie Chart is smaller and so we assume that annual sales have increased between 2016 and 2017. The most logical thing to do would be to have the change in total area of the two Pie Charts to be in proportion to the change in sales between the two years (in this case – based on the underlying data – 2017 sales are 69% bigger than 2016 sales). However, such an approach, while adding information, makes the task of comparing sectors from year to year even harder.

The general argument is that Nested Bar Charts are better for the type of scenario I have presented and the types of questions I asked above. Looking at the same annual sales data this way we could generate the following graph:

 Aside: While Bar Charts are often used to show absolute values, what we have above is the same “percentage of the whole” data that was shown in the Pie Charts. We have already covered the relative / absolute issue inherent in Pie Charts, from now on, each new chart will be like a Pie Chart inasmuch as it will contain relative (percentage of the whole) data, not absolute. Indeed you could think about generating the bar graph above by moving the Pie Chart sectors around and squishing them into new shapes, while preserving their area.

The Bar Chart makes the yearly comparisons a breeze and it is also pretty easy to take a stab at percentage differences. For example B’ looks about a fifth bigger than B (it’s actually 17.5% bigger) [3]. However, what I think gets lost here is a sense of the make-up of the elements of the two sets. We can see that A is the biggest value in the first year and A’ in the second, but it is harder to gauge what percentage of the overall both A and A’ represent.

To do this better, we could move to a Stacked Bar Chart as follows (again with the same sales data):

 Aside: Once more, we are dealing with how proportions have changed – to put it simply the height of both “skyscrapers” is the same. If we instead shifted to absolute values, then our exhibit might look more like:

The observant reader will note that I have also added dashed lines linking the same category for each year. These help to show growth. Regardless of what angle to the horizontal the lower line for a category makes, if it and the upper category line diverge (as for B and B’), then the category is growing; if they converge (as for C and C’), the category is shrinking [4]. Parallel lines indicate a steady state. Using this approach, we can get a better sense of the relative size of categories in the two years.

However, here – despite the dashed lines – we lose at least some of of the year-on-year comparative power of the Nested Bar Chart above. In turn the Nested Bar Chart loses some of the attributes of the original Pie Chart. In truth, there is no single chart which fits all purposes. Trying to find one is analogous to trying to find a planar projection of a sphere that preserves angles, distances and areas [5].

Rather than finding the Philosopher’s Stone [6] of an all-purpose chart, the challenge for those engaged in data visualisation is to anticipate the central purpose of an exhibit and to choose a chart type that best resonates with this. Sometimes, the Pie Chart can be just what is required, as I found myself in my article, A Tale of Two [Brexit] Data Visualisations, which closed with the following image:

Or, to put it another way:

You may very well be well bred
But there’s always some special case, time or place
To replace perfect taste

For instance…

Never cry ’bout a Chart of Pie
You can still do fine with a Chart of Pie
People may well laugh at this humble graph
But it can be just the thing you need to help the staff

Never cry ’bout a Chart of Pie
Though without due care things can go awry
Bars are fine, Columns shine
Boxes fly, but never cry about a Chart of Pie

With apologies to the Disney Corporation!

It was pointed out to me by Adam Carless that I had omitted the following thing of beauty from my Pie Chart menagerie. How could I have forgotten?

It is claimed that some Theoretical Physicists (and most Higher Dimensional Geometers) can visualise in four dimensions. Perhaps this facility would be of some use in discerning meaning from the above exhibit.

Notes

 [1] Including: [2] Playfair also most likely was the first to introduce line, area and bar charts. [3] Recall again we are comparing percentages, so 50% is 25% bigger than 40%. [4] This assertion would not hold for absolute values, or rather parallel lines would indicate that the absolute value of sales (not the relative one) had stayed constant across the two years. [5] A little-known Mathematician, going by the name of Gauss, had something to say about this back in 1828 – Disquisitiones generales circa superficies curvas. I hope you read Latin. [6] No, not that one!.

From: peterjamesthomas.com, home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases

# Convergent Evolution

No this article has not escaped from my Maths & Science section, it is actually about data matters. But first of all, channeling Jennifer Aniston [1], “here comes the Science bit – concentrate”.

Shared Shapes

The Theory of Common Descent holds that any two organisms, extant or extinct, will have a common ancestor if you roll the clock back far enough. For example, each of fish, amphibians, reptiles and mammals had a common ancestor over 500 million years ago. As shown below, the current organism which is most like this common ancestor is the Lancelet [2].

To bring things closer to home, each of the Great Apes (Orangutans, Gorillas, Chimpanzees, Bonobos and Humans) had a common ancestor around 13 million years ago.

So far so simple. As one would expect, animals sharing a recent common ancestor would share many attributes with both it and each other.

Convergent Evolution refers to something else. It describes where two organisms independently evolve very similar attributes that were not features of their most recent common ancestor. Thus these features are not inherited, instead evolutionary pressure has led to the same attributes developing twice. An example is probably simpler to understand.

The image at the start of this article is of an Ichthyosaur (top) and Dolphin. It is striking how similar their body shapes are. They also share other characteristics such as live birth of young, tail first. The last Ichthyosaur died around 66 million years ago alongside many other archosaurs, notably the Dinosaurs [3]. Dolphins are happily still with us, but the first toothed whale (not a Dolphin, but probably an ancestor of them) appeared around 30 million years ago. The ancestors of the modern Bottlenose Dolphins appeared a mere 5 million years ago. Thus there is tremendous gap of time between the last Ichthyosaur and the proto-Dolphins. Ichthyosaurs are reptiles, they were covered in small scales [4]. Dolphins are mammals and covered in skin not massively different to our own. The most recent common ancestor of Ichthyosaurs and Dolphins probably lived around quarter of a billion years ago and looked like neither of them. So the shape and other attributes of Ichthyosaurs do not come from a common ancestor, they have developed independently (and millions of years apart) as adaptations to similar lifestyles as marine hunters. This is the essence of Convergent Evolution.

That was the Science, here comes the Technology…

A Brief Hydrology of Data Lakes

From 2000 to 2015, I had some success [5] with designing and implementing Data Warehouse architectures much like the following:

As a lot of my work then was in Insurance or related fields, the Analytical Repositories tended to be Actuarial Databases and / or Exposure Management Databases, developed in collaboration with such teams. Even back then, these were used for activities such as Analytics, Dashboards, Statistical Modelling, Data Mining and Advanced Visualisation.

Overlapping with the above, from around 2012, I began to get involved in also designing and implementing Big Data Architectures; initially for narrow purposes and later Data Lakes spanning entire enterprises. Of course some architectures featured both paradigms as well.

One of the early promises of a Data Lake approach was that – once all relevant data had been ingested – this would be directly leveraged by Data Scientists to derive insight.

Over time, it became clear that it would be useful to also have some merged / conformed and cleansed data structures in the Data Lake. Once the output of Data Science began to be used to support business decisions, a need arose to consider how it could be audited and both data privacy and information security considerations also came to the fore.

Next, rather than just being the province of Data Scientists, there were moves to use Data Lakes to support general Data Discovery and even business Reporting and Analytics as well. This required additional investments in metadata.

The types of issues with Data Lake adoption that I highlighted in Draining the Swamp earlier this year also led to the advent of techniques such as Data Curation [6]. In parallel, concerns about expensive Data Science resource spending 80% of their time in Data Wrangling [7] led to the creation of a new role, that of Data Engineer. These people take on much of the heavy lifting of consolidating, fixing and enriching datasets, allowing the Data Scientists to focus on Statistical Analysis, Data Mining and Machine Learning.

All of which leads to a modified Big Data / Data Lake architecture, embodying people and processes as well as technology and looking something like the exhibit above.

This is where the observant reader will see the concept of Convergent Evolution playing out in the data arena as well as the Natural World.

In Closing

Lest it be thought that I am saying that Data Warehouses belong to a bygone era, it is probably worth noting that the archosaurs, Ichthyosaurs included, dominated the Earth for orders of magnitude longer that the mammals and were only dethroned by an asymmetric external shock, not any flaw their own finely honed characteristics.

Also, to be crystal clear, much as while there are similarities between Ichthyosaurs and Dolphins there are also clear differences, the same applies to Data Warehouse and Data Lake architectures. When you get into the details, differences between Data Lakes and Data Warehouses do emerge; there are capabilities that each has that are not features of the other. What is undoubtedly true however is that the same procedural and operational considerations that played a part in making some Warehouses seem unwieldy and unresponsive are also beginning to have the same impact on Data Lakes.

If you are in the business of turning raw data into actionable information, then there are inevitably considerations that will apply to any technological solution. The key lesson is that shape of your architecture is going to be pretty similar, regardless of the technical underpinnings.

Notes

 [1] The two of us are constantly mistaken for one another. [2] To be clear the common ancestor was not a Lancelet, rather Lancelets sit on the branch closest to this common ancestor. [3] Ichthyosaurs are not Dinosaurs, but a different branch of ancient reptiles. [4] This is actually a matter of debate in paleontological circles, but recent evidence suggests small scales. [5] See: [6] A term that is unaccountably missing from The Data & Analytics Dictionary – something to add to the next release. UPDATE: Now remedied here. [7] Ditto. UPDATE: Now remedied here

From: peterjamesthomas.com, home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases

# Version 2 of The Anatomy of a Data Function

Between November and December 2017, I published the three parts of my Anatomy of a Data Function. These were cunningly called Part I, Part II and Part III. Eight months is a long time in the data arena and I have now issued an update.

The changes in Version 2 are confined to the above organogram and Part I of the text. They consist of the following:

1. Split Artificial Intelligence out of Data Science in order to better reflect the ascendancy of this area (and also its use outside of Data Science).

2. Change Data Science to Data Science / Engineering in order to better reflect the continuing evolution of this area.

My aim will be to keep this trilogy up-to-date as best practice Data Functions change their shapes and contents.

If you would like help building or running your Data Function, or would just like to have an informal chat about the area, please get in touch

From: peterjamesthomas.com, home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases

# Fact-based Decision-making

This article is about facts. Facts are sometimes less solid than we would like to think; sometimes they are downright malleable. To illustrate, consider the fact that in 98 episodes of Dragnet, Sergeant Joe Friday never uttered the words “Just the facts Ma’am”, though he did often employ the variant alluded to in the image above [1]. Equally, Rick never said “Play it again Sam” in Casablanca [2] and St. Paul never suggested that “money is the root of all evil” [3]. As Michael Caine never said in any film, “not a lot of people know that” [4].

 Up-front Acknowledgements These normally appear at the end of an article, but it seemed to make sense to start with them in this case: Recently I published Building Momentum – How to begin becoming a Data-driven Organisation. In response to this, one of my associates, Olaf Penne, asked me about my thoughts on fact-base decision-making. This piece was prompted by both Olaf’s question and a recent article by my friend Neil Raden on his Silicon Angle blog, Performance management: Can you really manage what you measure? Thanks to both Olaf and Neil for the inspiration.

Fact-based decision making. It sounds good doesn’t it? Especially if you consider the alternatives: going on gut feel, doing what you did last time, guessing, not taking a decision at all. However – as is often the case with issues I deal with on this blog – fact-based decision-making is easier to say than it is to achieve. Here I will look to cover some of the obstacles and suggest a potential way to navigate round them. Let’s start however with some definitions.

 Fact NOUN A thing that is known or proved to be true. (Oxford Dictionaries) Decision NOUN A conclusion or resolution reached after consideration. (Oxford Dictionaries)

So one can infer that fact-based decision-making is the process of reaching a conclusion based on consideration of things that are known to be true. Again, it sounds great doesn’t it? It seems that all you have to do is to find things that are true. How hard can that be? Well actually quite hard as it happens. Let’s cover what can go wrong (note: this section is not intended to be exhaustive, links are provided to more in-depth articles where appropriate):

Accuracy of Data that is captured

A number of factors can play into the accuracy of data capture. Some systems (even in 2018) can still make it harder to capture good data than to ram in bad. Often an issue may also be a lack of master data definitions, so that similar data is labelled differently in different systems.

A more pernicious problem is combinatorial data accuracy, two data items are both valid, but not in combination with each other. However, often the biggest stumbling block is a human one, getting people to buy in to the idea that the care and attention they pay to data capture will pay dividends later in the process.

These and other areas are covered in greater detail in an older article, Using BI to drive improvements in data quality.

Honesty of Data that is captured

Data may be perfectly valid, but still not represent reality. Here I’ll let Neil Raden point out the central issue in his customary style:

People find the most ingenious ways to distort measurement systems to generate the numbers that are desired, not only NOT providing the desired behaviors, but often becoming more dysfunctional through the effort.

[…] voluntary compliance to the [US] tax code encourages a national obsession with “loopholes”, and what salesman hasn’t “sandbagged” a few deals for next quarter after she has met her quota for the current one?

Where there is a reward to be gained or a punishment to be avoided, by hitting certain numbers in a certain way, the creativeness of humans often comes to the fore. It is hard to account for such tweaking in measurement systems.

Timing issues with Data

Timing is often problematic. For example, a transaction completed near the end of a period gets recorded in the next period instead, one early in a new period goes into the prior period, which is still open. There is also (as referenced by Neil in his comments above) the delayed booking of transactions in order to – with the nicest possible description – smooth revenues. It is not just hypothetical salespeople who do this of course. Entire organisations can make smoothing adjustments to their figures before publishing and deferral or expedition of obligations and earnings has become something of an art form in accounting circles. While no doubt most of this tweaking is done with the best intentions, it can compromise the fact-based approach that we are aiming for.

Reliability with which Data is moved around and consolidated

In our modern architectures, replete with web-services, APIs, cloud-based components and the quasi-instantaneous transmission of new transactions, it is perhaps not surprising that occasionally some data gets lost in translation [5] along the way. That is before data starts to be Sqooped up into Data Lakes, or other such Data Repositories, and then otherwise manipulated in order to derive insight or provide regular information. All of these are processes which can introduce their own errors. Suffice it to say that transmission, collation and manipulation of data can all reduce its accuracy.

Again see Using BI to drive improvements in data quality for further details.

Pertinence and fidelity of metrics developed from Data

Here we get past issues with data itself (or how it is handled and moved around) and instead consider how it is used. Metrics are seldom reliant on just one data element, but are often rather combinations. The different elements might come in because a given metric is arithmetical in nature, e.g.

$\text{Metric X} = \dfrac{\text{Data Item A}+\text{Data Item B}}{\text{Data Item C}}$

Choices are made as to how to construct such compound metrics and how to relate them to actual business outcomes. For example:

$\text{New Biz Growth} = \dfrac{(\text{Sales CYTD}-\text{Repeat CYTD})-(\text{Sales PYTD}-\text{Repeat PYTD})}{(\text{Sales PYTD}-\text{Repeat PYTD})}$

Is this a good way to define New Business Growth? Are there any weaknesses in this definition, for example is it sensitive to any glitches in – say – the tagging of Repeat Business? Do we need to take account of pricing changes between Repeat Business this year and last year? Is New Business Growth something that is even worth tracking; what will we do as a result of understanding this?

The above is a somewhat simple metric, in a section of Using historical data to justify BI investments – Part I, I cover some actual Insurance industry metrics that build on each other and are a little more convoluted. The same article also considers how to – amongst other things – match revenue and outgoings when the latter are spread over time. There are often compromises to be made in defining metrics. Some of these are based on the data available. Some relate to inherent issues with what is being measured. In other cases, a metric may be a best approximation to some indication of business health; a proxy used because that indication is not directly measurable itself. In the last case, staff turnover may be a proxy for staff morale, but it does not directly measure how employees are feeling (a competitor might be poaching otherwise happy staff for example).

Robustness of extrapolations made from Data

I have used the above image before in these pages [6]. The situation it describes may seem farcical, but it is actually not too far away from some extrapolations I have seen in a business context. For example, a prediction of full-year sales may consist of this year’s figures for the first three quarters supplemented by prior year sales for the final quarter. While our metric may be better than nothing, there are some potential distortions related to such an approach:

1. Repeat business may have fallen into Q4 last year, but was processed in Q3 this year. This shift in timing would lead to such business being double-counted in our year end estimate.

2. Taking point 1 to one side, sales may be growing or contracting compared to the previous year. Using Q4 prior year as is would not reflect this.

3. It is entirely feasible that some market event occurs this year ( for example the entrance or exit of a competitor, or the launch of a new competitor product) which would render prior year figures a poor guide.

Of course all of the above can be adjusted for, but such adjustments would be reliant on human judgement, making any projections similarly reliant on people’s opinions (which as Neil points out may be influenced, conciously or unconsciously, by self-interest). Where sales are based on conversions of prospects, the quantum of prospects might be a more useful predictor of Q4 sales. However here a historical conversion rate would need to be calculated (or conversion probabilities allocated by the salespeople involved) and we are back into essentially the same issues as catalogued above.

I explore some similar themes in a section of Data Visualisation – A Scientific Treatment

Integrity of statistical estimates based on Data

Having spent 18 years working in various parts of the Insurance industry, statistical estimates being part of the standard set of metrics is pretty familiar to me [7]. However such estimates appear in a number of industries, sometimes explicitly, sometimes implicitly. A clear parallel would be credit risk in Retail Banking, but something as simple as an estimate of potentially delinquent debtors is an inherently statistical figure (albeit one that may not depend on the output of a statistical model).

The thing with statistical estimates is that they are never a single figure but a range. A model may for example spit out a figure like £12.4 million ± £0.5 million. Let’s unpack this.

Well the output of the model will probably be something analogous to the above image. Here a distribution has been fitted to the business event being modelled. The central point of this (the one most likely to occur according to the model) is £12.4 million. The model is not saying that £12.4 million is the answer, it is saying it is the central point of a range of potential figures. We typically next select a symmetrical range above and below the central figure such that we cover a high proportion of the possible outcomes for the figure being modelled; 95% of them is typical [8]. In the above example, the range extends plus £0. 5 million above £12.4 million and £0.5 million below it (hence the ± sign).

Of course the problem is then that Financial Reports (or indeed most Management Reports) are not set up to cope with plus or minus figures, so typically one of £12.4 million (the central prediction) or £11.9 million (the most conservative estimate [9]) is used. The fact that the number itself is uncertain can get lost along the way. By the time that people who need to take decisions based on such information are in the loop, the inherent uncertainty of the prediction may have disappeared. This can be problematic. Suppose a real result of £12.4 million sees an organisation breaking even, but one of £11.9 million sees a small loss being recorded. This could have quite an influence on what course of action managers adopt [10]; are they relaxed, or concerned?

Beyond the above, it is not exactly unheard of for statistical models to have glitches, sometimes quite big glitches [11].

This segment could easily expand into a series of articles itself. Hopefully I have covered enough to highlight that there may be some challenges in this area.

And so what?

Even if we somehow avoid all of the above pitfalls, there remains one booby-trap that is likely to snare us, absent the necessary diligence. This was alluded to in the section about the definition of metrics:

Is New Business Growth something that is even worth tracking; what will we do as a result of understanding this?

Unless a reported figure, or output of a model, leads to action being taken, it is essentially useless. Facts that never lead to anyone doing anything are like lists learnt by rote at school and regurgitated on demand parrot-fashion; they demonstrate the mechanism of memory, but not that of understanding. As Neil puts it in his article:

[…] technology is never a solution to social problems, and interactions between human beings are inherently social. This is why performance management is a very complex discipline, not just the implementation of dashboard or scorecard technology.

How to Measure the Unmeasurable

Our dream of fact-based decision-making seems to be crumbling to dust. Regular facts are subject to data quality issues, or manipulation by creative humans. As data is moved from system to system and repository to repository, the facts can sometimes acquire an “alt-” prefix. Timing issues and the design of metrics can also erode accuracy. Then there are many perils and pitfalls associated with simple extrapolation and less simple statistical models. Finally, any fact that manages to emerge from this gantlet [12] unscathed may then be totally ignored by those whose actions it is meant to guide. What can be done?

As happens elsewhere on this site, let me turn to another field for inspiration. Not for the first time, let’s consider what Science can teach us about dealing with such issues with facts. In a recent article [13] in my Maths & Science section, I examined the nature of Scientific Theory and – in particular – explored the imprecision inherent in the Scientific Method. Here is some of what I wrote:

It is part of the nature of scientific theories that (unlike their Mathematical namesakes) they are not “true” and indeed do not seek to be “true”. They are models that seek to describe reality, but which often fall short of this aim in certain circumstances. General Relativity matches observed facts to a greater degree than Newtonian Gravity, but this does not mean that General Relativity is “true”, there may be some other, more refined, theory that explains everything that General Relativity does, but which goes on to explain things that it does not. This new theory may match reality in cases where General Relativity does not. This is the essence of the Scientific Method, never satisfied, always seeking to expand or improve existing thought.

I think that the Scientific Method that has served humanity so well over the centuries is applicable to our business dilemma. In the same way that a Scientific Theory is never “true”, but instead useful for explaining observations and predicting the unobserved, business metrics should be judged less on their veracity (though it would be nice if they bore some relation to reality) and instead on how often they lead to the right action being taken and the wrong action being avoided. This is an argument for metrics to be simple to understand and tied to how decision-makers actually think, rather than some other more abstruse and theoretical definition.

A proxy metric is fine, so long as it yields the right result (and the right behaviour) more often than not. A metric with dubious data quality is still useful if it points in the right direction; if the compass needle is no more than a few degrees out. While of course steps that improve the accuracy of metrics are valuable and should be undertaken where cost-effective, at least equal attention should be paid to ensuring that – when the metric has been accessed and digested – something happens as a result. This latter goal is a long way from the arcana of data lineage and metric definition, it is instead the province of human psychology; something that the accomploished data professional should be adept at influencing.

I have touched on how to positively modify human behaviour in these pages a number of times before [14]. It is a subject that I will be coming back to again in coming months, so please watch this space.

Notes

[1]

According to Snopes, the phrase arose from a spoof of the series.

[2]

The two pertinent exchanges were instead:

 Ilsa: Play it once, Sam. For old times’ sake. Sam: I don’t know what you mean, Miss Ilsa. Ilsa: Play it, Sam. Play “As Time Goes By” Sam: Oh, I can’t remember it, Miss Ilsa. I’m a little rusty on it. Ilsa: I’ll hum it for you. Da-dy-da-dy-da-dum, da-dy-da-dee-da-dum… Ilsa: Sing it, Sam.

and

 Rick: You know what I want to hear. Sam: No, I don’t. Rick: You played it for her, you can play it for me! Sam: Well, I don’t think I can remember… Rick: If she can stand it, I can! Play it!

[3]

Though he, or whoever may have written the first epistle to Timothy, might have condemned the “love of money”.

[4]

The origin of this was a Peter Sellers interview in which he impersonated Caine.

[5]

One of my Top Ten films.

[6]

Especially for all Business Analytics professionals out there (2009).

[7]

See in particular my trilogy:

[8]

Without getting into too many details, what you are typically doing is stating that there is a less than 5% chance that the measurements forming model input match the distribution due to a fluke; but this is not meant to be a primer on null hypotheses.

[9]

Of course, depending on context, £12.9 million could instead be the most conservative estimate.

[10]

This happens a lot in election polling. Candidate A may be estimated to be 3 points ahead of Candidate B, but with an error margin of 5 points, it should be no real surprise when Candidate B wins the ballot.

[11]

Try googling Nobel Laureates Myron Scholes and Robert Merton and then look for references to Long-term Capital Management.

[12]

Yes I meant “gantlet” that is the word in the original phrase, not “gauntlet” and so connections with gloves are wide of the mark.

[13]

Finches, Feathers and Apples (2018).

[14]

For example:

From: peterjamesthomas.com, home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases

# In-depth with CDO Jo Coutuer

Part of the In-depth series of interviews

 Today’s guest on In-depth is Jo Coutuer, Chief Data Officer and Member of the Executive Committee of BNP Paribas Fortis, a leading Belgian bank. Given the importance of the CDO role in Financial Services, I am very happy that Jo has managed to spare us some of his valuable time to talk.
 Jo, you have had an interesting career in a variety of organisations from consultancies to start-ups, from government to major companies. Can you give readers a pen-picture of the journey that has taken you to your current role? For me, the variety of contexts has been the most rewarding. I started in an industry that has now sharply declined in Europe (Telco Manufacturing), continued in the consulting world of ERP tools, switched into a very interesting job for the government, became an entrepreneur and co-created a data company for 13 years, merged that data company into a big 4 consultancy and finally decided to apply my life’s learnings to the fascinating industry of banking. The most remarkable aspect of my career is the fact that my current role and the attention to data that goes with it, did not exist when I started my career. It illustrates how young people today can also build a future, without really knowing what lies ahead. All it takes is the mental flexibility to switch contexts when it is needed.
 Do you collaborate with other Executives in the data arena, or is the CDO primus inter pares when it comes to data matters? I would not speak of a hierarchical order when it comes to data. It helps to distinguish three identities of a Data department. The first one is the identity of the “Governor”. In that identity, peers accept that the CDO translates external duties into internal best practices, as long as this happens in a co-creation mode. We have established a “College of Data Managers”, who are 13 senior managers, representing each a specific “data perimeter”, which in its turn rather well maps to our fields of business or our internal functions. These senior managers intimately link the Data activities to the day-to-day business functions and their respective executives. A second identity is that of the “Expert”. In that identity, we offer expertise in fields of data integration, data warehousing, reporting, visualisation, data science, … It means that I see my fellow executives as clients and partners and the Data department helps them achieve their business objectives. Mentally (and sometimes practically), we measure up to external professional services or IT companies. A third identity is that of the “Integrator”. As an integrator, we actively make the link between the business of today, the technological and data potential of today and the business of tomorrow. We actively try to question existing practices and we introduce new concepts for a variety of business applications. And although we are more driving in this role than we are in the role of the “Expert”, we still are fully at the service of our clients.
 More generally, how do you see the CDO role changing in coming years, what would 2020’s CDO be doing? Will we even need CDOs in 2020? Ahah! One of the most frequently asked questions on CDO related social media! If previous two years are any predictor of the future, I would say that the CDO of 2020 is one who has solidly matured the governance aspects of Data, just like the CFO and CRO have done that for financial management or risk management. Let’s say that Data has become “routine”. At the same time, the 2020 CDO will need to offer to his peers, the technical and expert capabilities that are data centric and essential to running a digital business. And on top of that, I believe that 2020 will be the timeframe in which data valorisation will become an active topic. I explicitly do not use the word “monetisation” because we currently associate data to often with “selling data for advertising purposes”. In our industry, PSD2 [1] will define our duties to be able to exchange data with third party service providers, at the explicit request of our clients. From that new reality, an API-driven ecosystem will surface in which data will be actively valorised, to the direct service of our clients, not to the indirect service of our marketing departments. The 2020 CDO will be instrumental in shaping his or her company’s ecosystem to make sure this happens in a well governed, trusted and safe way. Clients will seek that reassurance and will reward companies who take data management seriously.
 Of course, senior roles tend to exist because they add value to their organisations, what do you feel is the value that a CDO brings to the table? I have already mentioned the CDO’s challenge to be schizophrenic ally split between his or her various identities. But it is exactly that breadth of scope that can add value. The CDO should be an “executive integrator”. He can employ “governors” and “experts”, but his or her role in the peer team of executives is to represent the transversality of data’s nature. Data “flows”, data “unites”. More than it is “oil”, data is “water”. It flows through the company’s ecosystem and it nourishes the business and the future business potential. As such, the CDO needs to keep the water clean and make sure it gets pumped across the organisation, so that others can benefit from the nutrients it. And while doing so, the CDO has a duty to add nutrients to the water, in the form of analytical or artificial intelligence induced insights.
 Focussing on Analytics, I know you have written about how to build the ideal Analytics team and have mentioned that “purple people” are the key. Can you explain more about this? Purple people are people that integrate the skills of “red” people and “blue” people. Red people bring the scientific data methodologies to the table. Blue people bring the solid frameworks of the business. Data people as individuals and a Data department as an entity, must have as a mission to be “purple” and to actively bridge the gap between the fast growing set of data technologies and methodologies on the one hand and the rapidly evolving and transforming business challenges on the other hand. And of course, if you like Prince [2] as a musician, that can be an asset too!
 In my discussions with other CDOs [3] and indeed in my own experience, it seems that teamwork is crucial for a CDO. Of course, this is important for many senior roles, but it does seem central to what a CDO does. My perspective is that both a CDO’s own team and the virtual teams that he or she forms with colleagues are going to have a big say in whether things go well or not. What are your views on this topic? You are absolutely right. A CDO or data function cannot exist in isolation. At some times, transversality feels a burden because it imposes a daily attention to stakeholders. However, in reality, it’s exactly the transversal effect that can generate the added value to an organisation. At the end of the day, the integration aspects between departments and people will generate positive side effects, above and beyond the techniques of data management.
 Artificial Intelligence in its various guises has been the topic of conversation recently. This is something with strong linkage to the data field. Obviously without divulging any commercial secrets, what role do you see AI playing in banking going forwards? What about in our lives in general? It’s funny that AI is being discovered as a new topic. I remember writing my Master thesis on the topic a long time ago. Of course, things have evolved since the 90s, with a storage and computing capacity that is approximately 50,000 times stronger for the same price point. This capacity explosion, combined with the connectivity of the internet and the cloud, combined with the increased awareness that data and algorithms have become central elements in a many business strategies, has fundamentally re-calibrated the potential of AI. In banking, AI and Analytics will soon help clients understand their finances better, will help them to take better and faster decisions, will generate a better (less friction) client experience for “the easy stuff” and it will allow the banks to put humans on “the hard stuff” or on those interactions with their clients that require true human interaction. Behind the scenes, Analytics and AI are already helping to prevent fraud, monitoring suspicious transactions to detect crime, money laundering and fraud. And even deeper inside the mechanics of a bank, Analytics and AI are helping prevent cyber-crimes and are monitoring the stability of the technological platforms onto which our modern financial and societal system is built. I am convinced that the societal role of banks will continue to exists, despite innovative peer-to-peer or blockchain driven schemes. As such, Analytics and AI will contribute to society as a whole, through their contribution to a reliable and stable financial services system.
 With GDPR [4] coming into force only a couple of months ago, the subject of customer data and how it is used is a topical one. Taking BNP Paribas Fortis to one side, what are your thoughts on the balance between data privacy and the “free” services that we all pay for by allowing our data to be sold? I believe that GDPR is both important legislation and brings benefits to customers. First of all, we have good historical reasons to care about our privacy. In times of societal crises or wars, it is the first weapon that is used against society and its citizens. So we should care for it deeply. Second, being in an industry for which “trust” is the most essential element of identity, protecting and respecting the data and the privacy of clients is a natural reflex. And putting the banking question aside for a moment, we should continue to educate aggressively about the fact that services never come for free. As long as consumers are well informed that they pay for their convenience with their data, there is no fundamental concern. But because there is still no real “paid” economy surfacing, the consumer does not really have a choice between “pay-for-service” or “give-data-for-service”. I believe that the market potential for paid services, that guarantee non-exploitation of personal data, is quietly growing. And when it finally appears, consumers will start making choices. Personally, I admit to having moved from being on all possible digital channels and tools, towards being much more selective. And I must admit that digital life with a privacy aware mind is still possible and still fun.
 It seems to me that a key capability of a CDO is as an influencer. Influence can take many shapes, from being an acknowledged expert in an area, to the softer skills of being someone that others can talk to openly. Do you agree about this observation? If so, how do you seek to be an influencer? It’s a thin line to walk and it depends on the type of CDO that you are and the mandate that you have. If you have a mandate to do “governance only”, then you should have the confidence of delivering on your mandate, just like a CRO or a CFO does. For that I always revert to the phrase: “we agreed that data is a valuable asset, just like money or people or buildings, … so let’s then act like it.” If you have mandate to “change”, to “create value”, then you have to be an integrator and influencer because you can never change an organisation and its people on your own.
 Before letting you go, a quick personal question. I know you spent some time at the University of Cambridge. I lived in this town while my wife was working on her PhD. Like Cambridge, Leuven [5] is a historic town just outside of a major capital city. What parallels do you see between the two and what did you think of the locals? Cambridge is famous for its “punts”, Leuven for its Stella Artois “pints”. And both central churches (or chapels) are home to iconic paintings by Flemish masters, Rubens in Cambridge and Bouts in Leuven. Visit both!
 Jo, thank you so much for talking to me and giving readers the benefit of your ideas and experience.

Jo Coutuer can be reached at via his LinkedIn profile.

Disclosure: At the time of publication, neither peterjamesthomas.com Ltd. nor any of its Directors had any shared commercial interests with Jo Coutuer, BNP Paribas Fortis or any entities associated with either of these.

 If you are a Chief Data Officer, a Chief Analytics Officer, a Director of Data, or hold some other “Top Data Job” and would like to share your thoughts with the readers of this site in an interview like this one, please get in contact.

Notes

 [1] Payment Services Directive 2. [2] Prince Rogers Nelson. [3] Two recent examples include: [4] General Data Protection Regulation. [5] Leuven.

From: peterjamesthomas.com, home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases