An expanded and more mobile-friendly version of the Data & Analytics Dictionary

4 August 20204 August 2020 Peter James Thomas big data, business intelligence, chief data officer, data governance, data quality, data science, data visualisation, data warehousing, machine learning, management information

A revised and expanded version of the peterjamesthomas.com Data and Analytics Dictionary has been published.

The previous Dictionary was not the easiest to read on mobile devices. Because of this, the layout has been amended in this release and the mobile experience should now be greatly enhanced. Any feedback on usability would be welcome.

The new Dictionary includes 22 additional definitions, bringing the total number of entries to 220, totalling well over twenty thousand words. As usual, the new definitions range across the data arena: from Data Science and Machine Learning; to Information and Reporting; to Data Governance and Controls. They are as follows:

Please remember that The Dictionary is a free resource and quoting contents (ideally with acknowledgement) and linking to its entries (via the buttons provided) are both encouraged.

If you would like to contribute a definition, which will of course be acknowledged, you can use the comments section here, or the dedicated form, we look forward to hearing from you ^[1].

If you have found The Data & Analytics Dictionary helpful, we would love to learn more about this. Please post something in the comments section or contact us and we may even look to feature you in a future article.

The Data & Analytics Dictionary will continue to be expanded in coming months.

Notes

^[1]	Please note that any submissions will be subject to editorial review and are not guaranteed to be accepted.

Another article from peterjamesthomas.com. The home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases.

Follow @peterjthomas

Put our Knowledge and Writing Skills to Work for you

8 January 202015 March 2022 Peter James Thomas business analytics, chief data officer, data architecture, data quality, project management neal analytics, white paper

As well as consultancy, research and interim work, peterjamesthomas.com Ltd. helps organisations in a number of other ways. The recently launched Data Strategy Review Service is just one example.

Another service we provide is writing White Papers for clients. Sometimes the labels of these are white ^[1] as well as the paper. Sometimes Peter James Thomas is featured as the author. White Papers can be based on themes arising from articles published here, they can feature findings from de novo research commissioned in the data arena, or they can be on a topic specifically requested by the client.

Seattle-based Data Consultancy, Neal Analytics, is an organisation we have worked with on a number of projects and whose experience and expertise dovetails well with our own. They recently commissioned a White Paper expanding on our 2018 article, Building Momentum – How to begin becoming a Data-driven Organisation. The resulting paper, The Path to Data-Driven, has just been published on Neal Analytics’ site (they have a lot of other interesting content, which I would recommend checking out):

Neal Analytics White Paper - The Path to Data-Driven — Clicking on the above image will take you to Neal Analytics’ site, where the White Paper may be downloaded for free and without registration

If you find the articles published on this site interesting and relevant to your work, then perhaps – like Neal Analytics – you would consider commissioning us to write a White Paper or some other document. If so, please just get in contact. We have a degree of flexibility on the commercial side and will most likely be able to come up with an approach that fits within your budget. Although we are based in the UK, commissions – like Neal Analytics’s one – from organisations based in other countries are welcome.

Notes

^[1]	White-label Product – Wikipedia

Another article from peterjamesthomas.com. The home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases.

Follow @peterjthomas

The latest edition of The Data & Analytics Dictionary is now out

2 August 20192 August 2019 Peter James Thomas big data, business analytics, business intelligence, chief data officer, dashboards, data architecture, data governance, data management, data quality, data science, data visualisation, data warehousing, infographics, machine learning, management information

After a hiatus of a few months, the latest version of the peterjamesthomas.com Data and Analytics Dictionary is now available. It includes 30 new definitions, some of which have been contributed by people like Tenny Thomas Soman, George Firican, Scott Taylor and and Taru Väre. Thanks to all of these for their help.

Analysis
Application Programming Interface (API)
Business Glossary (contributor: Tenny Thomas Soman)
Chart (Graph)
Data Architecture – Definition (2)
Data Catalogue
Data Community
Data Domain (contributor: Taru Väre)
Data Enrichment
Data Federation
Data Function
Data Model
Data Operating Model
Data Scrubbing
Data Service
Data Sourcing
Decision Model
Embedded BI / Analytics
Genetic Algorithm
Geospatial Data
Infographic
Insight
Management Information (MI)
Master Data – additional definition (contributor: Scott Taylor)
Optimisation
Reference Data (contributor: George Firican)
Report
Robotic Process Automation
Statistics
Self-service (BI or Analytics)

Remember that The Dictionary is a free resource and quoting contents (ideally with acknowledgement) and linking to its entries (via the buttons provided) are both encouraged.

If you would like to contribute a definition, which will of course be acknowledged, you can use the comments section here, or the dedicated form, we look forward to hearing from you ^[1].

The Data & Analytics Dictionary will continue to be expanded in coming months.

Notes

^[1]	Please note that any submissions will be subject to editorial review and are not guaranteed to be accepted.

Another article from peterjamesthomas.com. The home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases.

Follow @peterjthomas

Why do data migration projects have such a high failure rate?

25 July 20192 August 2019 Peter James Thomas data management, data quality, project management data migration

Similar to its predecessor, Why are so many businesses still doing a poor job of managing data in 2019? this brief article has its genesis in the question that appears in its title, something that I was asked to opine on recently. Here is an expanded version of what I wrote in reply:

Well the first part of the answer is based on consideing activities which have at least moderate difficulty and complexity associated with them. The majority of such activities that humans attempt will end in failure. Indeed I think that the oft-reported failure rate, which is in the range 60 – 70%, is probably a fundamental Physical constant; just like the speed of light in a vacuum ^[1], the rest mass of a proton ^[2], or the fine structure constant ^[3].

$\alpha=\dfrac{e^2}{4\pi\varepsilon_0d}\bigg/\dfrac{hc}{\lambda}=\dfrac{e^2}{4\pi\varepsilon_0d}\cdot\dfrac{2\pi d}{hc}=\dfrac{e^2}{4\pi\varepsilon_0d}\cdot\dfrac{d}{\hbar c}=\dfrac{e^2}{4\pi\varepsilon_0\hbar c}$

For more on this, see the preambles to both Ever tried? Ever failed? and Ideas for avoiding Big Data failures and for dealing with them if they happen.

Beyond that, what I have seen a lot is Data Migration being the poor relation of programme work-streams. Maybe the overall programme is to implement a new Transaction Platform, integrated with a new Digital front-end; this will replace 5+ legacy systems. When the programme starts the charter says that five years of history will be migrated from the 5+ systems that are being decommissioned.

Then the costs of the programme escallate ^[4] and something has to give to stay on budget. At the same time, when people who actually understand data make a proper assessment of the amount of work required to consolidate and conform the 5+ disparate data sets, it is found that the initial estimate for this work ^[5] was woefully inadequate. The combination leads to a change in migration scope, just two years historical data will now be migrated.

Rinse and repeat…

The latest strategy is to not migrate any data, but instead get the existing data team to build a Repository that will allow users to query historical data from the 5+ systems to be decommissioned. This task will fall under BAU ^[6] activities (thus getting programme expenditure back on track).

The slight flaw here is that building such a Repository is essentially a big chunk of the effort required for Data Migration and – of course – the BAU budget will not be enough for this quantum work. Oh well, someone else’s problem, the programme budget suddenly looks much rosier, only 20% over budget now…

Note: I may have exaggerated a bit to make a point, but in all honesty, not really by that much.

Notes

^[1]	$c\approx299,792,458\text{ }ms^{-1}$
^[2]	$m_p\approx1.6726 \times 10^{-27}\text{ }kg$
^[3]	$\alpha\approx0.0072973525693$ – which doesn’t have a unit (it’s dimensionless)
^[4]	Probably because they were low-balled at first to get it green-lit; both internal and external teams can be guilty of this.
^[5]	Which was do doubt created by a generalist of some sort; or at the very least an incurable optimist.
^[6]	BAU of course stands for Basically All Unfunded.

Another article from peterjamesthomas.com. The home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases.

Follow @peterjthomas

A Simple Data Capability Framework

3 May 20199 March 2020 Peter James Thomas business analytics, business intelligence competency centres, chief data officer, data architecture, data governance, data management, data quality, data science, strategy

Larger PDF version (opens in a new tab).

Introduction

As part of my consulting business, I end up thinking about Data Capability Frameworks quite a bit. Sometimes this is when I am assessing current Data Capabilities, sometimes it is when I am thinking about how to transition to future Data Capabilities. Regular readers will also recall my tripartite series on The Anatomy of a Data Function, which really focussed more on capabilities than purely organisation structure ^[1].

Detailed frameworks like the one contained in Anatomy are not appropriate for all audiences. Often I need to provide a more easily-absorbed view of what a Data Function is and what it does. The exhibit above is one that I have developed and refined over the last three or so years and which seems to have resonated with a number of clients. It has – I believe – the merit of simplicity. I have tried to distil things down to the essentials. Here I will aim to walk the reader through its contents, much of which I hope is actually self-explanatory.

The overall arrangement has been chosen intentionally, the top three areas are visible activities, the bottom three are more foundational areas ^[2], ones that are necessary for the top three boxes to be discharged well. I will start at the top left and work across and then down.

Collation of Data to provide Information

This area includes what is often described as “traditional” reporting ^[3], Dashboards and analysis facilities. The Information created here is invaluable for both determining what has happened and discerning trends / turning points. It is typically what is used to run an organisation on a day-to-day basis. Absence of such Information has been the cause of underperformance (or indeed major losses) in many an organisation, including a few that I have been brought in to help. The flip side is that making the necessary investments to provide even basic information has been at the heart of the successful business turnarounds that I have been involved in.

The bulk of Business Intelligence efforts would also fall into this area, but there is some overlap with the area I next describe as well.

Leverage of Data to generate Insight

In this second area we have disciplines such as Analytics and Data Science. The objective here is to use a variety of techniques to tease out findings from available data (both internal and external) that go beyond the explicit purpose for which it was captured. Thus data to do with bank transactions might be combined with publically available demographic and location data to build an attribute model for both existing and potential clients, which can in turn be used to make targeted offers or product suggestions to them on Digital platforms.

It is my experience that work in this area can have a massive and rapid commercial impact. There are few activities in an organisation where a week’s work can equate to a percentage point increase in profitability, but I have seen insight-focussed teams deliver just that type of ground-shifting result.

Control of Data to ensure it is Fit-for-Purpose

This refers to a wide range of activities from Data Governance to Data Management to Data Quality improvement and indeed related concepts such as Master Data Management. Here as well as the obvious policies, processes and procedures, together with help from tools and technology, we see the need for the human angle to be embraced via strong communications, education programmes and aligning personal incentives with desired data quality outcomes.

The primary purpose of this important work is to ensure that the information an organisation collates and the insight it generates are reliable. A helpful by-product of doing the right things in these areas is that the vast majority of what is required for regulatory compliance is achieved simply by doing things that add business value anyway.

Data Architecture / Infrastructure

Best practice has evolved in this area. When I first started focussing on the data arena, Data Warehouses were state of the art. More recently Big Data architectures, including things like Data Lakes, have appeared and – at least in some cases – begun to add significant value. However, I am on public record multiple times stating that technology choices are generally the least important in the journey towards becoming a data-centric organisation. This is not to say such choices are unimportant, but rather that other choices are more important, for example how best to engage your potential users and begin to build momentum ^[4].

Having said this, the model that seems to have emerged of late is somewhat different to the single version of the truth aspired to for many years by organisations. Instead best practice now encompasses two repositories: the first Operational, the second Analytical. At a high-level, arrangements would be something like this:

The Operational Repository would contain a subset of corporate data. It would be highly controlled, highly reconciled and used to support both regular reporting and a large chunk of dashboard content. It would be designed to also feed data to other areas, notably Finance systems. This would be complemented by the Analytical Repository, into which most corporate data (augmented by external data) would be poured. This would be accessed by a smaller number of highly skilled staff, Data Scientists and Analytics experts, who would use it to build models, produce one off analyses and to support areas such as Data Visualisation and Machine Learning.

It is not atypical for Operational Repositories to be SQL-based and Analytical Repsoitories to be Big Data-based, but you could use SQL for both or indeed Big Data for both according to the circumstances of an organisation and its technical expertise.

Data Operating Model / Organisation Design

Here I will direct readers to my (soon to be updated) earlier work on The Anatomy of a Data Function. However, it is worth mentioning a couple of additional points. First an Operating Model for data must encompass the whole organisation, not just the Data Function. Such a model should cover how data is captured, sourced and used across all departments.

Second I think that the concept of a Data Community is important here, a web of like-minded Data Scientists and Analytics people, sitting in various business areas and support functions, but linked to the central hub of the Data Function by common tooling, shared data sets (ideally Curated) and aligned methodologies. Such a virtual data team is of course predicated on an organisation hiring collaborative people who want to be part of and contribute to the Data Community, but those are the types of people that organisations should be hiring anyway ^[5].

Data Strategy

Our final area is that of Data Strategy, something I have written about extensively in these pages ^[6] and a major part of the work that I do for organisations.

It is an oft-repeated truism that a Data Strategy must reflect an overarching Business Strategy. While this is clearly the case, often things are less straightforward. For example, the Business Strategy may be in flux; this is particularly the case where a turn-around effort is required. Also, how the organisation uses data for competitive advantage may itself become a central pillar of its overall Business Strategy. Either way, rather than waiting for a Business Strategy to be finalised, there are a number of things that will need to be part of any Data Strategy: the establishment of a Data Function; a focus on making data fit-for-purpose to better support both information and insight; creation of consistent and business-focussed reporting and analysis; and the introduction or augmentation of Data Science capabilities. Many of these activities can help to shape a Business Strategy based on facts, not gut feel.

More broadly, any Data Strategy will include: a description of where the organisation is now (threats and opportunities); a vision for commercially advantageous future data capabilities; and a path for moving between the current and the future states. Rather than being PowerPoint-ware, such a strategy needs to be communicated assiduously and in a variety of ways so that it can be both widely understood and form a guide for data-centric activities across the organisation.

Summary

As per my other articles, the data capabilities that a modern organisation needs are broader and more detailed than those I have presented here. However, I have found this simple approach a useful place to start. It covers all the basic areas and provides a scaffold off of which more detailed capabilities may be hung.

The framework has been informed by what I have seen and done in a wide range of organisations, but of course it is not necessarily the final word. As always I would be interested in any general feedback and in any suggestions for improvement.

Notes

^[1]	In passing, Anatomy is due for its second refresh, which will put greater emphasis on Data Science and its role as an indispensable part of a modern Data Function. Watch this space.
^[2]	Though one would hope that a Data Strategy is also visible!
^[3]	Though nowadays you hear “traditional” Analytics and “traditional” Big Data as well (on the latter see Sic Transit Gloria Magnorum Datorum), no doubt “traditional” Machine Learning will be with us at some point, if it isn’t here already.
^[4]	See also Building Momentum – How to begin becoming a Data-driven Organisation.
^[5]	I will be revisiting the idea of a Data Community in coming months, so again watch this space.
^[6]	Most explicitly in my three-part series: Forming an Information Strategy: Part I – General Strategy Forming an Information Strategy: Part II – Situational Analysis Forming an Information Strategy: Part III – Completing the Strategy

Another article from peterjamesthomas.com. The home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases.

Follow @peterjthomas

The Chief Marketing Officer and the CDO – A Modern Fable

30 October 20181 November 2018 Peter James Thomas chief data officer, data architecture, data governance, data management, data quality, data science marketing week

This Fox has a longing for grapes:
He jumps, but the bunch still escapes.
So he goes away sour;
And, ’tis said, to this hour
Declares that he’s no taste for grapes.

— W.J.Linton (after Aesop)

Note:

Not all of the organisations I have worked with or for have had a C-level Executive accountable primarily for Marketing. Where they have, I have normally found the people holding these roles to be better informed about data matters than their peers. I have always found it easy and enjoyable to collaborate with such people. The same goes in general for Marketing Managers. This article is not about Marketing professionals, it is about poorly researched journalism.

Prelude…

I recently came across an article in Marketing Week with the clickbait-worthy headline of Why the rise of the chief data officer will be short-lived (their choice of capitalisation). The subhead continues in the same vein:

Chief data officers (ditto) are becoming increasingly common, but for a data strategy to work their appointments can only ever be a temporary fix.

Intrigued, I felt I had to avail myself of the wisdom and domain expertise contained in the article (the clickbait worked of course). The first few paragraphs reveal the actual motivation. The piece is a reaction ^[1] to the most senior Marketing person at easyJet being moved out of his role, which is being abolished, and – as part of the same reorganisation – a Chief Data Officer (CDO) being appointed. Now the first thing to say, based on the article’s introductory comments, is that easyJet did not have a Chief Marketing Officer. The role that was abolished was instead Chief Commercial Officer, so there was no one charged full-time with Marketing anyway. The Marketing responsibilities previously supported part-time by the CCO have now been spread among other executives.

The next part of the article covers the views of a Marketing Week columnist (pause for irony) before moving on to arrangements for the management of data matters in three UK-based organisations:

Camelot – who run the UK National Lottery
Mumsnet – which is a web-site for UK parents
Flubit – a growing on-line marketplace aiming to compete with Amazon

The first two of these have CDOs (albeit with one doing the role alongside other responsibilities). Both of these people:

[…] come at data as people with backgrounds in its use in marketing

Flubit does not have a CDO, which is used as supporting evidence for the superfluous nature of the role ^[2].

Suffice it to say that a straw poll consisting of the handful of organisations that the journalist was able to get a comment from is not the most robust of approaches ^[3]. Most of the time, the article does nothing more than to reflect the continuing confusion about whether or not organisations need CDOs and – assuming that they do – what their remit should be and who they should report to ^[4].

But then, without it has to be said much supporting evidence, the piece goes on to add that:

Most [CDOs – they would probably style it “Cdos”] are brought in to instill a data strategy across the business; once that is done their role should no longer be needed.

Now as a Group Theoretician, I am a great fan of symmetry. Symmetry relates to properties that remain invariant when something else is changed. Archetypally, an equilateral triangle is still an equilateral triangle when rotated by 120° ^[5]. More concretely, the laws of motion work just fine if we wind the clock forward 10 seconds (which incidentally leads to the principle of conservation of energy ^[6]).

Let’s assume that the Marketing Week assertion is true. I claim therefore that it must be still be true under the symmetry of changing the C-level role. This would mean that the following also has to be true:

Most [Chief marketing officers] are brought in to instill a marketing strategy across the business; once that is done their role should no longer be needed.

Now maybe this statement is indeed true. However, I can’t really see the guys and gals at Marketing Week agreeing with this. So maybe it’s false instead. Then – employing reductio ad absurdum – the initial statement is also false ^[7].

If you don’t work in Marketing, then maybe a further transformation will convince you:

Most [Chief financial officers] are brought in to instill a finance strategy across the business; once that is done their role should no longer be needed.

I could go on, but this is already becoming as tedious to write as it was to read the original Marketing Week claim. The closing sentence of the article is probably its most revealing and informative:

[…] marketers must make sure they are leading [the data] agenda, or someone else will do it for them.

I will leave readers to draw their own conclusions on the merits of this piece and move on to other thoughts that reading it spurred in me.

…and Fugue

Sometimes buried in the strangest of places you can find something of value, even if the value is different to the intentions of the person who buried it. Around some of the CDO forums that I attend ^[8] there is occasionally talk about just the type of issue that Marketing Week raises. An historical role often comes up in these discussions is that of Chief Electrification Officer ^[9]. This supposedly was an Executive role in organisations as the 19th Century turned into the 20th and electricity grids began to be created. The person ostensibly filling this role would be responsible for shepherding the organisation’s transition from earlier forms of power (e.g. steam) to the new-fangled streams of electrons. Of course this role would be very important until the transition was completed, after that redundancy surely beckoned.

Well to my way of thinking, there are a couple of problems here. The first one of these is alluded to by my choice of the words “supposedly” and “ostensibly” above. I am not entirely sure, based on my initial research ^[10], that this role ever actually existed. All the references I can find to it are modern pieces comparing it to the CDO role, so perhaps it is apochryphal.

The second is somewhat related. Electrification was an engineering problem, indeed it the [US] National Academy of Engineering called it “the greatest engineering achievement of the 20th Century”. Surely the people tackling this would be engineers, potentially led by a Chief Engineer. Did the completion of electrification mean that there was no longer a need for engineers, or did they simply move on to the next engineering problem ^[11]?

Extending this analogy, I think that Chief Data Officers are more like Chief Engineers than Chief Electrification Officers, assuming that the latter even exists. Why the confusion? Well I think part of it is because, over the last decade and a bit, organisations have been conditioned to believe the one dimensional perspective that everything is a programme or a project ^[12]. I am less sure that this applies 100% to the CDO role.

It may well be that one thing that a CDO needs to get going is a data transformation programme. This may purely be focused on cultural aspects of how an organisation records, shares and otherwise uses data. It may be to build a new (or a first) Data Architecture. It may be to remediate issues with an existing Data Architecture. It may be to introduce or expand Data Governance. It may be to improve Data Quality. Or (and, in my experience, this is often the most likely) a combination of all these five, plus other work, such as rapid tactical or interim deliveries. However, there is also a large element of data-centric work which is not project-based and instead falls into the category often described as “business as usual” (I loathe this term – I think that Data Operations & Technology is preferable). A handful of examples are as follows (this is not meant to be an exhaustive list) ^[13]:

Addressing architectural debt that results from neglect of a Data Assets or the frequently deleterious impact of improperly governed change portfolios ^[14]. This is often a series of small to medium-sized changes, rather than a project with a discrete scope and start and end dates.
More positively, engaging proactively in the change process in an attempt to act as a steward of Data Assets.
Establishing a regular Data Audit.
Regular Data Management activities.
Providing tailored Analytics to help understand some unscheduled or unexpected event.
Establishment of a data “SWAT team” to respond to urgent architecture, quality or reporting needs.
Running a Data Governance committee and related activities.
Creating and managing a Data Science capability.
Providing help and advice to those struggling to use Data facilities.
Responding to new Data regulations.
Creating and maintaining a target operating model for Data and is use.
Supporting Data Services to aid systems integration.
Production of regular reports and refreshing self-serve Data Repositories.
Testing and re-testing of Data facilities subject to change or change in source Data.
Providing training in the use of Data facilities or the importance of getting Data right-first-time.

The above all point to the need for an ongoing Data Function to meet these needs (and to form the core resources of any data programme / project work). I describe such a function in my series about The Anatomy of a Data Function.

There are of course many other such examples, but instead of cataloguing each of them, let’s return to what Marketing Week describe as the central responsibility of a CDO, to formulate a Data Strategy. Surely this is a one-off activity, right?

Well is the Marketing strategy set once and then never changed? If there is some material shift in the overall Business strategy, might the Marketing strategy change as a result? What would be the impact on an existing Marketing strategy of insight showing that this was being less than effective; might this lead to the development of a new Marketing strategy? Would the Marketing strategy need to be revised to cater for new products and services, or new segments and territories? What would be the impact on the Marketing strategy of an acquisition or divestment?

As anyone who has spent significant time in the strategy arena will tell you, it is a fluid area. Things are never set in stone and strategies may need to be significantly revised or indeed abandoned and replaced with something entirely new as dictated by events. Strategy is not a fire and forget exercise, not if you want it to be relevant to your business today, as opposed to a year ago. Specifically with Data Strategy (as I explain in Building Momentum – How to begin becoming a Data-driven Organisation), I would recommend keeping it rather broad brush at the begining of its development, allowing it to be adpated based on feedback from initial interim work and thus ensuring it better meets business needs.

So expecting that a Data Strategy (or any other type of strategy) to be done and dusted, with the key strategist dispensed with, is probably rather naive.

Coda

It would be really nice to think that sorting out their Data problems and seizing their Data opportunities are things that organisations can do once and then forget about. With twenty years experience of helping organisations to become more Data-centric, often with technical matters firmly in the background, I have to disabuse people of this all too frequent misconception. To adapt the National Canine Defence League’s ^[15 long-lived slogan from 1978:

A Chief Data Officer is for life, not just for Christmas.

With that out of the way, I’m off to write a well-informed and insightful article about how Marketing Departments should go about their business. Wish me luck!

Notes

^[1]	I first wrote “knee-jerk reaction” and then thought that maybe I was being unkind. “When they go low, we go high” is a better maxim. Note: link opens a YouTube video.
^[2]	I am sure that I read somewhere about the importance of the number of data points in any analysis, maybe I should ask a Data Scientist to remind me about this.
^[3]	For a more balanced view of what real CDOs do, please take a look at my ongoing series of in-depth interviews.
^[4]	As discussed in: Wanted – Chief Data Officer Alphabet Soup A Sweeter Spot for the CDO? A truth universally acknowledged… The CDO – A Dilemma or The Next Big Thing? and Offence, Defence and the Top Data Job
^[5]	See Glimpses of Symmetry, Chapter 3 – Shifting Shapes for more on the properties of equilateral triangles.
^[6]	As demonstrated by Emmy Noether in 1915.
^[7]	At this point I think I am meant to say “Fake news! SAD!!!”
^[8]	The [informal] proceedings of some of these may be viewed at: 5 Themes from a Chief Data Officer Forum 5 More Themes from a Chief Data Officer Forum Themes from a Chief Data Officer Forum – the 180 day perspective The Chief Data Officer “Sweet Spot”
^[9]	Or Chief Electrical Officer, or Chief Electricity Officer.
^[10]	I am doing some more digging and will of course update this piece should I find the evidence that has so far been elusive.
^[11]	Self-driving electric cars come to mind of course. That or running a Starship.
^[12]	As an aside, where do Programme Managers go when (or should that be if) their Programmes finish?
^[13]	It might be argued that some of these operational functions could be handed to IT. However, given that some elements of data functions have probably been carved out of IT in the past, this might be a retrograde step.
^[14]	See Bumps in the Road.
^[15]	Now Dogs Trust.

From: peterjamesthomas.com, home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases

Follow @peterjthomas

More Definitions in the Data and Analytics Dictionary

19 September 2018 Peter James Thomas big data, business analytics, business intelligence, chief data officer, dashboards, data governance, data management, data quality, data science, data visualisation, data warehousing, Statistics

The peterjamesthomas.com Data and Analytics Dictionary is an active document and I will continue to issue revised versions of it periodically. Here are 20 new definitions, including the first from other contributors (thanks Tenny!):

Artificial Intelligence Platform
Data Asset
Data Audit
Data Classification
Data Consistency
Data Controls
Data Curation (contributor: Tenny Thomas Soman)
Data Democratisation
Data Dictionary
Data Engineering
Data Ethics
Data Integrity
Data Lineage
Data Platform
Data Strategy
Data Wrangling (contributor: Tenny Thomas Soman)
Explainable AI (contributor: Tenny Thomas Soman)
Information Governance
Referential Integrity
Testing Data (Training Data)

Remember that The Dictionary is a free resource and quoting contents (ideally with acknowledgement) and linking to its entries (via the buttons provided) are both encouraged.

People are now also welcome to contribute their own definitions. You can use the comments section here, or the dedicated form. Submissions will be subject to editorial review and are not guaranteed to be accepted.

From: peterjamesthomas.com, home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases

Follow @peterjthomas

Data Science Challenges – It’s Deja Vu all over again!

30 August 201824 October 2018 Peter James Thomas cultural transformation, data quality, data science Business Over Broadway, Kaggle, Yogi Berra

The rather famous tautology, “It’s déjà vu all over again”, has of course been ascribed to that darling of malapropisms, baseball catcher Yogi Berra ^[1]. The phrase came to mind for me today when coming across the following exhibit:

The text in the above exhibit is not that clear ^[2], so here are the 20 top challenges ^[3] faced by those running Data Science teams in human-readable form:

#	Challenge	Cited by
1	Dirty Data	35.9%
2	Lack of Data Science talent in the organization	30.2%
3	Company politics / Lack of management/financial support for a Data Science team	27.0%
4	The lack of a clear question to be answering or a clear direction to go with available data	22.1%
5	Unavailability of/difficult access to data	22.0%
6	Data Science results not used by business decision makers	17.7%
7	Explaining Data Science to others	16.0%
8	Privacy issues	14.4%
9	Lack of significant domain expert input	14.2%
10	Organization is small and cannot afford a Data Science team	13.0%
11	Team using multiple ad hoc development environments such as Python/R/Java etc.	12.7%
12	Limitations of tools	12.0%
13	Need to coordinate with IT	11.8%
14	Maintaining responsible expectations about the potential impact of Data Science projects	11.5%
15	Inability to integrate findings into organization’s decision-making process	9.8%
16	Lack of funds to buy useful datasets from external sources	9.6%
17	Difficulties in deployment/scoring	8.6%
18	Scaling Data Science solution up to full database	8.4%
19	Limitations in the state of the art in machine learning	7.7%
20	Did not instrument data useful for scientific analysis and decision-making	6.5%
21	I prefer not to say	4.8%
22	Other	2.9%

The table above is a transcription of a transcription, so it would be remarkable if no Data Quality issues had crept in, however let’s assume that the figures are robust enough for our purposes. Of course the people surveyed will have reported multiple issues, so the percentages above are not additive. Nevertheless there are some very obvious comments to be made (some of the above items are pertinent to more than one of the points I would like to make):

Data Quality / Availability remain major issues – (1, 5, and 8)
It is indeed true that Machine Learning can be quite good at dealing with some types or bad or missing data. But no technology or approach is going to be able to paper over all of the cracks if your data is essentially incomplete and of poor quality. This point (together with some others below) speaks to the need to not approach Data Science on a stand-alone basis, but as part of a more holistic approach to data matters ^[4].
The Human angle and a focus on Culture are imperative – (3, 6, 7, 14 and 15)
Findings are one thing; using these to take action is quite another. At the end of the day, most ventures are successful or fail because of people; the people conducting the venture, the people receiving its intended benefits and so on. Ignore this dimension of data work (or any type of work) at your peril ^[5].
Business Questions amd Business Involvement matter – (4, 6, 9 and 15)
While in some circumstances the data can indeed “speak for itself”, it makes a lot more sense for Data Scientists to partner with business colleagues to both get direction and to help ensure that their findings lead to action ^[6].
Tools & Technology typically Trumped – (11, 12 and 18)
These first appear outside of the Top 10 (and 11 is a bit dubious to include here – it relates more to a proliferation of tools than to issues with any of them). I would never say that tools and technology are unimportant, but they are typically much less important than other considerations ^[7].

The overriding point is of course that – much as I noted out recently in Convergent Evolution – there is little new under the Sun. A survey of Business Intelligence / Data Warehousing professionals back in 2010 would have generated something very like the list above. A survey of EIS ^[8] professionals back in 2000 would have done the same.

The important things to do – regardless of the technologies and approaches employed – are to:

Understand what questions are key to the running of an organisation ^[9]
Determine what data is available to support decisions in these key areas
Ensure that the data is in a “good enough” state, appropriately consolidated / made consistent, augmented / corrected by any useful external data and made available to the right people in a timely manner
Focus on the human aspects of acting on what data is telling us and how to use data outputs to drive positive actions

Here too, little is new under the Sun. I have been referring to essentially these same four pillars of good practice since the mid 2000s. Some of our technological advances since then have been amazing. The prospect of leveraging the power of both Data Science and Artificial Intelligence in a business context is very exciting. But to truly succeed with these newer approaches, it helps to recall the eternal verities that have always underpinned good data-centric work ^[10]. The survey above makes this point crystal clear.

A final corollary to this observation is something I covered in A truth universally acknowledged…. The replies to the Kaggle survey highlight the fact that, much like the conductor of an orchestra does not need to be able to play the violin to a virtuoso level, people leading Data Science teams (and broader Data Functions) need a set of rounded skills, ones honed to address the types of issues appearing in the exhibits above. The skill-set that makes for an excellent Data Scientist does not necessarily help so much with many of the less technical issues that will determine the success or failure of Data Science teams.

Notes

^[1]	Other Yogi-isms included, “Always go to other people’s funerals; otherwise they won’t go to yours”, “You can observe a lot by watching” and “If you can’t imitate him, don’t copy him”.
^[2]	A Data Visualisation challenge to include that much text I realise. I think I might have been tempted to come up with pithier categories to aid legibility.
^[3]	Ignoring “I prefer not to say” and “Other”.
^[4]	As laid out in my many articles about the importance of Cultural Transformation.
^[5]	See: Building Momentum – How to begin becoming a Data-driven Organisation.
^[6]	I make precisely this point in my recent interview for Venturi Voice (starting just after 31:38).
^[7]	I make this point most forcibly back in: A bad workman blames his [Business Intelligence] tools. The technology may be different, but the learnings are just as relevant today.
^[8]	Executive Information Systems for those of tender years.
^[9]	Machine learning techniques can clearly help here, but only if in concert with dialogue with people actually on the front-line and leading business areas.
^[10]	In your search for such eternal verities, you could do much worse than starting with: 20 Risks that Beset Data Programmes.

From: peterjamesthomas.com, home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases

Follow @peterjthomas

Convergent Evolution

18 August 201810 April 2019 Peter James Thomas Biology, business analytics, business intelligence, dashboards, data architecture, data quality, data science, data visualisation, data warehousing, Statistics dolphin, ichthyosaur

No this article has not escaped from my Maths & Science section, it is actually about data matters. But first of all, channeling Jennifer Aniston ^[1], “here comes the Science bit – concentrate”.

Shared Shapes

The Theory of Common Descent holds that any two organisms, extant or extinct, will have a common ancestor if you roll the clock back far enough. For example, each of fish, amphibians, reptiles and mammals had a common ancestor over 500 million years ago. As shown below, the current organism which is most like this common ancestor is the Lancelet ^[2].

To bring things closer to home, each of the Great Apes (Orangutans, Gorillas, Chimpanzees, Bonobos and Humans) had a common ancestor around 13 million years ago.

So far so simple. As one would expect, animals sharing a recent common ancestor would share many attributes with both it and each other.

Convergent Evolution refers to something else. It describes where two organisms independently evolve very similar attributes that were not features of their most recent common ancestor. Thus these features are not inherited, instead evolutionary pressure has led to the same attributes developing twice. An example is probably simpler to understand.

The image at the start of this article is of an Ichthyosaur (top) and Dolphin. It is striking how similar their body shapes are. They also share other characteristics such as live birth of young, tail first. The last Ichthyosaur died around 66 million years ago alongside many other archosaurs, notably the Dinosaurs ^[3]. Dolphins are happily still with us, but the first toothed whale (not a Dolphin, but probably an ancestor of them) appeared around 30 million years ago. The ancestors of the modern Bottlenose Dolphins appeared a mere 5 million years ago. Thus there is tremendous gap of time between the last Ichthyosaur and the proto-Dolphins. Ichthyosaurs are reptiles, they were covered in small scales ^[4]. Dolphins are mammals and covered in skin not massively different to our own. The most recent common ancestor of Ichthyosaurs and Dolphins probably lived around quarter of a billion years ago and looked like neither of them. So the shape and other attributes shared by Ichthyosaurs and Dolphins do not come from a common ancestor, they have developed independently (and millions of years apart) as adaptations to similar lifestyles as marine hunters. This is the essence of Convergent Evolution.

That was the Science, here comes the Technology…

A Brief Hydrology of Data Lakes

From 2000 to 2015, I had some success ^[5] with designing and implementing Data Warehouse architectures much like the following:

As a lot of my work then was in Insurance or related fields, the Analytical Repositories tended to be Actuarial Databases and / or Exposure Management Databases, developed in collaboration with such teams. Even back then, these were used for activities such as Analytics, Dashboards, Statistical Modelling, Data Mining and Advanced Visualisation.

Overlapping with the above, from around 2012, I began to get involved in also designing and implementing Big Data Architectures; initially for narrow purposes and later Data Lakes spanning entire enterprises. Of course some architectures featured both paradigms as well.

One of the early promises of a Data Lake approach was that – once all relevant data had been ingested – this would be directly leveraged by Data Scientists to derive insight.

Over time, it became clear that it would be useful to also have some merged / conformed and cleansed data structures in the Data Lake. Once the output of Data Science began to be used to support business decisions, a need arose to consider how it could be audited and both data privacy and information security considerations also came to the fore.

Next, rather than just being the province of Data Scientists, there were moves to use Data Lakes to support general Data Discovery and even business Reporting and Analytics as well. This required additional investments in metadata.

The types of issues with Data Lake adoption that I highlighted in Draining the Swamp earlier this year also led to the advent of techniques such as Data Curation ^[6]. In parallel, concerns about expensive Data Science resource spending 80% of their time in Data Wrangling ^[7] led to the creation of a new role, that of Data Engineer. These people take on much of the heavy lifting of consolidating, fixing and enriching datasets, allowing the Data Scientists to focus on Statistical Analysis, Data Mining and Machine Learning.

All of which leads to a modified Big Data / Data Lake architecture, embodying people and processes as well as technology and looking something like the exhibit above.

This is where the observant reader will see the concept of Convergent Evolution playing out in the data arena as well as the Natural World.

In Closing

Lest it be thought that I am saying that Data Warehouses belong to a bygone era, it is probably worth noting that the archosaurs, Ichthyosaurs included, dominated the Earth for orders of magnitude longer that the mammals and were only dethroned by an asymmetric external shock, not any flaw their own finely honed characteristics.

Also, to be crystal clear, much as while there are similarities between Ichthyosaurs and Dolphins there are also clear differences, the same applies to Data Warehouse and Data Lake architectures. When you get into the details, differences between Data Lakes and Data Warehouses do emerge; there are capabilities that each has that are not features of the other. What is undoubtedly true however is that the same procedural and operational considerations that played a part in making some Warehouses seem unwieldy and unresponsive are also beginning to have the same impact on Data Lakes.

If you are in the business of turning raw data into actionable information, then there are inevitably considerations that will apply to any technological solution. The key lesson is that shape of your architecture is going to be pretty similar, regardless of the technical underpinnings.

Notes

^[1]	The two of us are constantly mistaken for one another.
^[2]	To be clear the common ancestor was not a Lancelet, rather Lancelets sit on the branch closest to this common ancestor.
^[3]	Ichthyosaurs are not Dinosaurs, but a different branch of ancient reptiles.
^[4]	This is actually a matter of debate in paleontological circles, but recent evidence suggests small scales.
^[5]	See: Financial Sector Technology Award Cognos UK Award and EMIR User Feedback
^[6]	A term that is unaccountably missing from The Data & Analytics Dictionary – something to add to the next release. UPDATE: Now remedied here.
^[7]	Ditto. UPDATE: Now remedied here

From: peterjamesthomas.com, home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases

Follow @peterjthomas

Fact-based Decision-making

13 August 201813 August 2018 Peter James Thomas business analytics, data quality, data science, Mathematics & Science, Statistics cassablanca, dragnet

This article is about facts. Facts are sometimes less solid than we would like to think; sometimes they are downright malleable. To illustrate, consider the fact that in 98 episodes of Dragnet, Sergeant Joe Friday never uttered the words “Just the facts Ma’am”, though he did often employ the variant alluded to in the image above ^[1]. Equally, Rick never said “Play it again Sam” in Casablanca ^[2] and St. Paul never suggested that “money is the root of all evil” ^[3]. As Michael Caine never said in any film, “not a lot of people know that” ^[4].

Up-front Acknowledgements

These normally appear at the end of an article, but it seemed to make sense to start with them in this case:

Recently I published Building Momentum – How to begin becoming a Data-driven Organisation. In response to this, one of my associates, Olaf Penne, asked me about my thoughts on fact-base decision-making. This piece was prompted by both Olaf’s question and a recent article by my friend Neil Raden on his Silicon Angle blog, Performance management: Can you really manage what you measure? Thanks to both Olaf and Neil for the inspiration.

Fact-based decision making. It sounds good doesn’t it? Especially if you consider the alternatives: going on gut feel, doing what you did last time, guessing, not taking a decision at all. However – as is often the case with issues I deal with on this blog – fact-based decision-making is easier to say than it is to achieve. Here I will look to cover some of the obstacles and suggest a potential way to navigate round them. Let’s start however with some definitions.

Fact	NOUN A thing that is known or proved to be true. (Oxford Dictionaries)
Decision	NOUN A conclusion or resolution reached after consideration. (Oxford Dictionaries)

So one can infer that fact-based decision-making is the process of reaching a conclusion based on consideration of things that are known to be true. Again, it sounds great doesn’t it? It seems that all you have to do is to find things that are true. How hard can that be? Well actually quite hard as it happens. Let’s cover what can go wrong (note: this section is not intended to be exhaustive, links are provided to more in-depth articles where appropriate):

Accuracy of Data that is captured

A number of factors can play into the accuracy of data capture. Some systems (even in 2018) can still make it harder to capture good data than to ram in bad. Often an issue may also be a lack of master data definitions, so that similar data is labelled differently in different systems.

A more pernicious problem is combinatorial data accuracy, two data items are both valid, but not in combination with each other. However, often the biggest stumbling block is a human one, getting people to buy in to the idea that the care and attention they pay to data capture will pay dividends later in the process.

These and other areas are covered in greater detail in an older article, Using BI to drive improvements in data quality.

Honesty of Data that is captured

Data may be perfectly valid, but still not represent reality. Here I’ll let Neil Raden point out the central issue in his customary style:

People find the most ingenious ways to distort measurement systems to generate the numbers that are desired, not only NOT providing the desired behaviors, but often becoming more dysfunctional through the effort.

[…] voluntary compliance to the [US] tax code encourages a national obsession with “loopholes”, and what salesman hasn’t “sandbagged” a few deals for next quarter after she has met her quota for the current one?

Where there is a reward to be gained or a punishment to be avoided, by hitting certain numbers in a certain way, the creativeness of humans often comes to the fore. It is hard to account for such tweaking in measurement systems.

Timing issues with Data

Timing is often problematic. For example, a transaction completed near the end of a period gets recorded in the next period instead, one early in a new period goes into the prior period, which is still open. There is also (as referenced by Neil in his comments above) the delayed booking of transactions in order to – with the nicest possible description – smooth revenues. It is not just hypothetical salespeople who do this of course. Entire organisations can make smoothing adjustments to their figures before publishing and deferral or expedition of obligations and earnings has become something of an art form in accounting circles. While no doubt most of this tweaking is done with the best intentions, it can compromise the fact-based approach that we are aiming for.

Reliability with which Data is moved around and consolidated

In our modern architectures, replete with web-services, APIs, cloud-based components and the quasi-instantaneous transmission of new transactions, it is perhaps not surprising that occasionally some data gets lost in translation ^[5] along the way. That is before data starts to be Sqooped up into Data Lakes, or other such Data Repositories, and then otherwise manipulated in order to derive insight or provide regular information. All of these are processes which can introduce their own errors. Suffice it to say that transmission, collation and manipulation of data can all reduce its accuracy.

Again see Using BI to drive improvements in data quality for further details.

Pertinence and fidelity of metrics developed from Data

Here we get past issues with data itself (or how it is handled and moved around) and instead consider how it is used. Metrics are seldom reliant on just one data element, but are often rather combinations. The different elements might come in because a given metric is arithmetical in nature, e.g.

$\text{Metric X} = \dfrac{\text{Data Item A}+\text{Data Item B}}{\text{Data Item C}}$

Choices are made as to how to construct such compound metrics and how to relate them to actual business outcomes. For example:

$\text{New Biz Growth} = \dfrac{(\text{Sales CYTD}-\text{Repeat CYTD})-(\text{Sales PYTD}-\text{Repeat PYTD})}{(\text{Sales PYTD}-\text{Repeat PYTD})}$

Is this a good way to define New Business Growth? Are there any weaknesses in this definition, for example is it sensitive to any glitches in – say – the tagging of Repeat Business? Do we need to take account of pricing changes between Repeat Business this year and last year? Is New Business Growth something that is even worth tracking; what will we do as a result of understanding this?

The above is a somewhat simple metric, in a section of Using historical data to justify BI investments – Part I, I cover some actual Insurance industry metrics that build on each other and are a little more convoluted. The same article also considers how to – amongst other things – match revenue and outgoings when the latter are spread over time. There are often compromises to be made in defining metrics. Some of these are based on the data available. Some relate to inherent issues with what is being measured. In other cases, a metric may be a best approximation to some indication of business health; a proxy used because that indication is not directly measurable itself. In the last case, staff turnover may be a proxy for staff morale, but it does not directly measure how employees are feeling (a competitor might be poaching otherwise happy staff for example).

Robustness of extrapolations made from Data

I have used the above image before in these pages ^[6]. The situation it describes may seem farcical, but it is actually not too far away from some extrapolations I have seen in a business context. For example, a prediction of full-year sales may consist of this year’s figures for the first three quarters supplemented by prior year sales for the final quarter. While our metric may be better than nothing, there are some potential distortions related to such an approach:

Repeat business may have fallen into Q4 last year, but was processed in Q3 this year. This shift in timing would lead to such business being double-counted in our year end estimate.
Taking point 1 to one side, sales may be growing or contracting compared to the previous year. Using Q4 prior year as is would not reflect this.
It is entirely feasible that some market event occurs this year ( for example the entrance or exit of a competitor, or the launch of a new competitor product) which would render prior year figures a poor guide.

Of course all of the above can be adjusted for, but such adjustments would be reliant on human judgement, making any projections similarly reliant on people’s opinions (which as Neil points out may be influenced, conciously or unconsciously, by self-interest). Where sales are based on conversions of prospects, the quantum of prospects might be a more useful predictor of Q4 sales. However here a historical conversion rate would need to be calculated (or conversion probabilities allocated by the salespeople involved) and we are back into essentially the same issues as catalogued above.

I explore some similar themes in a section of Data Visualisation – A Scientific Treatment

Integrity of statistical estimates based on Data

Having spent 18 years working in various parts of the Insurance industry, statistical estimates being part of the standard set of metrics is pretty familiar to me ^[7]. However such estimates appear in a number of industries, sometimes explicitly, sometimes implicitly. A clear parallel would be credit risk in Retail Banking, but something as simple as an estimate of potentially delinquent debtors is an inherently statistical figure (albeit one that may not depend on the output of a statistical model).

The thing with statistical estimates is that they are never a single figure but a range. A model may for example spit out a figure like £12.4 million ± £0.5 million. Let’s unpack this.

Well the output of the model will probably be something analogous to the above image. Here a distribution has been fitted to the business event being modelled. The central point of this (the one most likely to occur according to the model) is £12.4 million. The model is not saying that £12.4 million is the answer, it is saying it is the central point of a range of potential figures. We typically next select a symmetrical range above and below the central figure such that we cover a high proportion of the possible outcomes for the figure being modelled; 95% of them is typical ^[8]. In the above example, the range extends plus £0. 5 million above £12.4 million and £0.5 million below it (hence the ± sign).

Of course the problem is then that Financial Reports (or indeed most Management Reports) are not set up to cope with plus or minus figures, so typically one of £12.4 million (the central prediction) or £11.9 million (the most conservative estimate ^[9]) is used. The fact that the number itself is uncertain can get lost along the way. By the time that people who need to take decisions based on such information are in the loop, the inherent uncertainty of the prediction may have disappeared. This can be problematic. Suppose a real result of £12.4 million sees an organisation breaking even, but one of £11.9 million sees a small loss being recorded. This could have quite an influence on what course of action managers adopt ^[10]; are they relaxed, or concerned?

Beyond the above, it is not exactly unheard of for statistical models to have glitches, sometimes quite big glitches ^[11].

This segment could easily expand into a series of articles itself. Hopefully I have covered enough to highlight that there may be some challenges in this area.

And so what?

Even if we somehow avoid all of the above pitfalls, there remains one booby-trap that is likely to snare us, absent the necessary diligence. This was alluded to in the section about the definition of metrics:

Is New Business Growth something that is even worth tracking; what will we do as a result of understanding this?

Unless a reported figure, or output of a model, leads to action being taken, it is essentially useless. Facts that never lead to anyone doing anything are like lists learnt by rote at school and regurgitated on demand parrot-fashion; they demonstrate the mechanism of memory, but not that of understanding. As Neil puts it in his article:

[…] technology is never a solution to social problems, and interactions between human beings are inherently social. This is why performance management is a very complex discipline, not just the implementation of dashboard or scorecard technology.

How to Measure the Unmeasurable

Our dream of fact-based decision-making seems to be crumbling to dust. Regular facts are subject to data quality issues, or manipulation by creative humans. As data is moved from system to system and repository to repository, the facts can sometimes acquire an “alt-” prefix. Timing issues and the design of metrics can also erode accuracy. Then there are many perils and pitfalls associated with simple extrapolation and less simple statistical models. Finally, any fact that manages to emerge from this gantlet ^[12] unscathed may then be totally ignored by those whose actions it is meant to guide. What can be done?

As happens elsewhere on this site, let me turn to another field for inspiration. Not for the first time, let’s consider what Science can teach us about dealing with such issues with facts. In a recent article ^[13] in my Maths & Science section, I examined the nature of Scientific Theory and – in particular – explored the imprecision inherent in the Scientific Method. Here is some of what I wrote:

It is part of the nature of scientific theories that (unlike their Mathematical namesakes) they are not “true” and indeed do not seek to be “true”. They are models that seek to describe reality, but which often fall short of this aim in certain circumstances. General Relativity matches observed facts to a greater degree than Newtonian Gravity, but this does not mean that General Relativity is “true”, there may be some other, more refined, theory that explains everything that General Relativity does, but which goes on to explain things that it does not. This new theory may match reality in cases where General Relativity does not. This is the essence of the Scientific Method, never satisfied, always seeking to expand or improve existing thought.

I think that the Scientific Method that has served humanity so well over the centuries is applicable to our business dilemma. In the same way that a Scientific Theory is never “true”, but instead useful for explaining observations and predicting the unobserved, business metrics should be judged less on their veracity (though it would be nice if they bore some relation to reality) and instead on how often they lead to the right action being taken and the wrong action being avoided. This is an argument for metrics to be simple to understand and tied to how decision-makers actually think, rather than some other more abstruse and theoretical definition.

A proxy metric is fine, so long as it yields the right result (and the right behaviour) more often than not. A metric with dubious data quality is still useful if it points in the right direction; if the compass needle is no more than a few degrees out. While of course steps that improve the accuracy of metrics are valuable and should be undertaken where cost-effective, at least equal attention should be paid to ensuring that – when the metric has been accessed and digested – something happens as a result. This latter goal is a long way from the arcana of data lineage and metric definition, it is instead the province of human psychology; something that the accomploished data professional should be adept at influencing.

I have touched on how to positively modify human behaviour in these pages a number of times before ^[14]. It is a subject that I will be coming back to again in coming months, so please watch this space.

Further reading on this subject:

Using BI to drive improvements in data quality

Accuracy

Notes

^[1]

According to Snopes, the phrase arose from a spoof of the series.

^[2]

The two pertinent exchanges were instead:

Ilsa: Play it once, Sam. For old times’ sake.

Sam: I don’t know what you mean, Miss Ilsa.

Ilsa: Play it, Sam. Play “As Time Goes By”

Sam: Oh, I can’t remember it, Miss Ilsa. I’m a little rusty on it.

Ilsa: I’ll hum it for you. Da-dy-da-dy-da-dum, da-dy-da-dee-da-dum…

Ilsa: Sing it, Sam.

and

Rick: You know what I want to hear.

Sam: No, I don’t.

Rick: You played it for her, you can play it for me!

Sam: Well, I don’t think I can remember…

Rick: If she can stand it, I can! Play it!

^[3]

Though he, or whoever may have written the first epistle to Timothy, might have condemned the “love of money”.

^[4]

The origin of this was a Peter Sellers interview in which he impersonated Caine.

^[5]

One of my Top Ten films.

^[6]

Especially for all Business Analytics professionals out there (2009).

^[7]

See in particular my trilogy:

^[8]

Without getting into too many details, what you are typically doing is stating that there is a less than 5% chance that the measurements forming model input match the distribution due to a fluke; but this is not meant to be a primer on null hypotheses.

^[9]

Of course, depending on context, £12.9 million could instead be the most conservative estimate.

^[10]

This happens a lot in election polling. Candidate A may be estimated to be 3 points ahead of Candidate B, but with an error margin of 5 points, it should be no real surprise when Candidate B wins the ballot.

^[11]

Try googling Nobel Laureates Myron Scholes and Robert Merton and then look for references to Long-term Capital Management.

^[12]

Yes I meant “gantlet” that is the word in the original phrase, not “gauntlet” and so connections with gloves are wide of the mark.

^[13]

Finches, Feathers and Apples (2018).

^[14]

For example:

From: peterjamesthomas.com, home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases

Follow @peterjthomas

	Welcome to: peterjamesthomas.com, a site which covers my thoughts on the confluence of business, technology and change.
	[Web-site Contact]
	[Creative Commons]
	[Report Problems]

Peter James Thomas

Data & Analytics: Consultancy, Interim Services and Research

data quality

An expanded and more mobile-friendly version of the Data & Analytics Dictionary

Like this:

Put our Knowledge and Writing Skills to Work for you

Like this:

The latest edition of The Data & Analytics Dictionary is now out

Like this:

Why do data migration projects have such a high failure rate?

Like this:

A Simple Data Capability Framework

Like this:

The Chief Marketing Officer and the CDO – A Modern Fable

Like this:

More Definitions in the Data and Analytics Dictionary

Like this:

Data Science Challenges – It’s Deja Vu all over again!

Like this:

Convergent Evolution

Like this:

Fact-based Decision-making

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: