The revised and expanded Data and Analytics Dictionary

The Data and Analytics Dictionary

Since its launch in August of this year, the peterjamesthomas.com Data and Analytics Dictionary has received a welcome amount of attention with various people on different social media platforms praising its usefulness, particularly as an introduction to the area. A number of people have made helpful suggestions for new entries or improvements to existing ones. I have also been rounding out the content with some more terms relating to each of Data Governance, Big Data and Data Warehousing. As a result, The Dictionary now has over 80 main entries (not including ones that simply refer the reader to another entry, such as Linear Regression, which redirects to Model).

The most recently added entries are as follows:

  1. Anomaly Detection
  2. Behavioural Analytics
  3. Complex Event Processing
  4. Data Discovery
  5. Data Ingestion
  6. Data Integration
  7. Data Migration
  8. Data Modelling
  9. Data Privacy
  10. Data Repository
  11. Data Virtualisation
  12. Deep Learning
  13. Flink
  14. Hive
  15. Information Security
  16. Metadata
  17. Multidimensional Approach
  18. Natural Language Processing (NLP)
  19. On-line Transaction Processing
  20. Operational Data Store (ODS)
  21. Pig
  22. Table
  23. Sentiment Analysis
  24. Text Analytics
  25. View

It is my intention to continue to revise this resource. Adding some more detail about Machine Learning and related areas is probably the next focus.

As ever, ideas for what to include next would be more than welcome (any suggestions used will also be acknowledged).
 


 

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

 

Ever tried? Ever failed?

Ever Tried? Ever Failed?

Regular readers may recall my March 2017 article [1] which started by exploring failure rates of Big Data implementations. In this, amongst other facts, we learnt that between a half and two-thirds of a range of major business transformations fail to deliver lasting value [2]. After recently reading a pair of Harvard Business Review articles from back in 2016 [3], I can also add Analytics. Here is a salient quote from the second article:

Only a little more than one in three of the three-dozen companies that we studied met the objectives of their analytics initiatives over the long term. Clearly, driving major innovations with analytics was harder than many executives expected.

Once more we see what appears to be a fundamental constant emerge, around 60% of most major business endeavours cannot be classified as unqualified successes. I feel that we should come up with a name for this figure and ideally use a Greek letter to denote it, maybe φ which is as close to “F” for failure as the Greek alphabet gets [4].

Unbalanced C-suite

The authors based their study on a 20 years of research spanning 36 client companies. The drew a surprising conclusion:

Efforts to adopt analytics upset the balance of power in the C-suite, and this shift often had a negative impact on analytics initiatives.

As ever (and as indeed I concluded in my previous article) reasons for failure have little to do with technology and everything to do with humans and how they interact with each other. This is one of the reasons I get incensed by Analytics teams saying things like “the business didn’t know what they wanted” or “adoption wasn’t strong enough” when their programmes fail.

For a start, Analytics is a business discipline and the Analytics team should view themselves as a business team. Second, to me it is pretty clear that a core activity for such teams is working with stakeholders to form an appreciation of their products or services, their competitive landscape, the markets they operate in, their day-to-day challenges and, on top of all this, what they want from data; even if this requires some teasing out (e.g. spending time shadowing people or using mock-ups or prototypes to show the art of the possible). Also Analytics teams must take accountability for driving adoption themselves, rather than assuming that someone else will deal with this, or worse, that “if we build it, they will come” [5].

Handshake

The C-suite aspect is tougher, but in my own work I try to spend time with Executives to understand their world views and to make sure I align what I am doing with their priorities. Building relationships here can help to reduce the likelihood of Executive strife impacting on an Analytics programme. However, I do also agree with the authors that the CEO has a key role to play here in ensuring that his or her team embrace becoming a data-driven organisation, even if this means changes in roles and responsibilities for some.

I’d encourage readers to take a look at the original HBR material, it contains a number of other pertinent observations above and beyond the ones I have highlighted here. When either looking to prevent issues from arising, or trying to mitigating them once they do, my article, 20 Risks that Beset Data Programmes, can also be a useful reference.

Beyond this, my simplest advice is to always remember the human angle in any Analytics programme. This is more likely to determine success or failure than technical excellence, or embracing the latest and greatest Data Visualisation or Analysis tools [6].
 


 
Notes

 
[1]
 
Ideas for avoiding Big Data failures and for dealing with them if they happen.

This also includes a quote from Samuel Beckett, which provided the inspiration for the title of this article.

 
[2]
 
The specifics were, Big Data implementations, Data Warehousing, ERP systems and Mergers and Acquisitions; please see the earlier article for the source of the figures.

To this you could add any number of technology-based programmes, such as CRM implementations, Digital Transformation and even outsourcing. The main message is doing some things successfully is hard.

 
[3]
 
The articles are:

  1. How CEOs Can Keep Their Analytics Programs from Being a Waste of Time
  2. The Reason So Many Analytics Efforts Fall Short

— by Chris McShea, Dan Oakley and Chris Mazzei, all from EY.

 
[4]
 
No doubt φ can be shown to be a transcendental number that can be linked to π, e and i by some elegant formula.

Rather annoying φ is already the label we attach to the Golden Ratio, or (1 + √5)/2, but maybe I can repurpose this as I did π back in A quantised approach to formal group interactions of hominidae (size > 2).

 
[5]
 
Also see Ideas for avoiding Big Data failures and for dealing with them if they happen for the provenance of this misquote.
 
[6]
 
See also: A bad workman blames his [Business Intelligence] tools, which is as pertinent today as when I wrote it back in 2009.

 

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

 

Hurricanes and Data Visualisation: Part I(b) – The Mona Lisa

La Gioconda
La Gioconda – by Leonardo da Vinci
(painted some time between 1503 and 1506)

The first half of my planned thoughts on Hurricanes and Data Visualisation is called Rainbow’s Gravity and was published earlier this week. Part two, Map Reading, is in post-production and will appear some time next week. Here is an unplanned post slotting into the gap between the two.
 
 
The image above is iconic enough to require no introduction. In response to my article about the use of a rainbow palette Quora user Hyunjun Ji decided to illustrate the point using this famous painting. Here is the Mona Lisa rendered using a rainbow colour map:

Mona Lisa Rainbow

Here is the same image using the viridis colormap [1]:

Mona Lisa Viridis

The difference in detail conveyed between these two images is vast. I’ll let Hyunjun explain in his own words [2]:

In these images, the rainbow color map might look colorful, but for example, if you take a look at the neck and forehead, you observe a very rapid red to green color change.

Another thing about the rainbow colormap is that it is not uniform, especially in terms of brightness. When you go from small to large data, its brightness does not monotonically increase or decrease. Instead, it goes up and down, confusing human perception.

To emphasise his point, Hyunjun then converted the rainbow Mona Lisa back to greyscale, this final image really brings home how much information is lost by adopting a rainbow palette.

Mona Lisa Rainbow Greyscale

Hyunjun’s points were striking enough for me to want to share them with a wider audience and I thank him for providing this pithy insight.
 


 
Notes

 
[1]
 
viridis is an add-in package for the R statistical language, based on a colourmap originally developed for Python, see https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html.

According to its creators, viridis is designed to be:

  • Colorful, spanning as wide a palette as possible so as to make differences easy to see,
  • Perceptually uniform, meaning that values close to each other have similar-appearing colors and values far away from each other have more different-appearing colors, consistently across the range of values,
  • Robust to colorblindness, so that the above properties hold true for people with common forms of colorblindness, as well as in grey scale printing, and
  • Pretty, oh so pretty
 
[2]
 
Also noting that the Mona Lisa idea comes from a presentation from the creators of viridis, Stéfan van der Walt and Nathaniel Smith.

 

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

 

Hurricanes and Data Visualisation: Part I – Rainbow’s Gravity

The Gravity of Rainbows

This is the first of two articles whose genesis was the nexus of hurricanes and data visualisation. The second article, Part II – Map Reading, is forthcoming.
 
 
Introduction

This first article is not a critique of Thomas Pynchon‘s celebrated work, instead it refers to a grave malady that can afflict otherwise health data visualisations; the use and abuse of rainbow colours. This is an area that some data visualisation professionals can get somewhat hot under the collar about; there is even a Twitter hashtag devoted to opposing this colour choice, #endtherainbow.

Hurricane Irma

The [mal-] practice has come under additional scrutiny in recent weeks due to the major meteorological events causing so much damage and even loss of life in the Caribbean and southern US; hurricanes Harvey and Irma. Of course the most salient point about these two megastorms is their destructive capability. However the observations that data visualisers make about how information about hurricanes is conveyed do carry some weight in two areas; how the public perceives these phenomena and how they perceive scientific findings in general [1]. The issues at stake are ones of both clarity and inclusiveness. Some of these people felt that salt was rubbed in the wound when the US National Weather Service, avid users of rainbows [2], had to add another colour to their normal palette for Harvey:

NWS Harvey

In 2015, five scientists collectively wrote a letter to Nature entitled “Scrap rainbow colour scales” [3]. In this they state:

It is time to clamp down on the use of misleading rainbow colour scales that are increasingly pervading the literature and the media. Accurate graphics are key to clear communication of scientific results to other researchers and the public — an issue that is becoming ever more important.

© NPG. Used under license 4186731223352 Copyright Clearance Center

At this point I have to admit to using rainbow colour schemes myself professionally and personally [4]; it is often the path of least resistance. I do however think that the #endtherainbow advocates have a point, one that I will try to illustrate below.
 
 
Many Marvellous Maps

Let’s start by introducing the idyllic coastal county of Thomasshire, a map of which appears below:

Coastal Map 1

Of course this is a cartoon map, it might be more typical to start with an actual map from Google Maps or some other provider [5], but this doesn’t matter to the argument we will construct here. Let’s suppose that – rather than anything as potentially catastrophic as a hurricane – the challenge is simply to record the rainfall due to a nasty storm that passed through this shire [6]. Based on readings from various weather stations (augmented perhaps by information drawn from radar), rainfall data would be captured and used to build up a rain contour map, much like the elevation contour maps that many people will recall from Geography lessons at school [7].

If we were to adopt a rainbow colour scheme, then such a map might look something like the one shown below:

Coastal Map 2

Here all areas coloured purple will have received between 0 and 10 cm of rain, blue between 10 and 20 cm of rain and so on.

At this point I apologise to any readers who suffer from migraine. An obvious drawback of this approach is how garish it is. Also the solid colours block out details of the underlying map. Well something can be done about both of these issues by making the contour colours transparent. This both tones them down and allows map details to remain at least semi-visible. This gets us a new map:

Coastal Map 3

Here we get into the core of the argument about the suitability of a rainbow palette. Again quoting from the Nature letter:

[…] spectral-type colour palettes can introduce false perceptual thresholds in the data (or hide genuine ones); they may also mask fine detail in the data. These palettes have no unique perceptual ordering, so they can de-emphasize data extremes by placing the most prominent colour near the middle of the scale.

[…]

Journals should not tolerate poor visual communication, particularly because better alternatives to rainbow scales are readily available (see NASA Earth Observatory).

© NPG. Used under license 4186731223352 Copyright Clearance Center

In our map, what we are looking to do is to show increasing severity of the deluge as we pass from purple (indigo / violet) up to red. But the ROYGBIV [8] colours of the spectrum are ill-suited to this. Our eyes react differently to different colours and will not immediately infer the gradient in rainfall that the image is aiming to convey. The NASA article the authors cite above uses a picture to paint a thousand words:

NASA comparison of colour palettes
Compared to a monochromatic or grayscale palette the rainbow palette tends to accentuate contrast in the bright cyan and yellow regions, but blends together through a wide range of greens.
Sourced from NASA

Another salient point is that a relatively high proportion of people suffer from one or other of the various forms of colour blindness [9]. Even the most tastefully pastel rainbow chart will disadvantage such people seeking to derive meaning from it.
 
 
Getting Over the Rainbow

So what could be another approach? Well one idea is to show gradients of whatever the diagram is tracking using gradients of colour; this is the essence of the NASA recommendation. I have attempted to do just this in the next map.

Coastal Map 4

I chose a bluey-green tone both as it was to hand in the Visio palette I was using and also to avoid confusion with the blue sea (more on this later). Rather than different colours, the idea is to map intensity of rainfall to intensity of colour. This should address both colour-blindness issues and the problems mentioned above with discriminating between ROYGBIV colours. I hope that readers will agree that it is easier to grasp what is happening at a glance when looking at this chart than in the ones that preceded it.

However, from a design point of view, there is still one issue here; the sea. There are too many bluey colours here for my taste, so let’s remove the sea colouration to get:

Coastal Map 5

Some purists might suggest also turning the land white (or maybe a shade of grey), others would mention that the grid-lines add little value (especially as they are not numbered). Both would probably have a point, however I think that use can also push minimalism too far. I am pretty happy that our final map delivers the information it is intended to convey much more accurately and more immediately than any of its predecessors.

Comparing the first two rainbow maps to this last one, it is perhaps easy to see why so many people engaged in the design of data visualisations want to see an end to ROYGBIV palettes. In the saying, there is a pot of gold at the end of the rainbow, but of course this can never be reached. I strongly suspect that, despite the efforts of the #endtherainbow crowd, an end to the usage of this particular palette will be equally out of reach. However I hope that this article is something that readers will bear in mind when next deciding on how best to colour their business graph, diagram or data visualisation. I am certainly going to try to modify my approach as well.
 
 
The story of hurricanes and data visualisation will continue in Part II – Map Reading, which is currently forthcoming.
 


 
Notes

 
[1]
 
For some more thoughts on the public perception of science, see Toast.
 
[2]
 
I guess it’s appropriate from at least one point of view.
 
[3]
 
Scrap rainbow colour scales. Nature (519, 219, 2015)

  • Ed Hawkins – National Centre for Atmospheric Science, University of Reading, UK (@ed_hawkins)
  • Doug McNeall – Met Office Hadley Centre, Exeter, UK (@dougmcneall)
  • Jonny Williams – University of Bristol, UK (LinkedIn page)
  • David B. Stephenson – University of Exeter, UK (Academic page)
  • David Carlson – World Meteorological Organization, Geneva, Switzerland (retired June 2017).
 
[4]
 
I did also go through a brief monochromatic phase, but it didn’t last long.
 
[5]
 
I guess it might take some time to find Thomasshire on Google Maps.
 
[6]
 
Based on the data I am graphing here, it was a very nasty storm indeed! In this article, I am not looking for realism, just to make some points about the design of diagrams.
 
[7]
 
Contour Lines (click for a larger version)
Click to view a larger version.
Sourced from UK Ordnance Survey

Whereas contours on a physical geography map (see above) link areas with the same elevation above sea level, rainfall contour lines would link areas with the same precipitation.

 
[8]
 
Red, Orange, Yellow, Green, Blue, Indigo, Violet.
 
[9]
 
Red–green color blindness, the most common sort, affects 80 in 1,000 of males and 4 in 1,000 of females of Northern European descent.

 

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

 

A truth universally acknowledged…

£10 note

  “It is a truth universally acknowledged, that an organisation in possession of some data, must be in want of a Chief Data Officer”

— Growth and Governance, by Jane Austen (1813) [1]

 

I wrote about a theoretical job description for a Chief Data Officer back in November 2015 [2]. While I have been on “paternity leave” following the birth of our second daughter, a couple of genuine CDO job specs landed in my inbox. While unable to respond for the aforementioned reasons, I did leaf through the documents. Something immediately struck me; they were essentially wish-lists covering a number of data-related fields, rather than a description of what a CDO might actually do. Clearly I’m not going to cite the actual text here, but the following is representative of what appeared in both requirement lists:

CDO wishlist

Mandatory Requirements:

Highly Desirable Requirements:

  • PhD in Mathematics or a numerical science (with a strong record of highly-cited publications)
  • MBA from a top-tier Business School
  • TOGAF certification
  • PRINCE2 and Agile Practitioner
  • Invulnerability and X-ray vision [3]
  • Mastery of the lesser incantations and a cloak of invisibility [3]
  • High midi-chlorian reading [3]
  • Full, clean driving licence

Your common, all-garden CDO

The above list may have descended into farce towards the end, but I would argue that the problems started to occur much earlier. The above is not a description of what is required to be a successful CDO, it’s a description of a Swiss Army Knife. There is also the minor practical point that, out of a World population of around 7.5 billion, there may well be no one who ticks all the boxes [4].

Let’s make the fallacy of this type of job description clearer by considering what a simmilar approach would look like if applied to what is generally the most senior role in an organisation, the CEO. Whoever drafted the above list of requirements would probably characterise a CEO as follows:

  • The best salesperson in the organisation
  • The best accountant in the organisation
  • The best M&A person in the organisation
  • The best customer service operative in the organisation
  • The best facilities manager in the organisation
  • The best janitor in the organisation
  • The best purchasing clerk in the organisation
  • The best lawyer in the organisation
  • The best programmer in the organisation
  • The best marketer in the organisation
  • The best product developer in the organisation
  • The best HR person in the organisation, etc., etc., …

Of course a CEO needs to be none of the above, they need to be a superlative leader who is expert at running an organisation (even then, they may focus on plotting the way forward and leave the day to day running to others). For the avoidance of doubt, I am not saying that a CEO requires no domain knowledge and has no expertise, they would need both, however they don’t have to know every aspect of company operations better than the people who do it.

The same argument applies to CDOs. Domain knowledge probably should span most of what is in the job description (save for maybe the three items with footnotes), but knowledge is different to expertise. As CDOs don’t grow on trees, they will most likely be experts in one or a few of the areas cited, but not all of them. Successful CDOs will know enough to be able to talk to people in the areas where they are not experts. They will have to be competent at hiring experts in every area of a CDO’s purview. But they do not have to be able to do the job of every data-centric staff member better than the person could do themselves. Even if you could identify such a CDO, they would probably lose their best staff very quickly due to micromanagement.

Conducting the data orchestra

A CDO has to be a conductor of both the data function orchestra and of the use of data in the wider organisation. This is a talent in itself. An internationally renowned conductor may have previously been a violinist, but it is unlikely they were also a flautist and a percussionist. They do however need to be able to tell whether or not the second trumpeter is any good or not; this is not the same as being able to play the trumpet yourself of course. The conductor’s key skill is in managing the efforts of a large group of people to create a cohesive – and harmonious – whole.

The CDO is of course still a relatively new role in mainstream organisations [5]. Perhaps these job descriptions will become more realistic as the role becomes more familiar. It is to be hoped so, else many a search for a new CDO will end in disappointment.

Having twisted her text to my own purposes at the beginning of this article, I will leave the last words to Jane Austen:

  “A scheme of which every part promises delight, can never be successful; and general disappointment is only warded off by the defence of some little peculiar vexation.”

— Pride and Prejudice, by Jane Austen (1813)

 

 
Notes

 
[1]
 
Well if a production company can get away with Pride and Prejudice and Zombies, then I feel I am on reasonably solid ground here with this title.

I also seem to be riffing on JA rather a lot at present, I used Rationality and Reality as the title of one of the chapters in my [as yet unfinished] Mathematical book, Glimpses of Symmetry.

 
[2]
 
Wanted – Chief Data Officer.
 
[3]
 
Most readers will immediately spot the obvious mistake here. Of course all three of these requirements should be mandatory.
 
[4]
 
To take just one example, gaining a PhD in a numerical science, a track record of highly-cited papers and also obtaining an MBA would take most people at least a few weeks of effort. Is it likely that such a person would next focus on a PRINCE2 or TOGAF qualification?
 
[5]
 
I discuss some elements of the emerging consensus on what a CDO should do in: 5 Themes from a Chief Data Officer Forum and 5 More Themes from a Chief Data Officer Forum.

 

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

 

The impact of bad information on organisations

The impact of poor information on organisations

Larger PDF version (opens in a new tab)

My objective in this brief article is to compare how much time and effort is spent on certain information-related activities in an organisation that has adopted best practice, compared to what is typical in all too many organisations. For the avoidance of doubt, when I say people here I am focussing on staff who would ostensibly be the consumers of information, not data professionals who are engaged in providing such information. What follows relates to the end users of information.

What I have done at the top of the above exhibit (labelled “Activity” on the left-hand side) is to lay out the different types of information-related work that end users engage in, splitting these into low, medium and high valued-added components as we scan across the page.

  • Low value is number crunching and prettifying exhibits for publication
     
  • Medium value is analysis and interpretation of information
     
  • High value is taking action based on insights and then monitoring to check whether the desired outcome has been achieved

In the centre of the diagram (labelled “Ideal Time Allocation”), I have shown what I believe is a best practice allocation of time to these activities. It is worth pointing out that that I am recommending that significant time (60%) is spent on analysis and interpretation; while tagged as of medium-value, this type of work is a prerequisite for the higher value activities, you cannot really avoid it. Despite this, there is a still 30% of time devoted to the high-value activities of action and monitoring of results. The remaining 10% is expended on low-value activities.

At the bottom of the chart (labelled “Actual Time Allocation”), I have tried to estimate how people’s time is actually spent in organisations where insufficient attention has been paid to the information landscape; a large number of organisations fit into this category in my experience. I am not trying to be 100% precise here, but I believe that the figures are representative of what I have seen in several organisations. In fact I think that the estimated amount of time spent on low value activities is probably greater than 70% in many cases; however I don’t want to be accused of exaggeration.

Clearly a lack of robust, reliable and readily available information can mean that highly skilled staff spend their time generating information rather than analysing and interpreting it and then using such insights as the basis for action. This results in the bulk of their work being low valued-added. The medium and high value activities are squeezed out as there are only so many hours in the day.

It is obvious that such a state of affairs is sub-optimal and needs to be addressed. My experience of using diagrams like the one shown here is that they can be very valuable in explaining what is wrong with current information arrangements and highlighting the need for change.

An interesting exercise is to estimate what the bottom of the diagram would look like for your organisation. Are you close to best practice, or some way from this and in need of urgent change?
 

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

 

The peterjamesthomas.com Data and Analytics Dictionary

The Data and Analytics Dictionary

I find myself frequently being asked questions around terminology in Data and Analytics and so thought that I would try to define some of the more commonly used phrases and words. My first attempt to do this can be viewed in a new page added to this site (this also appears in the site menu):

The Data and Analytics Dictionary

I plan to keep this up-to-date as the field continues to evolve.

I hope that my efforts to explain some concepts in my main area of specialism are both of interest and utility to readers. Any suggestions for new entries or comments on existing ones are more than welcome.