The latest edition of The Data & Analytics Dictionary is now out

The Data and Analytics Dictionary

After a hiatus of a few months, the latest version of the peterjamesthomas.com Data and Analytics Dictionary is now available. It includes 30 new definitions, some of which have been contributed by people like Tenny Thomas Soman, George Firican, Scott Taylor and and Taru Väre. Thanks to all of these for their help.

  1. Analysis
  2. Application Programming Interface (API)
  3. Business Glossary (contributor: Tenny Thomas Soman)
  4. Chart (Graph)
  5. Data Architecture – Definition (2)
  6. Data Catalogue
  7. Data Community
  8. Data Domain (contributor: Taru Väre)
  9. Data Enrichment
  10. Data Federation
  11. Data Function
  12. Data Model
  13. Data Operating Model
  14. Data Scrubbing
  15. Data Service
  16. Data Sourcing
  17. Decision Model
  18. Embedded BI / Analytics
  19. Genetic Algorithm
  20. Geospatial Data
  21. Infographic
  22. Insight
  23. Management Information (MI)
  24. Master Data – additional definition (contributor: Scott Taylor)
  25. Optimisation
  26. Reference Data (contributor: George Firican)
  27. Report
  28. Robotic Process Automation
  29. Statistics
  30. Self-service (BI or Analytics)

Remember that The Dictionary is a free resource and quoting contents (ideally with acknowledgement) and linking to its entries (via the buttons provided) are both encouraged.

If you would like to contribute a definition, which will of course be acknowledged, you can use the comments section here, or the dedicated form, we look forward to hearing from you [1].

If you have found The Data & Analytics Dictionary helpful, we would love to learn more about this. Please post something in the comments section or contact us and we may even look to feature you in a future article.

The Data & Analytics Dictionary will continue to be expanded in coming months.
 


Notes

 
[1]
 
Please note that any submissions will be subject to editorial review and are not guaranteed to be accepted.

peterjamesthomas.com

Another article from peterjamesthomas.com. The home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases.

 

In praise of Jam Doughnuts or: How I learned to stop worrying and love Hybrid Data Organisations

The above infographic is the work of Management Consultants Oxbow Partners [1] and employs a novel taxonomy to categorise data teams. First up, I would of course agree with Oxbow Partners’ statement that:

Organisation of data teams is a critical component of a successful Data Strategy

Indeed I cover elements of this in two articles [2]. So the structure of data organisations is a subject which, in my opinion, merits some consideration.

Oxbow Partners draw distinctions between organisations where the Data Team is separate from the broader business, ones where data capabilities are entirely federated with no discernible “centre” and hybrids between the two. The imaginative names for these are respectively The Burger, The Smoothie and The Jam Doughnut. In this article, I review Oxbow Partners’s model and offer some of my own observations.



The Burger – Centralised

The Burger

Having historically recommended something along the lines of The Burger, not least when an organisation’s data capabilities are initially somewhere between non-existent and very immature, my views have changed over time, much as the characteristics of the data arena have also altered. I think that The Burger still has a role, in particular, in a first phase where data capabilities need to be constructed from scratch, but it has some weaknesses. These include:

  1. The pace of change in organisations has increased in recent years. Also, many organisations have separate divisions or product lines and / or separate geographic territories. Change can be happening in sometimes radically different ways in each of these as market conditions may vary considerably between Division A’s operations in Switzerland and Division B’s operations in Miami. It is hard for a wholly centralised team to react with speed in such a scenario. Even if they are aware of the shifting needs, capacity may not be available to work on multiple areas in parallel.
     
  2. Again in the above scenario, it is also hard for a central team to develop deep expertise in a range of diverse businesses spread across different locations (even if within just one country). A central team member who has to understand the needs of 12 different business units will necessarily be at a disadvantage when considering any single unit compared to a colleague who focuses on that unit and nothing else.
     
  3. A further challenge presented here is maintaining the relationships with colleagues in different business units that are typically a prerequisite for – for example – driving adoption of new data capabilities.


The Smoothie – Federated

The Smoothie

So – to address these shortcomings – maybe The Smoothie is a better organisational design. Well maybe, but also maybe not. Problems with these arrangements include:

  1. Probably biggest of all, it is an extremely high-cost approach. The smearing out of work on data capabilities inevitably leads to duplication of effort with – for example – the same data sourced or combined by different people in parallel. The pace of change in organisations may have increased, but I know few that are happy to bake large costs into their structures as a way to cope with this.
     
  2. The same duplication referred to above creates another problem, the way that data is processed can vary (maybe substantially) between different people and different teams. This leads to the nightmare scenario where people spend all their time arguing about whose figures are right, rather than focussing on what the figures say is happening in the business [3]. Such arrangements can generate business risk as well. In particular, in highly regulated industries heterogeneous treatment of the same data tends to be frowned upon in external reviews.
     
  3. The wholly federated approach also limits both opportunities for economies of scale and identification of areas where data capabilities can meet the needs of more than one business unit.
     
  4. Finally, data resources who are fully embedded in different parts of a business may become isolated and may not benefit from the exchange of ideas that happens when other similar people are part of the immediate team.

So to summarise we have:

Burger vs Smoothie



The Jam Doughnut – Hybrid

The Jam Doughnut

Which leaves us with The Jam Doughnut, in my opinion, this is a Goldilocks approach that captures as much as possible of the advantages of the other two set-ups, while mitigating their drawbacks. It is such an approach that tends to be my recommendation for most organisations nowadays. Let me spend a little more time describing its attributes.

I see the best way of implementing a Jam Doughnut approach is via a hub-and-spoke model. The hub is a central Data Team, the spokes are data-centric staff in different parts of the business (Divisions, Functions, Geographic Territories etc.).

Data Hub and Spoke

It is important to stress that each spoke satellite is not a smaller copy of the central Data Team. Some roles will be more federated, some more centralised according to what makes sense. Let’s consider a few different roles to illustrate this:

  • Data Scientist – I would see a strong central group of these, developing methodologies and tools, but also that many business units would have their own dedicated people; “spoke”-based people could also develop new tools and new approaches, which could be brought into the “hub” for wider dissemination
     
  • Analytics Expert – Similar to the Data Scientists, centralised “hub” staff might work more on standards (e.g. for Data Visualisation), developing frameworks to be leveraged by others (e.g. a generic harness for dashboards that can be leveraged by “spoke” staff), or selecting tools and technologies; “spoke”-based staff would be more into the details of meeting specific business needs
     
  • Data Engineer – Some “spoke” people may be hybrid Data Scientists / Data Engineers and some larger “spoke” teams may have dedicated Data Engineers, but the needle moves more towards centralisation with this role
     
  • Data Architect – Probably wholly centralised, but some “spoke” staff may have an architecture string to their bow, which would of course be helpful
     
  • Data Governance Analyst – Also probably wholly centralised, this is not to downplay the need for people in the “spokes” to take accountability for Data Governance and Data Quality improvement, but these are likely to be part-time roles in the “spokes”, whereas the “hub” will need full-time Data Governance people

It is also important to stress that the various spokes should also be in contact with each other, swapping successful approaches, sharing ideas and so on. Indeed, you could almost see the spokes beginning to merge together somewhat to form a continuum around the Data Team. Maybe the merged spokes could form the “dough”, with the Data Team being the “jam” something like this:

Data Hub and Spoke

I label these types of arrangements a Data Community and this is something that I have looked to establish and foster in a few recent assignments. Broadly a Data Community is something that all data-centric staff would feel part of; they are obviously part of their own segment of the organisation, but the Data Community is also part of their corporate identity. The Data Community facilities best practice approaches, sharing of ideas, helping with specific problems and general discourse between its members. I will be revisiting the concept of a Data Community in coming weeks. For now I would say that one thing that can help it to function as envisaged is sharing common tooling. Again this is a subject that I will return to shortly.

I’ll close by thanking Oxbow Partners for some good mental stimulation – I will look forward to their next data-centric publication.
 


 

Disclosure:

It is peterjamesthomas.com’s policy to disclose any connections with organisations or individuals mentioned in articles.

Oxbow Partners are an advisory firm for the insurance industry covering Strategy, Digital and M&A. Oxbow Partners and peterjamesthomas.com Ltd. have a commercial association and peterjamesthomas.com Ltd. was also engaged by one of Oxbow Partners’ principals, Christopher Hess, when he was at a former organisation.

 
Notes

 
[1]
 
Though the author might have had a minor role in developing some elements of it as well.
 
[2]
 
The Anatomy of a Data Function and A Simple Data Capability Framework.
 
[3]
 
See also The impact of bad information on organisations.

peterjamesthomas.com

Another article from peterjamesthomas.com. The home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases.

 
 

A Brief History of Databases

A Brief History of Databases

Larger PDF version (opens in a new tab)

The pace of change in the field of database technology seems to be constantly accelerating. No doubt in five year’s time [1], Big Data and the Hadoop suite [2] will seem to be as old-fashioned as earlier technologies can appear to some people nowadays. Today there is a great variety of database technologies that are in use in different organisations for different purposes. There are also a lot of vendors, some of whom have more than one type of database product. I think that it is worthwhile considering both the genesis of databases and some of the major developments that have occurred between then and now.

The infographic appearing at the start of this article seeks to provide just such a perspective. It presents an abridged and simplified perspective on the history of databases from the 1960s to the late 2010s. It is hard to make out the text in the above diagram, so I would recommend that readers click on the link provided in order to view a much larger version with bigger and more legible text.

The infographic references a number of terms. Below I provide links to definitions of several of these, which are taken from The Data and Analytics Dictionary. The list progresses from the top of the diagram downwards, but starts with a definition of “database” itself:

To my mind, it is interesting to see just how long we have been grappling with the best way to set up databases. Also of note is that some of the Big Data technologies are actually relatively venerable, dating to the mid-to-late 2000s (some elements are even older, consisting of techniques for handling flat files on UNIX or Mainframe computers back in the day).

I hope that both the infographic and the definitions provided above contribute to the understanding of the history of databases and also that they help to elucidate the different types of database that are available to organisations today.
 


 
Acknowledgements

The following people’s input is acknowledged on the document itself, but my thanks are also repeated here:

Of course any errors and omissions remain the responsibility of the author.


 
Notes

 
[1]
 
If not significantly before then.
 
[2]
 
One of J K Rowling’s lesser-known works.

From: peterjamesthomas.com, home of The Data and Analytics Dictionary and The Anatomy of a Data Function

 

Hurricanes and Data Visualisation: Part II(b) – Ooops!

Ooops!

The first half of my planned thoughts on Hurricanes and Data Visualisation, Rainbow’s Gravity and was published earlier back in September. Part two, Map Reading, joined it this month. In between, the first hurricane-centric article acquired an addendum, The Mona Lisa. With this post, the same has happened to the second article. Apparently you can’t keep a good hurricane story down.
 
 
One of our Hurricanes is missing

When I started writing about Hurricanes back in September of this year, it was in the aftermath of Harvey and Irma, both of which were safely far away from my native United Kingdom. Little did I think that in closing this mini-series Hurricane Ophelia (or at least the remnants of it) would be heading for these shores; I hope this is coincidence and not karma for me criticising the US National Weather Service’s diagrams!

As we batten down here, an odd occurrence was brought to my attention by Bill McKibben (@billmckibben), someone I connected with while working on this set of articles. Here is what he tweeted:

Ooops!

I am sure that inhabitants of both the Shetland Islands and the East Midlands will be breathing sighs of relief!

Clearly both the northward and eastward extent of Ophelia was outside of the scope of either the underlying model or the mapping software. A useful reminder to data professionals to ensure we set the boundaries of both modelling and visualisation work appropriately.

As an aside, this image is another for the Hall of Infamy, relying as it does on the less than helpful rainbow palette we critiqued all the way back in the first article.

I’ll hope to be writing again soon – hurricanes allowing!
 

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

 

Hurricanes and Data Visualisation: Part I – Rainbow’s Gravity

The Gravity of Rainbows

This is the first of two articles whose genesis was the nexus of hurricanes and data visualisation. The second article, Part II – Map Reading, has now been published.
 
 
Introduction

This first article is not a critique of Thomas Pynchon‘s celebrated work, instead it refers to a grave malady that can afflict otherwise health data visualisations; the use and abuse of rainbow colours. This is an area that some data visualisation professionals can get somewhat hot under the collar about; there is even a Twitter hashtag devoted to opposing this colour choice, #endtherainbow.

Hurricane Irma

The [mal-] practice has come under additional scrutiny in recent weeks due to the major meteorological events causing so much damage and even loss of life in the Caribbean and southern US; hurricanes Harvey and Irma. Of course the most salient point about these two megastorms is their destructive capability. However the observations that data visualisers make about how information about hurricanes is conveyed do carry some weight in two areas; how the public perceives these phenomena and how they perceive scientific findings in general [1]. The issues at stake are ones of both clarity and inclusiveness. Some of these people felt that salt was rubbed in the wound when the US National Weather Service, avid users of rainbows [2], had to add another colour to their normal palette for Harvey:

NWS Harvey

In 2015, five scientists collectively wrote a letter to Nature entitled “Scrap rainbow colour scales” [3]. In this they state:

It is time to clamp down on the use of misleading rainbow colour scales that are increasingly pervading the literature and the media. Accurate graphics are key to clear communication of scientific results to other researchers and the public — an issue that is becoming ever more important.

© NPG. Used under license 4186731223352 Copyright Clearance Center

At this point I have to admit to using rainbow colour schemes myself professionally and personally [4]; it is often the path of least resistance. I do however think that the #endtherainbow advocates have a point, one that I will try to illustrate below.
 
 
Many Marvellous Maps

Let’s start by introducing the idyllic coastal county of Thomasshire, a map of which appears below:

Coastal Map 1

Of course this is a cartoon map, it might be more typical to start with an actual map from Google Maps or some other provider [5], but this doesn’t matter to the argument we will construct here. Let’s suppose that – rather than anything as potentially catastrophic as a hurricane – the challenge is simply to record the rainfall due to a nasty storm that passed through this shire [6]. Based on readings from various weather stations (augmented perhaps by information drawn from radar), rainfall data would be captured and used to build up a rain contour map, much like the elevation contour maps that many people will recall from Geography lessons at school [7].

If we were to adopt a rainbow colour scheme, then such a map might look something like the one shown below:

Coastal Map 2

Here all areas coloured purple will have received between 0 and 10 cm of rain, blue between 10 and 20 cm of rain and so on.

At this point I apologise to any readers who suffer from migraine. An obvious drawback of this approach is how garish it is. Also the solid colours block out details of the underlying map. Well something can be done about both of these issues by making the contour colours transparent. This both tones them down and allows map details to remain at least semi-visible. This gets us a new map:

Coastal Map 3

Here we get into the core of the argument about the suitability of a rainbow palette. Again quoting from the Nature letter:

[…] spectral-type colour palettes can introduce false perceptual thresholds in the data (or hide genuine ones); they may also mask fine detail in the data. These palettes have no unique perceptual ordering, so they can de-emphasize data extremes by placing the most prominent colour near the middle of the scale.

[…]

Journals should not tolerate poor visual communication, particularly because better alternatives to rainbow scales are readily available (see NASA Earth Observatory).

© NPG. Used under license 4186731223352 Copyright Clearance Center

In our map, what we are looking to do is to show increasing severity of the deluge as we pass from purple (indigo / violet) up to red. But the ROYGBIV [8] colours of the spectrum are ill-suited to this. Our eyes react differently to different colours and will not immediately infer the gradient in rainfall that the image is aiming to convey. The NASA article the authors cite above uses a picture to paint a thousand words:

NASA comparison of colour palettes
Compared to a monochromatic or grayscale palette the rainbow palette tends to accentuate contrast in the bright cyan and yellow regions, but blends together through a wide range of greens.
Sourced from NASA

Another salient point is that a relatively high proportion of people suffer from one or other of the various forms of colour blindness [9]. Even the most tastefully pastel rainbow chart will disadvantage such people seeking to derive meaning from it.
 
 
Getting Over the Rainbow

So what could be another approach? Well one idea is to show gradients of whatever the diagram is tracking using gradients of colour; this is the essence of the NASA recommendation. I have attempted to do just this in the next map.

Coastal Map 4

I chose a bluey-green tone both as it was to hand in the Visio palette I was using and also to avoid confusion with the blue sea (more on this later). Rather than different colours, the idea is to map intensity of rainfall to intensity of colour. This should address both colour-blindness issues and the problems mentioned above with discriminating between ROYGBIV colours. I hope that readers will agree that it is easier to grasp what is happening at a glance when looking at this chart than in the ones that preceded it.

However, from a design point of view, there is still one issue here; the sea. There are too many bluey colours here for my taste, so let’s remove the sea colouration to get:

Coastal Map 5

Some purists might suggest also turning the land white (or maybe a shade of grey), others would mention that the grid-lines add little value (especially as they are not numbered). Both would probably have a point, however I think that use can also push minimalism too far. I am pretty happy that our final map delivers the information it is intended to convey much more accurately and more immediately than any of its predecessors.

Comparing the first two rainbow maps to this last one, it is perhaps easy to see why so many people engaged in the design of data visualisations want to see an end to ROYGBIV palettes. In the saying, there is a pot of gold at the end of the rainbow, but of course this can never be reached. I strongly suspect that, despite the efforts of the #endtherainbow crowd, an end to the usage of this particular palette will be equally out of reach. However I hope that this article is something that readers will bear in mind when next deciding on how best to colour their business graph, diagram or data visualisation. I am certainly going to try to modify my approach as well.
 
 
The story of hurricanes and data visualisation will continue in Part II – Map Reading.
 


 
Notes

 
[1]
 
For some more thoughts on the public perception of science, see Toast.
 
[2]
 
I guess it’s appropriate from at least one point of view.
 
[3]
 
Scrap rainbow colour scales. Nature (519, 219, 2015)

  • Ed Hawkins – National Centre for Atmospheric Science, University of Reading, UK (@ed_hawkins)
  • Doug McNeall – Met Office Hadley Centre, Exeter, UK (@dougmcneall)
  • Jonny Williams – University of Bristol, UK (LinkedIn page)
  • David B. Stephenson – University of Exeter, UK (Academic page)
  • David Carlson – World Meteorological Organization, Geneva, Switzerland (retired June 2017).
 
[4]
 
I did also go through a brief monochromatic phase, but it didn’t last long.
 
[5]
 
I guess it might take some time to find Thomasshire on Google Maps.
 
[6]
 
Based on the data I am graphing here, it was a very nasty storm indeed! In this article, I am not looking for realism, just to make some points about the design of diagrams.
 
[7]
 
Contour Lines (click for a larger version)
Click to view a larger version.
Sourced from UK Ordnance Survey

Whereas contours on a physical geography map (see above) link areas with the same elevation above sea level, rainfall contour lines would link areas with the same precipitation.

 
[8]
 
Red, Orange, Yellow, Green, Blue, Indigo, Violet.
 
[9]
 
Red–green color blindness, the most common sort, affects 80 in 1,000 of males and 4 in 1,000 of females of Northern European descent.

 

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

 

The impact of bad information on organisations

The impact of poor information on organisations

Larger PDF version (opens in a new tab)

My objective in this brief article is to compare how much time and effort is spent on certain information-related activities in an organisation that has adopted best practice, compared to what is typical in all too many organisations. For the avoidance of doubt, when I say people here I am focussing on staff who would ostensibly be the consumers of information, not data professionals who are engaged in providing such information. What follows relates to the end users of information.

What I have done at the top of the above exhibit (labelled “Activity” on the left-hand side) is to lay out the different types of information-related work that end users engage in, splitting these into low, medium and high valued-added components as we scan across the page.

  • Low value is number crunching and prettifying exhibits for publication
     
  • Medium value is analysis and interpretation of information
     
  • High value is taking action based on insights and then monitoring to check whether the desired outcome has been achieved

In the centre of the diagram (labelled “Ideal Time Allocation”), I have shown what I believe is a best practice allocation of time to these activities. It is worth pointing out that that I am recommending that significant time (60%) is spent on analysis and interpretation; while tagged as of medium-value, this type of work is a prerequisite for the higher value activities, you cannot really avoid it. Despite this, there is a still 30% of time devoted to the high-value activities of action and monitoring of results. The remaining 10% is expended on low-value activities.

At the bottom of the chart (labelled “Actual Time Allocation”), I have tried to estimate how people’s time is actually spent in organisations where insufficient attention has been paid to the information landscape; a large number of organisations fit into this category in my experience. I am not trying to be 100% precise here, but I believe that the figures are representative of what I have seen in several organisations. In fact I think that the estimated amount of time spent on low value activities is probably greater than 70% in many cases; however I don’t want to be accused of exaggeration.

Clearly a lack of robust, reliable and readily available information can mean that highly skilled staff spend their time generating information rather than analysing and interpreting it and then using such insights as the basis for action. This results in the bulk of their work being low valued-added. The medium and high value activities are squeezed out as there are only so many hours in the day.

It is obvious that such a state of affairs is sub-optimal and needs to be addressed. My experience of using diagrams like the one shown here is that they can be very valuable in explaining what is wrong with current information arrangements and highlighting the need for change.

An interesting exercise is to estimate what the bottom of the diagram would look like for your organisation. Are you close to best practice, or some way from this and in need of urgent change?
 

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

 

The Big Data Universe

The Royal Society - Big Data Universe (Click to view a larger version in a new window)

The above image is part of a much bigger infographic produced by The Royal Society about machine learning. You can view the whole image here.

I felt that this component was interesting in a stand-alone capacity.

The legend explains that a petabyte (Pb) is equal to a million gigabytes (Gb) [1], or 1 Pb = 106 Gb. A gigabyte itself is a billion bytes, or 1 Gb = 109 bytes. Recalling how we multiply indeces we can see that 1 Pb = 106 × 109 bytes = 106 + 9 bytes = 1015 bytes. 1015 also has a name, it’s called a quadrillion. Written out long hand:

1 quadrillion = 1,000,000,000,000,000

The estimate of the amount of data held by Google is fifteen thousand petabytes, let’s write that out long hand as well:

15,000 Pb = 15,000,000,000,000,000,000 bytes

That’s a lot of zeros. As is traditional with big numbers, let’s try to put this in context.

  1. The average size of a photo on an iPhone 7 is about 3.5 megabytes (1 Mb = 1,000,000 bytes), so Google could store about 4.3 trillion of such photos.

    iPhone 7 photo

  2. Stepping it up a bit, the average size of a high quality photo stored in CR2 format from a Canon EOS 5D Mark IV is ten times bigger at 35 Mb, so Google could store a mere 430 billion of these.

    Canon EOS 5D

  3. A high definition (1080p) movie is on average around 6 Gb, so Google could store the equivalent of 2.5 billion movies.

    The Complete Indiana Jones (helpful for Data Management professionals)

  4. If Google employees felt that this resolution wasn’t doing it for them, they could upgrade to 150 million 4K movies at around 100 Gb each.

    4K TV

  5. If instead they felt like reading, they could hold the equivalent of The Library of Congress print collections a mere 75 thousand times over [2].

    Library of Congress

  6. Rather than talking about bytes, 15,000 petametres is equivalent to about 1,600 light years and at this distance from us we find Messier Object 47 (M47), a star cluster which was first described an impressively long time ago in 1654.

    Messier 47

  7. If instead we consider 15,000 peta-miles, then this is around 2.5 million light years, which gets us all the way to our nearest neighbour, the Andromeda Galaxy [3].

    Andromeda

    The fastest that humankind has got anything bigger than a handful of sub-atomic particles to travel is the 17 kilometres per second (11 miles per second) at which Voyager 1 is currently speeding away from the Sun. At this speed, it would take the probe about 43 billion years to cover the 15,000 peta-miles to Andromeda. This is over three times longer than our best estimate of the current age of the Universe.

  8. Finally a more concrete example. If we consider a small cube, made of well concrete, and with dimensions of 1 cm in each direction, how big would a stack of 15,000 quadrillion of them be? Well, if arranged into a cube, each of the sides would be just under 25 km (15 and a bit miles) long. That’s a pretty big cube.

    Big cube (plan)

    If the base was placed in the vicinity of New York City, it would comfortably cover Manhattan, plus quite a bit of Brooklyn and The Bronx, plus most of Jersey City. It would extend up to Hackensack in the North West and almost reach JFK in the South East. The top of the cube would plough through the Troposphere and get half way through the Stratosphere before topping out. It would vie with Mars’s Olympus Mons for the title of highest planetary structure in the Solar System [4].

It is probably safe to say that 15,000 Pb is an astronomical figure.

Google played a central role in the initial creation of the collection of technologies that we now use the term Big Data to describe The image at the beginning of this article perhaps explains why this was the case (and indeed why they continue to be at the forefront of developing newer and better ways of dealing with large data sets).

As a point of order, when people start talking about “big data”, it is worth recalling just how big “big data” really is.
 


 Notes

 
[1]
 
In line with The Royal Society, I’m going to ignore the fact that these definitions were originally all in powers of 2 not 10.
 
[2]
 
The size of The Library of Congress print collections seems to have become irretrievably connected with the figure 10 terabytes (10 × 1012 bytes) for some reason. No one knows precisely, but 200 Tb seems to be a more reasonable approximation.
 
[3]
 
Applying the unimpeachable logic of eminent pseudoscientist and numerologist Erich von Däniken, what might be passed over as a mere coincidence by lesser minds, instead presents incontrovertible proof that Google’s PageRank algorithm was produced with the assistance of extraterrestrial life; which, if you think about it, explains quite a lot.
 
[4]
 
Though I suspect not for long, unless we chose some material other than concrete. Then I’m not a materials scientist, so what do I know?

 

 

Curiouser and Curiouser – The Limits of Brexit Voting Analysis

An original illustration from Charles Lutwidge Dodgson's seminal work would have been better, but sadly none such seems to be extant
 
Down the Rabbit-hole

When I posted my Brexit infographic reflecting the age of voters an obvious extension was to add an indication of the number of people in each age bracket who did not vote as well as those who did. This seemed a relatively straightforward task, but actually proved to be rather troublesome (this may be an example of British understatement). Maybe the caution I gave about statistical methods having a large impact on statistical outcomes in An Inconvenient Truth should have led me to expect such issues. In any case, I thought that it would be instructive to talk about the problems I stumbled across and to – once again – emphasise the perils of over-extending statistical models.

Brexit ages infographic
Click to download a larger PDF version in a new window.

Regular readers will recall that my Brexit Infographic (reproduced above) leveraged data from an earlier article, A Tale of two [Brexit] Data Visualisations. As cited in this article, the numbers used were from two sources:

  1. The UK Electoral Commission – I got the overall voting numbers from here.
  2. Lord Ashcroft’s Poling organisation – I got the estimated distribution of votes by age group from here.

In the notes section of A Tale of two [Brexit] Data Visualisations I [prophetically] stated that the breakdown of voting by age group was just an estimate. Based on what I have discovered since, I’m rather glad that I made this caveat explicit.
 
 
The Pool of Tears

In order to work out the number of people in each age bracket who did not vote, an obvious starting point would be the overall electorate, which the UK Electoral Commission stated as being 46,500,001. As we know that 33,551,983 people voted (an actual figure rather than an estimate), then this is where the turnout percentage of 72.2% (actually 72.1548%) came from (33,551,983 / 45,500,001).

A clarifying note, the electorate figures above refer to people who are eligible to vote. Specifically, in order to vote in the UK Referendum, people had to meet the following eligibility criteria (again drawn from the UK Electoral Commission):

To be eligible to vote in the EU Referendum, you must be:

  • A British or Irish citizen living in the UK, or
  • A Commonwealth citizen living in the UK who has leave to remain in the UK or who does not require leave to remain in the UK, or
  • A British citizen living overseas who has been registered to vote in the UK in the last 15 years, or
  • An Irish citizen living overseas who was born in Northern Ireland and who has been registered to vote in Northern Ireland in the last 15 years.

EU citizens are not eligible to vote in the EU Referendum unless they also meet the eligibility criteria above.

So far, so simple. The next thing I needed to know was how the electorate was split by age. This is where we begin to run into problems. One place to start is the actual population of the UK as at the last census (2011). This is as follows:
 

Ages (years) Population % of total
0–4 3,914,000 6.2
5–9 3,517,000 5.6
10–14 3,670,000 5.8
15–19 3,997,000 6.3
20–24 4,297,000 6.8
25–29 4,307,000 6.8
30–34 4,126,000 6.5
35–39 4,194,000 6.6
40–44 4,626,000 7.3
45–49 4,643,000 7.3
50–54 4,095,000 6.5
55–59 3,614,000 5.7
60–64 3,807,000 6.0
65–69 3,017,000 4.8
70–74 2,463,000 3.9
75–79 2,006,000 3.2
80–84 1,496,000 2.4
85–89 918,000 1.5
90+ 476,000 0.8
Total 63,183,000 100.0

 
If I roll up the above figures to create the same age groups as in the Ashcroft analysis (something that requires splitting the 15-19 range, which I have assumed can be done uniformly), I get:
 

Ages (years) Population % of total
0-17 13,499,200 21.4
18-24 5,895,800 9.3
25-34 8,433,000 13.3
35-44 8,820,000 14.0
45-54 8,738,000 13.8
55-64 7,421,000 11.7
65+ 10,376,000 16.4
Total 63,183,000 100.0

 
The UK Government isn’t interested in the views of people under 18[citation needed], so eliminating this row we get:
 

Ages (years) Population % of total
18-24 5,895,800 11.9
25-34 8,433,000 17.0
35-44 8,820,000 17.8
45-54 8,738,000 17.6
55-64 7,421,000 14.9
65+ 10,376,000 20.9
Total 49,683,800 100.0

 
As mentioned, the above figures are from 2011 and the UK population has grown since then. Web-site WorldOMeters offers an extrapolated population of 65,124,383 for the UK in 2016 (this is as at 12th July 2016; if extrapolation and estimates make you queasy, I’d suggest closing this article now!). I’m going to use a rounder figure of 65,125,000 people; there is no point pretending that precision exists where it clearly doesn’t. Making the assumption that such growth is uniform across all age groups (please refer to my previous bracketed comment!), then the above exhibit can also be extrapolated to give us:
 

Ages (years) Population % of total
18-24 6,077,014 11.9
25-34 8,692,198 17.0
35-44 9,091,093 17.8
45-54 9,006,572 17.6
55-64 7,649,093 14.9
65+ 10,694,918 20.9
Total 51,210,887 100.0

 
 
Looking Glass House

So our – somewhat fabricated – figure for the 18+ UK population in 2016 is 51,210,887, let’s just call this 51,200,000. As at the beginning of this article the electorate for the 2016 UK Referendum was 45,500,000 (dropping off the 1 person with apologies to him or her). The difference is explicable based on the eligibility criteria quoted above. I now have a rough age group break down of the 51.2 million population, how best to apply this to the 45.5 million electorate?

I’ll park this question for the moment and instead look to calculate a different figure. Based on the Ashcroft model, what percentage of the UK population (i.e. the 51.2 million) voted in each age group? We can work this one out without many complications as follows:
 

Ages (years)
 
Population
(A)
Voted
(B)
Turnout %
(B/A)
18-24 6,077,014 1,701,067 28.0
25-34 8,692,198 4,319,136 49.7
35-44 9,091,093 5,656,658 62.2
45-54 9,006,572 6,535,678 72.6
55-64 7,649,093 7,251,916 94.8
65+ 10,694,918 8,087,528 75.6
Total 51,210,887 33,551,983 65.5

(B) = Size of each age group in the Ashcroft sample as a percentage multiplied by the total number of people voting (see A Tale of two [Brexit] Data Visualisations).
 
Remember here that actual turnout figures have electorate as the denominator, not population. As the electorate is less than the population, this means that all of the turnout percentages should actually be higher than the ones calculated (e.g. the overall turnout with respect to electorate is 72.2% whereas my calculated turnout with respect to population is 65.5%). So given this, how to explain the 94.8% turnout of 55-64 year olds? To be sure this group does reliably turn out to vote, but did essentially all of them (remembering that the figures in the above table are too low) really vote in the referendum? This seems less than credible.

The turnout for 55-64 year olds in the 2015 General Election has been estimated at 77%, based on an overall turnout of 66.1% (web-site UK Political Info; once more these figures will have been created based on techniques similar to the ones I am using here). If we assume a uniform uplift across age ranges (that “assume” word again!) then one might deduce that an increase in overall turnout from 66.1% to 72.2%, might lead to the turnout in the 55-64 age bracket increasing from 77% to 84%. 84% turnout is still very high, but it is at least feasible; close to 100% turnout in from this age group seems beyond the realms of likelihood.

So what has gone wrong? Well so far the only culprit I can think of is the distribution of voting by age group in the Ashcroft poll. To be clear here, I’m not accusing Lord Ashcroft and his team of sloppy work. Instead I’m calling out that the way that I have extrapolated their figures may not be sustainable. Indeed, if my extrapolation is valid, this would imply that the Ashcroft model over estimated the proportion of 55-64 year olds voting. Thus it must have underestimated the proportion of voters in some other age group. Putting aside the likely fact that I have probably used their figures in an unintended manner, could it be that the much-maligned turnout of younger people has been misrepresented?

To test the validity of this hypothesis, I turned to a later poll by Omnium. To be sure this was based on a sample size of around 2,000 as opposed to Ashcroft’s 12,000, but it does paint a significantly different picture. Their distribution of voter turnout by age group was as follows:
 

Ages (years) Turnout %
18-24 64
25-39 65
40-54 66
55-64 74
65+ 90

 
I have to say that the Omnium age groups are a bit idiosyncratic, so I have taken advantage of the fact that the figures for 25-54 are essentially the same to create a schedule that matches the Ashcroft groups as follows:
 

Ages (years) Turnout %
18-24 64
25-34 65
35-44 65
45-54 65
55-64 74
65+ 90

 
The Omnium model suggests that younger voters may have turned out in greater numbers than might be thought based on the Ashcroft data. In turn this would suggest that a much greater percentage of 18-24 year olds turned out for the Referendum (64%) than for the last General Election (43%); contrast this with an estimated 18-24 turnout figure of 47% based on the just increase in turnout between the General Election and the Referendum. The Omnium estimates do still however recognise that turnout was still greater in the 55+ brackets, which supports the pattern seen in other elections.
 
 
Humpty Dumpty

While it may well be that the Leave / Remain splits based on the Ashcroft figures are reasonable, I’m less convinced that extrapolating these same figures to make claims about actual voting numbers by age group (as I have done) is tenable. Perhaps it would be better to view each age cohort as a mini sample to be treated independently. Based on the analysis above, I doubt that the turnout figures I have extrapolated from the Ashcroft breakdown by age group are robust. However, that is not the same as saying that the Ashcroft data is flawed, or that the Omnium figures are correct. Indeed the Omnium data (at least those elements published on their web-site) don’t include an analysis of whether the people in their sample voted Leave or Remain, so direct comparison is not going to be possible. Performing calculation gymnastics such as using the Omnium turnout for each age group in combination with the Ashcroft voting splits for Leave and Remain for the same age groups actually leads to a rather different Referendum result, so I’m not going to plunge further down this particular rabbit hole.

In summary, my supposedly simple trip to the destitution of an enhanced Brexit Infographic has proved unexpectedly arduous, winding and beset by troubles. These challenges have proved so great that I’ve abandoned the journey and will be instead heading for home.
 
 
Which dreamed it?

Based on my work so far, I have severe doubts about the accuracy of some of the age-based exhibits I have published (versions of which have also appeared on many web-sites, the BBC to offer just one example, scroll down to “How different age groups voted” and note that the percentages cited reconcile to mine). I believe that my logic and calculations are sound, but it seems that I am making too many assumptions about how I can leverage the Ashcroft data. After posting this article, I will accordingly go back and annotate each of my previous posts and link them to these later findings.

I think the broader lesson to be learnt is that estimates are just that, attempts (normally well-intentioned of course) to come up with figures where the actual numbers are not accessible. Sometimes this is a very useful – indeed indispensable – approach, sometimes it is less helpful. In either case estimation should always be approached with caution and the findings ideally sense-checked in the way that I have tried to do above.

Occam’s razor would suggest that when the stats tell you something that seems incredible, then 99 times out of 100 there is an error or inaccurate assumption buried somewhere in the model. This applies when you are creating the model yourself and doubly so where you are relying upon figures calculated by other people. In the latter case not only is there the risk of their figures being inaccurate, there is the incremental risk that you interpret them wrongly, or stretch their broader application to breaking point. I was probably guilty of one or more of the above sins in my earlier articles. I’d like my probable misstep to serve as a warning to other people when they too look to leverage statistics in new ways.

A further point is the most advanced concepts I have applied in my calculations above are addition, subtraction, multiplication and division. If these basic operations – even in the hands of someone like me who is relatively familiar with them – can lead to the issues described above, just imagine what could result from the more complex mathematical techniques (e.g. ambition, distraction, uglification and derision) used by even entry-level data scientists. This perhaps suggests an apt aphorism: Caveat calculator!

Beware the Jabberwock, my son! // The jaws that bite, the claws that catch! // Beware the Jubjub bird, and shun // The frumious Bandersnatch!

 

How Age was a Critical Factor in Brexit

Brexit ages infographic
Click to download a larger PDF version in a new window.

In my last article, I looked at a couple of ways to visualise the outcome of the recent UK Referendum on Europen Union membership. There I was looking at how different visual representations highlight different attributes of data.

I’ve had a lot of positive feedback about my previous Brexit exhibits and I thought that I’d capture the zeitgeist by offering a further visual perspective, perhaps one more youthful than the venerable pie chart; namely an infographic. My attempt to produce one of these appears above and a full-size PDF version is also just a click away.

For caveats on the provenance of the data, please also see the previous article’s notes section.
 

Addendum

I have leveraged age group distributions from the Ascroft Polling organisation to create this exhibits. Other sites – notably the BBC – have done the same and my figures reconcile to the interpretations in other places. However, based on further analysis, I have some reason to think that either there are issues with the Ashcroft data, or that I have leveraged it in ways that the people who compiled it did not intend. Either way, the Ashcroft numbers lead to the conclusion that close to 100% of 55-64 year olds voted in the UK Referendum, which seems very, very unlikely. I have contacted the Ashcroft Polling organisation about this and will post any reply that I receive.

– Peter James Thomas, 14th July 2016

 

Data Management as part of the Data to Action Journey

Data Information Insight Action (w700)

| Larger Version | Detailed and Annotated Version (as PDF) |

This brief article is actually the summation of considerable thought and reflects many elements that I covered in my last two pieces (5 Themes from a Chief Data Officer Forum and 5 More Themes from a Chief Data Officer Forum), in particular both the triangle I used as my previous Data Management visualisation and Peter Aiken’s original version, which he kindly allowed me to reproduce on this site (see here for more information about Peter).

What I began to think about was that both of these earlier exhibits (and indeed many that I have seen pertaining to Data Management and Data Governance) suggest that the discipline forms a solid foundation upon which other areas are built. While there is a lot of truth in this view, I have come round to thinking that Data Management may alternatively be thought of as actively taking part in a more dynamic process; specifically the same iterative journey from Data to Information to Insight to Action and back to Data again that I have referenced here several times before. I have looked to combine both the static, foundational elements of Data Management and the dynamic, process-centric ones in the diagram presented at the top of this article; a more detailed and annotated version of which is available to download as a PDF via the link above.

I have also introduced the alternative path from Data to Insight; the one that passes through Statistical Analysis. Data Management is equally critical to the success of this type of approach. I believe that the schematic suggests some of the fluidity that is a major part of effective Data Management in my experience. I also hope that the exhibit supports my assertion that Data Management is not an end in itself, but instead needs to be considered in terms of the outputs that it helps to generate. Pristine data is of little use to an organisation if it is not then exploited to form insights and drive actions. As ever, this need to drive action necessitates a focus on cultural transformation, an area that is covered in many other parts of this site.

This diagram also calls to mind the subject of where and how the roles of Chief Analytics Officer and Chief Data Officer intersect and whether indeed these should be separate roles at all. These are questions to which – as promised on several previous occasions – I will return to in future articles. For now, maybe my schematic can give some data and information practitioners a different way to view their craft and the contributions that it can make to organisational success.