Knowing what you do not Know

Measure twice cut once

As readers will have noticed, my wife and I have spent a lot of time talking to medical practitioners in recent months. The same readers will also know that my wife is a Structural Biologist, whose work I have featured before in Data Visualisation – A Scientific Treatment [1]. Some of our previous medical interactions had led to me thinking about the nexus between medical science and statistics [2]. More recently, my wife had a discussion with a doctor which brought to mind some of her own previous scientific work. Her observations about the connections between these two areas have formed the genesis of this article. While the origins of this piece are in science and medicine, I think that the learnings have broader applicability.


So the general context is a medical test, the result of which was my wife being told that all was well [3]. Given that humans are complicated systems (to say the very least), my wife was less than convinced that just because reading X was OK it meant that everything else was also necessarily OK. She contrasted the approach of the physician with something from her own experience and in particular one of the experiments that formed part of her PhD thesis. I’m going to try to share the central point she was making with you without going in to all of the scientific details [4]. However to do this I need to provide at least some high-level background.

Structural Biology is broadly the study of the structure of large biological molecules, which mostly means proteins and protein assemblies. What is important is not the chemical make up of these molecules (how many carbon, hydrogen, oxygen, nitrogen and other atoms they consist of), but how these atoms are arranged to create three dimensional structures. An example of this appears below:

The 3D structure of a bacterial Ribosome

This image is of a bacterial Ribosome. Ribosomes are miniature machines which assemble amino acids into proteins as part of the chain which converts information held in DNA into useful molecules [5]. Ribosomes are themselves made up of a number of different proteins as well as RNA.

In order to determine the structure of a given protein, it is necessary to first isolate it in sufficient quantity (i.e. to purify it) and then subject it to some form of analysis, for example X-ray crystallography, electron microscopy or a variety of other biophysical techniques. Depending on the analytical procedure adopted, further work may be required, such as growing crystals of the protein. Something that is generally very important in this process is to increase the stability of the protein that is being investigated [6]. The type of protein that my wife was studying [7] is particularly unstable as its natural home is as part of the wall of cells – removed from this supporting structure these types of proteins quickly degrade.

So one of my wife’s tasks was to better stabilise her target protein. This can be done in a number of ways [8] and I won’t get into the technicalities. After one such attempt, my wife looked to see whether her work had been successful. In her case the relative stability of her protein before and after modification is determined by a test called a Thermostability Assay.

Sigmoidal Dose Response Curve A
© University of Cambridge – reproduced under a Creative Commons 2.0 licence

In the image above, you can see the combined results of several such assays carried out on both the unmodified and modified protein. Results for the unmodified protein are shown as a green line [9] and those for the modified protein as a blue line [10]. The fact that the blue line (and more particularly the section which rapidly slopes down from the higher values to the lower ones) is to the right of the green one indicates that the modification has been successful in increasing thermostability.

So my wife had done a great job – right? Well things were not so simple as they might first seem. There are two different protocols relating to how to carry out this thermostability assay. These basically involve doing some of the required steps in a different order. So if the steps are A, B, C and D, then protocol #1 consists of A ↦ B ↦ C ↦ D and protocol #2 consists of A ↦ C ↦ B ↦ D. My wife was thorough enough to also use this second protocol with the results shown below:

Sigmoidal Dose Response Curve B
© University of Cambridge – reproduced under a Creative Commons 2.0 licence

Here we have the opposite finding, the same modification to the protein seems to have now decreased its stability. There are some good reasons why this type of discrepancy might have occurred [11], but overall my wife could not conclude that this attempt to increase stability had been successful. This sort of thing happens all the time and she moved on to the next idea. This is all part of the rather messy process of conducting science [12].

I’ll let my wife explain her perspective on these results in her own words:

In general you can’t explain everything about a complex biological system with one set of data or the results of one test. It will seldom be the whole picture. Protocol #1 for the thermostability assay was the gold standard in my lab before the results I obtained above. Now protocol #1 is used in combination with another type of assay whose efficacy I also explored. Together these give us an even better picture of stability. The gold standard shifted. However, not even this bipartite test tells you everything. In any complex system (be that Biological or a complicated dataset) there are always going to be unknowns. What I think is important is knowing what you can and can’t account for. In my experience in science, there is generally much much more that can’t be explained than can.

Belt and Braces [or suspenders if you are from the US, which has quite a different connotation in the UK!]

As ever translating all of this to a business context is instructive. Conscientious Data Scientists or business-focussed Statisticians who come across something interesting in a model or analysis will always try (where feasible) to corroborate this by other means; they will try to perform a second “experiment” to verify their initial findings. They will also realise that even two supporting results obtained in different ways will not in general be 100% conclusive. However the highest levels of conscientiousness may be more honoured in breach than observance [13]. Also there may not be an alternative “experiment” that can be easily run. Whatever the motivations or circumstances, it is not beyond the realm of possibility that some Data Science findings are true only in the same way that my wife thought she had successfully stabilised her protein before carrying out the second assay.

I would argue that business will often have much to learn from the levels of rigour customary in most scientific research [14]. It would be nice to think that the same rigour is always applied in commercial matters as academic ones. Unfortunately experience would tend to suggest the contrary is sometimes the case. However, it would also be beneficial if people working on statistical models in industry went out of their way to stress not only what phenomena these models can explain, but what they are unable to explain. Knowing what you don’t know is the first step towards further enlightenment.
 


 
Notes

 
[1]
 
Indeed this previous article had a sub-section titled Rigour and Scrutiny, echoing some of the themes in this piece.
 
[2]
 
See More Statistics and Medicine.
 
[3]
 
As in the earlier article, apologies for the circumlocution. I’m both looking to preserve some privacy and save the reader from boredom.
 
[4]
 
Anyone interested in more information is welcome to read her thesis which is in any case in the public domain. It is 188 pages long, which is reasonably lengthy even by my standards.
 
[5]
 
They carry out translation which refers to synthesising proteins based on information carried by messenger RNA, mRNA.
 
[6]
 
Some proteins are naturally stable, but many are not and will not survive purification or later steps in their native state.
 
[7]
 
G Protein-coupled Receptors or GPCRs.
 
[8]
 
Chopping off flexible sections, adding other small proteins which act as scaffolding, getting antibodies or other biological molecules to bind to the protein and so on.
 
[9]
 
Actually a sigmoidal dose-response curve.
 
[10]
 
For anyone with colour perception problems, the green line has markers which are diamonds and the blue line has markers which are triangles.
 
[11]
 
As my wife writes [with my annotations]:

A possible explanation for this effect was that while T4L [the protein she added to try to increase stability – T4 Lysozyme] stabilised the binding pocket, the other domains of the receptor were destabilised. Another possibility was that the introduction of T4L caused an increase in the flexibility of CL3, thus destabilising the receptor. A method for determining whether this was happening would be to introduce rigid linkers at the AT1R-T4L junction [AT1R was the protein she was studying, angiotensin II type 1 receptor], or other placements of T4L. Finally AT1R might exist as a dimer and the addition of T4L might inhibit the formation of dimers, which could also destabilise the receptor.

© University of Cambridge – reproduced under a Creative Commons 2.0 licence

 
[12]
 
See also Toast.
 
[13]
 
Though to be fair, the way that this phrase is normally used today is probably not what either Hamlet or Shakespeare intended by it back around 1600.
 
[14]
 
Of course there are sadly examples of specific scientists falling short of the ideals I have described here.

 

 

Elephants’ Graveyard?

Elephants' Graveyard
 
Introduction

My young daughter is very fond of elephants [1], as indeed am I, so I need to tread delicately here. I recent years, the world has been consumed with Big Data Fever [2] and this has been intimately entwined with Hadoop of yellow elephant fame. Clearly there are very many other products such as Apache [insert random word here] [3] which are part of the Big Data ecosystem, but it is Hadoop that has become synonymous with Big Data and indeed conflated with many of the other Big Data technologies.

Hadoop the Elephant

I have seen some successful and innovative Big Data projects and there are clearly many benefits associated with the cluster of technologies that this term is used to describe. There are also any number of paeans to this new paradigm a mouse click, or finger touch, away [4]; indeed I have featured some myself in these pages [5]. However, what has struck me of late is that a few less positive articles have been appearing. I come to neither bury, nor praise Hadoop [6], but merely to reflect on this development. I will also touch on recent rumours that one of the Apache tribe [7], specifically Spark, may be seeking an amicable divorce from Hadoop proper [8].

In doing this, I am going to draw on two articles in particular. First Hadoop Is Falling by George Hill (@IE_George) on The Innovation Enterprise. Second The Hadoop Honeymoon is Over [9] by Martyn Richard Jones (@GoodStratTweet) on LinkedIn.

However, before I leap into analysing other people’s thoughts I will present some of my own [very basic] research, care of Google Trends.
 
 
Eine Kleine Nachtgoogling

Below I display two charts (larger versions are but a click away) tracking the volume of queries in the 2014-16 period for two terms: “hadoop” and “apache spark” [10]. On the assumption that California tends to lead trends more than it follows, I have focussed in on this part of the US.

Hadoop searches

Spark searches

Note on axes: On this blog I have occasionally spoken about the ability of images to conceal information as well as to reveal it [11]. Lest I am accused of making the same mistake, normalising both sets of data in the above graphs could give the misleading impression that the peak volume of queries for “hadoop” and “apache spark” are equivalent. This is not so. The maximum number of weekly queries for “apache spark” in the three years examined is just under a fifth of the maximum number of queries for “hadoop” [12]. So, applying a rather broad rule of thumb, people searched for “hadoop” around five times more often. However, it was not the absolute number of queries that I was interested in, but how these change over time, so I think the approach I have taken is justified. If I had not normalised, it would have been difficult to pick out the “apache spark” trend in a combined graph.

The obvious inference to be drawn is that searches for Hadoop (in California at least) are declining and those for Spark are increasing; though maybe with a bit of a fall off in volume recently. Making a cast iron connection between trends in search and trends in industry is probably a mistake [13], but the discrepancies in the two trends are at least suggestive. In the Application Development Trends article I reference (note [8]) the author states:

The Spark momentum is so great that the technology — originally positioned as a replacement for MapReduce with added real-time capabilities and in-memory processing — could break free from the reins of the Hadoop universe and become its own independent tool.

This chimes with the AtScale findings I also reported here (note [5]), which included the observation that:

Organizations who have deployed Spark in production are 85% more likely to achieve value.

One conclusion (albeit a rather tentative one) could be that while Spark is on an upward trajectory and perhaps likely to step out of the Hadoop shadow, interest in Hadoop itself is at best plateauing and possibly declining. It is against this backdrop that I’ll now consider the two articles I introduced earlier.
 
 
Trouble with Trunks

Bad Elephant!

In his article, George Hill begins by noting that:

[Hadoop] adoption appears to have more or less stagnated, leading even James Kobielus [@jameskobielus], Big Data Evangelist at IBM Analytics [14], to claim that “Hadoop declined more rapidly in 2016 from the big-data landscape than I expected” [15]

In search for a reasons behind this apparent stagnation, he hypothesises that:

[A] cause for concern is simply that one man’s big data is another man’s small data. Hadoop is designed for huge amounts of data, and as Kashif Saiyed [@rizkashif] wrote on KD Nuggets [16] “You don’t need Hadoop if you don’t really have a problem of huge data volumes in your enterprise, so hundreds of enterprises were hugely disappointed by their useless 2 to 10TB Hadoop clusters – Hadoop technology just doesn’t shine at this scale.”

Most companies do not currently have enough data to warrant a Hadoop rollout, but did so anyway because they felt they needed to keep up with the Joneses. After a few years of experimentation and working alongside genuine data scientists, they soon realize that their data works better in other technologies.

Martyn Richard Jones weighs in on this issue in more provocative style when he says:

Hadoop has grown, feature by feature, as a response to specific technical challenges in specific and somewhat peculiar businesses. When it all kicked off, the developers weren’t thinking about creating a new generic data management architecture, one for handling massive amounts of data. They were thinking of how to solve specific problems. Then it rather got out of hand, and the piecemeal scope grew like topsy as did the multifarious ways to address the product backlog.

and aligns himself with Kashif Saiyed’s comments by adding:

It also turns out that, in spite of the babbling of the usual suspects, Big Data is not for everyone, not everyone needs it, and even if some businesses benefit from analysing their data, they can do smaller Big Data using conventional rock-solid, high-performance and proven database technologies, well-architected and packaged technologies that are in wide use.

I have been around the data space long enough to have seen a number of technologies emerge, each of which was touted as solving all known problems. These included Executive Information Systems, Relational Databases, Enterprise Resource Planning, Data Warehouses, OLAP, Business Intelligence Suites and Customer Relationship Management systems. All are useful tools, I have successfully employed each of them, but at the end of the day, they are all technologies and technologies don’t sort out problems, people do [17]. Big Data enables us to address some new problems (and revisit some old ones) in novel ways and lets us do things we could not do before. However, it is no more a universal panacea than anything that has preceded it.

Gartner Hype Cycle

Big Data seems to have disappeared off of the Gartner hype cycle in 2016, perhaps as it is now viewed as having become mainstream. However, back in August 2015, it was heading downhill fast towards the rather cataclysmically named Trough of Disillusionment [18]. This reflects the unwavering fact that no technology ever lives up to its initial hype. Instead, after a period of being over-sold and an inevitable reaction to this, technologies settle down and begin to be actually useful. It seems that Gartner believes that Big Data has already gone through this rite of passage; they may well be correct in this assertion.

Hill references this himself in one of his closing comments, while ending on a more positive note:

[…] it is not the platform in itself that has caused the current issues. Instead it is perhaps the hype and association of Big Data that has done the real damage. Companies have adopted the platform without understanding it and then failed to get the right people or data to make it work properly, which has led to disillusionment and its apparent stagnation. There is still a huge amount of life in Hadoop, but people just need to understand it better.

For me there are loud and clear echos of other technologies “failing” in the past in what Hill says [19]. My experience in these other cases is that, while technologies may not have lived up to implausible initial claims, when they do genuinely fail, it is often for reasons that are all too human [20].
 
 
Summary

A racquet is a tool, right?

I had considered creating more balance in this article by adding a section making the case for the defence. I then realised that this was actually a pretty pointless exercise. Not because Hadoop is in terminal decline and denial of this would be indefensible. Not because it must be admitted that Big Data is over-hyped and under-delivers. Cases could be made that both of those statements are either false, or at least do not tell the whole story. However I think that arguments like these are the wrong things to focus on. Let me try to explain why.

Back in 2009 I wrote an article with the title A bad workman blames his [Business Intelligence] tools. This considered the all-too-prevalent practice in rock climbing and bouldering circles of buying the latest and greatest kit and assuming that performance gains would follow from this, as opposed to doing the hard work of training and practice (the same phenomenon occurs in other sports of course). I compared this to BI practitioners relying on technology as a crutch rather than focussing on four much more important things:

  1. Determining what information is necessary to drive key business decisions.
     
  2. Understanding the various data sources that are available and how they relate to each other.
     
  3. Transforming the data to meet the information needs.
     
  4. Managing the embedding of BI in the corporate culture.

I am often asked how relevant my heritage articles are to today’s world of analytics, data management, machine learning and AI. My reply is generally that what has changed is technology and little else [21]. This means that what was relevant back in 2009 remains relevant today; sometimes more so. The only area with a strong technological element in the list of four I cite above is number 3. I would agree that a lot has happened in the intervening years around how this piece can be effected. However, nothing has really changed in the other areas. We may call business questions use cases or user stories today, but they are the same thing. You still can’t really leverage data without attempting to understand it first. The need for good communication about data projects, high-quality education and strong follow-up is just as essential as it ever was.

Below I have taken the liberty of editing my own text, replacing the terms that were prevalent in data and information circles then, with the current ones.

Well if you want people to actually use analytics capabilities, it helps if the way that the technology operates is not a hindrance to this. Ideally the ease-of-use and intuitiveness of the analytical platform deployed should be a plus point for you. However, if you have the ultimate in data technology, but your analytics do not highlight areas that business people are interested in, do not provide information that influences actual decision-making, or contain numbers that are inaccurate, out-of-date, or unreconciled, then they will not be used.

I stand by these sentiments seven or eight years later. Over time the technology and terminology we use both change. I would argue that the essentials that determine success or failure seldom do.

Let’s take the undeniable hype cycle effect to one side. Let’s also discount overreaching claims that Hadoop and its related technologies are Swiss Army Knives, capable of dealing with any data situation. Let’s also set aside the string of technical objections that Martyn Richard Jones raises. My strong opinion is that when Hadoop (or Spark or the next great thing) fails, it will again most likely be a case of bad workmen blaming their tools; just as they did back in 2009.
 


 
Notes

 
[1]
 
As was Doug Cutting‘s son back in 2006. Rather than being yellow, my daughter’s favourite pachyderm is blue and called “Dee”, my wife and I have no idea why.
 
[2]
 
WHO have described the Big Data Fever situation as follows:

Phase 6, the pandemic phase, is characterized by community level outbreaks in at least one other country in a different WHO region in addition to the criteria defined in Phase 5. Designation of this phase will indicate that a global pandemic is under way.

 
[3]
 
Pick any one of: Cassandra, Flink, Flume, HBase, Hive, Impala, Kafka, Oozie, Phoenix, Pig, Spark, Sqoop, Storm and ZooKeeper.
 
[4]
 
You could start with the LinkedIn Big Data Channel.
 
[5]
 
Do any technologies grow up or do they only come of age?
 
[6]
 
The evil that open-source frameworks do lives after them; The good is oft interred with their source code; So let it be with Hadoop.
 
[7]
 
Perhaps not very respectful to Native American sensibilities, but hard to resist. No offence is intended.
 
[8]
 
Spark Poised To Break from Hadoop, Move to Cloud, Survey Says, Application Development Trends.
 
[9]
 
While functioning at the point that this article was originally written, it now appears that Martyn Richard Jones’s LinkedIn account has been suspended and the article I refer to is no longer available. The original URL was https://www.linkedin.com/pulse/hadoop-honeymoon-over-martyn-jones. I’m not sure what the issue is and whether or not the article may reappear at some later point.
 
[10]
 
A couple of points here. As “spark” is a word in common usage, the qualifier of “apache” is necessary. On the contrary, “hadoop” is not a name that is used for much beyond yellow elephants and so no qualifier is required. I could have used “apache hadoop” as the comparator, but instances of this are less frequent than for just “hadoop”. For what it is worth, although the number of queries for “apache hadoop” are fewer, the trend over time is pretty much the same as for just “hadoop”.
 
[11]
 
For example:

 
[12]
 
18% to be precise.
 
[13]
 
Though quite a few people make a nice living doing just that.
 
[14]
 
“IBM Software” in the original article, corrected to “IBM Analytics” here.
 
[15]
 
Big Data: Main Developments in 2016 and Key Trends in 2017, KD Nuggets.
 
[16]
 
Why Not So Hadoop?, KD Nuggets.
 
[17]
 
Though admittedly nowadays people sometimes sort problems by writing algorithms for machines to run, which then come up with the answer.
 
[18]
 
Which has always felt to me that it should appear on a papyrus map next to a “here be dragons” legend.
 
[19]
 
For example as in “Why Business Intelligence projects fail”.
 
[20]
 
It’s worth counting how many of the risks I enumerate in 20 Risks that Beset Data Programmes are human-centric (hint: its a multiple of ten biger than 15 and smaller than 25).
 
[21]
 
I might be tempted to answer a little differently when it comes to Artificial Intelligence.

 

 

Bigger and Better (Data)?

Is bigger really better

I was browsing Data Science Central [1] recently and came across an article by Bill Vorhies, President & Chief Data Scientist of Data-Magnum. The piece was entitled 7 Cases Where Big Data Isn’t Better and is worth a read in full. Here I wanted to pick up on just a couple of Bill’s points.

In his preamble, he states:

Following the literature and the technology you would think there is universal agreement that more data means better models. […] However […] it’s always a good idea to step back and examine the premise. Is it universally true that our models will be more accurate if we use more data? As a data scientist you will want to question this assumption and not automatically reach for that brand new high-performance in-memory modeling array before examining some of these issues.

Bill goes on to make several pertinent points including: that if your data is bad, having more of it is not necessarily a solution; that attempting to create a gigantic and all-purpose model may well be inferior to multiple, more targeted models on smaller sub-sets of data; and that there exist specific instances where a smaller data sets yields greater accuracy [2]. However I wanted to pick up directly on Bill’s point 6 of 7, in which he also references Larry Greenemeier (@lggreenemeier) of Scientific American.

  Bill Vorhies   Larry Greenemeier  

6. Sometimes We Get Hypnotized By the Overwhelming Volume of the Data and Forget About Data Provenance and Good Project Design

A few months back I reviewed an article by Larry Greenemeier [3] about the failure of Google Flu Trend analysis to predict the timing and severity of flu outbreaks based on social media scraping. It was widely believed that this Big Data volume of data would accurately predict the incidence of flu but the study failed miserably missing timing and severity by a wide margin.

Says Greenemeier, “Big data hubris is the often the implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. The mistake of many big data projects, the researchers note, is that they are not based on technology designed to produce valid and reliable data amenable for scientific analysis. The data comes from sources such as smartphones, search results and social networks rather than carefully vetted participants and scientific instruments”.

Perhaps more pertinent to a business environment, Greenemeier’s article also states:

Context is often lacking when info is pulled from disparate sources, leading to questionable conclusions.

Ruler

Neither of these authors is saying that having greater volumes of data is a definitively bad thing; indeed Vorhies states:

In general would I still prefer to have more data than less? Yes, of course.

They are however both pointing out that, in some instances, more traditional statistical methods, applied to smaller data sets yield superior results. This is particularly the case where data are repurposed and the use to which they are put is different to the considerations when they were collected; something which is arguably more likely to be the case where general purpose Big Data sets are leveraged without reference to other information.

Also, when large data sets are collated from many places, the data from each place can have different characteristics. If this variation is not controlled for in models, it may well lead to erroneous findings.

Statistical Methods

Their final observation is that sound statistical methodology needs to be applied to big data sets just as much as more regular ones. The hope that design flaws will simply evaporate when data sets get large enough may be seducing, but it is also dangerously wrong.

Vorhies and Greenemeier are not suggesting that Big Data has no value. However they state that one of its most potent uses may well be as a supplement to existing methods, perhaps extending them, or bringing greater granularity to results. I view such introspection in Data Science circles as positive, likely to lead to improved methods and an indication of growing maturity in the field. It is however worth noting that, in some cases, leverage of Small-but-Well-Designed Data [4] is not only effective, but actually a superior approach. This is certainly something that Data Scientists should bear in mind.
 


 
Notes

 
[1]
 
I’d recommend taking a look at this site regularly. There is a high volume of articles and the quality is variable, but often there are some stand-out pieces.
 
[2]
 
See the original article for the details.
 
[3]
 
The article was in Scientific American and entitled Why Big Data Isn’t Necessarily Better Data.
 
[4]
 
I may have to copyright this term and of course the very elegant abridgement, SBWDD.

 

 

Predictions about Prediction

2017 the Road Ahead [Borrowed from Eckerson Group]

   
“Prediction and explanation are exactly symmetrical. Explanations are, in effect, predictions about what has happened; predictions are explanations about what’s going to happen.”

– John Rogers Searle

 

The above image is from Eckerson Group‘s article Predictions for 2017. Eckerson Group’s Founder and Principal Consultant, Wayne Eckerson (@weckerson), is someone whose ideas I have followed on-line for several years; indeed I’m rather surprised I have not posted about his work here before today.

As was possibly said by a variety of people, “prediction is very difficult, especially about the future” [1]. I did turn my hand to crystal ball gazing back in 2009 [2], but the Eckerson Group’s attempt at futurology is obviously much more up-to-date. As per my review of Bruno Aziza’s thoughts on the AtScale blog, I’m not going to cut and paste the text that Wayne and his associates have penned wholesale, instead I’d recommend reading the original article.

Here though are a number of points that caught my eye, together with some commentary of my own (the latter appears in italics below). I’ll split these into the same groups that Wayne & Co. use and also stick to their indexing, hence the occasional gaps in numbering. Where I have elided text, I trust that I have not changed the intended meaning:
 
 
Data Management

Data Management

1. The enterprise data marketplace becomes a priority. As companies begin to recognize the undesirable side effects of self-service they are looking for ways to reap self-service benefits without suffering the downside. […] The enterprise data marketplace returns us to the single-source vision that was once touted as the real benefit of Enterprise Data Warehouses.
  I’ve always thought of self-service as something of a cop-out. It tends to avoid data teams doing anything as arduous (and in some cases out of their comfort zone) as understanding what makes a business tick and getting to grips with the key questions that an organisation needs to answer in order to be successful [3]. With this messy and human-centric stuff out of the way, the data team can retreat into the comfort of nice orderly technological matters or friendly statistical models.

However, what Eckerson Group describe here is “an Amazon-like data marketplace”, which it seems to me has more of a chance of being successful. However, such a marketplace will only function if it embodies the same focus on key business questions and how they are answered. The paradigm within which such questions are framed may be different, more community based and more federated for example, but the questions will still be of paramount importance.

 
3.
 
New kinds of data governance organizations and practices emerge. Long-standing, command-and-control data governance practices fail to meet the challenges of big data and of data democratization. […]
  I think that this is overdue. To date Data Governance, where it is implemented at all, tends to be too police-like. I entirely agree that there are circumstances in which a Data Governance team or body needs to be able to put its foot down [4], but if all that Data Governance does is police-work, then it will ultimately fail. Instead good Data Governance needs to recognise that it is part of a much more fluid set of processes [5], whose aim is to add business value; to facilitate things being done as well as sometimes to stop the wrong path being taken.

 
Data Science

Data Science

1. Self-service and automated predictive analytics tools will cause some embarrassing mistakes. Business users now have the opportunity to use predictive models but they may not recognize the limits of the models themselves. […]
  I think this is a very valid point. As well as not understanding the limitations of some models [6], there is not widespread understanding of statistics in many areas of business. The concept of a central prediction surrounded by different outcomes with different probabilities is seldom seen in commercial circles [7]. In addition there seems to be a lack of appreciation of how big an impact the statistical methodology employed can have on what a model tells you [8].

 
Business Analytics

Business Analytics

1. Modern analytic platforms dominate BI. Business intelligence (BI) has evolved from purpose-built tools in the 1990s to BI suites in the 2000s to self-service visualization tools in the 2010s. Going forward, organizations will replace tools and suites with modern analytics platforms that support all modes of BI and all types of users […]
  Again, if it comes to fruition, such consolidation is overdue. Ideally the tools and technologies will blend into the background, good data-centric work is never about the technology and always about the content and the efforts involved in ensuring that it is relevant, accurate, consistent and timely [9]. Also information is often of most use when it is made available to people taking decisions at the precise point that they need it. This observation highlights the need for data to be integrated into systems and digital estates instead of simply being bound to an analytical hub.

 
So some food for thought from Wayne and his associates. The points they make (including those which I haven’t featured in this article) are serious and well-thought-out ones. It will be interesting to see how things have moved on by the beginning of 2018.
 


 
Notes

 
[1]
 
According to WikiQuotes, this has most famously been attributed to Danish theoretical physicist and father of Quantum Mechanics, Niels Bohr (in Teaching and Learning Elementary Social Studies (1970) by Arthur K. Ellis, p. 431). However it has also been ascribed to various humourists, the Danish poet Piet Hein: “det er svært at spå – især om fremtiden” and Danish cartoonist Storm P (Robert Storm Petersen). Perhaps it is best to say that a Dane made the comment and leave it at that.

Of course similar words have also been said to have been originated by Yogi Berra, but then that goes for most malapropisms you could care to mention. As Mr Berra himself says “I really didn’t say everything I said”.

 
[2]
 
See Trends in Business Intelligence. I have to say that several of these have come to pass, albeit sometimes in different ways to the ones I envisaged back then.
 
[3]
 
For a brief review of what is necessary see What should companies consider before investing in a Business Intelligence solution?
 
[4]
 
I wrote about the unpleasant side effects of a Change Programmes unfettered by appropriate Data Governance in Bumps in the Road, for example.
 
[5]
 
I describe such a set of processes in Data Management as part of the Data to Action Journey.
 
[6]
 
I explore some simmilar territory to that presented by Eckerson Group in Data Visualisation – A Scientific Treatment.
 
[7]
 
My favourite counterexample is provided by The Bank of England.

The Old Lady of Threadneedle Street is clearly not a witch
An inflation prediction from The Bank of England
Illustrating the fairly obvious fact that uncertainty increases in proportion to time from now.
 
[8]
 
This is an area I cover in An Inconvenient Truth.
 
[9]
 
I cover this assertion more fully in A bad workman blames his [Business Intelligence] tools.

 

 

Toast

Acrylamide [borrowed from Wikipedia]

Foreword

This blog touches on a wide range of topics, including social media, cultural transformation, general technology and – last but not least – sporting analogies. However, its primary focus has always been on data and information-centric matters in a business context. Having said this, all but the more cursory of readers will have noted the prevalence of pieces with a Mathematical or Scientific bent. To some extent this is a simple reflection of the author’s interests and experience, but a stronger motivation is often to apply learnings from different fields to the business data arena. This article is probably more scientific in subject matter than most, but I will also look to highlight some points pertinent to commerce towards the end.
 
 
Introduction

In Science We Trust?

The topic I want to turn my attention to in this article is public trust in science. This is a subject that has consumed many column inches in recent years. One particular area of focus has been climate science, which, for fairly obvious political reasons, has come in for even more attention than other scientific disciplines of late. It would be distracting to get into the arguments about climate change and humanity’s role in it here [1] and in a sense this is just the latest in a long line of controversies that have somehow become attached to science. An obvious second example here is the misinformation circling around both the efficacy and side effects of vaccinations [2]. In both of these cases, it seems that at least a sizeable minority of people are willing to query well-supported scientific findings. In some ways, this is perhaps linked to the general mistrust of “experts” and “elites” [3] that was explicitly to the fore in the UK’s European Union Referendum debate [4].

“People in this country have had enough of experts”

– Michael Gove [5], at this point UK Justice Secretary and one of the main proponents of the Leave campaign, speaking on Sky News, June 2016.

Mr Gove was talking about economists who held a different point of view to his own. However, his statement has wider resonance and cannot be simply dismissed as the misleading sound-bite of an experienced politician seeking to press his own case. It does indeed appear that in many places around the world experts are trusted much less than they used to be and that includes scientists.

“Many political upheavals of recent years, such as the rise of populist parties in Europe, Donald Trump’s nomination for the American presidency and Britain’s vote to leave the EU, have been attributed to a revolt against existing elites.”

The Buttonwood column, The Economist, September 2016.

Why has this come to be?
 
 
A Brief [6] History of the Public Perception of Science

Public Perception

Note: This section is focussed on historical developments in the public’s trust in science. If the reader would like to skip on to more toast-centric content, then please click here.

Answering questions about the erosion of trust in politicians and the media is beyond the scope of this humble blog. Wondering what has happened to trust in science is firmly in its crosshairs. One part of the answer is that – for some time – scientists were held in too much esteem and the pendulum was inevitably going to swing back the other way. For a while the pace of scientific progress and the miracles of technology which this unleashed placed science on a pedestal from which there was only one direction of travel. During this period in which science was – in general – uncritically held in great regard, the messy reality of actual science was never really highlighted. The very phrase “scientific facts” is actually something of an oxymoron. What we have is instead scientific theories. Useful theories are consistent with existing observations and predict new phenomena. However – as I explained in Patterns patterns everywhere – a theory is only as good as the latest set of evidence and some cherished scientific theories have been shown to be inaccurate; either in general, or in some specific circumstances [7]. However saying “we have a good model that helps us explain many aspects of a phenomenon and predict more, but it doesn’t cover everything and there are some uncertainties” is a little more of a mouthful than “we have discovered that…”.

There have been some obvious landmarks along the way to science’s current predicament. The unprecedented destruction unleashed by the team working on the Manhattan Project at first made the scientists involved appear God-like. It also seemed to suggest that the path to Great Power status was through growing or acquiring the best Physicists. However, as the prolonged misery caused in Japan by the twin nuclear strikes became more apparent and as the Cold War led to generations living under the threat of mutually assured destruction, the standing attached by the general public to Physicists began to wane; the God-like mantle began to slip. While much of our modern world and its technology was created off the back of now fairly old theories like Quantum Chromodynamics and – most famously – Special and General Relativity, the actual science involved became less and less accessible to the man or woman in the street. For all the (entirely justified) furore about the detection of the Higgs Boson, few people would be able to explain much about what it is and how it fits into the Standard Model of particle physics.

In the area of medicine and pharmacology, the Thalidomide tragedy, where a drug prescribed to help pregnant women suffering from morning sickness instead led to terrible birth defects in more than 10,000 babies, may have led to more stringent clinical trials, but also punctured the air of certainty that had surrounded the development of the latest miracle drug. While medical science and related disciplines have vastly improved the health of much of the globe, the glacial progress in areas such as oncology has served as a reminder of the fallibility of some scientific endeavours. In a small way, the technical achievements of that apogee of engineering, NASA, were undermined by loss of crafts and astronauts. Most notably the Challenger and Columbia fatalities served to further remove the glossy veneer that science had acquired in the 1940s to 1960s.

Lest it be thought at this point that I am decrying science, or even being anti-scientific, nothing could be further from the truth. I firmly believe that the ever growing body of scientific knowledge is one of humankind’s greatest achievements, if not its greatest. From our unpromising vantage point on an unremarkable little planet in our equally common-all-garden galaxy we have been able to grasp many of the essential truths about the whole Universe from the incomprehensibly gigantic to the most infinitesimal constituent of a sub-atomic particle. However, it seems that many people do not fully embrace the grandeur of our achievements, or indeed in many cases the unexpected beauty and harmony that they have revealed [8]. It is to the task of understanding this viewpoint that I am addressing my thoughts.

More recently, the austerity that has enveloped much of the developed world since the 2008 Financial Crisis has had two reinforcing impacts on science in many countries. First funding has often been cut, leading to pressure on research programmes and scientists increasingly having to make an economic case for their activities; a far cry from the 1950s. Second, income has been effectively stagnant for the vast majority of people, this means that scientific expenditure can seem something of a luxury and also fuels the anti-elite feelings cited by The Economist earlier in this article.

Anita Makri

Into this seeming morass steps Anita Makri, “editor/writer/producer and former research scientist”. In a recent Nature article she argues that the form of science communicated in popular media leaves the public vulnerable to false certainty. I reproduce some of her comments here:

“Much of the science that the public knows about and admires imparts a sense of wonder and fun about the world, or answers big existential questions. It’s in the popularization of physics through the television programmes of physicist Brian Cox and in articles about new fossils and quirky animal behaviour on the websites of newspapers. It is sellable and familiar science: rooted in hypothesis testing, experiments and discovery.

Although this science has its place, it leaves the public […] with a different, outdated view to that of scientists of what constitutes science. People expect science to offer authoritative conclusions that correspond to the deterministic model. When there’s incomplete information, imperfect knowledge or changing advice — all part and parcel of science — its authority seems to be undermined. […] A popular conclusion of that shifting scientific ground is that experts don’t know what they’re talking about.”

– Anita Makri, Give the public the tools to trust scientists, Nature, January 2017.

I’ll come back to Anita’s article again later.
 
 
Food Safety – The Dangers Lurking in Toast

Food Safety

After my speculations about the reasons why science is held in less esteem than once was the case, I’ll return to more prosaic matters; namely food and specifically that humble staple of many a breakfast table, toast. Food science has often fared no better than its brother disciplines. The scientific guidance issued to people wanting to eat healthily can sometimes seem to gyrate wildly. For many years fat was the source of all evil, more recently sugar has become public enemy number one. Red wine was meant to have beneficial effects on heart health, then it was meant to be injurious; I’m not quite sure what the current advice consists of. As Makri states above, when advice changes as dramatically as it can do in food science, people must begin to wonder whether the scientists really know anything at all.

So where does toast fit in? Well the governmental body charged with providing advice about food in the UK is called the Food Standards Agency. They describe their job as “using our expertise and influence so that people can trust that the food they buy and eat is safe and honest.” While the FSA do sterling work in areas such as publicly providing ratings of food hygiene for restaurants and the like, their most recent campaign is one which seems at best ill-advised and at worst another nail in the public perception of the reliability of scientific advice. Such things matter because they contribute to the way that people view science in general. If scientific advice about food is seen as unsound, surely there must be questions around scientific advice about climate change, or vaccinations.

Before I am accused of belittling the FSA’s efforts, let’s consider the campaign in question, which is called Go for Gold and encourages people to consume less acrylamide. Here is some of what the FSA has to say about the matter:

“Today, the Food Standards Agency (FSA) is launching a campaign to ‘Go for Gold’, helping people understand how to minimise exposure to a possible carcinogen called acrylamide when cooking at home.

Acrylamide is a chemical that is created when many foods, particularly starchy foods like potatoes and bread, are cooked for long periods at high temperatures, such as when baking, frying, grilling, toasting and roasting. The scientific consensus is that acrylamide has the potential to cause cancer in humans.

[…]

as a general rule of thumb, aim for a golden yellow colour or lighter when frying, baking, toasting or roasting starchy foods like potatoes, root vegetables and bread.”

– Food Standards Agency, Families urged to ‘Go for Gold’ to reduce acrylamide consumption, January 2017.

The Go for Gold campaign was picked up by various media outlets in the UK. For example the BBC posted an article on its web-site which opened by saying:

Dangerous Toast [borrowed from the BBC]

“Bread, chips and potatoes should be cooked to a golden yellow colour, rather than brown, to reduce our intake of a chemical which could cause cancer, government food scientists are warning.”

– BBC, Browned toast and potatoes are ‘potential cancer risk’, say food scientists, January 2017.

The BBC has been obsessed with neutrality on all subjects recently [9], but in this case they did insert the reasonable counterpoint that:

“However, Cancer Research UK [10] said the link was not proven in humans.”

Acrylamide is certainly a nasty chemical. Amongst other things, it is used in polyacrylamide gel electrophoresis, a technique used in biochemistry. If biochemists mix and pour their own gels, they have to monitor their exposure and there are time-based and lifetime limits as to how often they can do such procedures [11]. Acrylamide has also been shown to lead to cancer in mice. So what could be more reasonable that the FSA’s advice?
 
 
Food Safety – A Statistical / Risk Based Approach

David Spiegelhalter

Earlier I introduced Anita Makri, it is time to meet our second protagonist, David Spiegelhalter, Winton Professor for the Public Understanding of Risk in the Statistical Laboratory, Centre for Mathematical Sciences, University of Cambridge [12]. Professor Spiegelhalter has penned a response to the FSA’s Go for Gold campaign. I feel that this merits reading in entirety, but here are some highlights:

“Very high doses [of Acrylamide] have been shown to increase the risk of mice getting cancer. The IARC (International Agency for Research on Cancer) considers it a ‘probable human carcinogen’, putting it in the same category as many chemicals, red meat, being a hairdresser and shift-work.

However, there is no good evidence of harm from humans consuming acrylamide in their diet: Cancer Research UK say that ‘At the moment, there is no strong evidence linking acrylamide and cancer.’

This is not for want of trying. A massive report from the European Food Standards Agency (EFSA) lists 16 studies and 36 publications, but concludes

  ‘In the epidemiological studies available to date, AA intake was not associated with an increased risk of most common cancers, including those of the GI or respiratory tract, breast, prostate and bladder. A few studies suggested an increased risk for renal cell, and endometrial (in particular in never-smokers) and ovarian cancer, but the evidence is limited and inconsistent. Moreover, one study suggested a lower survival in non-smoking women with breast cancer with a high pre-diagnostic exposure to AA but more studies are necessary to confirm this result. (p185)’

[…]

[Based on the EFSA study] adults with the highest consumption of acrylamide could consume 160 times as much and still only be at a level that toxicologists think unlikely to cause increased tumours in mice.

[…]

This all seems rather reassuring, and may explain why it’s been so difficult to observe any effect of acrylamide in diet.”

– David Spiegelhalter, Opinion: How dangerous is burnt toast?, University of Cambridge, January 2017.

Indeed, Professor Spiegelhalter, an esteemed statistician, also points out that most studies will adopt the standard criteria for statistical significance. Given that such significance levels are often set at 5%, then this means that:

“[As] each study is testing an association with a long list of cancers […], we would expect 1 in 20 of these associations to be positive by chance alone.”

He closes his article by stating – not unreasonably – that the FSA’s time and attention might be better spent on areas where causality between an agent and morbidity is well-established, for example obesity. My assumption is that the FSA has a limited budget and has to pick and choose what food issues to weigh in on. Even if we accept for the moment that there is some slight chance of a causal link between the consumption of low levels of acrylamide and cancer, there are plenty of other areas in which causality is firmly established; obesity as mentioned by Professor Spiegelhalter, excessive use of alcohol, even basic kitchen hygiene. It is hard to understand why the FSA did not put more effort into these and instead focussed on an area where the balance of scientific judgement is that there is unlikely to be an issue.

Having a mathematical background perhaps biases me, but I tend to side with Professor Spiegelhalter’s point of view. I don’t want to lay the entire blame for the poor view that some people have of science at the FSA’s door, but I don’t think campaigns like Go for Gold help very much either. The apocryphal rational man or woman will probably deduce that there is not an epidemic of acrylamide poisoning in progress. This means that they may question what the experts at the FSA are going on about. In turn this reduces respect for other – perhaps more urgent – warnings about food and drink. Such a reaction is also likely to colour how the same rational person thinks about “expert” advice in general. All of this can contribute to further cracks appearing in the public edifice of science, an outcome I find very unfortunate.

So what is to be done?
 
 
A Call for a New and More Honest Approach to Science Communications

Honesty is the Best Policy

As promised I’ll return to Anita Makri’s thoughts in the same article referenced above:

“It’s more difficult to talk about science that’s inconclusive, ambivalent, incremental and even political — it requires a shift in thinking and it does carry risks. If not communicated carefully, the idea that scientists sometimes ‘don’t know’ can open the door to those who want to contest evidence.

[…]

Scientists can influence what’s being presented by articulating how this kind of science works when they talk to journalists, or when they advise on policy and communication projects. It’s difficult to do, because it challenges the position of science as a singular guide to decision making, and because it involves owning up to not having all of the answers all the time while still maintaining a sense of authority. But done carefully, transparency will help more than harm. It will aid the restoration of trust, and clarify the role of science as a guide.”

The scientific method is meant to be about honesty. You record what you see, not what you want to see. If the data don’t support your hypothesis, you discard or amend your hypothesis. The peer-review process is meant to hold scientists to the highest levels of integrity. What Makri seems to be suggesting is for scientists to turn their lenses on themselves and how they communicate their work. Being honest where there is doubt may be scary, but not as scary as being caught out pushing certainty where no certainty is currently to be had.
 


 
Epilogue

At the beginning of this article, I promised that I would bring things back to a business context. With lots of people with PhDs in numerate sciences now plying their trade as data scientists and the like, there is an attempt to make commerce more scientific [13]. Understandably, the average member of a company will have less of an appreciation of statistics and statistical methods than their data scientists do. This can lead to data science seeming like magic; the philosopher’s stone [14]. There are obvious parallels here with how Physicists were seen in the period immediately after the Second World War.

Earlier in the text, I mused about what factors may have led to a deterioration in how the public views science and scientists. I think that there is much to be learnt from the issues I have covered in this article. If data scientists begin to try to peddle absolute truth and perfect insight (both of which, it is fair to add, are often expected from them by non-experts), as opposed to ranges of outcomes and probabilities, then the same decline in reputation probably awaits them. Instead it would be better if data scientists heeded Anita Makri’s words and tried to always be honest about what they don’t know as well as what they do.
 


 
Notes

 
[1]
 
Save to note that there really is no argument in scientific circles.

As ever Randall Munroe makes the point pithily in his Earth Temperature Timeline – https://xkcd.com/1732/.

For a primer on the area, you could do worse than watching The Royal Society‘s video:

 
[2]
 
For the record, my daughter has had every vaccine known to the UK and US health systems and I’ve had a bunch of them recently as well.
 
[3]
 
Most scientists I know would be astonished that they are considered part of the amorphous, ill-defined and obviously malevolent global “elite”. Then “elite” is just one more proxy for “the other” something which it is not popular to be in various places in the world at present.
 
[4]
 
Or what passed for debate in these post-truth times.
 
[5]
 
Mr Gove studied English at Lady Margaret Hall, Oxford, where he was also President of the Oxford Union. Clearly Oxford produces less experts than it used to in previous eras.
 
[6]
 
One that is also probably wildly inaccurate and certainly incomplete.
 
[7]
 
So Newton’s celebrated theory of gravitation is “wrong” but actually works perfectly well in most circumstances. The the Rutherford–Bohr model, where atoms are little Solar Systems, with the nucleus circled by electrons much as the planets circle the Sun is “wrong”, but actually does serve to explain a number of things; if sadly not the orbital angular momentum of electrons.
 
[8]
 
Someone should really write a book about that – watch this space!
 
[9]
 
Not least in the aforementioned EU Referendum where it felt the need to follow the views of the vast majority of economists with those of the tiny minority, implying that the same weight be attached to both points of view. For example, 99.9999% of people believe the world to be round, but in the interests of balance my mate Jim reckons it is flat.
 
[10]
 
According to their web-site: “the world’s leading charity dedicated to beating cancer through research”.
 
[11]
 
As attested to personally by the only proper scientist in our family.
 
[12]
 
Unlike Oxford (according to Mr Gove anyway), Cambridge clearly still aspires to creating experts.
 
[13]
 
By this I mean proper science and not pseudo-science like management theory and the like.
 
[14]
 
In the original, non-J.K. Rowling sense of the phrase.

 

 

Metamorphosis

Metamorphosis

No neither my observations on the work of Kafka, nor that of Escher [1]. Instead some musings relating on how to transform a bare bones and unengaging chart into something that both captures the attention of the reader and better informs them of the message that the data displayed is relaying. Let’s consider an example:

Before:

Before

After:

After

The two images above are both renderings of the same dataset, which tracks the degree of fragmentation of the Israeli parliament – the Knesset – over time [2]. They are clearly rather different and – I would argue – the latter makes it a lot easier to absorb information and thus to draw inferences.

Boris Gorelik

Both are the work of Boris Gorelik a data scientist at Automattic, a company that is most well-known for creating freemium SAAS blogging platform, WordPress.com and open source blogging software, WordPress [3].

Data for breakfast

I have been a contented WordPress.com user since the inception of this blog back in November 2008, so it was with interest that I learnt that Automattic have their own data-focussed blog, Data for Breakfast, unsurprisingly hosted on WordPress.com. It was on Data for Breakfast that I found Boris’s article, Evolution of a Plot: Better Data Visualization, One Step at a Time. In this he takes the reader step by step through what he did to transform his data visualisation from the ugly duckling “before” exhibit to the beautiful swan “after” exhibit.

Boris is using Python and various related libraries to do his data visualisation work. Given that I stopped commercially programming sometime around 2009 (admittedly with a few lapses since), I typically use the much more quotidian Excel for most of the charts that appear on peterjamesthomas.com [4]. Sometimes, where warranted, I enhance these using Visio and / or PaintShop Pro.

For example, the three [5] visualisations featured in A Tale of Two [Brexit] Data Visualisations were produced this way. Despite the use of Calibri, which is probably something of a giveaway, I hope that none of these resembles a straight-out-of-the-box Excel graph [6].
 

Brexit Bar
UK Referendum on EU Membership – Percentage voting by age bracket (see notes)

 
Brexit Bar 2
UK Referendum on EU Membership – Numbers voting by age bracket (see notes)

 
Brexit Flag
UK Referendum on EU Membership – Number voting by age bracket (see notes)

 
While, in the above, I have not gone to the lengths that Boris has in transforming his initial and raw chart into something much more readable, I do my best to make my Excel charts look at least semi-professional. My reasoning is that, when the author of a chart has clearly put some effort into what their chart looks like and has at least attempted to consider how it will be read by people, then this is a strong signal that the subject matter merits some closer consideration.

Next time I develop a chart for posting on these pages, I may take Boris’s lead and also publish how I went about creating it.
 


 Notes

 
[1]
 
Though the latter’s work has adorned these pages on several occasions and indeed appears in my seminar decks.
 
[2]
 
Boris has charted a metric derived from how many parties there have been and how many representatives of each. See his article itself for further background.
 
[3]
 
You can learn more about the latter at WordPress.org.
 
[4]
 
Though I have also used GraphPad Prism for producing more scientific charts such as the main one featured in Data Visualisation – A Scientific Treatment.
 
[5]
 
Yes I can count. I have certificates which prove this.
 
[6]
 
Indeed the final one was designed to resemble a fractured British flag. I’ll leave readers to draw their own conclusions here.

 

 

Curiouser and Curiouser – The Limits of Brexit Voting Analysis

An original illustration from Charles Lutwidge Dodgson's seminal work would have been better, but sadly none such seems to be extant
 
Down the Rabbit-hole

When I posted my Brexit infographic reflecting the age of voters an obvious extension was to add an indication of the number of people in each age bracket who did not vote as well as those who did. This seemed a relatively straightforward task, but actually proved to be rather troublesome (this may be an example of British understatement). Maybe the caution I gave about statistical methods having a large impact on statistical outcomes in An Inconvenient Truth should have led me to expect such issues. In any case, I thought that it would be instructive to talk about the problems I stumbled across and to – once again – emphasise the perils of over-extending statistical models.

Brexit ages infographic
Click to download a larger PDF version in a new window.

Regular readers will recall that my Brexit Infographic (reproduced above) leveraged data from an earlier article, A Tale of two [Brexit] Data Visualisations. As cited in this article, the numbers used were from two sources:

  1. The UK Electoral Commission – I got the overall voting numbers from here.
  2. Lord Ashcroft’s Poling organisation – I got the estimated distribution of votes by age group from here.

In the notes section of A Tale of two [Brexit] Data Visualisations I [prophetically] stated that the breakdown of voting by age group was just an estimate. Based on what I have discovered since, I’m rather glad that I made this caveat explicit.
 
 
The Pool of Tears

In order to work out the number of people in each age bracket who did not vote, an obvious starting point would be the overall electorate, which the UK Electoral Commission stated as being 46,500,001. As we know that 33,551,983 people voted (an actual figure rather than an estimate), then this is where the turnout percentage of 72.2% (actually 72.1548%) came from (33,551,983 / 45,500,001).

A clarifying note, the electorate figures above refer to people who are eligible to vote. Specifically, in order to vote in the UK Referendum, people had to meet the following eligibility criteria (again drawn from the UK Electoral Commission):

To be eligible to vote in the EU Referendum, you must be:

  • A British or Irish citizen living in the UK, or
  • A Commonwealth citizen living in the UK who has leave to remain in the UK or who does not require leave to remain in the UK, or
  • A British citizen living overseas who has been registered to vote in the UK in the last 15 years, or
  • An Irish citizen living overseas who was born in Northern Ireland and who has been registered to vote in Northern Ireland in the last 15 years.

EU citizens are not eligible to vote in the EU Referendum unless they also meet the eligibility criteria above.

So far, so simple. The next thing I needed to know was how the electorate was split by age. This is where we begin to run into problems. One place to start is the actual population of the UK as at the last census (2011). This is as follows:
 

Ages (years) Population % of total
0–4 3,914,000 6.2
5–9 3,517,000 5.6
10–14 3,670,000 5.8
15–19 3,997,000 6.3
20–24 4,297,000 6.8
25–29 4,307,000 6.8
30–34 4,126,000 6.5
35–39 4,194,000 6.6
40–44 4,626,000 7.3
45–49 4,643,000 7.3
50–54 4,095,000 6.5
55–59 3,614,000 5.7
60–64 3,807,000 6.0
65–69 3,017,000 4.8
70–74 2,463,000 3.9
75–79 2,006,000 3.2
80–84 1,496,000 2.4
85–89 918,000 1.5
90+ 476,000 0.8
Total 63,183,000 100.0

 
If I roll up the above figures to create the same age groups as in the Ashcroft analysis (something that requires splitting the 15-19 range, which I have assumed can be done uniformly), I get:
 

Ages (years) Population % of total
0-17 13,499,200 21.4
18-24 5,895,800 9.3
25-34 8,433,000 13.3
35-44 8,820,000 14.0
45-54 8,738,000 13.8
55-64 7,421,000 11.7
65+ 10,376,000 16.4
Total 63,183,000 100.0

 
The UK Government isn’t interested in the views of people under 18[citation needed], so eliminating this row we get:
 

Ages (years) Population % of total
18-24 5,895,800 11.9
25-34 8,433,000 17.0
35-44 8,820,000 17.8
45-54 8,738,000 17.6
55-64 7,421,000 14.9
65+ 10,376,000 20.9
Total 49,683,800 100.0

 
As mentioned, the above figures are from 2011 and the UK population has grown since then. Web-site WorldOMeters offers an extrapolated population of 65,124,383 for the UK in 2016 (this is as at 12th July 2016; if extrapolation and estimates make you queasy, I’d suggest closing this article now!). I’m going to use a rounder figure of 65,125,000 people; there is no point pretending that precision exists where it clearly doesn’t. Making the assumption that such growth is uniform across all age groups (please refer to my previous bracketed comment!), then the above exhibit can also be extrapolated to give us:
 

Ages (years) Population % of total
18-24 6,077,014 11.9
25-34 8,692,198 17.0
35-44 9,091,093 17.8
45-54 9,006,572 17.6
55-64 7,649,093 14.9
65+ 10,694,918 20.9
Total 51,210,887 100.0

 
 
Looking Glass House

So our – somewhat fabricated – figure for the 18+ UK population in 2016 is 51,210,887, let’s just call this 51,200,000. As at the beginning of this article the electorate for the 2016 UK Referendum was 45,500,000 (dropping off the 1 person with apologies to him or her). The difference is explicable based on the eligibility criteria quoted above. I now have a rough age group break down of the 51.2 million population, how best to apply this to the 45.5 million electorate?

I’ll park this question for the moment and instead look to calculate a different figure. Based on the Ashcroft model, what percentage of the UK population (i.e. the 51.2 million) voted in each age group? We can work this one out without many complications as follows:
 

Ages (years)
 
Population
(A)
Voted
(B)
Turnout %
(B/A)
18-24 6,077,014 1,701,067 28.0
25-34 8,692,198 4,319,136 49.7
35-44 9,091,093 5,656,658 62.2
45-54 9,006,572 6,535,678 72.6
55-64 7,649,093 7,251,916 94.8
65+ 10,694,918 8,087,528 75.6
Total 51,210,887 33,551,983 65.5

(B) = Size of each age group in the Ashcroft sample as a percentage multiplied by the total number of people voting (see A Tale of two [Brexit] Data Visualisations).
 
Remember here that actual turnout figures have electorate as the denominator, not population. As the electorate is less than the population, this means that all of the turnout percentages should actually be higher than the ones calculated (e.g. the overall turnout with respect to electorate is 72.2% whereas my calculated turnout with respect to population is 65.5%). So given this, how to explain the 94.8% turnout of 55-64 year olds? To be sure this group does reliably turn out to vote, but did essentially all of them (remembering that the figures in the above table are too low) really vote in the referendum? This seems less than credible.

The turnout for 55-64 year olds in the 2015 General Election has been estimated at 77%, based on an overall turnout of 66.1% (web-site UK Political Info; once more these figures will have been created based on techniques similar to the ones I am using here). If we assume a uniform uplift across age ranges (that “assume” word again!) then one might deduce that an increase in overall turnout from 66.1% to 72.2%, might lead to the turnout in the 55-64 age bracket increasing from 77% to 84%. 84% turnout is still very high, but it is at least feasible; close to 100% turnout in from this age group seems beyond the realms of likelihood.

So what has gone wrong? Well so far the only culprit I can think of is the distribution of voting by age group in the Ashcroft poll. To be clear here, I’m not accusing Lord Ashcroft and his team of sloppy work. Instead I’m calling out that the way that I have extrapolated their figures may not be sustainable. Indeed, if my extrapolation is valid, this would imply that the Ashcroft model over estimated the proportion of 55-64 year olds voting. Thus it must have underestimated the proportion of voters in some other age group. Putting aside the likely fact that I have probably used their figures in an unintended manner, could it be that the much-maligned turnout of younger people has been misrepresented?

To test the validity of this hypothesis, I turned to a later poll by Omnium. To be sure this was based on a sample size of around 2,000 as opposed to Ashcroft’s 12,000, but it does paint a significantly different picture. Their distribution of voter turnout by age group was as follows:
 

Ages (years) Turnout %
18-24 64
25-39 65
40-54 66
55-64 74
65+ 90

 
I have to say that the Omnium age groups are a bit idiosyncratic, so I have taken advantage of the fact that the figures for 25-54 are essentially the same to create a schedule that matches the Ashcroft groups as follows:
 

Ages (years) Turnout %
18-24 64
25-34 65
35-44 65
45-54 65
55-64 74
65+ 90

 
The Omnium model suggests that younger voters may have turned out in greater numbers than might be thought based on the Ashcroft data. In turn this would suggest that a much greater percentage of 18-24 year olds turned out for the Referendum (64%) than for the last General Election (43%); contrast this with an estimated 18-24 turnout figure of 47% based on the just increase in turnout between the General Election and the Referendum. The Omnium estimates do still however recognise that turnout was still greater in the 55+ brackets, which supports the pattern seen in other elections.
 
 
Humpty Dumpty

While it may well be that the Leave / Remain splits based on the Ashcroft figures are reasonable, I’m less convinced that extrapolating these same figures to make claims about actual voting numbers by age group (as I have done) is tenable. Perhaps it would be better to view each age cohort as a mini sample to be treated independently. Based on the analysis above, I doubt that the turnout figures I have extrapolated from the Ashcroft breakdown by age group are robust. However, that is not the same as saying that the Ashcroft data is flawed, or that the Omnium figures are correct. Indeed the Omnium data (at least those elements published on their web-site) don’t include an analysis of whether the people in their sample voted Leave or Remain, so direct comparison is not going to be possible. Performing calculation gymnastics such as using the Omnium turnout for each age group in combination with the Ashcroft voting splits for Leave and Remain for the same age groups actually leads to a rather different Referendum result, so I’m not going to plunge further down this particular rabbit hole.

In summary, my supposedly simple trip to the destitution of an enhanced Brexit Infographic has proved unexpectedly arduous, winding and beset by troubles. These challenges have proved so great that I’ve abandoned the journey and will be instead heading for home.
 
 
Which dreamed it?

Based on my work so far, I have severe doubts about the accuracy of some of the age-based exhibits I have published (versions of which have also appeared on many web-sites, the BBC to offer just one example, scroll down to “How different age groups voted” and note that the percentages cited reconcile to mine). I believe that my logic and calculations are sound, but it seems that I am making too many assumptions about how I can leverage the Ashcroft data. After posting this article, I will accordingly go back and annotate each of my previous posts and link them to these later findings.

I think the broader lesson to be learnt is that estimates are just that, attempts (normally well-intentioned of course) to come up with figures where the actual numbers are not accessible. Sometimes this is a very useful – indeed indispensable – approach, sometimes it is less helpful. In either case estimation should always be approached with caution and the findings ideally sense-checked in the way that I have tried to do above.

Occam’s razor would suggest that when the stats tell you something that seems incredible, then 99 times out of 100 there is an error or inaccurate assumption buried somewhere in the model. This applies when you are creating the model yourself and doubly so where you are relying upon figures calculated by other people. In the latter case not only is there the risk of their figures being inaccurate, there is the incremental risk that you interpret them wrongly, or stretch their broader application to breaking point. I was probably guilty of one or more of the above sins in my earlier articles. I’d like my probable misstep to serve as a warning to other people when they too look to leverage statistics in new ways.

A further point is the most advanced concepts I have applied in my calculations above are addition, subtraction, multiplication and division. If these basic operations – even in the hands of someone like me who is relatively familiar with them – can lead to the issues described above, just imagine what could result from the more complex mathematical techniques (e.g. ambition, distraction, uglification and derision) used by even entry-level data scientists. This perhaps suggests an apt aphorism: Caveat calculator!

Beware the Jabberwock, my son! // The jaws that bite, the claws that catch! // Beware the Jubjub bird, and shun // The frumious Bandersnatch!