This article was originally intended for publication late in the year it reviews, but, as they  say, the best-laid schemes o’ mice an’ men gang aft agley…
In 2017 I wrote more articles  than in any year since 2009, which was the first full year of this site’s existence. Some were viewed by thousands of people, others received less attention. Here I am going to ignore the metric of popular acclaim and instead highlight a few of the articles that I enjoyed writing most, or sometimes re-reading a few months later . Given the breadth of subject matter that appears on peterjamesthomas.com, I have split this retrospective into six areas, which are presented in decreasing order of the number of 2017 articles I wrote in each. These are as follows:
In each category, I will pick out two or three of pieces which I feel are both representative of my overall content and worth a read. I would be more than happy to receive any feedback on my selections, or suggestions for different choices.
Two articles on how Data Visualisation is used in Meteorology. Part I provides a worked example illustrating some of the problems that can arise when adopting a rainbow colour palette in data visualisation. Part II grapples with hurricane prediction and covers some issues with data visualisations that are intended to convey safety information to the public.
What links Climate Change, the Manhattan Project, Brexit and Toast? How do these relate to the public’s trust in Science? What does this mean for Data Scientists?
Answers provided by Nature, The University of Cambridge and the author.
The wisdom of the crowd relies upon essentially democratic polling of a large number of respondents; an approach that has several shortcomings, not least the lack of weight attached to people with specialist knowledge. The Surprisingly Popular algorithm addresses these shortcomings and so far has out-performed existing techniques in a range of studies.
The 2017 Nobel Prize for Chemistry was awarded to Structural Biologist Richard Henderson and two other co-recipients. What can Machine Learning practitioners learn from Richard’s observations about how to generate images from Cryo-Electron Microscopy data?
Many Chief Data Officer job descriptions have a list of requirements that resemble Swiss Army Knives. This article argues that the CDO must be the conductor of an orchestra, not someone who is a virtuoso in every single instrument.
Paul Barsch (EY & Teradata) provides some insight into why Big Data projects fail, what you can do about this and how best to treat any such projects that head off the rails. With additional contributions from Big Data gurus Albert Einstein, Thomas Edison and Samuel Beckett.
Thoughts on trends in interest in Hadoop and Spark, featuring George Hill, James Kobielus, Kashif Saiyed and Martyn Richard Jones, together with the author’s perspective on the importance of technology in data-centric work.
I would like to close this review of 2017 with a final article, one that somehow defies classification:
In the first article in this mini-series we looked at alternative approaches to colour and how these could inform or mislead in data visualisations relating to weather events. In particular we discussed drawbacks of using a rainbow palette in such visualisations and some alternatives. Here we move into much more serious territory, how best to inform the public about what a specific hurricane will do next and the risks that it poses. It would not be an exaggeration to say that sometimes this area may be a matter of life and death. As with rainbow-coloured maps of weather events, some aspects of how the estimated future course of hurricanes are communicated and understood leave much to be desired.
The above diagram is called a the cone of uncertainty of a hurricane. Cone of uncertainty sounds like an odd term. What does it mean? Let’s start by offering a historical perspective on hurricane modelling.
Well like any other type of weather prediction, determining the future direction and speed of a hurricane is not an exact science . In the earlier days of hurricane modelling, Meteorologists used to employ statistical models, which were built based on detailed information about previous hurricanes, took as input many data points about the history of a current hurricane’s evolution and provided as output a prediction of what it could do in coming days.
There were a variety of statistical models, but the output of them was split into two types when used for hurricane prediction.
First, the model could have generated a single prediction (the centre of the hurricane will be at 32.3078° N, 64.7505° W tomorrow) and supplemented this with an error measure. The error measure would have been based on historical hurricane data and related to how far out prior predictions had been on average; this measure would have been in kilometres. It would have been typical to employ some fraction of the error measure to define a “circle of uncertainty” around the central prediction; 80% in the example directly above (compared to two thirds in the NWS exhibit at the start of the article).
Second, the model could have generated a large number of mini-predictions, each of which would have had a probability associated with it (e.g. the first two estimates of location could be that the centre of the hurricane is at 32.3078° N, 64.7505° W with a 5% chance, or a mile away at 32.3223° N, 64.7505° W with a 2% chance and so on). In general if you had picked the “centre of gravity” of the second type of output, it would have been analogous to the single prediction of the first type of output . The spread of point predictions in the second method would have also been analogous to the error measure of the first. Drawing a circle around the centroid would have captured a percentage of the mini-predictions, once more 80% in the example immediately above and two thirds in the NWS chart, generating another “circle of uncertainty”.
Here comes the Science
That was then of course, nowadays the statistical element of hurricane models is less significant. With increased processing power and the ability to store and manipulate vast amounts of data, most hurricane models instead rely upon scientific models; let’s call this Type C.
As the air is a fluid , its behaviour falls into the area of study known as fluid dynamics. If we treat the atmosphere as being viscous, then the appropriate equation governing fluid dynamics is the Navier-Stokes equation, which is itself derived from the Cauchy Momentum equation:
If viscosity is taken as zero (as a simplification), instead the Euler equations apply:
The reader may be glad to know that I don’t propose to talk about any of the above equations any further.
To get back to the model, in general the atmosphere will be split into a three dimensional grid (the atmosphere has height as well). The current temperature, pressure, moisture content etc. are fed in (or sometimes interpolated) at each point and equations such as the ones above are used to determine the evolution of fluid flow at a given grid element. Of course – as is typical in such situations – approximations of the equations are used and there is some flexibility over which approximations to employ. Also, there may be uncertainty about the input parameters, so statistics does not disappear entirely. Leaving this to one side, how the atmospheric conditions change over time at each grid point rolls up to provide a predictive basis for what a hurricane will do next.
Although the methods are very different, the output of these scientific models will be pretty similar, qualitatively, to the Type A statistical model above. In particular, uncertainty will be delineated based on how well the model performed on previous occasions. For example, what was the average difference between prediction and fact after 6 hours, 12 hours and so on. Again, the uncertainty will have similar characteristics to that of Type A above.
A Section about Conics
In all of the cases discussed above, we have a central prediction (which may be an average of several predictions as per Type B) and a circular distribution around this indicating uncertainty. Let’s consider how these predictions might change as we move into the future.
If today is Monday, then there will be some uncertainty about what the hurricane does on Tuesday. For Wednesday, the uncertainty will be greater than for Tuesday (the “circle of uncertainty” will have grown) and so on. With the Type A and Type C outputs, the error measure will increase with time. With the Type B output, if the model spits out 100 possible locations for the hurricane on a specific day (complete with the likelihood of each of these occurring), then these will be fairly close together on Tuesday and further apart on Wednesday. In all cases, uncertainty about the location of the becomes smeared out over time, resulting in a larger area where it is likely to be located and a bigger “circle of uncertainty”.
This is where the circles of uncertainty combine to become a cone of uncertainty. For the same example, on each day, the meteorologists will plot the central prediction for the hurricane’s location and then draw a circle centered on this which captures the uncertainty of the prediction. For the same reason as stated above, the size of the circle will (in general) increase with time; Wednesday’s circle will be bigger than Tuesday’s. Also each day’s central prediction will be in a different place from the previous day’s as the hurricane moves along. Joining up all of these circles gives us the cone of uncertainty .
If the central predictions imply that a hurricane is moving with constant speed and direction, then its cone of uncertainty would look something like this:
In this diagram, broadly speaking, on each day, there is a 67% probability that the centre of the hurricane will be found within the relevant circle that makes up the cone of uncertainty. We will explore the implications of the underlined phrase in the next section.
Of course hurricanes don’t move in a single direction at an unvarying pace (see the actual NWS exhibit above as opposed to my idealised rendition), so part of the purpose of the cone of uncertainty diagram is to elucidate this.
The Central Issue
So hopefully the intent of the NWS chart at the beginning of this article is now clearer. What is the problem with it? Well I’ll go back to the words I highlighted couple of paragraphs back:
There is a 67% probability that the centre of the hurricane will be found within the relevant circle that makes up the cone of uncertainty
So the cone helps us with where the centre of the hurricane may be. A reasonable question is, what about the rest of the hurricane?
For ease of reference, here is the NWS exhibit again:
Let’s first of all pause to work out how big some of the NWS “circles of uncertainty” are. To do this we can note that the grid lines (though not labelled) are clearly at 5° intervals. The distance between two lines of latitude (ones drawn parallel to the equator) that are 1° apart from each other is a relatively consistent number; approximately 111 km . This means that the lines of latitude on the page are around 555 km apart. Using this as a reference, the “circle of uncertainty” labelled “8 PM Sat” has a diameter of about 420 km (260 miles).
Let’s now consider how big Hurricane Irma was .
Aside: I’d be remiss if I didn’t point out here that RMS have selected what seems to me to be a pretty good colour palette in the chart above.
Well there is no defined sharp edge of a hurricane, rather the speed of winds tails off as may be seen in the above diagram. In order to get some sense of the size of Irma, I’ll use the dashed line in the chart that indicates where wind speeds drop below that classified as a tropical storm (65 kmph or 40 mph ). This area is not uniform, but measures around 580 km (360 miles) wide.
There are two issues here, which are illustrated in the above diagram.
Irma was actually bigger  than at least some of the “circles of uncertainty”. A cursory glance at the NWS exhibit would probably give the sense that the cone of uncertainty represents the extent of the storm, it doesn’t. In our example, Irma extends 80 km beyond the “circle of uncertainty” we measured above. If you thought you were safe because you were 50 km from the edge of the cone, then this was probably an erroneous conclusion.
Even more pernicious, because each “circle of uncertainty” provides an area within which the centre of the hurricane could be situated, this includes cases where the centre of the hurricane sits on the circumference of the “circle of uncertainty”. This, together with the size of the storm, means that someone 290 km from the edge of the “circle of uncertainty” could suffer 65 kmph (40 mph) winds. Again, based on the diagram, if you felt that you were guaranteed to be OK if you were 250 km away from the edge of the cone, you could get a nasty surprise.
These are not academic distinctions, the real danger that hurricane cones were misinterpreted led the NWS to start labelling their charts with “This cone DOES NOT REPRESENT THE SIZE OF THE STORM!!” .
Even Florida senator Marco Rubio got in on the act, tweeting:
When you need a politician help you avoid misinterpreting a data visualisation, you know that there is something amiss.
The last thing I want to do is to appear critical of the men and women of the US National Weather Service. I’m sure that they do a fine job. If anything, the issues we have been dissecting here demonstrate that even highly expert people with a strong motivation to communicate clearly can still find it tough to select the right visual metaphor for a data visualisation; particularly when there is a diverse audience consuming the results. It also doesn’t help that there are many degrees of uncertainty here: where might the centre of the storm be? how big might the storm be? how powerful might the storm be? in which direction might the storm move? Layering all of these onto a single exhibit while still rendering it both legible and of some utility to the general public is not a trivial exercise.
The cone of uncertainty is a precise chart, so long as the reader understands what it is showing and what it is not. Perhaps the issue lies more in the eye of the beholder. However, having to annotate your charts to explain what they are not is never a good look on anyone. The NWS are clearly aware of the issues, I look forward to viewing whatever creative solution they come up with later this hurricane season.
I would like to thank Dr Steve Smith, Head of Catastrophic Risk at Fractal Industries, for reviewing this piece and putting me right on some elements of modern hurricane prediction. I would also like to thank my friend and former colleague, Dr Raveem Ismail, also of Fractal Industries, for introducing me to Steve. Despite the input of these two experts, responsibility for any errors or omissions remains mine alone.
I don’t mean to imply by this that the estimation process is unscientific of course. Indeed, as we will see later, hurricane prediction is becoming more scientific all the time.
If both methods were employed in parallel, it would not be too surprising if their central predictions were close to each other.
A gas or a liquid.
A shape traced out by a particle traveling with constant speed and with a circle of increasing radius inscribed around it would be a cone.
The distance between lines of longitude varies between 111 km at the equator and 0 km at either pole. This is because lines of longitude are great circles (or meridians) that meet at the poles. Lines of latitude are parallel circles (parallels) progressing up and down the globe from the equator.
At a point in time of course. Hurricanes change in size over time as well as in their direction/speed of travel and energy.
I am rounding here. The actual threshold values are 63 kmph and 39 mph.
Using the definition of size that we have adopted above.
Their use of capitals, bold and multiple exclamation marks.
I have mentioned the LMB in these pages before . In my earlier article, which focussed on Data Visualisation in science, I also provided a potted history of X-ray crystallography, which included the following paragraph:
Today, X-ray crystallography is one of many tools available to the structural biologist with other approaches including Nuclear Magnetic Resonance Spectroscopy, Electron Microscopy and a range of biophysical techniques.
I have highlighted the term Electron Microscopy above and it was for his immense contributions to the field of Cryo-electron Microscopy (Cryo-EM) that Richard was awarded his Nobel Prize; more on this shortly.
First of all some disclosure. The LMB is also my wife’s alma mater, she received her PhD for work she did there between 2010 and 2014. Richard was one of two people who examined her as she defended her thesis . As Venki initially interviewed her for the role, the bookends of my wife’s time at the LMB were formed by two Nobel laureates; an notable symmetry.
The press release about Richard’s Nobel Prize includes the following text:
The Nobel Prize in Chemistry 2017 is awarded to Jacques Dubochet, Joachim Frank and Richard Henderson for the development of cryo-electron microscopy, which both simplifies and improves the imaging of biomolecules. This method has moved biochemistry into a new era.
Electron microscopes were long believed to only be suitable for imaging dead matter, because the powerful electron beam destroys biological material. But in 1990, Richard Henderson succeeded in using an electron microscope to generate a three-dimensional image of a protein at atomic resolution. This breakthrough proved the technology’s potential.
Electron microscopes  work by passing a beam of electrons through a thin film of the substance being studied. The electrons interact with the constituents of the sample and go on to form an image which captures information about these interactions (nowadays mostly on an electronic detector of some sort). Because the wavelength of electrons  is so much shorter than light , much finer detail can be obtained using electron microscopy than with light microscopy. Indeed electron microscopes can be used to “see” structures at the atomic scale. Of course it is not quite as simple as printing out the image snapped by you SmartPhone. The data obtained from electron microscopy needs to be interpreted by software; again we will come back to this point later.
Cryo-EM refers to how the sample being examined is treated prior to (and during) microscopy. Here a water-suspended sample of the substance is frozen (to put it mildly) in liquid ethane to temperatures around -183 °C and maintained at that temperature during the scanning procedure. The idea here is to protect the sample from the damaging effects of the cathode rays  it is subjected to during microscopy.
A Matter of Interpretation
On occasion, I write articles which are entirely scientific or mathematical in nature, but more frequently I bring observations from these fields back into my own domain, that of data, information and insight. This piece will follow the more typical course. To do this, I will rely upon a perspective that Richard Henderson wrote for the Proceedings of the National Academy of Science back in 2013 .
Here we come back to the interpretation of Cryo-EM data in order to form an image. In the article, Richard refers to:
[Some researchers] who simply record images, follow an established (or sometimes a novel or inventive ) protocol for 3D map calculation, and then boldly interpret and publish their map without any further checks or attempts to validate the result. Ten years ago, when the field was in its infancy, referees would simply have to accept the research results reported in manuscripts at face value. The researchers had recorded images, carried out iterative computer processing, and obtained a map that converged, but had no way of knowing whether it had converged to the true structure or some complete artifact. There were no validation tests, only an instinct about whether a particular map described in the publication looked right or wrong.
The title of Richard’s piece includes the phrase “Einstein from noise”. This refers to an article published in the Journal of Structural Biology in 2009 . Here the authors provided pure white noise (i.e. a random set of black and white points) as the input to an Algorithm which is intended to produce EM maps and – after thousands of iterations – ended up with the following iconic mage:
Richard lists occurrences of meaning being erroneously drawn from EM data from his own experience of reviewing draft journal articles and cautions scientists to hold themselves to the highest standards in this area, laying out meticulous guidelines for how the creation of EM images should be approached, checked and rechecked.
The obvious correlation here is to areas of Data Science such as Machine Learning. Here again algorithms are applied iteratively to data sets with the objective of discerning meaning. Here too conscious or unconscious bias on behalf of the people involved can lead to the business equivalent of Einstein ex machina. It is instructive to see the level of rigour which a Nobel Laureate views as appropriate in an area such as the algorithmic processing of data. Constantly questioning your results and validating that what emerges makes sense and is defensible is just one part of what can lead to gaining a Nobel Prize . The opposite approach will invariably lead to disappointment in either academia or in business.
Having introduced a strong cautionary note, I’d like to end this article with a much more positive tone by extending my warm congratulations to Richard both for his well-deserved achievement, but more importantly for his unwavering commitment to rolling back the bounds of human knowledge.
If you are interested in learning more about Cryo-Electron Microscopy, the following LMB video, which features Richard Henderson and colleagues, may be of interest:
Her thesis was passed without correction – an uncommon occurrence – and her contribution to the field was described as significant in the formal documentation.
More precisely this description applies to Transmission Electron Microscopes, which are the type of kit used in Cryo-EM.
The wave-particle duality that readers may be familiar with when speaking about light waves / photons also applies to all sub-atomic particles. Electrons have both a wave and a particle nature and so, in particular, have wavelengths.
This is still the case even if ultraviolet or more energetic light is used instead of visible light.
Cathode rays are of course just beams of electrons.
Since its launch in August of this year, the peterjamesthomas.com Data and Analytics Dictionary has received a welcome amount of attention with various people on different social media platforms praising its usefulness, particularly as an introduction to the area. A number of people have made helpful suggestions for new entries or improvements to existing ones. I have also been rounding out the content with some more terms relating to each of Data Governance, Big Data and Data Warehousing. As a result, The Dictionary now has over 80 main entries (not including ones that simply refer the reader to another entry, such as Linear Regression, which redirects to Model).
“It is a truth universally acknowledged, that an organisation in possession of some data, must be in want of a Chief Data Officer”
— Growth and Governance, by Jane Austen (1813) 
I wrote about a theoretical job description for a Chief Data Officer back in November 2015 . While I have been on “paternity leave” following the birth of our second daughter, a couple of genuine CDO job specs landed in my inbox. While unable to respond for the aforementioned reasons, I did leaf through the documents. Something immediately struck me; they were essentially wish-lists covering a number of data-related fields, rather than a description of what a CDO might actually do. Clearly I’m not going to cite the actual text here, but the following is representative of what appeared in both requirement lists:
Solid commercial understanding and 5 years spent in [insert industry sector here]
The above list may have descended into farce towards the end, but I would argue that the problems started to occur much earlier. The above is not a description of what is required to be a successful CDO, it’s a description of a Swiss Army Knife. There is also the minor practical point that, out of a World population of around 7.5 billion, there may well be no one who ticks all the boxes .
Let’s make the fallacy of this type of job description clearer by considering what a simmilar approach would look like if applied to what is generally the most senior role in an organisation, the CEO. Whoever drafted the above list of requirements would probably characterise a CEO as follows:
The best salesperson in the organisation
The best accountant in the organisation
The best M&A person in the organisation
The best customer service operative in the organisation
The best facilities manager in the organisation
The best janitor in the organisation
The best purchasing clerk in the organisation
The best lawyer in the organisation
The best programmer in the organisation
The best marketer in the organisation
The best product developer in the organisation
The best HR person in the organisation, etc., etc., …
Of course a CEO needs to be none of the above, they need to be a superlative leader who is expert at running an organisation (even then, they may focus on plotting the way forward and leave the day to day running to others). For the avoidance of doubt, I am not saying that a CEO requires no domain knowledge and has no expertise, they would need both, however they don’t have to know every aspect of company operations better than the people who do it.
The same argument applies to CDOs. Domain knowledge probably should span most of what is in the job description (save for maybe the three items with footnotes), but knowledge is different to expertise. As CDOs don’t grow on trees, they will most likely be experts in one or a few of the areas cited, but not all of them. Successful CDOs will know enough to be able to talk to people in the areas where they are not experts. They will have to be competent at hiring experts in every area of a CDO’s purview. But they do not have to be able to do the job of every data-centric staff member better than the person could do themselves. Even if you could identify such a CDO, they would probably lose their best staff very quickly due to micromanagement.
A CDO has to be a conductor of both the data function orchestra and of the use of data in the wider organisation. This is a talent in itself. An internationally renowned conductor may have previously been a violinist, but it is unlikely they were also a flautist and a percussionist. They do however need to be able to tell whether or not the second trumpeter is any good or not; this is not the same as being able to play the trumpet yourself of course. The conductor’s key skill is in managing the efforts of a large group of people to create a cohesive – and harmonious – whole.
The CDO is of course still a relatively new role in mainstream organisations . Perhaps these job descriptions will become more realistic as the role becomes more familiar. It is to be hoped so, else many a search for a new CDO will end in disappointment.
Having twisted her text to my own purposes at the beginning of this article, I will leave the last words to Jane Austen:
“A scheme of which every part promises delight, can never be successful; and general disappointment is only warded off by the defence of some little peculiar vexation.”
Most readers will immediately spot the obvious mistake here. Of course all three of these requirements should be mandatory.
To take just one example, gaining a PhD in a numerical science, a track record of highly-cited papers and also obtaining an MBA would take most people at least a few weeks of effort. Is it likely that such a person would next focus on a PRINCE2 or TOGAF qualification?
I find myself frequently being asked questions around terminology in Data and Analytics and so thought that I would try to define some of the more commonly used phrases and words. My first attempt to do this can be viewed in a new page added to this site (this also appears in the site menu):
I plan to keep this up-to-date as the field continues to evolve.
I hope that my efforts to explain some concepts in my main area of specialism are both of interest and utility to readers. Any suggestions for new entries or comments on existing ones are more than welcome.
I have some form when it comes to getting irritated by quasi-mathematical social media memes (see Facebook squares “puzzle” for example). Facebook, which I find myself using less and less frequently these days, has always been plagued by clickbait articles. Some of these can be rather unsavoury. One that does not have this particular issue, but which more than makes up for this in terms of general annoyance, is the many variants of:
Only a math[s] genius can solve [insert some dumb problem here] – can u?
Life is too short to complain about Facebook content, but this particular virus now seems to have infected LinkedIn (aka MicrosoftedIn) as well. Indeed as LinkedIn’s current “strategy” seems to be to ape what Facebook was doing a few years ago, perhaps this is not too surprising. Nevertheless, back in the day, LinkedIn used to be a reasonably serious site dedicated to networking and exchanging points of view with fellow professionals.
Those days appear to be fading fast, something I find sad. It seems that a number of people agree with me as – at the time of writing – over 9,000 people have viewed a LinkedIn article I briefly penned bemoaning this development. While some of the focus inevitably turned to general scorn being heaped on the new LinekdIn user experience (UX), it seemed that most people are of the same opinion as I am.
However, I suspect that there is little to be done and the folks at LinkedIn probably have their hands full trying to figure out how to address their UX catastrophe. Given this, I thought that if you can’t beat them, join them. So above appears my very own Mathematical meme, maybe it will catch on.
It should be noted that in this case “Less than 1% can do it!!!” is true, in the strictest sense. Unlike the original meme, so is the first piece of text!
After 100s of views on my blog, 1,000s of views on LinkedIn and 10,000s of views on Twitter, it took Neil Raden (@NeilRaden) to point out that in the original image I had the sum running from n=0 as opposed to n=1. The former makes no sense whatsoever. I guess his company is called Hired Brains for a reason! This was meant to be a humorous post, but at least part of the joke is now on me.