Hurricanes and Data Visualisation: Part II – Map Reading

Confusing map reading

This is the second of two articles whose genesis was the nexus of hurricanes and data visualisation. The first article was, Part I – Rainbow’s Gravity [1].
 
 
Introduction

In the first article in this mini-series we looked at alternative approaches to colour and how these could inform or mislead in data visualisations relating to weather events. In particular we discussed drawbacks of using a rainbow palette in such visualisations and some alternatives. Here we move into much more serious territory, how best to inform the public about what a specific hurricane will do next and the risks that it poses. It would not be an exaggeration to say that sometimes this area may be a matter of life and death. As with rainbow-coloured maps of weather events, some aspects of how the estimated future course of hurricanes are communicated and understood leave much to be desired.

Hurricane Irma cone
Source: US National Weather Service (NWS)

The above diagram is called a the cone of uncertainty of a hurricane. Cone of uncertainty sounds like an odd term. What does it mean? Let’s start by offering a historical perspective on hurricane modelling.
 
 
Paleomodelling

Well like any other type of weather prediction, determining the future direction and speed of a hurricane is not an exact science [2]. In the earlier days of hurricane modelling, Meteorologists used to employ statistical models, which were built based on detailed information about previous hurricanes, took as input many data points about the history of a current hurricane’s evolution and provided as output a prediction of what it could do in coming days.

Black Box

There were a variety of statistical models, but the output of them was split into two types when used for hurricane prediction.

Type A and Type B
A) A prediction of the next location of the hurricane was made. This included an average error (in km) based on the inaccuracy of previous hurricane predictions. A percentage of this, say 80%, gave a “circle of uncertainty”, x km around the prediction.

B) Many predictions were generated and plotted; each had a location and an probability associated with it. The centroid of these (adjusted by the probability associated with each location) was calculated and used as the central prediction (cf. A). A circle was drawn (y km from the centroid) so that a percentage of the predictions fell within this. Above, there are 100 estimates and the chosen percentage is 80%; so 80 points lie within the red “circle of uncertainty”.

Type A

First, the model could have generated a single prediction (the centre of the hurricane will be at 32.3078° N, 64.7505° W tomorrow) and supplemented this with an error measure. The error measure would have been based on historical hurricane data and related to how far out prior predictions had been on average; this measure would have been in kilometres. It would have been typical to employ some fraction of the error measure to define a “circle of uncertainty” around the central prediction; 80% in the example directly above (compared to two thirds in the NWS exhibit at the start of the article).

Type B

Second, the model could have generated a large number of mini-predictions, each of which would have had a probability associated with it (e.g. the first two estimates of location could be that the centre of the hurricane is at 32.3078° N, 64.7505° W with a 5% chance, or a mile away at 32.3223° N, 64.7505° W with a 2% chance and so on). In general if you had picked the “centre of gravity” of the second type of output, it would have been analogous to the single prediction of the first type of output [3]. The spread of point predictions in the second method would have also been analogous to the error measure of the first. Drawing a circle around the centroid would have captured a percentage of the mini-predictions, once more 80% in the example immediately above and two thirds in the NWS chart, generating another “circle of uncertainty”.
 
 
Here comes the Science

Fluid Dynamics

That was then of course, nowadays the statistical element of hurricane models is less significant. With increased processing power and the ability to store and manipulate vast amounts of data, most hurricane models instead rely upon scientific models; let’s call this Type C.

Type C

As the air is a fluid [4], its behaviour falls into the area of study known as fluid dynamics. If we treat the atmosphere as being viscous, then the appropriate equation governing fluid dynamics is the Navier-Stokes equation, which is itself derived from the Cauchy Momentum equation:

\displaystyle\frac{\partial}{\partial t}(\rho \boldsymbol{u}) + \nabla \cdot (\rho \boldsymbol{u}\otimes \boldsymbol{u})=-\nabla\cdot p\boldsymbol{I}+\nabla\cdot\boldsymbol{\tau} + \rho\boldsymbol{g}

If viscosity is taken as zero (as a simplification), instead the Euler equations apply:

\displaystyle\left\{\begin{array}{lr}\displaystyle\frac{\partial\boldsymbol{u}}{\partial t} + \nabla \cdot (\boldsymbol{u}\otimes \boldsymbol{u} + w\boldsymbol{I}) = \boldsymbol{g} \\ \\ \nabla \cdot \boldsymbol{u}= 0\end{array}\right.

The reader may be glad to know that I don’t propose to talk about any of the above equations any further.

3D Grid

To get back to the model, in general the atmosphere will be split into a three dimensional grid (the atmosphere has height as well). The current temperature, pressure, moisture content etc. are fed in (or sometimes interpolated) at each point and equations such as the ones above are used to determine the evolution of fluid flow at a given grid element. Of course – as is typical in such situations – approximations of the equations are used and there is some flexibility over which approximations to employ. Also, there may be uncertainty about the input parameters, so statistics does not disappear entirely. Leaving this to one side, how the atmospheric conditions change over time at each grid point rolls up to provide a predictive basis for what a hurricane will do next.

Although the methods are very different, the output of these scientific models will be pretty similar, qualitatively, to the Type A statistical model above. In particular, uncertainty will be delineated based on how well the model performed on previous occasions. For example, what was the average difference between prediction and fact after 6 hours, 12 hours and so on. Again, the uncertainty will have similar characteristics to that of Type A above.
 
 
A Section about Conics

An advanced statistical approach

In all of the cases discussed above, we have a central prediction (which may be an average of several predictions as per Type B) and a circular distribution around this indicating uncertainty. Let’s consider how these predictions might change as we move into the future.

If today is Monday, then there will be some uncertainty about what the hurricane does on Tuesday. For Wednesday, the uncertainty will be greater than for Tuesday (the “circle of uncertainty” will have grown) and so on. With the Type A and Type C outputs, the error measure will increase with time. With the Type B output, if the model spits out 100 possible locations for the hurricane on a specific day (complete with the likelihood of each of these occurring), then these will be fairly close together on Tuesday and further apart on Wednesday. In all cases, uncertainty about the location of the becomes smeared out over time, resulting in a larger area where it is likely to be located and a bigger “circle of uncertainty”.

This is where the circles of uncertainty combine to become a cone of uncertainty. For the same example, on each day, the meteorologists will plot the central prediction for the hurricane’s location and then draw a circle centered on this which captures the uncertainty of the prediction. For the same reason as stated above, the size of the circle will (in general) increase with time; Wednesday’s circle will be bigger than Tuesday’s. Also each day’s central prediction will be in a different place from the previous day’s as the hurricane moves along. Joining up all of these circles gives us the cone of uncertainty [5].

If the central predictions imply that a hurricane is moving with constant speed and direction, then its cone of uncertainty would look something like this:

Cone of uncertainty

In this diagram, broadly speaking, on each day, there is a 67% probability that the centre of the hurricane will be found within the relevant circle that makes up the cone of uncertainty. We will explore the implications of the underlined phrase in the next section.

Of course hurricanes don’t move in a single direction at an unvarying pace (see the actual NWS exhibit above as opposed to my idealised rendition), so part of the purpose of the cone of uncertainty diagram is to elucidate this.
 
 
The Central Issue

So hopefully the intent of the NWS chart at the beginning of this article is now clearer. What is the problem with it? Well I’ll go back to the words I highlighted couple of paragraphs back:

There is a 67% probability that the centre of the hurricane will be found within the relevant circle that makes up the cone of uncertainty

So the cone helps us with where the centre of the hurricane may be. A reasonable question is, what about the rest of the hurricane?

For ease of reference, here is the NWS exhibit again:

Hurricane Irma cone

Let’s first of all pause to work out how big some of the NWS “circles of uncertainty” are. To do this we can note that the grid lines (though not labelled) are clearly at 5° intervals. The distance between two lines of latitude (ones drawn parallel to the equator) that are 1° apart from each other is a relatively consistent number; approximately 111 km [6]. This means that the lines of latitude on the page are around 555 km apart. Using this as a reference, the “circle of uncertainty” labelled “8 PM Sat” has a diameter of about 420 km (260 miles).

Let’s now consider how big Hurricane Irma was [7].

Size of Irma
Source: RMS

Aside: I’d be remiss if I didn’t point out here that RMS have selected what seems to me to be a pretty good colour palette in the chart above.

Well there is no defined sharp edge of a hurricane, rather the speed of winds tails off as may be seen in the above diagram. In order to get some sense of the size of Irma, I’ll use the dashed line in the chart that indicates where wind speeds drop below that classified as a tropical storm (65 kmph or 40 mph [8]). This area is not uniform, but measures around 580 km (360 miles) wide.

Misplaced hurricane
A) The size of the hurricane is greater than the size of the “circle of uncertainty”, the former extends 80 km beyond the circumference of the latter in all directions.

B) The “circle of uncertainty” captures the area within which the hurricane’s centre is likely to fall. But this includes cases where the centre of the hurricane is on the circumference of the “circle of uncertainty”. This means the the furthermost edge of the hurricane could be up to 290 km outside of the “circle of uncertainty”.

There are two issues here, which are illustrated in the above diagram.

Issue A

Irma was actually bigger [9] than at least some of the “circles of uncertainty”. A cursory glance at the NWS exhibit would probably give the sense that the cone of uncertainty represents the extent of the storm, it doesn’t. In our example, Irma extends 80 km beyond the “circle of uncertainty” we measured above. If you thought you were safe because you were 50 km from the edge of the cone, then this was probably an erroneous conclusion.

Issue B

Even more pernicious, because each “circle of uncertainty” provides an area within which the centre of the hurricane could be situated, this includes cases where the centre of the hurricane sits on the circumference of the “circle of uncertainty”. This, together with the size of the storm, means that someone 290 km from the edge of the “circle of uncertainty” could suffer 65 kmph (40 mph) winds. Again, based on the diagram, if you felt that you were guaranteed to be OK if you were 250 km away from the edge of the cone, you could get a nasty surprise.

Hurricane Season

These are not academic distinctions, the real danger that hurricane cones were misinterpreted led the NWS to start labelling their charts with “This cone DOES NOT REPRESENT THE SIZE OF THE STORM!![10].

Even Florida senator Marco Rubio got in on the act, tweeting:

Rubio tweet

When you need a politician help you avoid misinterpreting a data visualisation, you know that there is something amiss.
 
 
In Summary

Could do better

The last thing I want to do is to appear critical of the men and women of the US National Weather Service. I’m sure that they do a fine job. If anything, the issues we have been dissecting here demonstrate that even highly expert people with a strong motivation to communicate clearly can still find it tough to select the right visual metaphor for a data visualisation; particularly when there is a diverse audience consuming the results. It also doesn’t help that there are many degrees of uncertainty here: where might the centre of the storm be? how big might the storm be? how powerful might the storm be? in which direction might the storm move? Layering all of these onto a single exhibit while still rendering it both legible and of some utility to the general public is not a trivial exercise.

The cone of uncertainty is a precise chart, so long as the reader understands what it is showing and what it is not. Perhaps the issue lies more in the eye of the beholder. However, having to annotate your charts to explain what they are not is never a good look on anyone. The NWS are clearly aware of the issues, I look forward to viewing whatever creative solution they come up with later this hurricane season.
 


 
Acknowledgements

I would like to thank Dr Steve Smith, Head of Catastrophic Risk at Fractal Industries, for reviewing this piece and putting me right on some elements of modern hurricane prediction. I would also like to thank my friend and former colleague, Dr Raveem Ismail, also of Fractal Industries, for introducing me to Steve. Despite the input of these two experts, responsibility for any errors or omissions remains mine alone.
 


 
Notes

 
[1]
 
I also squeezed Part I(b) – The Mona Lisa in between the two articles I originally planned.
 
[2]
 
I don’t mean to imply by this that the estimation process is unscientific of course. Indeed, as we will see later, hurricane prediction is becoming more scientific all the time.
 
[3]
 
If both methods were employed in parallel, it would not be too surprising if their central predictions were close to each other.
 
[4]
 
A gas or a liquid.
 
[5]
 
Cone

A shape traced out by a particle traveling with constant speed and with a circle of increasing radius inscribed around it would be a cone.

 
[6]
 
Latitude and Longitude

The distance between lines of longitude varies between 111 km at the equator and 0 km at either pole. This is because lines of longitude are great circles (or meridians) that meet at the poles. Lines of latitude are parallel circles (parallels) progressing up and down the globe from the equator.

 
[7]
 
At a point in time of course. Hurricanes change in size over time as well as in their direction/speed of travel and energy.
 
[8]
 
I am rounding here. The actual threshold values are 63 kmph and 39 mph.
 
[9]
 
Using the definition of size that we have adopted above.
 
[10]
 
Their use of capitals, bold and multiple exclamation marks.

 

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

 

The revised and expanded Data and Analytics Dictionary

The Data and Analytics Dictionary

Since its launch in August of this year, the peterjamesthomas.com Data and Analytics Dictionary has received a welcome amount of attention with various people on different social media platforms praising its usefulness, particularly as an introduction to the area. A number of people have made helpful suggestions for new entries or improvements to existing ones. I have also been rounding out the content with some more terms relating to each of Data Governance, Big Data and Data Warehousing. As a result, The Dictionary now has over 80 main entries (not including ones that simply refer the reader to another entry, such as Linear Regression, which redirects to Model).

The most recently added entries are as follows:

  1. Anomaly Detection
  2. Behavioural Analytics
  3. Complex Event Processing
  4. Data Discovery
  5. Data Ingestion
  6. Data Integration
  7. Data Migration
  8. Data Modelling
  9. Data Privacy
  10. Data Repository
  11. Data Virtualisation
  12. Deep Learning
  13. Flink
  14. Hive
  15. Information Security
  16. Metadata
  17. Multidimensional Approach
  18. Natural Language Processing (NLP)
  19. On-line Transaction Processing
  20. Operational Data Store (ODS)
  21. Pig
  22. Table
  23. Sentiment Analysis
  24. Text Analytics
  25. View

It is my intention to continue to revise this resource. Adding some more detail about Machine Learning and related areas is probably the next focus.

As ever, ideas for what to include next would be more than welcome (any suggestions used will also be acknowledged).
 


 

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

 

A truth universally acknowledged…

£10 note

  “It is a truth universally acknowledged, that an organisation in possession of some data, must be in want of a Chief Data Officer”

— Growth and Governance, by Jane Austen (1813) [1]

 

I wrote about a theoretical job description for a Chief Data Officer back in November 2015 [2]. While I have been on “paternity leave” following the birth of our second daughter, a couple of genuine CDO job specs landed in my inbox. While unable to respond for the aforementioned reasons, I did leaf through the documents. Something immediately struck me; they were essentially wish-lists covering a number of data-related fields, rather than a description of what a CDO might actually do. Clearly I’m not going to cite the actual text here, but the following is representative of what appeared in both requirement lists:

CDO wishlist

Mandatory Requirements:

Highly Desirable Requirements:

  • PhD in Mathematics or a numerical science (with a strong record of highly-cited publications)
  • MBA from a top-tier Business School
  • TOGAF certification
  • PRINCE2 and Agile Practitioner
  • Invulnerability and X-ray vision [3]
  • Mastery of the lesser incantations and a cloak of invisibility [3]
  • High midi-chlorian reading [3]
  • Full, clean driving licence

Your common, all-garden CDO

The above list may have descended into farce towards the end, but I would argue that the problems started to occur much earlier. The above is not a description of what is required to be a successful CDO, it’s a description of a Swiss Army Knife. There is also the minor practical point that, out of a World population of around 7.5 billion, there may well be no one who ticks all the boxes [4].

Let’s make the fallacy of this type of job description clearer by considering what a simmilar approach would look like if applied to what is generally the most senior role in an organisation, the CEO. Whoever drafted the above list of requirements would probably characterise a CEO as follows:

  • The best salesperson in the organisation
  • The best accountant in the organisation
  • The best M&A person in the organisation
  • The best customer service operative in the organisation
  • The best facilities manager in the organisation
  • The best janitor in the organisation
  • The best purchasing clerk in the organisation
  • The best lawyer in the organisation
  • The best programmer in the organisation
  • The best marketer in the organisation
  • The best product developer in the organisation
  • The best HR person in the organisation, etc., etc., …

Of course a CEO needs to be none of the above, they need to be a superlative leader who is expert at running an organisation (even then, they may focus on plotting the way forward and leave the day to day running to others). For the avoidance of doubt, I am not saying that a CEO requires no domain knowledge and has no expertise, they would need both, however they don’t have to know every aspect of company operations better than the people who do it.

The same argument applies to CDOs. Domain knowledge probably should span most of what is in the job description (save for maybe the three items with footnotes), but knowledge is different to expertise. As CDOs don’t grow on trees, they will most likely be experts in one or a few of the areas cited, but not all of them. Successful CDOs will know enough to be able to talk to people in the areas where they are not experts. They will have to be competent at hiring experts in every area of a CDO’s purview. But they do not have to be able to do the job of every data-centric staff member better than the person could do themselves. Even if you could identify such a CDO, they would probably lose their best staff very quickly due to micromanagement.

Conducting the data orchestra

A CDO has to be a conductor of both the data function orchestra and of the use of data in the wider organisation. This is a talent in itself. An internationally renowned conductor may have previously been a violinist, but it is unlikely they were also a flautist and a percussionist. They do however need to be able to tell whether or not the second trumpeter is any good or not; this is not the same as being able to play the trumpet yourself of course. The conductor’s key skill is in managing the efforts of a large group of people to create a cohesive – and harmonious – whole.

The CDO is of course still a relatively new role in mainstream organisations [5]. Perhaps these job descriptions will become more realistic as the role becomes more familiar. It is to be hoped so, else many a search for a new CDO will end in disappointment.

Having twisted her text to my own purposes at the beginning of this article, I will leave the last words to Jane Austen:

  “A scheme of which every part promises delight, can never be successful; and general disappointment is only warded off by the defence of some little peculiar vexation.”

— Pride and Prejudice, by Jane Austen (1813)

 

 
Notes

 
[1]
 
Well if a production company can get away with Pride and Prejudice and Zombies, then I feel I am on reasonably solid ground here with this title.

I also seem to be riffing on JA rather a lot at present, I used Rationality and Reality as the title of one of the chapters in my [as yet unfinished] Mathematical book, Glimpses of Symmetry.

 
[2]
 
Wanted – Chief Data Officer.
 
[3]
 
Most readers will immediately spot the obvious mistake here. Of course all three of these requirements should be mandatory.
 
[4]
 
To take just one example, gaining a PhD in a numerical science, a track record of highly-cited papers and also obtaining an MBA would take most people at least a few weeks of effort. Is it likely that such a person would next focus on a PRINCE2 or TOGAF qualification?
 
[5]
 
I discuss some elements of the emerging consensus on what a CDO should do in: 5 Themes from a Chief Data Officer Forum and 5 More Themes from a Chief Data Officer Forum.

 

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

 

The peterjamesthomas.com Data and Analytics Dictionary

The Data and Analytics Dictionary

I find myself frequently being asked questions around terminology in Data and Analytics and so thought that I would try to define some of the more commonly used phrases and words. My first attempt to do this can be viewed in a new page added to this site (this also appears in the site menu):

The Data and Analytics Dictionary

I plan to keep this up-to-date as the field continues to evolve.

I hope that my efforts to explain some concepts in my main area of specialism are both of interest and utility to readers. Any suggestions for new entries or comments on existing ones are more than welcome.
 

 

Knowing what you do not Know

Measure twice cut once

As readers will have noticed, my wife and I have spent a lot of time talking to medical practitioners in recent months. The same readers will also know that my wife is a Structural Biologist, whose work I have featured before in Data Visualisation – A Scientific Treatment [1]. Some of our previous medical interactions had led to me thinking about the nexus between medical science and statistics [2]. More recently, my wife had a discussion with a doctor which brought to mind some of her own previous scientific work. Her observations about the connections between these two areas have formed the genesis of this article. While the origins of this piece are in science and medicine, I think that the learnings have broader applicability.


So the general context is a medical test, the result of which was my wife being told that all was well [3]. Given that humans are complicated systems (to say the very least), my wife was less than convinced that just because reading X was OK it meant that everything else was also necessarily OK. She contrasted the approach of the physician with something from her own experience and in particular one of the experiments that formed part of her PhD thesis. I’m going to try to share the central point she was making with you without going in to all of the scientific details [4]. However to do this I need to provide at least some high-level background.

Structural Biology is broadly the study of the structure of large biological molecules, which mostly means proteins and protein assemblies. What is important is not the chemical make up of these molecules (how many carbon, hydrogen, oxygen, nitrogen and other atoms they consist of), but how these atoms are arranged to create three dimensional structures. An example of this appears below:

The 3D structure of a bacterial Ribosome

This image is of a bacterial Ribosome. Ribosomes are miniature machines which assemble amino acids into proteins as part of the chain which converts information held in DNA into useful molecules [5]. Ribosomes are themselves made up of a number of different proteins as well as RNA.

In order to determine the structure of a given protein, it is necessary to first isolate it in sufficient quantity (i.e. to purify it) and then subject it to some form of analysis, for example X-ray crystallography, electron microscopy or a variety of other biophysical techniques. Depending on the analytical procedure adopted, further work may be required, such as growing crystals of the protein. Something that is generally very important in this process is to increase the stability of the protein that is being investigated [6]. The type of protein that my wife was studying [7] is particularly unstable as its natural home is as part of the wall of cells – removed from this supporting structure these types of proteins quickly degrade.

So one of my wife’s tasks was to better stabilise her target protein. This can be done in a number of ways [8] and I won’t get into the technicalities. After one such attempt, my wife looked to see whether her work had been successful. In her case the relative stability of her protein before and after modification is determined by a test called a Thermostability Assay.

Sigmoidal Dose Response Curve A
© University of Cambridge – reproduced under a Creative Commons 2.0 licence

In the image above, you can see the combined results of several such assays carried out on both the unmodified and modified protein. Results for the unmodified protein are shown as a green line [9] and those for the modified protein as a blue line [10]. The fact that the blue line (and more particularly the section which rapidly slopes down from the higher values to the lower ones) is to the right of the green one indicates that the modification has been successful in increasing thermostability.

So my wife had done a great job – right? Well things were not so simple as they might first seem. There are two different protocols relating to how to carry out this thermostability assay. These basically involve doing some of the required steps in a different order. So if the steps are A, B, C and D, then protocol #1 consists of A ↦ B ↦ C ↦ D and protocol #2 consists of A ↦ C ↦ B ↦ D. My wife was thorough enough to also use this second protocol with the results shown below:

Sigmoidal Dose Response Curve B
© University of Cambridge – reproduced under a Creative Commons 2.0 licence

Here we have the opposite finding, the same modification to the protein seems to have now decreased its stability. There are some good reasons why this type of discrepancy might have occurred [11], but overall my wife could not conclude that this attempt to increase stability had been successful. This sort of thing happens all the time and she moved on to the next idea. This is all part of the rather messy process of conducting science [12].

I’ll let my wife explain her perspective on these results in her own words:

In general you can’t explain everything about a complex biological system with one set of data or the results of one test. It will seldom be the whole picture. Protocol #1 for the thermostability assay was the gold standard in my lab before the results I obtained above. Now protocol #1 is used in combination with another type of assay whose efficacy I also explored. Together these give us an even better picture of stability. The gold standard shifted. However, not even this bipartite test tells you everything. In any complex system (be that Biological or a complicated dataset) there are always going to be unknowns. What I think is important is knowing what you can and can’t account for. In my experience in science, there is generally much much more that can’t be explained than can.

Belt and Braces [or suspenders if you are from the US, which has quite a different connotation in the UK!]

As ever translating all of this to a business context is instructive. Conscientious Data Scientists or business-focussed Statisticians who come across something interesting in a model or analysis will always try (where feasible) to corroborate this by other means; they will try to perform a second “experiment” to verify their initial findings. They will also realise that even two supporting results obtained in different ways will not in general be 100% conclusive. However the highest levels of conscientiousness may be more honoured in breach than observance [13]. Also there may not be an alternative “experiment” that can be easily run. Whatever the motivations or circumstances, it is not beyond the realm of possibility that some Data Science findings are true only in the same way that my wife thought she had successfully stabilised her protein before carrying out the second assay.

I would argue that business will often have much to learn from the levels of rigour customary in most scientific research [14]. It would be nice to think that the same rigour is always applied in commercial matters as academic ones. Unfortunately experience would tend to suggest the contrary is sometimes the case. However, it would also be beneficial if people working on statistical models in industry went out of their way to stress not only what phenomena these models can explain, but what they are unable to explain. Knowing what you don’t know is the first step towards further enlightenment.
 


 
Notes

 
[1]
 
Indeed this previous article had a sub-section titled Rigour and Scrutiny, echoing some of the themes in this piece.
 
[2]
 
See More Statistics and Medicine.
 
[3]
 
As in the earlier article, apologies for the circumlocution. I’m both looking to preserve some privacy and save the reader from boredom.
 
[4]
 
Anyone interested in more information is welcome to read her thesis which is in any case in the public domain. It is 188 pages long, which is reasonably lengthy even by my standards.
 
[5]
 
They carry out translation which refers to synthesising proteins based on information carried by messenger RNA, mRNA.
 
[6]
 
Some proteins are naturally stable, but many are not and will not survive purification or later steps in their native state.
 
[7]
 
G Protein-coupled Receptors or GPCRs.
 
[8]
 
Chopping off flexible sections, adding other small proteins which act as scaffolding, getting antibodies or other biological molecules to bind to the protein and so on.
 
[9]
 
Actually a sigmoidal dose-response curve.
 
[10]
 
For anyone with colour perception problems, the green line has markers which are diamonds and the blue line has markers which are triangles.
 
[11]
 
As my wife writes [with my annotations]:

A possible explanation for this effect was that while T4L [the protein she added to try to increase stability – T4 Lysozyme] stabilised the binding pocket, the other domains of the receptor were destabilised. Another possibility was that the introduction of T4L caused an increase in the flexibility of CL3, thus destabilising the receptor. A method for determining whether this was happening would be to introduce rigid linkers at the AT1R-T4L junction [AT1R was the protein she was studying, angiotensin II type 1 receptor], or other placements of T4L. Finally AT1R might exist as a dimer and the addition of T4L might inhibit the formation of dimers, which could also destabilise the receptor.

© University of Cambridge – reproduced under a Creative Commons 2.0 licence

 
[12]
 
See also Toast.
 
[13]
 
Though to be fair, the way that this phrase is normally used today is probably not what either Hamlet or Shakespeare intended by it back around 1600.
 
[14]
 
Of course there are sadly examples of specific scientists falling short of the ideals I have described here.

 

 

Bigger and Better (Data)?

Is bigger really better

I was browsing Data Science Central [1] recently and came across an article by Bill Vorhies, President & Chief Data Scientist of Data-Magnum. The piece was entitled 7 Cases Where Big Data Isn’t Better and is worth a read in full. Here I wanted to pick up on just a couple of Bill’s points.

In his preamble, he states:

Following the literature and the technology you would think there is universal agreement that more data means better models. […] However […] it’s always a good idea to step back and examine the premise. Is it universally true that our models will be more accurate if we use more data? As a data scientist you will want to question this assumption and not automatically reach for that brand new high-performance in-memory modeling array before examining some of these issues.

Bill goes on to make several pertinent points including: that if your data is bad, having more of it is not necessarily a solution; that attempting to create a gigantic and all-purpose model may well be inferior to multiple, more targeted models on smaller sub-sets of data; and that there exist specific instances where a smaller data sets yields greater accuracy [2]. However I wanted to pick up directly on Bill’s point 6 of 7, in which he also references Larry Greenemeier (@lggreenemeier) of Scientific American.

  Bill Vorhies   Larry Greenemeier  

6. Sometimes We Get Hypnotized By the Overwhelming Volume of the Data and Forget About Data Provenance and Good Project Design

A few months back I reviewed an article by Larry Greenemeier [3] about the failure of Google Flu Trend analysis to predict the timing and severity of flu outbreaks based on social media scraping. It was widely believed that this Big Data volume of data would accurately predict the incidence of flu but the study failed miserably missing timing and severity by a wide margin.

Says Greenemeier, “Big data hubris is the often the implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. The mistake of many big data projects, the researchers note, is that they are not based on technology designed to produce valid and reliable data amenable for scientific analysis. The data comes from sources such as smartphones, search results and social networks rather than carefully vetted participants and scientific instruments”.

Perhaps more pertinent to a business environment, Greenemeier’s article also states:

Context is often lacking when info is pulled from disparate sources, leading to questionable conclusions.

Ruler

Neither of these authors is saying that having greater volumes of data is a definitively bad thing; indeed Vorhies states:

In general would I still prefer to have more data than less? Yes, of course.

They are however both pointing out that, in some instances, more traditional statistical methods, applied to smaller data sets yield superior results. This is particularly the case where data are repurposed and the use to which they are put is different to the considerations when they were collected; something which is arguably more likely to be the case where general purpose Big Data sets are leveraged without reference to other information.

Also, when large data sets are collated from many places, the data from each place can have different characteristics. If this variation is not controlled for in models, it may well lead to erroneous findings.

Statistical Methods

Their final observation is that sound statistical methodology needs to be applied to big data sets just as much as more regular ones. The hope that design flaws will simply evaporate when data sets get large enough may be seducing, but it is also dangerously wrong.

Vorhies and Greenemeier are not suggesting that Big Data has no value. However they state that one of its most potent uses may well be as a supplement to existing methods, perhaps extending them, or bringing greater granularity to results. I view such introspection in Data Science circles as positive, likely to lead to improved methods and an indication of growing maturity in the field. It is however worth noting that, in some cases, leverage of Small-but-Well-Designed Data [4] is not only effective, but actually a superior approach. This is certainly something that Data Scientists should bear in mind.
 


 
Notes

 
[1]
 
I’d recommend taking a look at this site regularly. There is a high volume of articles and the quality is variable, but often there are some stand-out pieces.
 
[2]
 
See the original article for the details.
 
[3]
 
The article was in Scientific American and entitled Why Big Data Isn’t Necessarily Better Data.
 
[4]
 
I may have to copyright this term and of course the very elegant abridgement, SBWDD.

 

 

How to be Surprisingly Popular

Popular with the Crowd
 
Introduction

This article is about the wisdom of the crowd [1], or more particularly its all too frequent foolishness. I am going to draw on a paper recently published in Nature by a cross-disciplinary team from the Massachusetts Institute of Technology and Princeton University. The authors are Dražen Prelec, H. Sebastian Seung and John McCoy. The paper’s title is A solution to the single-question crowd wisdom problem [2]. Rather than reinvent the wheel, here is a section from the abstract (with my emphasis):

Once considered provocative, the notion that the wisdom of the crowd is superior to any individual has become itself a piece of crowd wisdom, leading to speculation that online voting may soon put credentialed experts out of business. Recent applications include political and economic forecasting, evaluating nuclear safety, public policy, the quality of chemical probes, and possible responses to a restless volcano. Algorithms for extracting wisdom from the crowd are typically based on a democratic voting procedure. […] However, democratic methods have serious limitations. They are biased for shallow, lowest common denominator information, at the expense of novel or specialized knowledge that is not widely shared.

 
 
The Problems

The authors describe some compelling examples of where a crowd-based approach ignores the aforementioned specialised knowledge. I’ll cover a couple of these in a second, but let me first add my own.

How heavy is a proton?

Suppose we ask 1,000 people to come up with an estimate of the mass of a proton. One of these people happens to have won the Nobel Prize for Physics the previous year. Is the average of the estimates provided by the 1,000 people likely to be more accurate, or is the estimate of the one particularly qualified person going to be superior? There is an obvious answer to this question [3].

Lest it be thought that the above flaw in the wisdom of the crowd is confined to populations including a Nobel Laureate, I’ll reproduce a much more quotidian example from the Nature paper [4].

Philadelphia or Harrisburg?

[..] imagine that you have no knowledge of US geography and are confronted with questions such as: Philadelphia is the capital of Pennsylvania, yes or no? And, Columbia is the capital of South Carolina, yes or no? You pose them to many people, hoping that majority opinion will be correct. [in an actual exercise the team carried out] this works for the Columbia question, but most people endorse the incorrect answer (yes) for the Philadelphia question. Most respondents may only recall that Philadelphia is a large, historically significant city in Pennsylvania, and conclude that it is the capital. The minority who vote no probably possess an additional piece of evidence, that the capital is Harrisburg. A large panel will surely include such individuals. The failure of majority opinion cannot be blamed on an uninformed panel or flawed reasoning, but represents a defect in the voting method itself.

I’m both a good and bad example here. I know the capital of Pennsylvania is Harrisburg because I have specialist knowledge [5]. However my acquaintance with South Carolina is close to zero. I’d therefore get the first question right and have a 50 / 50 chance on the second (all other things being equal of course). My assumption is that Columbia is, in general, much more well-known than Harrisburg for some reason.

Confidence Levels

The authors go on to cover the technique that is often used to try to address this type of problem in surveys. Respondents are also asked how confident they are about their answer. Thus a tentative “yes” carries less weight than a definitive “yes”. However, as the authors point out, such an approach only works if correct responses are strongly correlated with respondent confidence. As is all too evident from real life, people are often both wrong and very confident about their opinion [6]. The authors extended their Philadelphia / Columbia study to apply confidence weightings, but with no discernible improvement.
 
 
A Surprisingly Popular Solution

As well as identifying the problem, the authors suggest a solution and later go on to demonstrate its efficacy. Again quoting from the paper’s abstract:

Here we propose the following alternative to a democratic vote: select the answer that is more popular than people predict. We show that this principle yields the best answer under reasonable assumptions about voter behaviour, while the standard ‘most popular’ or ‘most confident’ principles fail under exactly those same assumptions.

Let’s use the examples of capitals of states again here (as the authors do in the paper). As well as asking respondents, “Philadelphia is the capital of Pennsylvania, yes or no?” you also ask them “What percentage of people in this survey will answer ‘yes’ to this question?” The key is then to compare the actual survey answers with the predicted survey answers.

Columbia and Philadelphia [click to view a larger version in a new tab]

As shown in the above exhibit, in the authors’ study, when people were asked whether or not Columbia is the capital of South Carolina, those who replied “yes” felt that the majority of respondents would agree with them. Those who replied “no” symmetrically felt that the majority of people would also reply “no”. So no surprises there. Both groups felt that the crowd would agree with their response.

However, in the case of whether or not Philadelphia is the capital of Pennsylvania there is a difference. While those who replied “yes” also felt that the majority of people would agree with them, amongst those who replied “no”, there was a belief that the majority of people surveyed would reply “yes”. This is a surprise. People who make the correct response to this question feel that the wisdom of the crowd will be incorrect.

In the Columbia example, what people predict will be the percentage of people replying “yes” tracks with the actual response rate. In the Philadelphia example, what people predict will be the percentage of people replying “yes” is significantly less than the actual proportion of people making this response [7]. Thus a response of “no” to “Philadelphia is the capital of Pennsylvania, yes or no?” is surprisingly popular. The methodology that the authors advocate would then lead to the surprisingly popular answer (i.e. “no”) actually being correct; as indeed it is. Because there is no surprisingly popular answer in the Columbia example, then the result of a democratic vote stands; which is again correct.

To reiterate: a surprisingly popular response will overturn the democratic verdict, if there is no surprisingly popular response, the democratic verdict is unmodified.

Discriminating about Art

As well as confirming the superiority of the surprisingly popular approach (as opposed to either weighted or non-weighted democratic votes) with questions about state capitals, the authors went on to apply their new technique in a range of other areas [8].

  • Study 1 used 50 US state capitals questions, repeating the format [described above] with different populations [9].
     
  • Study 2 employed 80 general knowledge questions.
     
  • Study 3 asked professional dermatologists to diagnose 80 skin lesion images as benign or malignant.
     
  • Study 4 presented 90 20th century artworks [see the images above] to laypeople and art professionals, and asked them to predict the correct market price category.

Taking all responses across the four studies into account [10], the central findings were as follows [11]:

We first test pairwise accuracies of four algorithms: majority vote, surprisingly popular (SP), confidence-weighted vote, and max. confidence, which selects the answer endorsed with highest average confidence.

  • Across all items, the SP algorithm reduced errors by 21.3% relative to simple majority vote (P < 0.0005 by two-sided matched-pair sign test).
     
  • Across the items on which confidence was measured, the reduction was:
    • 35.8% relative to majority vote (P < 0.001),
    • 24.2% relative to confidence-weighted vote (P = 0.0107) and
    • 22.2% relative to max. confidence (P < 0.13).

The authors go on to further kick the tyres [12] on these results [13] without drawing any conclusions that deviate considerably from the ones they first present and which are reproduced above. The surprising finding is that the surprisingly popular algorithm significantly out-performs the algorithms normally used in wisdom of the crowd polling. This is a major result, in theory at least.
 
 
Some Thoughts

Tools and Toolbox

At the end of the abstract, the authors state that:

Like traditional voting, [the surprisingly popular algorithm] accepts unique problems, such as panel decisions about scientific or artistic merit, and legal or historical disputes. The potential application domain is thus broader than that covered by machine learning […].

Given the – justified – attention that has been given to machine learning in recent years, this is a particularly interesting claim. More broadly, SP seems to bring much needed nuance to the wisdom of the crowd. It recognises that the crowd may often be right, but also allows better informed minorities to override the crowd opinion in specific cases. It does this robustly in all of the studies that the authors conducted. It will be extremely interesting to see this novel algorithm deployed in anger, i.e. in a non-theoretical environment. If its undoubted promise is borne out – and the evidence to date suggests that it will be – then statisticians will have a new and powerful tool in their arsenal and a range of predictive activities will be improved.

The scope of applicability of the SP technique is as wide as that of any wisdom of the crowd approach and, to repeat the comments made by the authors in their abstract, has recently included:

[…] political and economic forecasting, evaluating nuclear safety, public policy, the quality of chemical probes, and possible responses to a restless volcano

If the author’s initial findings are repeated in “live” situations, then the refinement to the purely democratic approach that SP brings should elevate an already useful approach to being an indispensable one in many areas.

I will let the authors have a penultimate word [14]:

Although democratic methods of opinion aggregation have been influential and productive, they have underestimated collective intelligence in one respect. People are not limited to stating their actual beliefs; they can also reason about beliefs that would arise under hypothetical scenarios. Such knowledge can be exploited to recover truth even when traditional voting methods fail. If respondents have enough evidence to establish the correct answer, then the surprisingly popular principle will yield that answer; more generally, it will produce the best answer in light of available evidence. These claims are theoretical and do not guarantee success in practice, as actual respondents will fall short of ideal. However, it would be hard to trust a method [such as majority vote or confidence-weighted vote] if it fails with ideal respondents on simple problems like [the Philadelphia one]. To our knowledge, the method proposed here is the only one that passes this test.

US Presidential Election Polling [borrowed from Wikipedia]

The ultimate thought I will present in this article is an entirely speculative one. The authors posit that their method could be applied to “potentially controversial topics, such as political and environmental forecasts”, while cautioning that manipulation should be guarded against. Their suggestion leads me wonder what impact on the results of opinion polls a suitably formed surprisingly popular questionnaire would have had in the run up to both the recent UK European Union Referendum and the plebiscite for the US Presidency. Of course it is now impossible to tell, but maybe some polling organisations will begin to incorporate this new approach going forward. It can hardly make things worse.
 


 
Notes

 
[1]
 
According to Wikipedia, the phenomenon that:

A large group’s aggregated answers to questions involving quantity estimation, general world knowledge, and spatial reasoning has generally been found to be as good as, and often better than, the answer given by any of the individuals within the group.

The authors of the Nature paper question whether this is true in all circumstances.

 
[2]
 
Prelec, D., Seung, H.S., McCoy, J., (2017). A solution to the single-question crowd wisdom problem. Nature 541, 532–535.

You can view a full version of this paper care of Springer Nature SharedIt at the following link. ShareIt is Springer’s content sharing initiative.

Direct access to the article on Nature’s site (here) requires a subscription to the journal.

 
[3]
 
This example is perhaps an interesting rejoinder to the increasing lack of faith in experts in the general population, something I covered in Toast.

Of course the answer is approximately: 1.6726219 × 10-27 kg.

 
[4]
 
I have lightly edited this section but abjured the regular bracketed ellipses (more than one […] as opposed to conic sections as I note elsewhere). This is both for reasons of readability and also as I have not yet got to some points that the authors were making in this section. The original text is a click away.
 
[5]
 
My wife is from this state.
 
[6]
 
Indeed it sometimes seems that the more wrong the opinion, the more certain that people believe it to be right.

Here the reader is free to insert whatever political example fits best with their worldview.

 
[7]
 
Because many people replying “no” felt that a majority would disagree with them.
 
[8]
 
Again I have lightly edited this text.
 
[9]
 
To provide a bit of detail, here the team created a questionnaire with 50 separate questions sets of the type:

  1. {Most populous city in a state} is the capital of {state}: yes or no?
     
  2. How confident are you in your answer (50- 100%)?
     
  3. What percentage of people surveyed will respond “yes” to this question? (1 – 100%)

This was completed by 83 people split between groups of undergraduate and graduate students at both MIT and Princeton. Again see the paper for further details.

 
[10]
 
And eliding some nuances such as some responses being binary (yes/no) and others a range (e.g. the dermatologists were asked to rate the chance of malignancy on a six point scale from “absolutely uncertain to absolutely certain”). Also respondents were asked to provide their confidence in some studies and not others.
 
[11]
 
Once more with some light editing.
 
[12]
 
This is a technical term employed in scientific circles an I apologise if my use of jargon confuses some readers.
 
[13]
 
Again please see the actual paper for details.
 
[14]
 
Modified very slightly by my last piece of editing.