Since its launch in August of this year, the peterjamesthomas.com Data and Analytics Dictionary has received a welcome amount of attention with various people on different social media platforms praising its usefulness, particularly as an introduction to the area. A number of people have made helpful suggestions for new entries or improvements to existing ones. I have also been rounding out the content with some more terms relating to each of Data Governance, Big Data and Data Warehousing. As a result, The Dictionary now has over 80 main entries (not including ones that simply refer the reader to another entry, such as Linear Regression, which redirects to Model).
“It is a truth universally acknowledged, that an organisation in possession of some data, must be in want of a Chief Data Officer”
— Growth and Governance, by Jane Austen (1813) [1]
I wrote about a theoretical job description for a Chief Data Officer back in November 2015 [2]. While I have been on “paternity leave” following the birth of our second daughter, a couple of genuine CDO job specs landed in my inbox. While unable to respond for the aforementioned reasons, I did leaf through the documents. Something immediately struck me; they were essentially wish-lists covering a number of data-related fields, rather than a description of what a CDO might actually do. Clearly I’m not going to cite the actual text here, but the following is representative of what appeared in both requirement lists:
Mandatory Requirements:
Solid commercial understanding and 5 years spent in [insert industry sector here]
The above list may have descended into farce towards the end, but I would argue that the problems started to occur much earlier. The above is not a description of what is required to be a successful CDO, it’s a description of a Swiss Army Knife. There is also the minor practical point that, out of a World population of around 7.5 billion, there may well be no one who ticks all the boxes [4].
Let’s make the fallacy of this type of job description clearer by considering what a simmilar approach would look like if applied to what is generally the most senior role in an organisation, the CEO. Whoever drafted the above list of requirements would probably characterise a CEO as follows:
The best salesperson in the organisation
The best accountant in the organisation
The best M&A person in the organisation
The best customer service operative in the organisation
The best facilities manager in the organisation
The best janitor in the organisation
The best purchasing clerk in the organisation
The best lawyer in the organisation
The best programmer in the organisation
The best marketer in the organisation
The best product developer in the organisation
The best HR person in the organisation, etc., etc., …
Of course a CEO needs to be none of the above, they need to be a superlative leader who is expert at running an organisation (even then, they may focus on plotting the way forward and leave the day to day running to others). For the avoidance of doubt, I am not saying that a CEO requires no domain knowledge and has no expertise, they would need both, however they don’t have to know every aspect of company operations better than the people who do it.
The same argument applies to CDOs. Domain knowledge probably should span most of what is in the job description (save for maybe the three items with footnotes), but knowledge is different to expertise. As CDOs don’t grow on trees, they will most likely be experts in one or a few of the areas cited, but not all of them. Successful CDOs will know enough to be able to talk to people in the areas where they are not experts. They will have to be competent at hiring experts in every area of a CDO’s purview. But they do not have to be able to do the job of every data-centric staff member better than the person could do themselves. Even if you could identify such a CDO, they would probably lose their best staff very quickly due to micromanagement.
A CDO has to be a conductor of both the data function orchestra and of the use of data in the wider organisation. This is a talent in itself. An internationally renowned conductor may have previously been a violinist, but it is unlikely they were also a flautist and a percussionist. They do however need to be able to tell whether or not the second trumpeter is any good or not; this is not the same as being able to play the trumpet yourself of course. The conductor’s key skill is in managing the efforts of a large group of people to create a cohesive – and harmonious – whole.
The CDO is of course still a relatively new role in mainstream organisations [5]. Perhaps these job descriptions will become more realistic as the role becomes more familiar. It is to be hoped so, else many a search for a new CDO will end in disappointment.
Having twisted her text to my own purposes at the beginning of this article, I will leave the last words to Jane Austen:
“A scheme of which every part promises delight, can never be successful; and general disappointment is only warded off by the defence of some little peculiar vexation.”
— Pride and Prejudice, by Jane Austen (1813)
Notes
[1]
Well if a production company can get away with Pride and Prejudice and Zombies, then I feel I am on reasonably solid ground here with this title.
I also seem to be riffing on JA rather a lot at present, I used Rationality and Reality as the title of one of the chapters in my [as yet unfinished] Mathematical book, Glimpses of Symmetry.
Most readers will immediately spot the obvious mistake here. Of course all three of these requirements should be mandatory.
[4]
To take just one example, gaining a PhD in a numerical science, a track record of highly-cited papers and also obtaining an MBA would take most people at least a few weeks of effort. Is it likely that such a person would next focus on a PRINCE2 or TOGAF qualification?
I find myself frequently being asked questions around terminology in Data and Analytics and so thought that I would try to define some of the more commonly used phrases and words. My first attempt to do this can be viewed in a new page added to this site (this also appears in the site menu):
I plan to keep this up-to-date as the field continues to evolve.
I hope that my efforts to explain some concepts in my main area of specialism are both of interest and utility to readers. Any suggestions for new entries or comments on existing ones are more than welcome.
I have some form when it comes to getting irritated by quasi-mathematical social media memes (see Facebook squares “puzzle” for example). Facebook, which I find myself using less and less frequently these days, has always been plagued by clickbait articles. Some of these can be rather unsavoury. One that does not have this particular issue, but which more than makes up for this in terms of general annoyance, is the many variants of:
Only a math[s] genius can solve [insert some dumb problem here] – can u?
Life is too short to complain about Facebook content, but this particular virus now seems to have infected LinkedIn (aka MicrosoftedIn) as well. Indeed as LinkedIn’s current “strategy” seems to be to ape what Facebook was doing a few years ago, perhaps this is not too surprising. Nevertheless, back in the day, LinkedIn used to be a reasonably serious site dedicated to networking and exchanging points of view with fellow professionals.
Those days appear to be fading fast, something I find sad. It seems that a number of people agree with me as – at the time of writing – over 9,000 people have viewed a LinkedIn article I briefly penned bemoaning this development. While some of the focus inevitably turned to general scorn being heaped on the new LinekdIn user experience (UX), it seemed that most people are of the same opinion as I am.
However, I suspect that there is little to be done and the folks at LinkedIn probably have their hands full trying to figure out how to address their UX catastrophe. Given this, I thought that if you can’t beat them, join them. So above appears my very own Mathematical meme, maybe it will catch on.
It should be noted that in this case “Less than 1% can do it!!!” is true, in the strictest sense. Unlike the original meme, so is the first piece of text!
Erratum:
After 100s of views on my blog, 1,000s of views on LinkedIn and 10,000s of views on Twitter, it took Neil Raden (@NeilRaden) to point out that in the original image I had the sum running from n=0 as opposed to n=1. The former makes no sense whatsoever. I guess his company is called Hired Brains for a reason! This was meant to be a humorous post, but at least part of the joke is now on me.
As readers will have noticed, my wife and I have spent a lot of time talking to medical practitioners in recent months. The same readers will also know that my wife is a Structural Biologist, whose work I have featured before in Data Visualisation – A Scientific Treatment[1]. Some of our previous medical interactions had led to me thinking about the nexus between medical science and statistics [2]. More recently, my wife had a discussion with a doctor which brought to mind some of her own previous scientific work. Her observations about the connections between these two areas have formed the genesis of this article. While the origins of this piece are in science and medicine, I think that the learnings have broader applicability.
So the general context is a medical test, the result of which was my wife being told that all was well [3]. Given that humans are complicated systems (to say the very least), my wife was less than convinced that just because reading X was OK it meant that everything else was also necessarily OK. She contrasted the approach of the physician with something from her own experience and in particular one of the experiments that formed part of her PhD thesis. I’m going to try to share the central point she was making with you without going in to all of the scientific details [4]. However to do this I need to provide at least some high-level background.
Structural Biology is broadly the study of the structure of large biological molecules, which mostly means proteins and protein assemblies. What is important is not the chemical make up of these molecules (how many carbon, hydrogen, oxygen, nitrogen and other atoms they consist of), but how these atoms are arranged to create three dimensional structures. An example of this appears below:
This image is of a bacterial Ribosome. Ribosomes are miniature machines which assemble amino acids into proteins as part of the chain which converts information held in DNA into useful molecules [5]. Ribosomes are themselves made up of a number of different proteins as well as RNA.
In order to determine the structure of a given protein, it is necessary to first isolate it in sufficient quantity (i.e. to purify it) and then subject it to some form of analysis, for example X-ray crystallography, electron microscopy or a variety of other biophysical techniques. Depending on the analytical procedure adopted, further work may be required, such as growing crystals of the protein. Something that is generally very important in this process is to increase the stability of the protein that is being investigated [6]. The type of protein that my wife was studying [7] is particularly unstable as its natural home is as part of the wall of cells – removed from this supporting structure these types of proteins quickly degrade.
So one of my wife’s tasks was to better stabilise her target protein. This can be done in a number of ways [8] and I won’t get into the technicalities. After one such attempt, my wife looked to see whether her work had been successful. In her case the relative stability of her protein before and after modification is determined by a test called a Thermostability Assay.
In the image above, you can see the combined results of several such assays carried out on both the unmodified and modified protein. Results for the unmodified protein are shown as a green line [9] and those for the modified protein as a blue line [10]. The fact that the blue line (and more particularly the section which rapidly slopes down from the higher values to the lower ones) is to the right of the green one indicates that the modification has been successful in increasing thermostability.
So my wife had done a great job – right? Well things were not so simple as they might first seem. There are two different protocols relating to how to carry out this thermostability assay. These basically involve doing some of the required steps in a different order. So if the steps are A, B, C and D, then protocol #1 consists of A ↦ B ↦ C ↦ D and protocol #2 consists of A ↦ C ↦ B ↦ D. My wife was thorough enough to also use this second protocol with the results shown below:
Here we have the opposite finding, the same modification to the protein seems to have now decreased its stability. There are some good reasons why this type of discrepancy might have occurred [11], but overall my wife could not conclude that this attempt to increase stability had been successful. This sort of thing happens all the time and she moved on to the next idea. This is all part of the rather messy process of conducting science [12].
I’ll let my wife explain her perspective on these results in her own words:
In general you can’t explain everything about a complex biological system with one set of data or the results of one test. It will seldom be the whole picture. Protocol #1 for the thermostability assay was the gold standard in my lab before the results I obtained above. Now protocol #1 is used in combination with another type of assay whose efficacy I also explored. Together these give us an even better picture of stability. The gold standard shifted. However, not even this bipartite test tells you everything. In any complex system (be that Biological or a complicated dataset) there are always going to be unknowns. What I think is important is knowing what you can and can’t account for. In my experience in science, there is generally much much more that can’t be explained than can.
As ever translating all of this to a business context is instructive. Conscientious Data Scientists or business-focussed Statisticians who come across something interesting in a model or analysis will always try (where feasible) to corroborate this by other means; they will try to perform a second “experiment” to verify their initial findings. They will also realise that even two supporting results obtained in different ways will not in general be 100% conclusive. However the highest levels of conscientiousness may be more honoured in breach than observance [13]. Also there may not be an alternative “experiment” that can be easily run. Whatever the motivations or circumstances, it is not beyond the realm of possibility that some Data Science findings are true only in the same way that my wife thought she had successfully stabilised her protein before carrying out the second assay.
I would argue that business will often have much to learn from the levels of rigour customary in most scientific research [14]. It would be nice to think that the same rigour is always applied in commercial matters as academic ones. Unfortunately experience would tend to suggest the contrary is sometimes the case. However, it would also be beneficial if people working on statistical models in industry went out of their way to stress not only what phenomena these models can explain, but what they are unable to explain. Knowing what you don’t know is the first step towards further enlightenment.
Notes
[1]
Indeed this previous article had a sub-section titled Rigour and Scrutiny, echoing some of the themes in this piece.
As in the earlier article, apologies for the circumlocution. I’m both looking to preserve some privacy and save the reader from boredom.
[4]
Anyone interested in more information is welcome to read her thesis which is in any case in the public domain. It is 188 pages long, which is reasonably lengthy even by my standards.
[5]
They carry out translation which refers to synthesising proteins based on information carried by messenger RNA, mRNA.
[6]
Some proteins are naturally stable, but many are not and will not survive purification or later steps in their native state.
Chopping off flexible sections, adding other small proteins which act as scaffolding, getting antibodies or other biological molecules to bind to the protein and so on.
[9]
Actually a sigmoidal dose-response curve.
[10]
For anyone with colour perception problems, the green line has markers which are diamonds and the blue line has markers which are triangles.
[11]
As my wife writes [with my annotations]:
A possible explanation for this effect was that while T4L [the protein she added to try to increase stability – T4 Lysozyme] stabilised the binding pocket, the other domains of the receptor were destabilised. Another possibility was that the introduction of T4L caused an increase in the flexibility of CL3, thus destabilising the receptor. A method for determining whether this was happening would be to introduce rigid linkers at the AT1R-T4L junction [AT1R was the protein she was studying, angiotensin II type 1 receptor], or other placements of T4L. Finally AT1R might exist as a dimer and the addition of T4L might inhibit the formation of dimers, which could also destabilise the receptor.
Following the literature and the technology you would think there is universal agreement that more data means better models. […] However […] it’s always a good idea to step back and examine the premise. Is it universally true that our models will be more accurate if we use more data? As a data scientist you will want to question this assumption and not automatically reach for that brand new high-performance in-memory modeling array before examining some of these issues.
Bill goes on to make several pertinent points including: that if your data is bad, having more of it is not necessarily a solution; that attempting to create a gigantic and all-purpose model may well be inferior to multiple, more targeted models on smaller sub-sets of data; and that there exist specific instances where a smaller data sets yields greater accuracy [2]. However I wanted to pick up directly on Bill’s point 6 of 7, in which he also references Larry Greenemeier (@lggreenemeier) of Scientific American.
6. Sometimes We Get Hypnotized By the Overwhelming Volume of the Data and Forget About Data Provenance and Good Project Design
A few months back I reviewed an article by Larry Greenemeier [3] about the failure of Google Flu Trend analysis to predict the timing and severity of flu outbreaks based on social media scraping. It was widely believed that this Big Data volume of data would accurately predict the incidence of flu but the study failed miserably missing timing and severity by a wide margin.
Says Greenemeier, “Big data hubris is the often the implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. The mistake of many big data projects, the researchers note, is that they are not based on technology designed to produce valid and reliable data amenable for scientific analysis. The data comes from sources such as smartphones, search results and social networks rather than carefully vetted participants and scientific instruments”.
Perhaps more pertinent to a business environment, Greenemeier’s article also states:
Context is often lacking when info is pulled from disparate sources, leading to questionable conclusions.
Neither of these authors is saying that having greater volumes of data is a definitively bad thing; indeed Vorhies states:
In general would I still prefer to have more data than less? Yes, of course.
They are however both pointing out that, in some instances, more traditional statistical methods, applied to smaller data sets yield superior results. This is particularly the case where data are repurposed and the use to which they are put is different to the considerations when they were collected; something which is arguably more likely to be the case where general purpose Big Data sets are leveraged without reference to other information.
Also, when large data sets are collated from many places, the data from each place can have different characteristics. If this variation is not controlled for in models, it may well lead to erroneous findings.
Their final observation is that sound statistical methodology needs to be applied to big data sets just as much as more regular ones. The hope that design flaws will simply evaporate when data sets get large enough may be seducing, but it is also dangerously wrong.
Vorhies and Greenemeier are not suggesting that Big Data has no value. However they state that one of its most potent uses may well be as a supplement to existing methods, perhaps extending them, or bringing greater granularity to results. I view such introspection in Data Science circles as positive, likely to lead to improved methods and an indication of growing maturity in the field. It is however worth noting that, in some cases, leverage of Small-but-Well-Designed Data [4] is not only effective, but actually a superior approach. This is certainly something that Data Scientists should bear in mind.
Notes
[1]
I’d recommend taking a look at this site regularly. There is a high volume of articles and the quality is variable, but often there are some stand-out pieces.
This article is about the wisdom of the crowd [1], or more particularly its all too frequent foolishness. I am going to draw on a paper recently published in Nature by a cross-disciplinary team from the Massachusetts Institute of Technology and Princeton University. The authors are Dražen Prelec, H. Sebastian Seung and John McCoy. The paper’s title is A solution to the single-question crowd wisdom problem[2]. Rather than reinvent the wheel, here is a section from the abstract (with my emphasis):
Once considered provocative, the notion that the wisdom of the crowd is superior to any individual has become itself a piece of crowd wisdom, leading to speculation that online voting may soon put credentialed experts out of business. Recent applications include political and economic forecasting, evaluating nuclear safety, public policy, the quality of chemical probes, and possible responses to a restless volcano. Algorithms for extracting wisdom from the crowd are typically based on a democratic voting procedure. […] However, democratic methods have serious limitations. They are biased for shallow, lowest common denominator information, at the expense of novel or specialized knowledge that is not widely shared.
The Problems
The authors describe some compelling examples of where a crowd-based approach ignores the aforementioned specialised knowledge. I’ll cover a couple of these in a second, but let me first add my own.
Suppose we ask 1,000 people to come up with an estimate of the mass of a proton. One of these people happens to have won the Nobel Prize for Physics the previous year. Is the average of the estimates provided by the 1,000 people likely to be more accurate, or is the estimate of the one particularly qualified person going to be superior? There is an obvious answer to this question [3].
Lest it be thought that the above flaw in the wisdom of the crowd is confined to populations including a Nobel Laureate, I’ll reproduce a much more quotidian example from the Nature paper [4].
[..] imagine that you have no knowledge of US geography and are confronted with questions such as: Philadelphia is the capital of Pennsylvania, yes or no? And, Columbia is the capital of South Carolina, yes or no? You pose them to many people, hoping that majority opinion will be correct. [in an actual exercise the team carried out] this works for the Columbia question, but most people endorse the incorrect answer (yes) for the Philadelphia question. Most respondents may only recall that Philadelphia is a large, historically significant city in Pennsylvania, and conclude that it is the capital. The minority who vote no probably possess an additional piece of evidence, that the capital is Harrisburg. A large panel will surely include such individuals. The failure of majority opinion cannot be blamed on an uninformed panel or flawed reasoning, but represents a defect in the voting method itself.
I’m both a good and bad example here. I know the capital of Pennsylvania is Harrisburg because I have specialist knowledge [5]. However my acquaintance with South Carolina is close to zero. I’d therefore get the first question right and have a 50 / 50 chance on the second (all other things being equal of course). My assumption is that Columbia is, in general, much more well-known than Harrisburg for some reason.
The authors go on to cover the technique that is often used to try to address this type of problem in surveys. Respondents are also asked how confident they are about their answer. Thus a tentative “yes” carries less weight than a definitive “yes”. However, as the authors point out, such an approach only works if correct responses are strongly correlated with respondent confidence. As is all too evident from real life, people are often both wrong and very confident about their opinion [6]. The authors extended their Philadelphia / Columbia study to apply confidence weightings, but with no discernible improvement.
A Surprisingly Popular Solution
As well as identifying the problem, the authors suggest a solution and later go on to demonstrate its efficacy. Again quoting from the paper’s abstract:
Here we propose the following alternative to a democratic vote: select the answer that is more popular than people predict. We show that this principle yields the best answer under reasonable assumptions about voter behaviour, while the standard ‘most popular’ or ‘most confident’ principles fail under exactly those same assumptions.
Let’s use the examples of capitals of states again here (as the authors do in the paper). As well as asking respondents, “Philadelphia is the capital of Pennsylvania, yes or no?” you also ask them “What percentage of people in this survey will answer ‘yes’ to this question?” The key is then to compare the actual survey answers with the predicted survey answers.
As shown in the above exhibit, in the authors’ study, when people were asked whether or not Columbia is the capital of South Carolina, those who replied “yes” felt that the majority of respondents would agree with them. Those who replied “no” symmetrically felt that the majority of people would also reply “no”. So no surprises there. Both groups felt that the crowd would agree with their response.
However, in the case of whether or not Philadelphia is the capital of Pennsylvania there is a difference. While those who replied “yes” also felt that the majority of people would agree with them, amongst those who replied “no”, there was a belief that the majority of people surveyed would reply “yes”. This is a surprise. People who make the correct response to this question feel that the wisdom of the crowd will be incorrect.
In the Columbia example, what people predict will be the percentage of people replying “yes” tracks with the actual response rate. In the Philadelphia example, what people predict will be the percentage of people replying “yes” is significantly less than the actual proportion of people making this response [7]. Thus a response of “no” to “Philadelphia is the capital of Pennsylvania, yes or no?” is surprisingly popular. The methodology that the authors advocate would then lead to the surprisingly popular answer (i.e. “no”) actually being correct; as indeed it is. Because there is no surprisingly popular answer in the Columbia example, then the result of a democratic vote stands; which is again correct.
To reiterate: a surprisingly popular response will overturn the democratic verdict, if there is no surprisingly popular response, the democratic verdict is unmodified.
As well as confirming the superiority of the surprisingly popular approach (as opposed to either weighted or non-weighted democratic votes) with questions about state capitals, the authors went on to apply their new technique in a range of other areas [8].
Study 1 used 50 US state capitals questions, repeating the format [described above] with different populations [9].
Study 2 employed 80 general knowledge questions.
Study 3 asked professional dermatologists to diagnose 80 skin lesion images as benign or malignant.
Study 4 presented 90 20th century artworks [see the images above] to laypeople and art professionals, and asked them to predict the correct market price category.
Taking all responses across the four studies into account [10], the central findings were as follows [11]:
We first test pairwise accuracies of four algorithms: majority vote, surprisingly popular (SP), confidence-weighted vote, and max. confidence, which selects the answer endorsed with highest average confidence.
Across all items, the SP algorithm reduced errors by 21.3% relative to simple majority vote (P < 0.0005 by two-sided matched-pair sign test).
Across the items on which confidence was measured, the reduction was:
35.8% relative to majority vote (P < 0.001),
24.2% relative to confidence-weighted vote (P = 0.0107) and
22.2% relative to max. confidence (P < 0.13).
The authors go on to further kick the tyres [12] on these results [13] without drawing any conclusions that deviate considerably from the ones they first present and which are reproduced above. The surprising finding is that the surprisingly popular algorithm significantly out-performs the algorithms normally used in wisdom of the crowd polling. This is a major result, in theory at least.
Some Thoughts
At the end of the abstract, the authors state that:
Like traditional voting, [the surprisingly popular algorithm] accepts unique problems, such as panel decisions about scientific or artistic merit, and legal or historical disputes. The potential application domain is thus broader than that covered by machine learning […].
Given the – justified – attention that has been given to machine learning in recent years, this is a particularly interesting claim. More broadly, SP seems to bring much needed nuance to the wisdom of the crowd. It recognises that the crowd may often be right, but also allows better informed minorities to override the crowd opinion in specific cases. It does this robustly in all of the studies that the authors conducted. It will be extremely interesting to see this novel algorithm deployed in anger, i.e. in a non-theoretical environment. If its undoubted promise is borne out – and the evidence to date suggests that it will be – then statisticians will have a new and powerful tool in their arsenal and a range of predictive activities will be improved.
The scope of applicability of the SP technique is as wide as that of any wisdom of the crowd approach and, to repeat the comments made by the authors in their abstract, has recently included:
[…] political and economic forecasting, evaluating nuclear safety, public policy, the quality of chemical probes, and possible responses to a restless volcano
If the author’s initial findings are repeated in “live” situations, then the refinement to the purely democratic approach that SP brings should elevate an already useful approach to being an indispensable one in many areas.
I will let the authors have a penultimate word [14]:
Although democratic methods of opinion aggregation have been influential and productive, they have underestimated collective intelligence in one respect. People are not limited to stating their actual beliefs; they can also reason about beliefs that would arise under hypothetical scenarios. Such knowledge can be exploited to recover truth even when traditional voting methods fail. If respondents have enough evidence to establish the correct answer, then the surprisingly popular principle will yield that answer; more generally, it will produce the best answer in light of available evidence. These claims are theoretical and do not guarantee success in practice, as actual respondents will fall short of ideal. However, it would be hard to trust a method [such as majority vote or confidence-weighted vote] if it fails with ideal respondents on simple problems like [the Philadelphia one]. To our knowledge, the method proposed here is the only one that passes this test.
The ultimate thought I will present in this article is an entirely speculative one. The authors posit that their method could be applied to “potentially controversial topics, such as political and environmental forecasts”, while cautioning that manipulation should be guarded against. Their suggestion leads me wonder what impact on the results of opinion polls a suitably formed surprisingly popular questionnaire would have had in the run up to both the recent UK European Union Referendum and the plebiscite for the US Presidency. Of course it is now impossible to tell, but maybe some polling organisations will begin to incorporate this new approach going forward. It can hardly make things worse.
Notes
[1]
According to Wikipedia, the phenomenon that:
A large group’s aggregated answers to questions involving quantity estimation, general world knowledge, and spatial reasoning has generally been found to be as good as, and often better than, the answer given by any of the individuals within the group.
The authors of the Nature paper question whether this is true in all circumstances.
[2]
Prelec, D., Seung, H.S., McCoy, J., (2017). A solution to the single-question crowd wisdom problem. Nature 541, 532–535.
You can view a full version of this paper care of Springer Nature SharedIt at the following link. ShareIt is Springer’s content sharing initiative.
Direct access to the article on Nature’s site (here) requires a subscription to the journal.
[3]
This example is perhaps an interesting rejoinder to the increasing lack of faith in experts in the general population, something I covered in Toast.
Of course the answer is approximately: 1.6726219 × 10-27 kg.
[4]
I have lightly edited this section but abjured the regular bracketed ellipses (more than one […] as opposed to conic sections as I note elsewhere). This is both for reasons of readability and also as I have not yet got to some points that the authors were making in this section. The original text is a click away.
[5]
My wife is from this state.
[6]
Indeed it sometimes seems that the more wrong the opinion, the more certain that people believe it to be right.
Here the reader is free to insert whatever political example fits best with their worldview.
[7]
Because many people replying “no” felt that a majority would disagree with them.
[8]
Again I have lightly edited this text.
[9]
To provide a bit of detail, here the team created a questionnaire with 50 separate questions sets of the type:
{Most populous city in a state} is the capital of {state}: yes or no?
How confident are you in your answer (50- 100%)?
What percentage of people surveyed will respond “yes” to this question? (1 – 100%)
This was completed by 83 people split between groups of undergraduate and graduate students at both MIT and Princeton. Again see the paper for further details.
[10]
And eliding some nuances such as some responses being binary (yes/no) and others a range (e.g. the dermatologists were asked to rate the chance of malignancy on a six point scale from “absolutely uncertain to absolutely certain”). Also respondents were asked to provide their confidence in some studies and not others.
[11]
Once more with some light editing.
[12]
This is a technical term employed in scientific circles an I apologise if my use of jargon confuses some readers.
[13]
Again please see the actual paper for details.
[14]
Modified very slightly by my last piece of editing.
This blog touches on a wide range of topics, including social media, cultural transformation, general technology and – last but not least – sporting analogies. However, its primary focus has always been on data and information-centric matters in a business context. Having said this, all but the more cursory of readers will have noted the prevalence of pieces with a Mathematical or Scientific bent. To some extent this is a simple reflection of the author’s interests and experience, but a stronger motivation is often to apply learnings from different fields to the business data arena. This article is probably more scientific in subject matter than most, but I will also look to highlight some points pertinent to commerce towards the end.
Introduction
The topic I want to turn my attention to in this article is public trust in science. This is a subject that has consumed many column inches in recent years. One particular area of focus has been climate science, which, for fairly obvious political reasons, has come in for even more attention than other scientific disciplines of late. It would be distracting to get into the arguments about climate change and humanity’s role in it here [1] and in a sense this is just the latest in a long line of controversies that have somehow become attached to science. An obvious second example here is the misinformation circling around both the efficacy and side effects of vaccinations [2]. In both of these cases, it seems that at least a sizeable minority of people are willing to query well-supported scientific findings. In some ways, this is perhaps linked to the general mistrust of “experts” and “elites” [3] that was explicitly to the fore in the UK’s European Union Referendum debate [4].
“People in this country have had enough of experts”
– Michael Gove [5], at this point UK Justice Secretary and one of the main proponents of the Leave campaign, speaking on Sky News, June 2016.
Mr Gove was talking about economists who held a different point of view to his own. However, his statement has wider resonance and cannot be simply dismissed as the misleading sound-bite of an experienced politician seeking to press his own case. It does indeed appear that in many places around the world experts are trusted much less than they used to be and that includes scientists.
“Many political upheavals of recent years, such as the rise of populist parties in Europe, Donald Trump’s nomination for the American presidency and Britain’s vote to leave the EU, have been attributed to a revolt against existing elites.”
A Brief [6] History of the Public Perception of Science
Note: This section is focussed on historical developments in the public’s trust in science. If the reader would like to skip on to more toast-centric content, then please click here.
Answering questions about the erosion of trust in politicians and the media is beyond the scope of this humble blog. Wondering what has happened to trust in science is firmly in its crosshairs. One part of the answer is that – for some time – scientists were held in too much esteem and the pendulum was inevitably going to swing back the other way. For a while the pace of scientific progress and the miracles of technology which this unleashed placed science on a pedestal from which there was only one direction of travel. During this period in which science was – in general – uncritically held in great regard, the messy reality of actual science was never really highlighted. The very phrase “scientific facts” is actually something of an oxymoron. What we have is instead scientific theories. Useful theories are consistent with existing observations and predict new phenomena. However – as I explained in Patterns patterns everywhere – a theory is only as good as the latest set of evidence and some cherished scientific theories have been shown to be inaccurate; either in general, or in some specific circumstances [7]. However saying “we have a good model that helps us explain many aspects of a phenomenon and predict more, but it doesn’t cover everything and there are some uncertainties” is a little more of a mouthful than “we have discovered that…”.
There have been some obvious landmarks along the way to science’s current predicament. The unprecedented destruction unleashed by the team working on the Manhattan Project at first made the scientists involved appear God-like. It also seemed to suggest that the path to Great Power status was through growing or acquiring the best Physicists. However, as the prolonged misery caused in Japan by the twin nuclear strikes became more apparent and as the Cold War led to generations living under the threat of mutually assured destruction, the standing attached by the general public to Physicists began to wane; the God-like mantle began to slip. While much of our modern world and its technology was created off the back of now fairly old theories like Quantum Chromodynamics and – most famously – Special and General Relativity, the actual science involved became less and less accessible to the man or woman in the street. For all the (entirely justified) furore about the detection of the Higgs Boson, few people would be able to explain much about what it is and how it fits into the Standard Model of particle physics.
In the area of medicine and pharmacology, the Thalidomide tragedy, where a drug prescribed to help pregnant women suffering from morning sickness instead led to terrible birth defects in more than 10,000 babies, may have led to more stringent clinical trials, but also punctured the air of certainty that had surrounded the development of the latest miracle drug. While medical science and related disciplines have vastly improved the health of much of the globe, the glacial progress in areas such as oncology has served as a reminder of the fallibility of some scientific endeavours. In a small way, the technical achievements of that apogee of engineering, NASA, were undermined by loss of crafts and astronauts. Most notably the Challenger and Columbia fatalities served to further remove the glossy veneer that science had acquired in the 1940s to 1960s.
Lest it be thought at this point that I am decrying science, or even being anti-scientific, nothing could be further from the truth. I firmly believe that the ever growing body of scientific knowledge is one of humankind’s greatest achievements, if not its greatest. From our unpromising vantage point on an unremarkable little planet in our equally common-all-garden galaxy we have been able to grasp many of the essential truths about the whole Universe from the incomprehensibly gigantic to the most infinitesimal constituent of a sub-atomic particle. However, it seems that many people do not fully embrace the grandeur of our achievements, or indeed in many cases the unexpected beauty and harmony that they have revealed [8]. It is to the task of understanding this viewpoint that I am addressing my thoughts.
More recently, the austerity that has enveloped much of the developed world since the 2008 Financial Crisis has had two reinforcing impacts on science in many countries. First funding has often been cut, leading to pressure on research programmes and scientists increasingly having to make an economic case for their activities; a far cry from the 1950s. Second, income has been effectively stagnant for the vast majority of people, this means that scientific expenditure can seem something of a luxury and also fuels the anti-elite feelings cited by The Economist earlier in this article.
Into this seeming morass steps Anita Makri, “editor/writer/producer and former research scientist”. In a recent Nature article she argues that the form of science communicated in popular media leaves the public vulnerable to false certainty. I reproduce some of her comments here:
“Much of the science that the public knows about and admires imparts a sense of wonder and fun about the world, or answers big existential questions. It’s in the popularization of physics through the television programmes of physicist Brian Cox and in articles about new fossils and quirky animal behaviour on the websites of newspapers. It is sellable and familiar science: rooted in hypothesis testing, experiments and discovery.
Although this science has its place, it leaves the public […] with a different, outdated view to that of scientists of what constitutes science. People expect science to offer authoritative conclusions that correspond to the deterministic model. When there’s incomplete information, imperfect knowledge or changing advice — all part and parcel of science — its authority seems to be undermined. […] A popular conclusion of that shifting scientific ground is that experts don’t know what they’re talking about.”
After my speculations about the reasons why science is held in less esteem than once was the case, I’ll return to more prosaic matters; namely food and specifically that humble staple of many a breakfast table, toast. Food science has often fared no better than its brother disciplines. The scientific guidance issued to people wanting to eat healthily can sometimes seem to gyrate wildly. For many years fat was the source of all evil, more recently sugar has become public enemy number one. Red wine was meant to have beneficial effects on heart health, then it was meant to be injurious; I’m not quite sure what the current advice consists of. As Makri states above, when advice changes as dramatically as it can do in food science, people must begin to wonder whether the scientists really know anything at all.
So where does toast fit in? Well the governmental body charged with providing advice about food in the UK is called the Food Standards Agency. They describe their job as “using our expertise and influence so that people can trust that the food they buy and eat is safe and honest.” While the FSA do sterling work in areas such as publicly providing ratings of food hygiene for restaurants and the like, their most recent campaign is one which seems at best ill-advised and at worst another nail in the public perception of the reliability of scientific advice. Such things matter because they contribute to the way that people view science in general. If scientific advice about food is seen as unsound, surely there must be questions around scientific advice about climate change, or vaccinations.
Before I am accused of belittling the FSA’s efforts, let’s consider the campaign in question, which is called Go for Gold and encourages people to consume less acrylamide. Here is some of what the FSA has to say about the matter:
“Today, the Food Standards Agency (FSA) is launching a campaign to ‘Go for Gold’, helping people understand how to minimise exposure to a possible carcinogen called acrylamide when cooking at home.
Acrylamide is a chemical that is created when many foods, particularly starchy foods like potatoes and bread, are cooked for long periods at high temperatures, such as when baking, frying, grilling, toasting and roasting. The scientific consensus is that acrylamide has the potential to cause cancer in humans.
[…]
as a general rule of thumb, aim for a golden yellow colour or lighter when frying, baking, toasting or roasting starchy foods like potatoes, root vegetables and bread.”
The Go for Gold campaign was picked up by various media outlets in the UK. For example the BBC posted an article on its web-site which opened by saying:
“Bread, chips and potatoes should be cooked to a golden yellow colour, rather than brown, to reduce our intake of a chemical which could cause cancer, government food scientists are warning.”
The BBC has been obsessed with neutrality on all subjects recently [9], but in this case they did insert the reasonable counterpoint that:
“However, Cancer Research UK [10] said the link was not proven in humans.”
Acrylamide is certainly a nasty chemical. Amongst other things, it is used in polyacrylamide gel electrophoresis, a technique used in biochemistry. If biochemists mix and pour their own gels, they have to monitor their exposure and there are time-based and lifetime limits as to how often they can do such procedures [11]. Acrylamide has also been shown to lead to cancer in mice. So what could be more reasonable that the FSA’s advice?
Food Safety – A Statistical / Risk Based Approach
Earlier I introduced Anita Makri, it is time to meet our second protagonist, David Spiegelhalter, Winton Professor for the Public Understanding of Risk in the Statistical Laboratory, Centre for Mathematical Sciences, University of Cambridge [12]. Professor Spiegelhalter has penned a response to the FSA’s Go for Gold campaign. I feel that this merits reading in entirety, but here are some highlights:
“Very high doses [of Acrylamide] have been shown to increase the risk of mice getting cancer. The IARC (International Agency for Research on Cancer) considers it a ‘probable human carcinogen’, putting it in the same category as many chemicals, red meat, being a hairdresser and shift-work.
However, there is no good evidence of harm from humans consuming acrylamide in their diet: Cancer Research UK say that ‘At the moment, there is no strong evidence linking acrylamide and cancer.’
This is not for want of trying. A massive report from the European Food Standards Agency (EFSA) lists 16 studies and 36 publications, but concludes
‘In the epidemiological studies available to date, AA intake was not associated with an increased risk of most common cancers, including those of the GI or respiratory tract, breast, prostate and bladder. A few studies suggested an increased risk for renal cell, and endometrial (in particular in never-smokers) and ovarian cancer, but the evidence is limited and inconsistent. Moreover, one study suggested a lower survival in non-smoking women with breast cancer with a high pre-diagnostic exposure to AA but more studies are necessary to confirm this result. (p185)’
[…]
[Based on the EFSA study] adults with the highest consumption of acrylamide could consume 160 times as much and still only be at a level that toxicologists think unlikely to cause increased tumours in mice.
[…]
This all seems rather reassuring, and may explain why it’s been so difficult to observe any effect of acrylamide in diet.”
Indeed, Professor Spiegelhalter, an esteemed statistician, also points out that most studies will adopt the standard criteria for statistical significance. Given that such significance levels are often set at 5%, then this means that:
“[As] each study is testing an association with a long list of cancers […], we would expect 1 in 20 of these associations to be positive by chance alone.”
He closes his article by stating – not unreasonably – that the FSA’s time and attention might be better spent on areas where causality between an agent and morbidity is well-established, for example obesity. My assumption is that the FSA has a limited budget and has to pick and choose what food issues to weigh in on. Even if we accept for the moment that there is some slight chance of a causal link between the consumption of low levels of acrylamide and cancer, there are plenty of other areas in which causality is firmly established; obesity as mentioned by Professor Spiegelhalter, excessive use of alcohol, even basic kitchen hygiene. It is hard to understand why the FSA did not put more effort into these and instead focussed on an area where the balance of scientific judgement is that there is unlikely to be an issue.
Having a mathematical background perhaps biases me, but I tend to side with Professor Spiegelhalter’s point of view. I don’t want to lay the entire blame for the poor view that some people have of science at the FSA’s door, but I don’t think campaigns like Go for Gold help very much either. The apocryphal rational man or woman will probably deduce that there is not an epidemic of acrylamide poisoning in progress. This means that they may question what the experts at the FSA are going on about. In turn this reduces respect for other – perhaps more urgent – warnings about food and drink. Such a reaction is also likely to colour how the same rational person thinks about “expert” advice in general. All of this can contribute to further cracks appearing in the public edifice of science, an outcome I find very unfortunate.
So what is to be done?
A Call for a New and More Honest Approach to Science Communications
As promised I’ll return to Anita Makri’s thoughts in the same article referenced above:
“It’s more difficult to talk about science that’s inconclusive, ambivalent, incremental and even political — it requires a shift in thinking and it does carry risks. If not communicated carefully, the idea that scientists sometimes ‘don’t know’ can open the door to those who want to contest evidence.
[…]
Scientists can influence what’s being presented by articulating how this kind of science works when they talk to journalists, or when they advise on policy and communication projects. It’s difficult to do, because it challenges the position of science as a singular guide to decision making, and because it involves owning up to not having all of the answers all the time while still maintaining a sense of authority. But done carefully, transparency will help more than harm. It will aid the restoration of trust, and clarify the role of science as a guide.”
The scientific method is meant to be about honesty. You record what you see, not what you want to see. If the data don’t support your hypothesis, you discard or amend your hypothesis. The peer-review process is meant to hold scientists to the highest levels of integrity. What Makri seems to be suggesting is for scientists to turn their lenses on themselves and how they communicate their work. Being honest where there is doubt may be scary, but not as scary as being caught out pushing certainty where no certainty is currently to be had.
Epilogue
At the beginning of this article, I promised that I would bring things back to a business context. With lots of people with PhDs in numerate sciences now plying their trade as data scientists and the like, there is an attempt to make commerce more scientific [13]. Understandably, the average member of a company will have less of an appreciation of statistics and statistical methods than their data scientists do. This can lead to data science seeming like magic; the philosopher’s stone [14]. There are obvious parallels here with how Physicists were seen in the period immediately after the Second World War.
Earlier in the text, I mused about what factors may have led to a deterioration in how the public views science and scientists. I think that there is much to be learnt from the issues I have covered in this article. If data scientists begin to try to peddle absolute truth and perfect insight (both of which, it is fair to add, are often expected from them by non-experts), as opposed to ranges of outcomes and probabilities, then the same decline in reputation probably awaits them. Instead it would be better if data scientists heeded Anita Makri’s words and tried to always be honest about what they don’t know as well as what they do.
Notes
[1]
Save to note that there really is no argument in scientific circles.
As ever Randall Munroe makes the point pithily in his Earth Temperature Timeline – https://xkcd.com/1732/.
For a primer on the area, you could do worse than watching The Royal Society‘s video:
[2]
For the record, my daughter has had every vaccine known to the UK and US health systems and I’ve had a bunch of them recently as well.
[3]
Most scientists I know would be astonished that they are considered part of the amorphous, ill-defined and obviously malevolent global “elite”. Then “elite” is just one more proxy for “the other” something which it is not popular to be in various places in the world at present.
[4]
Or what passed for debate in these post-truth times.
[5]
Mr Gove studied English at Lady Margaret Hall, Oxford, where he was also President of the Oxford Union. Clearly Oxford produces less experts than it used to in previous eras.
[6]
One that is also probably wildly inaccurate and certainly incomplete.
[7]
So Newton’s celebrated theory of gravitation is “wrong” but actually works perfectly well in most circumstances. The the Rutherford–Bohr model, where atoms are little Solar Systems, with the nucleus circled by electrons much as the planets circle the Sun is “wrong”, but actually does serve to explain a number of things; if sadly not the orbital angular momentum of electrons.
[8]
Someone should really write a book about that – watch this space!
[9]
Not least in the aforementioned EU Referendum where it felt the need to follow the views of the vast majority of economists with those of the tiny minority, implying that the same weight be attached to both points of view. For example, 99.9999% of people believe the world to be round, but in the interests of balance my mate Jim reckons it is flat.
[10]
According to their web-site: “the world’s leading charity dedicated to beating cancer through research”.
[11]
As attested to personally by the only proper scientist in our family.
[12]
Unlike Oxford (according to Mr Gove anyway), Cambridge clearly still aspires to creating experts.
[13]
By this I mean proper science and not pseudo-science like management theory and the like.
[14]
In the original, non-J.K. Rowling sense of the phrase.
The Periodic Table, is one of the truly iconic scientific images [1], albeit one with a variety of forms. In the picture above, the normal Periodic Table has been repurposed in a novel manner to illuminate a different field of scientific enquiry. This version was created by Professor Jennifer Johnson (@jajohnson51) of The Ohio State University and the Sloan Digital Sky Survey (SDSS). It comes from an article on the SDSS blog entitled Origin of the Elements in the Solar System; I’d recommend reading the original post.
The historical perspective
A modern rendering of the Periodic Table appears above. It probably is superfluous to mention, but the Periodic Table is a visualisation of an underlying principle about elements; that they fall into families with similar properties and that – if appropriately arranged – patterns emerge with family members appearing at regular intervals. Thus the Alkali Metals [2], all of which share many important characteristics, form a column on the left-hand extremity of the above Table; the Noble Gases [3] form a column on the far right; and, in between, other families form further columns.
Given that the underlying principle driving the organisation of the Periodic Table is essentially a numeric one, we can readily see that it is not just a visualisation, but a data visualisation. This means that Professor Johnson and her colleagues are using an existing data visualisation to convey new information, a valuable technique to have in your arsenal.
One of the original forms of the Periodic Table appears above, alongside its inventor, Dmitri Mendeleev.
As with most things in science [4], my beguilingly straightforward formulation of “its inventor” is rather less clear-cut in practice. Mendeleev’s work – like Newton’s before him – rested “on the shoulders of giants” [5]. However, as with many areas of scientific endeavour, the chain of contributions winds its way back a long way and specifically to one of the greatest exponents of the scientific method [6], Antoine Lavoisier. The later Law of Triads [7], was another significant step along the path and – to mix a metaphor – many other scientists provided pieces of the jigsaw puzzle that Mendeleev finally assembled. Indeed around the same time as Mendeleev published his ideas [8], so did the much less celebrated Julius Meyer; Meyer and Mendeleev’s work shared several characteristics.
The epithet of inventor attached to Mendeleev for two main reasons: his leaving of gaps in his table, pointing the way to as yet undiscovered elements; and his ordering of table entries according to family behaviour rather than atomic mass [9]. None of this is to take away from Mendeleev’s seminal work, it is wholly appropriate that his name will always be linked with his most famous insight. Instead it is my intention is to demonstrate that the the course of true science never did run smooth [10].
The Johnson perspective
Since its creation – and during its many reformulations – the Periodic Table has acted as a pointer for many areas of scientific enquiry. Why do elements fall into families in this way? How many elements are there? Is it possible to achieve the Alchemists’ dream and transmute one element into another? However, the question which Professor Johnson’s diagram addresses is another one, Why is there such an abundance of elements and where did they all come from?
The term nucleosynthesis that appears in the title of this article covers processes by which different atoms are formed from either base nucleons (protons and neutrons) or the combination of smaller atoms. It is nucleosynthesis which attempts to answer the question we are now considering. There are different types.
Our current perspective on where everything in the observable Universe came from is of course the Big Bang [11]. This rather tidily accounts for the abundance of element 1, Hydrogen, and much of that of element 2, Helium. This is our first type of nucleosynthesis, Big Bang nucleosynthesis. However, it does not explain where all of the heavier elements came from [12]. The first part of the answer is from processes of nuclear fusion in stars. The most prevalent form of this is the fusion of Hydrogen to form Helium (accounting for the remaining Helium atoms), but this process continues creating heavier elements, albeit in ever decreasing quantities. This is stellar nucleosynthesis and refers to those elements created in stars during their normal lives.
While readers may be ready to accept the creation of these heavier elements in stars, an obvious question is How come they aren’t in stars any longer? The answer lies in what happens at the end of the life of a star. This is something that depends on a number of factors, but particularly its mass and also whether or not it is associated with another star, e.g. in a binary system.
Broadly speaking, higher mass stars tend to go out with a bang [13], lower mass ones with various kinds of whimpers. The exception to the latter is where the low mass star is coupled to another star, arrangements which can also lead to a considerable explosion as well [14]. Of whatever type, violent or passive, star deaths create all of the rest of the heavier elements. Supernovae are also responsible for releasing many heavy elements in to interstellar space, and this process is tagged explosive nucleosynthesis.
Into this relatively tidy model of nucleosynthesis intrudes the phenomenon of cosmic ray fission, by which cosmic rays [15] impact on heavier elements causing them to split into smaller constituents. We believe that this process is behind most of the Beryllium and Boron in the Universe as well as some of the Lithium. There are obviously other mechanisms at work like radioactive decay, but the vast majority of elements are created either in stars or during the death of stars.
I have elided many of the details of nucleosynthesis here, it is a complicated and evolving field. What Professor Johnson’s graphic achieves is to reflect current academic thinking around which elements are produced by which type of process. The diagram certainly highlights the fact that the genesis of the elements is a complex story. Perhaps less prosaically, it also encapulates Carl Sagan‘s famous aphorism, the one that Professor Johnson quotes at the beginning of her article and which I will use to close mine.
Lithium, Sodium, Potassium, Rubidium, Caesium and Francium (Hydrogen sometimes is shown as topping this list as well).
[3]
Helium, Argon, Neon, Krypton, Xenon and Radon.
[4]
Watch this space for an article pertinent to this very subject.
[5]
Isaac Newton on 15th February 1676. in a letter to Robert Hooke; but employing a turn of phrase which had been in use for many years.
[6]
And certainly the greatest scientist ever to be beheaded.
[7]
Döbereiner, J. W. (1829) “An Attempt to Group Elementary Substances according to Their Analogies”. Annalen der Physik und Chemie.
[8]
In truth somewhat earlier.
[9]
The emergence of atomic number as the organising principle behind the ordering of elements happened somewhat later, vindicating Mendeleev’s approach.
We have:
atomic mass ≅ number of protons in the nucleus of an element + number of neutrons
whereas:
atomic number = number of protons only
The number of neutrons can jump about between successive elements meaning that arranging them in order of atomic mass gives a different result from atomic number.
[10]
With apologies to The Bard.
[11]
I really can’t conceive that anyone who has read this far needs the Big Bang further expounded to them, but if so, then GIYF.
[12]
We think that the Big Bang also created some quantities of Lithium and several other heavier elements, as covered in Professor Johnson’s diagram.
Type-Ia supernovae are a phenomenon that allow us to accurately measure the size of the universe and how this is changing.
[15]
Cosmic rays are very high energy particles that originate from outside of the Solar System and consist mostly of very fast moving protons (aka Hydrogen nuclei) and other atomic nuclei similarly stripped of their electrons.
The above image is part of a much bigger infographic produced by The Royal Society about machine learning. You can view the whole image here.
I felt that this component was interesting in a stand-alone capacity.
The legend explains that a petabyte (Pb) is equal to a million gigabytes (Gb) [1], or 1 Pb = 106 Gb. A gigabyte itself is a billion bytes, or 1 Gb = 109 bytes. Recalling how we multiply indeces we can see that 1 Pb = 106 × 109 bytes = 106 + 9 bytes = 1015 bytes. 1015 also has a name, it’s called a quadrillion. Written out long hand:
1 quadrillion = 1,000,000,000,000,000
The estimate of the amount of data held by Google is fifteen thousand petabytes, let’s write that out long hand as well:
15,000 Pb = 15,000,000,000,000,000,000 bytes
That’s a lot of zeros. As is traditional with big numbers, let’s try to put this in context.
The average size of a photo on an iPhone 7 is about 3.5 megabytes (1 Mb = 1,000,000 bytes), so Google could store about 4.3 trillion of such photos.
Stepping it up a bit, the average size of a high quality photo stored in CR2 format from a Canon EOS 5D Mark IV is ten times bigger at 35 Mb, so Google could store a mere 430 billion of these.
A high definition (1080p) movie is on average around 6 Gb, so Google could store the equivalent of 2.5 billion movies.
If Google employees felt that this resolution wasn’t doing it for them, they could upgrade to 150 million 4K movies at around 100 Gb each.
If instead they felt like reading, they could hold the equivalent of The Library of Congress print collections a mere 75 thousand times over [2].
Rather than talking about bytes, 15,000 petametres is equivalent to about 1,600 light years and at this distance from us we find Messier Object 47 (M47), a star cluster which was first described an impressively long time ago in 1654.
If instead we consider 15,000 peta-miles, then this is around 2.5 million light years, which gets us all the way to our nearest neighbour, the Andromeda Galaxy[3].
The fastest that humankind has got anything bigger than a handful of sub-atomic particles to travel is the 17 kilometres per second (11 miles per second) at which Voyager 1 is currently speeding away from the Sun. At this speed, it would take the probe about 43 billion years to cover the 15,000 peta-miles to Andromeda. This is over three times longer than our best estimate of the current age of the Universe.
Finally a more concrete example. If we consider a small cube, made of well concrete, and with dimensions of 1 cm in each direction, how big would a stack of 15,000 quadrillion of them be? Well, if arranged into a cube, each of the sides would be just under 25 km (15 and a bit miles) long. That’s a pretty big cube.
If the base was placed in the vicinity of New York City, it would comfortably cover Manhattan, plus quite a bit of Brooklyn and The Bronx, plus most of Jersey City. It would extend up to Hackensack in the North West and almost reach JFK in the South East. The top of the cube would plough through the Troposphere and get half way through the Stratosphere before topping out. It would vie with Mars’s Olympus Mons for the title of highest planetary structure in the Solar System [4].
It is probably safe to say that 15,000 Pb is an astronomical figure.
Google played a central role in the initial creation of the collection of technologies that we now use the term Big Data to describe The image at the beginning of this article perhaps explains why this was the case (and indeed why they continue to be at the forefront of developing newer and better ways of dealing with large data sets).
As a point of order, when people start talking about “big data”, it is worth recalling just how big “big data” really is.
Notes
[1]
In line with The Royal Society, I’m going to ignore the fact that these definitions were originally all in powers of 2 not 10.
[2]
The size of The Library of Congress print collections seems to have become irretrievably connected with the figure 10 terabytes (10 × 1012 bytes) for some reason. No one knows precisely, but 200 Tb seems to be a more reasonable approximation.
[3]
Applying the unimpeachable logic of eminent pseudoscientist and numerologist Erich von Däniken, what might be passed over as a mere coincidence by lesser minds, instead presents incontrovertible proof that Google’s PageRank algorithm was produced with the assistance of extraterrestrial life; which, if you think about it, explains quite a lot.
[4]
Though I suspect not for long, unless we chose some material other than concrete. Then I’m not a materials scientist, so what do I know?
You must be logged in to post a comment.