The pace of change in the field of database technology seems to be constantly accelerating. No doubt in five year’s time , Big Data and the Hadoop suite  will seem to be as old-fashioned as earlier technologies can appear to some people nowadays. Today there is a great variety of database technologies that are in use in different organisations for different purposes. There are also a lot of vendors, some of whom have more than one type of database product. I think that it is worthwhile considering both the genesis of databases and some of the major developments that have occurred between then and now.
The infographic appearing at the start of this article seeks to provide just such a perspective. It presents an abridged and simplified perspective on the history of databases from the 1960s to the late 2010s. It is hard to make out the text in the above diagram, so I would recommend that readers click on the link provided in order to view a much larger version with bigger and more legible text.
The infographic references a number of terms. Below I provide links to definitions of several of these, which are taken from The Data and Analytics Dictionary. The list progresses from the top of the diagram downwards, but starts with a definition of “database” itself:
To my mind, it is interesting to see just how long we have been grappling with the best way to set up databases. Also of note is that some of the Big Data technologies are actually relatively venerable, dating to the mid-to-late 2000s (some elements are even older, consisting of techniques for handling flat files on UNIX or Mainframe computers back in the day).
I hope that both the infographic and the definitions provided above contribute to the understanding of the history of databases and also that they help to elucidate the different types of database that are available to organisations today.
The following people’s input is acknowledged on the document itself, but my thanks are also repeated here:
The first half of my planned thoughts on Hurricanes and Data Visualisation, Rainbow’s Gravity and was published earlier back in September. Part two, Map Reading, joined it this month. In between, the first hurricane-centric article acquired an addendum, The Mona Lisa. With this post, the same has happened to the second article. Apparently you can’t keep a good hurricane story down.
One of our Hurricanes is missing
When I started writing about Hurricanes back in September of this year, it was in the aftermath of Harvey and Irma, both of which were safely far away from my native United Kingdom. Little did I think that in closing this mini-series Hurricane Ophelia (or at least the remnants of it) would be heading for these shores; I hope this is coincidence and not karma for me criticising the US National Weather Service’s diagrams!
As we batten down here, an odd occurrence was brought to my attention by Bill McKibben (@billmckibben), someone I connected with while working on this set of articles. Here is what he tweeted:
I am sure that inhabitants of both the Shetland Islands and the East Midlands will be breathing sighs of relief!
Clearly both the northward and eastward extent of Ophelia was outside of the scope of either the underlying model or the mapping software. A useful reminder to data professionals to ensure we set the boundaries of both modelling and visualisation work appropriately.
As an aside, this image is another for the Hall of Infamy, relying as it does on the less than helpful rainbow palette we critiqued all the way back in the first article.
I’ll hope to be writing again soon – hurricanes allowing!
This first article is not a critique of Thomas Pynchon‘s celebrated work, instead it refers to a grave malady that can afflict otherwise health data visualisations; the use and abuse of rainbow colours. This is an area that some data visualisation professionals can get somewhat hot under the collar about; there is even a Twitter hashtag devoted to opposing this colour choice, #endtherainbow.
The [mal-] practice has come under additional scrutiny in recent weeks due to the major meteorological events causing so much damage and even loss of life in the Caribbean and southern US; hurricanes Harvey and Irma. Of course the most salient point about these two megastorms is their destructive capability. However the observations that data visualisers make about how information about hurricanes is conveyed do carry some weight in two areas; how the public perceives these phenomena and how they perceive scientific findings in general . The issues at stake are ones of both clarity and inclusiveness. Some of these people felt that salt was rubbed in the wound when the US National Weather Service, avid users of rainbows , had to add another colour to their normal palette for Harvey:
In 2015, five scientists collectively wrote a letter to Nature entitled “Scrap rainbow colour scales” . In this they state:
It is time to clamp down on the use of misleading rainbow colour scales that are increasingly pervading the literature and the media. Accurate graphics are key to clear communication of scientific results to other researchers and the public — an issue that is becoming ever more important.
At this point I have to admit to using rainbow colour schemes myself professionally and personally ; it is often the path of least resistance. I do however think that the #endtherainbow advocates have a point, one that I will try to illustrate below.
Many Marvellous Maps
Let’s start by introducing the idyllic coastal county of Thomasshire, a map of which appears below:
Of course this is a cartoon map, it might be more typical to start with an actual map from Google Maps or some other provider , but this doesn’t matter to the argument we will construct here. Let’s suppose that – rather than anything as potentially catastrophic as a hurricane – the challenge is simply to record the rainfall due to a nasty storm that passed through this shire . Based on readings from various weather stations (augmented perhaps by information drawn from radar), rainfall data would be captured and used to build up a rain contour map, much like the elevation contour maps that many people will recall from Geography lessons at school .
If we were to adopt a rainbow colour scheme, then such a map might look something like the one shown below:
Here all areas coloured purple will have received between 0 and 10 cm of rain, blue between 10 and 20 cm of rain and so on.
At this point I apologise to any readers who suffer from migraine. An obvious drawback of this approach is how garish it is. Also the solid colours block out details of the underlying map. Well something can be done about both of these issues by making the contour colours transparent. This both tones them down and allows map details to remain at least semi-visible. This gets us a new map:
Here we get into the core of the argument about the suitability of a rainbow palette. Again quoting from the Nature letter:
[…] spectral-type colour palettes can introduce false perceptual thresholds in the data (or hide genuine ones); they may also mask fine detail in the data. These palettes have no unique perceptual ordering, so they can de-emphasize data extremes by placing the most prominent colour near the middle of the scale.
Journals should not tolerate poor visual communication, particularly because better alternatives to rainbow scales are readily available (see NASA Earth Observatory).
In our map, what we are looking to do is to show increasing severity of the deluge as we pass from purple (indigo / violet) up to red. But the ROYGBIV  colours of the spectrum are ill-suited to this. Our eyes react differently to different colours and will not immediately infer the gradient in rainfall that the image is aiming to convey. The NASA article the authors cite above uses a picture to paint a thousand words:
Another salient point is that a relatively high proportion of people suffer from one or other of the various forms of colour blindness . Even the most tastefully pastel rainbow chart will disadvantage such people seeking to derive meaning from it.
Getting Over the Rainbow
So what could be another approach? Well one idea is to show gradients of whatever the diagram is tracking using gradients of colour; this is the essence of the NASA recommendation. I have attempted to do just this in the next map.
I chose a bluey-green tone both as it was to hand in the Visio palette I was using and also to avoid confusion with the blue sea (more on this later). Rather than different colours, the idea is to map intensity of rainfall to intensity of colour. This should address both colour-blindness issues and the problems mentioned above with discriminating between ROYGBIV colours. I hope that readers will agree that it is easier to grasp what is happening at a glance when looking at this chart than in the ones that preceded it.
However, from a design point of view, there is still one issue here; the sea. There are too many bluey colours here for my taste, so let’s remove the sea colouration to get:
Some purists might suggest also turning the land white (or maybe a shade of grey), others would mention that the grid-lines add little value (especially as they are not numbered). Both would probably have a point, however I think that use can also push minimalism too far. I am pretty happy that our final map delivers the information it is intended to convey much more accurately and more immediately than any of its predecessors.
Comparing the first two rainbow maps to this last one, it is perhaps easy to see why so many people engaged in the design of data visualisations want to see an end to ROYGBIV palettes. In the saying, there is a pot of gold at the end of the rainbow, but of course this can never be reached. I strongly suspect that, despite the efforts of the #endtherainbow crowd, an end to the usage of this particular palette will be equally out of reach. However I hope that this article is something that readers will bear in mind when next deciding on how best to colour their business graph, diagram or data visualisation. I am certainly going to try to modify my approach as well.
The story of hurricanes and data visualisation will continue in Part II – Map Reading, which is currently forthcoming.
For some more thoughts on the public perception of science, see Toast.
I guess it’s appropriate from at least one point of view.
My objective in this brief article is to compare how much time and effort is spent on certain information-related activities in an organisation that has adopted best practice, compared to what is typical in all too many organisations. For the avoidance of doubt, when I say people here I am focussing on staff who would ostensibly be the consumers of information, not data professionals who are engaged in providing such information. What follows relates to the end users of information.
What I have done at the top of the above exhibit (labelled “Activity” on the left-hand side) is to lay out the different types of information-related work that end users engage in, splitting these into low, medium and high valued-added components as we scan across the page.
Low value is number crunching and prettifying exhibits for publication
Medium value is analysis and interpretation of information
High value is taking action based on insights and then monitoring to check whether the desired outcome has been achieved
In the centre of the diagram (labelled “Ideal Time Allocation”), I have shown what I believe is a best practice allocation of time to these activities. It is worth pointing out that that I am recommending that significant time (60%) is spent on analysis and interpretation; while tagged as of medium-value, this type of work is a prerequisite for the higher value activities, you cannot really avoid it. Despite this, there is a still 30% of time devoted to the high-value activities of action and monitoring of results. The remaining 10% is expended on low-value activities.
At the bottom of the chart (labelled “Actual Time Allocation”), I have tried to estimate how people’s time is actually spent in organisations where insufficient attention has been paid to the information landscape; a large number of organisations fit into this category in my experience. I am not trying to be 100% precise here, but I believe that the figures are representative of what I have seen in several organisations. In fact I think that the estimated amount of time spent on low value activities is probably greater than 70% in many cases; however I don’t want to be accused of exaggeration.
Clearly a lack of robust, reliable and readily available information can mean that highly skilled staff spend their time generating information rather than analysing and interpreting it and then using such insights as the basis for action. This results in the bulk of their work being low valued-added. The medium and high value activities are squeezed out as there are only so many hours in the day.
It is obvious that such a state of affairs is sub-optimal and needs to be addressed. My experience of using diagrams like the one shown here is that they can be very valuable in explaining what is wrong with current information arrangements and highlighting the need for change.
An interesting exercise is to estimate what the bottom of the diagram would look like for your organisation. Are you close to best practice, or some way from this and in need of urgent change?
The above image is part of a much bigger infographic produced by The Royal Society about machine learning. You can view the whole image here.
I felt that this component was interesting in a stand-alone capacity.
The legend explains that a petabyte (Pb) is equal to a million gigabytes (Gb) , or 1 Pb = 106 Gb. A gigabyte itself is a billion bytes, or 1 Gb = 109 bytes. Recalling how we multiply indeces we can see that 1 Pb = 106 × 109 bytes = 106 + 9 bytes = 1015 bytes. 1015 also has a name, it’s called a quadrillion. Written out long hand:
1 quadrillion = 1,000,000,000,000,000
The estimate of the amount of data held by Google is fifteen thousand petabytes, let’s write that out long hand as well:
15,000 Pb = 15,000,000,000,000,000,000 bytes
That’s a lot of zeros. As is traditional with big numbers, let’s try to put this in context.
The average size of a photo on an iPhone 7 is about 3.5 megabytes (1 Mb = 1,000,000 bytes), so Google could store about 4.3 trillion of such photos.
Stepping it up a bit, the average size of a high quality photo stored in CR2 format from a Canon EOS 5D Mark IV is ten times bigger at 35 Mb, so Google could store a mere 430 billion of these.
A high definition (1080p) movie is on average around 6 Gb, so Google could store the equivalent of 2.5 billion movies.
If Google employees felt that this resolution wasn’t doing it for them, they could upgrade to 150 million 4K movies at around 100 Gb each.
If instead they felt like reading, they could hold the equivalent of The Library of Congress print collections a mere 75 thousand times over .
Rather than talking about bytes, 15,000 petametres is equivalent to about 1,600 light years and at this distance from us we find Messier Object 47 (M47), a star cluster which was first described an impressively long time ago in 1654.
If instead we consider 15,000 peta-miles, then this is around 2.5 million light years, which gets us all the way to our nearest neighbour, the Andromeda Galaxy.
The fastest that humankind has got anything bigger than a handful of sub-atomic particles to travel is the 17 kilometres per second (11 miles per second) at which Voyager 1 is currently speeding away from the Sun. At this speed, it would take the probe about 43 billion years to cover the 15,000 peta-miles to Andromeda. This is over three times longer than our best estimate of the current age of the Universe.
Finally a more concrete example. If we consider a small cube, made of well concrete, and with dimensions of 1 cm in each direction, how big would a stack of 15,000 quadrillion of them be? Well, if arranged into a cube, each of the sides would be just under 25 km (15 and a bit miles) long. That’s a pretty big cube.
If the base was placed in the vicinity of New York City, it would comfortably cover Manhattan, plus quite a bit of Brooklyn and The Bronx, plus most of Jersey City. It would extend up to Hackensack in the North West and almost reach JFK in the South East. The top of the cube would plough through the Troposphere and get half way through the Stratosphere before topping out. It would vie with Mars’s Olympus Mons for the title of highest planetary structure in the Solar System .
It is probably safe to say that 15,000 Pb is an astronomical figure.
Google played a central role in the initial creation of the collection of technologies that we now use the term Big Data to describe The image at the beginning of this article perhaps explains why this was the case (and indeed why they continue to be at the forefront of developing newer and better ways of dealing with large data sets).
As a point of order, when people start talking about “big data”, it is worth recalling just how big “big data” really is.
In line with The Royal Society, I’m going to ignore the fact that these definitions were originally all in powers of 2 not 10.
The size of The Library of Congress print collections seems to have become irretrievably connected with the figure 10 terabytes (10 × 1012 bytes) for some reason. No one knows precisely, but 200 Tb seems to be a more reasonable approximation.
Applying the unimpeachable logic of eminent pseudoscientist and numerologist Erich von Däniken, what might be passed over as a mere coincidence by lesser minds, instead presents incontrovertible proof that Google’s PageRank algorithm was produced with the assistance of extraterrestrial life; which, if you think about it, explains quite a lot.
Though I suspect not for long, unless we chose some material other than concrete. Then I’m not a materials scientist, so what do I know?
When I posted my Brexit infographic reflecting the age of voters an obvious extension was to add an indication of the number of people in each age bracket who did not vote as well as those who did. This seemed a relatively straightforward task, but actually proved to be rather troublesome (this may be an example of British understatement). Maybe the caution I gave about statistical methods having a large impact on statistical outcomes in An Inconvenient Truth should have led me to expect such issues. In any case, I thought that it would be instructive to talk about the problems I stumbled across and to – once again – emphasise the perils of over-extending statistical models.
Regular readers will recall that my Brexit Infographic (reproduced above) leveraged data from an earlier article, A Tale of two [Brexit] Data Visualisations. As cited in this article, the numbers used were from two sources:
In the notes section of A Tale of two [Brexit] Data Visualisations I [prophetically] stated that the breakdown of voting by age group was just an estimate. Based on what I have discovered since, I’m rather glad that I made this caveat explicit.
The Pool of Tears
In order to work out the number of people in each age bracket who did not vote, an obvious starting point would be the overall electorate, which the UK Electoral Commission stated as being 46,500,001. As we know that 33,551,983 people voted (an actual figure rather than an estimate), then this is where the turnout percentage of 72.2% (actually 72.1548%) came from (33,551,983 / 45,500,001).
A clarifying note, the electorate figures above refer to people who are eligible to vote. Specifically, in order to vote in the UK Referendum, people had to meet the following eligibility criteria (again drawn from the UK Electoral Commission):
To be eligible to vote in the EU Referendum, you must be:
A British or Irish citizen living in the UK, or
A Commonwealth citizen living in the UK who has leave to remain in the UK or who does not require leave to remain in the UK, or
A British citizen living overseas who has been registered to vote in the UK in the last 15 years, or
An Irish citizen living overseas who was born in Northern Ireland and who has been registered to vote in Northern Ireland in the last 15 years.
EU citizens are not eligible to vote in the EU Referendum unless they also meet the eligibility criteria above.
So far, so simple. The next thing I needed to know was how the electorate was split by age. This is where we begin to run into problems. One place to start is the actual population of the UK as at the last census (2011). This is as follows:
% of total
If I roll up the above figures to create the same age groups as in the Ashcroft analysis (something that requires splitting the 15-19 range, which I have assumed can be done uniformly), I get:
% of total
The UK Government isn’t interested in the views of people under 18, so eliminating this row we get:
% of total
As mentioned, the above figures are from 2011 and the UK population has grown since then. Web-site WorldOMeters offers an extrapolated population of 65,124,383 for the UK in 2016 (this is as at 12th July 2016; if extrapolation and estimates make you queasy, I’d suggest closing this article now!). I’m going to use a rounder figure of 65,125,000 people; there is no point pretending that precision exists where it clearly doesn’t. Making the assumption that such growth is uniform across all age groups (please refer to my previous bracketed comment!), then the above exhibit can also be extrapolated to give us:
% of total
Looking Glass House
So our – somewhat fabricated – figure for the 18+ UK population in 2016 is 51,210,887, let’s just call this 51,200,000. As at the beginning of this article the electorate for the 2016 UK Referendum was 45,500,000 (dropping off the 1 person with apologies to him or her). The difference is explicable based on the eligibility criteria quoted above. I now have a rough age group break down of the 51.2 million population, how best to apply this to the 45.5 million electorate?
I’ll park this question for the moment and instead look to calculate a different figure. Based on the Ashcroft model, what percentage of the UK population (i.e. the 51.2 million) voted in each age group? We can work this one out without many complications as follows:
Turnout % (B/A)
(B) = Size of each age group in the Ashcroft sample as a percentage multiplied by the total number of people voting (see A Tale of two [Brexit] Data Visualisations).
Remember here that actual turnout figures have electorate as the denominator, not population. As the electorate is less than the population, this means that all of the turnout percentages should actually be higher than the ones calculated (e.g. the overall turnout with respect to electorate is 72.2% whereas my calculated turnout with respect to population is 65.5%). So given this, how to explain the 94.8% turnout of 55-64 year olds? To be sure this group does reliably turn out to vote, but did essentially all of them (remembering that the figures in the above table are too low) really vote in the referendum? This seems less than credible.
The turnout for 55-64 year olds in the 2015 General Election has been estimated at 77%, based on an overall turnout of 66.1% (web-site UK Political Info; once more these figures will have been created based on techniques similar to the ones I am using here). If we assume a uniform uplift across age ranges (that “assume” word again!) then one might deduce that an increase in overall turnout from 66.1% to 72.2%, might lead to the turnout in the 55-64 age bracket increasing from 77% to 84%. 84% turnout is still very high, but it is at least feasible; close to 100% turnout in from this age group seems beyond the realms of likelihood.
So what has gone wrong? Well so far the only culprit I can think of is the distribution of voting by age group in the Ashcroft poll. To be clear here, I’m not accusing Lord Ashcroft and his team of sloppy work. Instead I’m calling out that the way that I have extrapolated their figures may not be sustainable. Indeed, if my extrapolation is valid, this would imply that the Ashcroft model over estimated the proportion of 55-64 year olds voting. Thus it must have underestimated the proportion of voters in some other age group. Putting aside the likely fact that I have probably used their figures in an unintended manner, could it be that the much-maligned turnout of younger people has been misrepresented?
To test the validity of this hypothesis, I turned to a later poll by Omnium. To be sure this was based on a sample size of around 2,000 as opposed to Ashcroft’s 12,000, but it does paint a significantly different picture. Their distribution of voter turnout by age group was as follows:
I have to say that the Omnium age groups are a bit idiosyncratic, so I have taken advantage of the fact that the figures for 25-54 are essentially the same to create a schedule that matches the Ashcroft groups as follows:
The Omnium model suggests that younger voters may have turned out in greater numbers than might be thought based on the Ashcroft data. In turn this would suggest that a much greater percentage of 18-24 year olds turned out for the Referendum (64%) than for the last General Election (43%); contrast this with an estimated 18-24 turnout figure of 47% based on the just increase in turnout between the General Election and the Referendum. The Omnium estimates do still however recognise that turnout was still greater in the 55+ brackets, which supports the pattern seen in other elections.
While it may well be that the Leave / Remain splits based on the Ashcroft figures are reasonable, I’m less convinced that extrapolating these same figures to make claims about actual voting numbers by age group (as I have done) is tenable. Perhaps it would be better to view each age cohort as a mini sample to be treated independently. Based on the analysis above, I doubt that the turnout figures I have extrapolated from the Ashcroft breakdown by age group are robust. However, that is not the same as saying that the Ashcroft data is flawed, or that the Omnium figures are correct. Indeed the Omnium data (at least those elements published on their web-site) don’t include an analysis of whether the people in their sample voted Leave or Remain, so direct comparison is not going to be possible. Performing calculation gymnastics such as using the Omnium turnout for each age group in combination with the Ashcroft voting splits for Leave and Remain for the same age groups actually leads to a rather different Referendum result, so I’m not going to plunge further down this particular rabbit hole.
In summary, my supposedly simple trip to the destitution of an enhanced Brexit Infographic has proved unexpectedly arduous, winding and beset by troubles. These challenges have proved so great that I’ve abandoned the journey and will be instead heading for home.
Which dreamed it?
Based on my work so far, I have severe doubts about the accuracy of some of the age-based exhibits I have published (versions of which have also appeared on many web-sites, the BBC to offer just one example, scroll down to “How different age groups voted” and note that the percentages cited reconcile to mine). I believe that my logic and calculations are sound, but it seems that I am making too many assumptions about how I can leverage the Ashcroft data. After posting this article, I will accordingly go back and annotate each of my previous posts and link them to these later findings.
I think the broader lesson to be learnt is that estimates are just that, attempts (normally well-intentioned of course) to come up with figures where the actual numbers are not accessible. Sometimes this is a very useful – indeed indispensable – approach, sometimes it is less helpful. In either case estimation should always be approached with caution and the findings ideally sense-checked in the way that I have tried to do above.
Occam’s razor would suggest that when the stats tell you something that seems incredible, then 99 times out of 100 there is an error or inaccurate assumption buried somewhere in the model. This applies when you are creating the model yourself and doubly so where you are relying upon figures calculated by other people. In the latter case not only is there the risk of their figures being inaccurate, there is the incremental risk that you interpret them wrongly, or stretch their broader application to breaking point. I was probably guilty of one or more of the above sins in my earlier articles. I’d like my probable misstep to serve as a warning to other people when they too look to leverage statistics in new ways.
A further point is the most advanced concepts I have applied in my calculations above are addition, subtraction, multiplication and division. If these basic operations – even in the hands of someone like me who is relatively familiar with them – can lead to the issues described above, just imagine what could result from the more complex mathematical techniques (e.g. ambition, distraction, uglification and derision) used by even entry-level data scientists. This perhaps suggests an apt aphorism: Caveat calculator!
In my last article, I looked at a couple of ways to visualise the outcome of the recent UK Referendum on Europen Union membership. There I was looking at how different visual representations highlight different attributes of data.
I’ve had a lot of positive feedback about my previous Brexit exhibits and I thought that I’d capture the zeitgeist by offering a further visual perspective, perhaps one more youthful than the venerable pie chart; namely an infographic. My attempt to produce one of these appears above and a full-size PDF version is also just a click away.
For caveats on the provenance of the data, please also see the previous article’s notes section.
I have leveraged age group distributions from the Ascroft Polling organisation to create this exhibits. Other sites – notably the BBC – have done the same and my figures reconcile to the interpretations in other places. However, based on further analysis, I have some reason to think that either there are issues with the Ashcroft data, or that I have leveraged it in ways that the people who compiled it did not intend. Either way, the Ashcroft numbers lead to the conclusion that close to 100% of 55-64 year olds voted in the UK Referendum, which seems very, very unlikely. I have contacted the Ashcroft Polling organisation about this and will post any reply that I receive.