The Big Data Universe

The Royal Society - Big Data Universe (Click to view a larger version in a new window)

The above image is part of a much bigger infographic produced by The Royal Society about machine learning. You can view the whole image here.

I felt that this component was interesting in a stand-alone capacity.

The legend explains that a petabyte (Pb) is equal to a million gigabytes (Gb) [1], or 1 Pb = 106 Gb. A gigabyte itself is a billion bytes, or 1 Gb = 109 bytes. Recalling how we multiply indeces we can see that 1 Pb = 106 × 109 bytes = 106 + 9 bytes = 1015 bytes. 1015 also has a name, it’s called a quadrillion. Written out long hand:

1 quadrillion = 1,000,000,000,000,000

The estimate of the amount of data held by Google is fifteen thousand petabytes, let’s write that out long hand as well:

15,000 Pb = 15,000,000,000,000,000,000 bytes

That’s a lot of zeros. As is traditional with big numbers, let’s try to put this in context.

  1. The average size of a photo on an iPhone 7 is about 3.5 megabytes (1 Mb = 1,000,000 bytes), so Google could store about 4.3 trillion of such photos.

    iPhone 7 photo

  2. Stepping it up a bit, the average size of a high quality photo stored in CR2 format from a Canon EOS 5D Mark IV is ten times bigger at 35 Mb, so Google could store a mere 430 billion of these.

    Canon EOS 5D

  3. A high definition (1080p) movie is on average around 6 Gb, so Google could store the equivalent of 2.5 billion movies.

    The Complete Indiana Jones (helpful for Data Management professionals)

  4. If Google employees felt that this resolution wasn’t doing it for them, they could upgrade to 150 million 4K movies at around 100 Gb each.

    4K TV

  5. If instead they felt like reading, they could hold the equivalent of The Library of Congress print collections a mere 75 thousand times over [2].

    Library of Congress

  6. Rather than talking about bytes, 15,000 petametres is equivalent to about 1,600 light years and at this distance from us we find Messier Object 47 (M47), a star cluster which was first described an impressively long time ago in 1654.

    Messier 47

  7. If instead we consider 15,000 peta-miles, then this is around 2.5 million light years, which gets us all the way to our nearest neighbour, the Andromeda Galaxy [3].

    Andromeda

    The fastest that humankind has got anything bigger than a handful of sub-atomic particles to travel is the 17 kilometres per second (11 miles per second) at which Voyager 1 is currently speeding away from the Sun. At this speed, it would take the probe about 43 billion years to cover the 15,000 peta-miles to Andromeda. This is over three times longer than our best estimate of the current age of the Universe.

  8. Finally a more concrete example. If we consider a small cube, made of well concrete, and with dimensions of 1 cm in each direction, how big would a stack of 15,000 quadrillion of them be? Well, if arranged into a cube, each of the sides would be just under 25 km (15 and a bit miles) long. That’s a pretty big cube.

    Big cube (plan)

    If the base was placed in the vicinity of New York City, it would comfortably cover Manhattan, plus quite a bit of Brooklyn and The Bronx, plus most of Jersey City. It would extend up to Hackensack in the North West and almost reach JFK in the South East. The top of the cube would plough through the Troposphere and get half way through the Stratosphere before topping out. It would vie with Mars’s Olympus Mons for the title of highest planetary structure in the Solar System [4].

It is probably safe to say that 15,000 Pb is an astronomical figure.

Google played a central role in the initial creation of the collection of technologies that we now use the term Big Data to describe The image at the beginning of this article perhaps explains why this was the case (and indeed why they continue to be at the forefront of developing newer and better ways of dealing with large data sets).

As a point of order, when people start talking about “big data”, it is worth recalling just how big “big data” really is.
 


 Notes

 
[1]
 
In line with The Royal Society, I’m going to ignore the fact that these definitions were originally all in powers of 2 not 10.
 
[2]
 
The size of The Library of Congress print collections seems to have become irretrievably connected with the figure 10 terabytes (10 × 1012 bytes) for some reason. No one knows precisely, but 200 Tb seems to be a more reasonable approximation.
 
[3]
 
Applying the unimpeachable logic of eminent pseudoscientist and numerologist Erich von Däniken, what might be passed over as a mere coincidence by lesser minds, instead presents incontrovertible proof that Google’s PageRank algorithm was produced with the assistance of extraterrestrial life; which, if you think about it, explains quite a lot.
 
[4]
 
Though I suspect not for long, unless we chose some material other than concrete. Then I’m not a materials scientist, so what do I know?

 

 

Curiouser and Curiouser – The Limits of Brexit Voting Analysis

An original illustration from Charles Lutwidge Dodgson's seminal work would have been better, but sadly none such seems to be extant
 
Down the Rabbit-hole

When I posted my Brexit infographic reflecting the age of voters an obvious extension was to add an indication of the number of people in each age bracket who did not vote as well as those who did. This seemed a relatively straightforward task, but actually proved to be rather troublesome (this may be an example of British understatement). Maybe the caution I gave about statistical methods having a large impact on statistical outcomes in An Inconvenient Truth should have led me to expect such issues. In any case, I thought that it would be instructive to talk about the problems I stumbled across and to – once again – emphasise the perils of over-extending statistical models.

Brexit ages infographic
Click to download a larger PDF version in a new window.

Regular readers will recall that my Brexit Infographic (reproduced above) leveraged data from an earlier article, A Tale of two [Brexit] Data Visualisations. As cited in this article, the numbers used were from two sources:

  1. The UK Electoral Commission – I got the overall voting numbers from here.
  2. Lord Ashcroft’s Poling organisation – I got the estimated distribution of votes by age group from here.

In the notes section of A Tale of two [Brexit] Data Visualisations I [prophetically] stated that the breakdown of voting by age group was just an estimate. Based on what I have discovered since, I’m rather glad that I made this caveat explicit.
 
 
The Pool of Tears

In order to work out the number of people in each age bracket who did not vote, an obvious starting point would be the overall electorate, which the UK Electoral Commission stated as being 46,500,001. As we know that 33,551,983 people voted (an actual figure rather than an estimate), then this is where the turnout percentage of 72.2% (actually 72.1548%) came from (33,551,983 / 45,500,001).

A clarifying note, the electorate figures above refer to people who are eligible to vote. Specifically, in order to vote in the UK Referendum, people had to meet the following eligibility criteria (again drawn from the UK Electoral Commission):

To be eligible to vote in the EU Referendum, you must be:

  • A British or Irish citizen living in the UK, or
  • A Commonwealth citizen living in the UK who has leave to remain in the UK or who does not require leave to remain in the UK, or
  • A British citizen living overseas who has been registered to vote in the UK in the last 15 years, or
  • An Irish citizen living overseas who was born in Northern Ireland and who has been registered to vote in Northern Ireland in the last 15 years.

EU citizens are not eligible to vote in the EU Referendum unless they also meet the eligibility criteria above.

So far, so simple. The next thing I needed to know was how the electorate was split by age. This is where we begin to run into problems. One place to start is the actual population of the UK as at the last census (2011). This is as follows:
 

Ages (years) Population % of total
0–4 3,914,000 6.2
5–9 3,517,000 5.6
10–14 3,670,000 5.8
15–19 3,997,000 6.3
20–24 4,297,000 6.8
25–29 4,307,000 6.8
30–34 4,126,000 6.5
35–39 4,194,000 6.6
40–44 4,626,000 7.3
45–49 4,643,000 7.3
50–54 4,095,000 6.5
55–59 3,614,000 5.7
60–64 3,807,000 6.0
65–69 3,017,000 4.8
70–74 2,463,000 3.9
75–79 2,006,000 3.2
80–84 1,496,000 2.4
85–89 918,000 1.5
90+ 476,000 0.8
Total 63,183,000 100.0

 
If I roll up the above figures to create the same age groups as in the Ashcroft analysis (something that requires splitting the 15-19 range, which I have assumed can be done uniformly), I get:
 

Ages (years) Population % of total
0-17 13,499,200 21.4
18-24 5,895,800 9.3
25-34 8,433,000 13.3
35-44 8,820,000 14.0
45-54 8,738,000 13.8
55-64 7,421,000 11.7
65+ 10,376,000 16.4
Total 63,183,000 100.0

 
The UK Government isn’t interested in the views of people under 18[citation needed], so eliminating this row we get:
 

Ages (years) Population % of total
18-24 5,895,800 11.9
25-34 8,433,000 17.0
35-44 8,820,000 17.8
45-54 8,738,000 17.6
55-64 7,421,000 14.9
65+ 10,376,000 20.9
Total 49,683,800 100.0

 
As mentioned, the above figures are from 2011 and the UK population has grown since then. Web-site WorldOMeters offers an extrapolated population of 65,124,383 for the UK in 2016 (this is as at 12th July 2016; if extrapolation and estimates make you queasy, I’d suggest closing this article now!). I’m going to use a rounder figure of 65,125,000 people; there is no point pretending that precision exists where it clearly doesn’t. Making the assumption that such growth is uniform across all age groups (please refer to my previous bracketed comment!), then the above exhibit can also be extrapolated to give us:
 

Ages (years) Population % of total
18-24 6,077,014 11.9
25-34 8,692,198 17.0
35-44 9,091,093 17.8
45-54 9,006,572 17.6
55-64 7,649,093 14.9
65+ 10,694,918 20.9
Total 51,210,887 100.0

 
 
Looking Glass House

So our – somewhat fabricated – figure for the 18+ UK population in 2016 is 51,210,887, let’s just call this 51,200,000. As at the beginning of this article the electorate for the 2016 UK Referendum was 45,500,000 (dropping off the 1 person with apologies to him or her). The difference is explicable based on the eligibility criteria quoted above. I now have a rough age group break down of the 51.2 million population, how best to apply this to the 45.5 million electorate?

I’ll park this question for the moment and instead look to calculate a different figure. Based on the Ashcroft model, what percentage of the UK population (i.e. the 51.2 million) voted in each age group? We can work this one out without many complications as follows:
 

Ages (years)
 
Population
(A)
Voted
(B)
Turnout %
(B/A)
18-24 6,077,014 1,701,067 28.0
25-34 8,692,198 4,319,136 49.7
35-44 9,091,093 5,656,658 62.2
45-54 9,006,572 6,535,678 72.6
55-64 7,649,093 7,251,916 94.8
65+ 10,694,918 8,087,528 75.6
Total 51,210,887 33,551,983 65.5

(B) = Size of each age group in the Ashcroft sample as a percentage multiplied by the total number of people voting (see A Tale of two [Brexit] Data Visualisations).
 
Remember here that actual turnout figures have electorate as the denominator, not population. As the electorate is less than the population, this means that all of the turnout percentages should actually be higher than the ones calculated (e.g. the overall turnout with respect to electorate is 72.2% whereas my calculated turnout with respect to population is 65.5%). So given this, how to explain the 94.8% turnout of 55-64 year olds? To be sure this group does reliably turn out to vote, but did essentially all of them (remembering that the figures in the above table are too low) really vote in the referendum? This seems less than credible.

The turnout for 55-64 year olds in the 2015 General Election has been estimated at 77%, based on an overall turnout of 66.1% (web-site UK Political Info; once more these figures will have been created based on techniques similar to the ones I am using here). If we assume a uniform uplift across age ranges (that “assume” word again!) then one might deduce that an increase in overall turnout from 66.1% to 72.2%, might lead to the turnout in the 55-64 age bracket increasing from 77% to 84%. 84% turnout is still very high, but it is at least feasible; close to 100% turnout in from this age group seems beyond the realms of likelihood.

So what has gone wrong? Well so far the only culprit I can think of is the distribution of voting by age group in the Ashcroft poll. To be clear here, I’m not accusing Lord Ashcroft and his team of sloppy work. Instead I’m calling out that the way that I have extrapolated their figures may not be sustainable. Indeed, if my extrapolation is valid, this would imply that the Ashcroft model over estimated the proportion of 55-64 year olds voting. Thus it must have underestimated the proportion of voters in some other age group. Putting aside the likely fact that I have probably used their figures in an unintended manner, could it be that the much-maligned turnout of younger people has been misrepresented?

To test the validity of this hypothesis, I turned to a later poll by Omnium. To be sure this was based on a sample size of around 2,000 as opposed to Ashcroft’s 12,000, but it does paint a significantly different picture. Their distribution of voter turnout by age group was as follows:
 

Ages (years) Turnout %
18-24 64
25-39 65
40-54 66
55-64 74
65+ 90

 
I have to say that the Omnium age groups are a bit idiosyncratic, so I have taken advantage of the fact that the figures for 25-54 are essentially the same to create a schedule that matches the Ashcroft groups as follows:
 

Ages (years) Turnout %
18-24 64
25-34 65
35-44 65
45-54 65
55-64 74
65+ 90

 
The Omnium model suggests that younger voters may have turned out in greater numbers than might be thought based on the Ashcroft data. In turn this would suggest that a much greater percentage of 18-24 year olds turned out for the Referendum (64%) than for the last General Election (43%); contrast this with an estimated 18-24 turnout figure of 47% based on the just increase in turnout between the General Election and the Referendum. The Omnium estimates do still however recognise that turnout was still greater in the 55+ brackets, which supports the pattern seen in other elections.
 
 
Humpty Dumpty

While it may well be that the Leave / Remain splits based on the Ashcroft figures are reasonable, I’m less convinced that extrapolating these same figures to make claims about actual voting numbers by age group (as I have done) is tenable. Perhaps it would be better to view each age cohort as a mini sample to be treated independently. Based on the analysis above, I doubt that the turnout figures I have extrapolated from the Ashcroft breakdown by age group are robust. However, that is not the same as saying that the Ashcroft data is flawed, or that the Omnium figures are correct. Indeed the Omnium data (at least those elements published on their web-site) don’t include an analysis of whether the people in their sample voted Leave or Remain, so direct comparison is not going to be possible. Performing calculation gymnastics such as using the Omnium turnout for each age group in combination with the Ashcroft voting splits for Leave and Remain for the same age groups actually leads to a rather different Referendum result, so I’m not going to plunge further down this particular rabbit hole.

In summary, my supposedly simple trip to the destitution of an enhanced Brexit Infographic has proved unexpectedly arduous, winding and beset by troubles. These challenges have proved so great that I’ve abandoned the journey and will be instead heading for home.
 
 
Which dreamed it?

Based on my work so far, I have severe doubts about the accuracy of some of the age-based exhibits I have published (versions of which have also appeared on many web-sites, the BBC to offer just one example, scroll down to “How different age groups voted” and note that the percentages cited reconcile to mine). I believe that my logic and calculations are sound, but it seems that I am making too many assumptions about how I can leverage the Ashcroft data. After posting this article, I will accordingly go back and annotate each of my previous posts and link them to these later findings.

I think the broader lesson to be learnt is that estimates are just that, attempts (normally well-intentioned of course) to come up with figures where the actual numbers are not accessible. Sometimes this is a very useful – indeed indispensable – approach, sometimes it is less helpful. In either case estimation should always be approached with caution and the findings ideally sense-checked in the way that I have tried to do above.

Occam’s razor would suggest that when the stats tell you something that seems incredible, then 99 times out of 100 there is an error or inaccurate assumption buried somewhere in the model. This applies when you are creating the model yourself and doubly so where you are relying upon figures calculated by other people. In the latter case not only is there the risk of their figures being inaccurate, there is the incremental risk that you interpret them wrongly, or stretch their broader application to breaking point. I was probably guilty of one or more of the above sins in my earlier articles. I’d like my probable misstep to serve as a warning to other people when they too look to leverage statistics in new ways.

A further point is the most advanced concepts I have applied in my calculations above are addition, subtraction, multiplication and division. If these basic operations – even in the hands of someone like me who is relatively familiar with them – can lead to the issues described above, just imagine what could result from the more complex mathematical techniques (e.g. ambition, distraction, uglification and derision) used by even entry-level data scientists. This perhaps suggests an apt aphorism: Caveat calculator!

Beware the Jabberwock, my son! // The jaws that bite, the claws that catch! // Beware the Jubjub bird, and shun // The frumious Bandersnatch!

 

How Age was a Critical Factor in Brexit

Brexit ages infographic
Click to download a larger PDF version in a new window.

In my last article, I looked at a couple of ways to visualise the outcome of the recent UK Referendum on Europen Union membership. There I was looking at how different visual representations highlight different attributes of data.

I’ve had a lot of positive feedback about my previous Brexit exhibits and I thought that I’d capture the zeitgeist by offering a further visual perspective, perhaps one more youthful than the venerable pie chart; namely an infographic. My attempt to produce one of these appears above and a full-size PDF version is also just a click away.

For caveats on the provenance of the data, please also see the previous article’s notes section.
 

Addendum

I have leveraged age group distributions from the Ascroft Polling organisation to create this exhibits. Other sites – notably the BBC – have done the same and my figures reconcile to the interpretations in other places. However, based on further analysis, I have some reason to think that either there are issues with the Ashcroft data, or that I have leveraged it in ways that the people who compiled it did not intend. Either way, the Ashcroft numbers lead to the conclusion that close to 100% of 55-64 year olds voted in the UK Referendum, which seems very, very unlikely. I have contacted the Ashcroft Polling organisation about this and will post any reply that I receive.

– Peter James Thomas, 14th July 2016

 

Data Management as part of the Data to Action Journey

Data Information Insight Action (w700)

| Larger Version | Detailed and Annotated Version (as PDF) |

This brief article is actually the summation of considerable thought and reflects many elements that I covered in my last two pieces (5 Themes from a Chief Data Officer Forum and 5 More Themes from a Chief Data Officer Forum), in particular both the triangle I used as my previous Data Management visualisation and Peter Aiken’s original version, which he kindly allowed me to reproduce on this site (see here for more information about Peter).

What I began to think about was that both of these earlier exhibits (and indeed many that I have seen pertaining to Data Management and Data Governance) suggest that the discipline forms a solid foundation upon which other areas are built. While there is a lot of truth in this view, I have come round to thinking that Data Management may alternatively be thought of as actively taking part in a more dynamic process; specifically the same iterative journey from Data to Information to Insight to Action and back to Data again that I have referenced here several times before. I have looked to combine both the static, foundational elements of Data Management and the dynamic, process-centric ones in the diagram presented at the top of this article; a more detailed and annotated version of which is available to download as a PDF via the link above.

I have also introduced the alternative path from Data to Insight; the one that passes through Statistical Analysis. Data Management is equally critical to the success of this type of approach. I believe that the schematic suggests some of the fluidity that is a major part of effective Data Management in my experience. I also hope that the exhibit supports my assertion that Data Management is not an end in itself, but instead needs to be considered in terms of the outputs that it helps to generate. Pristine data is of little use to an organisation if it is not then exploited to form insights and drive actions. As ever, this need to drive action necessitates a focus on cultural transformation, an area that is covered in many other parts of this site.

This diagram also calls to mind the subject of where and how the roles of Chief Analytics Officer and Chief Data Officer intersect and whether indeed these should be separate roles at all. These are questions to which – as promised on several previous occasions – I will return to in future articles. For now, maybe my schematic can give some data and information practitioners a different way to view their craft and the contributions that it can make to organisational success.
 

 

Scienceogram.org’s Infographic to celebrate the Philae landing

Scienceogram.org Rosetta Infographic - Click to view the original in a new tab

© scienceogram.org 2014
Reproduced on this site under a Creative Commons licence
Original post may be viewed here

As a picture is said to paint a thousand words, I’ll (mostly) leave it to Scienceogram’s infographic to deliver the message.

However, The Center for Responsive Politics (I have no idea whether or not they have a political affiliation, they claim to be nonpartisan) estimates the cost of the recent US Congressional elections at around $3.67 bn (€2.93 bn). I found a lower (but still rather astonishing) figure of $1.34 bn (€1.07 bn) at the Federal Election Commission web-site, but suspect that this number excludes Political Action Committees and their like.

To make a European comparisson to a European space project, the Common Agriculture Policy cost €57.5 bn ($72.0 bn) in 2013 according to the BBC. Given that Rosetta’s costs were spread over nearly 20 years, it makes sense to move the decimal point rightwards one place in both the euro and dollar figures and then to double the resulting numbers before making comparisons (this is left as an exercise for the reader).

Of course I am well aware that a quick Google could easily produce figures (such as how many meals, or vaccinations, or so on you could get for €1.4 bn) making points that are entirely antipodal to the ones presented. At the end of the day we landed on a comet and will – fingers crossed – begin to understand more about the formation of the Solar System and potentially Life on Earth itself as a result. Whether or not you think that is good value for money probably depends mostly on what sort of person you are. As I relate in a previous article, infographics only get you so far.
 


 
Scienceogram provides précis [correct plural] of UK science spending, giving overviews of how investment in science compares to the size of the problems it’s seeking to solve.