astronomy

This article was originally intended for publication late in the year it reviews, but, as they ^[1] say, the best-laid schemes o’ mice an’ men gang aft agley…

In 2017 I wrote more articles ^[2] than in any year since 2009, which was the first full year of this site’s existence. Some were viewed by thousands of people, others received less attention. Here I am going to ignore the metric of popular acclaim and instead highlight a few of the articles that I enjoyed writing most, or sometimes re-reading a few months later ^[3]. Given the breadth of subject matter that appears on peterjamesthomas.com, I have split this retrospective into six areas, which are presented in decreasing order of the number of 2017 articles I wrote in each. These are as follows:

General Data Articles
Data Visualisation
Statistics & Data Science
CDO perspectives
Programme Advice
Analytics & Big Data

In each category, I will pick out two or three of pieces which I feel are both representative of my overall content and worth a read. I would be more than happy to receive any feedback on my selections, or suggestions for different choices.


General Data Articles

	August The Data and Analytics Dictionary
My attempt to navigate the maze of data and analytics terminology. Everything from Algorithm to Web Analytics.
	November & December The Anatomy of a Data Function: Part I, Part II and Part III
Three articles focussed on the structure and components of a modern Data Function and how its components interact with both each other and the wider organisation in order to support business goals.

Data Visualisation

	January Nucleosynthesis and Data Visualisation
How one of the most famous scientific data visualisations, the Periodic Table, has been repurposed to explain where the atoms we are all made of come from via the processes of nucleosynthesis.
	September & October Hurricanes and Data Visualisation: Part I – Rainbow’s Gravity and Part II – Map Reading
Two articles on how Data Visualisation is used in Meteorology. Part I provides a worked example illustrating some of the problems that can arise when adopting a rainbow colour palette in data visualisation. Part II grapples with hurricane prediction and covers some issues with data visualisations that are intended to convey safety information to the public.

Statistics & Data Science

	February Toast
What links Climate Change, the Manhattan Project, Brexit and Toast? How do these relate to the public’s trust in Science? What does this mean for Data Scientists? Answers provided by Nature, The University of Cambridge and the author.
	February How to be Surprisingly Popular
The wisdom of the crowd relies upon essentially democratic polling of a large number of respondents; an approach that has several shortcomings, not least the lack of weight attached to people with specialist knowledge. The Surprisingly Popular algorithm addresses these shortcomings and so far has out-performed existing techniques in a range of studies.
	October A Nobel Laureate’s views on creating Meaning from Data
The 2017 Nobel Prize for Chemistry was awarded to Structural Biologist Richard Henderson and two other co-recipients. What can Machine Learning practitioners learn from Richard’s observations about how to generate images from Cryo-Electron Microscopy data?

CDO Perspectives

	January Alphabet Soup
Musings on the overlapping roles of Chief Analytics Officer and Chief Data Officer and thoughts on whether there should be just one Top Data Job in an organisation.
	February A Sweeter Spot for the CDO?
An extension of my concept of the Chief Data Officer sweet spot, inspired by Bruno Aziza of AtScale.
	September A truth universally acknowledged…
Many Chief Data Officer job descriptions have a list of requirements that resemble Swiss Army Knives. This article argues that the CDO must be the conductor of an orchestra, not someone who is a virtuoso in every single instrument.

Programme Advice

	January Bumps in the Road
What the aftermath of repeated roadworks can tell us about the potentially deleterious impact of Change Programmes on Data Landscapes.
	February 20 Risks that Beset Data Programmes
A review of 20 risks that can plague data programmes. How effectively these are managed / mitigated can make or break your programme.
	March Ideas for avoiding Big Data failures and for dealing with them if they happen
Paul Barsch (EY & Teradata) provides some insight into why Big Data projects fail, what you can do about this and how best to treat any such projects that head off the rails. With additional contributions from Big Data gurus Albert Einstein, Thomas Edison and Samuel Beckett.

Analytics & Big Data

	February Bigger and Better (Data)?
Some examples of where bigger data is not necessarily better data. Provided by Bill Vorhies and Larry Greenemeier .
	March Elephants’ Graveyard?
Thoughts on trends in interest in Hadoop and Spark, featuring George Hill, James Kobielus, Kashif Saiyed and Martyn Richard Jones, together with the author’s perspective on the importance of technology in data-centric work.

and Finally…

I would like to close this review of 2017 with a final article, one that somehow defies classification:


	April 25 Indispensable Business Terms
An illustrated Buffyverse take on Business gobbledygook – What would Buffy do about thinking outside the box? To celebrate 20 years of Buffy the Vampire Slayer and 1st April 2017.

Notes

^[1]	“They” here obviously standing for Robert Burns.
^[2]	Thirty-four articles and one new page.
^[3]	Of course some of these may also have been popular, I’m not being masochistic here!

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

Follow @peterjthomas

The above image is part of a much bigger infographic produced by The Royal Society about machine learning. You can view the whole image here.

I felt that this component was interesting in a stand-alone capacity.

The legend explains that a petabyte (Pb) is equal to a million gigabytes (Gb) ^[1], or 1 Pb = 10⁶ Gb. A gigabyte itself is a billion bytes, or 1 Gb = 10⁹ bytes. Recalling how we multiply indeces we can see that 1 Pb = 10⁶ × 10⁹ bytes = 10^{6 + 9} bytes = 10¹⁵ bytes. 10¹⁵ also has a name, it’s called a quadrillion. Written out long hand:

1 quadrillion = 1,000,000,000,000,000

The estimate of the amount of data held by Google is fifteen thousand petabytes, let’s write that out long hand as well:

15,000 Pb = 15,000,000,000,000,000,000 bytes

That’s a lot of zeros. As is traditional with big numbers, let’s try to put this in context.

The average size of a photo on an iPhone 7 is about 3.5 megabytes (1 Mb = 1,000,000 bytes), so Google could store about 4.3 trillion of such photos.
Stepping it up a bit, the average size of a high quality photo stored in CR2 format from a Canon EOS 5D Mark IV is ten times bigger at 35 Mb, so Google could store a mere 430 billion of these.
A high definition (1080p) movie is on average around 6 Gb, so Google could store the equivalent of 2.5 billion movies.
If Google employees felt that this resolution wasn’t doing it for them, they could upgrade to 150 million 4K movies at around 100 Gb each.
If instead they felt like reading, they could hold the equivalent of The Library of Congress print collections a mere 75 thousand times over ^[2].
Rather than talking about bytes, 15,000 petametres is equivalent to about 1,600 light years and at this distance from us we find Messier Object 47 (M47), a star cluster which was first described an impressively long time ago in 1654.
If instead we consider 15,000 peta-miles, then this is around 2.5 million light years, which gets us all the way to our nearest neighbour, the Andromeda Galaxy ^[3].

The fastest that humankind has got anything bigger than a handful of sub-atomic particles to travel is the 17 kilometres per second (11 miles per second) at which Voyager 1 is currently speeding away from the Sun. At this speed, it would take the probe about 43 billion years to cover the 15,000 peta-miles to Andromeda. This is over three times longer than our best estimate of the current age of the Universe.
Finally a more concrete example. If we consider a small cube, made of well concrete, and with dimensions of 1 cm in each direction, how big would a stack of 15,000 quadrillion of them be? Well, if arranged into a cube, each of the sides would be just under 25 km (15 and a bit miles) long. That’s a pretty big cube.

If the base was placed in the vicinity of New York City, it would comfortably cover Manhattan, plus quite a bit of Brooklyn and The Bronx, plus most of Jersey City. It would extend up to Hackensack in the North West and almost reach JFK in the South East. The top of the cube would plough through the Troposphere and get half way through the Stratosphere before topping out. It would vie with Mars’s Olympus Mons for the title of highest planetary structure in the Solar System ^[4].

It is probably safe to say that 15,000 Pb is an astronomical figure.

Google played a central role in the initial creation of the collection of technologies that we now use the term Big Data to describe The image at the beginning of this article perhaps explains why this was the case (and indeed why they continue to be at the forefront of developing newer and better ways of dealing with large data sets).

As a point of order, when people start talking about “big data”, it is worth recalling just how big “big data” really is.

Notes

^[1]	In line with The Royal Society, I’m going to ignore the fact that these definitions were originally all in powers of 2 not 10.
^[2]	The size of The Library of Congress print collections seems to have become irretrievably connected with the figure 10 terabytes (10 × 10¹² bytes) for some reason. No one knows precisely, but 200 Tb seems to be a more reasonable approximation.
^[3]	Applying the unimpeachable logic of eminent pseudoscientist and numerologist Erich von Däniken, what might be passed over as a mere coincidence by lesser minds, instead presents incontrovertible proof that Google’s PageRank algorithm was produced with the assistance of extraterrestrial life; which, if you think about it, explains quite a lot.
^[4]	Though I suspect not for long, unless we chose some material other than concrete. Then I’m not a materials scientist, so what do I know?