The Big Data Universe

The Royal Society - Big Data Universe (Click to view a larger version in a new window)

The above image is part of a much bigger infographic produced by The Royal Society about machine learning. You can view the whole image here.

I felt that this component was interesting in a stand-alone capacity.

The legend explains that a petabyte (Pb) is equal to a million gigabytes (Gb) [1], or 1 Pb = 106 Gb. A gigabyte itself is a billion bytes, or 1 Gb = 109 bytes. Recalling how we multiply indeces we can see that 1 Pb = 106 × 109 bytes = 106 + 9 bytes = 1015 bytes. 1015 also has a name, it’s called a quadrillion. Written out long hand:

1 quadrillion = 1,000,000,000,000,000

The estimate of the amount of data held by Google is fifteen thousand petabytes, let’s write that out long hand as well:

15,000 Pb = 15,000,000,000,000,000,000 bytes

That’s a lot of zeros. As is traditional with big numbers, let’s try to put this in context.

  1. The average size of a photo on an iPhone 7 is about 3.5 megabytes (1 Mb = 1,000,000 bytes), so Google could store about 4.3 trillion of such photos.

    iPhone 7 photo

  2. Stepping it up a bit, the average size of a high quality photo stored in CR2 format from a Canon EOS 5D Mark IV is ten times bigger at 35 Mb, so Google could store a mere 430 billion of these.

    Canon EOS 5D

  3. A high definition (1080p) movie is on average around 6 Gb, so Google could store the equivalent of 2.5 billion movies.

    The Complete Indiana Jones (helpful for Data Management professionals)

  4. If Google employees felt that this resolution wasn’t doing it for them, they could upgrade to 150 million 4K movies at around 100 Gb each.

    4K TV

  5. If instead they felt like reading, they could hold the equivalent of The Library of Congress print collections a mere 75 thousand times over [2].

    Library of Congress

  6. Rather than talking about bytes, 15,000 petametres is equivalent to about 1,600 light years and at this distance from us we find Messier Object 47 (M47), a star cluster which was first described an impressively long time ago in 1654.

    Messier 47

  7. If instead we consider 15,000 peta-miles, then this is around 2.5 million light years, which gets us all the way to our nearest neighbour, the Andromeda Galaxy [3].


    The fastest that humankind has got anything bigger than a handful of sub-atomic particles to travel is the 17 kilometres per second (11 miles per second) at which Voyager 1 is currently speeding away from the Sun. At this speed, it would take the probe about 43 billion years to cover the 15,000 peta-miles to Andromeda. This is over three times longer than our best estimate of the current age of the Universe.

  8. Finally a more concrete example. If we consider a small cube, made of well concrete, and with dimensions of 1 cm in each direction, how big would a stack of 15,000 quadrillion of them be? Well, if arranged into a cube, each of the sides would be just under 25 km (15 and a bit miles) long. That’s a pretty big cube.

    Big cube (plan)

    If the base was placed in the vicinity of New York City, it would comfortably cover Manhattan, plus quite a bit of Brooklyn and The Bronx, plus most of Jersey City. It would extend up to Hackensack in the North West and almost reach JFK in the South East. The top of the cube would plough through the Troposphere and get half way through the Stratosphere before topping out. It would vie with Mars’s Olympus Mons for the title of highest planetary structure in the Solar System [4].

It is probably safe to say that 15,000 Pb is an astronomical figure.

Google played a central role in the initial creation of the collection of technologies that we now use the term Big Data to describe The image at the beginning of this article perhaps explains why this was the case (and indeed why they continue to be at the forefront of developing newer and better ways of dealing with large data sets).

As a point of order, when people start talking about “big data”, it is worth recalling just how big “big data” really is.


In line with The Royal Society, I’m going to ignore the fact that these definitions were originally all in powers of 2 not 10.
The size of The Library of Congress print collections seems to have become irretrievably connected with the figure 10 terabytes (10 × 1012 bytes) for some reason. No one knows precisely, but 200 Tb seems to be a more reasonable approximation.
Applying the unimpeachable logic of eminent pseudoscientist and numerologist Erich von Däniken, what might be passed over as a mere coincidence by lesser minds, instead presents incontrovertible proof that Google’s PageRank algorithm was produced with the assistance of extraterrestrial life; which, if you think about it, explains quite a lot.
Though I suspect not for long, unless we chose some material other than concrete. Then I’m not a materials scientist, so what do I know?



You have to love Google

…well if you used to be a Number Theorist that is.

Google / Fermat

It’s almost enough to make me forgive them for Gmail’s consider including “feature”. Almost!


LinkedIn does what it says on the can

Referring domains
An analysis of traffic based on linking site

I suppose, given that this is a essentially professional blog, I should not be surprised that LinkedIn dominates traffic for me, dwarfing even the mighty Google and Twitter (incidentally Facebook was in 13th place, below Microsoft – a verdict of “could do better”, but then Facebook is only semi-pro for me).

It is also worth noting that traffic from all WordPress blogs (not included in the 4% WordPress figure above) amounted to 3% of traffic. Adding in all other non-corporate blogs got this to 5% and notional 4th place).

It is also notable that StumbleUpon outdid all other social bookmarking sites, with Reddit next in a lowly 23rd place.
Some selected top threes…

Please note that the only criteria here is quantum of traffic.
The Social Media “Big Three”

  1. LinkedIn
  2. Twitter
  3. Facebook


  1. Microsoft
  2. SAS
  3. IBM


  1. Oracle Business Intelligence 101
  2. Judith Hurwitz
  3. Merv Adrian

Social Bookmarking

  1. StumbleUpon
  2. Reddit
  3. Delicious

Blog Readers

  1. Bloglines (now sadly defunct)
  2. Netvibes
  3. Google Reader

Technology News / Communities

  1. Smart Data Collective
  2. IT Business Edge
  3. Joint: IT Finance Connection & Social Media Today


  1. CIO Magazine
  2. The Economist
  3. Computing


I should point out that the figures presented above are all-time, rather than say the last six months. It would be interesting to do some trending, but this is a bit more clunky to achieve than one might expect.

Four [Social Media] Failures and a Success

Four Social Media Failures and a Success - with apologies to Mike Newell


The internet is full of articles claiming to transform the reader into the Social Media equivalent of Charles Atlas. I have written some of them myself (though hopefully while highlighting that that things are seldom as simple as ticking a set of boxes). Bearing in mind the old adage that you learn more from your mistakes than your successes, here are some thoughts on Social Media failures; the first three are mine and the fourth a failure that seems very widespread. Lest this article becomes too depressing, I will close with a more positive piece of Social Media news.
Failure 1 – Thinking that you can dip in and out of Social Media

Articles per month

I recently came across Ken Mueller’s blog via a LinkedIn Group (see the segment of New Adventures in WiFi that relates to LinkedIn for some thoughts on groups). In one of his articles he lays out what he sees as the factors that have led to him tripling his blog traffic. Foremost amongst these is consistency:

I’ve been doing this every day for about 2 years now. Some of the growth that I’m seeing is due to just plugging away and forcing myself to blog every day, hopefully creating good, relevant content that people want to read. If I take a day off, I notice a drop in traffic. In fact, I always see a drop in my November traffic because I go away for Thanksgiving to an area with no Internet access.

A quick look at the above chart, which shows the number of articles I have published each month since founding this blog back in November 2008, will reveal that consistency hasn’t been my middle name.

For a variety of reasons, I have had periods where I have sustained a high output of articles (without, it is to be hoped, quantity compromising quality) and periods where my writing has slowed to a barely perceptible trickle. To take an ultra-prosaic example, I started writing this piece while commuting by train and my recent output is highly correlated with my method of transportation.

Now what shall I blog about today? ... Sadly I don't travel too much on the London Tube nowadays - odd the things that you miss

Coming out of some of the troughs in writing, I have sometimes felt that I could simply pick up where I left off. This is probably the case with some niche readers who may visit this site; this is precisely because at least some of my content is directly pertinent to them from time to time. However, after a while, even they may have looked elsewhere for their regular fix of the topics I cover here. Beyond this, there is equally likely to be a second cohort of casual readers who will quickly move on to pastures new if the grass here does not re-grow apace [note to self, I am meant to be restraining myself from overly liberal use of analogies, must try harder!].

Even if an author has written several articles that have proved popular with a number of people; after anything more than a few weeks’ lay-off, it can almost be like starting again from scratch. To employ a too widely-used phrase, you are only as good as your last month’s (or maybe week’s, or maybe day’s) output.

7th November 2002 - Brisbane Cricket Ground, Queensland, Australia. England's Simon Jones ruptures a cruciate ligament. It took him until 11th March 2004 to play for England again.

Disregarding for the moment my own parenthetic advice from the end of the paragraph before last, this feels rather familiar. It seems to be very like what it feels like trying to get fit again after an injury or time away from a sport. It doesn’t really matter if you had attained a certain level of fitness a year ago; what is relevant today is your current level of fitness and the gap between the two. Sometimes recalling just how long it took them to achieve a previous standard can be quite de-motivating to an athlete returning from a break. Once fit, it is a lot easier to stay fit than is is to regain lost fitness. The same applies to audiences and this is why – as Kevin suggests in his article – at least periodic blogging (assuming that it is of a standard) is essential.

My learning here is both to make time to write and also to re-engage with my readers.

[Perhaps ironically this article itself has been in gestation for a few weeks]
Failure 2 – Assuming that what has worked before will work again

Michael Schumacher's comeback - or how to dim a glistening reputation

I have a specific example in mind here and it relates to a blog post that precedes this one. In turn this goes back to a survey of senior IT people that I carried out predominantly via LinkedIn back in January 2009. This related to their view on the top priorities that they faced in their jobs. Recently I thought that it would be interesting to update this and – no doubt naturally – I also though that I would adopt the same modus operandi; i.e. LinkedIn. I even targeted the same Group – that of CIO Magazine.

linkedin CIO Magazine CIO Magazine forum

Sad to say, while I had dozens of responses last time round, there was been little or no response at all when I attempted to refresh the findings. I have been thinking about why this might be. Of course my musings are pure speculation, but a few ideas come to mind:

  1. The output of the last survey was not of much interest / didn’t tell people anything that they didn’t already know and so it was not worth the effort of replying again.
  2. The people frequenting the CIO Magazine LinkedIn Group back in 2009 were a very different set of people to now. Back then we were in the aftermath of the global banking crisis and perhaps a number of good people had more time on their hands than would normally be the case. Today, while the good times are not exactly rolling, I hope that a large tranche of these people are once more gainfully employed.
  3. It could be (as I have mentioned before) that the wild proliferation of LinkedIn groups means that people’s time and energy is spread over a wider set of these, with less time to devote to specific questions. I have no access to LinkedIn statistics, but would like to bet that while overall Group-based activity has no doubt increased, activity per group may well have decreased.
  4. Variants of the same question may have been asked so often that people have grown tired of answering it.
  5. This could be one of the early signs of general Social Media fatigue.

By way of contrast – and perhaps tapping into my thoughts about variants of the same question having been asked many times before – the same Group has a thread asking members to state in one word what their key challenge is. Although many of the replies are somewhat trite and there is a limit to how much information a single word can convey, it is instructive to think that an innovative approach (and one that requires little time typing a response) has been successful where my attempt to repeat a previous exercise has failed.

My learning here is to think of new ways to approach old material, rather than simply believing that your can repeat past successes.

[UPDATE: I posted on the original CIO Magazine Group threads to change its status to publicly available and started to receive new thoughts on this. Another thought – perhaps people are just more comfortable contributing to discussions that others have already engaged in, rather than being the first to comment?]
Failure 3 – Ascribing [as yet] unwarranted maturity to Social Media

Starting them young...

I religiously refrain from blogging about current work projects, however the following was 100% in the public domain of its very nature.

I have recently been doing some recruitment and – given both the increasing use of LinkedIn by recruitment firms in their work and that I have a pretty extensive network – thought that it would be worth trying to leverage Social Media to reach out to potential candidates. I did this via a status update, rather than taking the perhaps more obvious path of using the various job sections. My logic here was that I would potentially reach a wider audience in one go than via several postings within pertinent groups. I was also pursuing my recruitment through more traditional channels, so this idea could simply be viewed as a Social Media experiment.

As with any honest scientist, it is important that I state my negative results as well as positive. In this case, though I was contacted by many recruitment agencies, I didn’t get any feedback from actual candidates themselves at all. It could be argued that the failure was in the way I approached the experiment, or the narrowness of the channel that I selected. While both of these are true observations, the whole point of Social Media in business (if there is one) is to make either organisation-to-person, or person-to-person contact ridiculously easy and immediate. Regardless of my level of ineptitude, it wasn’t easy to achieve what I wanted to achieve and I abandoned my experiment after a week or so.

My learning here is to not to refrain from business / Social Media experimentation, but not to expect too much from what is after all an emerging area.
Failure 4 – Vendor employees not “getting” Social Media

Clueless about Social Media

I have often used this column to talk about my opinion that your choice of Business Intelligence tool is one of the least important factors in a BI/DW project. In the article I link to in the previous sentence, I quote from an interview I gave in which I compare the market for BI tools with that for cars. There is no definitive answer to the question “what is the best car?” and in the same way there is no “best BI tool”. Going further than this, there are many other areas of a BI/DW project which, if done well, will come close to guaranteeing your success regardless of which BI tool you select; but, if done badly, will come close to guaranteeing your failure with any BI tool.

I have also previously contrasted my opinion with the surprisingly large number of discussion threads on LinkedIn that have as a title some variant of “Please, please, please, please, please tell me which is the best BI tool”. I worry about people making quite significant purchasing decisions based on replies posted in an internet forum, but that is perhaps a topic for another day. The particular failure I wanted to highlight is of people posting on these types of thread who work for Big BI Corporation Inc. Of course everyone is entitled to their opinion, but I am not sure that many readers would be swayed by:

I highly recommend Object Explorer Studio+ for all your BI needs

– Joe Blogs

Particularly where one click reveals that Joe Blogs is either employed by the owners of OES+ or a consultant whose company seems to exclusively do OES+ implementations. I hate to single out one vendor, but a particularly egregious reply to one of these “Which BI Tool?” threads that I saw recently consisted of one word:


– Jimmy Blogs

As I say, on the very same thread there were examples of employees of many other big and small BI vendors doing just the same, but most of them at least provided more than one word. In the cause of balance, the same thread also contained some thoughts along the lines of:

I can heartily recommend Oracle BI, OBIEE+ is great because [sales pitch deleted]. If you would like to know more drop me a line at

– Jeff Blogs

I still wonder whether Jeff got any e-mails. At least he flagged his connection with Oracle, I don’t recall many other vendor employees being honest enough to do the same.

Lest I be accused of bias there were also not too dissimilar postings from people strongly associated with SAP, IBM, QlikTech, Pentaho and a sprinkling of BI start-ups. I should perhaps also note that SAS was not a culprit (at least to date), but then maybe this was because the question was about BI, something they abjure. Microstrategy was also honourably notable for its lack of replies containing naive self-promotion, but perhaps this was simply an oversight.

The above rather bizarre behaviour leads to two questions:

  1. Why do the people making these types of posting think that they will be taken seriously?
  2. Why do the vendors themselves not offer better guidance to their employees about avoiding crass and counter-productive social media advertising of a sort that is more likely to tarnish reputations than enhance sales?

Maybe here again we have an issue of social media maturity. Many people are perhaps struggling as much to get their message across effectively as they did with say the advent of television advertising.

My learning here is that I should curb my rather obsessive compulsion to “out” vendors promoting their own products under the guise of neutral advice-giving.

[not sure that I am going to take much notice of this one however]
Success – The Accidental Search Engine Optimiser

After covering three of my own failures and one of the BI vendor community (though I am sure the phenomenon is not restricted to BI or even technology vendors), I will close with one of my successes, albeit an unintentional one. I noticed a strange result the other day when looking at the following (I was actually looking for something else believe it or not):

Business Intelligence Expert

I believe that my elevated ranking is probably correlated to recent changes in Google’s algorithms that take greater account of social media. Certainly I don’t recall placing on the first page for any Google search before, let alone rank #1. I suppose that I might have a degree of technical satisfaction if this was as the result of months of assiduous search engine optimisation. However the truth is that the result appears to be the unintended by-product of doing lots of things that I wanted to do anyway, like writing about topics I am interested in and trying to engage with a wide group of people in a number of different ways. In a sense the fact that this achievement was accidental (or at least collateral) makes it more pleasing. Maybe the secret to Social Media success is simply to not worry about it and just get on with expressing yourself.

My learning here is that providing content that is of interest to your target audience and being clear about who you are and what you do is going to be an approach that trumps any more mechanistic approach to SEO.
Closing thoughts

I believe that I have leant something from my three failures above (and that vendors should learn something from the fourth), but the single success encourages me to persevere. My aim in sharing these experiences is to hopefully also similarly encourage other Social Media ingénues like myself. I hope that I have at least partially achieved this.

Consider including…

Gmail logo

Let me get something out of the way straight up. I am a fan of Google. Are their services and products flawless? Probably not. Did they live up to their stated objective of “do no evil”? Well I guess the Chinese difficulties didn’t exactly paint them in the best light, nevertheless I can think of less savoury technology companies. On the plus side, I have used Google’s services and, in particular, their cloud-based e-mail – Gmail – for years and been very happy with them. If I explain that my smart phone is a Nexus One, you will probably get the general idea.

Gmail fail?
Image edited and truncated to fit page - click for full version

However, Google have introduced a “feature” into Gmail which leads me to question what on earth they were thinking. This is the “Consider including” function. When you type an e-mail, Gmail comes up with a list of people that you may like to also copy it to. Let’s pause and just think about this. You are writing an e-mail, generally the first thing that you do is to type in the address of the person (or people) you are writing to. Gmail has a useful feature that scans your previous mails, so typing “Pe” will bring up “Peter Thomas” as an option. So far so good. But then, based solely on this first e-mail entered (not even on the subject), the bar highlighted in pale yellow appears above with a list of people that you may consider including on the mail.

Google’s algorithms may be great at figuring out which context-based ads to display alongside the advertising-supported Gmail (though I must admit to never having clicked on any of these and to generally mentally filtering them out), but how does an algorithm know better than me who I want to send an e-mail to? I suppose we could give the geniuses at Google the benefit of the doubt, maybe they do know.

Sadly empirical evidence is that the software doesn’t have a clue. In the example above, the contacts “J”, “L” and “R” (the names have been anonymised to protect those irrelevant to the context) have nothing whatsoever to do with the e-mail recipient (again anonymised) that I started writing. Aside from perhaps once being cc’ed in an e-mail sent to the person whose address I typed in, they have no relation to either the intended recipient, or indeed to each other. As to content, at this point there isn’t any, so it is anyone’s guess how Google generates the list; an even more worrying question is why do they?

Not only does the feature fail to work, it is also totally asinine. It might make some sense for say Facebook to suggest people with whom you might want to share a link. However, there are people who you might e-mail twice a year for very specific purposes, that still get suggested in a “Consider including”. Google plainly doesn’t know better than me to whom I actually want to send an e-mail. A worry is that a stray click and a lack of attention could send an e-mail to someone who is not intended to see it. Given the fact that many small businesses and sole-trader consultants rely on Gmail, then – in extremis – this could lead to commercially sensitive (or indeed personally private) information being sent to the wrong person. The feature is clearly ill-advised and – worst of all – you cannot (at present) turn it off.

In searching (via Google) for tips on how to get rid of this truly abysmal piece of functionality I came across two things: screeds of people just like me asking what Google was thinking and the an article entitled: Gmail’s Most Ridiculous, Idiotic, Intrusive, Useless Feature Ever by Zoli Erdos, which covers the problems and potential implications of “Consider including” in more depth. Here is a pithy quote:

I’ve never thought the day would come I would write the words utterly ridiculous, iditiotic, intrusive, with absolute certainly about a Google feature

This “feature” is bad enough to have merited me writing to Google asking them to remove it, or at least make it optional. Their support forums are full of people saying the same. It will be interesting to see whether or not they listen.

[Disclosure: I have more than one Gmail account and also use Google apps from time to time, as stated above, I also use Feedburner and have a Google smart phone. Other than this I have no commercial relationship with Google and have never bought or recommended their services in a business context]

Google Fools Day

Happy April Fools Day from Google

A nice touch – pointed out by @CurtMonash (who seems to be cropping up on my blog quite a bit at the moment):

  • Results 1 – 10 of about 63,300,000 for peter thomas. (2.00 shakes of a lamb’s tail)
  • Results 1 – 10 of about 63,300,000 for peter thomas. (0.10 microfortnights)
  • Results 1 – 10 of about 63,300,000 for peter thomas. (1.21 gigawatts)

and so on…

Try it yourself here .

Though I suspect you have only a few hours left.

Also worth checking out:

  1. The burgeoning NoData movement, led by revolutionary in chief @merv.
  2. The cutting-edge concept of Subterranean Computing, championed by @ocdqblog – so much more substantive than The Cloud.




As might be inferred from my last post, certain sporting matters have been on my mind of late. However, as is becoming rather a theme on this blog, these have also generated some business-related thoughts.

On Friday evening, the Australian cricket team finished the second day of the second Test Match on a score of 152 runs for the loss of 8 (out of 10) first innings wickets. This was still 269 runs behind the England team‘s total of 425.

In scanning what I realise must have been a hastily assembled end-of-day report on the web-site of one of the UK’s leading quality newspapers, a couple are glaring errors stood out. First, the Australian number 4 batsman Michael Hussey was described as having “played-on” to a delivery from England’s shy-and-retiring Andrew Flintoff. Second, the journalist wrote that Australia’s number six batsman, Marcus North, had been “clean-bowled” by James Anderson.

I appreciate that not all readers of this blog will be cricket aficionados and also that the mysteries of this most complex of games are unlikely to be made plain by a few brief words from me. However, “played on” means that the ball has hit the batsman’s bat and deflected to break his wicket (or her wicket – as I feel I should mention as a staunch supporter of the all-conquering England Women’s team, a group that I ended up meeting at a motorway service station just recently).

By contrast, “clean-bowled” means that the ball broke the batsman’s wicket without hitting anything else. If you are interested in learning more about the arcane rules of cricket (and let’s face it, how could you not be interested) then I suggest taking a quick look here. The reason for me bothering to go into this level of detail is that, having watched the two dismissals live myself, I immediately thought that the journalist was wrong in both cases.

It may be argued that the camera sometimes lies, but the caption (whence these images are drawn) hardly ever does. The following two photographs show what actually happened:

Michael Hussey leaves one and is bowled, England v Australia, 2nd Test, Lord's, 2nd day, July 17, 2009
Michael Hussey leaves one and is bowled, England v Australia, 2nd Test, Lord's, 2nd day, July 17, 2009
Marcus North drags James Anderson into his stumps, England v Australia, 2nd Test, Lord's, 2nd day, July 17, 2009
Marcus North drags James Anderson into his stumps, England v Australia, 2nd Test, Lord's, 2nd day, July 17, 2009

As hopefully many readers will be able to ascertain, Hussey raised his bat aloft, a defensive technique employed to avoid edging the ball to surrounding fielders, but misjudged its direction. It would be hard to “play on” from a position such as he adopted. The ball arced in towards him and clipped the top of his wicket. So, in fact he was the one who was “clean-bowled”; a dismissal that was qualified by him having not attempted to play a stroke.

North on the other hand had been at the wicket for some time and had already faced 13 balls without scoring. Perhaps in frustration at this, he played an overly-ambitious attacking shot (one not a million miles from a baseball swing), the ball hit the under-edge of his horizontal bat and deflected down into his wicket. So it was North, not Hussey, who “played on” on this occasion.

So, aside from saying that Hussey had been adjudged out “handled the ball” and North dismissed “obstructed the field” (two of the ten ways in which a batsman’s innings can end – see here for a full explanation), the journalist in question could not have been more wrong.

As I said, the piece was no doubt composed quickly in order to “go to press” shortly after play had stopped for the day. Maybe these are minor slips, but surely the core competency of a sports journalist is to record what happened accurately. If they can bring insights and colour to their writing, so much the better, but at a minimum they should be able to provide a correct description of events.

Everyone makes mistakes. Most of my blog articles contain at least one typographical or grammatical error. Some of them may include errors of fact, though I do my best to avoid these. Where I offer my opinions, it is possible that some of these may be erroneous, or that they may not apply in different situations. However, we tend to expect professionals in certain fields to be held to a higher standard.


For a molecular biologist, the difference between a 0.20 micro-molar solution and a 0.19 one may be massive. For a team of experimental physicists, unbelievably small quantities may mean the difference between confirming the existence of the Higgs Boson and just some background noise.

In business, it would be unfortunate (to say the least) if auditors overlooked major assets or liabilities. One would expect that law-enforcement agents did not perjure themselves in court. Equally politicians should never dissemble, prevaricate or mislead. OK, maybe I am a little off track with the last one. But surely it is not unreasonable to expect that a cricket journalist should accurately record how a batsman got out.
Twitter and Truth

I made something of a leap from these sporting events to the more tragic news of Michael Jackson’s recent demise. I recall first “hearing” rumours of this on At this point, no news sites had much to say about the matter. As the evening progressed, the self-styled celebrity gossip site TMZ was the first to announce Jackson’s death. Other news outlets either said “Jackson taken to hospital” or (perhaps hedging their bets) “US web-site reports Jackson dead”.

By this time the twitterverse was experiencing a cosmic storm of tweets about the “fact” of Jackson’s passing. A comparably large number of comments lamented how slow “old media” was to acknowledge this “fact”. Eventually of course the dinosaurs of traditional news and reporting lumbered to the same conclusion as the more agile mammals of Twitter.

In this case social media was proved to be both quick and accurate, so why am I now going to offer a defence of the world’s news organisations? Well I’ll start with a passage from one of my all-time favourite satires, Yes Minister, together with its sequel Yes Prime Minister.

In the following brief excerpt Sir Geoffrey Hastings (the head of MI5, the British domestic intelligence service) is speaking to The Right Honourable James Hacker (the British Prime Minister). Their topic of conversation is the recently revealed news that a senior British Civil Servant had in fact been a Russian spy:

Yes Prime Minister

Hastings: Things might get out. We don’t want any more irresponsible ill-informed press speculation.
Hacker: Even if it’s accurate?
Hastings: Especially if it’s accurate. There is nothing worse than accurate irresponsible ill-informed press speculation.

Yes Prime Minister, Vol. I by J. Lynn and A. Jay

Was the twitter noise about Jackson’s death simply accurate ill-informed speculation? It is difficult to ask this question as, sadly, the tweets (and TMZ) proved to be correct. However, before we garland new media with too many wreaths, it is perhaps salutary to recall that there was a second rumour of a celebrity death circulating in the febrile atmosphere of Twitter on that day. As far as I am aware, Pittsburgh’s finest – Jeff Goldblum – is alive and well as we speak. Rumours of his death (in an accident on a New Zealand movie set) proved to be greatly exaggerated.

The difference between a reputable news outlet and hordes of twitterers is that the former has a reputation to defend. While the average tweep will simply shrug their shoulders at RTing what they later learn is inaccurate information, misrepresenting the facts is a cardinal sin for the best news organisations. Indeed reputation is the main thing that news outlets have going for them. This inevitably includes annoying and time-consuming things such as checking facts and validating sources before you publish.

With due respect to Mr Jackson, an even more tragic set of events also sparked some similar discussions; the aftermath of the Iranian election. The Economist published an interesting artilce comparing old and new media responses to this entitiled: Twitter 1, CNN 0. Their final comments on this area were:

[…]the much-ballyhooed Twitter swiftly degraded into pointlessness. By deluging threads like Iranelection with cries of support for the protesters, Americans and Britons rendered the site almost useless as a source of information—something that Iran’s government had tried and failed to do. Even at its best the site gave a partial, one-sided view of events. Both Twitter and YouTube are hobbled as sources of news by their clumsy search engines.

Much more impressive were the desk-bound bloggers. Nico Pitney of the Huffington Post, Andrew Sullivan of the Atlantic and Robert Mackey of the New York Times waded into a morass of information and pulled out the most useful bits. Their websites turned into a mish-mash of tweets, psephological studies, videos and links to newspaper and television reports. It was not pretty, and some of it turned out to be inaccurate. But it was by far the most comprehensive coverage available in English. The winner of the Iranian protests was neither old media nor new media, but a hybrid of the two.

Aside from the IT person in me noticing the opportunity to increase the value of Twitter via improved text analytics (see my earlier article, Literary calculus?), these types of issues raise concerns in my mind. To balance this slightly negative perspective it is worth noting that both accurate and informed tweets have preceded several business events, notably the recent closure of BI start-up LucidEra.

Also main stream media seem to have swallowed the line that Google has developed its own operating system in Chrome OS (rather than lashing the pre-existing Linux kernel on to its browser); maybe it just makes a better story. Blogs and Twitter were far more incisive in their commentary about this development.

Considering the pros and cons, on balance the author remains something of a doubting Thomas (by name as well as nature) about placing too much reliance on Twitter for news; at least as yet.
Accuracy an Business Intelligence

A balancing act

Some business thoughts leaked into the final paragraph of the Introduction above, but I am interested more in the concept of accuracy as it pertains to one of my core areas of competence – business intelligence. Here there are different views expressed. Some authorities feel that the most important thing in BI is to be quick with information that is good-enough; the time taken to achieve undue precision being the enemy of crisp decision-making. Others insist that small changes can tip finely-balanced decisions one way or another and so precision is paramount. In a way that is undoubtedly familiar to regular readers, I straddle these two opinions. With my dislike for hard-and-fast recipes for success, I feel that circumstances should generally dictate the approach.

There are of course different types of accuracy. There is that which insists that business information reflects actual business events (often more a case for work in front-end business systems rather than BI). There is also that which dictates that BI systems reconcile to the penny to perhaps less functional, but pre-existing scorecards (e.g. the financial results of an organisation).

A number of things can impact accuracy, including, but not limited to: how data has been entered into systems; how that data is transformed by interfaces; differences between terminology and calculation methods in different data sources; misunderstandings by IT people about the meaning of business data; errors in the extract transform and load logic that builds BI solutions; and sometimes even the decisions about how information is portrayed in BI tools themselves. I cover some of these in my previous piece Using BI to drive improvements in data quality.

However, one thing that I think differentiates enterprise BI from departmental BI (or indeed predictive models or other types of analytics), is a greater emphasis on accuracy. If enterprise BI is to aspire to becoming the single version of the truth for an organisation, then much more emphasis needs to be placed on accuracy. For information that is intended to be the yardstick by which a business is measured, good enough may fall short of the mark. This is particularly the case where a series of good enough solutions are merged together; the whole may be even less than the sum of its parts.

A focus on accuracy in BI also achieves something else. It stresses an aspiration to excellence in the BI team. Such aspirations tend to be positive for groups of people in business, just as they are for sporting teams. Not everyone who dreams of winning an Olympic gold medal will do so, but trying to make such dreams a reality generally leads to improved performance. If the central goal of BI is to improve corporate performance, then raising the bar for the BI team’s own performance is a great place to start and aiming for accuracy is a great way to move forward.

A final thought: England went on to beat Australia by precisely 115 runs in the second Test at Lord’s; the final result coming today at precisely 12:42 pm British Summer Time. The accuracy of England’s bowling was a major factor. Maybe there is something to learn here.