I was flattered to be included in the recent list of the 23 most influential BI bloggers published by BI Software Insight. To be 100% honest, I was also a little surprised as, due to other commitments, this blog has received very little of my attention in recent years. Taking a glass half full approach, maybe my content stands the test of time; it would be nice to think so.
It was also good to be in the company of various members of the BI community whose work I respect and several of whom I have got to know on-line or in person. These include (as per the original article, in no particular order):
* You can see Bruno and me talking on Microsoft’s YouTube channel here.
BI Software Insight helps organizations make smarter purchasing decisions on Business Intelligence Software. Their team of experts helps organizations find the right BI solution with expert reviews, objective resource guides, and insights on the latest BI news and trends.
This article is the first of three which address how to formulate an Information Strategy. I have written a number of other articles which touch on this subject and have also spoken about the topic. However I realised that I had never posted an in-depth review of this important area. This series of articles seeks to remedy this omission.
Part I – Generalities explores the nature of strategy, lays some foundations and presents a framework of questions which will need to be answered in order to form any general strategy. The forthcoming Part II – Situational Analysis adapts the first part of this general framework – The Situational Analysis – to the task of starting to form an Information Strategy. The final chapter, Part III – Completing the Strategy, rounds out this process by working through the rest of the general framework and explaining how this can be used to produce a fully-formed Information Strategy.
As with all of my other articles, this essay is not intended as a recipe for success, a set of instructions which – if slavishly followed – will guarantee the desired outcome. Instead the reader is invited to view the following as a set of observations based on what I have learnt during a career in which the development of both Information Strategies and technology strategies in general have played a major role.
What is a Strategy?
This would seem to be a relatively easy question to answer as the word is used (more likely over-used) in many areas of human endeavour and in business in particular. Let’s start by seeing if we can reach a consensus by the power of Google:
Strategy (from Greek στρατηγία stratēgia, “art of troop leader; office of general, command, generalship”) is a high level plan to achieve one or more goals under conditions of uncertainty. [and also later in the same article]Max McKeown (2011) argues that “strategy is about shaping the future” and is the human attempt to get to “desirable ends with available means”.
/stráttiji/ n.1 the art of war. 2 a the management of an army or armies in a campaign. b the art of moving troops, ships, aircraft, etc. into favourable positions (cf. TACTICS). c an instance of this or a plan formed according to it. 3 a plan of action or policy in business or politics etc. (economic strategy) [F stratégie f. Gk strātegia generalship f. stratēgeos]
[…] a method or plan chosen to bring about a desired future, such as achievement of a goal or solution to a problem. […]
So – assuming we decide, in the context of this blog, that the objective is probably not to better order military affairs, or to teach infantry men and women to write MDX – some sort of loose consensus emerges from the various definitions above. It seems that a strategy is something which seeks to influence the future, to bring about some conditions or cause an event, neither of which would manifest themselves without some action being taken. I am going to adopt the definition that a strategy is a method to achieve some future objective; or at least to make the realisation of this aim more likely. This means that a strategy implies change. If the situation is now X then after the strategy has been successfully enacted then the situation will then be Y.
A Metaphor for Strategy
The role of change in strategy leads me to think about strategy formulation in the following way. I think of situation X (the current one) as a place on a map. Then situation Y (the desired one) is a second place on the same map. We are at X and we want to get to Y, we have a starting point and a destination, an origin and a terminus. The shortest distance between two points is of course a straight line. However a straight line between between X and Y may not exist (there could be a lake in between with no method to traverse this), or it might not be the quickest route (if the line passes over an intervening mountain, which could instead be more quickly circumnavigated). In general there may be more than one route between X and Y and each may have its advantages and disadvantages. I tend to think of strategy formation as the process by which the best (or, if this is all that is achievable, least bad) route is established.
Of course a challenge here is that – outside the realms of mathematics (or indeed SatNav) – there may not be an optimum route and equally no optimum strategy. Even if an optimum strategy does exist, the strategist may not have enough information to hand to discern this. Also, while effecting change is the objective of a strategy, this aim may itself be impacted by change; to employ our metaphor of travel, change to the destination, or to the territory in between. This may mean that the route (the strategy) must be adjusted, or in some cases wholly abandoned in favour of a different approach. Strategy formulation has some scientific-like qualities and I will focus on some of these shortly. However for the reasons just put forward (and indeed others we will examine later) elements of the strategy formation process can sometimes be more of an art form.
Of course another problem could be that you don’t have a map!
Having introduced a geographic quality to describing strategy formation, I’ll leverage this analogy for the rest of the article. However, first a slight detour to establish the credentials of your guide to the terrain of Information Strategy; namely me. Any readers who are already familiar with my work are encouraged to scroll past the next section.
So what do I know about Information Strategies anyway?
I have worked in IT for over quarter of a century with much of that related to turning data into information. Indeed one of my early tasks during my first job at a software house was to help design and develop the automated Balance Sheet and Profit and Loss statements provided as part of the company’s flagship product. These took the transactions entered into the company’s General Ledger system and assembled them into sensible Financial statements, which could be sliced and diced by period, cost centre or project code. However, my full initiation to the related areas of Business Intelligence and Data Warehousing was not until the beginning of 2000, when I was asked to establish a Management Information function for a pan-European insurance organisation. This means that I don’t reach my 15-year BI/DW milestone until New Year (actually probably some point in the middle of January 2015).
Having both developed and executed an Information Strategy for the European part of this company, I extended both of these processes to encompass Latin America. I then developed a broader Information Strategy which included all of their International operations. It is gratifying to note that this strategy still guides information provision at this organisation to this day. After this, I went on to shape Information Strategies for other companies in sectors such as Manufacturing, Retail and back to Reinsurance / Insurance again. In each of these cases, I either saw the execution of these strategies through to at least their first delivery, or the programmes of work that I crafted were then executed by the teams that I had built.
There are many good resources available in printed form and on-line for those who want to understand various approaches to general strategy formulation. For readers who are interested in strategy outside of a technology context and specifically outside of the area of Information Strategy, then Google is your friend. For anyone who is still with us, then while I would not claim to be an all-purpose strategy guru, I think that it is worth starting by presenting some general questions that pertain to the area of strategy formation. I am going to cast these in the shape of the geographic / journey metaphor that I developed above. Adopting this framework, any general strategy will have to answer the following questions:
Where are we?
Answering this question is the province of a Situational Analysis. Such a study will highlight what is good about the current situation as well as what needs to be changed.
Where do we want to be instead and why?
Here it is useful to consider two things: first Drivers for Change (which may emerge from the Situational Analysis); second a further question, What does good look like? This area is thus a mixture of what is wrong with the current situation and what would be good about the one proposed as the objective of the strategy.
How do we get there, how long will it take and what will it cost?
Thinking of the most perfect of destinations is going to be of little use if it costs too much to get there or the journey time is prohibitive. Here the strategist needs to get more concrete and consider realistic estimates of time and money.
Will the trip be worth it?
There is a relation here to areas covered under the earlier bullet points, but answering this question will normally require some sort of cost/benefit analysis. In describing what good looks like, many potential benefits may be articulated, here there is a need to quantify them as best as is possible.
What else can we do along the way?
Some might quibble at the inclusion of this item. However I think that the metaphor of a journey lends itself to considering what tactical work can help buttress the central activities of the strategy.
The framing of the above in terms pertinent to a journey may not be familiar, but I think that it is useful. This metaphor also has the benefit of alluding to what is inevitably the case with each of strategy development, strategy execution and the most worthwhile of journeys; they seldom happen overnight.
Having laid some general foundations, the next article in this series, Part II[to be published], will begin to be more specific and consider how these questions can be applied to forming the first element an Information Strategy, a Situational Analysis.
Assuming Euclidean Geometry, if not then maybe try this instead.
Here I am using the term generally rather than in the sense of information generated by systems, which even today remains mostly numeric; albeit that other forms of data have streaked ahead of dowdy old numbers some time ago. Numeric data tends to be somewhat easier to transform into information than non-numeric; at least for now.
Perhaps it is worth introducing a note of caution about the over-extension of analogies here – I do this in an earlier article bearing the same name.
To this day, I have a compulsion to write “dice and slice” as opposed to “slice and dice”, despite the latter being a more logical sequence of events when approaching – say – a butternut squash.
I am looking forward to my engraved TDWI decanter immensely.
Sometimes the current situation is so bad that simply addressing its shortcomings is enough work for a strategy to consider. More often a strategy will look to ad value beyond just remediating current issues.
Though I can hardly claim to be the first person to come up with this metaphor.
It is so often stated that it has become a truism of sorts that on-line interactions, particularly those via social media, displace what is termed “real world” or “face to face” interactions. My view is that this perspective, rather than being self-evidently true, is actually apocryphal. I am sure that there are examples of people who have become more isolated (in a physical sense) through use of social media; those who are engaged in a zero-sum game where time spent on-line is at the expense of being around other humans. Most communications media can be accused of the same thing, though I am not aware that anyone ever told Jane Austen to stop wasting her time writing letters and instead get out and meet people. It wasn’t so long ago that people, particularly younger people, were berated for spending so much time on the ‘phone; even back when those were connected to a wall socket by a wire. The same barbs were thrown (and still are) at what we now call Video Games; another area which I admit has occupied a lot of my time in other periods of my life.
There is however a different way of looking at this supposed issue. As I explain in my now rather antiquated review of the Twitterverse:
I have been involved in running web-sites and various on-line communities since 1999.
I think that Twitter.com can be an extremely useful way of interacting with people, expanding your network and coming into contact with interesting new people.
I have indeed come in to contact with a wide range of different people through my, admittedly rather intermittent, use of what we now call social media. Importantly, a lot of these people are based in parts of the world, or even parts of my own country, where our paths would have been unlikely to cross. I suppose that a case could be made that any time I spend writing or reading blog articles, or talking to people on Twitter or LinkedIn, could instead have been more profitably employed sitting on a barstool; perhaps in the hope that someone with complementary interests would start talking to me. However, this does seem to be a doubtful assertion to make. As with most things in life (except chocolate of course) balance is the key. If you spend all of your time on social media (or indeed all of your time in bars) you will rule out some social experiences. If instead you spend some time on social media as part of a healthy, balanced diet, then this should lead to a wider range of associates and sometimes even friends. It is also a pretty frictionless way to find people who are passionate about the things that you are passionate about; or indeed to find out why people are passionate about areas that you think might be interesting.
I mention above that – despite the observations I make later in the same paragraph – my own use of social media has been sporadic. Having made some progress in understanding some elements of the area in an earlier stage of its evolution, jumping back in as I am doing now can feel a little daunting. These fears have been somewhat ameliorated by reconnecting with a lot of people, who still seem interested in me and what I have to say. I have also connected with some new people and acknowledging this second occurrence is the actual purpose of this article.
First, I’d like to offer thanks to Ontario-based Pauline Cabrera (@twelveskip) of twelveskip.com. Pauline describes herself thus on Twitter:
Savvy Digital Strategist / Blogger / Web Designer / Virtual Assistant (http://GeekyVA.com). I dig #SEO, blogging, social media & content marketing.
I found Pauline’s web-site when I was thinking about sprucing up my Twitter header and looking for some advice. Pauline’s observations were clear and helpful, but while I get by OK in creating images (both in a business context and with many of the diagrams on this site), I am not a graphic designer. Given Pauline’s greater experience, I decided to reach out to her. The fruits of this interaction can now be viewed on my Twitter site, @peterjthomas.
Pauline and I reached a commercial arrangement, so I’m not here referring to the kindness of strangers always meaning doing stuff for free. However, while I am sure many other people provide the services that Pauline does, I’m equally confident that very few do it with such speed and professionalism. When you couple these attributes with her being ultra-friendly and displaying an evident delight in doing what she does, you end up with someone it is a pleasure to do business with.
I mentioned that Pauline resides in Canada, I live in the UK, we wouldn’t have bumped into each other without those modern inventions of the Internet, search engines, web-sites and (the subject of the search that allowed me to find Pauline) Twitter.
My main work-related areas of interest are in developing self-service interactive, dynamic reports for Web and Mobile (most notably iPad). I currently develop using MicroStrategy in the Cloud with Netezza.
Michael and I also share a mutual connection in Cindi Howson (@BIScorecard) of BI Scorecard. Despite this, I had not been aware of Michael’s work until recently. I did however connect with him via his web-site. Today he has been kind enough to feature the data visualisation piece I wrote on his blog. It is always gratifying when a fellow professional thinks that your work merits sharing with their network.
In this case, Michael is based in Arizona. The chances of us bumping in to each other, except though us both blogging, would have been slim as well.
The kindness that I wanted to point out here is the diligence with which Simon responds to comments on his site. Of course, on a personal note, there is always a frisson of excitement when someone whose work you admire and who is also something of a public figure in the UK replies to you directly as Simon has to me. Politeness and consideration for others pre-date the Internet of course, but treating people reasonably gets you a long way in social media. As Simon seems to do this naturally, I am sure this characteristic will stand him in good stead.
I can’t claim that Simon lives a long way from me, his home in Norfolk is pretty adjacent to my current one in Cambridge. However, despite having read his articles for years, it was only once Simon established a web presence that the opportunity to correspond opened up.
So, in the couple of weeks during which I have dipped my toe back into the social media water, I have had the privilege to connect (in a number of different ways) with the three people that I mention above. Each of Pauline, Michael and Simon are on-line for different reasons and each have different things to say about very different areas. However, I am interested in what each of them does, as are many other people around the world. It’s hard to imagine an easier way in which I could have formed connections with these three people, one from Canada, one from the US and one from my native UK, than via the Internet and – in these cases – Twitter and Blogging. I think these are useful facts to remember in the face of accusations that social media makes people insular, closed-off and lonely. It may do that to some people, but this is a million miles away from my own experiences and – I strongly suspect – those of many of the people who are now able to access a wider world through their keyboards or touchscreens.
As a picture is said to paint a thousand words, I’ll (mostly) leave it to Scienceogram’s infographic to deliver the message.
However, The Center for Responsive Politics (I have no idea whether or not they have a political affiliation, they claim to be nonpartisan) estimates the cost of the recent US Congressional elections at around $3.67 bn (€2.93 bn). I found a lower (but still rather astonishing) figure of $1.34 bn (€1.07 bn) at the Federal Election Commission web-site, but suspect that this number excludes Political Action Committees and their like.
To make a European comparisson to a European space project, the Common Agriculture Policy cost €57.5 bn ($72.0 bn) in 2013 according to the BBC. Given that Rosetta’s costs were spread over nearly 20 years, it makes sense to move the decimal point rightwards one place in both the euro and dollar figures and then to double the resulting numbers before making comparisons (this is left as an exercise for the reader).
Of course I am well aware that a quick Google could easily produce figures (such as how many meals, or vaccinations, or so on you could get for €1.4 bn) making points that are entirely antipodal to the ones presented. At the end of the day we landed on a comet and will – fingers crossed – begin to understand more about the formation of the Solar System and potentially Life on Earth itself as a result. Whether or not you think that is good value for money probably depends mostly on what sort of person you are. As I relate in a previous article, infographics only get you so far.
Scienceogram provides précis [correct plural] of UK science spending, giving overviews of how investment in science compares to the size of the problems it’s seeking to solve.
Having enjoyed Simon’s sport journalism (particularly his insightful and amusing commentary on Test Match cricket) for many years, I was interested to learn about this new book via his web-site. As an avid consumer of pop-science literature and already being aware of Simon’s considerable abilities as a writer, I was keen to read Ten Million Aliens. To be brief, I would recommend the book to anyone with an enquiring mind, an interest in the natural world and its endless variety, or just an affection for good science writing. My only sadness was that the number of phyla eventually had to come to an end. I laughed in places, I was better informed than before reading a chapter in others and the autobiographical anecdotes and other general commentary on the state of our stewardship of the planet added further dimensions. I look forward to Simon’s next book.
Instead this piece contains some general musings which came to mind while reading Ten Million Aliens and – as is customary – applies some of these to my own fields of professional endeavour.
Regular readers of this blog will be aware of my affection for Cricket and also my interest in Science. Simon Barnes’s work spans both of these passions. I became familiar with Simon’s journalism when he was Chief Sports Writer for The Times an organ he wrote for over 32 years. Given my own sporting interests, I first read his articles specifically about Cricket and sometimes Rugby Union, but began to appreciate his writing in general and to consume his thoughts on many other sports.
There is something about Simon’s writing which I (and no doubt many others) find very engaging. He manages to be both insightful and amusing and displays both elegance of phrase and erudition without ever seeming to show off, or to descend into the overly-florid prose of which I can sometimes (OK often) be guilty. It also helps that we seem to share a favourite cricketer in the shape of David Gower, who appears above and was the most graceful bastman to have played for England in the last forty years. However, it is not Simon’s peerless sports writing that I am going to focus on here. For several years he also penned a wildlife column for The Times and is a patron of a number of wildlife charities. He has written books on, amongst other topics, birds, horses, his safari experiences and conservation in general.
My own interest in science merges into an appreciation of the natural world, perhaps partly also related to the amount of time I have spent in remote and wild places rock-climbing and bouldering. As I started to write this piece, some welcome November Cambridge sun threw shadows of the Green Finches and Great Tits on our feeders across the monitor. Earlier in the day, my wife and I managed to catch a Lesser Spotted Woodpecker, helping itself to our peanuts. Last night we stood on our balcony listening to two Tawny Owls serenading each other. Our favourite Corvidae family are also very common around here and we have had each of the birds appearing in the bottom row of the above image on our balcony at some point. My affection for living dinosaurs also extends to their cousins, the herpetiles, but that is perhaps a topic for another day.
Ten Million Aliens has the modest objectives, revealed by its sub-title, of saying something interesting about about each of the (at the last count) thirty-five phyla of the Animal Kingdom and of providing some insights in to a few of the thousands of familes and species that make these up. Simon’s boundless enthusiasm for the life he sees around him (and indeed the life that is often hidden from all bar the most intrepid of researchers), his ability to bring even what might be viewed as ostensibly dull subject matter to life and a seemingly limitless trove of pertinent personal anecdotes, all combine to ensure not only that he achieves these objectives, but that he does so with some élan.
Classifications and Hierarchies
Well having said that this article wasn’t going to be a book review, I guess it has borne a striking resemblance to one so far. Now to take a different tack; one which relates to three of the words that I referenced and provided links to in the last paragraph of the previous section: phylum, family and species. These are all levels in the general classification of life. At least one version of where these three levels fit into the overall scheme of things appears in the image above. Some readers may even be able to recall a related mnemonic from years gone by: Kings Play Chess on Fine Green Sand.
The father of modern taxonomy, Carl Linnaeus, founded his original biological classification – not unreasonably – on the shared characteristics of organisms; things that look similar are probably related. Relations mean that like things can be collected together into groups and that the groups can be further consolidated into super-groups. This approach served science well for a long time. However when researchers began to find more and more examples of convergent evolution, Linnaeus’s rule of thumb was seen to not always apply and complementary approaches also began to be adopted.
One of these approaches, called Cladistics, focuses on common ancestors rather than shared physical characteristics. Breakthroughs in understanding the genetic code provided impetus to this technique. The above diagram, referred to as a cladogram, represents one school of thought about the relationship between avian dinosaurs, non-avian dinosaurs and various other reptiles that I mentioned above.
It is at this point that the Business Intelligence professional may begin to detect something somewhat familiar. I am of course talking about both dimensions and organising these into hierarchies. Dimensions are the atoms of Business Intelligence and Data Warehousing. In Biological Classification: H. sapiens is part of Homo , which is part of Hominidae, which is part of Primates, which is part of Mammalia, which is part of Chordata, which then gets us back up to Animalia. In Business Intelligence: Individuals make up Teams, which make up Offices, which make up Countries and Regions.
Above I references different approaches to Biological Classification, one based on shared attributes, the other on homology of DNA. This also reminds me of the multiple ways to roll-up dimensions. To pick the most obvious, Day rolls up to Month, Quarter, Half-Year and Year; but also in a different manner to Week and then Year. Given that the aforementioned DNA evidence has caused a reappraisal of the connections between many groups of animals, the structures of Biological Classification are not rigid and instead can change over time. Different approaches to grouping living organisms can provide a range of perspectives, each with its own benefits. In a similar way, good BI/DW design practices should account for both dimensions changing and the fact that different insights may well be provided by parallel dimension hierarchies.
In summary, I suppose what I am saying is that BI/DW practitioners, as well as studying the works of Inmon and Kimball, might want to consider expanding their horizons to include Barnes; to say nothing of Linnaeus. They might find something instructive in these other taxonomical works.
Articles from this blog in which I intertwine Cricket and aspects of business, technology and change include (in chronological order):
Articles on this site which reference either Science or Mathematics are far too numerous to list in full. A short selection of the ones I enjoyed writing most would include (again in chronological order):
Though this elides both Domains and Johny-come-latelies like super-families, sub-genuses and hyper-orders [I may have made that last one up of course].
For example the wings of Pterosaurs, Birds and Bats.
No pun intended.
This metaphor becomes rather cumbersome when one tries to extend it to cover measures. It’s tempting to perhaps align these with fundamental forces, and thus bosons as opposed to combinations of fermions, but the analogy breaks down pretty quickly, so let’s conveniently forget that multidimensional data structures have fact tables at their hearts for now.
Here I am going to strive manfully to avoid getting embroiled in discussions about domains, superregnums, superkingdoms, empires, or regios and instead leave the interested reader to explore these areas themselves if the so desire. Ten Million Aliens itself could be one good starting point, as could the following link.
Science is yet to determine whether these slowly changing dimensions are of Type 1, 2, 3 or 4 (it has however been definitively established that they are not Type 6 / Hybrid).
The above diagram was compiled by Florence Nightingale, who was – according to The Font – “a celebrated English social reformer and statistician, and the founder of modern nursing”. It is gratifying to see her less high-profile role as a number-cruncher acknowledged up-front and central; particularly as she died in 1910, eight years before women in the UK were first allowed to vote and eighteen before universal suffrage. This diagram is one of two which are generally cited in any article on Data Visualisation. The other is Charles Minard’s exhibit detailing the advance on, and retreat from, Moscow of Napoleon Bonaparte’s Grande Armée in 1812 (Data Visualisation had a military genesis in common with – amongst many other things – the internet). I’ll leave the reader to look at this second famous diagram if they want to; it’s just a click away.
While there are more elements of numeric information in Minard’s work (what we would now call measures), there is a differentiating point to be made about Nightingale’s diagram. This is that it was specifically produced to aid members of the British parliament in their understanding of conditions during the Crimean War (1853-56); particularly given that such non-specialists had struggled to understand traditional (and technical) statistical reports. Again, rather remarkably, we have here a scenario where the great and the good were listening to the opinions of someone who was barred from voting on the basis of lacking a Y chromosome. Perhaps more pertinently to this blog, this scenario relates to one of the objectives of modern-day Data Visualisation in business; namely explaining complex issues, which don’t leap off of a page of figures, to busy decision makers, some of whom may not be experts in the specific subject area (another is of course allowing the expert to discern less than obvious patterns in large or complex sets of data). Fortunately most business decision makers don’t have to grapple with the progression in number of “deaths from Preventible or Mitigable Zymotic diseases” versus ”deaths from wounds” over time, but the point remains.
Data Visualisation in one branch of Science
Coming much more up to date, I wanted to consider a modern example of Data Visualisation. As with Nightingale’s work, this is not business-focused, but contains some elements which should be pertinent to the professional considering the creation of diagrams in a business context. The specific area I will now consider is Structural Biology. For the incognoscenti (no advert for IBM intended!), this area of science is focussed on determining the three-dimensional shape of biologically relevant macro-molecules, most frequently proteins or protein complexes. The history of Structural Biology is intertwined with the development of X-ray crystallography by Max von Laue and father and son team William Henry and William Lawrence Bragg; its subsequent application to organic molecules by a host of pioneers including Dorothy Crowfoot Hodgkin, John Kendrew and Max Perutz; and – of greatest resonance to the general population – Francis Crick, Rosalind Franklin, James Watson and Maurice Wilkins’s joint determination of the structure of DNA in 1953.
X-ray diffraction image of the double helix structure of the DNA molecule, taken 1952 by Raymond Gosling, commonly referred to as “Photo 51″, during work by Rosalind Franklin on the structure of DNA
While the masses of data gathered in modern X-ray crystallography needs computer software to extrapolate them to physical structures, things were more accessible in 1953. Indeed, it could be argued that Gosling and Franklin’s famous image, its characteristic “X” suggestive of two helices and thus driving Crick and Watson’s model building, is another notable example of Data Visualisation; at least in the sense of a picture (rather than numbers) suggesting some underlying truth. In this case, the production of Photo 51 led directly to the creation of the even more iconic image below (which was drawn by Francis Crick’s wife Odile and appeared in his and Watson’s seminal Nature paper):
It is probably fair to say that the visualisation of data which is displayed above has had something of an impact on humankind in the fifty years since it was first drawn.
Modern Structural Biology
Today, X-ray crystallography is one of many tools available to the structural biologist with other approaches including Nuclear Magnetic Resonance Spectroscopy, Electron Microscopy and a range of biophysical techniques which I will not detain the reader by listing. The cutting edge is probably represented by the X-ray Free Electron Laser, a device originally created by repurposing the linear accelerators of the previous generation’s particle physicists. In general Structural Biology has historically sat at an intersection of Physics and Biology.
However, before trips to synchrotrons can be planned, the Structural Biologist often faces the prospect of stabilising their protein of interest, ensuring that they can generate sufficient quantities of it, successfully isolating the protein and finally generating crystals of appropriate quality. This process often consumes years, in some cases decades. As with most forms of human endeavour, there are few short-cuts and the outcome is at least loosely correlated to the amount of time and effort applied (though sadly with no guarantee that hard work will always be rewarded).
From the general to the specific
At this point I should declare a personal interest, the example of Data Visualisation which I am going to consider is taken from a paper recently accepted by the Journal of Molecular Biology (JMB) and of which my wife is the first author. Before looking at this exhibit, it’s worth a brief detour to provide some context.
In recent decades, the exponential growth in the breadth and depth of scientific knowledge (plus of course the velocity with which this can be disseminated), coupled with the increase in the range and complexity of techniques and equipment employed, has led to the emergence of specialists. In turn this means that, in a manner analogous to the early production lines, science has become a very collaborative activity; expert in stage one hands over the fruits of their labour to expert in stage two and so on. For this reason the typical scientific paper (and certainly those in Structural Biology) will have several authors, often spread across multiple laboratory groups and frequently in different countries. By way of example the previous paper my wife worked on had 16 authors (including a Nobel Laureate). In this context, the fact the paper I will now reference was authored by just my wife and her group leader is noteworthy.
The reader may at this point be relieved to learn that I am not going to endeavour to explain the subject matter of my wife’s paper, nor the general area of biology to which it pertains (the interested are recommended to Google “membrane proteins” or “G Protein Coupled Receptors” as a starting point). Instead let’s take a look at one of the exhibits.
The above diagram (in common with Nightingale’s much earlier one) attempts to show a connection between sets of data, rather than just the data itself. I’ll elide the scientific specifics here and focus on more general issues.
First the grey upper section with the darker blots on it – which is labelled (a) – is an image of a biological assay called a Western Blot (for the interested details can be viewed here); each vertical column (labelled at the top of the diagram) represents a sub-experiment on protein drawn from a specific sample of cells. The vertical position of a blot indicates the size of the molecules found within it (in kilodaltons); the intensity of a given blot indicates how much of the substance is present. Aside from the headings and labels, the upper part of the figure is a photographic image and so essentially analogue data. So, in summary, this upper section represents the findings from one set of experiments.
At the bottom – and labelled (b) – appears an artefact familiar to anyone in business, a bar-graph. This presents results from a parallel experiment on samples of protein from the same cells (for the interested, this set of data relates to degree to which proteins in the samples bind to a specific radiolabelled ligand). The second set of data is taken from what I might refer to as a “counting machine” and is thus essentially digital. To be 100% clear, the bar chart is not a representation of the data in the upper part of the diagram, it pertains to results from a second experiment on the same samples. As indicated by the labelling, for a given sample, the column in the bar chart (b) is aligned with the column in the Western Blot above (a), connecting the two different sets of results.
Taken together the upper and lower sections establish a relationship between the two sets of data. Again I’ll skip on the specifics, but the general point is that while the Western Blot (a) and the binding assay (b) tell us the same story, the Western Blot is a much more straightforward and speedy procedure. The relationship that the paper establishes means that just the Western Blot can be used to perform a simple new assay which will save significant time and effort for people engaged in the determination of the structures of membrane proteins; a valuable new insight. Clearly the relationships that have been inferred could equally have been presented in a tabular form instead and be just as relevant. It is however testament to the more atavistic side of humans that – in common with many relationships between data – a picture says it more surely and (to mix a metaphor) more viscerally. This is the essence of Data Visualisation.
What learnings can Scientific Data Visualisation provide to Business?
Using the JMB exhibit above, I wanted to now make some more general observations and consider a few questions which arise out of comparing scientific and business approaches to Data Visualisation. I think that many of these points are pertinent to analysis in general.
Broadly, normalisation consists of defining results in relation to some established yardstick (or set of yardsticks); displaying relative, as opposed to absolute, numbers. In the JMB exhibit above, the amount of protein solubilised in various detergents is shown with reference to the un-solubilised amount found in native membranes; these reference figures appear as 100% columns to the right and left extremes of the diagram.
The most common usage of normalisation in business is growth percentages. Here the fact that London business has grown by 5% can be compared to Copenhagen having grown by 10% despite total London business being 20-times the volume of Copenhagen’s. A related business example, depending on implementation details, could be comparing foreign currency amounts at a fixed exchange rate to remove the impact of currency fluctuation.
Normalised figures are very typical in science, but, aside from the growth example mentioned above, considerably less prevalent in business. In both avenues of human endeavour, the approach should be used with caution; something that increases 200% from a very small starting point may not be relevant, be that the result of an experiment or weekly sales figures. Bearing this in mind, normalisation is often essential when looking to present data of different orders on the same graph; the alternative often being that smaller data is swamped by larger, not always what is desirable.
I’ll use and anecdote to illustrate this area from a business perspective. Imagine an organisation which (as you would expect) tracks the volume of sales of a product or service it provides via a number of outlets. Imagine further that it launches some sort of promotion, perhaps valid only for a week, and notices an uptick in these sales. It is extremely tempting to state that the promotion has resulted in increased sales.
However this cannot always be stated with certainty. Sales may have increased for some totally unrelated reason such as (depending on what is being sold) good or bad weather, a competitor increasing prices or closing one or more of their comparable outlets and so on. Equally perniciously, the promotion maybe have simply moved sales in time – people may have been going to buy the organisation’s product or service in the weeks following a promotion, but have brought the expenditure forward to take advantage of it. If this is indeed the case, an uptick in sales may well be due to the impact of a promotion, but will be offset by a subsequent decrease.
In science, it is this type of problem that the concept of control tests is designed to combat. As well as testing a result in the presence of substance or condition X, a well-designed scientific experiment will also be carried out in the absence of substance or condition X, the latter being the control. In the JMB exhibit above, the controls appear in the columns with white labels.
There are ways to make the business “experiment” I refer to above more scientific of course. In retail business, the current focus on loyalty cards can help, assuming that these can be associated with the relevant transactions. If the business is on-line then historical records of purchasing behaviour can be similarly referenced. In the above example, the organisation could decide to offer the promotion at only a subset of the its outlets, allowing a comparison to those where no promotion applied. This approach may improve rigour somewhat, but of course it does not cater for purchases transferred from a non-promotion outlet to a promotion one (unless a whole raft of assumptions are made). There are entire industries devoted to helping businesses deal with these rather messy scenarios, but it is probably fair to say that it is normally easier to devise and carry out control tests in science.
The general take away here is that a graph which shows some change in a business output (say sales or profit) correlated to some change in a business input (e.g. a promotion, a new product launch, or a price cut) would carry a lot more weight if it also provided some measure of what would have happened without the change in input (not that this is always easy to measure).
Rigour and Scrutiny
I mention in the footnotes that the JMB paper in question includes versions of the exhibit presented above for four other membrane proteins, this being in order to firmly establish a connection. Looking at just the figure I have included here, each element of the data presented in the lower bar-graph area is based on duplicated or triplicated tests, with average results (and error bars – see the next section) being shown. When you consider that upwards of three months’ preparatory work could have gone into any of these elements and that a mistake at any stage during this time would have rendered the work useless, some impression of the level of rigour involved emerges. The result of this assiduous work is that the authors can be confident that the exhibits they have developed are accurate and will stand up to external scrutiny. Of course such external scrutiny is a key part of the scientific process and the manuscript of the paper was reviewed extensively by independent experts before being accepted for publication.
In the business world, such external scrutiny tends to apply most frequently to publicly published figures (such as audited Financial Accounts); of course external financial analysts also will look to dig into figures. There may be some internal scrutiny around both the additional numbers used to run the business and the graphical representations of these (and indeed some companies take this area very seriously), but not every internal KPI is vetted the way that the report and accounts are. Particularly in the area of Data Visualisation, there is a tension here. Graphical exhibits can have a lot of impact if they relate to the current situation or present trends; contrawise if they are substantially out-of-date, people may question their relevance. There is sometimes the expectation that a dashboard is just like its aeronautical counterpart, showing real-time information about what is going on now. However a lot of the value of Data Visualisation is not about the here and now so much as trends and explanations of the factors behind the here and now. A well-thought out graph can tell a very powerful story, more powerful for most people than a table of figures. However a striking graph based on poor quality data, data which has been combined in the wrong way, or even – as sometimes happens – the wrong datasets entirely, can tell a very misleading story and lead to the wrong decisions being taken.
I am not for a moment suggesting here that every exhibit produced using Data Visualisation tools must be subject to months of scrutiny. As referenced above, in the hands of an expert such tools have the value of sometimes quickly uncovering hidden themes or factors. However, I would argue that – as in science – if the analyst involved finds something truly striking, an association which he or she feels will really resonate with senior business people, then double- or even triple-checking the data would be advisable. Asking a colleague to run their eye over the findings and to then probe for any obvious mistakes or weaknesses sounds like an appropriate next step. Internal Data Visualisations are never going to be subject to peer-review, however their value in taking sound business decisions will be increased substantially if their production reflects at least some of the rigour and scrutiny which are staples of the scientific method.
Dealing with Uncertainty
In the previous section I referred to the error bars appearing on the JMB figure above. Error bars are acknowledgements that what is being represented is variable and they indicate the extent of such variability. When dealing with a physical system (be that mechanical or – as in the case above – biological), behaviour is subject to many factors, not all of which can be eliminated or adjusted for and not all of which are predictable. This means that repeating an experiment under ostensibly identical conditions can lead to different results. If the experiment is well-designed and if the experimenter is diligent, then such variability is minimised, but never eliminated. Error bars are a recognition of this fundamental aspect of the universe as we understand it.
While de rigueur in science, error bars seldom make an appearance in business, even – in my experience – in estimates of business measures which emerge from statistical analyses. Even outside the realm of statistically generated figures, more business measures are subject to uncertainty than might initially be thought. An example here might be a comparison (perhaps as part of the externally scrutinised report and accounts) of the current quarter’s sales to the previous one (or the same one last year). In companies where sales may be tied to – for example – the number of outlets, care is paid to making these figures like-for-like. This might include only showing numbers for outlets which were in operation in the prior period and remain in operation now (i.e. excluding sales from both closed outlets or newly opened ones). However, outside the area of high-volume low-value sales where the Law of Large Numbers rules, other factors could substantially skew a given quarter’s results for many organisations. Something as simple as a key customer delaying a purchase (so that it fell in Q3 this year instead of Q2 last) could have a large impact on quarterly comparisons. Again companies will sometimes look to include adjustments to cater for such timing or related issues, but this cannot be a precise process.
The main point I am making here is that many aspects of the information produced in companies is uncertain. The cash transactions in a quarter are of course the cash transactions in a quarter, but the above scenario suggests that they may not always 100% reflect actual business conditions (and you cannot adjust for everything). Equally where you get in to figures that would be part of most companies’ financial results, outstanding receivables and allowance for bad debts, the spectre of uncertainty arises again without a statistical model in sight. In many industries, regulators are pushing for companies to include more forward-looking estimates of future assets and liabilities in their Financials. While this may be a sensible reaction to recent economic crises, the approach inevitably leads to more figures being produced from models. Even when these models are subject to external review, as is the case with most regulatory-focussed ones, they are still models and there will be uncertainty around the numbers that they generate. While companies will often provide a range of estimates for things like guidance on future earnings per share, providing a range of estimates for historical financial exhibits is not really a mainstream activity.
Which perhaps gets me back to the subject of error bars on graphs. In general I think that their presence in Data Visualisations can only add value, not subtract it. In my article entitled Limitations of Business Intelligence I include the following passage which contains an exhibit showing how the Bank of England approaches communicating the uncertainty inevitably associated with its inflation estimates:
Business Intelligence is not a crystal ball, Predictive Analytics is not a crystal ball either. They are extremely useful tools […] but they are not universal panaceas.
[…] Statistical models will never give you precise answers to what will happen in the future – a range of outcomes, together with probabilities associated with each is the best you can hope for (see above). Predictive Analytics will not make you prescient, instead it can provide you with useful guidance, so long as you remember it is a prediction, not fact.
While I can’t see them figuring in formal financial statements any time soon, perhaps there is a case for more business Data Visualisations to include error bars.
So, as is often the case, I have embarked on a journey. I started with an early example of Data Visualisation, diverted in to a particular branch of science with which I have some familiarity and hopefully returned, again as is often the case, to make some points which I think are pertinent to both the Business Intelligence practitioner and the consumers (and indeed commissioners) of Data Visualisations. Back in “All that glisters is not gold” – some thoughts on dashboards I made some more general comments about the best Data Visualisations having strong informational foundations underpinning them. While this observation remains true, I do see a lot of value in numerically able and intellectually curious people using Data Visualisation tools to quickly make connections which had not been made before and to tease out patterns from large data sets. In addition there can be great value in using Data Visualisation to present more quotidian information in a more easily digestible manner. However I also think that some of the learnings from science which I have presented in this article suggest that – as with all powerful tools – appropriate discretion on the part of the people generating Data Visualisation exhibits and on the part of the people consuming such content would be prudent. In particular the business equivalents of establishing controls, applying suitable rigour to data generation / combination and including information about uncertainty on exhibits where appropriate are all things which can help make Data Visualisation more honest and thus – at least in my opinion – more valuable.
The list of scientists involved in the development of X-ray Crystallography and Structural Biology which was presented earlier in the text encompasses a further nine such laureates (four of whom worked at my wife’s current research institute), though sadly this number does not include Rosalind Franklin. Over 20 Nobel Prizes have been awarded to people working in the field of Structural Biology, you can view an interactive time line of these here.
The intensity, size and position of blots are often digitised by specialist software, but this is an aside for our purposes.
Plus four other analogous exhibits which appear in the paper and relate to different proteins.
Normalisation has a precise mathematical meaning, actually (somewhat ironically for that most precise of activities) more than one. Here I am using the term more loosely.
That’s assuming you don’t want to get into log scales, something I have only come across once in over 25 years in business.
The uptick could be as compared to the week before, or to some other week (e.g. the same one last year or last month maybe) or versus an annual weekly average. The change is what is important here, not what the change is with respect to.
Of course some element of real-time information is indeed both feasible and desirable; for more analytic work (which encompasses many aspects of Data Visualisation) what is normally more important is sufficient historical data of good enough quality.
The incomparable Randall Munroe from xkcd.com has just knocked my earlier work into a cocked hat with his (perhaps unsurprisingly) much more laconic observations from last Friday, which are instead inspired by the recent cold snaps in the US: