A Picture Paints a Thousand Numbers

Charts

Introduction

The recent update of The Data & Analytics Dictionary featured an entry on Charts. Entries in The Dictionary are intended to be relatively brief [1] and also the layout does not allow for many illustrations. Given this, I have used The Dictionary entries as a basis for this slightly expanded article on the subject of chart types.

A Chart is a way to organise and Visualise Data with the general objective of making it easier to understand and – in particular – to discern trends and relationships. This article will cover some of the most frequently used Chart types, which appear in alphabetical order.

Note:
 
Categories, Values and Axes

Here an “axis” is a fixed reference line (sometimes invisible for stylistic reasons) which typically goes vertically up the page or horizontally from left to right across the page (but see also Radar Charts). Categories and values (see below) are plotted on axes. Most charts have two axes.

Throughout I use the word “category” to refer to something discrete that is plotted on an axis, for example France, Germany, Italy and The UK, or 2016, 2017, 2018 and 2019. I use the word “value” to refer to something more continuous plotted on an axis, such as sales or number of items etc. With a few exceptions, the Charts described below plot values against categories. Both Bubble Charts and Scatter Charts plot values against other values.

Series

I use “series” to mean sets of categories and values. So if the categories are France, Germany, Italy and The UK; and the values are sales; then different series may pertain to sales of different products by country.


Index

Bar & Column Charts Bubble Charts Cartograms
Histograms Line Charts Map Charts
Pie Charts Radar Charts / Spider Charts Scatter Charts
Tree Maps    


Bar & Column Charts
Clustered Bar Charts, Stacked Bar Charts

Bar Chart

Bar Charts is the generic term, but this is sometimes reserved for charts where the categories appear on the vertical axis, with Column Charts being those where categories appear on the horizontal axis. In either case, the chart has a series of categories along one axis. Extending righwards (or upwards) from each category is a rectangle whose width (height) is proportional to the value associated with this category. For example if the categories related to products, then the size of rectangle appearing against Product A might be proportional to the number sold, or the value of such sales.

Click here to view a larger version in a new tab

|  © JMB (2014)  |  Used under a Creative Commons licence  |

The exhibit above, which is excerpted from Data Visualisation – A Scientific Treatment, is a compound one in which two bar charts feature prominently.

Clustered Column Chart with Products

Sometimes the bars are clustered to allow multiple series to be charted side-by-side, for example yearly sales for 2015 to 2018 might appear against each product category. Or – as above – sales for Product A and Product B may both be shown by country.

Stacked Bar Chart

Another approach is to stack bars or columns on top of each other, something that is sometimes useful when comparing how the make-up of something has changed.

See also: a section of As Nice as Pie



Bubble Charts

Bubble Chart

Bubble Charts are used to display three dimensions of data on a two dimensional chart. A circle is placed with its centre at a value on the horizontal and vertical axes according to the first two dimensions of data, but then then the area (or less commonly the diameter [2]) of the circle reflects the third dimension. The result is reminiscent of a glass of champagne (then maybe this says more about the author than anything else).

Bubble Planets

You can also use bubble charts in a quite visceral way, as exemplified by the chart above. The vertical axis plots the number of satellites of the four giant planets in the Solar System. The horizontal axis plots the closest that they ever come to the Sun. The size of the planets themselves is proportional to their relative sizes.

See also: Data Visualisation according to a Four-year-old



Cartograms

Cartogram

There does not seem to be a generally accepted definition of Cartograms. Some authorities describe them as any diagram using a map to display statistical data; I cover this type of general chart in Map Charts below. Instead I will define a Cartogram more narrowly as a geographic map where areas of map sections are changed to be proportional to some other value; resulting in a distorted map. So, in a map of Europe, the size of countries might be increased or decreased so that their new areas are proportional to each country’s GDP.

Cartogram

Alternatively the above cartogram of the United States has been distorted (and coloured) to emphasise the population of each state. The dark blue of California and the slightly less dark blues of Texas, Florida and New York dominate the map.



Histograms

Histogram

A type of Bar Chart (typically with categories along the horizontal axis) where the categories are bins (or buckets) and the bars are proportional to the number of items falling into a bin. For example, the bins might be ranges of ages, say 0 to 19, 20 to 39, 30 to 49 and 50+ and the bars appearing against each might be the UK female population falling into each bin.

Brexit Bar
UK Referendum on EU Membership – Voting by age bracket

The diagram above is a bipartite quasi-histogram [3] that I created to illustrate another article. It is not a true histogram as it shows percentages for and against in each bin rather than overall frequencies.

Brexit Bar 2
UK Referendum on EU Membership – Numbers voting by age bracket

In the same article, I addressed this shortcoming with a second view of the same data, which is more histogram-like (apart from having a total category) and appears above. The point that I was making related to how Data Visualisation can both inform and mislead depending on the presentational choices taken.



Line Charts
Fan Charts, Area Charts

Line Chart

These typically have categories across the horizontal axis and could be considered as a set of line segments joining up the tops of what would be the rectangles on a Bar Chart. Clearly multiple lines, associated with multiple series, can be plotted simultaneously without the need to cluster rectangles as is required with Bar Charts. Lines can also be used to join up the points on Scatter Charts assuming that these are sufficiently well ordered to support this.

Line (Fan) Chart

Adaptations of Line Charts can also be used to show the probability of uncertain future events as per the exhibit above. The single red line shows the actual value of some metric up to the middle section of the chart. Thereafter it is the central prediction of a range of possible values. Lying above and below it are shaded areas which show bands of probability. For example it may be that the probability of the actual value falling within the area that has the darkest shading is 50%. A further example is contained in Limitations of Business Intelligence. Such charts are sometimes called Fan Charts.

Area Chart

Another type of Line Chart is the Area Chart. If we can think of a regular Line Chart as linking the tops of an invisible Bar Chart, then an Area Chart links the tops of an invisible Stacked Bar Chart. The effect is that how a band expands and contracts as we move across the chart shows how the contribution this category makes to the whole changes over time (or whatever other category we choose for the horizontal axis).

See also: The first exhibit in New Thinking, Old Thinking and a Fairytale



Map Charts

Cartogram

These place data on top of geographic maps. If we consider the canonical example of a map of the US divided into states, then the degree of shading of each state could be proportional to some state-related data (e.g. average income quartile of residents). Or more simply, figures could appear against each state. Bubbles could be placed at the location of major cities (or maybe a bubble per country or state etc.) with their size relating to some aspect of the locale (e.g.population). An example of this approach might be a map of US states with their relative populations denoted by Bubble area.

Rainfall intensity

Also data could be overlaid on a map, for example – as shown above – coloured bands corresponding to different intensities of rainfall in different areas. This exhibit is excerpted from Hurricanes and Data Visualisation: Part I – Rainbow’s Gravity.



Pie Charts

Pie Chart

These circular charts normally display a single series of categories with values, showing the proportion each category contributes to the total. For example a series might be the nations that make up the United Kingdom and their populations: England 55.62 million people, Scotland 5.43 million, Wales 3.13 million and Northern Ireland 1.87 million.

UK Population by Nation

The whole circle represents the total of all the category values (e.g. the UK population of 66.05 million people [4]). The ratio of a segment’s angle to 360° (i.e. the whole circle) is equal to the percentage of the total represented by the linked category’s value (e.g. Scotland is 8.2% of the UK population and so will have a segment with an angle of just under 30°).

Brexit Flag
UK Referendum on EU Membership – Number voting by age bracket (see notes)

Sometimes – as illustrated above – the segments are “exploded”away from each other. This is taken from the same article as the other voting analysis exhibits.

See also: As Nice as Pie, which examines the pros and cons of this type of chart in some depth.



Radar Charts / Spider Charts

Radar Chart

Radar Charts are used to plot one or more series of categories with values that fall into the same range. If there are six categories, then each has its own axis called a radius and the six of these radiate at equal angles from a central point. The calibration of each radial axis is the same. For example Radar Charts are often used to show ratings (say from 5 = Excellent to 1 = Poor) so each radius will have five points on it, typically with low ratings at the centre and high ones at the periphery. Lines join the values plotted on each adjacent radius, forming a jagged loop. Where more than one series is plotted, the relative scores can be easily compared. A sense of aggregate ratings can also be garnered by seeing how much of the plot of one series lies inside or outside of another.

Part of a Data Capability Assessment

I use Radar Charts myself extensively when assessing organisations’ data capabilities. The above exhibit shows how an organisation ranks in five areas relating to Data Architecture compared to the best in their industry sector [5].



Scatter Charts

Scatter Chart

In most of the cases we have dealt with to date, one axis has contained discrete categories and the other continuous values (though our rating example for the Radar Chart) had discrete categories and values). For a Scatter Chart both axes plot values, either continuous or discrete. A series would consist of a set of pairs of values, one to plotted on the horizontal axis and one to be plotted on the vertical axis. For example a series might be a number of pairs of midday temperature (to be plotted on the horizontal axis) and sales of ice cream (to be plotted on the vertical axis). As may be deduced from the example, often the intention is to establish a link between the pairs of values – do ice cream sales increase with temperature? This aspect can be highlighted by drawing a line of best fit on the chart; one that minimises the total distance between each plotted point and the line. Further series, say sales of coffee versus midday temperature can be added.

Here is a further example, which illustrates potential correlation between two sets of data, one on the x-axis and the other on the y-axis:

Climate Change Scatter Chart

As always a note of caution must be introduced when looking to establish correlations using scatter graphs. The inimitable Randall Munroe of xkcd.com [7] explains this pithility as follows:

By the third trimester there will be hundreds of babies inside you.

|  © Randall Munroe, xkcd.com (2009)  |  Excerpted from: Extrapolating  |

See also: Linear Regression



Tree Maps

Tree Map

Tree Maps require a little bit of explanation. The best way to understand them is to start with something more familiar, a hierarchy diagram with three levels (i.e. something like an organisation chart). Consider a cafe that sells beverages, so we have a top level box labeled Beverages. The Beverages box splits into Hot Beverages and Cold Beverages at level 2. At level 3, Hot Beverages splits into Tea, Coffee, Herbal Tea and Hot Chocolate; Cold Beverages splits into Still Water, Sparkling Water, Juices and Soda. So there is one box at level 1, two at level 2 and eight at level 3. As ever a picture paints a thousand words:

Hierarchy Diagram

Next let’s also label each of the boxes with the value of sales in the last week. If you add up the sales for Tea, Coffee, Herbal Tea and Hot Chocolate we obviously get the sales for Hot Beverages.

Labelled Hierarchy Diagram

A Tree Map takes this idea and expands on it. A Tree Map using the data from our example above might look like this:

Treemap

First, instead of being linked by lines, boxes at level 3 (leaves let’s say) appear within their parent box at level 2 (branches maybe) and the level 2 boxes appear within the overall level 1 box (the whole tree); so everything is nested. Sometimes, as is the case above, rather than having the level 2 boxes drawn explicitly, the level 3 boxes might be colour coded. So above Tea, Coffee, Herbal Tea and Hot Chocolate are mid-grey and the rest are dark grey.

Next, the size of each box (at whatever level) is proportional to the value associated with it. In our example, 66.7% of sales (\frac{1000}{1500}) are of Hot Beverages. Then two-thirds of the Beverages box will be filled with the Hot Beverages box and one-third (\frac{500}{1500}) with the Cold Beverage box. If 20% of Cold Beverages sales (\frac{100}{500}) are Still Water, then the Still Water box will fill one fifth of the Cold Beverages box (or one fifteenth – \frac{100}{1500} – of the top level Beverages box).

It is probably obvious from the above, but it is non-trivial to find a layout that has all the boxes at the right size, particularly if you want to do something else, like have the size of boxes increase from left to right. This is a task generally best left to some software to figure out.


In Closing

The above review of various chart types is not intended to be exhaustive. For example, it doesn’t include Waterfall Charts [8], Stock Market Charts (or Open / High / Low / Close Charts [9]), or 3D Surface Charts [10] (which seldom are of much utility outside of Science and Engineering in my experience). There are also a number of other more recherché charts that may be useful in certain niche areas. However, I hope we have covered some of the more common types of charts and provided some helpful background on both their construction and usage.
 


Notes

 
[1]
 
Certainly by my normal standards!
 
[2]
 
Research suggests that humans are more attuned to comparing areas of circles than say their diameters.
 
[3]
 
© peterjamesthomas.com Ltd. (2019).
 
[4]
 
Excluding overseas territories.
 
[5]
 
This has been suitably redacted of course. Typically there are four other such exhibits in my assessment pack: Data Strategy, Data Organisation, MI & Analytics and Data Controls, together with a summary radar chart across all five lower level ones.
 
[6]
 
The atmospheric CO2 records were sourced from the US National Oceanographic and Atmospheric Administration’s Earth System Research Laboratory and relate to concentrations measured at their Mauna Loa station in Hawaii. The Global Average Surface Temperature records were sourced from the Earth Policy Institute, based on data from NASA’s Goddard Institute for Space Studies and relate to measurements from the latter’s Global Historical Climatology Network. This exhibit is meant to be a basic illustration of how a scatter chart can be used to compare two sets of data. Obviously actual climatological research requires a somewhat more rigorous approach than the simplistic one I have employed here.
 
[7]
 
Randall’s drawings are used (with permission) liberally throughout this site,Including:

 
[8]
 
Waterfall Chart – Wikipedia.
 
[9]
 
Open-High-Low-Close Chart – Wikipedia.
 
[10]
 
Surface Chart – AnyCharts.

peterjamesthomas.com

Another article from peterjamesthomas.com. The home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases.

 

As Nice as Pie

If you can't get your graphing tool to do the shading, just add some clip art of cosmologists discussing the unusual curvature of space in the area.

© Randall Munroe of xkcd.com – Image adjusted to fit dimensions of this page

Work by the inimitable Randall Munroe, author of long-running web-comic, xkcd.com, has been featured (with permission) multiple times on these pages [1]. The above image got me thinking that I had not penned a data visualisation article since the series starting with Hurricanes and Data Visualisation: Part I – Rainbow’s Gravity nearly a year ago. Randall’s perspective led me to consider that staple of PowerPoint presentations, the humble and much-maligned Pie Chart.


 
While the history is not certain, most authorities credit the pioneer of graphical statistics, William Playfair, with creating this icon, which appeared in his Statistical Breviary, first published in 1801 [2]. Later Florence Nightingale (a statistician in case you were unaware) popularised Pie Charts. Indeed a Pie Chart variant (called a Polar Chart) that Nightingale compiled appears at the beginning of my article Data Visualisation – A Scientific Treatment.

I can’t imagine any reader has managed to avoid seeing a Pie Chart before reading this article. But, just in case, here is one (Since writing Rainbow’s Gravity – see above for a link – I have tried to avoid a rainbow palette in visualisations, hence the monochromatic exhibit):

Basic Pie Chart

The above image is a representation of the following dataset:

 
Label Count
A 4,500
B 3,000
C 3,000
D 3,000
E 4,500
Total 18,000
 

The Pie Chart consists of a circle divided in to five sectors, each is labelled A through E. The basic idea is of course that the amount of the circle taken up by each sector is proportional to the count of items associated with each category, A through E. What is meant by the innocent “amount of the circle” here? The easiest way to look at this is that going all the way round a circle consumes 360°. If we consider our data set, the total count is 18,000, which will equate to 360°. The count for A is 4,500 and we need to consider what fraction of 18,000 this represents and then apply this to 360°:

\dfrac{4,500}{18,000}\times 360^o=\dfrac{1}{4}\times 360^o=90^o

So A must take up 90°, or equivalently one quarter of the total circle. Similarly for B:

\dfrac{3,000}{18,000}\times 360^o=\dfrac{1}{6}\times 360^o=60^o

Or one sixth of the circle.

If we take this approach then – of course – the sum of all of the sectors must equal the whole circle and neither more nor less than this (pace Randall). In our example:

 
Label Degrees
A 90°
B 60°
C 60°
D 60°
E 90°
Total 360°
 

So far, so simple. Now let’s consider a second data-set as follows:

 
Label Count
A 9,480,301
B 6,320,201
C 6,320,200
D 6,320,201
E 9,480,301
Total 37,921,204
 

What does its Pie Chart look like? Well it’s actually rather familiar, it looks like this:

Basic Pie Chart

This observation stresses something important about Pie Charts. They show how a number of categories contribute to a whole figure, but they only show relative figures (percentages of the whole if you like) and not the absolute figures. The totals in our two data-sets differ by a factor of over 2,100 times, but their Pie Charts are identical. We will come back to this point again later on.


 
Pie Charts have somewhat fallen into disrepute over the years. Some of this is to do with their ubiquity, but there is also at least one more substantial criticism. This is that the human eye is bad at comparing angles, particularly if they are not aligned to some reference point, e.g. a vertical. To see this consider the two Pie Charts below (please note that these represent a different data set from above – for starters, there are only four categories plotted as opposed to five earlier on):

Comparative Pie Charts

The details of the underlying numbers don’t actually matter that much, but let’s say that the left-hand Pie Chart represents annual sales in 2016, broken down by four product lines. The right-hand chart has the same breakdown, but for 2017. This provides some context to our discussions.

Suppose what is of interest is how the sales for each product line in the 2016 chart compare to their counterparts in the right-hand one; e.g. A and A’, B and B’ and so on. Well for the As, we have the helpful fact that they both start from a vertical line and then swing down and round, initially rightwards. This can be used to gauge that A’ is a bit bigger than A. What about B and B’? Well they start in different places and end in different places, looking carefully, we can see that B’ is bigger than B. C and C’ are pretty easy, C is a lot bigger. Then we come to D and D’, I find this one a bit tricky, but we can eventually hazard a guess that they are pretty much the same.

So we can compare Pie Charts and talk about how sales change between two years, what’s the problem? The issue is that it takes some time and effort to reach even these basic conclusions. How about instead of working out which is bigger, A or A’, I ask the reader to guess by what percentage A’ is bigger. This is not trivial to do based on just the charts.

If we really want to look at year-on-year growth, we would prefer that the answer leaps off the page; after all, isn’t that the whole point of visualisations rather than tables of numbers? What if we focus on just the right-hand diagram? Can you say with certainty which is bigger, A or C, B or D? You can work to an answer, but it takes longer than should really be the case for a graphical exhibit.

Aside:

There is a further point to be made here and it relates to what we said Pie Charts show earlier in this piece. What we have in our two Pie Charts above is the make-up of a whole number (in the example we have been working through, this is total annual sales) by categories (product lines). These are percentages and what we have been doing above is to compare the fact that A made up 30% of the total sales in 2016 and 33% in 2017. What we cannot say based on just the above exhibits is how actual sales changed. The total sales may have gone up or down, the Pie Chat does not tell us this, it just deals in how the make-up of total sales has shifted.

Some people try to address this shortcoming, which can result in exhibits such as:

Comparative Pie Charts - with Growth

Here some attempt has been made to show the growth in the absolute value of sales year on year. The left-hand Pie Chart is smaller and so we assume that annual sales have increased between 2016 and 2017. The most logical thing to do would be to have the change in total area of the two Pie Charts to be in proportion to the change in sales between the two years (in this case – based on the underlying data – 2017 sales are 69% bigger than 2016 sales). However, such an approach, while adding information, makes the task of comparing sectors from year to year even harder.


 
The general argument is that Nested Bar Charts are better for the type of scenario I have presented and the types of questions I asked above. Looking at the same annual sales data this way we could generate the following graph:

Comparative Bar Charts

Aside:

While Bar Charts are often used to show absolute values, what we have above is the same “percentage of the whole” data that was shown in the Pie Charts. We have already covered the relative / absolute issue inherent in Pie Charts, from now on, each new chart will be like a Pie Chart inasmuch as it will contain relative (percentage of the whole) data, not absolute. Indeed you could think about generating the bar graph above by moving the Pie Chart sectors around and squishing them into new shapes, while preserving their area.

The Bar Chart makes the yearly comparisons a breeze and it is also pretty easy to take a stab at percentage differences. For example B’ looks about a fifth bigger than B (it’s actually 17.5% bigger) [3]. However, what I think gets lost here is a sense of the make-up of the elements of the two sets. We can see that A is the biggest value in the first year and A’ in the second, but it is harder to gauge what percentage of the overall both A and A’ represent.

To do this better, we could move to a Stacked Bar Chart as follows (again with the same sales data):

Stacked Bar Chart

Aside:

Once more, we are dealing with how proportions have changed – to put it simply the height of both “skyscrapers” is the same. If we instead shifted to absolute values, then our exhibit might look more like:

Stacked Bar Chart (Absolute Values)

The observant reader will note that I have also added dashed lines linking the same category for each year. These help to show growth. Regardless of what angle to the horizontal the lower line for a category makes, if it and the upper category line diverge (as for B and B’), then the category is growing; if they converge (as for C and C’), the category is shrinking [4]. Parallel lines indicate a steady state. Using this approach, we can get a better sense of the relative size of categories in the two years.


 
However, here – despite the dashed lines – we lose at least some of of the year-on-year comparative power of the Nested Bar Chart above. In turn the Nested Bar Chart loses some of the attributes of the original Pie Chart. In truth, there is no single chart which fits all purposes. Trying to find one is analogous to trying to find a planar projection of a sphere that preserves angles, distances and areas [5].

Rather than finding the Philosopher’s Stone [6] of an all-purpose chart, the challenge for those engaged in data visualisation is to anticipate the central purpose of an exhibit and to choose a chart type that best resonates with this. Sometimes, the Pie Chart can be just what is required, as I found myself in my article, A Tale of Two [Brexit] Data Visualisations, which closed with the following image:

Brexit Flag
UK Referendum on EU Membership – Number voting by age bracket (see caveats in original article)

Or, to put it another way:

You may very well be well bred
Chart aesthetics filling your head
But there’s always some special case, time or place
To replace perfect taste

For instance…

Never cry ’bout a Chart of Pie
You can still do fine with a Chart of Pie
People may well laugh at this humble graph
But it can be just the thing you need to help the staff

Never cry ’bout a Chart of Pie
Though without due care things can go awry
Bars are fine, Columns shine
Lines are ace, Radars race
Boxes fly, but never cry about a Chart of Pie

With apologies to the Disney Corporation!


 
Addendum:

It was pointed out to me by Adam Carless that I had omitted the following thing of beauty from my Pie Chart menagerie. How could I have forgotten?

3D Pie Chart

It is claimed that some Theoretical Physicists (and most Higher Dimensional Geometers) can visualise in four dimensions. Perhaps this facility would be of some use in discerning meaning from the above exhibit.
 


 
Notes

 
[1]
 
Including:

 
[2]
 
Playfair also most likely was the first to introduce line, area and bar charts.
 
[3]
 
Recall again we are comparing percentages, so 50% is 25% bigger than 40%.
 
[4]
 
This assertion would not hold for absolute values, or rather parallel lines would indicate that the absolute value of sales (not the relative one) had stayed constant across the two years.
 
[5]
 
A little-known Mathematician, going by the name of Gauss, had something to say about this back in 1828 – Disquisitiones generales circa superficies curvas. I hope you read Latin.
 
[6]
 
The Philosopher's Stone

No, not that one!.

 


From: peterjamesthomas.com, home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases

 

Indiana Jones and The Anomalies of Data

One of an occasional series [1] highlighting the genius of Randall Munroe. Randall is a prominent member of the international data community and apparently also writes some sort of web-comic as a side line [2].

I didn't even realize you could HAVE a data set made up entirely of outliers.
Copyright xkcd.com

Data and Indiana Jones, these are a few of my favourite things… [3] Indeed I must confess to having used a variant of the image below in each of my seminar deck and – on this site back in 2009 – a previous article, A more appropriate metaphor for Business Intelligence projects.

Raiders of the Lost Ark II would have been a much better title than Temple of Doom IMO

In both cases I was highlighting that data-centric work is sometimes more like archaeology than the frequently employed metaphor of construction. To paraphrase myself, you never know what you will find until you start digging. The image suggested the unfortunate results of not making this distinction when approaching data projects.

So, perhaps I am arguing for less Data Architects and more Data Archaeologists; the whip and fedora are optional of course!
 


 Notes

 
[1]
 
Well not that occasional as, to date, the list extends to:

  1. Patterns patterns everywhere – The Sequel
  2. An Inconvenient Truth
  3. Analogies, the whole article is effectively an homage to xkcd.com
  4. A single version of the truth?
  5. Especially for all Business Analytics professionals out there
  6. New Adventures in Wi-Fi – Track 1: Blogging
  7. Business logic [My adaptation]
  8. New Adventures in Wi-Fi – Track 2: Twitter
  9. Using historical data to justify BI investments – Part III
 
[2]
 
xkcd.com if you must ask.
 
[3]
 
Though in this case, my enjoyment would have been further enhanced by the use of “artefacts” instead.

 

 

An Inconvenient Truth

Frequentists vs. Bayesians - © xkcd.com
© xkcd.com (adapted from the original to fit the dimensions of this page)

No, not a polemic about climate change, but instead some observations on the influence of statistical methods on statistical findings. It is clearly a truism to state that there are multiple ways to skin a cat, what is perhaps less well-understood is that not all methods of flaying will end up with a cutaneously-challenged feline and some may result in something altogether different.

So an opaque introduction, let me try to shed some light instead. While the points I am going to make here are ones that any statistical practitioner would (or certainly should) know well, they are perhaps less widely appreciated by a general audience. I returned to thinking about this area based on an article by Raphael Silberzahn and Eric Uhlmann in Nature [1], but one which I have to admit first came to my attention via The Economist [2].

Messrs Silberzahn and Uhlmann were propounding a crowd-sourced approach to statistical analysis in science, in particular the exchange of ideas about a given analysis between (potentially rival) groups before conclusions are reached and long before the customary pre- and post-publication reviews. While this idea may well have a lot of merit, I’m instead going to focus on the experiment that the authors performed, some of its results and their implications for more business-focussed analysis teams and individuals.

The interesting idea here was that Silberzahn and Uhlmann provided 29 different teams of researchers the same data set and asked them to investigate the same question. The data set was a sporting one covering the number of times that footballers (association in this case, not American) were dismissed from the field of play by an official. The data set included many attributes from the role of the player, to when the same player / official encountered each other, to demographics of the players themselves. The question was – do players with darker skins get dismissed more often than their fairer teammates?

Leaving aside the socio-political aspects that this problem brings to mind, the question is one that, at least on first glance, looks as if it should be readily susceptible to statistical analysis and indeed the various researchers began to develop their models and tests. A variety of methodologies was employed, “everything from Bayesian clustering to logistic regression and linear modelling” (the authors catalogued the approaches as well as the results) and clearly each team took decisions as to which data attributes were the most significant and how their analyses would be parameterised. Silberzahn and Uhlmann then compared the results.

Below I’ll simply repeat part of their comments (with my highlighting):

Of the 29 teams, 20 found a statistically significant correlation between skin colour and red cards […]. The median result was that dark-skinned players were 1.3 times more likely than light-skinned players to receive red cards. But findings varied enormously, from a slight (and non-significant) tendency for referees to give more red cards to light-skinned players to a strong trend of giving more red cards to dark-skinned players.

This diversity in findings is neatly summarised in the following graph (please click to view the original on Nature’s site):

Nature Graph

© NPG. Used under license 3741390447060 Copyright Clearance Center

To be clear here, the unanimity of findings that one might have expected from analysing what is essentially a pretty robust and conceptually simple data set was essentially absent. What does this mean aside from potentially explaining some of the issues with repeatability that have plagued some parts of science in recent years?

Well the central observation is that precisely the same data set can lead to wildly different insights dependent on how it is analysed. It is not necessarily the case that one method is right and others wrong, indeed in review of the experiment, the various research teams agreed that the approaches taken by others were also valid. Instead it is extremely difficult to disentangle results from the algorithms employed to derive them. In this case methodology had a bigger impact on findings than any message lying hidden in the data.

Here we are talking about leading scientific researchers, whose prowess in statistics is a core competency. Let’s now return to the more quotidian world of the humble data scientist engaged in helping an organisation to take better decisions through statistical modelling. Well the same observations apply. In many cases, insight will be strongly correlated with how the analysis is performed and the choices that the analyst has made. Also, it may not be that there is some objective truth hidden in a dataset, instead only a variety of interpretations of this.

Now this sounds like a call to abandon all statistical models. Nothing could be further from my point of view [3]. However caution is required. In particular those senior business people who place reliance on the output of models, but who maybe do not have a background in statistics, should perhaps ask themselves whether what their organisation’s models tell them is absolute truth, or instead simply more of an indication. They should also ask whether a different analysis methodology might have yielded a different result and thus dictated different business action.

At the risk of coming over all Marvel, the great power of statistical modelling comes with great responsibility.

In 27 years in general IT and 15 in the data/information space (to say nothing of my earlier Mathematical background) I have not yet come across a silver bullet. My strong suspicion is that they don’t exist. However, I’d need to carry out some further analysis to reach a definitive conclusion; now what methodology to employ…?
 


 
Notes

 
[1]
 
Crowdsourced research: Many hands make tight work. Raphael Silberzahn &a Eric L. Uhlmann. Nature. 07 October 2015.
07 October 2015
 
[2]
 
On the other hands – Honest disagreement about methods may explain irreproducible results.The Economist 10th Oct 2015.
 
[3]
 
See the final part of my trilogy on using historical data to justify BI investments for a better representation of my actual views.

Analogies

Disaster Area's chief research accountant has recently been appointed Professor of Neomathematics at the University of Maximegalon, in recognition of both his General and his Special Theories of Disaster Area Tax Returns, in which he proves that the whole fabric of the space- time continuum is not merely curved, it is in fact totally bent.

Note: In the following I have used the abridgement Maths when referring to Mathematics, I appreciate that this may be jarring to US readers, omitting the ‘s’ is jarring to me, so please accept my apologies in advance.

Introduction

Regular readers of this blog will be aware of my penchant for analogies. Dominant amongst these have been sporting ones, which have formed a major part of articles such as:

Rock climbing: Perseverance
A bad workman blames his [BI] tools
Running before you can walk
Feasibility studies continued…
Incremental Progress and Rock Climbing
Cricket: Accuracy
The Big Picture
Mountain Biking: Mountain Biking and Systems Integration
Football (Soccer): “Big vs. Small BI” by Ann All at IT Business Edge

I have also used other types of analogy from time to time, notably scientific ones such as in the middle sections of Recipes for Success?, or A Single Version of the Truth? – I was clearly feeling quizzical when I wrote both of those pieces! Sometimes these analogies have been buried in illustrations rather than the text as in:

Synthesis RNA Polymerase transcribing DNA to produce RNA in the first step of protein synthesis
The Business Intelligence / Data Quality symbiosis A mitochondria, the possible product of endosymbiosis of proteobacteria and eukaryots
New Adventures in Wi-Fi – Track 2: Twitter Paul Dirac, the greatest British Physicist since Newton

On other occasions I have posted overtly Mathematical articles such as Patterns, patterns everywhere, The triangle paradox and the final segment of my recently posted trilogy Using historical data to justify BI investments.

Jim Harris' OCDQ Blog

Jim Harris (@ocdqblog) frequently employs analogies on his excellent Obsessive Compulsive Data Quality blog. If there is a way to form a title “The X of Data Quality”, and relate this in a meaningful way back to his area of expertise, Jim’s creative brain will find it. So it is encouraging to feel that I am not alone in adopting this approach. Indeed I see analogies employed increasingly frequently in business and technology blogs, to say nothing of in day-to-day business life.

However, recently two things have given me pause for thought. The first was the edition of Randall Munroe’s highly addictive webcomic, xkcd.com, that appeared on 6th May 2011, entitled “Teaching Physics”. The second was a blog article I read which likened a highly abstract research topic in one branch of Theoretical Physics to what BI practitioners do in their day job.

An homage to xkcd.com

Let’s consider xkcd.com first. Anyone who finds some nuggets of interest in the type of – generally rather oblique – references to matters Mathematical or Scientific that I mention above is likely to fall in love with xkcd.com. Indeed anyone who did a numerate degree, works in a technical role, or is simply interested in Mathematics, Science or Engineering would as well – as Randall says in a footnote:

“this comic occasionally contains […] advanced mathematics (which may be unsuitable for liberal-arts majors)”

Although Randall’s main aim is to entertain – something he manages to excel at – his posts can also be thought-provoking, bitter-sweet and even resonate with quite profound experiences and emotions. Who would have thought that some stick figures could achieve all that? It is perhaps indicative of the range of topics dealt with on xkcd.com that I have used it to illustrate no fewer than seven of my articles (including this one, a full list appears at the end of the article). It is encouraging that Randall’s team of corporate lawyers has generally viewed my requests to republish his work favourably.

The example of Randall’s work that I wanted to focus on is as follows.

Space-time is like some simple and familiar system which is both intuitively understandable and precisely analogous, and if I were Richard Feynman I’d be able to come up with it.
© xkcd.com (adapted from the original to fit the dimensions of this page)

It is worth noting that often the funniest / most challenging xkcd.com observations appear in the mouse-over text of comic strips (alt or title text for any HTML heads out there – assuming that there are any of us left). I’ll reproduce this below as it is pertinent to the discussion:

Space-time is like some simple and familiar system which is both intuitively understandable and precisely analogous, and if I were Richard Feynman I’d be able to come up with it.

If anyone needs some background on the science referred to then have a skim of this article if you need some background on the scientist mentioned (who has also made an appearance on peterjamesthomas.com in Presenting in Public) then glance through this second one.

Here comes the Science…

Randall points out the dangers of over-extending an analogy. While it has always helped me to employ the rubber-sheet analogy of warped space-time when thinking about the area, it is rather tough (for most people) to extrapolate a 2D surface being warped to a 4D hyperspace experiencing the same thing. As an erstwhile Mathematician, I find it easy enough to cope with the following generalisation:

S(1) = The set of all points defined by one variable (x1)
– i.e. a straight line
S(2) = The set of all points defined by two variables (x1, x2)
– i.e. a plane
S(3) = The set of all points defined by three variables (x1, x2, x3)
– i.e. “normal” 3-space
S(4) = The set of all points defined by four variables (x1, x2, x3, x4)
– i.e. 4-space
” ” ” “
S(n) = The set of all points defined by n variables (x1, x2, … , xn)
– i.e. n-space

As we increase the dimensions, the Maths continues to work and you can do calculations in n-space (e.g. to determine the distance between two points) just as easily (OK with some more arithmetic) as in 3-space; Pythagoras still holds true. However, actually visualising say 7-space might be rather taxing for even a Field’s Medallist or Nobel-winning Physicist.

… and the Maths

More importantly while you can – for example – use 3-space as an analogue for some aspects of 4-space, there are also major differences. To pick on just one area, some pieces of string that are irretrievably knotted in 3-space can be untangled with ease in 4-space.

To briefly reference a probably familiar example, starting with 2-space we can look at what is clearly a family of related objects:

2-space: A square has 4 vertexes, 4 edges joining them and 4 “faces” (each consisting of a line – so the same as edges in this case)
3-space: A cube has 8 vertexes, 12 edges and 6 “faces” (each consisting of a square)
4-space: A tesseract (or 4-hypercube) has 16 vertexes, 32 edges and 8 “faces” (each consisting of a cube)
Note: The reason that faces appears in inverted commas is that the physical meaning changes, only in 3-space does this have the normal connotation of a surface with two dimensions. Instead of faces, one would normally talk about the bounding cubes of a tesseract forming its cells.

Even without any particular insight into multidimensional geometry, it is not hard to see from the way that the numbers stack up that:

n-space: An n-hypercube has 2n vertexes, 2n-1n edges and 2n “faces” (each consisting of an (n-1)-hypercube)

Again, while the Maths is compelling, it is pretty hard to visualise a tesseract. If you think that a drawing of a cube, is an attempt to render a 3D object on a 2D surface, then a picture of a tesseract would be a projection of a projection. The French (with a proud history of Mathematics) came up with a solution, just do one projection by building a 3D “picture” of a tesseract.

La Grande Arche de la Défense

As aside it could be noted that the above photograph is of course a 2D projection of a 3D building, which is in turn a projection of a 4D shape; however recursion can sometimes be pushed too far!

Drawing multidimensional objects in 2D, or even building them in 3D, is perhaps a bit like employing an analogy (this sentence being of course a meta-analogy). You may get some shadowy sense of what the true object is like in n-space, but the projection can also mask essential features, or even mislead. For some things, this shadowy sense may be more than good enough and even allow you to better understand the more complex reality. However, a 2D projection will not be good enough (indeed cannot be good enough) to help you understand all properties of the 3D, let alone the 4D. Hopefully, I have used one element of the very subject matter that Randall raises in his webcomic to further bolster what I believe are a few of the general points that he is making, namely:

  1. Analogies only work to a degree and you over-extend them at your peril
  2. Sometimes the wholly understandable desire to make a complex subject accessible by comparing it to something simpler can confuse rather than illuminate
  3. There are subject areas that very manfully resist any attempts to approach them in a manner other than doing the hard yards – not everything is like something less complex

Why BI is not [always] like Theoretical Physics

Hand with reflecting sphere - Maurits Cornelis Escher (1935). This is your only clue.

Having hopefully supported these points, I’ll move on to the second thing that I mentioned reading; a BI-related blog also referencing Theoretical Physics. I am not going to name the author, mention where I read their piece, state what the title was, or even cite the precise area of Physics they referred to. If you are really that interested, I’m sure that the nice people at Google can help to assuage your curiosity. With that out of the way, what were the concerns that reading this piece raised in my mind?

Well first of all, from the above discussion (and indeed the general tone of this blog), you might think that such an article would be right up my street. Sadly I came away feeling that the connection made was, tenuous at best, rather unhelpful (it didn’t really tell you anything about Business Intelligence) and also exhibited a lack of anything bar a superficial understanding of the scientific theory involved.

The analogy had been drawn based on a single word which is used in both some emerging (but as yet unvalidated) hypotheses in Theoretical Physics and in Business Intelligence. While, just like the 2D projection of a 4D shape, there are some elements in common between the two, there are some fundamental differences. This is a general problem in Science and Mathematics, everyday words are used because they have some connection with the concept in hand, but this does not always imply as close a relationship as the casual reader might infer. Some examples:

  1. In Pure Mathematics, the members of a group may be associative, but this doesn’t mean that they tend to hang out together.
  2. In Particle Physics, an object may have spin, but this does not mean that it has been bowled by Murali
  3. In Structural Biology, a residue is not precisely what a Chemist might mean by one, let alone a lay-person

Part of the blame for what was, in my opinion, an erroneous connection between things that are not actually that similar lies with something that, in general, I view more positively; the popular science book. The author of the BI/Physics blog post referred to just such a tome in making his argument. I have consumed many of these books myself and I find them an interesting window into areas in which I do not have a background. The danger with them lies when – in an attempt to convey meaning that is only truly embodied (if that is the word) in Mathematical equations – our good friend the analogy is employed again. When done well, this can be very powerful and provide real insight for the non-expert reader (often the writers of pop-science books are better at this kind of thing than the scientists themselves). When done less well, this can do more than fail to illuminate, it can confuse, or even in some circumstances leave people with the wrong impression.

Tridimensional realisation of the Riemann Zeta function
© Jean-François Colonna

During my MSc, I spent a year studying the Riemann Hypothesis and the myriad of results that are built on the (unproven) assumption that it is true. Before this I had spent three years obtaining a Mathematics BSc. Before this I had taken two Maths A-levels (national exams taken in the UK during and at the end of what would equate to High School in the US), plus (less relevantly perhaps) Physics and Chemistry. One way or another I had been studying Maths for probably 15 plus years before I encountered this most famous and important of ideas.

So what is the Riemann Hypotheis? A statement of it is as follows:

The real part of all non-trivial zeros of the Riemann Zeta function is equal to one half

There! Are you any the wiser? If I wanted to explain this statement to those who have not studied Pure Mathematics at a graduate level, how would I go about it? Maybe my abilities to think laterally and be creative are not well-developed, but I struggle to think of an easily accessible way to rephrase the proposal. I could say something gnomic such as, “it is to do with the distribution of prime numbers” (while trying to avoid the heresy of adding that prime numbers are important because of cryptography – I believe that they are important because they are prime numbers!).

I spent a humble year studying this area, after years of preparation. Some of the finest Mathematical minds of the last century (sadly not a set of which I am a member) have spent vast chunks of their careers trying to inch towards a proof. The Riemann Hypothesis is not like something from normal experience; it is complicated. Some things are complicated and not easily susceptible to analogy.

Equally – despite how interesting, stimulating, rewarding and even important Business Intelligence can be – it is not Theoretical Physics and n’er the twain shall meet.

And so what?

So after this typically elliptical journey through various parts of Science and Mathematics, what have I learnt? Mainly that analogies must be treated with care and not over-extended lest they collapse in a heap. Will I therefore stop filling these pages with BI-related analogies, both textual and visual? Probably not, but maybe I’ll think twice before hitting the publish key in future!

Euler's product formula for the Riemann Zeta function


Chronological list of articles using xkcd.com illustrations:

  1. A single version of the truth?
  2. Especially for all Business Analytics professionals out there
  3. New Adventures in Wi-Fi – Track 1: Blogging
  4. Business logic [My adaptation]
  5. New Adventures in Wi-Fi – Track 2: Twitter
  6. Using historical data to justify BI investments – Part III

 

Using historical data to justify BI investments – Part III

The earliest recorded surd

This article completes the three-part series which started with Using historical data to justify BI investments – Part I and continued (somewhat inevitably) with Using historical data to justify BI investments – Part II. Having presented a worked example, which focused on using historical data both to develop a profit-enhancing rule and then to test its efficacy, this final section considers the implications for justifying Business Intelligence / Data Warehouse programmes and touches on some more general issues.
 
 
The Business Intelligence angle

In my experience when talking to people about the example I have just shared, there can be an initial “so what?” reaction. It can maybe seem that we have simply adopted the all-too-frequently-employed business ruse of accentuating the good and down-playing the bad. Who has not heard colleagues say “this was a great month excluding the impact of X, Y and Z”? Of course the implication is that when you include X, Y and Z, it would probably be a much less great month; but this is not what we have done.

One goal of business intelligence is to help in estimating what is likely to happen in the future and guiding users in taking decisions today that will influence this. What we have really done in the above example is as follows:

Look out Morlocks, here I come... [alumni of Imperial College London are so creative aren't they?]

  1. shift “now” back two years in time
  2. pretend we know nothing about what has happened in these most recent two years
  3. develop a predictive rule based solely on the three years preceding our back-shifted “now”
  4. then use the most recent two years (the ones we have metaphorically been covering with our hand) to see whether our proposed rule would have been efficacious

For the avoidance of doubt, in the previously attached example, the losses incurred in 2009 – 2010 have absolutely no influence on the rule we adopt, this is based solely on 2006 – 2008 losses. All the 2009 – 2010 losses are used for is to validate our rule.

We have therefore achieved two things:

  1. Established that better decisions could have been taken historically at the juncture of 2008 and 2009
  2. Devised a rule that would have been more effective and displayed at least some indication that this could work going forward in 2011 and beyond

From a Business Intelligence / Data Warehousing perspective, the general pitch is then something like:

Eight out of ten cats said that their owners got rid of stubborn stains no other technology could shift with BI - now with added BA

  1. if we can mechanically take such decisions, based on a very non-sophisticated analysis of data, then if we make even simple information available to the humans taking decisions (i.e. basic BI), then surely the quality of their decision-making will improve
  2. If we go beyond this to provide more sophisticated analyses (e.g. including industry segmentation, analysis of insured attributes, specific products sold etc., i.e. regular BI) then we can – by extrapolation from the example – better shape the evolution of the performance of whole books of business
  3. We can also monitor the decisions taken to determine the relative effectiveness of individuals and teams and compare these to their peers – ideally these comparisons would also be made available to the individuals and teams themselves, allowing them to assess their relative performance (again regular BI)
  4. Finally, we can also use more sophisticated approaches, such as statistical modelling to tease out trends and artefacts that would not be easily apparent when using a standard numeric or graphical approach (i.e. sophisticated BI, though others might use the terms “data mining”, “pattern recognition” or the now ubiquitous marketing term “analytics”)

The example also says something else – although we may already have reporting tools, analysis capabilities and even people dabbling in statistical modelling, it appears that there is room for improvement in our approach. The 2009 – 2010 loss ratio was 54% and it could have been closer to 40%. Thus what we are doing now is demonstrably not as good as it could be and the monetary value of making a stepped change in information capabilities can be estimated.

The generation of which should be the object of any BI/DW project worth its salt - thinking of which, maybe a mound of salt would also have worked as an illustration

In the example, we are talking about £1m of biannual premium and £88k of increased profit. What would be the impact of better information on an annual book of £1bn premium? Assuming a linear relationship and using some advanced Mathematics, we might suggest £44m. What is more, these gains would not be one-off, but repeatable every year. Even if we moderate our projected payback to a more conservative figure, our exercise implies that we would be not out of line to suggest say an ongoing annual payback of £10m. These are numbers and concepts which are likely to resonate with Executive decision-makers.

To put it even more directly an increase of £10m a year in profits would quickly swamp the cost of a BI/DW programme in very substantial benefits. These are payback ratios that most IT managers can only dream of.

As an aside, it may have occurred to readers that the mechanistic rule is actually rather good and – if so – why exactly do we need the underwriters? Taking to one side examples of solely rule-based decision-making going somewhat awry (LTCM anyone?) the human angle is often necessary in messy things like business acquisition and maintaining relationships. Maybe because of this, very few insurance organisations are relying on rules to take all decisions. However it is increasingly common for rules to play some role in their overall approach. This is likely to take the form of triage of some sort. For example:

  1. A rule – maybe not much more sophisticated than the one I describe above – is established and run over policies before renewal.
  2. This is used to score polices as maybe having green, amber or red lights associated with them.
  3. Green policies may be automatically renewed with no intervention from human staff
  4. Amber polices may be looked at by junior staff, who may either OK the renewal if they satisfy themselves that the issues picked up are minor, or refer it to more senior and experienced colleagues if they remain concerned
  5. Red policies go straight to the most experienced staff for their close attention

In this way process efficiencies are gained. Staff time is only applied where it is necessary and the most expensive resources are applied to those cases that most merit their abilities.

 
Correlation

From the webcomic of the inimitable Randall Munroe - his mouse-over text is a lot better than mine BTW
© xkcd.com

Let’s pause for a moment and consider the Insurance example a little more closely. What has actually happened? Well we seem to have established that performance of policies in 2006 – 2008 is at least a reasonable predictor of performance of the same policies in 2009 – 2010. Taking the mutual fund vendors’ constant reminder that past performance does not indicate future performance to one side, what does this actually mean?

What we have done is to establish a loose correlation between 2006 – 2008 and 2009 – 2010 loss ratios. But I also mentioned a while back that I had fabricated the figures, so how does that work? In the same section, I also said that the figures contained an intentional bias. I didn’t adjust my figures to make the year-on-year comparison work out. However, at the policy level, I was guilty of making the numbers look like the type of results that I have seen with real policies (albeit of a specific type). Hopefully I was reasonably realistic about this. If every policy that was bad in 2006 – 2008 continued in exactly the same vein in 2009 – 2010 (and vice versa) then my good segment would have dropped from an overall loss ratio of 54% to considerably less than 40%. The actual distribution of losses is representative of real Insurance portfolios that I have analysed. It is worth noting that only a small bias towards policies that start bad continuing to be bad is enough for our rule to work and profits to be improved. Close scrutiny of the list of policies will reveal that I intentionally introduced several counter-examples to our rule; good business going bad and vice versa. This is just as it would be in a real book of business.

Not strongly correlated

Rather than continuing to justify my methodology, I’ll make two statements:

  1. I have carried out the above sort of analysis on multiple books of Insurance business and come up with comparable results; sometimes the implied benefit is greater, sometimes it is less, but it has been there without exception (of course statistics being what it is, if I did the analysis frequently enough I would find just such an exception!).
  2. More mathematically speaking, the actual figure for the correlation between the two sets of years is a less than stellar 0.44. Of course a figure of 1 (or indeed -1) would imply total correlation, and one of 0 would imply a complete lack of correlation, so I am not working with doctored figures. Even a very mild correlation in data sets (one much less than the threshold for establishing statistical dependence) can still yield a significant impact on profit.

 
Closing thoughts

Ground floor: Perfumery, Stationery and leather goods, Wigs and haberdashery, Kitchenware and food…. Going up!

Having gone into a lot of detail over the course of these three articles, I wanted to step back and assess what we have covered. Although the worked-example was drawn from my experience in Insurance, there are some generic learnings to be made.

Broadly I hope that I have shown that – at least in Insurance, but I would argue with wider applicability – it is possible to use the past to infer what actions we should take in the future. By a slight tweak of timeframes, we can even take some steps to validate approaches suggested by our information. It is important that we remember that the type of basic analysis I have carried out is not guaranteed to work. The same can be said of the most advanced statistical models; both will give you some indication of what may happen and how likely this is to occur, but neither of them is foolproof. However, either of these approaches has more chance of being valuable than, for example, solely applying instinct, or making decisions at random.

In Patterns, patterns everywhere, I wrote about the dangers associated with making predictions about events are essentially unpredictable. This is another caveat to be born in mind. However, to balance this it is worth reiterating that even partial correlation can lead to establishing rules (or more sophisticated models) that can have a very positive impact.

While any approach based on analysis or statistics will have challenges and need careful treatment, I hope that my example shows that the option of doing nothing, of continuing to do things how they have been done before, is often fraught with even more problems. In the case of Insurance at least – and I suspect in many other industries – the risks associated with using historical data to make predictions about the future are, in my opinion, outweighed by the risks of not doing this; on average of course!

But then 1=2 for very large values of 1
 

New Adventures in Wi-Fi – Track 2: Twitter

New Adventures in Wi-Fi (with apologies to R.E.M.)

Forming the second part of the trilogy that commenced with:

New Adventures in Wi-Fi – Track 1: Blogging

 
Introduction

To tweet, or not to tweet. That is the question.

First of all some caveats:

I am not a social media expert, nor any of its many variants.
I do not work in marketing or PR.
I will not be encouraging you to unleash the power of FaceTube/YouSpace/MyBook to make the world a better place (and your bank vault a fuller one), or to sell a million more of your product.
I can not claim to have some secret formula for success in the world of on-line communication (indeed I tend to be allergic to such things as per Recipes for Success?).

If you want all the answers, then please look elsewhere. Good luck with your search!

However:

I am an IT person, with a reasonable degree of commercial awareness and a background in sales and sales support.
I have been involved in running web-sites and various on-line communities since 1999.
I do author a business, technology and change blog that has been relatively well-received (why else would you be reading this?)
I think that Twitter.com can be an extremely useful way of interacting with people, expanding your network and coming into contact with interesting new people.

This is the middle chapter of a series of articles about the experiences of a neophyte in the sometimes confusing world of social media. View this article as akin to Herodotus describing crocodiles and you won’t go far wrong. If you learn something useful, then that’s great. If not, I hope that my adventures prove a harmless diversion for the reader.
 
 
Origins

I thought of adding a fourth zero, but that seemed much too applied. For the avoidance of doubt this illustration should not be taken as an endorsement of Ab Initio.

I covered some of my previous forays in what has now come to be called social media in my earlier article, so I won’t revisit them here. The main focus of this piece is Twitter, a service that I joined back in December 2008, a couple of months after establishing this blog. It took me some time to figure Twitter out and I am not sure that I entirely “get it” in full.

In a recent article – How I write – I referred to many of my blog posts flowing quickly and easily. I must admit that writing this piece is proving to be something more of a struggle. Perhaps this reflects the fact that making progress on Twitter was also anything but easy. Indeed I felt that for a long time I was blundering about without any real idea about how to use the medium, or what I wanted to use it for. It also probably reflects my admitted lack of expertise in social media.
 

An aside for fellow pedants:

One in a million

Twitter is positioned as a micro-blogging service. This terminology offends the scientific bent of my mind. Micro (μικρός) implies 10-6 or one millionth. I tend to write relatively long blog posts and the average size of one of my articles is about 1,200 words; this equates to just over 7,000 characters. Twitter’s 140 character limit (originally set as the length of an SMS) is one fiftieth of this figure, so a more accurate description of Twitter would be a centi-blogging service; for less verbose bloggers maybe deci-blogging would also work.

 
Many aficionados of Twitter claim that it is the ideal way to promote your product, your service and/or yourself (or all three at the same time). The same people also say that it is a great tool for listening to existing and potential customers, obtaining information about what they like and dislike and picking up on trends. All this may very well be true, but this is not how I have come to use Twitter and I will not be covering any of these aspects here.

For me the facility is not really about reaching a wide audience – however much I may be passionate about areas such as Business Intelligence, I realise that not everyone will feel then same. Instead it has been a great way to discover the members of a broader worldwide technology community focussed on areas such as databases and data warehousing, BI tools and approaches, numerical and text-based analysis and general technology industry issues.
 
 
So what is all the fuss about anway?

How come it doesn't recognise twttr.com?

Twitter started as a way to post updates from your mobile ‘phone by texting a message to a number (07624 801 423 here in the UK). The messages would generally be about the sorts of things that you would be doing when you don’t have access to a PC, but do to a mobile ‘phone. For example:

  • “I’m standing in line at the grocery”,
  • “It is raining outside“,
  • “The girl opposite on the bus is looking at me“,
  • “Oh dear so is her boyfriend and he seems less friendly“,

These messages were then posted on-line and could be read by other people. If these people found your output interesting (and let’s face it who could not be captivated by the examples I quote above), then they could subscribe to your posts (or follow your tweets in the lingo). When some one follows you, you are notified and can return the complement if you wish. In this way the network of people with whom you can share your updates grows.

At some point people began to realise that you could skip the mobile bit and use your computer to post tweets directly on-line. This opened up the entire:

  • "I was surfing the Net and found this cool site http://…"

type post and the rest, as they say is history. I have tweeted via my mobile ‘phone recently, but only by first loading Opera Mini and going to twitter.com. I suspect that there are people out there who have never sent an SMS to their Twitter account.

Hamlet Act 2, scene 2, 86–92

A relic of this history is the aforementioned 140 character limit. Because there is not much room to type, there is a limit to the length of thought that you can share. In turn this means that a defining characteristic of Twitter is brevity. For someone such as me who is not known for having this quality as a core characteristic, this presents something of a challenge. However when you have something exceeding 140 characters to say, the Twitter limit forces the approach of writing it down somewhere else (e.g. on a blog) and then posting a link. A lot of my Twitter posts contain either links to this site or to interesting articles that I have found elsewhere. In this way, Twitter has some attributes akin to a more dynamic version of a social bookmarking site (such as reddit.com or del.icio.us).

The other key characteristic of twitter is interaction. Most of my other tweets are either passing on comments made by other people, or links posted by them – of course this type of behaviour tends to lead to reciprocation, which binds people together (in a positive sense) and also potentially widens the network available to both. The balance of my tweeting is made up of chatting with people (tweeps if you must) either about industry issues, or – probably more frequently – just shooting the breeze.

To me rather than [insert appropriate negative power of 10 here]-blogging Twitter is much more akin to it’s historical roots of public texting. Instead of SMSing one person, or a small group, you share your abbreviated pearls of wisdom with potentially thousands of people, these people also have a much easier way of following your train of thought. Of course there is no guarantee that they put the same care and attention into reading your tweets as you did in to writing them; more on this later.
 
 
Some suggestions for blissful tweeting

Blue skies / Smiling at me / Nothing but blue skies / Do I see ||  Bluebirds / Singing a song / Nothing but bluebirds / All day long

These are some things that have worked for me and seem to make sense. There are lots of alternative perspectives out there, just a google away:

  1. Go to twitter.com and sign-up for an account.

    Unless you want to stay anonymous, I would suggest using your real name and a user name that is close to this: I’m @peterjthomas for example.

  2. Fill in your profile and tell people a bit about yourself.

    There is nothing more off-putting than being followed by someone, clicking on their page and finding… nothing. Why would anyone want to listen to what you have to say if you don’t lay down some markers here? While you are at it, think about customising your page to make it a bit more distinctive. But don’t go to town, at least at present, it is not that easy to come up with a scheme that will work on multiple screen resolutions.

  3. Find some people to follow.

    This can be a little easier said than done. What you are most likely looking for is people with similar interests to yourself. There are a number of approaches.

    1. You may already know some peers who use Twitter, as well as following them, go to their page (www.twitter.com/their-account-name) and see who they interact with when speaking about subjects that you also want to talk about. If they don’t have thousands of followers, take a look at the list and also look at who they follow.
    2. Many people in the blogosphere (as well as many corporations) have a Twitter presence and will often advertise this fact. If you have found an interesting blog article – say this one – then scan the site to see if there is a Twitter link; more often than not there will be.
    3. If you end up following some one that you view as being influential in your area, then take a look at the people that he or she tweets with – they will probably also be worth following.
    4. You can also use twitter search to see what other people are talking about that might be of interest – the following link looks for references to business intelligence: https://twitter.com/search?q=%23businessintelligence (more on how to tag your tweets later). It may be that some of the people that come up in a search list are worth following.
    5. Finally you can let other people do the hard work for you every Friday. Follow Friday is a Twitter tradition in which people give recommendations of tweeps that they feel others may want to follow. This can be gold-dust for someone hoping to find like-minded people.
       
  4. Think about how to get people to follow you.

    Maybe a good way to think about this is to consider the exercise that you have just completed to look for people to follow. What would make your Twitter account come into focus in such a process? Whatever you are looking for in some one to follow, similar people will also be looking for, so try to fit the bill.

    If you are looking for people who share cool articles, then share cool articles. If you are looking for people who express opinions about things that are important to you, then express opinions; either on Twitter, or via a blog and post links on Twitter. If you are looking for people who engage with others, then engage with others yourself. You can reference people who are not following you (and indeed who you are not following) just by putting an ‘@’ in front of their name.

    For example even if you are not following me and you post:

    “Wow! that @peterjthomas really knows his business intelligence”

    then first of all I will notice (as you reference me) and second I’m as human as the next person and am likely to at least consider following you, or at the very least sharing your comment with my followers.

    An aside on sharing tweets:

    Twitter etiquette is that you don’t share other people’s tweets without referencing them. So in the above example I might re-tweet your kind comment as:

    “RT @your-name Wow! that @peterjthomas really knows his business intelligence << Thanks"

    the RT stands for re-tweet and the << indicates my additional comments, in this case to say thank you – people do the latter in a number of way. An alternative to using RT is as follows:

    “Wow! that @peterjthomas really knows his business intelligence (via @your-name)”

    Not only is this polite, but now @your-name and @peterjthomas are linked – if I was worth following, then me mentioning you is a worthwhile objective.

    Of course the other two keys to gaining followers are the same as for getting people to read your blog: first share links that are worthwhile sharing (particularly if they are your own work) and second try to engage with people and refrain from being a passive by-stander.

One thing that is probably dawning on any Twitter novices right now is that the above are not discrete activities that you do once and then are finished with. If you want to get the most out of Twitter, then you will have to keep doing them.
 
 
More advanced techniques

Paul Dirac - the UK's greatest physicist since Newton

Unless you are looking to create a social media presence for a Fortune 500 company (assuming that there are any left who have not already created such a thing), then the above pointers are probably more than enough to get you started. Like me you may then just muddle through, hopefully learning from your mistakes. Alternatively, there are any number of guides out there which may or may not strike a chord with you and suit your personal style; just search for them.
 
Be yourself

On the subject of personal style, I’d suggest (as I also suggested in my article on blogging) that you be yourself on Twitter. Even within 140 characters, trying to be something that you are not comes across as fake; people aren’t impressed. On the same subject, treat people as you would face-to-face. If you are trying to sell something – even just your personal brand – then would you ram this down people’s throats in person? If not, then why would it be OK to do this on Twitter? A more low key approach is likely to lead to engagement and a better outcome than blowing your trumpet from the roof-tops (I know, I have tried the latter and it doesn’t work too well).
 
Use hash-tags

Above I mentioned tagging your posts. So if you write something about cloud computing, you might want to tag it with a key word, e.g. “cloud”. Though Twitter’s own search engine and the various other tools that you can employ on Twitter data will search for any occurence of specified text, it is still traditional to use hash tags, so in the above example a tweet might look like:

“I have just come across a great article sumarising new development in cloud computing – http://link.here #cloud

As ever the incomparable xkcd.com has a view on this world that is both acerbic and insightful:

I learnt everything I know about title/alt text from Randall Munroe
© xkcd.com

To see a slightly more positive use of Twitter search and hash-tags, try looking up coverage of a recent Teradata analyst event using #td3pi.
 
Shorten your URLs

On the subject of links, the 140 character limitation means that you don’t want to waste space with long URLs. Using a URL shortener is mandatory – I use http://bit.ly but there are many other such tools out there.
 
Check out the wide range of Twitter-related tools

Now that the subject of tools has come up, there is an entire hinterland of Twitter-related tools that can do a wide range of things to help you. These include:
 

  1. Twitter platforms

    These which help to manage your entire Twitter experience from reading other people’s posts, to making your own (sometimes doing link shortening for you automatically). If you are successful in finding people to follow and attracting people to follow you, then there will come a time when the noise level becomes unmanageable. This type of tool can help by providing filters and groups, which enable you to make sense of a tsunami of tweets, organise them and prioritise your time.

    I use TweetDeck, but again there are many alternatives.

  2. Twitter add-ins

    These are generally what you would employ on your blog or other internet site to allow people to easily tweet your content. There are several very slick and attractive looking options out there, just take a look at a handful of sites and take your pick. I’m staying old-school for present and hand-coding my Twitter links (as at the end of this article).

  3. Twitter analytics

    This is rather a grand name which covers everything from the trend of how many people are following you through to quite sophisticated analysis. Rather than provide a list, take a look at one that Pam Dyer has put together here.

  4. Other

    There are a lot of fun Twitter-related applications out there. Just one example to whet your appetite is the following app, written by @petewarden which graphs your relations to other people on Twitter and gives a very visual perspective on the totality of your tweeting:

    XXX
    © http://twitter.mailana.com

 
 
In closing

I chose to close this article with the above image for a reason. To me it captures the essence of what Twitter is about; forming a network of associations with people who can enrich your understanding, provide you with fresh perspectives, or even simply make you smile. The diagram looks awfully like a community doesn’t it? If you enjoy reading this blog and are looking for people to follow who might share your world-view, then clicking on the above graphic and checking out some of the people I interact with most may be a good starting point.

If you chose to take the plunge with Twitter then good luck and I hope that you get as much out of it as I have. You can also then do me a favour and use the handy link just below to share this article with your followers!
 


 
The New Adventures in Wi-Fi series of articles on Social Media concludes with a piece on professional networking and LinkedIn here.
 

Business logic

The dot product of the original sketch and my plagiarism of it is 0

With enormous apologies to Randall Munroe of xkcd.com fame; from whose much funnier, and obviously more original, sketch entitled “GOTO” the above was shamelessly adapted.
 


 
Comic strip adapted with the kind permission of the copyright holder.
 

New Adventures in Wi-Fi – Track 1: Blogging

New Adventures in Wi-Fi (with apologies to R.E.M.)
 
Introduction

I established this blog back in November 2008 – shortly after this I joined twitter.com in December 2008 – I had already been a member of LinkedIn.com since July 2005. However, my involvement with what is now collectively called social media goes back a lot further than this. Back then we tended to use the phrase on-line communities to describe what we were engaged in.

My first foray into this new world was in 1998/99 when I joined a, now defunct, discussion forum (then known as a Bulletin Board). This was focused on computer games. I wasn’t terribly in to such games at the time, I didn’t own a console and my PC was used for more prosaic purposes. Nevertheless, for reasons that I will not bore the reader with, I signed up. Since then I have been a member of a number of on-line forums, mostly with some sporting element, for example rock climbing.

Yahoo! Geocities

In May 1999, my forum activities led me to creating my first web-site (again now also defunct). I started on Geocities (another chance to use the word “defunct”) and then moved to having my own domain and an agreement with a hosting company. I even ended up jointly running a very successful forum with an on-line friend from Australia. Back then the men were real men, the women were real women and the HTML was real HTML. However this article is not about ancient history, but rather about my more recent experiences in social media.

Nowadays, nobody seems to think of it as being odd that you regularly “speak” to people you have never met and who inhabit countries on the other side of the world. People do not slowly back away from you at parties if you drop the fact that you have your own web-site into a conversation (though maybe one reason that the portmanteau of web-log became socially acceptable is that its abridgement to blog sounds the opposite of technological). It was not always thus and maybe I retain something of the spirit of those pioneering days. For example, I am currently typing these words into the HTML pane of WordPress.com. Old habits die hard and WYSIYWG is for softies!

Social media is now mainstream – in fact you could argue that it is real life that has become a minority activity – and things are a lot easier. Although I doggedly insist on still cutting HTML, you can be up and running with a fairly professional-looking blog on WordPress in minutes and without having to know much about any of the technical underpinnings. Software as a Service certainly works really well as an approach to blogging.

Over a number of articles, I am going to touch upon my recent experience of Social Media in the three areas that I first mentioned at the beginning: blogging, micro-blogging and professional networking. Without fully revealing the denouement of this series, I will state now that one of the most interesting things is how well these three areas work in combination and how mutually reinforcing they have become for me. The sequence starts with my thoughts on blogging.
 
 
WordPress and Motivation

WordPress.com

I suppose I have to thank my partner for getting me in to this area as she started her blog long before any of mine. However, having suffered a couple of climbing-related injuries I started my own training blog, both to chart my recovery and to act as a motivational tool.

I started out using Blogger as that was what my partner had used, but got rather frustrated with its lack of support for some basic HTML constructs (e.g. tables). A friend suggested WordPress instead and this became the venue for my training blog. Somewhat amazingly this is not defunct. However, after a period when I religiously posted at least once or twice every week, I haven’t updated it in a long while.

When I wanted to start a professional blog, WordPress seemed the way to go and I have been mostly happy with my choice. But what were my motivations for blogging about business-related issues? I guess that there were a few of these, in no particular order:

  1. I wanted to build upon the public profile that appearing in press articles and speaking at seminars had afforded me.
  2. I like writing and the idea of doing this in a more general context than internal strategy papers and memoranda seemed appealing.
  3. Based on the feedback I had received from my public speaking, I believed that I had quite a lot of relevant experience to draw on which might make interesting reading; at least for a niche audience.
  4. Although it would be fair to say that I started writing mostly for myself, over time the idea of building a blog following seemed like a challenge and I like challenges.
  5. In this same category of emergent motivation, after a short while the notion of establishing a corpus of work, spanning my ideas about a range of issues also became a factor. Maybe some element of Narcissism is present in most blogging.
  6. There was a big slice of simple curiosity about the area, how it worked and how I could be a part of it. You get some interaction in public speaking, but I was intrigued by the idea of getting the benefit of the input of a wider range of people.

So I leapt in with both feet and my first article was based on some reflections on attending a Change Management seminar. It was entitled Business is from Mars and IT is from Venus and dealt with what I see as an artificial divide between IT and business groups. I suppose it makes sense to start as you mean to go on and IT / Business alignment has been a theme running through much of what I have written.
 
 
Things that I have learnt so far

In a subsequent piece, Recipes for success?, I expressed my scepticism about articles of the type “My Top Ten Tips for Successful Blogging”, so the following is not meant to be a set of precepts to be followed to the last letter. Instead, with the benefit of over 60,000 page views (small beer compared to many blogs), here are some things that have worked for me. If some of these chime with your own experience, then great. If others are not pertinent to you, then this is only to be expected.

Finally I should also stress that these observations relate mostly to professional blogs, for personal blogs there are essentially no constraints on your creativity (assuming that the results of this are legal of course).

  1. Write about areas that you know something about. You don’t have to be a world authority, but on a professional blog, no one is going to be that interested in your fevered speculations on something that you know nothing about. This is one of many reasons that you will never see me blogging about IT Infrastructure!
  2. When you blog about an area of personal expertise, then you can be pretty free in expressing your opinions, though [note to self] a dose of humility never did anyone any harm.

    If you know as much as him, then knock yourself out. Else proceed with caution!

    When the subject is one in which your own knowledge is less well-developed (for me something like text analytics would fall into this category), then seek out the opinions of experts in the field and quote these (even if you disagree with them). Linking to the places that experts have expressed their thoughts also expands you network and increases the utility of your blog, which becomes part of a wider world.

  3. It helps if you are interested in the majority of the topics that you cover. If you are unmotivated about something, them why write about it? If you decide to do so for some reason (maybe because you haven’t written anything else this week, or because a piece of news is “hot” at present) then your personal ennui will seep into your words and be evident to your readers. No doubt it will generate similar feelings in them.
  4. Beyond the previous point, I would go further and say that it is crucial that you are truly passionate about at least one thing that you write about and ideally several. Expressing strong opinions is fine, assuming that you have some reason for holding them and that you remain open to the ideas of other people. For me, these areas of passion are Business Intelligence, its intimate connection with Cultural Transformation and the related area of IT / Business Alignment.

    Passion is not only important because it will hopefully infuse your words, but because it will sustain you returning to write about these areas over a long period of time. There are an awful lot of blogs out there where a bright beginning has petered out because the author had nothing left to say, or has lost interest.

  5. For the same reasons relating to sustaining your blog, I would recommend being yourself. If you really want to present an alternative personality to the world, then good luck to you (and your therapist), you will have to possess enormous perseverance and be a very talented actor.

    Not an ideal way to write your blog

    For me this means the presence of strong elliptical and eclectic qualities to my articles. I can do terse and to the point when it is necessary, but circumlocution is more my stock-in-trade. I’m more comfortable being myself and if this means my audience is one composed of people yearning for elliptic, eclectic, circumlocutory writing, then so be it!

  6. To me being yourself extends to the quantity of your writing. In an era sometimes characterised as one of short attention spans and instant gratification, the orthodox advice is to be punchy and direct. Sometimes the point I want to make in one of my articles (assuming that I can remember what this is by the time I get to the end of writing it) takes some time to develop – like a fine wine I like to think (or a mould the less kind might add).

    Not my target audience

    This means that my writing tends to resemble the River Amazon in both its meandering nature and length. I appreciate that this narrows my potential audience, but hope that it also means that at least a few people get some more out of it than they would from the CliffsNotes version.

  7. Blogging should also be about interaction. If you simply want to broadcast your incredibly wise thoughts, then write a book. I hope that some of the pieces that I write spur others to record their own thoughts, either as comments here, or in their own blog articles. If some of my ideas make it into other people’s PowerPoint decks or project proposals, then I am honoured.

    Equally, virtually everything that I write has been inspired to some degree by other people: co-workers, authors, the people that I come into contact with on the Internet and in real life on a daily basis and so on. I try to explicitly acknowledge (and link to) what has inspired me when I write, but I am sure that thousands of unconscious influencers go un-credited.

  8. While passion and having opinions contribute to developing your own voice, it is important to never think that you have all the answers. In a blogging context this means treating anyone who has taken the time to comment on your writing with the respect that this act deserves. While starting a conversation is clearly the best outcome of someone commenting on your blog, a simple ‘thank you’ from the author should be the very least that you can offer (when people whinge about the England cricket team having cheated their way to victory, this is an obvious exception to the rule).
    What do you want me to do? LEAVE? Then they'll keep being wrong!
    © xkcd.com

    In this area I also try to avoid deleting comments that are derogatory about my ideas. The approach I take is rather to either seek further clarification on why the contributor thinks this way, or to politely argue why I still believe that the points that I have made are valid. Of course I have not always 100% lived up to this aspiration!

  9. As in virtually every aspect of life, treating others as you would like to be treated yourself is not a bad approach. If you enjoy people commenting on your articles or linking to your blog, then maybe proactively doing these things yourself is a good idea. I don’t mean adding comments purely for the sake of it; that sounds awfully like spam. But if you read something that you find interesting, then thank the author.

    Better still, augment what they have written with your own ideas – either on their blog or in a piece on your own site that links back to their article. Even in this day and age, it is amazing how far being nice to people can get you. For the same reason, try to be as polite on-line as you would be in your more traditional professional life.

  10. [Yes I am aware of the irony of having ten bullet points here!]

    Finally, I mentioned the Narcissistic tendencies that can either be a cause or effect of blogging. I think that trying to not take yourself too seriously is a must as an antidote to this. Both the medium and my prose can veer towards the preachy sometimes, so some well-placed self-deprecation to balance this never goes amiss.

I hope that some readers will have been interested in my observations and that they will have helped a further subset of these in their blogging. For those who are pondering whether to join the blogosphre, my simple advice is give it a go. You will either hate it or love it, but at least you won’t die wondering “what if?”
 


 
The New Adventures in Wi-Fi series of articles on Social Media continues by discussing the relatively new world of micro-blogging and the phenomenon that is Twitter here.
 

tweet this Tweet this article on twitter.com
Bookmark this article with:
| Facebook | del.icio.us | digg | Reddit | Stumble

 

Especially for all Business Analytics professionals out there

Last week I was being interviewed by a journalist about Business Analytics amongst other things. I found myself speaking about the perils faced in extrapolation that are significantly less scary when merely interpolating.

Serendipity had led to the following cartoon appearing on the web-site of that doyen of scientific humour Randall Munroe, namely xkcd.com.

By the third trimester, there will be hundreds of babies inside you.
A lesson for us all - © xkcd.com

I’m sure this drawing must have appeared on some other BI blogs, but what the hell, it merits posting again in my opinion.