# The Anatomy of a Data Function – Part II

 Part I Part II Part III

This is the second part of my review of the anatomy of a Data Function, the artfully named Part I may be viewed here. As seems to happen all too often to me, this series will now extend to having a Part III, which will be published in coming weeks.

In the first article, I introduced the following Data Function organogram:

and went on to cover each of Data Strategy, Analytics & Insight and Data Operations & Technology. In Part II, I will consider the two remaining Data Function areas of Data Architecture and Data Management. Covering Related Areas, and presenting some thoughts on how to go about setting up a Data Function and the pitfalls to be faced along the way, will together form the third and final part of this trilogy.

As in Part I, unless otherwise stated, text indented as a quotation is excerpted from the Data and Analytics Dictionary.

Data Architecture

To be somewhat self-referential, this area acts a a cornerstone for the rest of the Data Function. While sometimes non-Data architects can seem to inhabit a loftier plane than most mere mortals, Data Architects (who definitively must be part of the Data Function and none of the Business, Enterprise or Solutions Architecture groups) tend to be more practical sorts with actual hands-on technical skills. Perhaps instead of the title “Architect”, “Structural Engineer” would be more appropriate. When a Data Architect draws a diagram with connected boxes, he or she generally understands how the connections work and could probably take a fair stab at implementing the linkages themselves. The other denizens of this area, such as Data Business Analysts, are also essentially pragmatic people, focused on real business outcomes. Data Architecture is a non-theoretical discipline and here I present some of the real-world activities that its members are often engaged in.

Change Portfolio Engagement

One of the most important services that a good Data Function can perform is to act as a moderator for the otherwise deleterious impact that uncontrolled (and uncoordinated) Change portfolios can have on even the best of data landscapes [1]. As I mention in another article:

Over the last decade or so, the delivery of technological change has evolved to the point where many streams of parallel work are run independently of each other with each receiving very close management scrutiny in order to ensure delivery on-time and on-budget. It should be recognised that some of this shift in modus operandi has been as a result of IT departments running projects that have spiralled out of control, or where delivery has been significantly delayed or compromised. The gimlet-like focus of Change on delivery “come Hell or High-water” represents the pendulum swinging to the other extreme.

What this shift in approach means in practice is that – as is often the case – when things go wrong or take longer than anticipated, areas of work are de-scoped to secure delivery dates. In my experience, 9 times out of 10 one of the things that gets thrown out is data-related work; be that not bothering to develop reporting on top of new systems, not integrating new data into existing repositories, not complying with data standards, or not implementing master data management.

As well as the danger of skipping necessary data related work, if some data-related work is actually undertaken, then corners may be cut to meet deadlines and budgets. It is not atypical for instance that a Change Programme, while adding their new capabilities to interfaces or ETL, compromises or overwrites existing functionality. This can mean that data-centric code is in a worse state after a Change Programme than before. My roadworks anecdote begins to feel all too apt a metaphor to employ.

Looking more broadly at Change Programmes, even without the curse of de-scopes, their focus is seldom data and the expertise of Change staff is not often in data matters. Because of this, such work can indeed seem to be analogous to continually digging up the same stretch of road for different purposes, combined with patching things up again in a manner that can sometimes be barely adequate. Extending our metaphor, the result of Change that is not controlled from a data point of view can be a landscape with lumps, bumps and pot-holes. Maybe the sewer was re-laid on time and to budget, but the road has been trashed in the process. Perhaps a new system was shoe-horned in to production, but rendered elements of an Analytical Repository useless in the process.

Excerpted from: Bumps in the Road

A primary responsibility of a properly constituted Data Function is to lean hard against the prevailing winds of Change in order to protect existing data capabilities that would otherwise likely be blown away [2]. Given the gargantuan size of most current Change teams, it makes sense to have at least a reasonable amount of Data Function resource applied to this area. Hopefully early interventions in projects and programmes can mitigate any potentially adverse impacts and perhaps even lead to Change being accretive to data landscapes, as it really ought to be.

The best approach, as with most human endeavours is a collaborative one, with Data Function staff (probably Data Architects) getting involved in new Change projects and programmes at an early stage and shaping them to be positive from a Data dimension. However, there also needs to be teeth in the process; on occasion the Data Function must be able to prevent work that would cause true damage from going ahead; hopefully powers that are used more in breach than observance.

Data Modelling

It is in this area that the practical bent of Data Architects and Data Business Analysts is seen very clearly. Data modelling mirrors the realities of systems and databases the way that Theoretical Physicists use Mathematics to model the Natural World [3]. In both cases, while there may be a degree of abstraction, the end purpose is to achieve something more concrete. A definition is as follows:

[Data Modelling is] the process of examining data sets (e.g. the database underpinning a system) in order to understand how they are structured, the relationships between their various parts and the business entities and transactions they represent. While system data will have a specific Physical Data Model (the tables it contains and their linkages), Data Modelling may instead look to create a higher-level and more abstract set of pseudo-tables, which would be easier to relate to for non-technical staff and would more closely map to business terms and activities; this is known as a Conceptual Data Model. Sitting somewhere between the two may be found Logical Data Models. There are several specific documents produced by such work, one of the most common being an Entity-Relationship diagram, e.g. a sales order has a customer and one or more line items, each of which has a product.

Data and Analytics Dictionary entry: Data Modelling

Another critical role. In my long experience of both setting up Data Functions and running Data Programmes, having good Data Business Analysts on board is often the difference between success and failure. I cannot stress enough how important this role is.

Data Business Analysts are neither regular Business Analysts, nor just Data Analysts, but rather a combination of the best of both. They do have all the requirement gathering skills of the best BAs, but complement these with Data Modelling abilities, always seeking to translate new requirements into expanded or refined Data Models. Also the way that they approach business requirements will be very specific. The optimal way to do this is by teasing out (and they collating and categorising) business questions and then determining the information needed to answer these. A good Data Business Analyst will also have strong Data Analysis skills, being able to work with unfamiliar and lightly-documented datasets to discern meaning and link this to business concepts. A definition is as follows:

A person who has extensive understanding of both business processes and the data necessary to support these. A Business Analyst is expert at discerning what people need to do. A Data Analyst is adept at working with datasets and extracting meaning from them. A Data Business Analyst can work equally happily in both worlds at the same time. When they talk to people about their requirements for information, they are simultaneously updating mental models of the data necessary to meet these needs. When they are considering how lightly-documented datasets hang together, they constantly have in mind the business purpose to which such resources may be bent.

Data and Analytics Dictionary entry: Data Business Analyst

Data Management

Again, it is worth noting that I have probably defined this area more narrowly than many. It could be argued that it should encompass the work I have under Data Architecture and maybe much of what is under Data Operations & Technology. The actual hierarchy is likely to be driven by factors like the nature of the organisation and the seniority of Managers in the Data Function. For good or ill, I have focussed Data Management more on the care and feeding of Data Assets in my recommended set-up. A definition is as follows:

The day-to-day management of data within an organisation, which encompasses areas such as Data Architecture, Data Quality, Data Governance (normally on behalf of a Data Governance Committee) and often some elements of data provision and / or regular reporting. The objective is to appropriately manage the lifecycle of data throughout the entire organisation, which both ensures the reliability of data and enables it to become a valuable and strategic asset.

In some organisations, Data Management and Analytics are part of the same organisation, in others they are separate but work closely together to achieve shared objectives.

Data and Analytics Dictionary entry: Data Management

Data Governance

There is a clear link here with some of the Data Architecture activities, particularly the Change Portfolio Engagement work-area. Governance should represent the strategic management of the data component of Change (i.e. most of Change), day-to-day collaboration would sit more in the Data Architecture area.

The management processes and policies necessary to ensure that data captured or generated within a company is of an appropriate standard to use, represents actual business facts and has its integrity preserved when transferred to repositories (e.g. Data Lakes and / or Data Warehouses, General Ledgers etc.), especially when this transfer involves aggregation or merging of different data sets. The activities that Data Governance has oversight of include the operation of and changes to Systems of Record and the activities of Data Management and Analytics departments (which may be merged into one unit, or discrete but with close collaboration).

Data Governance has a strategic role, often involving senior management. Day-to-day tasks supporting Data Governance are often carried out by a Data Management team.

Data and Analytics Dictionary entry: Data Governance

This is a relatively straightforward area to conceptualise. Rigorous and consistent definitions of master data and calculated data are indispensable in all aspects of how a Data Function operates and how an organisation both leverages and protects its data. Focusing on Metadata, a definition would be as follows:

[Metadata is] data about data. So descriptions of what appears in fields, how these relate to other fields and what concepts bigger constructs like Tables embody. This helps people unfamiliar with a dataset to understand how it hangs together and is good practice in the same way that documentation of any other type of code is good practice. Metadata can be used to support some elements of Data Discovery by less technical people. It is also invaluable when there is a need for Data Migration.

Data and Analytics Dictionary entry: Metadata

Data Audit

One of the challenges in driving Data Quality improvements in organisations is actually highlighting the problems and their impacts. Often poor Data Quality is a hidden cost, spread across many people taking longer to do their jobs than is necessary, or specific instances where interactions with business counterparties (including customers) are compromised. Organisations obviously cope – at least in general – with these issues, but they are a drag on efficiency and, in extremis, can lead to incidents which can cause significant financial loss and/or reputational damage. A way to make such problems more explicit is via a regular Data Audit, a review of data in source systems and as it travels through various data repositories. This would include some assessment of the completeness and overall quality of data, highlighting areas of particular concern. So one component might include the percentage of active records which suffer from a significant data quality issue.

It is important that any such issues are categorised. Are they the result of less than perfect data entry procedures, which could be tightened up? Are they due to deficient validation in transactional systems, where this could be improved and there may be a role for Master Data Management? Are data interfaces between systems to blame, where these need to be reengineered or potentially replaced? Are there architectural issues with systems or repositories, which will require remedial work to address?

This information needs to be rolled up and presented in an accessible manner so that those responsible for systems and processes can understand where issues lie. Data Audits, even if partially automated, take time and effort, so it may be appropriate to carry them out quarterly. In this case, it is valuable to understand how the situation is changing over time and also to track the – hopefully positive – impact of any remedial action. Experienced Data Analysts with a good appreciation of how business is conducted in the organisation are the type of resource best suited to Data Audit work.

Data Quality

Much that needs to be said here is covered in the previous section about Data Audit. Data Quality can be defined as follows:

The characteristics of data that cover how accurately and completely it mirrors real world events and thereby how much reliance can be placed on it for the purpose of generating information and insight. Enhancing Data Quality should be a primary objective of Data Management teams.

Data and Analytics Dictionary entry: Data Quality

A Data Quality team, which would work closely with Data Audit colleagues, would be focussed on helping to drive improvements. The details of such work are covered in an earlier article, from which the following is excerpted:

There are a number of elements that combine to improve the quality of data:

As with any strategy, it is ideal to have the support of all four pillars. However, I have seen greater and quicker improvements through the fourth element than with any of the others.

Excerpted from: Using BI to drive improvements in data quality

Master Data Management

There is some overlap here with Data Definitions & Metadata as mentioned above. Master Data Management has also been mentioned here in the context of Data Quality initiatives. However this specialist area tends to demand dedicated staff. A definition is as follows:

Master Data Management is the term used to both describe the set of process by which Master Data is created, changed and deleted in an organisation and also the technological tools that can facilitate these processes. There is a strong relation here to Data Governance, an area which also encompasses broader objectives. The aim of MDM is to ensure that the creation of business transactions results in valid data, which can then be leveraged confidently to create Information.

Many of the difficulties in MDM arise from items of Master Data that can change over time; for example when one counterparty is acquired by another, or an organisational structure is changed (maybe creating new departments and consolidating old ones). The challenges here include, how to report historical transactions that are tagged with Master Data that has now changed.

Data and Analytics Dictionary entry: Master Data Management

At this point, we have covered all of the work-areas within our idealised Data Function. In the third and final piece (which is yet to be published), we will consider the right-hand column of Related Areas, ones that a Data Function must collaborate with. Having covered these, the trilogy will close by offering some thoughts on the challenges of setting up a Data Function and how these may be overcome.

 Part I Part II Part III

Notes

 [1] I am old enough to recall a time before Change portfolios, I can recall no organisation in which I have worked over the last 20 years in which Change portfolios have had a positive impact on data assets; maybe I have just been unlucky, but it begins to feel more like a fundamental Physical Law. [2] I have clearly been writing about hurricanes too much recently! [3] As is seen, for example in, the Introduction to my [as yet unfinished] book on the role of Group Theory in Theoretical Physics, Glimpses of Symmetry.

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

# The Anatomy of a Data Function – Part I

 Part I Part II Part III

Back in Alphabet Soup, I presented a diagram covering what I think are good and bad approaches to organising Analytics and Data Management. I wanted to offer an expanded view [1] of the good organisation chart and to talk a bit about each of its components. Originally, I planned to address these objectives across two articles. As happens to me all too frequently, the piece has now expanded to become three parts. The second may be read here. The third is forthcoming.

Let’s leap right in and look at my suggested chart:

I appreciate that the above is a lot of boxes! I can feel Finance and HR staff reaching for their FTE calculators as I write. A few things to note:

1. I have avoided the temptation to add the titles of executives, managers or team leaders. Alphabet Soup itself pointed out how tough it can be to wrestle with the nomenclature. Instead I have just focussed on areas of work.

2. The term “work areas” is intentional. In larger organisations, there may be teams or individuals corresponding to each box. In smaller ones Data Function staff will wear many hats and several work areas may be covered by one person.

3. In some places, a number of work areas that I have tagged as Data Function ones may be performed in other parts of the organisation, though it is to be hoped with collaboration and coordination.

Having dealt with these caveats, let’s provide some colour on each of these progressing from top to bottom and left to right. In this first article we will consider the Data Strategy, Analytics & Insight and Data Operations & Technology areas. The second part will cover the remaining elements of Data Architecture and Data Management. The final article, when published, will consider Related Areas before also covering some of the challenges that may be faced in setting up a Data Function.

In what follows, unless otherwise stated, text indented as a quotation is excerpted from the Data and Analytics Dictionary.

Data Strategy

A clear strategy is obviously most important to establish in the early days of a Data Function. Indeed a Data Strategy may well call for the creation of a Data Function where none currently exists. For anyone interested in this process, I recommend my series of three articles on this subject [2]. However a Data Strategy is not something carved in stone, it will need to be revisited and adapted (maybe significantly) as circumstances change (e.g. after an acquisition, a change in market conditions or potentially due to the emergence of some new technology). There is thus a need for ongoing work in this area. However, as demand for strategic work will tend to be lumpy, I suggest amalgamating Data Strategy with the following two sub-areas.

Data Comms & Education

Elsewhere on this site, I have highlighted the need for effective communication, education and assiduous follow-up in data programmes [3]. Education on data matters does not stop when a data quality drive is successfully completed, or when a new set of analytical capabilities are introduced, this is a need for an ongoing commitment here. Activities falling into this work area include: publishing regular data newsletters and infographics, designing and helping to deliver training programmes, providing follow-up and support to aid the embedded used of new capabilities or to ingrain new behaviours.

Relationship Management

There is a need for all Data Function staff to establish and maintain good working relations with any colleagues they come into contact with, regardless of their level or influence. However, the nature of, generally hierarchical, organisations is that it is often prudent to pay special attention to more senior staff, or to the type of person (common in many companies) who may not be that senior, but whose opinion is influential. In aggregate these two groups of people are often described as stakeholders. Providing regular updates to stakeholders and ensuring both that they are comfortable with Data Function work and that this is aligned with their priorities can be invaluable [4]. Having senior, business-savvy Data Function people available to do this work is the most likely path to success.

Analytics & Insight

Broadly speaking the Analytics area and its sub-areas are focussed more on one-off analyses rather that the recurrent production of information [5], the latter being more the preserve of the Data Operations & Technology area. There is also more of a statistical flavour to the work carried out here.

[Analytics relates to] deriving insights from data which are generally beyond the purpose for which the data was originally captured – to be contrasted with Information which relates to the meaning inherent in data (i.e. the reason that it was captured in the first place). Analytics often employ advanced statistical techniques (logistic regression, multivariate regression, time series analysis etc.) to derive meaning from data.

Data and Analytics Dictionary entry: Analytics

Data Science

I have Data Science as a sub-area of analytics, as with most terminology used in the data arena and most organisational units that exist in Data Functions, some people might argue that I have this the wrong way round and that Data Science should be preeminent. Reconciling different points of view is not my objective here, I think most people will agree that both work areas should be covered. This comment pertains to many other parts of this article. Here is a definition of the area (or rather the people who populate it):

[Data Scientists are people who are] au fait with exploiting data in many formats from Flat Files to Data Warehouses to Data Lakes. Such individuals possess equal abilities in the data technologies (such as Big Data) and how to derive benefit from these via statistical modelling. Data Scientists are often lapsed actual scientists.

Data and Analytics Dictionary entry: Data Scientist

Data Visualisation

There is an overlap here with both the Data Science team within the Analytics & Insight area and the Business Intelligence team in the Data Operations & Technology area. Many of the outputs of a good Data Function will include graphs, charts and other such exhibits. However, here would be located the real specialists, the people who would set standards for the presentation of visual data across the Data Function and be the most able in leveraging visualisation tools. A definition of Data Visualisation is as follows:

Techniques – such as graphs – for presenting complex information in a manner in which it can be more easily digested by human observers. Based on the concept that a picture paints a thousand words (or a dozen Excel sheets).

Data and Analytics Dictionary entry: Data Visualisation

Predictive Analytics

Gartner refer to four types of Analytics: descriptive, diagnostic, predictive and prescriptive analytics. In an article I referred to these as:

• What happened?
• Why did it happen?
• What is going to happen next?
• What should we be doing?

Data and Analytics Dictionary entry: Analytics

Predictive analytics is that element of the Analytics function that aims to predict the future, “What is going to happen next?” in the above list. This can be as simple as extrapolating data based on a trend line, or can involve more sophisticated techniques such as Time Series Analysis. As with most elements of the Data Function, there is overlap between Predictive Analytics and both Data Science and Business Intelligence.

“Skunkworks”

As with Data Strategy, state-of-the-art in Analytics & Insight will continue to evolve. This part of the Data Function will aim to keep current with the latest developments and to try out new techniques and new technologies that may later be adopted more widely by Data Function colleagues. The “skunkworks” team would be staffed by capable programmers / data scientists / statisticians.

Data Operations & Technology

It could be reasonably argued that this area is part of Data Management; I probably would not object too strongly to this suggestion. However, there are some benefits to considering it separately. This is the most IT-like of the areas considered here. It recognises that data technology (being it the Hadoop suite, Data Warehouse technology, or combinations of both) is different to many other forms of technology and needs its own specialists to focus on it. It is likely that the staff in this area will also collaborate closely with IT (see the final work area in Part II) or, in some cases, supervise work carried out by IT. As well as directly creating data capabilities, Data Operations & Technology staff would be active in the day-to-day running of these; again in collaboration with colleagues from both inside and outside of the Data Function.

There is no ISO definition, but I use this term as a catch-all to describe the transformation of raw data into information that can be disseminated to business people to support decision-making.

Data and Analytics Dictionary entry: Business Intelligence

This sub-area focusses on the relatively mature task of providing Business Intelligence solutions to organisations and working with IT to support and maintain these. Good BI tools work best on a sound underlying information architecture and so there would need to also be close collaboration with Data Infrastructure staff within Data Operations & Technology as well as colleagues from Data Architecture and also Analytics & Insight.

Regular Reporting

If BI provides interactive capabilities to support decision making, Regular Reporting is about the provision of specific key reports to relevant parties on a periodic basis; daily, weekly, monthly etc. These may be burst out to people’s e-mail accounts, provided at some central location, or both. While this an area that is ideally automated, there will still be significant need for human monitoring and to support the inevitable changes.

Data Service

One of the things that any part of a Data Function will find itself doing on a very regular basis is crafting ad hoc data extracts for other departments, e.g. Marketing, Risk & Compliance etc. Sometimes such a need will be on an ongoing basis and a web-service or some other Data Integration mechanism will need to be set up. Rather than having this be something that is supported out of the general running costs of the Data Function, it makes sense to have a specific unit whose role is to fulfil these needs. Even so, there may be a need for queuing and prioritisation of requests

Data Infrastructure

This relates to the physical architecture of the data landscape (for various flavours of logical architectures, see Data Architecture in Part II). While some of the tasks here may be carried out by (or in collaboration with) IT, the Data Infrastructure team will be expert at the care and feeding of Hadoop and related technologies and have experience in the fine-tuning of Data Warehouses and Data Marts.

SWAT Team

While (as both mentioned above and also covered in Part III this article) some of the heavy lifting in data matters will be carried out by an organisation’s IT team and / or its external partners, the process for getting things done in this way can be slow, tortuous and expensive [6]. It is important that a Data Function has its own capability to make at least minor technological changes, or to build and deploy helpful data facilities without having to engage with the overall bureaucracy. The SWAT Team will have a small number of very capable and business-knowledgeable programmers, capable of quickly generating robust and functional code.

The second part of this piece will picks up where I have left off here and first consider Data Architecture.

 Part I Part II Part III

Notes

 [1] I have added some functions that were absent in the previous one, mostly as they were not central to the points I was making in the previous article. [2] My trilogy on Formatting a Data / Information Strategy has the following parts: [3] While this theme runs through most of my writing, it is most explicitly referenced in the following three articles: [4] It should be noted that the relationship management described here is not the same as a Project Manager covering progress against plan. This is more of a two way conversation to ensure that the Data Function remains cognisant of stakeholder needs [5] Though of course sometimes one-off analyses have value on an ongoing basis and so need to be productionised. In such cases the Analytics & Insight team would work with the Data Operations & Technology team to achieve this. [6] No citation needed.

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

# Hurricanes and Data Visualisation: Part II(b) – Ooops!

The first half of my planned thoughts on Hurricanes and Data Visualisation, Rainbow’s Gravity and was published earlier back in September. Part two, Map Reading, joined it this month. In between, the first hurricane-centric article acquired an addendum, The Mona Lisa. With this post, the same has happened to the second article. Apparently you can’t keep a good hurricane story down.

One of our Hurricanes is missing

When I started writing about Hurricanes back in September of this year, it was in the aftermath of Harvey and Irma, both of which were safely far away from my native United Kingdom. Little did I think that in closing this mini-series Hurricane Ophelia (or at least the remnants of it) would be heading for these shores; I hope this is coincidence and not karma for me criticising the US National Weather Service’s diagrams!

As we batten down here, an odd occurrence was brought to my attention by Bill McKibben (@billmckibben), someone I connected with while working on this set of articles. Here is what he tweeted:

I am sure that inhabitants of both the Shetland Islands and the East Midlands will be breathing sighs of relief!

Clearly both the northward and eastward extent of Ophelia was outside of the scope of either the underlying model or the mapping software. A useful reminder to data professionals to ensure we set the boundaries of both modelling and visualisation work appropriately.

As an aside, this image is another for the Hall of Infamy, relying as it does on the less than helpful rainbow palette we critiqued all the way back in the first article.

I’ll hope to be writing again soon – hurricanes allowing!

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

# Hurricanes and Data Visualisation: Part II – Map Reading

This is the second of two articles whose genesis was the nexus of hurricanes and data visualisation. The first article was, Part I – Rainbow’s Gravity [1].

Introduction

In the first article in this mini-series we looked at alternative approaches to colour and how these could inform or mislead in data visualisations relating to weather events. In particular we discussed drawbacks of using a rainbow palette in such visualisations and some alternatives. Here we move into much more serious territory, how best to inform the public about what a specific hurricane will do next and the risks that it poses. It would not be an exaggeration to say that sometimes this area may be a matter of life and death. As with rainbow-coloured maps of weather events, some aspects of how the estimated future course of hurricanes are communicated and understood leave much to be desired.

The above diagram is called a the cone of uncertainty of a hurricane. Cone of uncertainty sounds like an odd term. What does it mean? Let’s start by offering a historical perspective on hurricane modelling.

Paleomodelling

Well like any other type of weather prediction, determining the future direction and speed of a hurricane is not an exact science [2]. In the earlier days of hurricane modelling, Meteorologists used to employ statistical models, which were built based on detailed information about previous hurricanes, took as input many data points about the history of a current hurricane’s evolution and provided as output a prediction of what it could do in coming days.

There were a variety of statistical models, but the output of them was split into two types when used for hurricane prediction.

Type A

First, the model could have generated a single prediction (the centre of the hurricane will be at 32.3078° N, 64.7505° W tomorrow) and supplemented this with an error measure. The error measure would have been based on historical hurricane data and related to how far out prior predictions had been on average; this measure would have been in kilometres. It would have been typical to employ some fraction of the error measure to define a “circle of uncertainty” around the central prediction; 80% in the example directly above (compared to two thirds in the NWS exhibit at the start of the article).

Type B

Second, the model could have generated a large number of mini-predictions, each of which would have had a probability associated with it (e.g. the first two estimates of location could be that the centre of the hurricane is at 32.3078° N, 64.7505° W with a 5% chance, or a mile away at 32.3223° N, 64.7505° W with a 2% chance and so on). In general if you had picked the “centre of gravity” of the second type of output, it would have been analogous to the single prediction of the first type of output [3]. The spread of point predictions in the second method would have also been analogous to the error measure of the first. Drawing a circle around the centroid would have captured a percentage of the mini-predictions, once more 80% in the example immediately above and two thirds in the NWS chart, generating another “circle of uncertainty”.

Here comes the Science

That was then of course, nowadays the statistical element of hurricane models is less significant. With increased processing power and the ability to store and manipulate vast amounts of data, most hurricane models instead rely upon scientific models; let’s call this Type C.

Type C

As the air is a fluid [4], its behaviour falls into the area of study known as fluid dynamics. If we treat the atmosphere as being viscous, then the appropriate equation governing fluid dynamics is the Navier-Stokes equation, which is itself derived from the Cauchy Momentum equation:

$\displaystyle\frac{\partial}{\partial t}(\rho \boldsymbol{u}) + \nabla \cdot (\rho \boldsymbol{u}\otimes \boldsymbol{u})=-\nabla\cdot p\boldsymbol{I}+\nabla\cdot\boldsymbol{\tau} + \rho\boldsymbol{g}$

If viscosity is taken as zero (as a simplification), instead the Euler equations apply:

$\displaystyle\left\{\begin{array}{lr}\displaystyle\frac{\partial\boldsymbol{u}}{\partial t} + \nabla \cdot (\boldsymbol{u}\otimes \boldsymbol{u} + w\boldsymbol{I}) = \boldsymbol{g} \\ \\ \nabla \cdot \boldsymbol{u}= 0\end{array}\right.$

The reader may be glad to know that I don’t propose to talk about any of the above equations any further.

To get back to the model, in general the atmosphere will be split into a three dimensional grid (the atmosphere has height as well). The current temperature, pressure, moisture content etc. are fed in (or sometimes interpolated) at each point and equations such as the ones above are used to determine the evolution of fluid flow at a given grid element. Of course – as is typical in such situations – approximations of the equations are used and there is some flexibility over which approximations to employ. Also, there may be uncertainty about the input parameters, so statistics does not disappear entirely. Leaving this to one side, how the atmospheric conditions change over time at each grid point rolls up to provide a predictive basis for what a hurricane will do next.

Although the methods are very different, the output of these scientific models will be pretty similar, qualitatively, to the Type A statistical model above. In particular, uncertainty will be delineated based on how well the model performed on previous occasions. For example, what was the average difference between prediction and fact after 6 hours, 12 hours and so on. Again, the uncertainty will have similar characteristics to that of Type A above.

In all of the cases discussed above, we have a central prediction (which may be an average of several predictions as per Type B) and a circular distribution around this indicating uncertainty. Let’s consider how these predictions might change as we move into the future.

If today is Monday, then there will be some uncertainty about what the hurricane does on Tuesday. For Wednesday, the uncertainty will be greater than for Tuesday (the “circle of uncertainty” will have grown) and so on. With the Type A and Type C outputs, the error measure will increase with time. With the Type B output, if the model spits out 100 possible locations for the hurricane on a specific day (complete with the likelihood of each of these occurring), then these will be fairly close together on Tuesday and further apart on Wednesday. In all cases, uncertainty about the location of the becomes smeared out over time, resulting in a larger area where it is likely to be located and a bigger “circle of uncertainty”.

This is where the circles of uncertainty combine to become a cone of uncertainty. For the same example, on each day, the meteorologists will plot the central prediction for the hurricane’s location and then draw a circle centered on this which captures the uncertainty of the prediction. For the same reason as stated above, the size of the circle will (in general) increase with time; Wednesday’s circle will be bigger than Tuesday’s. Also each day’s central prediction will be in a different place from the previous day’s as the hurricane moves along. Joining up all of these circles gives us the cone of uncertainty [5].

If the central predictions imply that a hurricane is moving with constant speed and direction, then its cone of uncertainty would look something like this:

In this diagram, broadly speaking, on each day, there is a 67% probability that the centre of the hurricane will be found within the relevant circle that makes up the cone of uncertainty. We will explore the implications of the underlined phrase in the next section.

Of course hurricanes don’t move in a single direction at an unvarying pace (see the actual NWS exhibit above as opposed to my idealised rendition), so part of the purpose of the cone of uncertainty diagram is to elucidate this.

The Central Issue

So hopefully the intent of the NWS chart at the beginning of this article is now clearer. What is the problem with it? Well I’ll go back to the words I highlighted couple of paragraphs back:

There is a 67% probability that the centre of the hurricane will be found within the relevant circle that makes up the cone of uncertainty

So the cone helps us with where the centre of the hurricane may be. A reasonable question is, what about the rest of the hurricane?

For ease of reference, here is the NWS exhibit again:

Let’s first of all pause to work out how big some of the NWS “circles of uncertainty” are. To do this we can note that the grid lines (though not labelled) are clearly at 5° intervals. The distance between two lines of latitude (ones drawn parallel to the equator) that are 1° apart from each other is a relatively consistent number; approximately 111 km [6]. This means that the lines of latitude on the page are around 555 km apart. Using this as a reference, the “circle of uncertainty” labelled “8 PM Sat” has a diameter of about 420 km (260 miles).

Let’s now consider how big Hurricane Irma was [7].

Aside: I’d be remiss if I didn’t point out here that RMS have selected what seems to me to be a pretty good colour palette in the chart above.

Well there is no defined sharp edge of a hurricane, rather the speed of winds tails off as may be seen in the above diagram. In order to get some sense of the size of Irma, I’ll use the dashed line in the chart that indicates where wind speeds drop below that classified as a tropical storm (65 kmph or 40 mph [8]). This area is not uniform, but measures around 580 km (360 miles) wide.

There are two issues here, which are illustrated in the above diagram.

Issue A

Irma was actually bigger [9] than at least some of the “circles of uncertainty”. A cursory glance at the NWS exhibit would probably give the sense that the cone of uncertainty represents the extent of the storm, it doesn’t. In our example, Irma extends 80 km beyond the “circle of uncertainty” we measured above. If you thought you were safe because you were 50 km from the edge of the cone, then this was probably an erroneous conclusion.

Issue B

Even more pernicious, because each “circle of uncertainty” provides an area within which the centre of the hurricane could be situated, this includes cases where the centre of the hurricane sits on the circumference of the “circle of uncertainty”. This, together with the size of the storm, means that someone 290 km from the edge of the “circle of uncertainty” could suffer 65 kmph (40 mph) winds. Again, based on the diagram, if you felt that you were guaranteed to be OK if you were 250 km away from the edge of the cone, you could get a nasty surprise.

These are not academic distinctions, the real danger that hurricane cones were misinterpreted led the NWS to start labelling their charts with “This cone DOES NOT REPRESENT THE SIZE OF THE STORM!![10].

Even Florida senator Marco Rubio got in on the act, tweeting:

When you need a politician help you avoid misinterpreting a data visualisation, you know that there is something amiss.

In Summary

The last thing I want to do is to appear critical of the men and women of the US National Weather Service. I’m sure that they do a fine job. If anything, the issues we have been dissecting here demonstrate that even highly expert people with a strong motivation to communicate clearly can still find it tough to select the right visual metaphor for a data visualisation; particularly when there is a diverse audience consuming the results. It also doesn’t help that there are many degrees of uncertainty here: where might the centre of the storm be? how big might the storm be? how powerful might the storm be? in which direction might the storm move? Layering all of these onto a single exhibit while still rendering it both legible and of some utility to the general public is not a trivial exercise.

The cone of uncertainty is a precise chart, so long as the reader understands what it is showing and what it is not. Perhaps the issue lies more in the eye of the beholder. However, having to annotate your charts to explain what they are not is never a good look on anyone. The NWS are clearly aware of the issues, I look forward to viewing whatever creative solution they come up with later this hurricane season.

Acknowledgements

I would like to thank Dr Steve Smith, Head of Catastrophic Risk at Fractal Industries, for reviewing this piece and putting me right on some elements of modern hurricane prediction. I would also like to thank my friend and former colleague, Dr Raveem Ismail, also of Fractal Industries, for introducing me to Steve. Despite the input of these two experts, responsibility for any errors or omissions remains mine alone.

Notes

 [1] I also squeezed Part I(b) – The Mona Lisa in between the two articles I originally planned. [2] I don’t mean to imply by this that the estimation process is unscientific of course. Indeed, as we will see later, hurricane prediction is becoming more scientific all the time. [3] If both methods were employed in parallel, it would not be too surprising if their central predictions were close to each other. [4] A gas or a liquid. [5] A shape traced out by a particle traveling with constant speed and with a circle of increasing radius inscribed around it would be a cone. [6] The distance between lines of longitude varies between 111 km at the equator and 0 km at either pole. This is because lines of longitude are great circles (or meridians) that meet at the poles. Lines of latitude are parallel circles (parallels) progressing up and down the globe from the equator. [7] At a point in time of course. Hurricanes change in size over time as well as in their direction/speed of travel and energy. [8] I am rounding here. The actual threshold values are 63 kmph and 39 mph. [9] Using the definition of size that we have adopted above. [10] Their use of capitals, bold and multiple exclamation marks.

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

# A Nobel Laureate’s views on creating Meaning from Data

Praise for the Praiseworthy

Today the recipients of the 2017 Nobel Prize for Chemistry were announced [1]. I was delighted to learn that one of the three new Laureates was Richard Henderson, former Director of the UK Medical Research Council’s Laboratory of Molecular Biology in Cambridge; an institute universally known as the LMB. Richard becomes the fifteenth Nobel Prize winner who worked at the LMB. The fourteenth was Venkatraman Ramakrishnan in 2009. Venki was joint Head of Structural Studies at the LMB, prior to becoming President of the Royal Society [2].

I have mentioned the LMB in these pages before [3]. In my earlier article, which focussed on Data Visualisation in science, I also provided a potted history of X-ray crystallography, which included the following paragraph:

Today, X-ray crystallography is one of many tools available to the structural biologist with other approaches including Nuclear Magnetic Resonance Spectroscopy, Electron Microscopy and a range of biophysical techniques.

I have highlighted the term Electron Microscopy above and it was for his immense contributions to the field of Cryo-electron Microscopy (Cryo-EM) that Richard was awarded his Nobel Prize; more on this shortly.

First of all some disclosure. The LMB is also my wife’s alma mater, she received her PhD for work she did there between 2010 and 2014. Richard was one of two people who examined her as she defended her thesis [4]. As Venki initially interviewed her for the role, the bookends of my wife’s time at the LMB were formed by two Nobel laureates; an notable symmetry.

The press release about Richard’s Nobel Prize includes the following text:

The Nobel Prize in Chemistry 2017 is awarded to Jacques Dubochet, Joachim Frank and Richard Henderson for the development of cryo-electron microscopy, which both simplifies and improves the imaging of biomolecules. This method has moved biochemistry into a new era.

[…]

Electron microscopes were long believed to only be suitable for imaging dead matter, because the powerful electron beam destroys biological material. But in 1990, Richard Henderson succeeded in using an electron microscope to generate a three-dimensional image of a protein at atomic resolution. This breakthrough proved the technology’s potential.

Electron microscopes [5] work by passing a beam of electrons through a thin film of the substance being studied. The electrons interact with the constituents of the sample and go on to form an image which captures information about these interactions (nowadays mostly on an electronic detector of some sort). Because the wavelength of electrons [6] is so much shorter than light [7], much finer detail can be obtained using electron microscopy than with light microscopy. Indeed electron microscopes can be used to “see” structures at the atomic scale. Of course it is not quite as simple as printing out the image snapped by you SmartPhone. The data obtained from electron microscopy needs to be interpreted by software; again we will come back to this point later.

Cryo-EM refers to how the sample being examined is treated prior to (and during) microscopy. Here a water-suspended sample of the substance is frozen (to put it mildly) in liquid ethane to temperatures around -183 °C and maintained at that temperature during the scanning procedure. The idea here is to protect the sample from the damaging effects of the cathode rays [8] it is subjected to during microscopy.

A Matter of Interpretation

On occasion, I write articles which are entirely scientific or mathematical in nature, but more frequently I bring observations from these fields back into my own domain, that of data, information and insight. This piece will follow the more typical course. To do this, I will rely upon a perspective that Richard Henderson wrote for the Proceedings of the National Academy of Science back in 2013 [9].

Here we come back to the interpretation of Cryo-EM data in order to form an image. In the article, Richard refers to:

[Some researchers] who simply record images, follow an established (or sometimes a novel or inventive [10]) protocol for 3D map calculation, and then boldly interpret and publish their map without any further checks or attempts to validate the result. Ten years ago, when the field was in its infancy, referees would simply have to accept the research results reported in manuscripts at face value. The researchers had recorded images, carried out iterative computer processing, and obtained a map that converged, but had no way of knowing whether it had converged to the true structure or some complete artifact. There were no validation tests, only an instinct about whether a particular map described in the publication looked right or wrong.

The title of Richard’s piece includes the phrase “Einstein from noise”. This refers to an article published in the Journal of Structural Biology in 2009 [11]. Here the authors provided pure white noise (i.e. a random set of black and white points) as the input to an Algorithm which is intended to produce EM maps and – after thousands of iterations – ended up with the following iconic mage:

Richard lists occurrences of meaning being erroneously drawn from EM data from his own experience of reviewing draft journal articles and cautions scientists to hold themselves to the highest standards in this area, laying out meticulous guidelines for how the creation of EM images should be approached, checked and rechecked.

The obvious correlation here is to areas of Data Science such as Machine Learning. Here again algorithms are applied iteratively to data sets with the objective of discerning meaning. Here too conscious or unconscious bias on behalf of the people involved can lead to the business equivalent of Einstein ex machina. It is instructive to see the level of rigour which a Nobel Laureate views as appropriate in an area such as the algorithmic processing of data. Constantly questioning your results and validating that what emerges makes sense and is defensible is just one part of what can lead to gaining a Nobel Prize [12]. The opposite approach will invariably lead to disappointment in either academia or in business.

Having introduced a strong cautionary note, I’d like to end this article with a much more positive tone by extending my warm congratulations to Richard both for his well-deserved achievement, but more importantly for his unwavering commitment to rolling back the bounds of human knowledge.

If you are interested in learning more about Cryo-Electron Microscopy, the following LMB video, which features Richard Henderson and colleagues, may be of interest:

Notes

 [1] The Nobel Prize in Chemistry 2017. [2] Both Richard and Venki remain Group Leaders at the LMB and are actively involved in new scientific research. [3] Data Visualisation – A Scientific Treatment. [4] Her thesis was passed without correction – an uncommon occurrence – and her contribution to the field was described as significant in the formal documentation. [5] More precisely this description applies to Transmission Electron Microscopes, which are the type of kit used in Cryo-EM. [6] The wave-particle duality that readers may be familiar with when speaking about light waves / photons also applies to all sub-atomic particles. Electrons have both a wave and a particle nature and so, in particular, have wavelengths. [7] This is still the case even if ultraviolet or more energetic light is used instead of visible light. [8] Cathode rays are of course just beams of electrons. [9] Henderson, R. (2013). Avoiding the pitfalls of single particle cryo-electron microscopy: Einstein from noise. PNAS This opens a PDF. [10] This is an example of Richard being very, very polite. [11] Shatsky, M., Hall, R.J., Brenner, S.E., Glaeser, R.M. (2009). A method for the alignment of heterogeneous macromolecules from electron microscopy. JSB This article is behind a paywall. [12] There are a couple of other things you need to do as well I believe.

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

# The revised and expanded Data and Analytics Dictionary

Since its launch in August of this year, the peterjamesthomas.com Data and Analytics Dictionary has received a welcome amount of attention with various people on different social media platforms praising its usefulness, particularly as an introduction to the area. A number of people have made helpful suggestions for new entries or improvements to existing ones. I have also been rounding out the content with some more terms relating to each of Data Governance, Big Data and Data Warehousing. As a result, The Dictionary now has over 80 main entries (not including ones that simply refer the reader to another entry, such as Linear Regression, which redirects to Model).

The most recently added entries are as follows:

It is my intention to continue to revise this resource. Adding some more detail about Machine Learning and related areas is probably the next focus.

As ever, ideas for what to include next would be more than welcome (any suggestions used will also be acknowledged).

From: peterjamesthomas.com, home of The Data and Analytics Dictionary

# Ever tried? Ever failed?

Regular readers may recall my March 2017 article [1] which started by exploring failure rates of Big Data implementations. In this, amongst other facts, we learnt that between a half and two-thirds of a range of major business transformations fail to deliver lasting value [2]. After recently reading a pair of Harvard Business Review articles from back in 2016 [3], I can also add Analytics. Here is a salient quote from the second article:

Only a little more than one in three of the three-dozen companies that we studied met the objectives of their analytics initiatives over the long term. Clearly, driving major innovations with analytics was harder than many executives expected.

Once more we see what appears to be a fundamental constant emerge, around 60% of most major business endeavours cannot be classified as unqualified successes. I feel that we should come up with a name for this figure and ideally use a Greek letter to denote it, maybe φ which is as close to “F” for failure as the Greek alphabet gets [4].

The authors based their study on a 20 years of research spanning 36 client companies. The drew a surprising conclusion:

Efforts to adopt analytics upset the balance of power in the C-suite, and this shift often had a negative impact on analytics initiatives.

As ever (and as indeed I concluded in my previous article) reasons for failure have little to do with technology and everything to do with humans and how they interact with each other. This is one of the reasons I get incensed by Analytics teams saying things like “the business didn’t know what they wanted” or “adoption wasn’t strong enough” when their programmes fail.

For a start, Analytics is a business discipline and the Analytics team should view themselves as a business team. Second, to me it is pretty clear that a core activity for such teams is working with stakeholders to form an appreciation of their products or services, their competitive landscape, the markets they operate in, their day-to-day challenges and, on top of all this, what they want from data; even if this requires some teasing out (e.g. spending time shadowing people or using mock-ups or prototypes to show the art of the possible). Also Analytics teams must take accountability for driving adoption themselves, rather than assuming that someone else will deal with this, or worse, that “if we build it, they will come” [5].

The C-suite aspect is tougher, but in my own work I try to spend time with Executives to understand their world views and to make sure I align what I am doing with their priorities. Building relationships here can help to reduce the likelihood of Executive strife impacting on an Analytics programme. However, I do also agree with the authors that the CEO has a key role to play here in ensuring that his or her team embrace becoming a data-driven organisation, even if this means changes in roles and responsibilities for some.

I’d encourage readers to take a look at the original HBR material, it contains a number of other pertinent observations above and beyond the ones I have highlighted here. When either looking to prevent issues from arising, or trying to mitigating them once they do, my article, 20 Risks that Beset Data Programmes, can also be a useful reference.

Beyond this, my simplest advice is to always remember the human angle in any Analytics programme. This is more likely to determine success or failure than technical excellence, or embracing the latest and greatest Data Visualisation or Analysis tools [6].

Notes

 [1] Ideas for avoiding Big Data failures and for dealing with them if they happen. This also includes a quote from Samuel Beckett, which provided the inspiration for the title of this article. [2] The specifics were, Big Data implementations, Data Warehousing, ERP systems and Mergers and Acquisitions; please see the earlier article for the source of the figures. To this you could add any number of technology-based programmes, such as CRM implementations, Digital Transformation and even outsourcing. The main message is doing some things successfully is hard. [3] The articles are: — by Chris McShea, Dan Oakley and Chris Mazzei, all from EY. [4] No doubt φ can be shown to be a transcendental number that can be linked to π, e and i by some elegant formula. Rather annoying φ is already the label we attach to the Golden Ratio, or (1 + √5)/2, but maybe I can repurpose this as I did π back in A quantised approach to formal group interactions of hominidae (size > 2). [5] Also see Ideas for avoiding Big Data failures and for dealing with them if they happen for the provenance of this misquote. [6] See also: A bad workman blames his [Business Intelligence] tools, which is as pertinent today as when I wrote it back in 2009.

From: peterjamesthomas.com, home of The Data and Analytics Dictionary