Data Science Challenges – It’s Deja Vu all over again!

The late Yogi Berra

The rather famous tautology, “It’s déjà vu all over again”, has of course been ascribed to that darling of malapropisms, baseball catcher Yogi Berra [1]. The phrase came to mind for me today when coming across the following exhibit:

Business Over Broadway - Kaggle Survey (Click to view a larger version in a new window)

© Business Over Broadway (2018). Based on Kaggle’s State of Data Science Survey 2017 (Sample size: 10,153).

The text in the above exhibit is not that clear [2], so here are the 20 top challenges [3] faced by those running Data Science teams in human-readable form:

# Challenge Cited by
1 Dirty Data 35.9%
2 Lack of Data Science talent in the organization 30.2%
3 Company politics / Lack of management/financial support for a Data Science team 27.0%
4 The lack of a clear question to be answering or a clear direction to go with available data 22.1%
5 Unavailability of/difficult access to data 22.0%
6 Data Science results not used by business decision makers 17.7%
7 Explaining Data Science to others 16.0%
8 Privacy issues 14.4%
9 Lack of significant domain expert input 14.2%
10 Organization is small and cannot afford a Data Science team 13.0%
11 Team using multiple ad hoc development environments such as Python/R/Java etc. 12.7%
12 Limitations of tools 12.0%
13 Need to coordinate with IT 11.8%
14 Maintaining responsible expectations about the potential impact of Data Science projects 11.5%
15 Inability to integrate findings into organization’s decision-making process 9.8%
16 Lack of funds to buy useful datasets from external sources 9.6%
17 Difficulties in deployment/scoring 8.6%
18 Scaling Data Science solution up to full database 8.4%
19 Limitations in the state of the art in machine learning 7.7%
20 Did not instrument data useful for scientific analysis and decision-making 6.5%
21 I prefer not to say 4.8%
22 Other 2.9%

The table above is a transcription of a transcription, so it would be remarkable if no Data Quality issues had crept in, however let’s assume that the figures are robust enough for our purposes. Of course the people surveyed will have reported multiple issues, so the percentages above are not additive. Nevertheless there are some very obvious comments to be made (some of the above items are pertinent to more than one of the points I would like to make):

  • Data Quality / Availability remain major issues(1, 5, and 8)

    It is indeed true that Machine Learning can be quite good at dealing with some types or bad or missing data. But no technology or approach is going to be able to paper over all of the cracks if you data is essentially incomplete and of poor quality. This point (together with some others below) speaks to the need to not approach Data Science on a stand-alone basis, but as part of a more holistic approach to data matters [4].
     

  • The Human angle and a focus on Culture are imperative(3, 6, 7, 14 and 15)

    Findings are one thing; using these to take action is quite another. At the end of the day, most ventures are successful or fail because of people; the people conducting the venture, the people receiving its intended benefits and so on. Ignore this dimension of data work (or any type of work) at your peril [5].
     

  • Business Questions amd Business Involvement matter(4, 6, 9 and 15)

    While in some circumstances the data can indeed “speak for itself”, it makes a lot more sense for Data Scientists to partner with business colleagues to both get direction and to help ensure that their findings lead to action [6].
     

  • Tools & Technology typically Trumped(11, 12 and 18)

    These first appear outside of the Top 10 (and 11 is a bit dubious to include here – it relates more to a proliferation of tools than to issues with any of them). I would never say that tools and technology are unimportant, but they are typically much less important than other considerations [7].

The overriding point is of course that – much as I noted out recently in Convergent Evolution – there is little new under the Sun. A survey of Business Intelligence / Data Warehousing professionals back in 2010 would have generated something very like the list above. A survey of EIS [8] professionals back in 2000 would have done the same.

The important things to do – regardless of the technologies and approaches employed – are to:

  1. Understand what questions are key to the running of an organisation [9]
     
  2. Determine what data is available to support decisions in these key areas
     
  3. Ensure that the data is in a “good enough” state, appropriately consolidated / made consistent, augmented / corrected by any useful external data and made available to the right people in a timely manner
     
  4. Focus on the human aspects of acting on what data is telling us and how to use data outputs to drive positive actions

Here too, little is new under the Sun. I have been referring to essentially these same four pillars of good practice since the mid 2000s. Some of our technological advances since then have been amazing. The prospect of leveraging the power of both Data Science and Artificial Intelligence in a business context is very exciting. But to truly succeed with these newer approaches, it helps to recall the eternal verities that have always underpinned good data-centric work [10]. The survey above makes this point crystal clear.

A final corollary to this observation is something I covered in A truth universally acknowledged…. The replies to the Kaggle survey highlight the fact that, much like the conductor of an orchestra does not need to be able to play the violin to a virtuoso level, people leading Data Science teams (and broader Data Functions) need a set of rounded skills, ones honed to address the types of issues appearing in the exhibits above. The skill-set that makes for an excellent Data Scientist does not necessarily help so much with many of the less technical issues that will determine the success or failure of Data Science teams.
 


 
Notes

 
[1]
 
Other Yogi-isms included, “Always go to other people’s funerals; otherwise they won’t go to yours”, “You can observe a lot by watching” and “If you can’t imitate him, don’t copy him”.
 
[2]
 
A Data Visualisation challenge to include that much text I realise. I think I might have been tempted to come up with pithier categories to aid legibility.
 
[3]
 
Ignoring “I prefer not to say” and “Other”.
 
[4]
 
As laid out in my many articles about the importance of Cultural Transformation.
 
[5]
 
See: Building Momentum – How to begin becoming a Data-driven Organisation.
 
[6]
 
I make precisely this point in my recent interview for Venturi Voice (starting just after 31:38).
 
[7]
 
I make this point most forcibly back in: A bad workman blames his [Business Intelligence] tools. The technology may be different, but the learnings are just as relevant today.
 
[8]
 
Executive Information Systems for those of tender years.
 
[9]
 
Machine learning techniques can clearly help here, but only if in concert with dialogue with people actually on the front-line and leading business areas.
 
[10]
 
In your search for such eternal verities, you could do much worse than starting with: 20 Risks that Beset Data Programmes.

 


From: peterjamesthomas.com, home of The Data and Analytics Dictionary, The Anatomy of a Data Function and A Brief History of Databases

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.