Homegeneraltechnologytext analytics

Accuracy

20 July 20093 November 2014 Peter James Thomas business, business intelligence, data quality, google, social media, technology, text analytics, twitter cricket, jeff goldblum, michael jackson, the ashes, yes prime minister

As might be inferred from my last post, certain sporting matters have been on my mind of late. However, as is becoming rather a theme on this blog, these have also generated some business-related thoughts.

Introduction

On Friday evening, the Australian cricket team finished the second day of the second Test Match on a score of 152 runs for the loss of 8 (out of 10) first innings wickets. This was still 269 runs behind the England team‘s total of 425.

In scanning what I realise must have been a hastily assembled end-of-day report on the web-site of one of the UK’s leading quality newspapers, a couple are glaring errors stood out. First, the Australian number 4 batsman Michael Hussey was described as having “played-on” to a delivery from England’s shy-and-retiring Andrew Flintoff. Second, the journalist wrote that Australia’s number six batsman, Marcus North, had been “clean-bowled” by James Anderson.

I appreciate that not all readers of this blog will be cricket aficionados and also that the mysteries of this most complex of games are unlikely to be made plain by a few brief words from me. However, “played on” means that the ball has hit the batsman’s bat and deflected to break his wicket (or her wicket – as I feel I should mention as a staunch supporter of the all-conquering England Women’s team, a group that I ended up meeting at a motorway service station just recently).

By contrast, “clean-bowled” means that the ball broke the batsman’s wicket without hitting anything else. If you are interested in learning more about the arcane rules of cricket (and let’s face it, how could you not be interested) then I suggest taking a quick look here. The reason for me bothering to go into this level of detail is that, having watched the two dismissals live myself, I immediately thought that the journalist was wrong in both cases.

It may be argued that the camera sometimes lies, but the cricinfo.com caption (whence these images are drawn) hardly ever does. The following two photographs show what actually happened:

Michael Hussey leaves one and is bowled, England v Australia, 2nd Test, Lord's, 2nd day, July 17, 2009

Marcus North drags James Anderson into his stumps, England v Australia, 2nd Test, Lord's, 2nd day, July 17, 2009

As hopefully many readers will be able to ascertain, Hussey raised his bat aloft, a defensive technique employed to avoid edging the ball to surrounding fielders, but misjudged its direction. It would be hard to “play on” from a position such as he adopted. The ball arced in towards him and clipped the top of his wicket. So, in fact he was the one who was “clean-bowled”; a dismissal that was qualified by him having not attempted to play a stroke.

North on the other hand had been at the wicket for some time and had already faced 13 balls without scoring. Perhaps in frustration at this, he played an overly-ambitious attacking shot (one not a million miles from a baseball swing), the ball hit the under-edge of his horizontal bat and deflected down into his wicket. So it was North, not Hussey, who “played on” on this occasion.

So, aside from saying that Hussey had been adjudged out “handled the ball” and North dismissed “obstructed the field” (two of the ten ways in which a batsman’s innings can end – see here for a full explanation), the journalist in question could not have been more wrong.

As I said, the piece was no doubt composed quickly in order to “go to press” shortly after play had stopped for the day. Maybe these are minor slips, but surely the core competency of a sports journalist is to record what happened accurately. If they can bring insights and colour to their writing, so much the better, but at a minimum they should be able to provide a correct description of events.

Everyone makes mistakes. Most of my blog articles contain at least one typographical or grammatical error. Some of them may include errors of fact, though I do my best to avoid these. Where I offer my opinions, it is possible that some of these may be erroneous, or that they may not apply in different situations. However, we tend to expect professionals in certain fields to be held to a higher standard.

For a molecular biologist, the difference between a 0.20 micro-molar solution and a 0.19 one may be massive. For a team of experimental physicists, unbelievably small quantities may mean the difference between confirming the existence of the Higgs Boson and just some background noise.

In business, it would be unfortunate (to say the least) if auditors overlooked major assets or liabilities. One would expect that law-enforcement agents did not perjure themselves in court. Equally politicians should never dissemble, prevaricate or mislead. OK, maybe I am a little off track with the last one. But surely it is not unreasonable to expect that a cricket journalist should accurately record how a batsman got out.

Twitter and Truth

I made something of a leap from these sporting events to the more tragic news of Michael Jackson’s recent demise. I recall first “hearing” rumours of this on twitter.com. At this point, no news sites had much to say about the matter. As the evening progressed, the self-styled celebrity gossip site TMZ was the first to announce Jackson’s death. Other news outlets either said “Jackson taken to hospital” or (perhaps hedging their bets) “US web-site reports Jackson dead”.

By this time the twitterverse was experiencing a cosmic storm of tweets about the “fact” of Jackson’s passing. A comparably large number of comments lamented how slow “old media” was to acknowledge this “fact”. Eventually of course the dinosaurs of traditional news and reporting lumbered to the same conclusion as the more agile mammals of Twitter.

In this case social media was proved to be both quick and accurate, so why am I now going to offer a defence of the world’s news organisations? Well I’ll start with a passage from one of my all-time favourite satires, Yes Minister, together with its sequel Yes Prime Minister.

In the following brief excerpt Sir Geoffrey Hastings (the head of MI5, the British domestic intelligence service) is speaking to The Right Honourable James Hacker (the British Prime Minister). Their topic of conversation is the recently revealed news that a senior British Civil Servant had in fact been a Russian spy:

Hastings: Things might get out. We don’t want any more irresponsible ill-informed press speculation.

Hacker: Even if it’s accurate?

Hastings: Especially if it’s accurate. There is nothing worse than accurate irresponsible ill-informed press speculation.

Yes Prime Minister, Vol. I by J. Lynn and A. Jay

Was the twitter noise about Jackson’s death simply accurate ill-informed speculation? It is difficult to ask this question as, sadly, the tweets (and TMZ) proved to be correct. However, before we garland new media with too many wreaths, it is perhaps salutary to recall that there was a second rumour of a celebrity death circulating in the febrile atmosphere of Twitter on that day. As far as I am aware, Pittsburgh’s finest – Jeff Goldblum – is alive and well as we speak. Rumours of his death (in an accident on a New Zealand movie set) proved to be greatly exaggerated.

The difference between a reputable news outlet and hordes of twitterers is that the former has a reputation to defend. While the average tweep will simply shrug their shoulders at RTing what they later learn is inaccurate information, misrepresenting the facts is a cardinal sin for the best news organisations. Indeed reputation is the main thing that news outlets have going for them. This inevitably includes annoying and time-consuming things such as checking facts and validating sources before you publish.

With due respect to Mr Jackson, an even more tragic set of events also sparked some similar discussions; the aftermath of the Iranian election. The Economist published an interesting artilce comparing old and new media responses to this entitiled: Twitter 1, CNN 0. Their final comments on this area were:

[…]the much-ballyhooed Twitter swiftly degraded into pointlessness. By deluging threads like Iranelection with cries of support for the protesters, Americans and Britons rendered the site almost useless as a source of information—something that Iran’s government had tried and failed to do. Even at its best the site gave a partial, one-sided view of events. Both Twitter and YouTube are hobbled as sources of news by their clumsy search engines.

Much more impressive were the desk-bound bloggers. Nico Pitney of the Huffington Post, Andrew Sullivan of the Atlantic and Robert Mackey of the New York Times waded into a morass of information and pulled out the most useful bits. Their websites turned into a mish-mash of tweets, psephological studies, videos and links to newspaper and television reports. It was not pretty, and some of it turned out to be inaccurate. But it was by far the most comprehensive coverage available in English. The winner of the Iranian protests was neither old media nor new media, but a hybrid of the two.

Aside from the IT person in me noticing the opportunity to increase the value of Twitter via improved text analytics (see my earlier article, Literary calculus?), these types of issues raise concerns in my mind. To balance this slightly negative perspective it is worth noting that both accurate and informed tweets have preceded several business events, notably the recent closure of BI start-up LucidEra.

Also main stream media seem to have swallowed the line that Google has developed its own operating system in Chrome OS (rather than lashing the pre-existing Linux kernel on to its browser); maybe it just makes a better story. Blogs and Twitter were far more incisive in their commentary about this development.

Considering the pros and cons, on balance the author remains something of a doubting Thomas (by name as well as nature) about placing too much reliance on Twitter for news; at least as yet.

Accuracy an Business Intelligence

Some business thoughts leaked into the final paragraph of the Introduction above, but I am interested more in the concept of accuracy as it pertains to one of my core areas of competence – business intelligence. Here there are different views expressed. Some authorities feel that the most important thing in BI is to be quick with information that is good-enough; the time taken to achieve undue precision being the enemy of crisp decision-making. Others insist that small changes can tip finely-balanced decisions one way or another and so precision is paramount. In a way that is undoubtedly familiar to regular readers, I straddle these two opinions. With my dislike for hard-and-fast recipes for success, I feel that circumstances should generally dictate the approach.

There are of course different types of accuracy. There is that which insists that business information reflects actual business events (often more a case for work in front-end business systems rather than BI). There is also that which dictates that BI systems reconcile to the penny to perhaps less functional, but pre-existing scorecards (e.g. the financial results of an organisation).

A number of things can impact accuracy, including, but not limited to: how data has been entered into systems; how that data is transformed by interfaces; differences between terminology and calculation methods in different data sources; misunderstandings by IT people about the meaning of business data; errors in the extract transform and load logic that builds BI solutions; and sometimes even the decisions about how information is portrayed in BI tools themselves. I cover some of these in my previous piece Using BI to drive improvements in data quality.

However, one thing that I think differentiates enterprise BI from departmental BI (or indeed predictive models or other types of analytics), is a greater emphasis on accuracy. If enterprise BI is to aspire to becoming the single version of the truth for an organisation, then much more emphasis needs to be placed on accuracy. For information that is intended to be the yardstick by which a business is measured, good enough may fall short of the mark. This is particularly the case where a series of good enough solutions are merged together; the whole may be even less than the sum of its parts.

A focus on accuracy in BI also achieves something else. It stresses an aspiration to excellence in the BI team. Such aspirations tend to be positive for groups of people in business, just as they are for sporting teams. Not everyone who dreams of winning an Olympic gold medal will do so, but trying to make such dreams a reality generally leads to improved performance. If the central goal of BI is to improve corporate performance, then raising the bar for the BI team’s own performance is a great place to start and aiming for accuracy is a great way to move forward.

A final thought: England went on to beat Australia by precisely 115 runs in the second Test at Lord’s; the final result coming today at precisely 12:42 pm British Summer Time. The accuracy of England’s bowling was a major factor. Maybe there is something to learn here.

Literary calculus?

18 June 20093 November 2014 Peter James Thomas business, technology, text analytics information technology, jean-michel texier, nstein, semantic web, seth grimes, web 3.0



	@sethgrimes	@jmtexier

As mentioned in my earlier article, A first for me…, I was lucky enough to secure an invitation to an Nstein seminar held in London’s Covent Garden today. The strap-line for the meeting was Media Companies: The Most to Gain from Web 3.0 and the two speakers appear above (some background on them is included at the foot of this article). I have no intention here of rehashing everything that Seth and Jean-Michel spoke about, try to catch one or both of them speaking some time if you want the full details, but I will try to pick up on some of their themes.

Seth spoke first and explained that, rather than having the future Web 3.0 as the centre of the session, he was going to speak more about some of the foundational elements that he saw as contributing to this, in particular text mining and semantics. I have to admit to being a total neophyte when it comes to these areas and Seth provided a helpful introduction including the thoughts of such early luminaries as Hans Peter Luhn and drawing on sources of even greater antiquity. An interesting observation in this section was that Business Intelligence was initially envisaged as encompassing documents and text, before it evolved into the more numerically-focused discipline that we know today.

Seth moved on to speak about the concept of the semantic web where all data and text is accompanied by contextual information that allows people (or machines) to use it; enabling a greatly increased level of “data, information, and knowledge exchange.” The deficiencies of attempting to derive meaning from text, based solely on statistical analysis were covered and, adopting a more linguistic approach, the issue of homonyms, where meaning is intrinsicly linked to context, was also raised. The dangers of a word-by-word approach to understanding text can perhaps be illustrated by reference to the title of this article.

Such problems can be seen in the results that are obtained when searching for certain terms, with some items being wholly unrelated to the desired information and others related, but only in such a way that their value is limited. However some interesting improvements in search were also highlighted where the engines can nowadays recognise such diverse entities as countries, people and mathematical formulae and respond accordingly; e.g.

http://www.google.co.uk/search?&q=age+of+the+pope.

Extending this theme, Seth quoted the following definition (while stating that there were many alternatives):

Web 3.0 = Web 2.0 + Semantic Web + Semantic Tools

One way of providing semantic information about content is of course by humans tagging it; either the author of the content, or subsequent reviewers. However there are limitations to this. As Jean-Michel later pointed out, how is the person tagging today meant to anticipate future needs to access the information? In this area, text mining or text analytics can enable Web 3.0 by the automatic allocation of tags; such an approach being more exhaustive and consistent than one based solely on human input.

Seth reported that the text analytics market has been holding up well, despite the current economic difficulties. In fact there was significant growth (approx. 40%) in 2008 and a good figure (approx. 25%) is also anticipated in 2009. These strong figures are driven by businesses beginning to realise the value that this area can release.

Seth next went through some of the high-level findings of a survey he had recently conducted (partially funded by Nstein). Amongst other things, this covers the type of text sources that organisations would like to analyse and the reasons that they would like to do this. I will leave readers to learn more about this area for themselves as this paper is due to be published in the near future. However, a stand-out finding was the level of satisfaction of users of text analytics. Nearly 75% of users described themselves as either very satisfied or satisfied. Only 4% said that they were dissatisfied. Seth made the comment, with which I concur, that these are extraordinarily high figures for a technology.

Jean-Michel took over at the half way point. Understandably a certain amount of his material was more focussed on the audience and his company’s tools, whereas Seth’s talk had been more conceptual in nature. However, he did touch on some of the technological components of the semantic web, including Resource Description Framework (RDF), Microformat, Web Ontology Language (OWL – you have to love Winnie the Pooh references don’t you?) and SPARQL. I’ll cover Jean-Michel’s comments in less detail. However a few things stuck in my mind, the first of these being:

Web 1.0 was for authors

Web 2.0 is for users (and includes the embracement of interaction)

Web 3.0 is also for machines (opening up a whole range of possibilities)

Second Jean-Michel challenged the adage that “Content is King” suggesting that this was slowly, but surely morphing into “Context is King”, offering some engaging examples, which I will not plagiarise here. He was however careful to stress that “content will remain key”.

All-in-all the two-hour session was extremely interesting. Both speakers were well-informed and engaging. Also, at least for a novice in the area like me, some of the material was very thought-provoking. As some one who is steeped in the numeric aspects of business intelligence, I think that I have maybe had my horizons somewhat broadened as a result of attending the seminar. It is difficult to think of a better outcome for such a gathering to achieve.

UPDATE: Seth has also written about his presentations on his BeyeNetwork blog. You can read his comments and find a link to a recording of the presentations here.


	Seth Grimes is an analytics strategy consultant, a recognized expert on business intelligence and text analytics. He is contributing editor at Intelligent Enterprise magazine, founding chair of the Text Analytics Summit, Data Warehousing Institute (TDWI) instructor, and text analytics channel expert at the Business Intelligence Network. Seth founded Washington DC-based Alta Plana Corporation in 1997. He consults, writes, and speaks on information-systems strategy, data management and analysis systems, industry trends, and emerging analytical technologies.

	Jean-Michel Texier has been building digital solutions for media companies since the early days of the Internet. He founded Eurocortex, in France, where he built content management solutions specifically for press and media companies. When the company was acquired by Nstein Technologies in 2006, Texier took over as CTO and chief visionary, helping companies organize, package and monetize content through semantic analysis.

	Nstein Technologies (TSX-V: EIN) develops and markets multilingual solutions that power digital publishing for the most prestigious newspapers, magazines, and content-driven organizations. Nstein’s solutions generate new revenue opportunities and reduce operational costs by enabling the centralization, management and automated indexing of digital assets. Nstein partners with clients to design a complete digital strategy for success using publishing industry best practices for the implementation of its Web Content Management, Digital Asset Management, Text Mining Engine and Picture Management Desk products. www.nstein.com

	Welcome to: peterjamesthomas.com, a site which covers my thoughts on the confluence of business, technology and change.
+44 (0) 2088 956 826
	[Web-site Contact]
	[Commercial Contact]
	[Creative Commons]
	[Report Problems]

Peter James Thomas

Data & Analytics: Consultancy, Interim Services and Research

text analytics

Literary calculus?

Like this:

Hastings:	Things might get out. We don’t want any more irresponsible ill-informed press speculation.

Hacker:	Even if it’s accurate?

Hastings:	Especially if it’s accurate. There is nothing worse than accurate irresponsible ill-informed press speculation.

Share this:

Like this:

Share this:

Like this: