Literary calculus?

Seth Grimes Jean-Michel Texier
@sethgrimes @jmtexier

As mentioned in my earlier article, A first for me…, I was lucky enough to secure an invitation to an Nstein seminar held in London’s Covent Garden today. The strap-line for the meeting was Media Companies: The Most to Gain from Web 3.0 and the two speakers appear above (some background on them is included at the foot of this article). I have no intention here of rehashing everything that Seth and Jean-Michel spoke about, try to catch one or both of them speaking some time if you want the full details, but I will try to pick up on some of their themes.

Seth spoke first and explained that, rather than having the future Web 3.0 as the centre of the session, he was going to speak more about some of the foundational elements that he saw as contributing to this, in particular text mining and semantics. I have to admit to being a total neophyte when it comes to these areas and Seth provided a helpful introduction including the thoughts of such early luminaries as Hans Peter Luhn and drawing on sources of even greater antiquity. An interesting observation in this section was that Business Intelligence was initially envisaged as encompassing documents and text, before it evolved into the more numerically-focused discipline that we know today.

Seth moved on to speak about the concept of the semantic web where all data and text is accompanied by contextual information that allows people (or machines) to use it; enabling a greatly increased level of “data, information, and knowledge exchange.” The deficiencies of attempting to derive meaning from text, based solely on statistical analysis were covered and, adopting a more linguistic approach, the issue of homonyms, where meaning is intrinsicly linked to context, was also raised. The dangers of a word-by-word approach to understanding text can perhaps be illustrated by reference to the title of this article.

Such problems can be seen in the results that are obtained when searching for certain terms, with some items being wholly unrelated to the desired information and others related, but only in such a way that their value is limited. However some interesting improvements in search were also highlighted where the engines can nowadays recognise such diverse entities as countries, people and mathematical formulae and respond accordingly; e.g.

Extending this theme, Seth quoted the following definition (while stating that there were many alternatives):

Web 3.0 = Web 2.0 + Semantic Web + Semantic Tools

One way of providing semantic information about content is of course by humans tagging it; either the author of the content, or subsequent reviewers. However there are limitations to this. As Jean-Michel later pointed out, how is the person tagging today meant to anticipate future needs to access the information? In this area, text mining or text analytics can enable Web 3.0 by the automatic allocation of tags; such an approach being more exhaustive and consistent than one based solely on human input.

Seth reported that the text analytics market has been holding up well, despite the current economic difficulties. In fact there was significant growth (approx. 40%) in 2008 and a good figure (approx. 25%) is also anticipated in 2009. These strong figures are driven by businesses beginning to realise the value that this area can release.

Seth next went through some of the high-level findings of a survey he had recently conducted (partially funded by Nstein). Amongst other things, this covers the type of text sources that organisations would like to analyse and the reasons that they would like to do this. I will leave readers to learn more about this area for themselves as this paper is due to be published in the near future. However, a stand-out finding was the level of satisfaction of users of text analytics. Nearly 75% of users described themselves as either very satisfied or satisfied. Only 4% said that they were dissatisfied. Seth made the comment, with which I concur, that these are extraordinarily high figures for a technology.

Jean-Michel took over at the half way point. Understandably a certain amount of his material was more focussed on the audience and his company’s tools, whereas Seth’s talk had been more conceptual in nature. However, he did touch on some of the technological components of the semantic web, including Resource Description Framework (RDF), Microformat, Web Ontology Language (OWL – you have to love Winnie the Pooh references don’t you?) and SPARQL. I’ll cover Jean-Michel’s comments in less detail. However a few things stuck in my mind, the first of these being:

  • Web 1.0 was for authors
  • Web 2.0 is for users (and includes the embracement of interaction)
  • Web 3.0 is also for machines (opening up a whole range of possibilities)

Second Jean-Michel challenged the adage that “Content is King” suggesting that this was slowly, but surely morphing into “Context is King”, offering some engaging examples, which I will not plagiarise here. He was however careful to stress that “content will remain key”.

All-in-all the two-hour session was extremely interesting. Both speakers were well-informed and engaging. Also, at least for a novice in the area like me, some of the material was very thought-provoking. As some one who is steeped in the numeric aspects of business intelligence, I think that I have maybe had my horizons somewhat broadened as a result of attending the seminar. It is difficult to think of a better outcome for such a gathering to achieve.

UPDATE: Seth has also written about his presentations on his BeyeNetwork blog. You can read his comments and find a link to a recording of the presentations here.

Seth Grimes Seth Grimes is an analytics strategy consultant, a recognized expert on business intelligence and text analytics. He is contributing editor at Intelligent Enterprise magazine, founding chair of the Text Analytics Summit, Data Warehousing Institute (TDWI) instructor, and text analytics channel expert at the Business Intelligence Network. Seth founded Washington DC-based Alta Plana Corporation in 1997. He consults, writes, and speaks on information-systems strategy, data management and analysis systems, industry trends, and emerging analytical technologies.

Jean-Michel Texier Jean-Michel Texier has been building digital solutions for media companies since the early days of the Internet. He founded Eurocortex, in France, where he built content management solutions specifically for press and media companies. When the company was acquired by Nstein Technologies in 2006, Texier took over as CTO and chief visionary, helping companies organize, package and monetize content through semantic analysis.

Nstein Nstein Technologies (TSX-V: EIN) develops and markets multilingual solutions that power digital publishing for the most prestigious newspapers, magazines, and content-driven organizations. Nstein’s solutions generate new revenue opportunities and reduce operational costs by enabling the centralization, management and automated indexing of digital assets. Nstein partners with clients to design a complete digital strategy for success using publishing industry best practices for the implementation of its Web Content Management, Digital Asset Management, Text Mining Engine and Picture Management Desk products.


10 thoughts on “Literary calculus?

  1. […] to involvement. It may seem that I am splitting hairs on this issue (maybe this is a byproduct of the things that I learnt about semantics yesterday), but I have seen BI projects fail to deliver on their promise specifically because the […]

  2. The big thing with the semantic web is something not many of the current people involved in Web 3.0 understand: that the first reallife application will be in business intelligence.

    OWL describes relationships between objects. Hey, I think I’ve seen that one somewhere before… except now we have freeform relationships instead of the hardcoded ones in ER-diagrams. But restrict them a bit, say, with OWL-DL, and they are machine-parsable. Now add a sniff of FCO-IM (the followup to NIAM, Nijssen’s information modelling method) and suddenly we can go from a description of the business model straight down to a database diagram for the physical layer.

    People work on this already, IBM Research China actually implemented half of this already in 2007.

    The semantic web is both overhyped and underrated at the same time – a weird combination. But I am pretty sure we will be hearing a lot about it in the next few years.

  3. […] This seems to be just another of those annoying facts of life. I should be used to it by now, after all “Akismet has protected your site from 2,607 spam comments already.” However it seems to me that the spammers could perhaps do a better job of targeting their work. Maybe this is a breakthrough area for text analytics. […]

This site uses Akismet to reduce spam. Learn how your comment data is processed.