The Data and Analytics Dictionary

The Data and Analytics Dictionary

This dictionary covers terms commonly used in the data and analytics field. It is not intended to be exhaustive. Instead I have focussed on a few terms I feel to be pertinent, perhaps particularly to those with less of a background in the area. I have avoided covering the second-level terms that are related to most of the definitions below (so I reference Cassandra, Flink, Hadoop, Hive, Pig and Spark, but none of Flume, HBase, Impala, Kafka, Oozie, Phoenix, Sqoop, Storm or ZooKeeper), instead trying to focus on the big picture. Similarly, I have not included basic statistical terms such as Standard Deviation or p-value, mostly to avoid the dictionary becoming too large.

— Peter James Thomas, August 2017

 
– Index –
A B C D E F G
H I J K L M N
O P Q R S T U
V W X Y Z    
 
– A – Return to Index
 
Algorithm A set of instructions (frequently Mathematical in nature), written down as a series of steps, which are generally iterated through many times in to achieve some task or find some result. Computers are good at running algorithms.
 
Analytics Deriving insights from data which are generally beyond the purpose for which the data was originally captured – to be contrasted with Information which relates to the meaning inherent in data (i.e. the reason that it was captured in the first place). Analytics often employ advanced statistical techniques (logistic regression, multivariate regression, time series analysis etc.) to derive meaning from data.

Gartner refer to four types of Analytics: descriptive, diagnostic, predictive and prescriptive analytics. In an article I referred to these as:

  1. What happened?
  2. Why did it happen?
  3. What is going to happen next?
  4. What should we be doing?

An Analytics department, together with a Data Management department, may be part of a broader data function, which would sometimes reporting to a CDO. Alternatively it may be more independent and perhaps be headed by a Chief Analytics Officer (CAO).

See also: An in-depth Interview with Allan Engelhardt about Analytics

 
Anomaly Detection (Outlier Detection) Techniques, used in Data Mining and elsewhere, for identifying points in a set of data that do not conform to the general characteristics of the rest of the data. A classic example would be detecting fraudulent transactions amongst the overwhelmingly larger number of legitimate transactions processed by a bank or credit card company. More seriously it could be identifying which of a number of photographs of chronic skin lesions could instead be a malignant tumor. Equally the objective could be to discard anomalous data that is otherwise skewing a population (e.g. the one multi-billionaire in a sample of 5,000 people).
 
Artificial Intelligence (AI) A scientific discipline devoted to the in silico creation of cognitive agents which mimic (or exceed) the behaviour of self-aware systems (such as humans). Milestones in AI include beating humans at Chess (IBM Deep Blue 1996), Jeopardy (IBM Watson 2011)and Go (Google AlphaGo 2016). Self-driving cars are another current example of AI in action.
 
– B – Return to Index
 
Behavioural Analytics A discipline focussed on how users interact with mobile applications and web-sites in order to anticipate and meet their future wants and needs. A very basic example is an web grocer reminding you to purchase items that you generally buy, but which are not in your current shopping basket. Slightly more advanced would be an online store suggesting items that you might be interested in, based on those you have both looked at and purchased historically (“if you liked X, you might like Y”; “people who bought W, also bought Z” etc.). Behavioural Analytics can also help to better design web-sites so that people reaching them as a result of searching for a product have this prominently displayed to them.
 
Big Data A suite of Open Source technologies (clustered around the Apache Hadoop platform) which leverage multiple commodity servers to spread the load of storing and processing very large data sets (such as those created by the Internet of Things) and provide a range of tools and software that support advanced statistical analysis. These servers can be on-premises or cloud-based with associated security. Big Data technologies are particularly adept at handling Unstructured Data.

See also: Do any technologies grow up or do they only come of age?

 
Big Table Google’s proprietary distributed database platform, which underpins Gmail, Google Maps, YouTube and many other well-known services. It is the precursor to Big Data technologies and Hadoop in particular.
 
Business Intelligence There is no ISO definition, but I use this term as a catch-all to describe the transformation of raw data into information that can be disseminated to business people to support decision-making.

See also: Keynote Articles Section 1.

 
– C – Return to Index
 
Cassandra More properly Apache Cassandra. Cassandra is a distributed file system (like HDFS), but which is targeted more at operational scenarios. As such it supports higher availability than HDFS. Rather than Google, Cassandra was originally a Facebook technology.
 
Chief Data Officer (CDO) The person fulfilling the top data job in an organisation and accountable for how data is both controlled and leveraged in pursuit of executing the organisation’s strategy. The CDO is a business role, but with significant technical experience. It most typically reports to the Chief Operating Officer, but other potential structures could see this role reporting to any number of other top-level CxO roles, including directly to the CEO.

While the CDO retains the accountability described above, they must collaborate with peers across the organisation in order to create stepped-changes in how data is treated and to promote a culture of reliance on data to support business decision-making.

See also: Keynote Articles Section 8

 
Clustering See Model.
 
Columnar Database (Column-oriented Database) In a Relational Database all sales would be gathered into a single Table (or set of related tables). The row is the primary entity in such a database. Each row of our sales table would have columns such as customer name, product purchased etc. A columnar version of the same database swaps the emphasis. One structure would have all the customer entries stored together, another all the products. This greatly speeds up the look-up of specific attributes.

Columnar databases generally support SQL for querying the same as Relational ones. An example of a Columnar Database is Vertica.

 
Complex Event Processing This is a way to monitor streams of data from more than one source, often in real-time, in order to identify threats, opportunities or simply just patterns. An example might be connecting Social Media Sentiment Analysis with the content of call-centre conversations. This might lead to the deduction that these is either an uptick in demand for an organisation’s products or services or that there might be some reputational issue that could have an adverse impact. Some automated trading algorithms in Financial Services might use elements of CEP to identify buy or sell opportunities related to real world events.

Some organisations even layer Business Intelligence solutions over a CEP platform. Flink is an example of CEP from within the Hadoop stable.

 
Customer Relationship Management (CRM) A class of systems designed to support interactions with customers (often in a call-centre and/or sales organisation setting). Because Customer Data is just one class of data, CRM systems need to be closely integrated with both Systems of Record and data repositories in order to be accurate and to avoid duplicate entry of data.
 
Cube See Multidimensional Approach.
 
– D – Return to Index
 
Dashboard A Dashboard is a single page or pane (often a web-page) which simultaneously displays multiple different measurements of the performance of an organisation or sub-unit of this (division, department, geographic territory). The information is generally at a high-level (with the ability to look into more details if required) and at least some of it may be presented in graphical form (e.g. charts, traffic lights, dials and more advanced Data Visualisations).

The content of Dashboards is frequently drawn from an organisation’s KPIs. Dashboards may represent what is happening at a point in time (e.g. month-end) or what is happening now; or both of these perspectives may be mixed (e.g. profit last month versus new business booked up to 5 minutes ago). The term is taken from the instrumentation of cars and aeroplanes.

See also: “All that glisters is not gold” – some thoughts on dashboards

 
Data Architecture The practice of determining business needs, modelling these and mapping them to appropriate data structures, systems landscapes and integration / interface strategies. Data Architects generally have an in-depth understanding of data and its usage and so can be more technologically experienced than some architects in other fields.

Data Architects tend to engage with Change programmes, both to instill good data practice and also to guard against the typically deleterious impact of such programmes on data assets over time. They are also heavily involved in the design and implementation of data-centric capabilities and the delivery of data to other teams (e.g. Digital).

See also: Bumps in the Road

 
Database While it could be argued that this term could be applied to analogue systems such as index cards at a physical library, it is generally taken to refer software that enables the storage and retrieval of numbers and text in digital format. Databases differ from Flat-files in that they often contain structures (e.g. Tables, Views and indeces) intended to facilitate these tasks and come with tools that enable the efficient management and manipulation of the data they contain.

Some examples of Databases include:

  1. hierarchical
  2. relational
  3. graph
  4. columnar
  5. in-memory
  6. distributed
  7. NoSQL.

Some of these attributes overlap with each other, i.e. a database could be both columnar and in-memory, another could be both distributed and NoSQL.

 
Data Cleansing At the positive end of things, this can refer to the harmless (and useful) activity of de-duplicating records, fixing inconsistencies in capitalisation, quarantining data with issues for later review and so on. At the other end of the spectrum we have less helpful (and often harmful) activities that could include picking values to fill empty fields, permanently excluding records, or recalculating figures according to some improved formula. In the author’s opinion these all present a slippery slope leading to data massaging and should be either used very sparingly or avoided altogether.

This advice stands for the statistical use of data as well as operational. Some of the above activities can skew models and create selection bias.

See also: Using BI to drive improvements in data quality

 
Data Discovery A process by which generally numerate, but not necessarily technical, staff are able to explore (generally curated) datasets, perform their own analyses and create their own reports. This term can also refer to interactions with facilities such as highly parameterisable reports or customisable Dashboards in order to develop a personally-relevant view of information. Data Discovery often relies upon good Metadata.
 
Data Governance The management processes and policies necessary to ensure that data captured or generated within a company is of an appropriate standard to use, represents actual business facts and has its integrity preserved when transferred to repositories (e.g. Data Lakes and / or Data Warehouses, General Ledgers etc.), especially when this transfer involves aggregation or merging of different data sets. The activities that Data Governance has oversight of include the operation of and changes to Systems of Record and the activities of Data Management and Analytics departments (which may be merged into one unit, or discrete but with close collaboration).

Data Governance has a strategic role, often involving senior management. Day-to-day tasks supporting Data Governance are often carried out by a Data Management team.

See also: 5 More Themes from a Chief Data Officer Forum and Bumps in the Road

 
Data Ingestion The process of bringing raw, untransformed data into a Big Data repository, such as a Data Lake.
 
Data Integration Bringing together data from different sources into a cohesive whole. This can involve processes like Extract Transform and Load, creating Views combining multiple tables in a Relational Database, or generating new physical data structures duplicating several existing ones in Hadoop.
 
Data Lake A Big Data repository in to which copies of source systems are periodically replicated. The Data Lake is the one of the resources that Data Scientists leverage to create insight.
 
Data Mart Part of a Data Warehouse devoted to a specific subject area, e.g. Finance, Sales etc.
 
Data Management The day-to-day management of data within an organisation, which encompasses areas such as Data Architecture, Data Quality, Data Governance (normally on behalf of a Data Governance Committee) and often some elements of data provision and / or regular reporting. The objective is to appropriately manage the lifecycle of data throughout the entire organisation, which both ensures the reliability of data and enables it to become a valuable and strategic asset.

In some organisations, Data Management and Analytics are part of the same organisation, in others they are separate but work closely together to achieve shared objectives.

See also: Alphabet Soup and Data Management as part of the Data to Action Journey

 
Data Migration The process of moving data from one place to another, often a legacy system to a new one, or from old Data Repositories to a new one. This requires a very good understanding of the structure of data in both the source and target systems and may involve elements of Data Integration.
 
Data Mining The process of sifting through generally large data sets to discern meaning, uncover new facts and relationships and establish useful patterns. There is a connection here to some of the activities carried out by Data Scientists, though some aspects of Data Mining may be automated. Data Mining may leverage Big Data implementations, but has been carried out successfully on other types of Databases for many years before the advent of these.
 
Data Modelling The process of examining data sets (e.g. the database underpinning a system) in order to understand how they are structured, the relationships between their various parts and the business entities and transactions they represent. While system data will have a specific Physical Data Model (the tables it contains and their linkages), Data Modelling may instead look to create a higher-level and more abstract set of pseudo-tables, which would be easier to relate to for non-technical staff and would more closely map to business terms and activities; this is known as a Conceptual Data Model. There are several specific documents produced by such work, one of the most common being an Entity-Relationship diagram, e.g. a sales order has a customer and one or more line items, each of which has a product.
 
Data Privacy All B2C organisations hold data about their customers (or potential customers, e.g. those who have made and enquiry). This can range from actual transactions with the company, to non-transactional contact (e.g. queries placed with a call center), to web-site interactions. While it is necessary to hold at least some of this information in order to properly service the customer, privacy laws (and general ethics) dictate that it should be used in an appropriate way (generally defined as one that the customer has explicitly sanctioned) and not released to either any third party or people in the B2C organisation who have no need to know such details. In general B2C organisations are also meant to retain customer data only so long as it is pertinent to servicing the customer’s needs. Similar arguments pertain to B2B organisations and the details that they hold of partner organisations, but this is less subject to regulation than customer data.

One use of such customer data is to perform Analytics or Statistical Modelling on it in order to better understand customer behaviour and preferences so as to aid retention, increase new business and offer more pertinent and useful products and services. Data Privacy generally dictates that such work must be on data sets that are aggregated with any information that could potentially be used to identify individual customers being anonymised. Though models in this areas may be used to segment existing customers into cohorts (e.g. higher risk appetite, outdoor fan, etc.), which does not in general infringe privacy law.

Laws to protect Data Privacy are becoming more stringent, the potential sanctions more material and the risk of major reputational damage more real. All of this has led organisations to invest time and resource into policies, practices and systems designed to bake Data Privacy compliance into day-to-day operations.

There is some overlap with Information Security, but the two areas are essentially distinct with different priorities and objectives.

 
Data Scientist Someone au fait with exploiting data in many formats from Flat Files to Data Warehouses to Data Lakes. Such individuals possess equal abilities in the data technologies (such as Big Data) and how to derive benefit from these via statistical modelling. Data Scientists are often lapsed actual scientists.

See also: Knowing what you do not Know

 
Data Quality The characteristics of data that cover how accurately and completely it mirrors real world events and thereby how much reliance can be placed on it for the purpose of generating information and insight. Enhancing Data Quality should be a primary objective of Data Management teams. Ways that this can be achieved include:

  1. Data Audits – so long as the loop is closed when issues are discovered
  2. Data Education – to explain to people entering data how it is used and its importance to the organisation
  3. Data Validation – improving how systems validate input or interfaced data, potentially in combination with an approach to Master Data Management
  4. Data Architecture – improving how systems are designed and talk to each other
  5. Data Transparency– taking a “warts and all” approach to how bad data is included in reporting and dashboards

See also: Using BI to drive improvements in data quality and Who should be accountable for data quality?

 
Data Repository A generic term for any structure for holding a collection of, normally related, data. This would encompass Databases, Data Lakes, Data Marts and Data Warehouses.
 
Data Virtualisation If the Data Warehouse paradigm is to gather all source data together in one place, Data Virtualisation instead leaves it where it was (or – more likely – in mirror copies of each system’s data, in order to prevent transaction processing from being impacted by queries) and instead brings the data together only when read. The term Virtualisation arises because this is like creating a virtual Data Warehouse.

A primary advantage of Data Virtualisation is that it can utilise quasi-real-time data (as up-to-date as the mirror databases are). This can be helpful for potentially rapidly changing data, like customer data.

 
Data Visualisation Techniques – such as graphs – for presenting complex information in a manner in which it can be more easily digested by human observers. Based on the concept that a picture paints a thousand words (or a dozen Excel sheets).

See also: Data Visualisation – A Scientific Treatment

 
Data Warehouse A Database holding data from many disparate sources in a common format which allows the comparison of apples and oranges. A regular warehouse is a big building in which many things may be stored, but which has an indexing system which allows them to be located and retrieved easily. A data warehouse is essentially the same concept. Good Data Warehouses have business meaning “baked” into them. Data Warehouses generally follow a Multidimensional Paradigm (related to OLAP) where data is held in Fact Tables (tables covering numbers such as revenue or costs) and Dimensions (things we want to view the facts by, such as region, office, or week).

See also: Using multiple business intelligence tools in an implementation – Part I and Part II

 
Data Warehouse Appliance A dedicated server which is tuned to carry out Analytical tasks very quickly. Transactional servers will be tuned to either create new records or update existing ones. Appliances are tuned to select all records with a given attribute quickly. This is often achieved by using Massively Parallel Processing. Products from IBM Netezza and Teradata are examples of Data Warehouse Appliances.
 
Deep Learning (Deep Neural Network) Deep Learning is a subset of Machine Learning. Its main feature leads to an alternative description of Hierarchical Learning.

Animal brains consist of neurons, each of these receives input signals via one or more dendrite and passes potentially modified output signals on via one or more axon, which in turn connect with the dendrites of other neurons. In this way, when one neuron “fires”, it may have an impact on several other downstream neurons. Within AI, the creation of physical neural networks and latterly in silico ones has been an active area of study; particularly in fields such as image recognition and interpretation.

Deep Learning utilises the hierarchical arrangements of neural networks, where each neuron is a virtual processor (or analogous arrangements where mini-algorithms play a role similar to these artificial neurons), but on a the larger scale possible with more modern computers. The general intention is that data – whose underlying structure is unknown – is fed into the basal layer and, as output from this iteratively become input to higher layers, meaning – i.e. a structure to the input data – begins to emerge.

The general concept is that complicated insights (high up the hierarchy) are built up from simpler ones (derived lower down). The number of layers can vary and different layers can be supervised or unsupervised by human “trainers”.

 
Dimension See Multidimensional Approach.
 
Distributed Data Data held across multiple computers, generally coupled with software which makes this seem as if it was held in one place.
 
– E – Return to Index
 
Extract Load Transform (ELT) Rather than the Extract Transform Load paragdigm inherent in Data Warehousing, Big Data implementations tend to change the order of these processes. Data is “lifted and shifted” wholesale to a Big Data repository (e.g. a Data Lake) and held in original format. It is only transformed “on the fly” when needed by Data Scientists.

In practice, things are not so black and white. Many Data Lakes contain intermediate merged and transformed data structures, not least to ensure that each Data Scientist doesn’t repeat the same work, or carry it out in a different way.

 
Enterprise Resource Planning (ERP) Systems (or a single system) covering the Financial aspects of an organisations activities from Accounts Payable and Receivable to Fixed Asset Management, Cash Management and General Ledger functions / Corporate Consolidation.

See also: “Why do CFOs and CEOs hate IT? – ERP” – Thomas Wailgum at CIO.com

 
Extract Transform Load (ETL) Extract Transform and Load is the term applied to a class of software tools which pretty much do what you would expect from the name. They are designed to manage the process of taking data from one Database, manipulating it so that it is consistent with a pre-defined structure, potentially combining it with data from other databases and loading it into areas of a Data Warehouse or Data Mart from which it can be leveraged by reporting and analysis tools.

The work that ETL tools do was previously carried out in a more manual manner, normally involving SQL. ETL tools allow their users to take more of a top-down design-driven approach and facilitate the documentation and control of code. They also generally have highly specialised elements designed to carry out the merging and aggregation of data quickly and efficiently.

See also: The need for collaboration between teams using the same data in different ways

 
– F – Return to Index
 
Fact Table See Multidimensional Approach.
 
Flat-file A simple text file stored directly on disk with no other structure (i.e. the file is not part of a Database). If you output and Excel sheet in comma separated format, this is one type of flat file.
 
Flink More properly Apache Flink. This is a framework for processing separate streams of data, often for the purpose of supporting Complex Event Processing.
 
– G – Return to Index
 
Graph Database Graph Databases consist of networks of nodes and relationships between them (mirroring the concept of a Graph in discrete mathematics, which is somewhat different to an Excel chart). The nodes contain information about entities, e.g. a trading companies, and the relationships capture how nodes are connected in different ways, e.g. supplier / customer. This way of modelling data is beneficial when complex data (e.g. all elements of a a multi-tiered hierarchy) is to be retrieved en masse.

Graph Databases represent one type of NoSQL database. An example of a Graph Database is neo4j.

 
– H – Return to Index
 
Hadoop More properly Apache Hadoop. This is the most widely used framework for Big Data and is derived from Google’s initial work in the area. It is an Open Source data platform with an increasingly rich hinterland of add-on capabilities, e.g. in the areas of Statistical Analysis and Data Visualisation. Hadoop represents one type of NoSQL database.

See also: Elephants’ Graveyard?

 
HDFS Hadoop Distributed File Store. The underpinning storage mechanism of Hadoop, a descendent of Google’s Big Table.
 
Hierarchical Database Hierarchical Databases store data in structures that resemble inverted trees (maybe family trees is a better analogy). These were among the first commercially available databases, having been developed in the 1960s. Hierarchical Databases have typically been found on large mainframe computers. Although perhaps perceived as legacy software, many companies still operate these databases which are reliable and highly performant. Issue arise due to the lack of flexibility with which data can be extracted, something that tends to require specialist programming.

An example of a Hierarchical Database is IBM’s IMS.

 
Hive (HiveQL) More properly Apache Hive. This is part of the Hadoop suite, which aims to deliver Data Warehouse-like functionality, including Tables and a SQL-like query language, HiveQL.
 
– I – Return to Index
 
Information Security Information Security consists of the steps that are necessary to make sure that any data or information, particularly sensitive information (trade secrets, financial information, intellectual property, employee details, customer and supplier details and so on), is protected from unauthorised access or use. Threats to be guarded against would include everything from intentional industrial espionage, to ad hoc hacking, to employees releasing or selling company information. The practice of Information Security also applies to the (nowadays typical) situation where some elements of internal information is made available via the internet. There is a need here to ensure that only those people who are authenticated to access such information can do so.

Information Security can also pertain to issues where there is no hostile actor, such as the accidental release of information to too wide an audience (e.g. a confidential internal mail being sent to external parties by mistake), information becoming inaccurate over time (perhaps due to some systems no longer being updated with the most recent details), or the loss of sensitive data through issues with internal systems and infrastructure (e.g. a catastrophic systems failure where insufficient attention has been paid to backup and recovery).

There is a connection with Data Privacy, not least the risk of reputational damage, but the two areas are somhewat different, are approached in different ways and often have different people responsible for them.

 
In-memory In a computer, data is typically stored on some long-term media (a hard disk or solid-state drive), but needs to be brought into short-term memory in order to be accessed by the processing unit. So a cycle might be read two records from disk, load them into memory, change some value on each of them and write back to disk. There are inherent time lags in such arrangements (even with solid-state drives).

With drops in the price of memory chips and the increasing ability of processors and operating systems to address more memory, it has been possible to load significant amounts of data into memory and keep it there. By eliminating repeated read / write cycles, this allows lightning fast access to data.

Different types of databases and tools can run in-memory. Thus there are in-memory Columnar Databases and on-disk Columnar Databases. SAP’s HANA is an example of an In-memory database.

 
Internet of Things A term related to the increasingly vast amounts of data generated by machines and the linkage of this to the Internet (e.g. Tesla self-driving cars rely upon data captured by other Tesla cars as well as many other data sources).
 
– K – Return to Index
 
Key Performance Indicator (KPI) Those measurements of performance that are most crucial to the understanding and monitoring the performance of a whole organisation or of parts of it. The “K” is important, measurements that are merely interesting or of some use are Performance Indicators (PIs), not KPIs; though a Divisional or Country KPI may be just a PI at Group or Head Office level. Ideally the number of KPIs should be kept relatively small to aid monitoring and also communication about why they are important. KPIs often form the basis of Dashboards.
 
– L – Return to Index
 
Linear Regression See Model.
 
Logistic Regression See Model.
 
– M – Return to Index
 
Machine Learning An Artificial Intelligence technique, which employs large data sets (thus the link to Big Data) to “train” Algorithms to recognise patterns and draw inferences. An example would be interpreting what is going on in a photo, e.g. “a boy is throwing a red ball”. Algorithms tuned by Machine Learning already routinely achieve super-human abilities, albeit in narrow contexts, such as face recognition. Generalised Machine Learning (where the code can adapt to learning many different things), like Generalised AI (effectively the creation of conscious AI entities), is some way off. Machine Learning typically leverages an area of mathematics call Linear Algebra.

See also the addendum.

 
MapReduce Google’s original process (now leveraged in Hadoop and elsewhere) for breaking a task into component parts, running each of these (together these comprise the Map element) and combining the results (the Reduce element). Comparable to Massively Parallel Processing. Google has since moved on to more advanced processes for its Big Data work.
 
Materialised View See View.
 
Massively Parallel Processing (MPP) A technique (and hardware embodying this) which splits a task into segments, each of which is carried out by a single processing unit, with the process being coordinated and results consolidated by controlling software.
 
Master Data Master Data can perhaps be best described as reference data, though other epithets are often applied, such as lookup data (which probably makes most sense to those with an Excel background) standing data and standard data. Some data will tell you about dynamic things, such as the product prices associated with a sale; here price could take a very wide range of values. By way of contrast, Master Data tells you about the more permanent things that the sale is associated with. These might include the office and salesperson, the type of product sold, the customer, and so on. Each of these will have a much smaller range of possible values than our example of price and it is generally important that correct values are assigned.

Taking a more accountancy-centric view, a General Ledger journal will have an amount (which is again very variable) and often a cost centre and account, both of which will fall into relatively small sets of possible values.

Despite changing much less frequently, Master Data is not immutable. The details of customers may actually be quite volatile for example, the attributes of offices much less so. This leads to a need for approaches to controlling this area and the field of Master Data Management.

 
Master Data Management (MDM) Master Data Management is the term used to both describe the set of process by which Master Data is created, changed and deleted in an organisation and also the technological tools that can facilitate these processes. There is a strong relation here to Data Governance, an area which also encompasses broader objectives. The aim of MDM is to ensure that the creation of business transactions results in valid data, which can then be leveraged confidently to create Information.

Many of the difficulties in MDM arise from items of Master Data that can change over time; for example when one counterparty is acquired by another, or an organisational structure is changed (maybe creating new departments and consolidating old ones). The challenges here include, how to report historical transactions that are tagged with Master Data that has now changed.

 
Measure See Multidimensional Approach.
 
Metadata Data about data. So descriptions of what appears in fields, how these relate to other fields and what concepts bigger constructs like Tables embody. This helps people unfamiliar with a dataset to understand how it hangs together and is good practice in the same way that documentation of any other type of code is good practice. Metadata can be used to support some elements of Data Discovery by less technical people. It is also invaluable when there is a need for Data Migration.
 
Model (Statistical Model) A mathematical representation of some physical process (e.g. the mortality and morbidity of a population with certain attributes). Some Models may have no statistical element, e.g. cash-flow models in Finance. However many models use statistical methods to derive approximations of future events based on past events, or to better understand the underlying mechanics of past events.

This dictionary could easily double in length if I included many of the statistical terms associated with modelling. Instead here are some selected mini-definitions:

Clustering

An iterative method of dividing a population into groups in such a way that the members of each group share more with each other than the members of any other group. This approach is often used in segmenting customers by particular attributes or baskets of attributes (e.g. demographics, behavioural attitude etc.).

Linear Regression

If data consists of pairs such as input value and output value. Then it can be plotted on respectively the x (input) and y (output) axis of a graph. In simple terms, linear regression is equivalent to drawing a line of best fit on the resulting x-y graph. The line which has best fit minimises the distance from all of the plotted points and balances divides the points into roughly equal populations.

This approach can scale up to more than one input value, so long as there remains a single output value; so x axis (input 1), y axis (input 2), z axis (output).

Logistic Regression

A form of regression where there are again (perhaps multiple) input values and one output value, but the output can take only discrete values (e.g. pass or fail, vote for one of candidate A, B or C), rather than continuous ones.

Multivariate Analysis

This covers the case where there is more than one output value – i.e. more than one variable, hence multivariate.

Sensitivity Analysis

Determining how variability in the output of a model is related to variability in its inputs. For example, how a change in product price might impact profitability (less sales pushing in one direction, more value per sale pushing in the other).

Time Series Analysis

A Time Series is a set of data which can be ordered according to time; e.g. sales per day, closing stock price per day, average temperature by month and so on. Time Series Analysis is the process of discerning trends and patterns in this type of data, normally with the objective of making predictions for how the data will behave in the future. The term can cover anything from drawing a simple line chart with time as one axis to using a mathematical technique called Fourier Analysis to generate the spectrum of a Time Series which possesses some cyclical element (e.g. seasonality).

See also: Keynote Articles Section 9.

 
Multi-dimensional Approach A Multidimensional Database is one which is structured in a way that most directly supports the related concept of On-line Analytical Processing (or OLAP). The multidimensional approach concentrates transactional data (or sometimes this aggregated into balances) into a single Table called a Fact Table, which contains all pertinent Measures for an area of analysis together with links to relevant Dimensions (see below for definitions of these terms). Creating a Multidimensional Database from source data that is structured differently is the province of Extract Transform and Load tools. Once built information may be retrieved very quickly and users may flexibly manipulate and filter the data in a way that makes sense to them. By way of very direct analogy, it may take quite some time to construct all of the data that goes into an Excel pivot table, but then the pivot table can be used in a number of different ways.

A selected set of definitions relating to the Multidimensional Approach appear below:

Dimension

Elements that you want to analyse data by. So country, branch, product and so on. Dimensions are sometimes arranged into hierarchies so Region, Country, Region, Town or Year, Month, Day.

Measure

Numeric quantities that you want to analyse. For example, counts like number of sales orders or number of customers; monetary values like sales revenue or costs; or percentages such as growth or profit margin.

Fact Table

For a particular multidimensional structure, all relevant Measures will be gathered into a central table called a Fact Table. The same Fact Table will let you look up any Measure or combination of Measures or allow you to aggregate these.

Star Schema

An arrangement where a central Fact Table is surrounded by a number of Dimension Tables allowing the Measures in the Fact Table to be “sliced and diced” by the Dimensions. For example, new customers and value of orders they have placed by country and product type.

Cube

Multidimensional Databases are often referred to as Cubes, particularly where they are saved using proprietary technology (such as Microsoft SQL Server Analysis Services).

 
Multivariate Analysis See Model.
 
– N – Return to Index
 
Natural Language Processing (NLP) An area of Artificial Intelligence research that is also a subset of Linguistics. It focusses on two main topics. First, developing software that can parse human speech or writing (see Text Analytics) in order to form an understanding of the meaning it conveys. Second, the same software generating credible and pertinent responses to such input. Nirvana for NLP would be a software agent that was indistinguishable from a human in conversation (also known as passing the Turing Test).
 
Neural Network See Deep Learning.
 
NoSQL Databases that are designed to store and retrieve very large volumes of data and – as a result – do not use the more general SQL query language to read and write data but instead their own specialist approaches. Ther various tools used in a Big Data implementation are examples of NoSQL.
 
– O – Return to Index
 
On-line Analytical Processing (OLAP) A rather clumsy term that arose out of On-line Transaction Processing, the latter being is what most systems support. OLAP describes structuring data in a manner which facilitates analysis. This consists of defining Dimensions (the things we want to view information by, such as Product, Month, Customer etc.) and Measures (the numbers we want to view, such as Sales, Returns, EBITDA etc.). This drives the Multidimensional Approach, which also underpins the design of Data Warehouses.
 
On-line Transaction Processing (OLTP) On-line Transaction Processing .a term applied to what most people would view as the central purpose of IT systems; i.e. recording business transactions. Most commercially available software systems fall into the OLTP category. OLTP databases are optimised for entering data, not for querying it, something that leads to the alternative On-line Analytical Processing paradigm.
 
Operational Data Store (ODS) A structure, generally part of an overall Data Warehouse architecture, that holds all unaggregated data from source systems. This data would be transformed (by ETL) to be self-consistent and may also have elements of a Multidimensional Approach applied to it. However it would differ from the main Data Warehouse structures by a) not being aggregated, b) being updated more frequently, or close to real-time and c) often maintaining less history. Its ODS is often the main source from which a Data Warehouse is built. An ODS can be used both to support operational reporting, which tends to be more detailed, and as a mechanism to provide granular data to other applications that need it, e.g. Digital front-ends.
 
Outlier Detection See Anomaly Detection.
 
– P – Return to Index
 
Pig A language used for creating MapReduce (or Spark) routines in more of a high-level language. The term “Pig” comes from Pig Latin. A Pig script might load a file, label the columns it contains, filter it by the contents of one or more of these and write the resulting data into a new file. An example of its usage might be some code to determine how many occurrences of a particular word there are in a given set of files. Pig has some elements in common with SQL, but does not attempt to be SQL-like the way that HiveQL does.
 
Python A Open Source programming language that is often the tool of choice for Data Scientists manipulating large data sets. Some reasons for this include its ease of use and its extensive libraries of statistical capabilities.
 
– R – Return to Index
 
R A powerful Open Source statistical programming language, often employed on top of a Big Data solution, but also on more traditional Relational Databases and other data sources.
 
Relational Database A Database that stores data in Tables with relations between them. E.g. There may be a Customer table with contact details and marketing preferences and a Sales table linked to this with products, price and purchase dates. Relational Databases have their contents created, modified and read using the industry standard SQL language. Relational Databases have been the de facto standard for supporting all IT systems in the last 40 years. They remain the standard for transactional systems (web-sites, sales systems, General Ledgers) and for statutory, regulatory and publically available information. Since the mid-2000s they have been challenged in the insight and analytics space by Big Data solutions.

The acronym RDBMS stands for Relational Database Management System and is close to synonymous with Relational Database. Examples of Relational Databases include Oracle and Microsoft’s SQL Server.

 
– S – Return to Index
 
Sensitivity Analysis See Model.
 
Sentiment Analysis This has the objective of understanding the state of mind of individuals, often customers, or potential customers, and how this might impinge on their relationship with an organisation. For example, do they have a positive view of the organisation’s brand? Did they react positively or negatively to a recent news story about the organisation. This discipline often uses Social Media sources as a primary input, though survey feedback and comments made to customer service representatives may also feature. Sentiment Analysis draws on elements of both Text Analytics and Natural Language Processing. Objectives might include determining which product or service to offer on a web-page or in advertising, or forestalling a good customer leaving for a competitor because of some shift in their opinion about the organisation.
 
Single Customer View (SCV) The concept of a Single View of the Truth more narrowly applied to just customer data. Historically this has been achieved via Customer Relationship Management systems, but other data repositories such as Data Lakes and Data Warehouses have sometimes played this role.

An Analytical SCV is one which is complete and accurate enough to support meaningful statistical analysis. An Operational SCV is one which is complete and accurate enough to support individual customer interactions, be this by ‘phone, e-mail or voice or via Digital media such as web-sites and smart ‘phones.

 
Single Version of the Truth (SVT) Generally an aspirational state where all decision making in an organisation relies upon the same source of data, normally a central repository of some sort such as a Data Lake (more accurately the models that Data Scientists build on this) or Data Warehouse. SVT is different to all numbers used in an organisation being identical. For example a monthly sales report might treat new business in a different way to a Finance report. The important point is that both figures ought to originate from the same place, even of they are treated differently later (e.g. cut-off dates being different for different purposes).

See also: A single version of the truth?

 
Spark More properly Apache Spark. Spark was developed to improve upon the speed of the MapReduce approach where the same data is accessed many times, as can happen in some queries and algorithms. This is achieved in part by holding some or all of the data to be accessed In-memory. Spark works with HDFS and also other distributed file systems, such as Apache Cassandra.

See also: Elephants’ Graveyard?

 
SQL Short for Structured [English] Query Language. A standardised language for asking questions from Relational Databases. SQL has been the standard for interacting with databases for the last four decades.
 
Star Schema See Multidimensional Approach.
 
Statistical Model See Model.
 
Structured Data A high proportion of the data captured by most organisations. Anything than can be captured in a Table with rows and columns (and thus reproduced in Excel).
 
System of Record
A system which captures transactions material to the Statutory Financial results or Regulatory submissions of an organisation. A System of Record will be the definitive source of data about transactions.
 
– T – Return to Index
 
Table The fundamental component of many Databases. Used to store specific sets of data, generally to do with a single area, e.g. a customer address table. A database will generally contain many tables. In a Relational Database, each table is in turn made up of columns and rows in a way that is analogous to an Excel spreadsheet. Each column holds a different type of data element (e.g. Column 1 = First Name, Column 2 = Second Name, Column 3 = Date of Birth), whereas each row gathers together one set of values for each of the columns (e.g. Row 1 = “Bill”, “Gates”, “28-OCT-1955”, … and Row 2 = “Elon”, “Musk”, “28-JUN-1971”, …).

Note: HDFS does not have the concept of a table (the FS stands for File Store) but rather files, which can be stored in variety of formats, the simplest being plain text. Other parts of the Hadoop suite, eg. Hive, do support structures analagous to tables.

 
Text Analytics (Text Mining) The process of parsing text to derive meaning and identify patterns. This could be based on context provided alongside text (semantics) or working on raw text alone. Text Analytics may employ techniques drawn from Machine Learning and – for obvious reasons – Natural Language Processing. On objective of Text Analytics may be Sentiment Analysis.

See also: Literary calculus?

 
Time Series Analysis See Model.
 
Transactional System Related to Systems of Record. A system into which transactions (e.g. new purchase orders, updates to existing purchase orders, returns etc.) are entered by humans, created from data interfaced from other systems, generated on web-sites, or sent from sensors. The main attribute of a transactional system is that historical transactions are not changed, but instead modified by new transactions. This allows historical positions to be recreated.
 
– U – Return to Index
 
Unstructured Data Natural language text, video, photographs / other images, telephony capture, social media / web-site interactions, some aspects of the Internet of Things etc.
 
– V – Return to Index
 
View (Materialised View) A construct in a Relational Database that is analagous to a virtual Table. When someone queries a table, this actually executes a piece of SQL, which could reference several tables at once. This means that a person interacting with a View can have some complexities of the source data hidden from them. A View might be used to combine data from two dissimilar tables, making them seem as one. It might allow access to only a subset of the contents of a table (certain columns and/or certain rows for example) according to security requirements or just the needs of a user. It might supplement raw data by including calculations (Field A × Field B) or adding looked-up descriptions from another table.

In some Relational Databases, instead of being virtual, the output of a View’s SQL is saved to speed access; this is known as a Materialised View.

 
Addendum

This image originally appeared in the Machine Learning section, but messed up the page when viewed on mobile devices. It was too good to drop entirely, so I’ve moved it here:

Machine Learning
© Randall Munroe of xkcd.com

 
Acknowledgements

I owe thank yous to the following, in chronological order:

 


Text & Images: © Peter James Thomas 2017.
Published under a Creative Commons Attribution 4.0 International License.