Know (and use) what you know: Competing on Analytics with Knowledge Graphs #2
by Dr. Alessandro Negro
· 6 min read
In this 3-part series “Know what you know: Competing on Analytics with Knowledge Graphs” Dr. Alessandro Negro, Chief Scientist at GraphAware, walks you through analytics, knowledge graphs and its “competition”. In the first part, we discussed the recipe for a successful analytics competition and shared some success stories.
None of the big changes happened overnight. According to the authors of Competing on Analytics, in 10 years or so, we went through a revolution in the way in which companies see and leverage their data.
They identified 4 major eras or stages of this revolution summarised in the following table.
|1.0||Describe your world||~2005 - 2008||This era focused mostly on “descriptive analytics”, i.e. reports and visuals explaining what happened in the past. Data warehouse and Business intelligence were the key tools and techniques used for accomplishing the so-called decision support. They were able to cope mainly with relational data sources (structured and numerical data) and organise them so that quantitative analysts can have a broader vision of the past. These types of analytics were used only to support internal decisions, even though they were completely ignored and in reality, the decisions continued to be made on intuition and gut feeling.|
|2.0||Big data dawns in the Valley||~2008 - 2011||As usual in this case, early adopters of new disruptive trends resided in Silicon Valley. Here in the late 2000s, the leading firms in the online industry – like Google, LinkedIn, PayPal and so forth – started adopting a new analytics paradigm trying to make sense of the data produced by their huge set of users. This new type of data was voluminous, fast-moving, fast-changing, and rarely came in rows and columns as in relational data sources. The term Big Data breaks into the main salon of sorts and in many articles. The four Vs, volume, velocity, variety, and veracity (and many later on) clearly stated the main issues to deal with such data. Many companies were looking at this data with different goals than supporting decision making. The purpose was instead to improve the services they offer end users. It became more strategic and core in the organisation: Google introduced Page Rank, LinkedIn offered “People you may know” and so on. These companies productised data leveraging to empower their service offering. It was a great success. Apache Hadoop Map-Reduce was born and suddenly became part of a collection of open-source technologies for processing big data - Pig, Hive, Python, Spark, R. Data Scientist became the sexiest job of the 21st century.|
|3.0||Big (and small data) go mainstream||~2011 - 2015||The evolution from 1.0 to 2.0 was an easy transition for these innovative-by-definition companies. They were online-based companies collecting data from clickstreams and interactions of the users to provide service to them. As you know, evolution often takes the simplest and shortest path. After 2010, it became clear that big data was not a fad. The technologies around Big Data were more mature and stable; important lessons were learned from the early movers. Many, even traditional, companies started looking at their data from a different perspective and discovered that they were collecting an enormous amount of information, they were distributed across multiple data silos, and since each organisation was collecting its own data independently from the other, some were small data some were big data. Trying to collect all data in a single point a new concept was introduced: Data Lake.
The term "data lake" was introduced for the first time in 2011 by James Dixon, Chief Technology Officer of Pentaho. He wanted to describe a data repository that stores a pool of data coming from multiple sources in its natural state (original data format), no filter, no pre-process, not packaged. A data lake is a type of data repository that stores large and varied sets of raw data in its native format. All data is kept when using a data lake; none of it is removed or filtered prior to storage. The data might be used many times for different purposes (or never at all). That’s why the data are kept raw until ready to be used, this is called “schema on read”. Data scientists can access the raw data when they need it using more advanced analytics tools or predictive modelling, like Hadoop and similar that in the meantime evolved quite significantly. Regardless of the approach used, in Analytics 3.0 data and analytics have become mainstream business resources. Companies like United Healthcare created a business unit called Optum that generates 11-figure annual revenues from selling data, analytics and information systems. Climate Corporation, acquired by Monsanto, launched Climate Pro, an application that uses weather, crop and soil data to tell farmers the optimal times to plant and harvest.
|4.0||The rise of autonomous analytics||~2015 - 2020||In the first three eras of analytics, the human was the centre of the process: the analysts or the data scientists gathered the data, created hypotheses and instructed the computer on what to do. The fourth era, starting in the mid-2010s, completely changed this approach, trying to remove the human variable from the equation or at least limiting its role. Cognitive technologies, Artificial Intelligence, Machine Learning became very common and related tools. They were scaling rapidly the hype cycle of emerging technologies. The rise of machine learning can be considered a direct consequence of the rapid growth of data, the availability of software and the increased power of computing architectures. No analyst or data scientist can manually process the data available in the data lake. In this new era, I’m also including the self-driving car, conversational AI, and many other technologies where machines serve humans offering them services where humans are completely unnecessary (or not completely necessary).|
What a journey so far! Analytics started by collecting just numerical data and allowing decision makers to make relevant decisions by looking at pie charts of aggregated value and at the end were able to replace us in many basic tasks, letting us focus on the business or our family. Of course, this didn’t come without costs. Paraphrasing Spiderman’s uncle, “from increasing power derives increasing responsibilities”. Analytics are now part of our life, and we are not talking only because they suggest a book to read or a movie to watch, but they are helping scientists find new treatments for cancer and other terrible diseases or helping physicians deliver precision medicine. So, in many cases, the great responsibility comes in the form of transparency. We can trust the machines up to a certain level. This assumption is feeding a new era that is becoming more and more relevant. Not for all the applications we mentioned but for some mission-critical ones, this is the case.
In the next and last part of the blog series “Know what you know: Competing on Analytics with Knowledge Graphs” we will move to the fifth era of analytics; Explainable AI and Knowledge Graphs, and answer why to compete on analytics with knowledge graphs.
 Thomas H. Davenport and Jeanne G. Harris. 2017. Competing on Analytics: The New Science of Winning (2nd. ed.). Harvard Business School Press, USA
 Eric Siegel. 2016. Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die (2nd. ed.). Wiley Publishing, USA