A Technical Article Series for Data Architects
This multi-part article series is intended for data architects and anyone else interested in learning how to design modern real-time data analytics solutions. It explores key principles and implications of event streaming and streaming analytics, and concludes that the biggest opportunity to derive meaningful value from data – and gain continuous intelligence about the state of things – lies in the ability to analyze, learn and predict from real-time events in concert with contextual, static and dynamic data. This article series places continuous intelligence in an architectural context, with reference to established technologies and use cases in place today.
Part 4: Streaming Analytics Helps
Event streaming is a useful de-coupler between the real-world and applications, but storage and queueing introduce unpredictable delays. And if we follow the typical “stream processor” approach that relies on stateless microservice instances, the developer must use a database to represent application state. This obviously introduces more delay. Application developers want to deliver meaningful results – from analysis, learning and prediction – at scale, continuously, in response to the continuous updates conveyed in events. Variable application and infrastructure layer delays are a big problem.
The field of streaming analytics takes a “top down” management-style view of this challenge and can serve use cases that do not demand a real-time response. Best known for delivering widget views and KPIs, streaming analytics applications periodically (or when prodded via a GUI) compute insights and deliver results to users. Many tools exist in this category, both general-purpose and vertically integrated. At the specific end of the scale: the Prometheius open source project provides infrastructure for application performance management, in the cloud. A simple schema lets a time-series database capture a log of performance metrics for many instances and a simple UI in Grafana or other toolkit can query the database and draw cool widgets for managers. This use case is a simple and very limited subset of the use cases of continuous intelligence: Summary widgets also need to evolve in real-time.
Managers of large systems – whether containers in the cloud or fleets of trucks, need detailed performance metrics specific to their use-cases. They are operations- and outcome-focused – they want to “zoom in” on issues to see problems – often in real-time. The meaning of data is critical to understanding system performance. So, whereas event streaming systems impose no meaning on events, in streaming analytics the meaning of an event is always well defined: “issues” and “problems” are short-hand for “hard-won expertise encoded in an application”.
Managers aren’t the only users though: Modern SaaS applications need to deliver contextually useful responses to end-users. Timely, granular, personalized, and local responses are crucial, even for applications at massive scale. This kind of capability is entirely beyond the scope of streaming analytics applications.
The meaning of data is crucial both for the systems that generate events (“temperature: 5” could mean the temperature is 5 degrees, or that it increased by 5 degrees) and in the context of a model of the system as a whole (thus: “if the temperature in nearby sensors is also increasing, and coolant levels are low, this could cause a fire”). Usually a system model is expressed as a set of relationships between entities (a schema in a relational database or a graph of entity relationships in a graph database). Either way, there is no escaping the need for a semantically rich, stateful model of the system that evolves over time in response to state changes communicated by events, and application logic that interprets those changes as they occur.
We need to differentiate between data and state. Analysis, whether logical or mathematical, evaluates / predicts / modifies the state of one or more entities, given raw event data (a traffic light is merely green, even if it sends an “I’m green” event every 100ms). The confusion between data and state lies at the heart of many challenges that application engineers face today.
To tackle this problem streaming analytics applications use a defined schema or model of the system, combined with business logic to translate event data into state changes in the system model. A single event could trigger state changes for many entities (eg: the bus arrives, so many riders can board). The meaning of state changes triggered by events is the “business logic” part of an application and has a specific interpretation given the use case. For this reason streaming analytics applications are often domain specific (for example in application performance management), or designed for a business specific purpose, where changing the schema or use cases mean changing (a lot of) code.
But Users Want Continuous Intelligence
Businesses want to track, analyze, learn and predict based on live event data from all their systems and at every moment. They want to easily develop, deploy and operate applications that deliver real-time, contextual responses to live events. And as their systems grow, the ability of an application to deliver real-time responses to streaming data continuously at scale must be ensured. They need to deliver real-time responses and drive automation at a local level, to any user or system, any time.
Building an application that continuously delivers real-time insights from streaming and contextual data is difficult. There are two main challenges:
- Stateless, database-backed microservices app architectures make it impossible to deliver real-time responses.
- Analyzing streaming data on-the-fly demands changes to traditional analytical algorithms that work on complete data sets.
- Things get even more dicey when processing streaming data in concert with contextual data from static data sources. With enough data volume, databases will not be able to keep up.
Analysis, learning, prediction, and other insights rely on powerful operators like joins, maps, reductions, unions, intersections and cartesian products, and a host of mathematical functions and other algorithms like machine learning. An evaluation of some predicate must occur for every event that could change state, in every relevant context. That seems heavy, so let’s take it slowly.
Every arriving event must be analyzed in the context of the state of the entity that generated it, and the states of all other entities that are related to it and their context. Context changes fast: A bus might be “near” a bus stop but then rapidly move away. There’s no point telling a rider that the bus is approaching a stop when it has already departed, so analysis has to be timely to be useful. The analysis (eg: “near” – an affine geo-fencing calculation) must be re-evaluated for every new event of relevance within the context of the rider, the bus stop and the bus, and potentially all other vehicles and pedestrians in the vicinity, for every bus, concurrently. Each event could affect not only the intended entity, but any other entity in its vicinity.
For a real-time response, the analysis must be executed at least as fast as new events arrive. If the analysis is slower than real-time, the results will be useless and the application will quickly fall behind – events might be dropped or delayed, and inaccuracies will mushroom. It’s easy to see how in a model with millions of entities, each in its context, the contextual meaning of events can cause cascading delays and performance impacts that cannot be solved with more hardware.
There is one last challenge: An event that is analyzed in the immediate context in which it was generated will also affect every other context at a degree of less detail (as we zoom out). For example: A rider boarding a bus affects the bus and the rider (the number of vacant seats decreases on the bus, and the speed and route of the rider change) but it also affects a KPI at a higher level, increasing the number of riders currently on all buses in the city. So streaming analytics use cases are a subset of continuous intelligence to the extent that a real-time view of the system is required. More importantly though, each event naturally causes a cascade of re-evaluations across the system model.
Swim re-imagines the entire software stack to solve problems that require data-driven applications at scale to deliver real-time responses – continuously and in context.
To read parts 1 to 3 of this guest article series, please visit the Swim blog.
About the Author
Simon Crosby is CTO at Swim. Swim offers the first open core, enterprise-grade platform for continuous intelligence at scale, providing businesses with complete situational awareness and operational decision support at every moment. Simon co-founded Bromium (now HP SureClick) in 2010 and currently serves as a strategic advisor. Previously, he was the CTO of the Data Center and Cloud Division at Citrix Systems; founder, CTO, and vice president of strategy and corporate development at XenSource; and a principal engineer at Intel, as well as a faculty member at Cambridge University, where he led the research on network performance and control and multimedia operating systems.
Simon is an equity partner at DCVC, serves on the board of Cambridge in America, and is an investor in and advisor to numerous startups. He is the author of 35 research papers and patents on a number of data center and networking topics, including security, network and server virtualization, and resource optimization and performance. He holds a PhD in computer science from the University of Cambridge, an MSc from the University of Stellenbosch, South Africa, and a BSc (with honors) in computer science and mathematics from the University of Cape Town, South Africa.