In this special guest feature, Rob Malnati, COO of thatDot, explains the data debate between stream vs. batch processing. Rob is a repeat entrepreneur focused on enterprise infrastructure as software. He has defined, evangelized, and managed product and go-to-market strategies for forward-looking service, software, and cloud-based SaaS providers.
Without resorting to your favorite search engine, do you know who Clive Humby is? Don’t worry if you don’t. I didn’t until I started writing this. And if you do, I want you on my bar trivia team.
It was Mr. Humby, an English-born mathematician and entrepreneur, who in 2006 coined the phrase “data is the new oil” while helping launch a shopper loyalty card in partnership with Tesco. He recognized that new data sources combined with technological advances made it possible to create more personalized experiences that would lead to better business outcomes. He understood that data had intrinsic value.
Everyone knows what happened next. Driven by companies hungry to capitalize on this new resource, infrastructure evolved to consume exabytes, zettabytes, and soon yottabytes of data. Data became big, and those who could amass lots of this valuable resource thrived.
But not all data was the same. Not just intelligence but insights lay somewhere submerged within vast lakes of data, waiting to be discovered by data scientists using neural networks and algorithms. Those who could uncover the patterns and apply them to their business flourished.
Through most of this evolution, infrastructure followed a standard blueprint: find ways to store data as cheaply as possible and, when you needed to extract value, process as much of it all at once as budget, time, and technology would allow.
This is batch processing. Typically scheduled as part of a well-defined workflow, batch processing is deeply ingrained into today’s enterprise’s business practices and infrastructure.
Examples of batch processing include simple bi-weekly payroll runs to jobs that run many times a day, synthesizing numerous sources to create personalized user profiles for ad targeting or eCommerce. One thing all batch jobs have in common is that they produce a fixed or static result that won’t change until the next time the batch job is rerun.
Unfortunately, the next stage in data’s evolution will render static views obsolete. In addition to being valuable, big, and smart, data has become fast—speed-of-decision fast. Click- of-the-mouse fast. Faster than a human can complete a thought.
The Quick and the Dead
Batch processing is, and will remain, enormously useful for many everyday tasks. However, for all its utility, batch processing is at odds with how the world works. Whether you are talking about financial transactions, social media feeds, or clicks on news sites, data is being generated continuously. It streams past. And once it is gone, your ability to act on it at the moment is also gone.
The ability to act on data in real time when the consequences are most significant is driving the move to streaming data processing—for example, stopping an insider trade before it happens or recommending a product that pairs well with something in a shopping cart before someone checks out.
Real-time stream processing architectures differ significantly from batch processing: whereas batch processing requires massive amounts of storage and CPU and memory resources sufficient to churn through hours, days, or months of collected data, stream processing emphasizes network throughput, memory, and CPU resources required to compute results the instant data appears.
One of the knocks against stream processing was its inability to find complex patterns. Batch jobs had the luxury of time when traversing joins between data sets and was considered best for ETL jobs.
This is no longer the case. While stream processing platforms like Kafka and Kinesis allowed users to organize and manage high-volume data flows, the creation of complex stream processing systems like Flink and ksqlDB made it possible to combine and extract insights from them in real time.
In practical terms, companies can’t afford to embrace real-time stream processing.
Whether it is detecting a complicated password spraying attack before you are compromised, diagnosing a service disruption caused by degraded performance somewhere in a sprawling video streaming infrastructure before viewers (and advertisers) notice, or anticipating the failure of vital factory equipment and avoiding costly production halts, acting at the moment is just good business.
Put another way, can you think of a single example where it was better to wait to make money or to fix a problem? Where it didn’t matter if you got an answer tomorrow?
Batch or Stream?
Batch processing isn’t going anywhere. It is ideal for non-urgent operations that require a relatively fixed set of information. Used well, it makes sense to be efficient and drive down costs.
But in today’s always-connected, instant-answer world, competitive advantage increasingly comes from the speed of action and the quality of an experience. Data pipelines are business pipelines, and when they can harness more data more quickly for a great decision, the business moves and grows faster.
As businesses demand faster execution, stream processing will continue to grow and eventually dominate. Evolve or perish.
Sign up for the free insideAI News newsletter.
Join us on Twitter: https://twitter.com/InsideBigData1
Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/
Join us on Facebook: https://www.facebook.com/insideAI NewsNOW