The Rise of Streaming Data and Its Cost Efficiency - How Did We Get Here?

As we’re getting settled into a post-pandemic way of working, organizations are restructuring their technology stack and software strategy for a new, distributed workforce. Real-time data streaming is emerging as a necessary and cost efficient way for enterprises to scale in an agile way. Last year, IDC found that 90% of the world’s largest companies will use real-time intelligence to improve customer experience and key services by 2025. They also found the stream processing market is expected to grow at a compound annual growth rate (CAGR) of 21.5% from 2022 to 2028.

The rise of data streaming is undeniable and becoming ubiquitous. There are two sides to data streaming coin with dual cost advantages – architectural and operational:

From an architectural perspective, it streamlines the dataflow. Previously, data was received from applications and directly stored in databases. Analysis of this data required an Extract, Transform, Load (ETL) process, typically performed once a day in batches. However, the emergence of data streaming platforms has revolutionized this architecture. Irrespective of the data source, setting up topics on the streaming platform is effortless. Connecting to these topics enables seamless data flow. Similarly, consuming the data is simplified. You can establish a destination for the data and specify the delivery frequency—whether real-time or in batches. The centralized data streaming platforms, like Apache Pulsar, enable dynamic scaling and consolidation of previously siloed data pipelines into unified multi-tenant platform services. This reduces over-provisioning waste that comes with maintaining disparate pipelines.

The operational advantages are strikingly persuasive. By centralizing the entire data flow process, data streaming eliminates the need to manage multiple technologies for transferring data from point A to point B. Consequently, data becomes more easily accessible, particularly with employees dispersed across homes and offices. Data streaming has evolved into the connective thread that synchronizes the activities and insights of distributed teams. Enabling the streaming of data outputs as live, accessible data products throughout the organization empowers real-time remote collaboration for enterprises.

How Did We Get Here?

How did we arrive at this pivotal moment for streaming data? The technology’s appeal dates back to the early 2000s with the rise of high-throughput message queues like Apache Kafka. This enabled organizations to consume real-time data. They built an architecture that decouples data producers and consumers, therefore creating reliable, scalable data pipelines. However, this was mainly built for data ingestion use cases. Initial streaming use cases were fairly narrow – ingesting high-volume data for creating interactive real-time dashboard, sending sensor data for predictive maintenance, powering real-time applications like stock tickers or betting engines. Most enterprises felt that real-time data is necessary for unique use cases and still relied on batched data extracts for operational analytics. However, there were significant use cases that also necessitated real-time data. For instance, self-driving cars require instantaneous data to make split-second decisions. Location analysis, such as maps and navigation, relies on real-time data to signal accidents and delays promptly. Additionally, many marketing technologies, like ad bidding and customer sentiment analysis, depend on real-time data within their transactional systems.

As cloud adoption accelerated in the 2010s, streaming data’s architectural value proposition grew clearer. Businesses began consolidating on-premise Message Queues and proprietary data streaming silos onto cloud-based streaming services. This allowed centralizing dozens of pipelines per business unit onto multi-tenant data streaming platforms.

Modern cloud-native data streaming platforms also made it easier for enterprises to operationalize streaming data products for internal stakeholders and customers. Instead of purely back-end data transportation, streaming morphed into an efficient distribution mechanism for provisioning continuously updating data access.

COVID-19 Enters the Chat

The final catalyst was increased need for real-time business agility and operational resilience during the pandemic’s disruptions. Enterprises rushed to cloud data streaming platforms to enable remote analytics, AI/ML and operational use cases powered by always-current data. During COVID-19, Cyber Monday revenue surpassed that of Black Friday. In peak shopping seasons or during flash sales, e-commerce platforms depend on real-time data to efficiently manage inventory, optimize pricing strategies, and deliver personalized recommendations instantly. In the post-COVID era, there has been a significant shift towards online data consumption, with higher demands than ever for faster data delivery.

Today, data streaming has become the de facto method for provisioning and sharing data in the modern, distributed enterprise. Its cost efficiency comes from streamlining formerly disparate pipelines while fully leveraging the cloud’s elasticity and economies of scale.

As enterprises’ remote infrastructures solidify and event-driven cloud architectures proliferate, streaming will grow only more ubiquitous and mission-critical. Apache Pulsar and similar platforms will play a crucial role in this continued rise. The ascent of streaming was a two-decade journey of technological evolution meeting newfound operational urgency.

About the Author

Sijie Guo is the Founder and CEO of Streamnative. Sijie’s journey with Apache Pulsar began at Yahoo! where he was part of the team working to develop a global messaging platform for the company. He then went to Twitter, where he led the messaging infrastructure group and co-created DistributedLog and Twitter EventBus. In 2017, he co-founded Streamlio, which was acquired by Splunk, and in 2019 he founded StreamNative. He is one of the original creators of Apache Pulsar and Apache BookKeeper, and remains VP of Apache BookKeeper and PMC Member of Apache Pulsar.

Sign up for the free insideAI News newsletter.

Join us on Twitter: https://twitter.com/InsideBigData1

Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/

Join us on Facebook: https://www.facebook.com/insideAI NewsNOW

The Rise of Streaming Data and Its Cost Efficiency – How Did We Get Here?

Sponsored Guest Articles

Generative AI’s Accuracy Depends on an Enterprise Storage-driven RAG Architecture

White Papers

From Legacy to Leading Edge: How Mainframe Data Can Transform AI and Analytics

Featured RSS Feed

More News from insideHPC