Data lakehouse architectures promise the combined strengths of data lakes and data warehouses, but one question arises: why do we still find the need to transfer data from these lakehouses to proprietary data warehouses? In this article, we’ll explore how to maximize the efficiency of lakehouses, eliminate data in motion, and streamline data management processes.
The Status Quo for Data Lakehouses
Many businesses have been quick to adopt data lakehouses for their flexibility, scalability, and cost efficiency. Yet despite these advertised benefits, there remains a notable gap in performance: current lakehouse query engines fall short in efficiently handling modern analytical workloads that require low latency and high concurrency.
Consequently, data engineers are forced to transfer both data and workload from their data lakehouses to high-performance data warehouses, specifically to enhance query speeds. While this approach addresses query performance issues, it incurs hidden costs, which outweigh the initial benefits:
Cost Factor #1: The Hidden Cost of Data Ingestion
Copying data to a warehouse may appear simple, yet the reality is quite complex. This data ingestion process involves writing data to the data warehouse’s file format, a process that consumes substantial computing power. Also, such data duplication not only escalates hardware costs but also leads to storage redundancy.
Beyond the hardware expenses, the labor involved should not be underestimated. Seemingly simple tasks, like ensuring data type or schema consistency across systems, can exhaust significant engineering time and resources. Moreover, the very act of ingesting data often introduces delays, compromising the timeliness and relevance of the data.
Cost Factor #2: Data Ingestion and Its Governance Pitfalls
Maintaining data integrity and accuracy is crucial for any business and a data lakehouse architecture enables this by offering a single source of truth for your data. However, copying data into another system undermines these elements and raises critical questions about data governance: How can we guarantee that all data replicas remain synchronized? What measures can prevent inconsistencies between these copies? Addressing these issues demands extensive technical expertise and, if not managed properly, can jeopardize the reliability of data-driven decision-making.
The Future Without Data In Motion
The costs associated with using a data warehouse for accelerating data lake queries are pushing enterprises to seek alternative solutions. Newer-generation query engines provide a way forward: equipped with deeper optimizations and features specifically designed to streamline data lake queries, they enable data lakehouses to support more demanding workloads. These next-generation features include:
- MPP Architecture with In-Memory Data Shuffling: Traditional data lake query engines are optimized for batch analytics by persisting intermediate query results on disk. MPP query engines are optimized for low-latency workloads by supporting in-memory data shuffling to enable efficient query execution.
- Well-Architected Caching Framework: Efficient data lakehouse queries require a caching framework to avoid bottlenecks in data lake storage as well as reduce network overhead.
- Further System-Level Optimizations: SIMD optimizations enhance performance by allowing data to be processed in larger batches concurrently, especially useful for complex OLAP queries involving JOINs and high cardinality aggregations common in data lakehouse queries
- Open Architecture: Open source solutions offer flexibility and adaptability for the data lakehouse architecture, making components like query engines interchangeable, further enhancing agility.
Eliminating data in motion is not just theoretical; it’s a strategy actively being implemented by industry leaders. Trip.com’s reporting platform Artnova recently made the jump, transitioning to the open-source query engine StarRocks. While their original solution could effectively manage a range of queries, high-demanding scenarios still relied on a proprietary data warehouse for query acceleration, causing data freshness lag and increased data pipeline complexity. The switch to a next-generation query engine allowed Artnova to eliminate its data warehouse dependency, streamlining its data pipeline, reducing operational complexity, and improving data freshness.
To Move Forward Just Stop
Imagine a future where data ingestion is redundant. With all workloads run on the data lakehouse, organizations can benefit from cost savings, enhanced data integrity, and the ability to perform real-time analytics directly on their data lakehouses. The solution to data in motion is clear: just stop. By focusing on optimizing data lakehouse architectures, we can eliminate the need for costly, complex, and inefficient data ingestion processes.
About the Author
Sida Shen is product marketing manager at CelerData. An engineer with backgrounds in building machine learning and big data infrastructures, he oversees the company’s market research and works closely with engineers and developers across the analytics industry to tackle challenges related to real-time analytics.
Sign up for the free insideAI News newsletter.
Join us on Twitter: https://twitter.com/InsideBigData1
Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/
Join us on Facebook: https://www.facebook.com/insideAI NewsNOW