Maximizing Data Lake Utility with Query Optimization 

Print Friendly, PDF & Email

Of all the user personas across the data landscape, the data consumer is arguably the most difficult to appease. Oftentimes, such users are heedless of the backend processes required to service the data they need to do their jobs better.

They simply want to probe their data at will, get rapid responses to questions, and apply them to better fulfill their business objectives.

Starburst’s recent acquisition of Varada was calculated to do just that, particularly in data lake settings in which organizations have tremendous quantities of data. According to Russell Christopher, Director of Product Strategy at Starburst, the company’s annexation of Varada behooves data consumers like “the analyst who now can ask more questions, and more complete questions, because all the data on the lake is now accessible in a performant way.”

Varada provides Starburst’s platform two pivotal benefits. On the one hand, it employs cognitive computing to intelligently index data at scale. On the other, it has caching capabilities that make queries even more responsive for swiftly retrieving answers for informed decision-making, analytics, and applications.

Such query acceleration methods can mean the difference between simply accumulating data and actually using data.

“In my experience, because I’ve always been in analytics, [if] you make users wait too long or make it more cumbersome, they just stop asking questions, and there’s huge risks in that for the organization,” Christopher cautioned.

Intelligent Indexing

Varada equips Starburst’s compute engine with a primarily automated form of indexing that removes much of the work from this task to expedite query responses. It utilizes statistical Artificial Intelligence technologies to assess what data should be indexed, then implements indexes accordingly. A similar approach determines which data to cache.

“Verada has, essentially, a machine learning looping mechanism that is watching all the queries that are executed on the lake,” Christopher explained. “Based on the columns that are being accessed and the tables that are being accessed the most frequently, it actually generates instructions for what should be cached and indexed, and how.”

The underlying system relies on a variety of indexing schemes, including trees. Moreover, it enables organizations to eschew manual approaches to indexing, which are typically time-consuming. “The cost in people time, to bring people in and say what are the important data, how do you use it, and show me how, and then using that information to try and index and cache, you don’t have to do that anymore,” Christopher revealed.

Cache Rules

It’s not uncommon for Varada to store both its indexes and caches in SSD. The latter involves what Christopher characterized as a “proprietary format, columnar format.” Thus, instead of having to constantly rescan a data lake to retrieve information from it to answer queries for a particular use case, firms can access that data via the cache to accelerate the time, resources, and cost of employing data for business insights.

The performance benefits of automatically indexing the most widely used data and caching them to hasten query times are formidable. Giving data consumers faster query results decreases the amount of compute resources for such information retrieval. Consequently, this approach can substantially “reduce cloud compute costs,” mentioned Matt Fuller, Starburst co-founder and VP of Product. “For the machines that are running, in our experience, we see about 40 percent in terms of cost savings. In terms of productivity, we’re seeing response times around 7 times [faster].”


Another boon of this query optimization method is firms can tailor it to meet the specific needs of individual users, departments, and deployments. It’s possible to prioritize queries according to the above considerations and others so the C-suite’s queries about monthly reports, for example, are answered before those of other users. Additionally, users can ascribe what’s essentially metadata to queries to make them more utilitarian to the enterprise.

“You can create groups of queries based on the users that are executing them or even, this again is one of the things that I think is kind of fun, [based on] free text that just happens to be sitting in the query,” Christopher divulged. “Like maybe someone puts a comment in the query saying this is a great query for the marketing team.” This functionality surpasses the capability to simply speed up queries, but makes them more helpful as well, thereby multiplying the value from data-centric processes altogether.

Human-Curated Automation 

The configurable nature of Starburst’s pairing with Varada also provides a degree of human control over the underlying automation. With machine learning automating the indexing and caching to accelerate queries, while people compartmentalize queries into groups, prioritize them, and annotate them with metadata, humans are overseeing the impact of these AI models.

The resulting combination is beneficial for optimizing queries to consistently deliver the best results. “The exciting part is the technology is ‘set it and forget it’, and it does what you need it to do without pulling in all the data and subject matters experts in to be the brains behind it,” Christopher concluded.

About the Author

Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance and analytics.

Sign up for the free insideAI News newsletter.

Join us on Twitter: @InsideBigData1 –

Speak Your Mind