From Data Warehouses and Data Lakes to Data Fabrics for Analytics

The evolution of data architecture for analytics began in earnest with relational data warehouses. Although these systems were good at generating insights from historic data, and thus offered some basis for predictive modeling, ultimately, they’re not very agile or responsive for the volume and variety of data that enterprises face today.

The data lake was the next progression of analytics-based architecture, mostly because it quickly accounted for the diverse schemas and data types organizations were dealing with at scale. But the way it accounted for this diversity left a lot to be desired. Because they are fundamentally enterprise file systems, data lakes typically turn into ungoverned data swamps requiring extensive engineering for organizations to connect and query data. As a result, a lot of time is spent wrangling data that, while it’s physically colocated in the data lake, is still unconnected with respect to business meaning, with the result that  productivity suffers and novel insights are missed.

And while data lake houses combine some of the best properties of both data warehouses and data lakes, it’s too early to make a sober judgment on their utility and they ultimately suffer, since they are in the end indistinguishable from relational systems, from an inability to deal with the enterprise data diversity problem. Relational data models just aren’t very good at handling data diversity.

Properly implemented data fabrics represent the latest evolution of analytics architecture by greatly reducing the effort data engineers, data scientists, and data modelers spend preparing data as compared with the aforementioned approaches that are all based on physical consolidation of data. With an artful combination of semantic data models, knowledge graphs, and data virtualization, a data lakes approach enables data to remain where it lives natively, while providing uniform access to that data, which is now connected according to its business meaning, for timely query answering across clouds, on-premises, business units, and organizations.

This method streamlines the complexity of data pipelines, diminishes DataOps costs, and delivers dramatically reduced time to analytic insight.

Knowledge Graphs

Knowledge graphs play a vital role in the enhanced analytics that comprehensive data fabrics can provide to organizations. Their graph underpinnings are critical for discerning and representing complex relationships between the most diverse datasets to drastically improve insight. Additionally, they readily align data of any variation (unstructured, semi-structured, and structured data) within a universal graph construct to provide organizations sane, rationalized access to the mass of structured, semi-structured, and unstructured data they’re contending with.

When querying customer data for appropriate training datasets for machine learning models, for example, knowledge graphs can detect relationships between individual and collective attributes that elude the capabilities of conventional relational approaches. They’re also able to make intelligent inferences between semantic facts or enterprise knowledge to create additional knowledge about a specific domain; for example, dependency relationships between supply chain partners. The combination of these capabilities means firms know more about their data’s significance to specific business processes, outcomes, and analytic concerns—like why certain products sell more in the summer in specific regions than others do—which inherently creates more relevant, meaningful results.

Expressive Semantic Modeling

Semantic data models, that is, richly detailed with real-world knowledge about concepts, occurrences, and problems in terms business users understand, and their ability to determine relationships between data and to create intelligent inferences about them are the backbone of semantic knowledge graphs. They also simplify the kinds of schema concerns that monopolize the time of data modelers and engineers when they’re preparing data for analytics with other approaches.

These data models naturally expand to accommodate new types of data or business requirements, whereas relational data models, for example, typically require modelers to create new or updated schemas and then physically migrate data according to the new schema. A process that increases rather than decreases time to insight. This advantage not only addresses data wrangling concerns, but it also enhances the real world knowledge depicted in data models, which in turn improves business user understanding of data’s relevance to analytics use cases.

Data Virtualization

Finally, virtualization technologies are at the heart of enterprise data fabrics. They provide coherent representations of data that are uniformly accessed at an abstraction layer, regardless of where the data is physically located. In this approach, the data remains in source systems yet is still accessed and queried in a single place in this virtualized framework or data fabric. This approach greatly reduces the need for replicating data, which is cost prohibitive and laborious with typical ETL jobs for complicated data pipelines that drive up DataOps costs.

The data fabric approach based on the business meaning of data rather than the location of data in the storage layer also diminishes the spread of, and reliance upon, data silo culture; it does this in part by connecting rather than consolidating data across complex enterprises and by supporting multi-tenancy at the schema layer. For example, individual business units like marketing and sales teams can access their data through the same enterprise fabric with their analytics tool of choice—without creating additional silos, without moving and copying data with expensive data pipelines. Such functionality promotes data sharing, data reuse, and more comprehensive insight by connecting data throughout the enterprise: in multiple clouds, on-premise, and edge environments.

The Epitome of Analytics Architecture

The place where we all want to be is for business users to marshal data in support of greater and greater insight. An analytics architecture based fundamentally on where data lives in the storage layer is hard-pressed to take the enterprise to this desired end state, particularly given the rapid growth of hybrid, multicloud environments. Data fabrics are the cornerstone of modern analytics architecture, especially when implemented with data virtualization, knowledge graphs, and expressive data models. This mix of capabilities eliminates much of the time, cost, and work spent piecing together data pipelines and DataOps. It does this by connecting data at the computation layer rather than consolidating it physically at the storage layer, which results in more and better contextualized meaning and thus greater speed to insight.

About the Author

Kendall Clark is founder and CEO of Stardog, a leading Enterprise Knowledge Graph (EKG) platform provider.

Sign up for the free insideAI News newsletter.

Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1