2020 Trends in Big Data: The Integration Agenda

Other than the resurgence of various Artificial Intelligence dimensions, the single most meaningful development in the big data space in the past several years is the burgeoning distribution of data assets.

Whereas once those assets were safely confined within the enterprise, the confluence of mobile technologies, the cloud, the Internet of Things, edge computing, containerization, social media, and big data itself has shifted the onus of data management to external, decentralized sources.

The ramifications of this reality are manifold. Organizations can now get the diversity of data required for meaningful machine learning results. The overhead of operating in hybrid, multi-cloud environments is less costly. The very worth of big data has increased with novel opportunities to comprehensively analyze business problems from an array of sources not previously available.

However, these benefits are only realized if organizations can successfully deal with the greatest consequence of the dispersal of data to heterogeneous settings: the undue emphasis it places on data integrations.

“This post big data architecture has a focus on the integration of data,” Cambridge Semantics CTO Sean Martin observed. “The Hadoop family of technologies was pretty good at aggregating a lot of data in data lakes, but they weren’t really good at integrating that data and often, there was chaos.”

The challenges of integrating big data at scale mean much more than simply automating transformation processes. To reduce the chaos Martin described, organizations must also account for the demands of data discovery, semantic or business understanding of data, metadata management, structured and unstructured data, and transformation.

Surmounting these obstacles enables organizations to swiftly cull, understand, deploy, and reuse data for competitive advantage—at will—from the full range of sources available to the modern enterprise.

Structured, Unstructured, and Semi-Structured Data

Just as data have become more distributed across the computational landscape, the meaning of data integrations has similarly expanded. Cambridge Semantics VP of Marketing John Rueter noted that because of data’s increasing distribution, modern integration efforts involve “not only data, but [different] tools and technologies”. Implicit to this sundry of integration factors are the assortments of structured, unstructured, and semi-structured data that exacerbate integration attempts, forcing organizations to consider, “how do you operate with data where the structures and the content of the data is not that well known?” remarked Paxata SVP of Global Marketing Piet Loubser. The answer takes many forms, including:

  • Data Virtualization: According to Denodo CMO Ravi Shankar, virtualization technologies can produce data integrations that are “format agnostic. So the format could be structured data, unstructured data, or semi-structured data like XML and so on.”
  • Data Fabrics: Although definitions of data fabrics differ, the most comprehensive of these platforms consolidate any data management tool, technique, or approach with a semantic layer on top to rectify modeling and semantic differences to spur integrations.
  • GraphQL and JSON: Organizations can also connect to various sources with multiple data management platforms. By relying on GraphQL (an API and query language) and the schema on demand capabilities of JSON, organizations can swiftly get the connected architecture for “that flexibility that gives them the integrations and basically what they need,” maintained TopQuadrant CEO Irene Polikoff.

Integrating unstructured and semi-structured data enables organizations to work with modern data sources pertaining to text, images, and video.

Dynamic Semantics

The heterogeneity of integrations in the post big data/Artificial Intelligence age also reinforces the need for semantic understanding of data stemming from divers tools and locations. A lack of clear, well-defined semantics (in addition to the tendency to flout governance protocols) contributed to the failure of most generalized data lake implementations—especially for self-service, business user access. The semantic comprehension of data fueling downstream necessities like data discovery is aggravated by the emergent reality that for many users, “the semantics of what they’re looking at is going to be changing based on the context of who I am as this person interacting with the data, and also potentially the question that I might be having,” Loubser revealed.

Organizations can better understand data’s meaning when integrating disparate data sources via smart data technologies including uniform data models, vocabularies, and taxonomies that “blend the semantics—the business meaning of the data with the data—to make it easier to discover and easier to use,” Martin said. The evolution of data’s meaning based on use cases “puts a lot more focus on dynamic semantic construction as I’m accessing data to help me understand and define a semantic context for the data that fits the purposes of my analytics,” Loubser added.

Data Discovery

Once sets of big data are integrated—regardless of structure—and understood by users, the data discovery process is vital to loading data for analytics or application use. There are innate data discovery benefits to understanding what data mean prior to analytics; synthesizing semantic understanding with the integration process provides an ideal layer for determining relationships among disparate data to maximize their deployment. Holistic data discovery across the enterprise is an indicator of successful integration and a point of departure from simply collocating data, in which older methods “got all the data in one place and made it available—if you could find it,” Martin said.

Dedicated data discovery solutions frequently invoke machine learning to determine relationships in data and their relevance for particular use cases. Advancements in this domain include the use of enterprise search capabilities involving machine learning and Natural Language Processing to augment discovery functionality. This capacity “lets the end users search the data assets that have been discovered by these smart discovery techniques in a pretty easy manner,” commented Io-Tahoe CTO Rohit Mahajan. Intelligent data discovery is pivotal for finding datasets on which to train cognitive computing models.

Transformation

Transformation is an integral aspect of every data integration. Transformation rectifies the disparities in data schema and formatting that are amplified in distributed computing settings. According to Shankar, virtualization-based integrations are useful in this respect because “across these three: location, format, and latency, we provide that uniformity of standardization through transformation by making [data available] in a format that you can actually pick it up in whatever way you want.”

Although many are still in place, conventional Extract, Transform, and Load (ETL) methods are considered less efficient than Extract, Load, and Transform (ELT) methods that utilize the underlying repositories—typically a cloud store—for transformation. This difference saves time and costs otherwise allocated to dedicated data staging tools. Still, self-service data preparation instruments that automate code and leverage intelligent algorithms for transformation “let the business go in and transform the data for their purposes, and that sort of contextual semantic description we talked about earlier is now instantiated by that process,” Loubser mentioned.

Metadata Management

Metadata is a huge influencer for timely, optimized integrations. Partly due to its data lineage capabilities, integration tools “all have some kind of a metadata layer where what happens is, they would get metadata from source A and then metadata from source B and then they use that to transform information from source A to source B,” Polikoff explained. Metadata’s utility in this regard is part of a wider trend in which its historic provenance capabilities are actually morphing into present, active, and future ones. Deriving timely action from metadata is influential to unifying distributed data settings. “There’s a fair degree of metadata, sometimes known as active metadata, that helps you assemble and automate an enterprise data fabric,” Martin reflected.

Metadata is also instrumental in transporting resources between hybrid and multi-cloud environments. According to Franz CEO Jans Aasman, it’s particular helpful with “multi-cloud environments, partly in Google, partly in Amazon, partly in Azure. You can have a library of virtual machines and a library of applications that you need to run and a library of databases that contain data. If you turn all of that into a metadata graph about your digital assets…you could apply [this] asset management to a company’s cloud strategy.”

Larger Implications

Metadata’s role in ameliorating the difficulties of the distributed data landscape is twofold. On the one hand, it delivers an accurate roadmap for optimizing integrations and transformations while reinforcing pertinent, expedient data discovery. On the other, it’s the means of controlling distributed data assets across clouds, virtual machines, and even on-premise environments. Therefore, it not only typifies the redoubled integration needs of the sprawling big data ecosystem, but provides the foundation for navigating those distributed settings to position and shift data assets at will for optimal computational and pricing opportunities.

Managing metadata—and acting on it—is the crux of redressing big data integration necessities stemming from the distribution of data assets. However, it’s equally indispensable for availing the enterprise of opportunities related to the IoT, edge computing, blockchain, and AI in the coming decade. Organizations can transition from coping with data’s distribution to capitalizing on it in the near future because, as Martin anticipated with active metadata, “it’s not too hard to see how you can move the data processing, or where the processing’s performed, around happily and easily.”

About the Author

Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance and analytics.

Sign up for the free insideAI News newsletter.