The big data ecosystem is constantly expanding, gravitating ever further from the four walls of the traditional centralized enterprise with a burgeoning array of external sources, services, and systems.
Capitalizing on this phenomenon requires horizontal visibility into data’s import for singular use cases—whether building predictive models, adhering to regulatory accords, devising comprehensive customer views and more—across a sundry of platforms, tools, and techniques.
The capacity to single-handedly avail organizations of the collective worth of such decentralized resources lies in the means of standardizing these data as though they were all in the same place, paradox notwithstanding. Accounting for the inevitable differences in schema, terminology, and data representations requires a data modeling uniformity across settings and sources alike.
Failure to do so prolongs the interminable journey towards data silos, regulatory penalties, and squander of data-centric investments.
Consequently, centralization efforts involving “data mesh and data fabric are something we’re seeing quite a lot of out there,” reflected Alation Director of Data and Analytics Julie Smith. “Because of data mesh, data fabric, and that whole method of work that they involve, it’s going to make your data modeling practices become incredibly pragmatic and have to evolve.”
A plethora of methods including data fabrics, revamped cloud native Master Data Management capabilities, and governance frameworks employing cognitive computing to point-and-click at sources for detailed cataloging of their data are viable means of implementing data models across the heterogeneity of the modern enterprise’s data.
Triumphing means more than fulfilling the foregoing business imperatives, but ultimately propels organizations ever closer to systemic interoperability to meet every use case with the most appropriate data—regardless of location or source.
Schema Differentiation
The path towards data interoperability for business cases spanning the breadth of organizations’ data assets typically involves some form of centralization such as data fabrics or data meshes. “Data meshes and data fabrics are very similar,” Smith specified. “Both are approaches where you’re getting the data moving through different places rather than trying to have it in one place and bring it together.” Nonetheless, resolving schema differences is still a time-honored hardship of integrating or aggregating variegated data sources for any central application. However, data fabrics implemented with data virtualization, query federation, and what Stardog CEO Kendall Clark termed a “graph query model” obsolete this impediment in several ways to deliver the subsequent boons:
- Schema Multi-Tenancy: Because data fabrics enable organizations to leave data in place but access them as though they’re collocated, firms can dynamically select their schema at query time. Respective departments can utilize different schema and terminology for individual queries instead of “deciding one version of the truth and one schema to structure this data,” Clark revealed—which is highly time consuming and resource intensive.
- Accurate Representations: Multi-tenant schema produces better data models with more realistic, detailed depictions of business concepts and their context. It creates “more flexibility and agility throughout organizations,” Clark stipulated. “You can more accurately represent the complexity of the world without internal battles between [business units] about schema.”
- Schema Unification: Moreover, since the underlying Resource Description Framework (RDF) knowledge graph data model naturally evolves to include new business requirements or sources, firms can create cross-departmental schema or holistic enterprise ones for use cases demanding such interoperability. In this and other instances “data models will evolve,” Smith acknowledged. “That’s why you need cataloging to tell you what’s where, what overlap there might be, and what usage is happening.”
Data Cataloging
The data model evolution implicit to multi-tenant schema and its innumerable combinations is considerably aided by data cataloging—which in turn informs the data discovery process for devising interoperable data models with the most meaningful information. Today’s catalogs rely on machine learning to point at sources and illustrate what Smith characterized as “the current reality: these are the fields, entities, relationships, and this is how they’re being used.” This basic understanding is critical for initiating schema, revising it, and understanding the requisite conditions for combining them for singular use cases. According to Profisee VP of Product Marketing Martin Boyd, data modeling best practices involve “looking at all the different places schema for a specific domain exists, then pulling it to create the schema.”
Data cataloging enhances this step in several ways, foremost of which is its means of centralizing information about data in distributed sources. In addition to valued metadata, statistical information stemming from data profiling, and subject matter expert input, catalogs also house collective “knowledge any number of users gained around a system or dataset,” Smith mentioned. They also provide lineage and other annotations about how datasets—and specific schema—were used. Collectively, this documentation allows users “to look at data models and where we’re going to take something,” Smith observed. “So, information from this data catalog can feed into the evolution of that data model.”
Entity Modeling
Entity modeling and creating master data models for individual domains furthers the advancement towards interoperability advantages of reusable schema, comprehensive insight across sources for analytics, and increased adaptability. Multiple domain Master Data Management plays an invaluable role in modeling entities by using fuzzy logic, cognitive computing, and other approaches to automate matching records of entities and merging them as needed. Such an automaton is beneficial for completing these data modeling facets at scale because this aspect of data management “is a process,” Boyd noted. “Once you’ve established that process and the rules, the system keeps on enforcing them.”
As previously mentioned, standardizing data’s representation in sources is a precursor to coalescing them for horizontal use cases, especially when the results of those entity models are pushed back to sources. Once users “standardize all of that information from a format perspective, that makes it more interoperable,” Boyd maintained. “So now different systems holding data in different formats can speak to each other, contribute to the master data model, and share that information back to them.” It subsequently becomes much easier to use such an array of distributed sources for data science attempts to build predictive models or to create applications across departments, sources, and domains for things like customer loyalty programs or security analytics.
Data Quality
The data quality rules for the sources informing different aspects of data modeling (such as entity modeling, logical modeling, and conceptual modeling) are critical for providing the standardization at the base of any interoperability attempts. Once organizations discern which sources have attributes or data impacting these modeling dimensions, they must homogenize how they appear in the sources so that “each field has data quality rules that mandate how they should [appear],” Boyd donoted.
Oftentimes, formulating those rules for standardizing data is a cooperative process involving subject matter experts, data governance personnel, and other stakeholders. The result is data become standardized across sources according to “data quality rules, consistency rules, there’s referential integrity, and all the things you expect in a normal database design,” Boyd explained. The ultimate benefit of imputing data quality to the fundamentals of data modeling is quality assurance and “how much you can trust that information and the different sources it comes from,” Smith added.
Interoperability Possibilities
The possibilities for making data more interoperable by combining data models or crafting unified models across use cases, departments, and domains is significant for a host of reasons. Firstly, it enables the enterprise to incorporate more of these lucrative data assets into everyday business applications to inflate the ROI on their substantial data management expenditures by using all the resources—or the best of them—for individual deployments. It’s also a reliable way to tame what’s otherwise the escalating disorder accompanying the greater decentralization of where data are accessed, stored, and required, which is why the data fabric tenet has persisted.
This methodology “leads to interoperability and interoperability at the data layer,” Clark commented. “At the data layer there’s one big pool, fabric, or graph of data. That doesn’t just mean you dump the data in one location, but that it’s connected so applications become simpler and easier to build, you can reuse the business logic, and you can reuse these connections that become like different views of a comprehensive fabric of data.”
Modeling Tomorrow Today
Regardless of which approach to centralization is deployed, organizations must adopt some method for countering the silos that otherwise occur with the dispersal of data across different clouds, on-premise settings, and geographic regions. Preparing for systemic interoperability at the granular level of data modeling by overcoming schema differentiation with effective data cataloging, entity modeling, and data quality mechanisms is nothing short of providential for what challenges lie ahead.
“There’s two ways to do future-proofing: the shrewd and the dumb way,” Clark propounded. “The first is to future-proof by doing data modeling that describes parts of your larger business that’s enduring overtime. Schema reuse across different parts of the business gets you future-proof benefits with high quality assurance and ROI. The dumb way is for smart people to do data modeling for data modeling’s sake, which leads to nowhere.”
About the Author
Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance and analytics.
Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1
Sign up for the free insideAI News newsletter.