2020 Trends in Data Modeling: Unparalleled Advancement

The data modeling space is rapidly advancing into the coming decade, perhaps more so than any other facet of data management. This discipline is evolving so swiftly and dramatically that, in a couple years, it will likely bear little resemblance to the data modeling chores of the present decade.

Whereas once this domain was largely based on modeling static, structured data sets of largely internal data, today’s data modeling demands are shaped by a number of vectors that drastically complicate this process. Various dimensions of the Internet of Things, cognitive computing, predictive modeling, three-dimensional models, cloud computing, and expanding varieties of schema have fundamentally altered modeling tasks.

Consequently, data modeling for 2020 and beyond will increasingly become characterized by data shapes, digital twins, ensemble modeling, ongoing model management, and model validation measures to satisfy what is quickly becoming the most demanding task in the data sphere.

Data Shapes

Traditional approaches in which data modelers leveraged relational techniques to conform the schema of all data for a particular repository are quickly becoming outmoded. Developments in big data, the IoT, cloud data warehousing, ELT, machine learning training data, and schema on read options like Avro, Parquet, JSON and JSON-LD are besieging organizations with more schema variations than ever before. Organizations can quickly address these data modeling differences via:

  • Data Shapes: With data shapes, schema is configured into a representation indicative of the schema of a particular data source and others with which it’s aligned. Schema thus becomes more inclusive of data instead of making data adhere to pre-defined schema.
  • Ontologies: Ontologies are naturally evolving data models based on knowledge graph technologies that model data according to universal standards regardless of differences in origination, type of structure, or schema. Their capacity to expand schema to include additional data sources or types is critical for keeping organizations from manually “changing schema constantly,” observed Smartest Systems Principal Consultant Julian Loren.
  • Shapes Constraint Language (SHACL): SHACL is a framework that assists with data modeling by describing the various shapes of data in knowledge graph settings, which produces the desirable downstream effect of enabling organizations to automate “the validation of your data,” remarked Franz CEO Jans Aasman. SHACL operates at a granular level involving classifications and specific data properties.

Digital Twins

As digital twins emerge from being solely deployed in the Industrial Internet and smart manufacturing applications, their array of use cases is gradually expanding to include healthcare, property management, and the hospitality industry. Digital twins are three-dimensional models of IoT data stemming from physical devices. The diversity of data involved with digital twins is essential to the accuracy of their replications of the physical world; in some instances, they can include sources outside the IoT. Digital twin utility is largely based on the celerity in which their data are modeled. The three main types of digital twins include:

  • Asset Digital Twins: As the name suggests, this type of digital twin focuses on data emitted from a specific equipment asset such as oil processing equipment in the Industrial Internet. According to Loren, however, “even asset [digital twins] will start to creep into operational factors.”
  • Operational Digital Twins: Operational digital twins extend upon the asset digital twin concept to include numerous assets and other apposite factors in production environments. Conversely, organizations can “digitize an entire supply chain network,” reflected Joe Bellini, One Network COO. “That physical and logical representation of the supply network now allows you to make better decisions and serve the end consumer demand.”
  • Future Digital Twins: Whereas the previous two digital twins provide diagnostic information about physical conditions, future digital twins rely on predictive modeling for accurate forecasts weeks in advance. Their real-time streaming data is particularly suitable for cognitive statistical models like neural networks and other approaches. “It’s a predictive future digital twin of operations that lines up with your planning horizons,” Loren said.

Ensemble Modeling

As machine learning continues to revolutionize professional and consumer IT applications, ensemble modeling is becoming more viable to the enterprise. With this statistical Artificial Intelligence technique, the predictive aptitude of individual models coalesces to exceed that of the individual models. Ensemble modeling combines many models of the same type (while focused on different features), or sundries of model types. Common ensemble modeling techniques include:

  • Ensemble management: According to Loren, ensemble management involves “going to give this type of math a little more weight in this kind of context, or this type of algorithm in this kind of context.” Notably, ensemble modeling bereft of ensemble management usually produces greater predictive accuracy than not ensembling at all.
  • Voting: With this technique, users take a democratic approach to the prediction of machine learning models. When ensembling four models to issue predictions for whether or not to approve a loan applicant, “If this is saying yes, yes, yes, no, then I will say yes,” explained Ilknur Kabul, SAS Senior Manager of AI and Machine Learning Research and Development.
  • Averaging: Averaging is somewhat similar to voting, although the individual predictions of models are averaged for the ensemble model’s output.
  • Stacking: With this method, “you create an ensemble, and you get outputs which are inputs to other models, and you stack those,” Kabul said. “It’s like deep learning, but the stacking of models.”
  • Bagging: This technique requires building models in parallel from the same dataset while using different elements (and features). Random forest is the most popular example of bagging.
  • Gradient Boosting: Gradient boosting involves taking the output from an initial model to swiftly build others. The iterative learning process is what’s important; models can initially perform poorly and still deliver value. 

Although ensemble modeling enhances predictive accuracy, it can complicate model management by obfuscating explainability.

Model Management

With surging regulatory pressure across verticals, organizations must manage models for all phases of the analytics lifecycle. This endless process includes going from “collecting the data, preparing the data, visualizaing the data, exploring, modeling, identifying what model to use, and then deploying it,” revealed SAS CTO and COO Oliver Schabenberger. Once models are put in production, model management concerns revolve around ascertaining “how do I know I need to change the model?” Schabenberger reflected. “How do I know the data hasn’t shifted, or in two months that model needs to be refreshed?” Once data preparation and feature engineering are complete, some of the more salient facets of model management involve:

  • Model Discovery: Model discovery involves determining which model is most appropriate for a specific use case. Oftentimes, machine learning assists with this process. “People have different definitions; I usually say deep learning is for model discovery for discovering new models,” Loren commented.
  • Model Assessment: Once models have been built and tested, it’s necessary to examine them prior to operationalizing them to ensure there’s sufficient interpretability and explainability, particularly in heavily regulated verticals. “We have to assess those models and interpret them,” Kabul said. The model assessment process is iterative and may involve returning to the data preparation or feature engineering phase to refine models.
  • Risk Management: Risk management is an integral aspect of ensuring models function in accordance to governance standards for minimizing risk once they’re in production. Regulatory compliance issues mean organizations must ensure models are fair and, ideally, transparent. “Much of the innovation and the new concepts and models that are happening in the industry, we’ve got to bring our regulators along,” remarked Amanda Norton, Wells Fargo Senior EVP and CRO.

Model Validation

Model validation operates at the intersection of regulatory compliance, risk management, and model assessment. It’s pivotal for deploying models and rectifying discrepancies between what’s usually synthetic training data and production data. Nexsan CTO Surya Varanasi described a fintech use case in which a company was utilizing machine learning to automate trading and hedge fund opportunities. In this example, continual model validation means “making models accurate by saying, here’s all the historical trends we saw, here’s all the inputs we saw, and here’s last day’s trading,” Varanasi denoted.

In most cases, model validation requires scrutinizing model results to ensure performance is as it was during training. The fintech use case involved repeatedly refreshing the data models are trained on to “munge on it, get your models accurate, and then get them into the next trading day,” Varanasi said. Model validation is also instrumental for optimization and increasing confidence scores. In Varanasi’s example, this model validation dimension involves determining “would those algorithms make the right picks if they were running today, and optimize them and build confidence so that eventually, these machines can actually do some of this work for you,” he maintained.

2020 Modeling

As 2020 dawns, data modeling has advanced to reflect the dynamic, fluid forms of data everywhere. It encompasses versatile methods to adapt to modern demands of variegated schema, real-time streaming data, predictive models, and pressing regulatory concerns. Organizations are tasked with becoming well versed in these modeling techniques or, quite possibly, falling prey to competitors who have.     

About the Author

Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance and analytics.

Sign up for the free insideAI News newsletter.