Data strategy is the substrate upon which the very notion of data management, and all its dimensions, depends. Developments in this facet of the data sphere directly impact all others, from the rudiments of data governance to advanced analytics spurring time-sensitive action.
In 2022, data strategy trends will focus on myriad motions that have been gaining momentum, including Artificial Intelligence governance, data privacy, and data literacy. However, the irresistible force it must primarily counter is the natural outcome of each of these considerations and the premier data management challenge of our time.
In a work from home world characterized by poly cloud deployments, edge computing, the Internet of Things, and the marked distribution of data assets throughout these and conventional on-premise settings, the most obstinate impediment to managing data is the challenge of centralizing such management in a decentralized—if not fragmented—landscape.
The time-honored, knee jerk reaction to this phenomenon is to reposition or replicate data for centralized access, governance, and oversight, which simply isn’t tenable or sustainable.
That’s why, according to Talend CTO Krishna Tammana, the biggest data strategy trend is “data management will shift organizations’ focus from the mechanics of moving and storing information to focusing on business outcomes. In the quest to drive business outcomes by being data driven, businesses realize the need to go beyond the mechanics of moving data.”
Curtailing the endless movement of data with a comprehensive data strategy that still facilitates centralized data management will surely require a variety of strategic measures from data fabrics and data meshes to top down, bottom up, offensive, and defensive postures.
Or, as Stardog CEO Kendall Clark termed it, “we’re in a corner; the only way out is alternatives to the physical consolidation of data.”
Surmounting Decentralization Woes
As Clark implied, most firms’ data management strategies involve copying or consolidating data for any use case—whether data warehousing, loading applications, or building machine learning models. The issues with this approach are nearly as plentiful as the means of overcoming it. “Long term trends show data growth is up and to the right, but the underlying network performance is fixed,” Clark revealed. “We’re closer to the maximum limit our networks can perform than we are to data growth and volumes slowing.” To Clark’s point, IDC announced that well over 60 zettabytes of data were either created or replicated in 2020, with a disproportionate amount of that figure due to replication.
Alternatives to physically consolidating data for integration purposes include architectures such as a data mesh or data fabric and technologies like data virtualization and query federation methods. Each of these options enables organizations to do what Denodo CMO Ravi Shankar termed “connect to data sources” for centralized access, instead of moving data. When properly implemented, they enable organizations to get unified views of their distributed data. Plus, they counteract the fact that “at the data management layer, we’ve got a monoculture,” Clark propounded. “We only do the one thing [move data] and that part’s easiest to change and make a leap forward for time-to-insight and better run businesses, organizations, and society by making a strategic change.”
Data Quality
Impaired data quality, partly due to the proliferation of silo culture, is another drawback to interminably replicating data for data management purposes. “As more individuals have access to increasing volumes and sources of data alongside the ability to create their own data lakes and warehouses, data governance practices and teams will be challenged in new ways,” Tammana commented. Whether concentrating on offensive, defensive, bottom up, or top down stratagems, data quality is indispensable for ensuring success. Its eight dimensions include:
- Syntax: Syntax involves how data are presented and what EWSolutions President David Marco called “the precision of data. How many places to the right of the decimal are we going to go; that’s a syntax example.”
- Format: Format is based on a uniform representation of data, such as specifying whether or not social security numbers will have dashes, slashes, spaces, etc.
- Accuracy: Accuracy involves aspects of validity and entails the rectitude of data. “If we’ve got a social security number that’s all nines, that’s not too accurate,” Marco said.
- Reasonableness: The crux of reasonableness is making sure “the data value conforms to its domain,” Marco mentioned. Ages, for example, shouldn’t include negative numbers.
- Currency: Currency or data timeliness is about how recent data are and whether or not that impacts their accuracy, or if they require updates.
- Consistency: Consistency involves not deviating from how data elements or entities are represented. For example, ‘California’ shouldn’t be referred to in systems as ‘CA’, ‘Cal’, or ‘Cali’.
- Authority: According to Marco, authority means “is this data backed by some kind of credibility? For example, a data stewardship team should certify data as being accurate. That accreditation process is a good example of authority.”
- Definition: This dimension goes into the realm of syntax and requires concrete definitions (or glossaries) for business terms.
AI Governance
Another undesirable outcome of constantly copying data and creating silos is belying data governance for AI deployments—resulting in inaccuracies, regulatory complications, and legal action. Monitaur CEO Anthony Habayeb noted organizations can reinforce this data governance aspect by documenting specific procedures for risk management and model management via:
- Risk Assessment: Involving risk and compliance personnel is pivotal for governing AI models. Evaluating risk is integral to this process and includes “what model do I select, what are the tradeoffs in terms of transparency versus accuracy,” Habayeb remarked. “What data do I select; how do I confirm that data’s fit for use, has an appropriate distribution, and the data itself’s not biased?” Risk personnel should obtain this information from data scientists devising models, which informs their use. “Enriching data with context and commentary will help other users in an organization understand what data they can trust and how to use it best to achieve actionable insights,” Tammana explained.
- Model Management: The overall lifecycle management of AI models is indispensable for ModelOps. This step may begin with data engineers documenting what was done to training data and can include ongoing measures for continually monitoring models. Habayeb referenced automation “that ensures models are being monitored for drift or outliers or statistical biases. You should be testing your model for drift, bias, and outliers.”
- Consequential Model Governance: This powerful capability is of interest to both regulators and auditors. “There should be an ability to test a decision,” Habayeb maintained. “How would a model perform if [someone] was 55 instead of 45, or 33 instead of three? What would a model do in that situation? Counterfactual testing of decisions allows a non-technical person to do that.” This capacity is assisted by traceability measures throughout a model’s lifecycle for auditing.
Data Strategy Archetypes
Quintessential data strategy typically involves measures for offense and defense. The latter is often about mitigating risk. The former is about boosting revenues and productivity. For example, “By building more repeatable, comprehensive governance programs your time to market accelerates,” Habayeb reflected. “If you’re in pharma, your six months faster time to market could be worth hundreds of millions of dollars.” There are also top down, bottom up, and combinatorial approaches of the two. The first is dictated by executives, board of directors, and top shareholders.
Marco provided an example of this approach by recounting an anecdote about a manufacturing software company in which “the finance group was led by somebody a little more sophisticated and he said we are not going to modify JD Edwards; we will instead modify our business processes to match JD Edwards.” Bottom up data strategies are often based on the needs of individual business units and their use cases at the time, which are then extended throughout the organization to encompass the rest of it. Admixtures of the two involve determining the needs of executives and data end users alike, identifying what issues the former has fulfilling the latter’s goals, then developing rules for implementing policies devised from the top.
Data Privacy, As Well
One of the more pressing concerns of managing data by continually copying them pertains to data privacy legislation, which is only expanding alongside consumer demands for it. “As new data privacy laws continue to emerge and increase compliance complexity, data governance will take a central role in strategic data management models,” Tammana indicated. Silos of data copies can easily defy such governance, whereas centralization efforts bereft of data copies (involving data fabrics, data meshes, data virtualization, and query federation) will simplify it, data quality, and aspects of AI governance.
Moreover, selecting the right data strategy that’s applicable to a specific organization is no longer a point of academic interest. “It’s increasingly hard to tell the difference between your data management strategy and your corporate strategy,” Clark observed. “Investors are learning that how you manage your data has to do with how you manage your business.
The choice, then, seems clear.
About the Author
Jelani Harper is an editorial consultant servicing the information technology market. He specializes in data-driven applications focused on semantic technologies, data governance and analytics.
Sign up for the free insideAI News newsletter.
Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1