Over the years, I’ve been hearing phrases like “data is the new oil” or “data is the new gold.” Yet, the more we look at and discuss data management and utilization, a more accurate comparison emerges: Data is like radioactive materials.
Much like radioactive substances, data holds immense potential for creating positive change and innovation. However, it also carries inherent risks that must be carefully managed. Just as mishandling radioactive materials can lead to catastrophic consequences, negligent handling of data can result in severe harm.
As AI builders and users, we must adopt a mindset like handling radioactive materials when it comes to data—acknowledging its potential for both good and harm, and taking proactive measures to ensure its responsible and beneficial use.
The Evolution of Data and AI
In the 2010s, the era of Big Data emerged, marked by an unprecedented influx of information. This surge in data was essential for the functioning of large-scale models, driving the need for vast amounts of information. However, as we transitioned into the 2020s, there was a noticeable shift in focus towards collecting the right data for specific use cases. This shift highlighted the importance of quality over quantity and the significance of targeted data acquisition.
Even more recently, the rise of generative AI (GenAI) has shifted the sort of content we consider to be data. No longer confined to spreadsheets and structured datasets, data now includes articles, videos, and more.
While this expansion broadens the scope of possibilities for AI initiatives, it also introduces new complexities and risks. With content as data, not only will the intricacy of AI projects increase, but so too will the potential for data to become a liability for companies.
When Data Is an Asset Vs Liability
While data can be a valuable asset by offering tangible business results, it has some serious limitations and can become a huge liability if not managed well.
This is especially true in the wake of GenAI and maturing privacy regulations. To quote Dominique Shelton-Leipzig’s book Trust, “a recalibration is necessary to avoid the collision course between data innovation and data privacy. If Data Breach were a country and the $6 trillion losses were GDP, the country of Data Breach would be the third largest GDP in the world behind the United States and China.” Gone are the days of retention by default, especially if that data isn’t generating value.
Even organizations that have a good handle on data governance are generally poorly prepared to apply the same level of data governance to the masses of new content data sources available today in the form of reports, pdfs, meeting recordings, presentations, and other multimedia assets.
Here are some scenarios where we’ve seen data become a liability for companies:
- Collecting data without a purpose or using data for multiple purposes. For example, original data might be collected for a transactional purpose (i.e. we need to capture physician notes in the patient record to document diagnoses and treatment plans) but trying to use the same data for a different unstated purpose doesn’t always work.
- Storing mass amounts of data. Data takes up vast amounts of energy to store, secure and process, resulting in an increased carbon footprint.
- Data poses security risks. Cybercriminals are drawn to organizations that have large volumes of data. As the volume of data you store grows, are you prepared to mitigate the additional risk that comes with it?
- Poor data quality leads to poorly trained models. AI and ML rely on clean data to function properly. Without it, companies could face costly errors.
Luckily, there are several strategies out there to avoid these data pitfalls.
Strategies to Make Data an Asset
Examine Flaws Introduced at Data Creation
Data subject to the strictest protection guidelines is often human originated—whether you’re observing human users, capturing information on transactions, building conversational agents, or any other human-centric ML activity. Humans are complex and sometimes silly and unreliable, which means data reflects some of these mistakes.
As Dun and Bradstreet say, “When data is dirty, there is typically an underlying business process issue to address.” In other words, inaccurate or incomplete data is often a result of poor data collection practices, a lack of data governance, and misalignment between IT and business goals. Don’t assume that what you’ve captured is an accurate representation of the world.
Real-world Application
In my experience working with hospitals, it’s not uncommon to see patient cases revisited and updated with new data because an incorrect diagnosis was applied, or lab work done outside the health system needed added to their record.
When working with the primary data, that’s fine. But there’s a cascade effect of models built on the original incomplete or uncorrected data. While data may never be perfect, you’ll want to make sure data hygiene processes not only target data, but the models that subscribe to them too.
Weigh the Risks
Every time you choose to collect new data, weigh the risk of (1) collecting the data and (2) holding onto the data. Will it only increase the liability for your company or is it connected to a permitted use and therefore worth storing (read: protecting)?
Perfection Doesn’t Exist
Don’t be the company that strives for perfect data. Often, building a model through rapid prototyping will yield the nature of the data that’s missing and give you a head start on capturing the right data for the right purpose.
In general, we must stop treating data as valuable by default. Cassie Kozyrkov wrote it best on LinkedIn: “I wish we’d all stop pronouncing data with a capital ‘D’. Data isn’t magic — just because you have a spreadsheet full of numbers doesn’t guarantee that you’ll be able to get anything useful out of it.”
Good data happens as a function of a process. As the volume of data necessary to leverage the power of GenAI increases, it’s never been more important to invest in data quality. Data is only made valuable through process and mindful investment. It may not be gold waiting to be found, but instead a diamond in process.
About the Author
Cal Al-Dhubaib is a globally recognized data scientist and AI strategist in trustworthy artificial intelligence, as well as the Head of AI and Data Science at Further, a data, cloud, and AI company focused on helping make sense of raw data.
Sign up for the free insideAI News newsletter.
Join us on Twitter: https://twitter.com/InsideBigData1
Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/
Join us on Facebook: https://www.facebook.com/insideAI NewsNOW