Who Done It? 3 Possible Suspects in this Halloween’s Bad Data Horror Movie, And How Data Teams Can Make It Out Alive

We all know the tell-tale signs that something is about to go horribly awry in a horror movie.

The getaway car won’t start. Every entrance in the mansion is locked aside from the backdoor. The basement light flickers and the room goes silent. 

While moviegoers can hide behind their buckets of popcorn or yell at the protagonist to “get away from the door!” data engineers are not so lucky when horror strikes. And for data engineers, that “horror” is more often than not bad data. 

According to a recent study from Wakefield Research, data teams spend 40 percent or more of their time tackling poor data quality, impacting 26 percent of their company’s total revenue. Talk about a horror story.

What constitutes a data horror story, you might ask? Here are a few examples.

On a dark and stormy night in 2022 (just kidding, we don’t know what time of day it was), gaming software company Unity Technologies’ Audience Pinpoint tool, designed to aid game developers in targeted player acquisition and advertising, ingested bad data from a large customer, causing major inaccuracies in the training sets for its predictive ML algorithms and a subsequent dip in performance. The data downtime incident sent the company’s stock plummeting by 36%, costing the company upwards of $110 million in lost revenue. 

Or take Equifax, who issued inaccurate credit scores to millions of its customers back in the summer of 2022, all due to a problem with bad data on a legacy on-prem server.

So, how can you evade your own data quality horror stories? We share three common causes of data downtime and walk through how you can escape them. 

Is the Call Coming From Inside The House? 3 Root Causes of Data Downtime

In the 2023 edition of the same Wakefield data quality survey, the time it takes to detect and resolve a given data incident rose by an astounding 166% year-over-year. In the case of our horror movie, that’s like taking an additional week to figure out who the killer is.

To trim that time-to-resolution down (and save some fictional lives), it’s critical to understand more about the root causes of data anomalies. And while there are a near-infinite number of root causes for each type of anomaly, they all stem from issues across three layers of your data infrastructure. 

Understanding these layers and how they produce data anomalies can provide structure to your incident resolution process.

System root causes

System or operational issues are found when there is an error introduced by the system or tools that customers apply to the data during the extraction, loading, and transformation processes. An example of this could be an Airflow check that took too long to run causing a data freshness anomaly. Another example could be a job that relies on accessing a particular schema in Snowflake, but it doesn’t have the right permissions to access that schema.

Code root causes

The second type of data incident root causes are code-related. An example would be is there anything wrong with your SQL or engineering code? An improper JOIN statement resulting in unwanted or unfiltered rows perhaps? Or is it a dbt model that accidentally added a very restrictive WHERE clause that resulted in a reduced number of rows of output data triggering a volume anomaly?

Data root causes

System and code issues are also very typical in software engineering, but in the wonderful world of data engineering, there can also be issues that arise in the data itself making it a more dynamic variable. For example, it could be a consumer application where the customer input is just wacky. Let’s say you are an online pet retailer and someone enters their dog weighs 500 pounds instead of just 50 which results in a field health anomaly. 

Your Ticket to Data Horror Story Survival

Since data anomalies can originate across each component of your data environment, as well as the data itself, incident resolution gets messy, and it becomes trickier to nab our killer. 

Data teams may have tabs open for Fivetran, Databricks, Snowflake, Airflow, and dbt, while simultaneously reviewing logs and error traces in their ETL engine and running multiple queries to segment the data. And on top of all of this, the massive pressure on data teams to focus on generative AI has caused the production of data and data pipelines to go into hyperdrive, only exacerbating the shakiness of manual and reactive data quality processes. 

Proactive data monitoring and observability can help consolidate and automate these processes by allowing you to see any changes in your data stack, regardless of code, data, or system cause. Not only that, it provides the lineage of the issues – down to the field – at just the click of a mouse. 

No cliffhangers in this data horror movie! Unless………

About the Author

Lior Gavish is CTO and Co-Founder of Monte Carlo, a data reliability company backed by Accel, Redpoint Ventures, GGV, ICONIQ Growth, Salesforce Ventures, and IVP. Prior to Monte Carlo, Lior co-founded cybersecurity startup Sookasa, which was acquired by Barracuda in 2016. At Barracuda, Lior was SVP of Engineering, launching award-winning ML products for fraud prevention. Lior holds an MBA from Stanford and an MSC in Computer Science from Tel-Aviv University.

Sign up for the free insideAI News newsletter.

Join us on Twitter: https://twitter.com/InsideBigData1

Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/

Join us on Facebook: https://www.facebook.com/insideAI NewsNOW