Big Data or Bad Data? Survey Shows Enterprises Struggle to Manage Big Data Flows

Print Friendly, PDF & Email

Streamsets_logoStreamSets, the company that delivers performance management for data flows, today announced results from a global survey of more than 300 data management professionals conducted by independent research firm Dimensional Research®. The study showed that enterprises of all sizes face challenges on a range of key data performance management issues from stopping bad data to keeping data flows operating effectively. In particular, 87 percent of respondents report flowing bad data into their data stores while just 12 percent consider themselves good at the key aspects of data flow performance management.

The survey reveals pervasive data pollution, which implies analytic results may be wrong, leading to false insights that drive poor business decisions. Even if companies can detect their bad data, the process of cleaning it after the fact wastes the time of data scientists and delays its use, which is deadly in a world increasingly reliant on real-time analysis.

Despite Constant Cleansing, Bad Data Is Polluting Data Stores and is Difficult to Detect

Respondents cited ensuring data quality as the most common challenge they face when managing big data flows (selected by 68 percent). In addition to bad data flowing into stores, 74 percent of organizations reported currently having bad data in their stores, despite cleansing data throughout the data lifecycle. While 69 percent of organizations consider the ability to detect diverging data values in flow as “valuable” or “very valuable,” only 34 percent rated themselves as “good” or “excellent” at detecting those changes.

Broad Challenges to Performance Managing Data Flows

While detecting bad data is a critical aspect of data flow performance, the survey showed that enterprise struggles are much broader. In fact, only 12 percent of respondents rate themselves as “good” or “excellent” across five key performance management areas, namely detecting the following events: pipeline down, throughput degradation, error rate increases, data value divergence and personally identifiable information (PII) violations.

Performance degradation (44%), error rate increases (44%) and detecting divergent data (34%) were where respondents felt weakest.  Detecting a “pipeline down” event was the only metric where a large majority felt positively about their capabilities (66%). Across each key performance management area there was a very large gap between the respondents’ self-reported capabilities and how valuable they considered each competency.

Fragile Hand Coding Plus Data Drift is a Dangerous Combination

These quality and performance management issues may be driven by the reality of data drift — unexpected changes in data structure or semantics — combined with the continued use of outdated methods to design data flows such as low-level coding or use of schema-driven ETL tools. Making frequent changes to pipelines using these inflexible approaches is not only highly inefficient but prone to errors. Also, these tools do not let you watch the data in motion, which means you are flying blind and can’t detect data quality or data flow issues.

  • Constant tweaking of pipelines due to data drift: Eighty-five percent said that unexpected changes to data structure or semantics create a substantial operational impact.  Over half (53%) reported that they have to alter each data flow pipeline several times a month, with 23% making changes several times a week or more.
  • The prominence of hand coding and legacy ETL tools: Nearly two thirds of respondents use ETL/data integration tools and 77 percent use hand coding to design their data pipelines.

The Need for a New Paradigm

With the emergence of fast data and streaming analytics, the operational risk has shifted from data at rest to data in motion. However, enterprises overwhelmingly report that they struggle to manage their data flows. What is required is a new organizational discipline around performance management of data flows with the goal of ensuring that next-generation applications are fed quality data continuously.

In today’s world of real-time analytics, data flows are the lifeblood of an enterprise,” said Girish Pancha, CEO, StreamSets. “The industry has long been fixated on managing data at rest and this myopia creates a real risk for enterprises as they attempt to harness big and fast data. It is imperative that we shift our mindset towards building continuous data operations capabilities that are in tune with the time-sensitive, dynamic nature of today’s data.”

For more information and complete survey results, please visit


Sign up for the free insideAI News newsletter.

Speak Your Mind