In this special guest feature, Michael Hyatt of Metanautix delves into the hurdles organizations must overcome to gain optimal insights from its big data. Challenges addressed include the proliferation of data sources and types, and how organizations must be able to seamlessly analyze and process data across the entire organizational architecture, “correlating and querying every bit of data we have, whatever and wherever it is.“ Michael is Product Evangelist at Metanautix, a big data analytics company founded by data scientists from Facebook and Google.
You see them every day, throughout the technology and industry media, but increasingly in the general media. “Big Data”, they wonder, “is it hype or huge?” There is an endless, and often pointless debate about what it is, why it is, if it has value, and if so, what that value is. We’ve seen this movie before, with “The Cloud”, and before that with “Virtualization”. Real, genuinely important concepts are co-opted by the marketers and applied to any product, service or solution that might benefit from the loose association.
But the same thing always happens with these kinds of technological concepts. While the world argues about what they are, what it means and who will benefit, companies make real investments and people are assigned real roles implementing those investments. And every one comes with its own unique challenges.
With Big Data, there are a number of challenges, but the first and most vexing problem the people responsible for utilizing Big Data must overcome is the proliferation of data sources and types. Different databases, different file formats, all stored in different silos in departments, data centers and clouds all over the globe. Analytics run against only one data source or type are useless – the value of Big Data is the ‘Big’ part. And our data is ‘Big’ only if we can look across the entire organizational architecture, correlating and querying every bit of data we have, whatever and wherever it is.
Oh sure. We do that now. We have ‘ETL’, we have ‘normalization’, we have ‘parsers’, we have field mapping, we have tools and scripts and grep and sed and AWK and perl, and we put them together in some kind of fragile system, utterly different for each data type, that can (usually) output some data to a useable format in yet another database. The problems with this approach are virtually endless. Sometimes our convoluted process fails. Updating the data usually requires some kind of intervention. The output formats are inconsistent, and are usually proprietary. And it will take months to get a new data type included in the analytics process (honestly, can your analysts query AVRO and .csv files at the same time with the same tool?).
But at the end of the day, if you’ve got “Big Data” in your title, a budget and a staff and a whole bunch of projects, demands and initiatives, the LAST thing you want to be struggling with is integration and normalization. The LAST thing you need is a whole bunch of tickets requesting access to certain data assets. Wouldn’t it be great if you could just connect all the data sources and types to a single intelligent software layer that would handle all the access requirements, delivering the data in a standardized, normalized format – ANSI SQL – that everybody can use with all their tools? Wouldn’t it be even greater if that intelligent software layer was aware of the users’ permission levels and credentials, and could automatically mask PII? And what if any business-side user could connect to any combination of enterprise data assets, with no intervention from the tech side whatsoever?
Think about it: If you’re a Big Data professional in 2015, how much of your time ought to be spent addressing 1980s ETL challenges? And how much does that part of the job contribute to your bonus?
Sign up for the free insideAI News newsletter.
Speak Your Mind