Hortonworks Sees Continued Growth Ahead

HADOOP SUMMIT 2015 RECAP

Tim_Hall_HortonworksAt the Hadoop Summit 2015 in San Jose last summer, I had the opportunity to sit down with Tim Hall, VP of Product Management at Hortonworks to discuss his company’s position in the Hadoop marketplace and what’s in store for the newly public company. Hortonworks develops, distributes and supports Hortonworks Data Platform (HDP) – the completely open-source Apache Hadoop data platform, architected for the enterprise.

Daniel – Managing Editor, insideAI News

insideAI News: What would you say is in the HDP 2.3 feature set that might attract companies that find themselves at a technology pivot point and need new solutions to solve new demands?

Tim Hall: First and foremost, people are a little bit scared of Hadoop in certain contexts. There’s been a lot of press and analyst commentary of late saying that it’s hard or there is a skillset mismatch, so that’s been shaping our opinion of what we need to do to take this to the next level.

We think we’ve crossed the chasm in terms of customer adoption. We’re at 450+ customers, including 204 in the last two quarters. You have to realize that we’ve only had product in the market for the last ten quarters. That’s not that long, but we’re feeding the hunger of the folks who want this platform who’ve recognized or heard that it can transform their businesses. But the challenge is – it’s technical. There’s a lot of command line interface and data wrangling work that needs to be done. This is shaping our opinion of what we need to do to help get the next large section of the curve, i.e. to bring more ease-of-use and simplify all of the moving parts of Hadoop overall.

My background as VP Engineering and Greg Pavliks’s background, we came from enterprise software companies. Oracle and HP are two places we both worked, so that forms our view of how we’re selling and who we’re selling these technologies to. Not everybody is in the early-adopter camp. Not everyone is a LinkedIn, True Car or Webtrends. Instead, there’s State Farm, Aetna and other traditional businesses that have been around for a long time. The idea is what we can do to help them adopt. They do have smart and brilliant technologists, but the idea is the bar has to come down in terms of the complexity of everything – the data wrangling, the management, the operations, the governance, the security – to make it easier to consume. That’s a big part of what we’re focused on.

insideAI News: How is Hortonworks evolving their product line to address the needs of traditional enterprise Hadoop adopters?

Tim Hall: We work with banks and financial services institutions adopting our technology, for example, and some of the struggles we have with them – between customer and vendor – are often like “how come this doesn’t work like an Oracle database?” We can look at lessons learned by examining previous technology maturation cycles and apply those to Hadoop in a more expedient way. That’s one of the great things about working at Hortonworks, we have a whole community of people that came from Yahoo like our founders who are used to and are comfortable with working at web scale companies. And then you supplement that experience with people having enterprise software backgrounds. It’s a good marriage and is what makes Hortonworks great.

insideAI News: How has going public changed the way prospective customers view Hortonworks?

Tim Hall: One of the reasons why we went public was to provide the transparency for customers. There are a number of vendors working in the Hadoop space. You want to remove the fear, uncertainty and doubt in terms of how we’re doing as a business, and going public forces you to lay those things bare. Our recent business results confirm this customer attitude, with Hortonworks feeding the hunger for the Hadoop market.

insideAI News: How does Hortonworks view the merging of HPC and big data with Hadoop as the conduit, especially with the scientific research community?

Tim Hall: My background is in scientific research. I was a physicist by education and I worked at JPL. Yes, the research community is aware of Hadoop and its ability to store and analyze data cost effectively. Where it’s tending toward HPC environments is the amount of resources being used to run the analysis. Originally Hadoop was known as a batch processing system, but as we’ve been making performance improvements and looking at how to utilize more compute resources at our disposal, we’ve been able to move up to interactive and even real-time processing speeds.

Some of it has to do with how we’re using memory. If you look at HDFS as a file system, it was originally totally oriented toward spinning disk, but now you can change this mindset if you realize the kind of storage that’s available now like flash arrays. We’ve been investing in HDFS core to move away from the notion about HDFS being about spinning disk toward “heterogeneous storage tiering” so you can decide where you want to place the various blocks of information as they land within HDFS – where HDFS is an abstraction above things like flash arrays, spinning disk, high-performance disk arrays and everything in between. So you can decide how you want to handle your hottest data sets and keep them as close to compute as possible, potentially even in memory, so you’re eliminating all kinds of latency.

So it’s that kind of intersection that could be valuable to the research community, especially if they want to run fast simulations or iterative processing. That’s where you see Spark emerging as well for things like machine learning where you’re trying to leverage memory as the primary compute resource. If you want to run a very large Monte Carlo simulation for example, Spark is great.

One of our partners is Ayasdi, a group out of Stanford with a focus on topological data analysis. They’re taking a page out of a cartography book, along with mathematical equations, to map relationships of data, so it fits nicely on top of Hadoop as a data science platform. It is very powerful in terms of visualizing insights into data. They run the data through a topological data analysis in a number of different ways, and they try to find the strength of correlation in the data and plot it visually. The idea behind their technology is to use it for the visualization of complex data sets. The big challenge for the scientific research community is they have all this data but they don’t know what questions to ask. You need some intuition to know what it might mean. One example is breast cancer research where they came up with a useful visualization. This works on top of HDP.

insideAI News: What’s the story behind Spark becoming an important part of the Hortonworks big data solution?

Tim Hall: The story behind it is customer demand. We were working on Hbase, Hive, Storm and a number of other data access engines on top Yarn and HDFS and Spark emerged as a new player in the data science community and we started getting requests from customers asking if we were going to support Spark and what would it need for this to run as a Yarn native engine and what could we do to accelerate that behavior.

From a Hortonworks perspective, we don’t randomly drop technology into HDP. There’s a very methodical and thoughtful approach that’s based on our business model – how do we provide the best support on planet Earth for the components that make up HDP and in order for us to do that we have to have a committer. A committer is someone who works within those Apache projects who can change and manipulate the code. We have the largest number of committers across the projects that we support. What this means to our customers is that we have a deep understanding of all the technology because we have people who are writing the code so when we find a defect or an issue that needs to be addressed we have people who can actually change the code. This is different than other vendors who might not be as deep into the projects. So to get to this level with Spark, we needed to get Spark committers. There are two ways to go about doing this. One way is to hire people who already have that status. Or second, you can go through the “minting process” where you contribute your code back to that community and you reach a “rock star” status and the community votes you in as a committer. We’re working both of those approaches and have been for some time. Our goal is to have 4-5 Spark committers by the end of 2015.

People always ask me what it’s like being a product manager at an open source company as opposed to closed source and I say it is actually very similar. I am the steward of requirements from our customers and partners and feeding those back into our engineering team who then takes them to the Apache software foundation and the various projects that are run there. I’m the conduit for looking at requests from our executive team and the sales team and listening to what we want to do strategically as a company, listening to customers and what they’re trying to do with the platform, as well as our partners like HP, Microsoft, Teradata and what are they trying to do and how we can make it easier to use and adopt technologies. I put those all in a blender and make a delicious smoothie every 6 months!

It’s a matter of prioritization for what we want to do first, second and third and how do we get the engineers to come scope those things and can we land into them on a 6 month or annual basis. The question we ask ourselves is – how do we deliver those things in a sequence because sometimes there are dependencies that emerge and get in between those things so we have to drill through all these items and get landed in the community and start working?

insideAI News: Looking out to the future, say a year or so, are there any important initiatives to look forward to?

Tim Hall: Yes, lots of things! One of the things we’re going to push hard on is our launching point for user experience – improvements for the Hadoop operator, developer, data steward. We’re also going to continue to accelerate performance and SQL breadth that Hive offers – the part of the defacto engine on Hadoop for SQL. We also have an initiative underway to drive query latency to below the 5 second range to move toward sub-second even for even relatively large data sets. The metric to track is called LLAP (not as Mr. Spock would say “live long and prosper” but rather “Live Long and Process” where #LLAP is often seen in conjunction with Hadoop). The idea is to take the best of what we’ve learned in terms of data processing at scale and trying to do in-memory cache to make it as high performance as possible. We’re removing serialization and deserialization penalty. We’re doing spill-the-disk when data is not being accessed. We’re holding things in memory, sort of like a column server so you know exactly what columns of data are present and rapidly bringing those things together.

We’re also going to continue to push the big 3 around enterprise readiness – security, governance and operations. We started with operations due to the scale of Hadoop like deploying across 1000 nodes where tools are important. If you have to do one thing to 1000 machines, that’s still 1000 steps, so automation around that was key early on. We’re getting a lot of assistance from our customers in taking Apache Ambari to the next level of scalability. Now we’re seeing support of 1000-2000 node clusters and the idea is to optimize that and drive it even higher. We’re starting to get customers with clusters even larger. When you look at where we are in the curve with the adoption of the first 450+ customers, nobody starts with a 1000 node cluster. A year out, I’d expect to see Ambari handle multi-thousand node clusters with ease, including rolling-upgrade capabilities. You want to keep the platform running to the point where more and more mission critical workloads on top means you don’t want down-time, ever.

Security will continue to deepen especially around this notion of transparent data encryption.

And governance, which sort of this new frontier. We have a stable footing around operations and security now and governance is the next thing that needs to be addressed because people are landing more and more data sets and they need a discovery mechanism to classify and find the data already landed. Say I have a new user coming in and I want to do this use case but I don’t know if the data is already landed in my lake or not – can I do a search to see if I can find it?

 

Sign up for the free insideAI News newsletter.