Overlooking the expansive grand ballroom of the San Jose Convention Center from the so-called “sky box” observation room, I was waiting for my next interview appointment with Peter Crossley of Webtrends. I was reflecting back to when I ran a web development firm in the late 90s and I used the premiere weblog analyzer tool of the day from Webtrends – Log Analyzer for all my client deployments. Webtrends pioneered web analytics so long ago, and it was heartening to see how they managed to stay strong as a company throughout the years. I was eager to learn how they’re using big data to solve important business problems.
The interview that follows is with Peter Crossley, Director of Product Architecture at Webtrends to discuss his company’s data platform centered around Hadoop and Spark.
Daniel – Managing Editor, insideAI News
insideAI News: Webtrends is an old company, what is it doing these days?
Peter Crossley: Web logs is our legacy concept, but that’s just one property. Other areas like IoT represent even more data – data sets coming from phones and other sources. What Webtrends is doing today is that we are providing digital marketing solutions for all these digital properties. We take data from sites, phones, and anything that can emit a sensor – IoT.
We’ve built a data pipeline so at the point that something emits data – it could be a click, like one row in a weblog represents a click, it could be a video being watched, whatever. From the point that we receive that event to the time we persist it to the point where we can turn around and start throwing it back out into a queryable form of data is about 40ms. That’s processed data sets, understanding device information, geo data, etc. We also can augment it with other external data sources. We have the ability to understand whether or not it’s the same person to get their lifetime order value and history of the visitor. We can do all this within the session of the visitor interacting with the property, phone, device, whatever it may be.
Webtrends has gone a log file that’s batch to the point where things happen, taking actual insights on data and redirecting to email remarketing vendors, and interstitial website optimization campaigns. We know that historically and also from some machine learning algorithms we’re working on that the 7 minute mark for a customer’s engagement on a website to purchase something is pretty critical. So let’s do something to retain them and actually have them convert. Take those data points to understand what the visitor is doing. That’s where Webtrends is now, we look at the visitors holistically and see how better engaged they can be and be contextually relevant to them.
insideAI News: Can you describe your current product offering?
Petter Crossley: We have multiple products. We soft-launched a new product last July 14 called Webtrends Explore which gives you the ability to adhoc explore your data in any fashion or form with unlimited data shape or size. It’s step 1 of a many step process we’re going through right now. We’ve built a new Infinity Engine which is where we’re ingesting all of this data in real-time versus our old Log Analyzer type product which was a log batch process – 12-24 hour processing to see static reports. That doesn’t work in the world today, so we had to develop a new solution. The Log Analyzer product has been around for a while and some form of it is still in production today over many years of enhancements. That said, it still dealt with the aggregate. We needed to get down to the visitor and to the event level. We needed to know what was happening for that person – they go from a phone, to a device, to a laptop; how is that multi-channel attribution model applied. Hadoop is giving us the ability to do that through technologies like Kafka, Spark, Storm, and Samza.
insideAI News: When did that pivot point occur, when you decided you needed something like Hadoop?
Peter Crossley: About 4 years ago we were at a customer conference in London and our company officials were discussing how we needed to change the way we deliver data. We saw that people wanted more adhoc abilities. The most important step we took was that we didn’t worry about the data persistence, we worried about the data consumption. The ability to ingest data in a real-time fashion from the stand point of when we get the event to the time we can process it, needs to be as quick as possible. That’s where we spent our efforts first.
We delivered a product called Streams which had the ability to take and visualize email retargeting use cases, abandon the cart value, what is happening on your site right now and see it in real time. The term “real time” is such a bad term because some people’s real time is 5 minutes, and other people’s real time is something very different. When we say real time, it means right now. Streams provided that framework and we extended that using Kafka technology. We were using Kafka before it was part of the Apache project. We took that technology and then added persistence. Originally we were just streaming it out but now we applied the persistence aspect.
insideAI News: Is your use of Hadoop technology in production, part of your product?
Peter Crossley: Yes
insideAI News: How did you decide which Hadoop distribution to use?
Peter Crossley: The way we approached that problem was a little unique. We had some technologists, myself and some others, and we were running Hadoop 1 originally from source on Apache on a limit fashion for some heatmaps capabilities that we acquired. Through that process, we wanted to do more and expand this. Because we were running source so true to the main development line, it didn’t make sense for us to go anywhere outside of that line. Hortonworks, as we started investigating what distribution to use, and because we didn’t want to build it ourselves every time we deployed it, we realized right away that in order for us to be close to where the new capabilities were happening, we had to stay close to the source and Hortonworks provides that by contributing back. We really didn’t spend time looking outside. It was purely a technology issue.
We also chose services to support our operations. We wanted to have someone to call if we needed some help with an upgrade, a migration, etc. They provided the best capabilities there.
insideAI News: Did Webtrends have Hadoop skillsets before you engaged Hortonworks?
Peter Crossley: Yes, we’re pretty advanced technologically. We understand these concepts. We have developers who are now learning about MapReduce, people who never touched it, all in 6 months. Some of the training came from Hortonworks. We have guys who are willing to learn about this technology. But that said, Hortonworks has given guidelines for operational best practices. We are a technology company and we invest heavily in people to develop those skills. Our developers are constantly evolving. Webtrends prides itself on its ability to be flexible and to be able to adjust and pivot when we need to.
Prior to the 5-year mark when we addressed big data and started to run Hadoop, we were running a lot of things on .NET, Microsoft SQL Server, and Windows boxes. We went from that to Hadoop open source. When I first started pushing the transition it was not an easy pill to swallow. But we had the right support and the writing was on the wall, if we wanted to be able to do this, we had to make that jump so we invested in a large-scale cluster. When you’re growing at a pace of 500 TB per quarter, or a petabyte every six months, you’re committed.
insideAI News: Was Hortonworks there for you when you were climbing this mountain?
Peter Crossley: Yes. We had the technical chops to do it, but we wanted a partner to give us assistance if there was something catastrophic. They helped us with the DR cluster. We run two clusters, one a primary cluster, and another a DR cluster. They’ve been working with us to determine the different application strategies. We run 60+ nodes of Hadoop and then we have another 30-40 nodes of Spark. We’re going to migrate that onto Yarn to make a 150 node cluster.
Our demands are changing constantly and the Hadoop infrastructure gives you the ability to change, you can classify machines differently, you can add new roles onto those capabilities. You build a data pipeline and you move data, you adjust data and you export data and then you have these processes in this data pipeline, but where do you put that data? It turns out it’s Hadoop, HDFS, Hive and Hbase. It’s this ability to constantly refine your data set, have multiple copies of it because it’s solving different business purposes. That’s the key.
insideAI News: When did you make the transition to using Spark?
Peter Crossley: It was on the Red Bull jump day in October 2012. That was the first day that we really dug into Spark because we needed a way to process the volume of data that came back from Red Bull in a very short period of time. In that short period of time, there were millions of events happening. Normally, it would take a couple of days to process this amount of data. The .7 version of Spark did it in 28 minutes. When we figured this out, we told ourselves that this technology was powerful. We saw that Spark was a way to express our data set. We didn’t go to production that day but it made us start thinking differently about how to look at our data. That was the big pivot point.
insideAI News: Where you using Spark at Hortonwork’s direction?
Peter Crossley: No, Hortonworks wasn’t even in the picture yet. We were already using Hadoop, HDP before we were a customer through the open source distribution. But then we thought, we’re going into production now, we’re going to need a partner to provide assistance to us.
We’re pleased that Hortonworks has climbed on the Spark bandwagon. We have an early committer on Spark before it became hot. Sean McNamara had been working closely with Berkeley AMP Lab and has done some work in Spark streaming for production use.
So we’re really pushing the envelope. We see that Spark is the way to express and explore data that’s available in HDFS and Hadoop and different data sources. We’ve put a lot of emphasis on expressing data out of Spark. We don’t do a lot of data cleansing on the input. What the customer sends us is what we persist. We ask for a specific structure, but it isn’t necessarily sanitized. We can do transformations on the data as a real time stream.
insideAI News: What is Webtrends use of machine learning?
Peter Crossley: Yes, we use LDA and cluster analysis so as the data is happening we can start classifying for people’s behavior for propensity for doing certain actions – this person is going to buy, abandon, or convert. So let’s rank them, give them some values and when we want to take action on those items we engage them with emails, display ad offers, or send them to different optimization campaigns for the testing and targeting solutions we offer.
So between this whole real time effort, we’re using models, and stream the data back in and have queries running through Spark emit that data and take action on it while the user is interacting with the site. So we don’t have to wait 24 hours to have that data generated. This is why we put so much emphasis on the data pipeline because if we didn’t have the data in a timely fashion we wouldn’t be able to make decisions fast enough.
insideAI News: What does the future hold for Webtrends?
Peter Crossley: That’s my job, knowing where we are now and bridge the gap with the future. The IoT is critical for us. Sensor data and the ability to have meaningful impact to users – we have refrigerators that have data sensors now. We live in a data rich environment. We can collect anything so let’s take that to the next level and be able to do predictions.
Technology needs to be as simple as a toaster. We need to make the technology and the data we receive from our technology almost as simple as this. It doesn’t matter who or how you’re consuming that data, it’s able to be processed and you’re able to take action on it.
We talked to a high-end retailer and they were interested in putting sensors on clothing hangars so when you walk into the store and buy something, they know what hangar was removed when you tried something on, so when you go home you get an email about things now in stock that you didn’t try on. Having that physical to digital work merge is critical.
Sign up for the free insideAI News newsletter.