Interview: Matt Winkler, Group Program Manager for Machine Learning at Microsoft

In this podcast interview we caught up with Matt Winkler, Group Program Manager for Machine Learning at Microsoft, where he leads a team crafting tools and services to enable data scientists and developers to do more with their data. Originally from St. Louis, Matt has been at Microsoft for 11 years working on developer tools and cloud services such as the .NET Framework, Visual Studio, Azure Websites, Data Lake and HDInsight. Matt has a bachelors degree from Denison University, and an MBA from Washington University in St. Louis.

insideAI News: Welcome to the insideAI News Podcast, I’m Rich Brueckner with insideAI News and today my guest is Matt Winkler from Microsoft. Matt, how are you doing today?

Matt Winkler: I’m doing great!

insideAI News: Great. I’m here at a GPU conference learning all about deep learning, machine learning, and some of the stuff your company is up to in that space. Lots of developments.

Matt Winkler: Awesome.

insideAI News: Matt, let’s start with you. Can you tell us a little bit about your background and your day job at Microsoft?

Matt Winkler: Sure. I’ve been at the company for about 11 years and I’ve been doing all sorts of stuff, but working in a lot of the Cloud data space for that last six or seven years. Most recently, I’ve moved to focus on machine learning and Azure. Some of the things you mentioned about deep learning, and GPU acceleration, and all of that cool stuff and I work on the product management team working to understand what’s the right product we should go build and then doing that.

insideAI News: You bring cloud and Azure, and I was wondering, how do you think that’s affected this field of data science that’s out there?

Matt Winkler: Yes, I have a pretty obvious bias towards the cloud. To me, I think it’s affected it really in a couple of different ways. I think one, fundamentally it’s made a number of scenarios that are simply not possible or not accessible to more people. The story around GPUs is a really interesting use-case for that. A lot of folks aren’t in a position to go and provision a cluster that’s running in their data center that’s got a ton of GPUs in it. It’s not cost-effective. They’re not running a workload where they need those GPUs running at 100%, 24 hours a day, 365 days a year. So the cloud really lets them get that capacity on demand. I think one of the interesting things we’ve seen is people that go and start small, they start on their local machine and now they get something that’s kind of interesting and they want to try that at scale. Or they want to train on more data. They’re able to do that very, very easily. The other thing that’s important is flexibility, which is where you can rapidly create and tear down new environments – that really opens up the ability to play with new technology a lot faster. For instance, “Hey, I’d like to try this thing I saw about Spark.” “Great, you can get a Spark cluster.” “Oh, now I’d like to do TensorFlow or the cognitive toolkit.” “Great, you can go try that.” You don’t necessarily have to have the, “Oh, hey, now I need IT to go and provision a new cluster for me, and they have to tear down the old one, but somebody’s using it.” You don’t really have to mess around with any of that.

insideAI News: Boy, that can take months to spin up that kind of hardware, right? I mean, get it all provisioned and such. So yeah, the flexibility is there. I’m just curious, Matt, when I log into Azure, is all that waiting to go, and I just click on the options I want, and my cluster is up waiting for data?

Matt Winkler: Yes. There’s a couple different things that we’ve got in Azure depending upon what you want to do. One of the easiest ways to start off with is we’ve got something that we call the data science VM. It’s just a VM that you can deploy on a little machine or a big machine or a GPU machine that’s set up and pre-configured with a bunch of machine learning toolkits, and frameworks and ready for you to go and try out, already configured. If it’s GPU-enabled, we’ve got all of the right CUDA drivers set up for the GPUs. So you can, kind of, just get up and go. We’ve also got the Azure Machine Learning Service which is a more full-featured managed service for building experiments. Then when you get a model that’s really interesting, you can publish them as web services. There’s two other things when you want to scale out. The first a managed Spark offering. Tell me how many nodes you want and then in 10 or so minutes, you’ll get that cluster up and running. The final one is a service that we call Azure Batch which is really about getting you a lot of compute on demand, and we’ve got recipes for that. If you want to do CNTK, or Tensorflow, or Caffe to do a massive scale up, deep learning on GPU machines you can do that very easily.

insideAI News: So you guys run a lot of your own jobs on this Azure Cloud and lift lots of data, I would imagine, doing it.

Matt Winkler: Oh yes. If you look at the heritage of data compute, and you rewind the clock 15 years ago, the companies that were doing a ton of innovation in the space have large amounts of unstructured data source and deriving insight from that, we’re really built around the internet search problem which is how do I cheaply, reliably, durably capture a copy of the Internet, and then ask it a bunch of questions. If you think about that, that’s what sparked the Google File System Paper and the MapReduce paper. Inside Microsoft, we were also doing things as well on top of a system that we call Cosmos, which is really the data lake for Microsoft.

insideAI News: I don’t know if you’re eating your own dog food here, but it sounds like you’re giving Microsoft a competitive advantage, because not many folks have access to that kind of compute obviously. But along the way, you’re able to learn and offer that to your customers.

Matt Winkler: It’s been fascinating, because the Cosmos system at Microsoft has been kind of foundational in transforming the company to be data-driven. In some ways, it gives us a way to look ahead because we’ve got this size and scale of data problem. If you think about IoT scenarios where you’re capturing sensor data, people are going to have these volumes of data. It gives us a way to look at where we think our customers will be going as they find themselves in a similar situation to get this type of insight and be data-driven and operate on top of these large, large amounts of data.

insideAI News: What would you say are the big data challenges for analytics in the cloud? I mean, you haven’t cracked all the nuts, I wouldn’t think, at this point. It’s too new.

Matt Winkler: Honestly, in some ways, I think it’s very much the same problems as on prem. I actually think the cloud makes a number of things about big data and machine learning challenges much, much easier because it’s easier to get hardware. It’s easier to scale. It’s easier to integrate and separate out compute and storage. But the fundamental problem that we see with old machine learning and big data engagement is about whether the project set up for success. I like to joke, when I was on our big data team, I would show up to do briefings for customers. There would be folks who would say, “I just bought some Hadoop. Help me make it do something.” And these systems – it’s not that you throw in millions of documents, then it just starts telling you, “Hey! Did you realize that in Topeka you’re having trouble selling widgets?” You have to have a methodology. You have to define a problem. Where we see a lot of projects fail are when people don’t have a really clearly defined problem. You could kind of say, “Hey, this is just like a lot of other software projects.” You have to be really clear about what’s the problem you’re trying to solve. What’s the insight you want to get? So my advice is always about starting small around a simple use-case of understanding customer churn or predicting inventory replenishment levels or something like that. There’s a lot of prior patterns you can follow to start getting some early wins and start building up the muscle on how to solve problems there. But in most cases, I actually see that customers can get there faster with the cloud because they don’t have to – in some ways they don’t have to prove out the value to justify the investment. Because you can just say, “Hey, for two weeks we’re going to run a cluster in the cloud. And if this thing doesn’t work out then we’re out what we spent on compute and that’s it.” You don’t have to wait for – let’s get the cluster set up, then let’s try and do something and if doesn’t work then we’re going to be in big trouble.

insideAI News: Well, Matt, how hands-on do you have to be with a customer like that, who wants to do this large instance or set of instances, right? Do you have to spend weeks working with them to get ready or is it more simple than that?

Matt Winkler: It depends. We find some folks that want to just do data warehousing but they want to do it bigger and cheaper. Those are usually pretty straight forward. In the machine learning space, it’s interesting. You can get to insight very quickly on the order of days or in the first week sometimes. Where we see folks spend a ton of time is not even on the machine learning part of it, it’s much more on the where do I go find the data? How do I get the data in the right shape? How do I clean it? That’s where we see a ton of time spent.

insideAI News: Oh sure, sure. What are the five Vs for data, right? It’s an interesting set of problems. But Microsoft with this Azure capability I think is– it’s really kind of mind-boggling because suddenly there’s no roof, right? I could only afford $1 million worth of servers before. But now, if I only need it for two weeks, I could literally have thousands and thousands of servers at my disposal in a short amount of time.

Matt Winkler: Yes, and we see a lot of folks who want to take advantage of that. The other thing is it gives you the ability to partition out your work differently. “Hey, I’ve been really happy with using Hive, but now I want to try Spark R. Now I want to try Flink.” Okay, that’s cool. It can still operate on the same data and access all of the same data, so you can much more easily play around with different approaches and find the one that’s going to work best. One of the things we’ll talk about a lot with customers is the rate of experimentation. In some ways, it’s super valuable for me to get to a wrong answer faster because then I can learn something to get close to the right one faster.

insideAI News: Well, I can’t let you go, Matt, without asking the data movement question, because some of these jobs spin up more data as you compute and how do you get that out of the cloud? Moving it over a wire is painful.

Matt Winkler: Physics can be a challenge. I think we see a couple of different patterns. Often times, we will see the cloud being used to land the big data, a big data system will be used to either pre-process that or train a machine learning model. The resulting output is ridiculously smaller than the input. A classic example that we’ll see is IoT scenarios where they’re uploading data at – there’s a record for every 25 milliseconds for every sensor on a device times hundreds of devices. They want to put that into their data warehouse, but nobody actually needs to access the 25 millisecond granularity when they’re sitting in Excel or Power BI. So what they’ll do is, they’ll use a Hadoop cluster or Spark cluster to go from – they’ll perform the initial aggregation to go from 25 milliseconds to hourly roll ups. That’s the data that they’ll pull into data warehouse that sits on prem. I don’t want to under represent the data movement challenge, because it certainly is something that you have to think about, but a lot of times what both of you folks being able to use the different pieces and parts for what they’re good for and kind of reduce it down so that the wire’s not the bottleneck. Then they’re other cases where if you need to move around a few terabytes of data you’re going to have to move that around and it’s going to take a little while.

insideAI News: Well Matt, before you go, just one more question about where do you see this stuff going? I mean, you see the obstacles of what your customers want to get done. Is it more accelerators? Is it containers? What are the trends you’re seeing?

Matt Winkler: I think there’s a couple of things that are happening that are interesting. I think there’s kind of a resurgence in the hardware space kind of between the CPU, GPU and FPGA and custom ASICs that really seems to be interesting where’s there’s a ton of innovation happening. Ultimately, I think the customer wins on that because things will get faster in lots of different dimensions for you. I think the other space that I do see a ton of folks that are looking at containerization and how that layers into their architecture to give them more flexibility. I think in some ways, folks are learning from kind of containerization. Some of the benefits that they get from the cloud in terms of, “Hey, you can pretty easily create a new container that is shaped like this with processor and memory. You can do the same thing with containers,” and so you’re starting to kind of see some of those patterns pop up anywhere you can run containers. So we see a lot there. I think there’s a huge explosion in innovation in the machine learning and artificial intelligence space. That’s kind of exploding along both the dimension of making impossible things possible as well as making things easier to say. I think if you look at things like the cognitive services that we have in Azure, it’s a really cool way of letting people who don’t know how to build a computer vision model with just a couple lines of code, put computer vision capabilities into their application. So, I think, that’s one of these ways that you’ll see more of this become more broadly available to folks, without kind of giving them the, “Oh, by the way, you have to go learn how to computer vision.”

insideAI News: Yeah, no small task, yeah. Well, so Matt this has been fascinating and I’m really glad that we could get together and talk about this today. And thanks for coming on the show.

Matt Winkler: Yeah, definitely and I look forward to chatting with you again sometime.

insideAI News: You bet. Okay folks, this is Rich Brueckner for Inside Big Data and I’ll give you out a big “Power to the data.” We’ll see you next time.

Download the MP3

Interview: Matt Winkler, Group Program Manager for Machine Learning at Microsoft

Sponsored Guest Articles

Generative AI’s Accuracy Depends on an Enterprise Storage-driven RAG Architecture

White Papers

From Legacy to Leading Edge: How Mainframe Data Can Transform AI and Analytics

Featured RSS Feed

More News from insideHPC

Interview: Matt Winkler, Group Program Manager for Machine Learning at Microsoft

Sponsored Guest Articles

Generative AI’s Accuracy Depends on an Enterprise Storage-driven RAG Architecture

White Papers

From Legacy to Leading Edge: How Mainframe Data Can Transform AI and Analytics

Join Us On Social Media

Featured RSS Feed

More News from insideHPC