Interview: Michael Stonebraker, Adjunct Professor, MIT Computer Science and AI Laboratory (CSAIL)

Today, big data has implications across all industries: from healthcare, automotive, telecom; to IoT and security. As the data deluge continues, we are finding newer ways of managing and analyzing, to gather actionable insights and grapple with the challenges if security and privacy.

The Association of Computing Machinery (ACM) just concluded a celebration of 50 years of the ACM A.M. Turing Award (commonly known as the “Nobel Prize of computing”) with a two-day conference in San Francisco. The conference brought together some of the brightest minds in computing to explore how computing has evolved and where the field is headed. Big data was the focus of a number of panels and discussions at the conference. The following is a discussion with Michael Stonebraker, Adjunct Professor, MIT Computer Science and Artificial Intelligence Laboratory (CSAIL); ACM 2014 Turing Award Recipient.

Question: Gartner estimates there are currently about 4.9 billion connected devices (cars, homes, appliances, industrial equipment, etc.) generating data. This  is expected to reach 25 billion by 2020. What do you see as some of the primary challenges and opportunities this wave of data will create?

Michael Stonebraker: From my point of view, there are three potential problems with Big Data. These can be broken into the three “V’s.” It can be  a volume problem, meaning you have too much data; the data is coming at you too fast and it’s a velocity problem; or there is data coming at you from  too many sources and it’s a variety problem.

Let’s take a look at each one of the three V’s respectively.

If you have a volume problem and you’re interested solely in running SQL-style business intelligence on a lot of data, in the data warehouse market,  there are at least a few dozen production petascale warehouses that do exactly this day in and day out. In that regard, if you just want to do business intelligence, the volume problem is basically solved and shouldn’t get much harder in the future.

The second “V,” velocity, is also fairly straightforward if all you’re looking to do is process data. If you want to process a million messages a second,  current stream processing engines can do this quite easily. I’m not aware of anybody that wants to go faster than that. Though there may be  applications that will require much faster velocity in the future, I don’t consider the velocity issue to be all that difficult.

Now we come to the third “V” and the one that I think creates the real problem, variety. When you have data coming at you from too many different  sources, you run into a data integration challenge. As near as I can tell, that is what is causing problems for nearly every major enterprise on the  planet. Most enterprises are siloed, meaning they have independently constructed data stores, perhaps for each business unit. The problem comes  when these business units want to integrate their data, perhaps the business units are storing customer data, and they want to identify common  customers for cross-selling purposes. Merging that data would allow for better insights that could lead to cost savings and operational efficiencies. But  with each silo having its own data store, merging the data becomes quite difficult, because there are typically no cross-unit customer identifiers. I  think what is going to kill everybody isn’t necessarily the number of connected devices but the variety of independently constructed data sources that  enterprises are going to want to put together. Whether you’re talking about healthcare, manufacturing, or financial services, all of these independently  structured databases are going to be a killer. I consider this to be the 800-lb. gorilla in the corner.

Question: What kinds of problems does a siloed approach create in terms of the individual right to privacy and the types of info that is being  gathered and how that is being analyzed?

Michael Stonebraker: Privacy is a really good Big Data question. Imagine this simple example: you show up at your doctor’s office and have an  x-ray done and you want the doctor to run a query that shows who else has x-rays that look like yours, what was their diagnosis and what was the  morbidity of the patients. That requires integrating essentially the country’s entire online medical databases and presumably would extend to multiple countries as well. While that is a daunting data integration challenge, because every hospital chain stores its data with different formats, different  encodings for common terms, etc., the social value gained from solving it is just huge. But that also creates an incredibly difficult privacy problem, one  that I believe is not a technical issue. Because by and large, if you’re looking for an interesting medical query, you’re not looking for common  events; you’re looking for rare events, and at least to my knowledge, there aren’t any technical solutions that will allow access to rare events without  indirectly disclosing who in fact the events belong to.

I view the privacy problem to be basically a legal problem. We have to have legal remedies in this area. I think there are tons of examples of data that  can be assembled right now that will compromise privacy. Unfortunately, the social value to compromising privacy is pretty substantial. So, you can  argue that technology has rendered privacy a moot question. Or you can argue it’s a legislative issue to figure out how to preserve privacy in the  current world.

Question: What moral quandaries do you see arising from the increasing use of predictive data analytics? How do we overcome these challenges?  We already talked about the privacy aspect of it, is there anything else?

Michael Stonebraker: The trouble with predictive models is that they are built by humans and humans by nature are prone to bias. If we look at  the most recent presidential election, we see a spectacular failure of existing polling models. Twenty-twenty hindsight shows that nobody thought  Trump could actually win, when in reality, it is far more likely the polling models were subtly biased against him.

Another great example of the problem with predictive analysis involved a school district that decided to evaluate teachers based on the compulsory  tests administrated to all students. The idea was to take the test scores of the students at the beginning of the year and the test scores at the beginning  of the next year and compare them. A delta was computed, and teachers were graded based on the results. Unfortunately, one of the teachers in this  school district received his results and found that on a scale from 0 to 100 where 100 is the best teacher and 0 is the worst; he scored a 6. This teacher  was considered by parents, students and colleagues alike to be a spectacular teacher, yet he scored a 6.

After a lot of sleuthing, it turned out many of the students came from a particular classroom, whose teacher had been faking test scores. The teacher,  in grade N had been altering the test scores, to make them look better; this of course messed up the results for the teacher in grade N + 1. The students  came in with artificially high test scores and, of course, the result in year N + 1 was not good. So, one year the teacher gets a 6, the next year he gets a 97.

So, the problem with predictive models is the models themselves. If they don’t include fraud, bias, etc. they can get very bad answers. One has to take  predictive models with a grain of salt. The problem with our society is we put way too much faith in predictive modeling.

Question: Security is a hot topic regarding big data. To what extent will big data be responsible for new security problems and challenges? What are  some of the immediate challenges in fixing those? Where does the responsibility lie? Who should be doing it? What are the biggest problems there?

Michael Stonebraker: Historically, most of the data breaches have been inside jobs–the Edward Snowden effect, for example. Although intrusion  from afar gets a lot of the press, most of the actual data losses are inside jobs, meaning deliberate swiping of data by people who have access to it as  part of their job, or employees that put their name and password on a yellow sticky on their terminal which allows another employee to use their  credentials. The trouble with the people protecting real-world data centers is they’re not paid all that well. I personally believe that paying top dollar  for security personnel in corporate data centers would make a big difference and would remediate at least some of the insider-type data breaches.  That’s not to say there isn’t an interesting technical problem, which is how do you guard against denial of service attacks and bad guys from afar  attacking you, but to me, that’s not the biggest issue, the biggest issue is the people problem.

Question: In what ways can Big Data be better utilized for greater public benefit? Other ideas than the medical and x-ray example for greater public benefit through Big Data?

Michael Stonebraker: There are numerous examples of how Big Data could provide greater social benefit. Whether through increased  competitiveness, substantially improved medical services, etc., I think most of these examples have some type of tie to a greater social good, but the  issue that will need to be resolved is the privacy concern: how to get all of this without destroying individuals’ right to privacy, and I’m not sure I have  an answer for that.

Question: Final thoughts?

Michael Stonebraker: Going back to what I said earlier about the 800-lb. gorilla in the corner. All of the fancy social benefit we expect from Big  Data depends on seamless data integration. Solving the problem of how to improve data integration is going to be key in getting the most benefit from all the data being created.

 

Sign up for the free insideAI News newsletter.