Interview: Daphne Koller, Chief Computing Officer, Calico; Adjunct Professor of Computer Science, Stanford University

Today, big data has implications across all industries: from healthcare, automotive, telecom; to IoT and security. As the data deluge continues, we are finding newer ways of managing and analyzing, to gather actionable insights and grapple with the challenges if security and privacy.

The Association of Computing Machinery (ACM) just concluded a celebration of 50 years of the ACM A.M. Turing Award (commonly known as the “Nobel Prize of computing”) with a two-day conference in San Francisco. The conference brought together some of the brightest minds in computing to explore how computing has evolved and where the field is headed. Big data was the focus of a number of panels and discussions at the conference. The following is a discussion with Daphne Koller, Chief Computing Officer, Calico Labs; Adjunct Professor of Computer Science, Stanford University; ACM-Infosys 2007 Foundation Award.

Question: There is an estimate from Gartner that says there are currently about 4.9 billion connected devices – cars, appliances, industrial equipment,  etc.–that are generating data. The number of devices is expected to reach 25 billion within the next few years. What do you see as some of the primary challenges and opportunities that will come from this massive amount of data?

Daphne Koller: This is a really exciting time, but we need to be thoughtful about how we approach things because, while the opportunities are so great, there are also some important challenges.

Looking at the opportunities, it’s clear that we will see an impact from big data across nearly every industry. In my own focus area of healthcare, I  expect we will see major changes in the way we approach personal health and the ability to monitor people as they go through their daily lives. We’ll likely be able to observe whether patients are taking their medication and whether they are beginning to exhibit symptoms or early warning signs of a  disease. This will allow us to intervene early, and hopefully make a much bigger difference to outcome. In some areas, we’re already beginning to see  the benefits of data. Examples include navigating and routing automobile traffic more efficiently, smarter energy grids, and many more. Each of these  examples represent enormous opportunities, but I think we’re really only starting to glimpse the future potential.

Of course, there are challenges that come with these advancements as well, not the least of which include data privacy and cybersecurity. Another  major challenge is the data itself. With so much new data out there, how do we look through it for the information that’s really valuable and  significant? And then, once you have the data, how do you avoid the confounders and the biases that may not be immediately apparent but that might  lead to misleading conclusions? Each of these challenges needs to be approached with a degree of thoughtfulness.

Question: Does the potential benefit of this data change the individual’s right to privacy, or have we lost the right to privacy?

Daphne Koller: I certainly would not say that an individual has lost their right to privacy; that is a fundamental and extremely important right. At  the same time, we could be much better about maintaining privacy through processes of informed consent. In the case of healthcare, this is an area  where there are strong norms and regulations regarding informing a patient about how their data are going to be used. That is not to say that  informed consent is a perfect solution. In fact, in some cases informed consent can be written to be unnecessarily narrow, thereby reducing the  usefulness of the data to other patients. But I believe there is now a movement to go back to patients and say, “Would you mind if your data were used to the benefit of other patients?” The exciting thing about this approach is that many people are actually eager to have their data be used to help  others. There are so many people who are donating their data voluntarily to the benefit of science, and therefore, hopefully to the benefit of other patients. Fundamental to the whole issue of privacy is the notion of consent, and making sure that the users whose data is being used in this fashion  are informed in advance and that they have the opportunity to opt out of having their data used. These measures would ensure that their privacy is not violated without their knowledge.

Informed consent also brings up some interesting high-level questions. For example, should we use an opt-in versus opt-out model for data sharing.  Some will tell you that it has to be an opt-in model, and others will note that the opt-in rate is considerably lower than what you get if you ask people  to opt out. A good example of this is organ donation. In countries that have moved gone to an opt-out model for organ donation, as opposed to the  opt-in model used in the U.S., donation rates are considerably higher. This is a perfect case where you might, without violating the individual’s right to  privacy, think about the tradeoff between the individual good and the public good and ask whether an opt-in or opt-out model is appropriate. It  might be different in different scenarios, but it’s a question that should be asked.

Question: As predictive models are increasingly used, how do we avoid the biases you mention when interpreting and using data?

Daphne Koller: Bias will always be a challenge, and there isn’t a single, magic solution. The bigger question is: “How do we disentangle correlation  from causation?” Again, I’ll turn to healthcare for an example: the gold standard in the medical state is that of randomized case control. In the case of  web data, it’s called AB testing—basically tech industry jargon for the same type of control. Although not perfect, randomized case control, or AB  testing, is about as good a tool as we’ve been able to develop for addressing some of the confounders. Unfortunately, this type of control is not feasible in all cases. In cases where it’s not, processes must be carefully scrutinized to check for different confounders and to look for any and all correlations  that give rise to the phenomenon being viewed. It’s a process that requires a lot of thought and a lot of care and cannot be overstated in its  importance.

For example, sometimes there are biases in the data that are reflected in the conclusions that are drawn from the data. An interesting example of this  relates to searches on certain sites, where “Steph” autocompletes to “Stephen” rather than “Stephanie” because Stephen was a more common search  term. Some would say this is a gender bias and it should be eliminated. As a woman in tech I can certainly relate to, and understand that perspective.  Some would also say that the data are what they are, and if Stephen is a more common search term than Stephanie– then do we really want to make  the algorithm do something other than what is best for user efficiency? It’s a real quandary, and one can make legitimate arguments either way.

Question: What role can Big Data and machine learning play in helping scientists understand the data (for example, in the human genome project)  and bring forth some potential real-world opportunities in health and medicine?

Daphne Koller: One of the main reasons I came back to the healthcare field is because I think the opportunity here is so tremendous. As costs go  down, our ability to sequence new genomes increases dramatically. And it’s not just genomes; it’s transcriptomes and proteomes and many other data  modalities. When we combine that with the types of wearable devices I mentioned above that allow you to see the effect of phenotypes, there is an  amazing explosion of data that we could access. One reason this is beneficial is that it will improve our ability to determine the genetic factors that  cause certain diseases. Yes, we could do that before, but when faced with tens of millions of variations in the genome and only a couple hundred  examples to use, it’s really difficult to extract much out of that except the very strongest signals. There are countless hypotheses, and it’s never clear  which genetic factor actually caused the phenotype that you’re measuring. With the amount of data that we’re starting to see, you can start to evaluate  specific rare mutations and their effect on a disease. This could allow us to identify genetic factors that cause the disease, but also factors that are  protective.

For instance, there’s a subpopulation of individuals infected with HIV who won’t get AIDS – they are entirely asymptomatic. Is there something in  their genome that causes that type of protection, and if so, what is it, and can we can use in developing better medicines? Both harmful and protective factors can provide insights that are helpful in developing new therapeutic interventions.

Question: Are there any practical or potential breakthroughs on the horizon that are particularly exciting to you?

Daphne Koller: There are two sides that we can look at—one is data utilization and one is data generation. On the data utilization side, we are  starting to see the use of much more sophisticated machine learning techniques in a variety of domains where Big Data is actually being generated.  For instance, look at the recent work from Google on machine translation, which basically took a step back and said: “We’re going to apply the most  modern machine learning techniques because we now have enough data to really train very complex deep network models.” By doing so, they were  able to do something that’s never been done before, which is translate between languages where you have not seen paired sentences from that pair of languages. It’s an amazing achievement. It’s an internal language the computer invented, and then is being used to translate into target languages.  Essentially the computer invented its own internal semantic representation for language that it used as intermediate points to converse between any  two pairs of languages. The use of new machine learning algorithms is going to be especially transformative in biology as we start to get enough data  to make this type of machine learning algorithm feasible.

On the other side, we’re starting to see technology that can generate very large amounts of data–for instance, single-cell RNA sequencing. It allows  you to dig into the variability of individual cells and look at what makes a tumor cell relative to the other cells around it. Equally important is the fact  that by looking at individual cells, you can now obtain thousands of data points per sample as opposed to a single data point, which is the average of  all of those cells. Again, this creates an opportunity for the kinds of large data to which machine learning can be applied.

Question: In what ways can Big Data be better utilized for greater public benefit? Are there any other examples that come to mind that we haven’t  touched on that you’d like to throw out there?

Daphne Koller: There are so many applications that it’s impossible to list all of them. One that’s not as commonly discussed is centered around  agriculture. For example, you could measure individual crop yield and the individual yield level. You could also tune your fertilizer schedule, your  pesticide schedule and your watering schedule for the needs of that specific field. By doing so you could save water and reduce the amount of pesticide  that’s being used, as well as reduce the runoff of pesticide-polluted waters into the waterways. Once you have measurement devices that can provide the right data for machine learning algorithms, there will be many industries that will be ripe for transformation. In 10 or 15 years, our lives are just going to be completely different because of the combination of very dense, continuous measurements combined with really smart data  analytics.

 

Sign up for the free insideAI News newsletter.