Apache Spark is the Smartphone of Big Data

Denny LeeIn this special guest feature, Denny Lee of Databricks, talks about the versatility of Spark – essentially comparing it to the Swiss Army Knife of on your camping tri​p, called​ Big Data/Analytics. Hadoop is often used in tandem with sister data analysis tools — tools that let you rapidly examine “real-time” data such as Tweets or ask questions of data via the familiar SQL query language — but Spark lets you do all this from a single piece of software.​ Denny is a Technology Evangelist with Databricks; he is a hands-on data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud. His key focuses surround solving complex large scale data problems – providing not only architectural direction but the hands-on implementation of these systems. He has extensive experience in building greenfield teams as well as turn around / change catalyst.  Prior to joining Databricks, Denny worked as a Senior Director of Data Sciences Engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight).

Similar to the way the smartphone changed the way we communicate – far beyond its original goal of mobile voice telephony – Apache Spark is revolutionizing Big Data. While portability may have been the catalyst of the mobile revolution, it was the ability to have one device perform multiple tasks very well with the ability to easily build and use a diverse range of applications that are the keys to its ubiquity. Ultimately, with the smartphone we have a general platform that has changed the way we communicate, socialize, work, and play. The smartphone has not only replaced older technologies but combined them in a way that led to new types of user experiences. Applying this analogy to the Big Data space – Apache Spark seeks to be the general platform that is changing the way we work and understand data.

The need for speed

No, this is not a Fast & Furious reference

The genesis for Apache Spark was to provide the flexibility and extensibility of Hadoop MapReduce at significantly faster speeds. The problem dates back to the inefficiencies of running machine learning algorithms at UC Berkeley. But this was not about time lost (due to long running executions) or inefficiencies (e.g. errors in long running jobs only revealed themselves at late stages, requiring fixing and re-running the entire job), a crucial component is that long running queries interrupt your train of thought.

When a data scientist, data engineer, or analyst need the results from labor intensive number crunching, the usual time window is “now.” But what is the definition of “now?” Loosely, the definition of “now” in terms of analytics is seven pieces of information in less than 30 seconds. From a short-term memory biological perspective, this definition is based on Miller’s Law and Shiffrin’s Short Term Memory respectively. The concept is that your short-term memory can only hold a finite amount of information for a short period of time. Miller’s Law defines that the number of objects that the average person can hold in working memory is about seven – which is the original reason why US phone numbers were 7-digits.

This is particularly important with data exploration because, after this short period of time, you start losing your train of thought. Faster response times are not just about efficiencies, but making it easier for data explorers to retain and test their connections and hypothesis.

An integrated platform for data science

Apache Spark was conceived on the need for iterative applications or interactive queries. But it quickly changed course when it was realized that Spark’s value extended well beyond high performance machine learning.

We quickly shifted our focus to design a multi-purpose framework that was a generalization of MapReduce, as opposed to something you’d only run alongside it. The reason is that we saw we could do a lot of other workloads beyond machine learning and that it was much easier for users to have a single API and engine to combine them in.” — Matei Zaharia, Founder and CTO Databricks, Creator of Spark.

What we have today is an ecosystem that can solve many different data problems including machine learning (via MLlib), graph computations (via GraphX), streaming (e.g. real-time calculations), and real-time interactive query processing with Spark SQL and DataFrames. The ability to use the same framework to solve many different problems and use cases allows data professionals to focus on solving their data problems instead of learning and maintaining a different tool for each use case.

With MLlib, Spark users can utilize a scalable machine learning library to build predictive models such as Toyota’s Customer 360 Insights Platform. GraphX enables Spark users to interactively build, transform, and reason about graph structured data at scale such as how Alibaba solves their complex network problem. Spark Streaming provides the ability to build powerful streaming applications such as how Netflix solves real-time data monitoring. In addition, with Spark SQL and DataFrames, Spark users can build a just-in-time data warehouse for business analytics such as the one built by Edmunds.

Spark_ecosystem

This is all built on top of the Spark Core, the underlying general execution engine for the Spark platform. It provides in-memory computing capabilities to deliver speed and a generalized execution model to support a wide variety of applications. What makes Spark even more powerful is that you can use your language of choice – Scala, Python, Java, SQL, or R – so your focus is on making sense of your data instead of learning a new tool and language.

Changing the way we look at data

With Spark, your focus is more on making sense of your data instead of the technologies to mine the data. While the technologies are important, it is what those technologies can do for you – i.e. making sense of your data easier and faster – that is the heart of the Big Data revolution. Instead of having many disparate pieces of technology often put together in a custom or proprietary fashion, Spark provides you a general platform and the necessary tools so that your focus is on your data problems.

But the impact is not just limited to individual data scientists, data engineers, or analysts – it can have a profound effect on the way organizations work. For example, when working with traditional data solutions, it is common to have different teams work in silos. While this organizational structure allows those teams to focus in their area of domain expertise, they rarely collaborate with each other – often missing key data learnings or techniques. In large organizations, it is not uncommon to see multiple data warehousing, analyst, and data sciences teams unknowingly working on the same set of data. Even if they are working on the same project, there are strict boundaries between the data warehousing, ETL, and BI teams based on the technologies they have deployed. Within this environment, you could build a machine learning team that focuses on just ML technologies and a separate team for graph-structured data. Each would build with their own technologies and often isolate themselves so basic business rules and definitions would be lost or unknown as the data progresses from one team to another.

Instead of these silos, Spark can help your organizations break down these barriers because these same data scientists, data engineers, and analysts are using the same platform. They can more quickly iterate through their ideas to discover what models work – and more importantly what models do not work. For example, a data scientist can build a recommendation engine utilizing a set of machine learning algorithms in Python or R. At the same time, an analyst can use those same algorithms through Spark SQL and DataFrames without necessarily understanding the details behind the algorithm. With both the analyst and data scientist using the same platform, they can quickly access each other’s analysis using their own language of choice (in this case SQL and Python); there isn’t a need to learn a new technology, platform, nor language.

This ability to do many things on the same platform is not just about efficiencies – it changes the way we make data products. For example, in the upcoming webinar – Transitioning from Traditional DW to Spark in OR Predictive Modeling Beth Israel Deaconess Medical Center (Harvard Academic Teaching Hospital) describes how they transitioned from their traditional data warehousing solutions to using Spark SQL. Their traditional data warehousing specialists can continue to build their SQL solutions while their analysts can continue to use their existing BI tools. But now, their data scientists can use their language of choice to run a logistic regression to model operating room schedules. Instead of extracting data out of one tool to be pushed into another, the data scientists can make use of the existing schemas and processes developed by their colleagues and extend them. Just as important, the data scientists can make these algorithms and functions available back to the data warehousing specialists and analysts so they can continue to use SQL but make use of advanced analytics algorithms. Using the same platform, they all can work together to solve a diverse set of data problems.

The smartphone replaced many navigation applications due to its embedded GPS and app store applications. It replaced the point and click standalone digital camera with its yearly megapixel improvements. Tasks ranging from note taking to calendar scheduling to social media are now mobile first. The smartphone is at the center of a rich ecosystem serving as a powerful platform that can do it all. While Spark is fast and helps us understand our data, more fundamentally it is about how Spark changes the way we do data engineering and data sciences. It allows us to solve a diverse range of data problems from machine learning to streaming in your language of choice. Analogous to smartphone apps, Spark also includes a diverse library of Spark-packages that allows you to share and extend the platform. Just like how the smartphone revolutionized telephony, Spark is revolutionizing Big Data.

 

Download insideAI News: An Insider’s Guide to Apache Spark