Apache Spark – An Overview
Apache Spark is an open source cluster computing framework originally developed in 2009 at the AMPLab at University of California, Berkeley but was later donated in 2013 to the Apache Software Foundation where it remains today. Spark allows for quick analysis and model development, plus it provides access to the full data set thus avoiding the need to subsample, as often needed in environments like R. Spark also supports streaming which can be used for building real-time models using full data sets. When you have a task that is too big to process on a laptop or single server, Spark enables you to divide that task into more manageable pieces. Spark then runs these pieces in-memory, on a cluster of servers to take advantage of the collective memory available on the cluster. Spark processing is based on its Resilient Distributed Dataset (RDD) application programming interface (API). Spark has gained a lot of popularity recently in the
big data world due to its computing performance and its wide array of libraries, including Spark SQL (with DataFrames), Spark Streaming, MLlib (machine learning) and GraphX. Spark SQL provides for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. Spark Streaming enables scalable, high-throughput, faulttolerant stream processing of live data streams. The goal of MLlib is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, and dimensionality reduction. GraphX is a component in Spark for graphs and graph-parallel computation.
All information that you supply is protected by our privacy policy. By submitting your information you agree to our Terms of Use.
* All fields required.