Data Science 101: Metaprogramming Python for Big Data

For many companies, understanding what is going on in your business involves lots of data. But, how do you query 10s of billions of data points? How can a company begin to make sense of so much information? The mainstream paradigms for processing large amounts of data, such as MapReduce and NoSQL, are based on distributed computing and massive horizontal scalability. Since the publication of the original MapReduce paper by Google in 2004, the performance of a single high-end server has grown by the factor of 50.

The video presentation below comes from our friends at the San Francisco Python Meetup group. The talk discusses how AdRoll uses Python to squeeze every last bit of performance out of a single high-end server for the purpose of interactive analysis of terabyte-scale data sets. This feat is made possible by Numba, a new NumPy aware dynamic Python compiler based on LLVM. Thanks to Python, the system can provide a very expressive and developer-friendly API, while keeping the complexity of implementation in check.  The talk should be relevant to anyone interested in Big Data and High-Performance Computing using Python.

The presenter is Ville Tuulos, Principal Engineer at AdRoll, a company producing tons of big data.  Previously, Ville was the CEO of Bitdeli, a big data startup that provided a platform for analyzing event streams using user-defined Python code.  Ville is also the original author of open-source Disco MapReduce that has been powering Python-based big data in various companies since 2008.

 

 

Sign up for the free insideAI News newsletter.