ACM SIGMOD City, Country, Year
sigmod pods logo

Introduction to Spark 2.0 for Database Researchers [Tutorial 8]

FRIDAY, July 1, 2016 (1:30pm - 5:00pm)

Abstract: Originally started as an academic research project at UC Berkeley, Apache Spark is one of the most popular open source projects for big data analytics. Over 1000 volunteers have contributed code to the project; it is supported by virtually every commercial vendor; many universities are now offering courses on Spark.

Spark has evolved significantly since the 2010 research paper: its foundational APIs are becoming more relational and structural with the introduction of the Catalyst relational optimizer, and its execution engine is developing quickly to adopt the latest research advances in database systems such as whole-stage code generation.

This tutorial is designed for database researchers (graduate students, faculty members, and industrial researchers) interested in a brief hands-on overview of Spark. This tutorial covers the core APIs for using Spark 2.0, including DataFrames, Datasets, SQL, streaming and machine learning pipelines. Each topic includes slide and lecture content along with hands-on use of a Spark cluster through a web-based notebook environment.

In addition, we will dive into the engine internals to discuss architectural design choices and their implications in practice. We will guide the audience to ``hack'' Spark by extending its query optimizer to speed up distributed join execution.

URL for the Slides: -- Registration Required


Michael Armbrust leads the development of Spark SQL, a project he created in 2014 that became the most popular component of Spark. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.

Doug Bateman is the Director of Training at Databricks and teaches advanced Spark classes for Databricks' customers. He has taught 800+ classes on Java, Spring, Hibernate, Python, and Android and 15+ years of software engineering experience.

Reynold Xin is a cofounder and Chief Architect for Spark at Databricks. Prior to Databricks, he was pursuing a PhD at the UC Berkeley AMPLab, advised by Michael Franklin. While at Berkeley, he worked on CrowdDB, Shark and GraphX. He also set the 2014 world record in 100 TB sorting (Daytona Gray), beating the previous 2013 Hadoop record by 30X on per-node efficiency.

Matei Zaharia is an assistant professor of computer science at MIT and CTO of Databricks, the company commercializing Apache Spark. He started the Spark project during his PhD at UC Berkeley. He is broadly interested in large-scale computer systems and networks, and has also contributed to projects including Mesos, Hadoop, Tachyon and Shark. Matei received the 2014 ACM Doctoral Dissertation award for his research on Spark, as well as best paper awards at NSDI and SIGCOMM.

Follow our progress: FacebookTwitter