Back To Schedule
Tuesday, September 29 • 16:00 - 16:50
Netflix: Integrating Spark at Petabyte Scale - Cheolsoo Park, Netflix and Ashwin Shankar, Netflix

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

The Big Data Platform team at Netflix maintains a cloud-based data warehouse with over 10 petabytes of data stored predominantly in Parquet format. Our platform has traditionally leveraged Pig for ETL processing, Hive for large analytic workloads, and Presto for interactive and exploratory use cases. For a long time, Spark seemed attractive to complement our platform, but technical gaps prevented effective use at scale in our environment. Recent improvements have allowed us to add Spark to our cloud data architecture and interoperate seamlessly with the other tools and services in our stack.

We will go into detail about our deployment configuration and what it takes to run Spark alongside traditional workloads on YARN. We will share examples of a few of our largest workflows translated to Spark for comparison in terms of both performance and complexity.

avatar for Cheolsoo Park

Cheolsoo Park

Senior Software Engineer, Netflix
Cheolsoo Park is an Apache Pig PMC member and Spark contributor. He is also a senior software engineer at Netflix and works on cloud-based big data analytics infrastructure that leverages open source technologies including Hadoop, Hive, Pig, and Spark.

Ashwin Shankar

Ashwin Shankar is an Apache Hadoop and Spark contributor. He is a senior software engineer at Netflix and is passionate about developing features and debugging problems in large scale distributed systems. Ashwin holds a Master's degree in Computer Science from University of Illinois... Read More →

Tuesday September 29, 2015 16:00 - 16:50 CEST

Attendees (0)