Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Spark - Flink - Tajo - Cascading [clear filter]
Monday, September 28


Cascading 3 and Beyond - André Kelpe, Concurrent
Cascading is a mature, robust and proven open source Java framework for developing data driven applications. Cascading focuses on developer productivity by providing an easy to use API which enables developers to solve business problems without the need to become a distributed systems expert.

In this presentation, André Kelpe will introduce the Cascading ecosystem and will focus on Cascading 3, the new major version of Cascading. Cascading 3 features are brand new query planner and rule engine, which enable Cascading to run on Apache Tez and make it possible to port it to any computational platform like Apache Flink, Hazelcast and others. Changing the computational platform enables developers to benefit from newer developments in the Big Data space without having to rewrite their applications.


André Kelpe

Senior Software Engineer, Concurrent Inc
André Kelpe works as a Senior Software Engineer at Concurrent Inc., the company behind Cascading and Driven. He works on all open source projects sponsored by Concurrent including Cascading itself and projects from the Cascading community. Previously André worked at TomTom, where... Read More →

Monday September 28, 2015 10:30 - 11:20


Apache Tez - Helping You Build Your Hadoop Big Data Engines - Bikas Saha, Hortonworks
YARN has opened up Hadoop to a variety of high performance purpose-built applications specialized for specific domains. Many of these need a common set of capabilities like scheduling, fault tolerance & scalability while not giving up on important aspects like multi-tenancy & security. We will provide an overview of how Apache Tez provides these capabilities via a dataflow based API to model these applications and an extensible orchestration framework for optimal performance. We will cover broad ecosystem adoption by Apache Hive, Pig, Cascading, Scalding, Flink & commercial vendors and provide some experiment results. We will look at the Tez Web UI for progress monitoring and performance debugging tools. Finally, we will look ahead at upcoming Tez features like hybrid execution which enables new types of integration with existing systems.


Bikas Saha

Bikas is an active Apache community member and has contributed to the Apache Hadoop and Tez projects and focuses mainly on the distributed compute stack on Hadoop. He works for Hortonworks, a company that supports an open source based Apache Hadoop distribution. Bikas has spoken widely... Read More →

Monday September 28, 2015 11:30 - 12:20


Magellan: Geospatial Analytics on Spark - Ram Sriharsha, Hortonworks
Geospatial data is pervasive, and spatial context is a very rich signal of user intent and relevance in search and targeted advertising and an important variable in many predictive analytics applications. In this talk, we describe the motivation and the internals of an open source library that we are building for Geospatial Analytics using Spark SQL, DataFrames and Catalyst as the underlying engine. We outline how we leverage Catalyst’s pluggable optimizer to efficiently execute spatial joins, how SparkSQL’s powerful operators allow us to express geometric queries in a natural DSL, and discuss some of the geometric algorithms that we implemented in the library. We also describe the Python bindings that we expose, leveraging Pyspark’s Python integration.


Ram Sriharsha

Senior Member of Technical Staff, Hortonworks
Ram is currently Product Manager for Apache Spark at Databricks. Prior to joining Databricks, he was Principal Research Scientist at Yahoo Research where he worked on large scale machine learning algorithms and systems related to login risk detection, sponsored search advertising... Read More →

Monday September 28, 2015 11:30 - 12:20


Leveraging the Power of SOLR with SPARK - Johannes Weigend, QAware GmbH
SOLR is a distributed NoSQL database with impressive search capabilities. SPARK is the new star in the distributed computing universum. In this code-intense session we show how to combine both to solve realtime search and processing problems. We show how to setup a SOLR/SPARK combination from the scratch and develop first jobs with runs distributed on shared SOLR data. We also show how to use this combination for your next generation BI platform.

avatar for Johannes Weigend

Johannes Weigend

CTO, QAware GmbH
Johannes works as a software architect with Java since 1999 and was honoured as "Java Rockstar" at JavaOne 2015. He is a lecturer at the University of Applied Sciences in Rosenheim, Germany and technical director at QAware, a decorated software engineering company located in Munich... Read More →

Monday September 28, 2015 14:00 - 14:50


Apache Zeppelin - The Missing Component for the Spark Ecosystem - DuyHai Doan, Datastax
If you are interested in Big Data, you surely has heard about Spark, but do you know Apache Zeppelin ? Do you know that it is possible to draw out beautiful graph using an user-friendly interface out of your Spark RDD ?

In this session, I will introduce Zeppelin by live coding example and highlight its modular architecture which allows you to plug-in any interpreter for the back-end of your choice.

avatar for DuyHai Doan

DuyHai Doan

Technical Advocate, Datastax
DuyHai DOAN is an Apache Cassandra Evangelist at DataStax and committer for Apache Zeppelin. He spends his time between technical presentations/meetups on Cassandra, coding on open source projects like Achilles or Apache Zeppelin to support the community and helping all companies... Read More →

Monday September 28, 2015 15:00 - 15:50


Integrating Apache Spark with an Enterprise Data Warehouse - Michael Wurst, IBM
This session will discuss the challenges and opportunities of integrating Apache Spark with enterprise data warehouses, especially the impact of columnar storage on the example of IBM DB2 and IBM dashDB. We will show how columnar storage can help to increase scalability and reduce response time, especially when pushing down processing of projections and aggregates to the database instead of processing them in Spark natively. Key takeaways from the session are: 1.How to benefit from the features of closed-source data warehouses from Spark without access to internal data structures, 2.the role of storage when working with large warehouses from Spark, 3. Opportunities of columnar storage vs. row based storage, 4. How such an integration impacts end-to-end analytics based on Spark MLlib.

avatar for Michael Wurst

Michael Wurst

Architect / Senior Software Developer, IBM Research & Development
Michael Wurst, Ph.D. is a senior software engineer and architect at the IBM Research & Development Lab in Germany. He holds a Ph.D. in computer science and is responsible for the integration of open source analytics based on R, Python or Spark into IBM's Datawarehouse portfolio. Prior... Read More →

Monday September 28, 2015 16:00 - 16:50
Tuesday, September 29


Architecture of Flink's Streaming Runtime - Robert Metzger
Apache Flink is an open-source framework for parallel data analysis. The core of Flink is a distributed stream processing engine that provides exactly-once semantics, low-latency processing, and system-managed operator state. Flink's high-level programming APIs and support for batch processing, make Flink a good choice for real-time data analysis. Apache Flink is one of the most active big data project in the Apache Software Foundation and has more than 100 contributors.

This talk presents Flink's architecture and design decisions that result in Flink's unique set of features. It discusses the pipelined execution engine for low latency processing, the operator state management and fault tolerance mechanisms. It will cover master high availability and system monitoring and also show a performance evaluation of the system.


Robert Metzger

Co-Founder and Software Engineer, data Artisans
Robert Metzger is a PMC member at Apache Flink and co-founder and software engineer at data Artisans. Robert studied Computer Science at TU Berlin and worked at IBM Germany and at the IBM Almaden Research Center in San Jose.

Tuesday September 29, 2015 10:30 - 11:20


Spark and Machine Learning to the aid of the Data Scientist - Frank Ketelaars, IBM and Andrey Vykhodtsev,IBM
In this talk, Andrey Vykhodtsev and Frank Ketelaars will talk about how data scientists benefit from the scalable machine learning capabilities of Spark. They will demonstrate a machine learning workflow and how this is executed using Apache Spark and MLlib. The ability to answer to the needs of data scientists will improve further with the contribution of IBM Research's Machine Learning system (also known as System ML) to the Apache Spark project, making is easier and quicker to express machine learning algorithms.

avatar for Frank Ketelaars

Frank Ketelaars

Big Data Leader, IBM
Frank Ketelaars is working as part of a European team focused on IBM Big Data solutions, including Hadoop and Real-time Analytical Processing. In his capacity, Frank leads the European technical community and conducts Big Data architecture sessions with customers and business partners... Read More →
avatar for Andrey Vykhodtsev

Andrey Vykhodtsev

Big Data Solution Architect, IBM
Andrey is has broad expertise in Analytics. He is Big Data Solution Architect at IBM, and is responsible for IBM Big Data product stack, which includes many Open Source components. Andrey is educating and consulting customers and IBM Business Partners across Central & Eastern Europe... Read More →

Tuesday September 29, 2015 11:30 - 12:20


Netflix: Integrating Spark at Petabyte Scale - Cheolsoo Park, Netflix and Ashwin Shankar, Netflix
The Big Data Platform team at Netflix maintains a cloud-based data warehouse with over 10 petabytes of data stored predominantly in Parquet format. Our platform has traditionally leveraged Pig for ETL processing, Hive for large analytic workloads, and Presto for interactive and exploratory use cases. For a long time, Spark seemed attractive to complement our platform, but technical gaps prevented effective use at scale in our environment. Recent improvements have allowed us to add Spark to our cloud data architecture and interoperate seamlessly with the other tools and services in our stack.

We will go into detail about our deployment configuration and what it takes to run Spark alongside traditional workloads on YARN. We will share examples of a few of our largest workflows translated to Spark for comparison in terms of both performance and complexity.

avatar for Cheolsoo Park

Cheolsoo Park

Senior Software Engineer, Netflix
Cheolsoo Park is an Apache Pig PMC member and Spark contributor. He is also a senior software engineer at Netflix and works on cloud-based big data analytics infrastructure that leverages open source technologies including Hadoop, Hive, Pig, and Spark.

Ashwin Shankar

Ashwin Shankar is an Apache Hadoop and Spark contributor. He is a senior software engineer at Netflix and is passionate about developing features and debugging problems in large scale distributed systems. Ashwin holds a Master's degree in Computer Science from University of Illinois... Read More →

Tuesday September 29, 2015 16:00 - 16:50
Wednesday, September 30


Configuring and Optimizing Spark Applications with Ease - Nishkam Ravi, Cloudera
Spark API exports intuitive and performant one-liners for data processing, which hide complexity and allow applications to be developed quickly. As an in-memory system, Spark has to be configured properly for performance and stability. This can sometimes be challenging. Based on internal deployments and interaction with customers, we conclude that (i) most Spark woes can be traced back to misconfiguration, and (ii) there is a need for tools that can aid configuration, performance optimization and debugging. In this talk, we will discuss common Spark configuration pitfalls and show how they can be avoided with the help of the auto-configuration and optimization tool being developed at Cloudera.


Nishkam Ravi

Software Engineer, Cloudera
Nishkam is a Software Engineer at Cloudera. His current focus is Spark and MapReduce performance. Nishkam got his B.Tech from IIT-Bombay and PhD from Rutgers. His first job was with Intel as a compiler engineer. Prior to joining Cloudera, Nishkam was a Research Staff Member at NEC... Read More →

Wednesday September 30, 2015 10:00 - 10:50


Shared Memory Layer for Spark Applications - Dmitriy Setrakyan, GridGain
In this presentation we will talk about the need to share state across different Spark
jobs and applications and several technologies that make it possible, including
Tachyon and Apache Ignite. We will dive into importance of In Memory File Systems,
Shared In-Memory RDDs with Apache Ignite, as well as present a hands on demo
demonstrating advantages and disadvantages of one approach over another. We will
also discuss requirements of storing data off-heap in order to achieve large horizontal
and vertical scale of the applications using Spark and Ignite.


Dmitriy Setrakyan

EVP Engineering, GridGain
Dmitriy Setrakyan is founder and Chief Product Officer at GridGain. Dmitriy has been working with distributed architectures for over 15 years and has expertise in the development of various middleware platforms, financial trading systems, CRM applications and similar systems. Prior... Read More →

Wednesday September 30, 2015 11:00 - 11:50


Apache Ignite: In-Memory Data Fabric in Action - Dmitriy Setrakyan, GridGain
In this talk Dmitriy will dissect the Apache Ignite architecture. We will focus on how Apache Ignite data partitioning and replication works, how computations are distributed and failed over in case of crashes. We will also talk about in-memory streaming in Ignite and various techniques we can employ to make it fault tolerant. To demonstrate how easy it is to get started with Ignite, Dmitriy will also run several Ignite coding examples live during the presentation.


Dimitriy Setrakyan

EVP of Engineering, GridGain Systems
Dmitriy Setrakyan is founder and EVP of Engineering at GridGain Systems. Dmitriy has been working with distributed architectures for over 15 years and has expertise in the development of various middleware platforms, financial trading systems, CRM applications and similar systems... Read More →

Wednesday September 30, 2015 12:00 - 12:50


Realtime Reactive Apps with Actor Model and Apache Spark - Rahul Kumar, Sigmoid Analytics
Developing Application with Big Data is really challenging work, scaling, fault tolerance and responsiveness some are the biggest challenge. Realtime bigdata application that have self healing feature is a dream these days. Apache Spark is a fast in-memory data processing system that gives a good backend for realtime application.In this talk I will show how to use reactive platform, Actor model and Apache Spark stack to develop a system that have responsiveness, resiliency, fault tolerance and message driven feature.

avatar for Rahul Kumar

Rahul Kumar

Technical Lead, Sigmoid
Rahul Kumar working as a Technical lead with Sigmoid, He has more than 4 years of experience in Data-driven distributed application development with Java , Scala , and Akka toolkit. He developed various real-time data analytics applications using Apache Hadoop, Mesos ecosystem projects... Read More →

Wednesday September 30, 2015 12:00 - 12:50


Faster ETL Workflows Using Apache Pig & Spark - Praveen Rachabattuni, Sigmoid Analytics
Pig on Spark aims to combine the simplicity of Pig with faster execution engine Spark and make Pig more promising to developers. Currently, with the help of Apache foundation, various contributions are working on the project for a release quality build. With Pig on spark, significant performance benefit has been observed in ETL workflows already running on MapReduce. Our initial benchmarks have shown 2x-5x improvement over Mapreduce. For a benchmarking test, we considered the ‘distinct’ operation. We used the wikistats dump for 25 days with a size of 270G, on a cluster involving one master and four worker machines (16 cores and 64GB RAM each). It took about 14 mins with Pig on Spark, compared to about 30 mins on Mapreduce. In this talk, Praveen would be sharing the progress of the project with the community and help people take advantage of Pig-Spark in their workflows.

avatar for Praveen Rachabattuni

Praveen Rachabattuni

Technical Team Lead, SigmoidAnalytics
Praveen Rachabattuni is a technical team lead at Sigmoid Analytics. His areas of expertise includes Real Time Big Data Analytics using open source technologies like Apache Spark, Shark and Pig on Spark. He is working as a committer on the Apache Pig project and contributing for Pig... Read More →

Wednesday September 30, 2015 14:30 - 15:20


Apache Spark for High-Throughput Systems - Michael Starch, NASA Jet Propulsion Laboratory
Data systems are increasingly expected to support data rates nearing network bandwidth limitations around 10Gb/s. Apache Spark is capable of high-throughputs via distributed computing and thus is a good choice to support a data system in this environment; however, most technologies breakdown under these conditions. Thus it is essential that Apache Spark be characterized for production use at these scales.

This talk will discuss the approach to running Apache Spark at throughputs on the order of 10Gb/s while performing non-trivial processing. This will give users a feel for Apache Spark’s performance under the most demanding conditions. Setup of Apache Spark, configuration used, and resource requirements to process at this scale will be discussed. In addition, concrete take-aways will be provided to users desiring to push Apache Spark to this scale.


Michael Starch

Computer Engineer in Applications, NASA Jet Propulsion Laboratory
Michael Starch has been employed by the Jet Propulsion laboratory for the past 5 years. His primary responsibilities include: engineering big data processing systems for handling scientific data, researching the next generation of big data technologies, and helping infuse these systems... Read More →

Wednesday September 30, 2015 15:30 - 16:20


Introduction to Apache Tajo: Data Warehouse for Big Data - Jihoon Son, Gruter
Apache Tajo is a data warehouse system for Web-scale data. It provides virtual integration of a multitude of diverse data sources, thereby facilitating easy and rapid data integration which has been regarded as an essential, but heavy step in business intelligence. In addition, it has a fault-tolerable distributed query engine for accelerating query speed. With the “query federation” and “distributed processing” capacities, Tajo is capable of providing users with reliable and efficient analysis of Web-scale data spread on multiple sources.

Jihoon Son will introduce Apache Tajo including its overall architecture, current state and challenges, and discuss advantages what Tajo can bring to users. In addition, he will give a demo of integrated data analysis with Tajo.


Jihoon Son

Software Engineer, Gruter
Dr. Jihoon Son is a distributed system engineer at Gruter, which is a Hadoop-based big data infrastructure company of South Korea. He is one of the co-founders of Apache Tajo project, and now working on distributed query processing and query optimization of Tajo. He has several speaking... Read More →

Wednesday September 30, 2015 15:30 - 16:20