Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Streaming - Pipelining - IoT [clear filter]
Monday, September 28


Large-Scale Stream Processing in the Hadoop Ecosystem - Gyula Fóra, SICs and Márton Balassi, Hungarian Academy of Sciences
Distributed stream processing is one of the hot topics in big data analytics today. An increasing number of applications are shifting from traditional static data sources to processing the incoming data in real-time. Performing large scale stream processing or analysis requires specialized tools and techniques which have become publicly available in the last couple of years.

This talk will give a deep, technical overview of the top-level Apache stream processing landscape. We compare several frameworks including Spark, Storm, Samza and Flink. Our goal is to highlight the strengths and weaknesses of the individual systems in a project-neutral manner to help selecting the best tools for the specific applications. We will touch on the topics of API expressivity, runtime architecture, performance, fault-tolerance and strong use-cases for the individual frameworks.

avatar for Márton Balassi

Márton Balassi

Solutions Architect, Cloudera
Márton Balassi is a Solution Architect at Cloudera and a PMC member at Apache Flink. He focuses on Big Data application development, especially in the streaming space. Marton is a regular contributor to open source and has been a speaker of a number of Big Data related conferences... Read More →
avatar for Gyula Fóra

Gyula Fóra

Researcher, Distributed Systems, SICS
Gyula is a committer and PMC member for the Apache Flink project, currently working as a researcher at the Swedish Institute of Computer Science. His main expertise and interest is real-time distributed data processing frameworks, and their connections to other big data applications... Read More →

Monday September 28, 2015 10:30 - 11:20


Apache Kafka for High-Throughput Systems - Jane Wyngaard, Jet Propulsion Laboratory
While designed to provide high-throughput, low-latency real-time data feeds of relatively small messages on a large scale Apache Kafka could offer a valuable service to alternative communities if operated at the even greater scale managing on the order of 10Gb/s streams.

While others have achieved this level of throughput via topic scaling using greater node counts, this presentation will discuss achieving such rates over a single topic and where hardware resource scaling is more limited.

avatar for Jane Wyngaard

Jane Wyngaard

PostDoc, University of Southern California
Currently a postdoctoral scholar at the University of Southern California, I primarily work on Bigdata tools for Earth Science (Apache Kafka and OODT currently). But with a background in Mechatronics and a PhD in microprocessor design, I am most excited about the potential combining... Read More →

Monday September 28, 2015 14:00 - 14:50
Tuesday, September 29


High-Throughput Processing With Streaming-OODT - Michael Starch, NASA Jet Propulsion Laboratory
Upcoming customers of Streaming-OODT have predicted that their system will operate at data throughputs of 10Gb/s. Thus in order to use Streaming-OODT at these throughputs, the system must be characterized and well understood in order to support customers’ needs. This presentation discusses Streaming-OODT’s performance processing non-trivial data at these scales and the lessons learned from operation in this environment.

Streaming-OODT uses various other Apache technologies to support its mission to provide a high-performance data system. These technologies include: Apache Kafka, Apache Spark, Spark Streaming, and Apache Mesos. This presentation will therefore discuss the performance of these technologies working together at high-throughputs and will discuss lessons learned orchestrating these technologies for high-performance as part of the Streaming OODT system.


Michael Starch

Computer Engineer in Applications, NASA Jet Propulsion Laboratory
Michael Starch has been employed by the Jet Propulsion laboratory for the past 5 years. His primary responsibilities include: engineering big data processing systems for handling scientific data, researching the next generation of big data technologies, and helping infuse these systems... Read More →

Tuesday September 29, 2015 10:30 - 11:20


The Best of Apache Kafka Architecture - Ranganathan B, ThoughtWorks
Big data event streaming is very common part of any big data Architecture. Of the available open source big data streaming technologies Apache Kafka stands out because of it realtime, distributed, and reliable characteristics. This is possible because of the Kafka Architecture. This talk highlights those features.

avatar for Ranganathan Balashanmugam

Ranganathan Balashanmugam

Head of Engineering - India, Aconex
Ranganathan has nearly twelve years of experience of developing awesome products and loves to works on full stack - from front end, to backend and scale. He is Head of Engineering - India at Aconex and prior to that was Technology Lead at ThoughtWorks. He is Microsoft MVP for Data... Read More →

Tuesday September 29, 2015 11:30 - 12:20


Being Ready for Apache Kafka: Today's Ecosystem and Future Roadmap - Michael Noll, Confluent
Apache Kafka is a high-throughput distributed messaging system that has become a mission-critical infrastructure component for modern data platforms. Kafka is used across a wide range of industries by thousands of companies such as Twitter, Netflix, Cisco, PayPal, and many others. After a brief introduction to Kafka this talk will provide an update on the growth and status of the Kafka project community. Rest of the talk will focus on walking the audience through what's required to put Kafka in production. We’ll give an overview of the current ecosystem of Kafka, including: client libraries for creating your own apps; operational tools; peripheral components required for running Kafka in production and for integration with other systems like Hadoop. We will cover the upcoming project roadmap, which adds key features to make Kafka even more convenient to use and more robust in production.

avatar for Michael Noll

Michael Noll

Developer Evangelist, Confluent
Michael Noll is the developer evangelist of Confluent, i.e. the US startup founded in 2014 by the creators of Apache Kafka who developed Kafka while at LinkedIn. Previously Michael was the technical lead of the Big Data platform of .COM/.NET DNS operator Verisign, where he grew the... Read More →

Tuesday September 29, 2015 14:00 - 14:50


Building a Highly-Scalable Open-Source Real-time Streaming Analytics System Using Spark SQL, Apache Geode (incubating), SpringXD and Apache Zeppelin (incubating) - Fred Melo, Pivotal
The Internet of Things requires new applications to consume data that streams in from connected devices, and apply advanced real-time analytics. It also demands the ability to scale horizontally in order to support a large number of devices, while keeping extreme low latency for immediate data insights. How can you leverage open source software like Apache Geode (incubating), Spring XD, Docker, Apache Zeppelin (incubating), Apache Spark and Cloud Foundry/Lattice to quickly build a complete IoT solution? This presentation will walk you through the construction of a system leveraging these technologies and a Raspberry PI with sensors including a live demo of data captured during the conference and perform some real-time analytics.


Tuesday September 29, 2015 15:00 - 15:50


Fly the Coop! - Getting Big Data to Soar With Apache Falcon - Michael Miklavcic, Hortonworks
Getting your Data Lake to function like a reservoir doesn't happen by accident. From ETL to analytics, all enterprise-level big data jobs eventually need a reliable platform for automation and data lifecycle management. In this presentation we walk you through Apache Falcon and show real working code examples of data pipelines in action.

avatar for Michael Miklavcic

Michael Miklavcic

Systems Architect, Hortonworks
Michael is a software engineer with over ten years of industry experience and has been a Systems Architect with Hortonworks for the past two years. He is a code contributor to the Apache Falcon project and works directly with clients to implement solutions using Hadoop. For over 2... Read More →

Tuesday September 29, 2015 16:00 - 16:50
Wednesday, September 30


Integrating Fully-Managed Data Streaming Services with Apache Samza - Renato Marroquinm ETH Zurich
Recently, the interest in highly scalable stream processing engines has risen, thus many projects have appeared. Apache Samza is a distributed stream-processing framework that uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, and resource management. It is one of the most popular stream processing engines out there used by many high-profile companies. On the other hand, we have Amazon Kinesis that is a fully managed service for real-time processing of streaming data which allows users to scale the amount of data ingested by Kinesis without worrying about the infrastructure details. This presentation gives a brief introduction about the very popular Samza-Kafka integration, then focuses on the new Samza-Kinesis integration, and explains users the new opportunities they have due to the new Samza-Kinesis integration.

avatar for Renato Marroquin

Renato Marroquin

PhD student, ETH Zurich
PhD Student at ETHZ Zurich working with distributed databases. Computer Science Master by the Pontifical University of Rio de Janeiro worked with Apache Pig. Google Summer of Code participant, Apache Gora PMC Member and Committer, Open Source and Big Data Enthusiast. Renato has spoken... Read More →

Wednesday September 30, 2015 14:30 - 15:20


Deploying Spark Streaming with Kafka: Gotchas and Performance Analysis - Nishkam Ravi, Cloudera
Apache Spark is an in-memory compute engine that supports real time data processing through the streaming API. Kafka is a popular publish-subscribe messaging system used for data ingest and distribution. The performance of Spark streaming with Kafka is barely understood. In this talk, we will discuss different Spark streaming APIs that can be used for receiving data from Kafka and evaluate their performance for complex event processing. We will also highlight some caveats and corresponding workarounds for best performance. We find that Spark+Kafka yields high throughput and sub-second latencies for complex events when configured properly.


Nishkam Ravi

Software Engineer, Cloudera
Nishkam is a Software Engineer at Cloudera. His current focus is Spark and MapReduce performance. Nishkam got his B.Tech from IIT-Bombay and PhD from Rutgers. His first job was with Intel as a compiler engineer. Prior to joining Cloudera, Nishkam was a Research Staff Member at NEC... Read More →

Wednesday September 30, 2015 15:30 - 16:20


Near Real Time Indexing Kafka Messages to Apache Blur using Spark Streaming - Dibyendu Bhattacharya, Pearson North America
Pearson is building a next generation adaptive learning platform and their Near Real Time architecture is powered by Kafka and Spark Streaming. Pearson also building a search infrastructure to index various learners data to Apache Blur, which is a Lucene based distributed search solution on Hadoop. For supporting NRT indexing into Apache Blur, Pearson has designed a fault-tolerant and reliable low-level Kafka Consumer for Spark Streaming. This talk will cover why Pearson chosen Apache Blur and how they designed this Kafka Consumer for Spark which helped NRT indexing into Blur. This talk will also cover the implementation details of Spark to Blur connector for doing bulk indexing to Apache Blur using Spark Hadoop API. This Spark-Blur connector is contributed to Apache Blur Project (http://bit.ly/1HVWk7G) and Kafka-Spark consumer is contributed to spark-packages (http://bit.ly/1PRNNtM)

avatar for Dibyendu Bhattacharya

Dibyendu Bhattacharya

Big Data Architect, Pearson North America
Holds MS in Software Systems and B.Tech in Computer Science. Experience in building applications and products leveraging distributed computing and big data technologies. Working as Big Data Architect at Pearson,building adaptive learning platform to capture behavioral data across... Read More →

Wednesday September 30, 2015 16:30 - 17:20