Loading…
This event has ended. View the official site or create your own event → Check it out
This event has ended. Create your own
View analytic

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Monday, September 28
 

08:00

Registration
For those staying in the hotel, please proceed to the Brasserie Restaurant for breakfast. For those not staying in the hotel, continental breakfast will be available in the Valletta II Foyer and Attendee Lounge daily.

Monday September 28, 2015 08:00 - 09:00
Valletta 1 Foyer

09:00

Keynote: The State of the Feather - Shane Curcuru, VP Brand Management, The Apache Software Foundation
An update on the Apache Software Foundation and an overview of its projects. A quick look at how the ASF works and the key events affecting the foundation in recent history.

 

Speakers
avatar for Shane Curcuru

Shane Curcuru

VP, Brand Management, The Apache Software Foundation
Shane serves as V.P. of Brand Management for the ASF, setting trademark and brand policy for all 250+ Apache projects, and has served as five-time Director, and member and mentor for Conferences and the Incubator. | | Shane's Punderthings consultancy is here to help both companies and FOSS communities understand how to work together better. At home, Shane is: a father and husband, a Member of the ASF, a BMW driver and punny guy. Oh, and we... Read More →


Monday September 28, 2015 09:00 - 09:20
Grand Ballroom

09:20

Keynote: Big Science and Big Data at CERN - Dirk Duellmann, CERN
Dirk Duellmann, deputy leader of the data and storage services group in the IT department at CERN, will discuss the large scale (100 PB) data management and analysis infrastructure for physics data at the Large Hadron Collider (LHC).

He will compare the established scientific workflow with more recent deployments of hadoop ecosystem components at CERN, which are rapidly gaining popularity for computing infrastructure analytics / optimisation studies and as an alternative to traditional
database deployments. The presentation will conclude with an outlook on how big data analysis experience from scientific and mainstream applications may further complement each other.

Speakers
DD

Dirk Duellmann

deputy leader of the data and storage services group in the IT department at CERN, CERN


Monday September 28, 2015 09:20 - 09:40
Grand Ballroom

09:40

Keynote: Apache's Key Role in the Big Data Industry - Arun Murthy, Hortonworks
Speakers
AM

Arun Murthy

Co-Founder, Hortonworks
I am the Founder and Architect of the Hortonworks Inc., a software company that is helping to accelerate the development and adoption of Apache Hadoop. Hortonworks was formed by the key architects and core Hadoop committers from the Yahoo! Hadoop software engineering team in June 2011 in order to accelerate the development and adoption of Apache Hadoop. Funded by Yahoo! and Benchmark Capital, one of the preeminent technology... Read More →


Monday September 28, 2015 09:40 - 10:10
Grand Ballroom

10:10

Morning Break
Monday September 28, 2015 10:10 - 10:30
Valletta 2 Foyer

10:30

Data Science in the Travel Industry: Real-World Experience with Current Leading Frameworks - Paul Balm, Amadeus IT Group
Amadeus IT Group is a leading IT provider in the travel industry, processing 525 million bookings per year and boarding 700 million passengers on its airline IT systems. The Travel Intelligence Unit was formed in 2013 with the objective to organize the travel information of the world. Amadeus Travel Intelligence is leveraging big data to help all parties in the travel industry make more effective and quicker decisions.
We will show how Amadeus Travel Intelligence employs open source technologies to achieve its objective: processing based on Hadoop; visualization layer based on Ruby-on-Rails and HTML5; streaming based on Spark and Flink; and API level access through web-services. We will review typical project requirements and our experiences, such as the pitfalls of immature projects, missing functionalities, and communities that have moved on.

Speakers
avatar for Paul Balm

Paul Balm

Data Scientist, Amadeus IT Group
Paul Balm joined Amadeus as a Data Scientist in September 2014. Before joining the Travel Intelligence unit at Amadeus, he worked on data processing systems for the European Space Agency since 2005. Paul holds a Ph.D. in particle physics from Fermi National Accelerator Laboratory in Chicago, IL (USA).



Monday September 28, 2015 10:30 - 11:20
Arany

10:30

Apache Bigtop: Where it Came From and Where It's Going - Nate DAmico
Following the mantra, “best tool for the job,” you seldom use a single Open Source tool for data processing. The more tools you use, however, the more you start to realize the difficulties of managing dependencies and configuring packages across components, projects, and versions. This is where the Apache Bigtop project and community comes in. Come get an overview of the origins of Apache Bigtop and why organizations like Cloudera, Wandisco, and Amazon Web Services rely on Bigtop for their own bigdata component distribution efforts, and where the project is going post its summer 1.0 release.

Speakers
ND

Nate DAmico

Nate has been working in the enterprise and mobile software industry for 14 years in various capacities. In recent years his tech efforts have focused around areas of mobile computer vision as well as the rise of the consumerization of IT Operations. Three years ago he started Reactor8, creating a set of open source tooling called DTK, to ease the pain of infrastructure/service developers and make advanced automation more approachable. Nate is... Read More →


Monday September 28, 2015 10:30 - 11:20
Krudy/Jokai

10:30

Geospatial querying in Apache Marmotta - Sergio Fernández, Redlink GmbH
Apache Marmotta provides different means of querying: SPARQL, LDPath, LDP, etc. GeoSPARQL provides an extension to the SPARQL constructs to represent and query geospatial data. The talk will present the ongoing effort to add GeoSPARQL support in Marmotta, going through the challenges and potential of this new set of features, demoing some of then during the talk.

The work is currently being developed in the context of the Google Summer of Code 2015, further details at https://wiki.apache.org/marmotta/GSoC/2015/MARMOTTA-584

Speakers
avatar for Sergio Fernández

Sergio Fernández

Software Engineer, Redlink GmbH
I'm a Software engineer specialized in innovation, with a focus on Data Architectures. My interests include Distributed Architectures, Data Integration, Linked Data and System Engineering. I've worked as software engineer and project manager in different industries, but always somehow close to science; because I strongly believe that innovation can be achieved by equally using research and engineering. Therefore all my scientific contributions... Read More →


Monday September 28, 2015 10:30 - 11:20
Tohotom

10:30

HBase: State of the Database - Nick Dimiduk, Hortonworks
HBase is a mature, low-latency, distributed "big data" store. It is used in production by companies large and small, in all manner of industries. There is a vibrant and active developer and user community supporting HBase, which means it's constantly improving, adapting to user need and challenging deployments. In this talk, Nick provides an update of the latest happenings in core HBase, recent and pending releases.

Speakers
avatar for Nick Dimiduk

Nick Dimiduk

Hortonworks
Nick Dimiduk is a committer and PMC member on both Apache HBase and Apache Phoenix. He's Release Manager for the HBase 1.1 branch and an author of the book HBase in Action, on Manning Press. Nick has also contributed to a number of Apache projects around HBase, including, HTrace, and Calcite. Nick works on the HBase team at Hortonworks where his focus is on operability and performance.



Monday September 28, 2015 10:30 - 11:20
Huba

10:30

OpenPower, OpenStack & Big Data - Luis Ramirez, OpenCloud.es
OpenPower, OpenStack & Big Data (Luis Ramirez) - Show how we could deploy a big data solution where all components (infrastructure, OS, Hypervisor, Frameworks & apps) are based in opensource and integrated with the cloud. In this lab i will show a different approach to this kind solutions and show how we could use other kind of architectures to reach or improve the performance of our Big Data environment.

Speakers
avatar for Luis Ramirez

Luis Ramirez

CEO, OpenCloud.es
IT professional specialized in OpenSource, Virtualization, HPC and Cloud Computing technology solutions, with extensive experience in pre-sales positions, development, implementation and Project management, creation of departments and team management. With experience in international projects @ EMEA and LATAM, I develop my career @IBM, SUN Microsystems, Oracle and DELL. Actually working for OpenCloud.ES, an opensource services provider. Speaker... Read More →


Monday September 28, 2015 10:30 - 11:20
Tas

10:30

Cascading 3 and Beyond - André Kelpe, Concurrent
Cascading is a mature, robust and proven open source Java framework for developing data driven applications. Cascading focuses on developer productivity by providing an easy to use API which enables developers to solve business problems without the need to become a distributed systems expert.

In this presentation, André Kelpe will introduce the Cascading ecosystem and will focus on Cascading 3, the new major version of Cascading. Cascading 3 features are brand new query planner and rule engine, which enable Cascading to run on Apache Tez and make it possible to port it to any computational platform like Apache Flink, Hazelcast and others. Changing the computational platform enables developers to benefit from newer developments in the Big Data space without having to rewrite their applications.

Speakers
AK

André Kelpe

Senior Software Engineer, Concurrent Inc
André Kelpe works as a Senior Software Engineer at Concurrent Inc., the company behind Cascading and Driven. He works on all open source projects sponsored by Concurrent including Cascading itself and projects from the Cascading community. Previously André worked at TomTom, where he introduced various BigData technologies into the process of digital mapping. André is based in Berlin, Germany.


Monday September 28, 2015 10:30 - 11:20
Dery/Mikszath

10:30

Large-Scale Stream Processing in the Hadoop Ecosystem - Gyula Fóra, SICs and Márton Balassi, Hungarian Academy of Sciences
Distributed stream processing is one of the hot topics in big data analytics today. An increasing number of applications are shifting from traditional static data sources to processing the incoming data in real-time. Performing large scale stream processing or analysis requires specialized tools and techniques which have become publicly available in the last couple of years.

This talk will give a deep, technical overview of the top-level Apache stream processing landscape. We compare several frameworks including Spark, Storm, Samza and Flink. Our goal is to highlight the strengths and weaknesses of the individual systems in a project-neutral manner to help selecting the best tools for the specific applications. We will touch on the topics of API expressivity, runtime architecture, performance, fault-tolerance and strong use-cases for the individual frameworks.

Speakers
avatar for Márton Balassi

Márton Balassi

Solutions Architect, Cloudera
Márton Balassi is a Solution Architect at Cloudera and a PMC member at Apache Flink. He focuses on Big Data application development, especially in the streaming space. Marton is a regular contributor to open source and has been a speaker of a number of Big Data related conferences and meetups, including Hadoop Summit and Apache Big Data recently. Márton has been a speaker at ApacheCon, Hadoop Summit and numerous Big Data... Read More →
avatar for Gyula Fóra

Gyula Fóra

Researcher, Distributed Systems, SICS
Gyula is a committer and PMC member for the Apache Flink project, currently working as a researcher at the Swedish Institute of Computer Science. His main expertise and interest is real-time distributed data processing frameworks, and their connections to other big data applications. He is a core architect of Apache Flink Streaming. His current work includes research and development on several aspects of stream processing, including... Read More →


Monday September 28, 2015 10:30 - 11:20
Petofi

11:30

Data Science Lifecycle with Apache Zeppelin (incubating) - Moon soo Lee, NFLabs and Alexander Bezzubov
Apache Zeppelin (incubating) is interactive data analytics environment for distributed data processing system. It provides beautiful interactive web-based interface, data visualization, collaborative work environment and many other nice features to make your data analytics more fun and enjoyable. LeeMoonSoo will going to demo Zeppelin's features to show how it helps data science lifecycle.

Zeppelin provides pluggable architecture for backend integration, visualization, notebook persistence storage.
This presentation will describe how these pluggable architecture works and how your project can leverage them.

Also will discuss about the future roadmap.

Speakers
avatar for Moon

Moon

cto, NFLabs
Moon soo Lee is a creator for Apache Zeppelin and a Co-Founder, CTO at NFLabs. For past few years he has been working on bootstrapping Zeppelin project and it’s community. His recent focus is growing Zeppelin community and getting adoptions.
AB

Alexander Bezzubov

Software Engineer, NFLabs
Alexander Bezzubov is Apache Zeppelin contributor, PMC member and software engineer at NFLabs. | | Previous speaking experience includes Apache BigData NA 2016 in Vancouver, FOSSASIA 2016 in Singapore, Apache BigData EU 2015 in Budapest.


Monday September 28, 2015 11:30 - 12:20
Arany

11:30

One-Click Hadoop Clusters - Anywhere (Using Docker) - Janos Matyas, Hortonworks
This session presents the provisioning of Hadoop clusters running inside Docker containers on different environments - let it be public/private cloud or bare metal. We share the same processes, automations and zero-configuration approach across all environments and allow users to span up SLA policy based autoscaling clusters of arbitrary sizes in minutes - all built on open source components exclusively. We will discuss the architecture, main building blocks (Docker, Consul, Apache Ambari, YARN) and the tools we made available (API, CLI and UI). The session will end up with a quick demonstration. Be your own Hadoop as a Service provider.

Speakers
JM

Janos Matyas

Janos is a Sr. Director of Engineering at Hortonworks and former CTO at SequenceIQ (acquired by Hortonworks) - a young startup with the mission statement of simplifying the provisioning, development and SLA policy based autoscaling on Hadoop. Before co-founding SequenceIQ he was a Solutions Architect at EPAM Systems. He is an open source advocate and Apache Ambari committer, a Hadoop YARN evangelist and a keen surfer and freeskier. He holds a... Read More →


Monday September 28, 2015 11:30 - 12:20
Krudy/Jokai

11:30

Combining Solr and Elasticsearch to Improve Autosuggestion for Local Search on Mobile - Toan Luu, local.ch AG
Many search applications simply suggest queries or data items with a prefix match on what the user is typing sorted by popularity. With the advent of the mobile era, local.ch becomes one of the most used applications for local search in Switzerland. Some characteristics of a mobile application e.g. (a) difficulty of typing on small screen, and (b) more available personalized data from users (e.g. location), generated challenges for us when implementing the autosuggest feature. We finally brought a better autosuggest experience to our users based on: their search history, popularity of internal and external data, user context awareness (language, location), spellchecker...
In this talk we will describe the architecture of our autosuggest feature, how did we use Solr as a search component and Elasticsearch as a data aggregation component to improve autosuggest in our mobile application.

Speakers
avatar for Toan Luu

Toan Luu

Senior Search Engineer, local.ch AG
Toan Vinh Luu obtained his PhD in the domain of Decentralized Information Retrieval Systems from the Swiss Institute of Technology in Lausanne (EPFL) in 2007. After a year working as a PostDoc at the EPFL, he joined local.ch ag, where he has worked in the search team of the company. His main missions at local.ch are making new data source searchable, improving ranking, matching algorithms, exploring different data sources (e.g. user logs... Read More →


Monday September 28, 2015 11:30 - 12:20
Tohotom

11:30

Managing Distributed Databases with Apache Mesos - Chris Ward, Crate.IO
Managing Distributed Databases with Mesos (Chris Ward, Crate.IO) - Apache Mesos is a fantastic tool for abstracting CPU, memory, storage, and other compute resources away from machines (physical or virtual). Alongside these features is Mesos-DNS, which provides service discovery for all applications and services running in a Mesos cluster. These features combined lets you program against your datacenter like it's a single pool of resources, a useful tool when building highly scalable applications stacks.

In this presentation we will show how to use Mesos so you can treat a distributed database of any size as if it were one instance.

We will cover:
- Installation and configuration of a Mesos Cluster
- How to create and manage cluster sizes
- Managing upgrades across a cluster
- Setting data locations
- Managing compute resources

Speakers
avatar for Chris Ward

Chris Ward

Developer Advocate, Crate.IO
Developer Relations, Technical Writing and Editing, (Board) Game Design, Education, Explanation and always more to come.


Monday September 28, 2015 11:30 - 12:20
Tas

11:30

Hadoop Elephant in Active Directory Forest - Marek Gawiński, Allegro Group Sp. z o.o.
Active Directory (AD) is a well known industry standard to authenticate employees in back office services. It assures password management and clear policies for requesting and gaining access to secured resources.
Integrating AD with Hadoop infrastructure brings those benefits to Big Data world. It also includes other features that make big data developers’ tasks much easier. For example our developers can submit Spark applications that use HDFS, YARN and Hive directly from their IDE.
In this talk we provide technical details which include:
Making AD users and groups visible to Linux via System Security Services Daemon.
Integrating new Linux servers automatically with AD forest on Kerberos level with all credentials needed.
Making whole architecture non-vulnerable to AD service unavailabilities.
Auto-deployment and autoconfiguration of Hadoop clients’ software on users desktops.

Speakers
MG

Marek Gawiński

Senior Data Engineer, Allegro Group Sp. z o.o.
Since 5 years in Infrastructure and Services Maintenance Team where he takes care of technical support for the scrum teams and maintenance of multiple services included in the Allegro Group's portfolio. He is now developing big data solutions. Passionate about web technologies and open source. Now he is Senior Data Engineer and deals with Hadoop ecosystem in Allegro Group. His responsibilities include maintenance private and public Hadoop... Read More →
AO

Arkadiusz Osinski

Senior Data Engineer, Allegro Group Sp. z o.o.
Works in Allegro Group as a senior data engineer. From the beginning he is related with building and maintaining of Hadoop infrastructure within Allegro Group. Previously he was responsible for maintaining large scale database systems. Passionate about new technologies and cycling.


Monday September 28, 2015 11:30 - 12:20
Kond

11:30

Apache Tez - Helping You Build Your Hadoop Big Data Engines - Bikas Saha, Hortonworks
YARN has opened up Hadoop to a variety of high performance purpose-built applications specialized for specific domains. Many of these need a common set of capabilities like scheduling, fault tolerance & scalability while not giving up on important aspects like multi-tenancy & security. We will provide an overview of how Apache Tez provides these capabilities via a dataflow based API to model these applications and an extensible orchestration framework for optimal performance. We will cover broad ecosystem adoption by Apache Hive, Pig, Cascading, Scalding, Flink & commercial vendors and provide some experiment results. We will look at the Tez Web UI for progress monitoring and performance debugging tools. Finally, we will look ahead at upcoming Tez features like hybrid execution which enables new types of integration with existing systems.

Speakers
BS

Bikas Saha

Bikas is an active Apache community member and has contributed to the Apache Hadoop and Tez projects and focuses mainly on the distributed compute stack on Hadoop. He works for Hortonworks, a company that supports an open source based Apache Hadoop distribution. Bikas has spoken widely on the Hadoop compute stack over the last few years at conferences in America, SIGMOD 2015 Australia, ApacheCon Europe and Big Data Technology Conference in China... Read More →


Monday September 28, 2015 11:30 - 12:20
Dery/Mikszath

11:30

Magellan: Geospatial Analytics on Spark - Ram Sriharsha, Hortonworks
Geospatial data is pervasive, and spatial context is a very rich signal of user intent and relevance in search and targeted advertising and an important variable in many predictive analytics applications. In this talk, we describe the motivation and the internals of an open source library that we are building for Geospatial Analytics using Spark SQL, DataFrames and Catalyst as the underlying engine. We outline how we leverage Catalyst’s pluggable optimizer to efficiently execute spatial joins, how SparkSQL’s powerful operators allow us to express geometric queries in a natural DSL, and discuss some of the geometric algorithms that we implemented in the library. We also describe the Python bindings that we expose, leveraging Pyspark’s Python integration.

Speakers
RS

Ram Sriharsha

Senior Member of Technical Staff, Hortonworks
Ram is currently Product Manager for Apache Spark at Databricks. Prior to joining Databricks, he was Principal Research Scientist at Yahoo Research where he worked on large scale machine learning algorithms and systems related to login risk detection, sponsored search advertising and advertising effectiveness measurement. | Prior talks include talks at ApacheCon BigData 2015, Spark Summit. | Prior talks include talks at ApacheCon BigData... Read More →


Monday September 28, 2015 11:30 - 12:20
Petofi

12:20

Lunch
Monday September 28, 2015 12:20 - 14:00
Brasserie Restaurant

14:00

Catch Them in the Act: Fraud Detection in Real-Time - Seshika Fernandom WS02
Fraud is getting more complex and dangerous every minute, with Fraudsters countering anti-fraud measures through technology and advanced statistical models. On the converse, overprotective fraud solutions are driving customers away. Finding the right level of fraud prevention is more an art than a science. As data scientists our duty is not to master the art, but to enable our customers to draw this fine line in a simple yet effective manner.

In this session, Seshika will take you through
• How to detect anomalies in real time using Complex Event Processing
• Why Markov Modelling is great, in detecting rare activity sequences
• How Scoring Functions can be used to reduce False Positives
• How Machine Learning can be used to intensify fraud detection
• What visualizations will enable Analysts to further crackdown relationships in large fraud rings

Speakers
avatar for Seshika Fernando

Seshika Fernando

Senior Technical Lead, WSO2
Seshika is a Senior Technical Lead at WSO2 and focuses on the applications of WSO2’s middleware platform in Financial Markets. Throughout her career, she has had extensive experience in providing technology for Stock Exchanges, Regulators and Investment Banks from across the globe. Her current area of interest is in Real time anomaly detection and its usage in e-commerce. | She holds a BSc (Hons) in Computer Science from the University of... Read More →


Monday September 28, 2015 14:00 - 14:50
Arany

14:00

Dynamics of Benchmarking Distributed Key-Value(KV) Store (Hbase, Cassandra, Accumulo, Hypertable, Aerospike) for Hosting TeraBytes of Data - Pracheer Agarwal, Inmobi & Kunal Gautam, Inmobi
Identifying a KV store to host terabytes of data, from wide range of choices, for a set of given use cases, is a daunting proposition. Even after crossing the fearsome first step and narrowing down the candidates, it is non-trivial to explain and reason the actual results of benchmarking experiments with the expected results. There are multiple variables at play and it is not often clear on how these interplay at run time.
In this talk, we present our experiences and methodology of how to analyze and effectively benchmark a distributed KV store. This involves monitoring and characterizing key server parameters like RAM, CPU, network, storage, page cache, IO scheduler, JVM size, GC tuning and logically reason out their effects on the overall performance and capabilities of the underlying KV store. These parameters were monitored by utilities like iostat, dstat, iftop, jstat and cachestat

Speakers
avatar for Kunal Gautam

Kunal Gautam

Senior Software Engineer, Inmobi
Kunal Gautam is a strong thought leader in the field of Big data and has hands on experience in using Hadoop framework. Very talented and received several awards for providing ideas and implementing them to working product.Kunal has been working with distributed architectures over last 5+ years. He has expertise in using BigData to build from scratch User Profiling(RealTime/Batch) at scale(handles about 2 billion users , 200+TB of data ). | He... Read More →


Monday September 28, 2015 14:00 - 14:50
Krudy/Jokai

14:00

What's New in Apache HTrace - Colin McCabe, Cloudera
Apache HTrace is a distributed tracing framework, currently in the incubator, which makes it easier to monitor and understand the performance of distributed systems. In this talk, I'll give an overview of the HTrace project. I'll also talk about how developers can engage with the HTrace community, and potentially integrate HTrace into their own projects.

The last few months have been an exciting time in the HTrace project. I'll talk about the new web interface, htraced trace sink, improvements to the client API, and other exciting new work .

Finally, I'll give a demo of using HTrace to find problems and optimize performance in a Hadoop cluster.

Speakers
CM

Colin McCabe

Software Engineer, Cloudera
Colin McCabe is a Platform Software Engineer at Cloudera, where he works on HDFS and related technologies. He is a committer on HDFS. Prior to joining Cloudera, he worked on the Ceph Distributed Filesystem, and the Linux kernel, among other things. He studied Computer Science and Computer Engineering at Carnegie Mellon.


Monday September 28, 2015 14:00 - 14:50
Tohotom

14:00

Spark/Cassandra Integration, Theory and Practice - DuyHai DOAN, Datastax
Apache Spark is a general data processing framework which allows you perform map-reduce tasks (but not only) in memory. Apache Cassandra is a highly available and massively scalable NoSQL data-store.

By combining Spark flexible API and Cassandra performance, we get an interesting alternative to the Hadoop eco-system for both real-time and batch processing.

During this talk we will highlight the tight integration between Spark & Cassandra and demonstrate some usages with live code demo using Apache Zeppelin.

Speakers
avatar for DuyHai Doan

DuyHai Doan

Technical Advocate, Datastax
DuyHai Doan is an Apache Cassandra evangelist and Apache Zeppelin committer. He spends his time between technical presentations/meetups on Cassandra, coding on open source projects to support the community and helping all companies using Cassandra to make their project successful. He also gets an interest in all the eco-system around Cassandra (Spark, Zeppelin, ...). Previously he was working as a freelance Java/Cassandra consultant


Monday September 28, 2015 14:00 - 14:50
Huba

14:00

Protecting Enterprise Data In Apache Hadoop - Owen O'Malley, Hortonworks
Hadoop has long had strong authentication via integration with Kerberos,
authorization via User/Group/Other HDFS permissions, and auditing via
the audit log. Recent developments in Hadoop have added HDFS file access
control lists, pluggable encryption key provider APIs, HDFS snapshots,
and HDFS encryption zones. These features combine to give important new
data protection features that every company should be using to protect
their data. This talk will cover what the new features are
and when and how to use them in enterprise production environments.
Upcoming features including columnar encryption in the ORC columnar format
will also be covered. By encrypting particular columns, enterprises can
control which users have access to particularly sensitive columns that
contain personally identifiable information or financial information.

Speakers
avatar for Owen O’Malley

Owen O’Malley

Co-founder & Sr Architect, Hortonworks
Owen O’Malley is a co-founder and architect at Hortonworks, which develops the completely open source Hortonworks Data Platform (HDP). HDP includes Hadoop and the large ecosystem of big data tools that enterprises need for data analytics. Owen has been working on Hadoop since 2006 at Yahoo, and was the first committer added to the project. In the last 10 years, he has been the architect of MapReduce, Security, Hive, and Orc.


Monday September 28, 2015 14:00 - 14:50
Kond

14:00

Leveraging the Power of SOLR with SPARK - Johannes Weigend, QAware GmbH
SOLR is a distributed NoSQL database with impressive search capabilities. SPARK is the new star in the distributed computing universum. In this code-intense session we show how to combine both to solve realtime search and processing problems. We show how to setup a SOLR/SPARK combination from the scratch and develop first jobs with runs distributed on shared SOLR data. We also show how to use this combination for your next generation BI platform.

Speakers
avatar for Johannes Weigend

Johannes Weigend

CTO, QAware GmbH
Johannes works as a software architect with Java since 1999 and was honoured as "Java Rockstar" at JavaOne 2015. He is a lecturer at the University of Applied Sciences in Rosenheim, Germany and technical director at QAware, a decorated software engineering company located in Munich, Germany. QAware works for enterprises like BMW, Allianz, German Telecom and others.



Monday September 28, 2015 14:00 - 14:50
Dery/Mikszath

14:00

Apache Kafka for High-Throughput Systems - Jane Wyngaard, Jet Propulsion Laboratory
While designed to provide high-throughput, low-latency real-time data feeds of relatively small messages on a large scale Apache Kafka could offer a valuable service to alternative communities if operated at the even greater scale managing on the order of 10Gb/s streams.

While others have achieved this level of throughput via topic scaling using greater node counts, this presentation will discuss achieving such rates over a single topic and where hardware resource scaling is more limited.

Speakers
avatar for Jane Wyngaard

Jane Wyngaard

PostDoc, University of Southern California
Currently a postdoctoral scholar at the University of Southern California, I primarily work on Bigdata tools for Earth Science (Apache Kafka and OODT currently). But with a background in Mechatronics and a PhD in microprocessor design, I am most excited about the potential combining IOT, drones, and BigData analytics, has for opening new avenues to data capture and analysis for the Earth Sciences.


Monday September 28, 2015 14:00 - 14:50
Petofi

14:00

Overcoming the Many-to-Many Data Mapping Mess With Apache Streams - Steve Blackmon, People Pattern
These days we have the tools and resources to collect and wrangle data at unprecedented scale, yet we remain plagued by compatibility gaps and semantic nuances with every new source we invite into our domain. Despite the best efforts of well meaning folks for decades, data integration remains a many-to-many problem.

Apache Streams (incubating) is an open-source real-time reference implementation for the Activity Streams specification. Streams contains libraries and patterns for specifying, publishing, and inter-linking schemas, and assists with conversion of activities and objects between the representation, format, and encoding preferred by supported data providers, processors, and indexes.

In this talk I will explain what Streams does, how it works (more or less), and how it can be used to compile a real-time, multi-network, polyglot content repository of profiles, posts, etc.

Speakers
avatar for Steve Blackmon

Steve Blackmon

VP Technology, People Pattern
VP Technology at People Pattern, previously Director of Data Science at W2O Group, co-founder of Ravel, stints at Boeing, Lockheed Martin, and Accenture. Committer and PMC for Apache Streams (incubating). Experienced user of Spark, Storm, Hadoop, Pig, Hive, Nutch, Cassandra, Tinkerpop, and more.


Monday September 28, 2015 14:00 - 14:50
Tas

15:00

Apache Bigtop Unconference (Everybody welcome)
Speakers
avatar for Konstantin Boudnik

Konstantin Boudnik

CEO, Memcore
Dr.Konstantin Boudnik, co-founder and CEO of Memcore Inc, is one of the early developers of Hadoop and a co-author of Apache BigTop, the open source framework and the community around creation of software stacks for data processing projects. With more than 20 years of experience in software development, big- and fast-data analytic, Git, distributed systems and more, Dr. Boudnik has authored 16 US patents in distributed computing. Dr. Boudnik... Read More →


Monday September 28, 2015 15:00 - 15:50
Kond

15:00

IPython Notebook as a Unified Data Science Interface for Hadoop - Casey Stella, Hortonworks
Data Science on Hadoop can be a daunting journey as you generally are spanning multiple tools and different interfaces. Furthermore, while there are people out there doing data science, worked examples are few and far between.

As part of the Social Security Act, the Center for Medicare and Medicaid Services has begun to publish data detailing the relationship between physicians and medical institutions. This data has been analyzed cursorily in the press, but an in-depth outlier and benford's law analysis hasn't been attempted (to my knowledge).

I will present an example of using Apache Spark and Hive on Hadoop to do the above analysis without leaving IPython notebook. This should motivate iPython and the Python bindings of Spark as a fantastic environment to do data science.

Speakers
CS

Casey Stella

Principal Architect, Hortonworks
I am a principal architect focusing on Data Science in the consulting organization at Hortonworks. In the past, I've worked as an architect and senior engineer at a healthcare informatics startup spun out of the Cleveland Clinic, as a developer at Oracle and as a Research Geophysicist in the Oil & Gas industry. Before that, I was a poor graduate student in Math at Texas A&M. | | I primarily work with the Apache Hadoop software stack. I... Read More →


Monday September 28, 2015 15:00 - 15:50
Arany

15:00

Leveraging Ambari to Build Comprehensive Management UIs For Your Hadoop Applications - Christian Tzolov, Pivotal
This presentation will demonstrate how to leverage modern HTML5 technologies with the flexibility of Apache Ambari to build a comprehensive, responsive and attractive management interfaces for your Hadoop applications. In the process we will walk you through the reference implementation of an management interface for SQL-on-Hadoop application and integrate it with Apache Ambari. We will share our experience in using technologies like Google Polymer, Spring Boot and Apache Ambari.

Speakers
avatar for Christian Tzolov

Christian Tzolov

Pivotal Inc
Christian Tzolov, Pivotal technical architect, BigData and Hadoop specialist, contributing to various open source projects. In addition to being an Apache® Committer and Apache Crunch PMC Member, he has spent over a decade working with various Java and Spring projects and has led several enterprises on large scale artificial intelligence, data science, and Apache Hadoop® projects. twitter: @christzolov blog: http://blog.tzolov.net


Monday September 28, 2015 15:00 - 15:50
Krudy/Jokai

15:00

Apache Tika for Enabling Metadata Interoperability - Michael Starch, NASA Jet Propulsion Laboratory and Nick Burch
Apache Tika is the de facto standard technology for textual content and metadata extraction from over a thousand different file types. Given the growing importance of metadata, Tika has become a fundamental tool, providing support for many metadata models. However, enabling uniform access to very large sets of heterogeneous documents requires dealing with most accurate interoperability techniques, such as metadata mapping. In this talk, Michael and Nick will review existing solutions based on Tika that make possible to obtain consistent metadata across file formats (i.e., TikaCoreProperties, Solr’s ExtractingRequestHandler) and then present a new component for Tika. This integration provides an extension of Metadata object in order to achieve metadata interoperability by using a highly configurable, fine-grained mapping technique that subsumes schema mapping and instance transformation.

This work has been proposed by Giuseppe Totaro (“Sapienza" University of Rome) and Chris Mattmann (NASA JPL). 

Speakers
NB

Nick Burch

CTO, Apache Software Foundation
Nick began contributing to Apache projects in 2003, and hasn't looked back since! He's mostly involved in ""Content"" projects like Apache POI, Apache Tika and Apache Chemistry, as well as foundation-wide activities like Conferences and Travel Assistance. | | Nick is CTO at Quanticate, a Clinical Research Organisation (CRO) with a strong focus on data and statistics. | | Nick has spoken at most ApacheCons since 2007, and as well as many... Read More →
MS

Michael Starch

Computer Engineer in Applications, NASA Jet Propulsion Laboratory
Michael Starch has been employed by the Jet Propulsion laboratory for the past 5 years. His primary responsibilities include: engineering big data processing systems for handling scientific data, researching the next generation of big data technologies, and helping infuse these systems into the mission world. He is a commiter and PMC on Apache OODT and has spoken about his work at the Southern California Linux Expo and ApacheCon North America.


Monday September 28, 2015 15:00 - 15:50
Tohotom

15:00

A Tale of Two Graphs: Property Graphs and RDF - Andy Seaborne and Paolo Castagna, Cloudera
Property Graphs and the Resource Description Framework (RDF) are both graph data models. Property Graphs originated with data practitioners while RDF was developed at W3C as an information model for the web. Both graph data models are "schema-neutral", there is no rigid organization of data up-front. Applications decide which part of the data graph to use and how to view it. New data and new applications can be introduced at any time without disturbing existing usage.

In this talk we will introduce both data models and look at some uses to show where they are (and are not) used. We will look at contrasting features of the data models by looking at use cases.

We will then look at two systems, Apache Spark/GraphX, for property graphs, and Apache Jena, for RDF databases, and how they deal with graph structure and how they can scale to big data.

Speakers
avatar for Paolo Castagna

Paolo Castagna

Systems Engineer, Cloudera
Paolo works as a Systems Engineer at Cloudera (EMEA). Before joining Cloudera he worked at HP Labs. Paolo is a PMC member and committer of the Apache Jena project (http://jena.apache.org/), 'addicted' to data (mostly RDF or graph shaped datasets).
avatar for Andy Seaborne

Andy Seaborne

Andy works on infrastructure for linked data graph systems. He was lead editor for SRARQL, the RDF Query Language. Andy developed the ARQ query engine which is released as part of Apache Jena where he is a committer. He has spoken at conferences and at developer events about linked data and RDF. Andy has given tutorials on SPARQL at the International Semantic Web Conference (invited), at the Semantic technologies Conference and teaches courses on... Read More →


Monday September 28, 2015 15:00 - 15:50
Huba

15:00

Apache Zeppelin - The Missing Component for the Spark Ecosystem - DuyHai Doan, Datastax
If you are interested in Big Data, you surely has heard about Spark, but do you know Apache Zeppelin ? Do you know that it is possible to draw out beautiful graph using an user-friendly interface out of your Spark RDD ?

In this session, I will introduce Zeppelin by live coding example and highlight its modular architecture which allows you to plug-in any interpreter for the back-end of your choice.

Speakers
avatar for DuyHai Doan

DuyHai Doan

Technical Advocate, Datastax
DuyHai Doan is an Apache Cassandra evangelist and Apache Zeppelin committer. He spends his time between technical presentations/meetups on Cassandra, coding on open source projects to support the community and helping all companies using Cassandra to make their project successful. He also gets an interest in all the eco-system around Cassandra (Spark, Zeppelin, ...). Previously he was working as a freelance Java/Cassandra consultant


Monday September 28, 2015 15:00 - 15:50
Dery/Mikszath

15:00

Unified Access to All Your Data Points With Apache MetaModel - Kasper Sørense, Human Inference
The wave of Big Data has overwhelming potential, but has also revealed an overwhelming challenge in the need to combine multiple sources. The representation of data is growing immensely just like the amount of data is – you might very well be ingesting as many sources as this: Relational, NoSQL, Hadoop, XML/JSON/CSV files, Cloud/SaaS systems and search indexes. In this presentation, Kasper Sørensen will introduce Apache MetaModel. With this project metadata has been put first and querying is based on this concept too. Apache MetaModel allows for a uniformed view of data from many sources, but just as important it also enables Data Federation and Data Integration patterns that are automatically adapting based on the metadata available in the source. The talk will be practically oriented, showing running code with MetaModel and examples of production usage in multiple business cases.

Speakers
avatar for Kasper Sørensen

Kasper Sørensen

Principal Tech Lead, Human Inference / Neopost
Kasper Sørensen is PMC of Apache MetaModel and Principal Tech Lead of Human Inference, a Neopost company. Having founded several open source projects, including Apache MetaModel and DataCleaner, he is passionate about building and sharing products for the Data Quality, Big Data and Analytics space. This is the first major conference talk of Kasper Sørensen, but with experience in teaching, training and mentoring developers it is not an... Read More →


Monday September 28, 2015 15:00 - 15:50
Tas

16:00

R as a Language For Big Data Analytics - Andrie de Vries, Microsoft
R is the language of data science, used by more than 2 million statisticians, data scientists and quantitative analysts around the world.

Many projects and companies have implemented libraries and solutions to make R available to the data scientist working with big data in Hadoop. Foremost among these is Revolution Analytics / Microsoft that sponsored the popular RHadoop project.

In this talk, I'll present
- A high level overview of R: its history, capabilities and community
- An introduction to predictive analytics, and some applications from industry, especially some examples from inside Microsoft
- Connections between R and big-data platforms including Hadoop and Spark
- The Revolution R Open and Revolution R Enterprise distributions, and the unique capabilities they bring to R.
- Using R in the Azure cloud and (coming soon) within the SQL Server 2016 database.

Speakers
AD

Andrie de Vries

Senior Programme Manager, Microsoft
Andrie is a senior programme manager at Microsoft, responsible for community projects and evangelization of Microsoft's contribution in Europe to the open source R language. He is co-author of the very popular title "R for Dummies" and a top contributor to the Q&A website, StackOverflow. | | Andrie is an experienced speaker at technology conferences and community events. For example he ran a RHadoop tutorial at UseR!2105 and gave a... Read More →


Monday September 28, 2015 16:00 - 16:50
Arany

16:00

Testing Big Data Pipelines Made Super Easy - Pallavi Rao, Inmobi & Pavan Kumar Kolamuri, Inmobi
Your company has developed a system that crunches humongous data from multiple data sources. It involves multiple and varied processing modules. Each of the individual modules has been well tested. But, when you try to deploy these modules and connect them either in an integration or a staging environment, you face issues. To debug these errors and re-test them requires you to set up a mirror environment and is time consuming. So, you write complex integration tests and equally complex setup scripts. Apache Falcon and Falcon Unit to your rescue! While Apache Falcon alleviates some of the problems of pipeline orchestration, Falcon Unit, a feature of Falcon, helps users test their entire pipeline and data lifecycle without even setting up a test environment. This talk will outline the capabilities of Falcon Unit and how it helps users test data pipelines early on in the development phase.

Speakers
PR

Pallavi Rao

Pallavi is an Architect at InMobi. She has been working on big data technologies for nearly 4 years now. She has deep knowledge of the Hadoop ecosystem, especially, YARN, PIG, Oozie, HBase, Hive and Storm. Since the past 6 months she has been actively contributing to Apache Falcon. She has spoken at conferences such as Annual | RFID Conference , Information Management Technical Conference and Grace Hopper Conference.


Monday September 28, 2015 16:00 - 16:50
Krudy/Jokai

16:00

Recent Evolution of Standards for Geospatial Applications and TheirIimplementation in Apache SIS - Martin Desruisseaux, Geomatys
Apache Spatial Information System (SIS) is a Java library for developing geospatial applications which conform to international standards. SIS implements interfaces derived from UML published jointly by the Open Geospatial Consortium (OGC) and the International Organization for Standardization (ISO). In the North America ApacheCon 2014 we introduced the OGC vision and how SIS implements the “geospatial metadata” and “referencing by coordinates” international standards. In this presentation we will present what is new on the standards front: the “metadata” revision published in 2014, and the “Well-Known Text 2” format which can be seen as an evolution of “referencing by coordinates”. We will present which problems the new standards solve, and how SIS preserves compatibility between the old and new standards. Finally we will present the plan for next SIS developments.

Speakers
MD

Martin Desruisseaux

Developer, Geomatys
Martin holds a Ph.D thesis in oceanography, but has continuously developed tools for helping his analysis work. He used C/C++ before to switch to Java in 1997. He develops geospatial libraries since that time, initially as a personal project then as a GeoTools contributor until 2008. He is now contributing to Apache SIS since 2013. Martin attends to Open Geospatial Consortium (OGC) meetings about twice per year in the hope to follow closely... Read More →


Monday September 28, 2015 16:00 - 16:50
Tohotom

16:00

S2Graph : A Large-Scale Graph Database with HBase - Doyung Yoon, Daumkakao
As dominant social network service provider, Daumkakao confronted several technical challenges to store and traverse large graph data.
First, our social network has 10 billion edges and 200 million vertices and users create 1 billion new edges everyday by interacting with our services so our system needed to be distributed and scalable.
Second, our system needed to provide low-latency and high concurrency to meet our quality of service.
Third, for viral effect, user`s activities should be delivered to right place at right time in real time so simple result cache can`t be used.
Lastly, there are about a hundred of services that daumkakao is operating and our system needed to provide common way to store and traverse data for synergy between services.
S2Graph successfully solved these technical challenges, so we'd like to introduce the methodology and architecture we used.

Speakers
avatar for Daewon Jeong

Daewon Jeong

Programmer, kakao
Works on S2Graph team
avatar for Doyung Yoon

Doyung Yoon

Software Engineer, Kakao
Doyung works in a distributed graph database team at Kakao as software engineer, where his focus is on performance and usability. He developed Apache S2Graph, an open-source distributed graph database, and has previously presented it at ApacheCon BigData Europe and ApacheCon BigData North America.



Monday September 28, 2015 16:00 - 16:50
Huba

16:00

Encryption and Anonymization in Hadoop - Current and Future - Balaji Ganesan, Hortonworks and Don Bosco Durai
As enterprises expand usage of Hadoop as a platform to store and process data, data security and compliance needs in the platform are becoming more pertinent.
Beyond the traditional Hadoop security controls of authentication through Kerberos, or access management through Apache Ranger, users are increasingly asking for encrypting data when it is being transmitted or when data is stored in disk. In relation to this, there is a movement towards anonymizing the data by tokenizing or masking it, with the intent of using the data in the query processing while protecting the sensitivity by hiding the original value from the end user.
The community has recently built encryption in HDFS file system. In this talk we look at the current encryption capabilities in Hadoop and areas where the community need to focus to further enhance Hadoop as an enterprise ready data platform.

Speakers
DB

Don Bosco Durai

Security Architect, Hortonworks
Bosco Durai is an Apache committer and currently working at Hortonworks, focused on enabling enterprise grade security within Hadoop platform. Bosco brings years of experience building and managing enterprise data security products. Before Hortonworks, Bosco was the co-founder and Chief Security Architect of big data security startup, XA Secure. XA Secure was built ground up to address the unique security challenges that big data environments... Read More →
BG

Balaji Ganesan

Hortonworks In
Balaji Ganesan is part of the enterprise security team at Hortonworks, where he is leading and executing the vision to bring comprehensive enterprise security into Apache Hadoop. He came to Hortonworks through its acquisition of his security startup, XA Secure. As the Senior Director of Enterprise Security and Strategy, he is responsible to bring together top big data and security talent, build enterprise-grade products, and redefine how data... Read More →


Monday September 28, 2015 16:00 - 16:50
Kond

16:00

Integrating Apache Spark with an Enterprise Data Warehouse - Michael Wurst, IBM
This session will discuss the challenges and opportunities of integrating Apache Spark with enterprise data warehouses, especially the impact of columnar storage on the example of IBM DB2 and IBM dashDB. We will show how columnar storage can help to increase scalability and reduce response time, especially when pushing down processing of projections and aggregates to the database instead of processing them in Spark natively. Key takeaways from the session are: 1.How to benefit from the features of closed-source data warehouses from Spark without access to internal data structures, 2.the role of storage when working with large warehouses from Spark, 3. Opportunities of columnar storage vs. row based storage, 4. How such an integration impacts end-to-end analytics based on Spark MLlib.

Speakers
avatar for Michael Wurst

Michael Wurst

Architect / Senior Software Developer, IBM Research & Development
Michael Wurst, Ph.D. is a senior software engineer and architect at the IBM Research & Development Lab in Germany. He holds a Ph.D. in computer science and is responsible for the integration of open source analytics based on R, Python or Spark into IBM's Datawarehouse portfolio. Prior to joining IBM, Michael worked as a co-developer for the RapidMiner open source data mining software. Michael presented at a wide range of conferences, including... Read More →


Monday September 28, 2015 16:00 - 16:50
Dery/Mikszath

16:50

Pivotal Drinks Reception
Monday September 28, 2015 16:50 - 18:30
Pivotal Hacker Lounge

17:00

18:30

Budapest Big Data Meetup

The Apache Big Data Europe conference is in town, and we are organizing an event where you can listen to and meet with the international speakers and attendees. 

We plan to have the following talks:

Dive deeper, Soar higher: MADlib + HAWQ for advanced SQL machine learning on Hadoop 

The growing Apache ecosystem just got bigger and better -- now with the ability to crunch vast volumes of data using fully ANSI-compliant SQL and at-scale machine learning algorithms.

Apache HAWQ has been years in the making and derives its heritage from Greenplum Database and PostgreSQL. HAWQ enables developers, analysts, data scientists and engineers to run advanced SQL queries, transforming data sets of extreme size, visualizing data with standard tools, and seamlessly running R and Python in a highly-distributed fashion all in the same environment. Invoke powerful machine learning and advanced statistical functions using Apache MADlib, and build models on billions of rows of data.

Speakers from Pivotal and Hortonworks will discuss the following:

- Introduction to Apache HAWQ & Apache MADlib 
- All about the Open Data Platform initiative 
- Data science in the Hadoop ecosystem 
- Live, end-to-end data science demo using Apache HAWQ, Apache MADLib, and Hortonworks Data Platform


Our Speakers:

Caleb Welton is Director for SQL on Hadoop at Pivotal covering the Pivotal HAWQ database. He has spent the last 18 years developing database technology for Oracle, Greenplum, EMC and Pivotal. In addition to his contributions in database technology he is one of the founding members of the open source MADlib project for in-database machine learning. Caleb is named inventor for 11 patents in database technology and has presented papers at SIGMOD, VLDB and KDD.

Michael Natusch leads Pivotal's Data Science team in EMEA. His experience lies in predictive analytics and his area of specialization is the application of statistical methods to large-scale data sets, in particular through the application of machine learning algorithms. Michael holds a PhD in theoretical physics from the University of Cambridge and an MBA. He is a Fellow of the Royal Statistical Society and lectures at the Open University. 

Janos Matyas is a Sr. Director of Engineering at Hortonworks and former CTO at SequenceIQ (acquired by Hortonworks). Before co-founding SequenceIQ he was a Solutions Architect at EPAM Systems. He is an open source advocate and Apache Ambari committer, a Hadoop YARN evangelist and a keen surfer and freeskier. He holds a Master's Degree in Computer Science, specialized on distributed systems. 

 

Planned schedule: 
18:30 Doors open  
19:00 Talks begin  
21:00 Meetup finishes 

Afterparty: 
After the meetup we'll visit a nearby pub (exact location to be announced). Join us there, have some drinks and talk data (or anything else) even if you can't make it to the meetup!

This will be an English speaking event.  The meetup will be hosted by LogMeIn.  


Speakers
avatar for Bence Arató

Bence Arató

Managing Director, BI Consulting
Managing Director of BI Consulting Hungary. He has been in the BI industry since 1995 as an analyst, architect and consultant. He advises companies on general BI strategy, project and architecture planning, and vendor and tool selection. Also provides QA and on-the-job mentoring services. He leads the research activities of the yearly run BI-TREK and DW-TREK surveys collecting information and user feedback about the local BI&DW. He also teaches... Read More →


Monday September 28, 2015 18:30 - 21:00
LogMeIn 1061 Paulay Ede u. 12., Ground floor, Budapest

21:00

Big Data Meetup AfterParty

International and Hungarian friends of Big Data, unite!

After our September meetup on Monday evening,  we are heading for drinks and discussions to An'kert, a fine example of the world-famous ruin pubs of Budapest. 

This AfterParty is open to everyone loving Big Data, and we are especially looking forward to meet and greet the attendees of the Apache Big Data Europe conference.

If you are staying at Hotel Corinthia - the venue of the conference - then you can easily walk to the pub. Here's a Google map link showing the route and distance: goo.gl/maps/RWtbxGN27qT2 

If you are also attending the earlier Big Data Meetup at LogMeIn, then getting to An'kert is even simpler: just walk a block (200m) on Paulay street.

For more information, click here.

Monday September 28, 2015 21:00 - 23:00
Ankert 1061, Paulay Ede u. 33., Budapest
 
Tuesday, September 29
 

08:00

Registration
For those staying in the hotel, please proceed to the Brasserie Restaurant for breakfast. For those not staying in the hotel, continental breakfast will be available in the Valletta II Foyer and Attendee Lounge daily.

Tuesday September 29, 2015 08:00 - 09:00
Valletta 1 Foyer

09:00

Keynote: How Apache Drives Spotify's Music Recommendations - Josh Baer, Spotify
Hear from Josh Baer, Hadoop Product Owner at Spotify on how Apache drives Spotify's music recommendations.

Speakers
avatar for Josh Baer

Josh Baer

Hadoop Product Owner, Spotify, Spotify
Since 2013, Josh has been working on Spotify's Hadoop infrastructure, growing their cluster from 190 nodes to 1700 and obsessing about reliability as the Hadoop team product owner in Stockholm, Sweden. In early September, he began a team in NYC which aims to build and support near real-time infrastructure used by the Spotify application. Prior to that, he was an engineer at AT&T working on big data problems in the advertising space. He has... Read More →


Tuesday September 29, 2015 09:00 - 09:20
Grand Ballroom

09:20

Keynote: Keeping the Elephant in the Room - Gary Richardson, KPMG UK
Gary Richardson, Head of Data Engineering for KPMG in the UK will discuss strategies for keeping the elephant in room; moving from proof of concept to full scale enterprise Hadoop adoption.

Speakers
avatar for Gary Richardson

Gary Richardson

Director, Head of Data Engineering, KPMG UK
Gary leads a team of data scientists and data engineers in the agile development of big data science solutions. The focus of the team is raising the bar in terms of industrialising big data science solutions and getting the science into business as usual functions.  He believes mainstream enterprise adoption of machine learning is the key to accelerating innovation in the usability and productivity of the data... Read More →


Tuesday September 29, 2015 09:20 - 09:40
Grand Ballroom

09:40

Keynote Panel: ODPi: Advancing Open Data for the Enterprise Panel - Anjul Bhambri, IBM; Konstantin Boudnik, WANdisco; Owen O'Malley, Hortonworks; Roman Shaposhnik, Pivotal - Moderated by Jim Zemlin, The Linux Foundation
This panel will be an opportunity for members of the Open Data Platform Initiative to share the benefits of ODP with the Apache community.

Moderators
Speakers
AB

Anjul Bhambri

VP, Big Data, IBM
Anjul Bhambhri is the Vice President of Big Data Products at IBM. She was previously the Director of IBM Optim application and data life cycle management tools. She is a seasoned professional with over twenty-two years in the database industry. Over this time, Anjul has held various engineering and management positions at IBM, Informix and Sybase. Prior to her assignment in tools, Anjul spearheaded the development of XML capabilities in IBM's DB2... Read More →
avatar for Konstantin Boudnik

Konstantin Boudnik

CEO, Memcore
Dr.Konstantin Boudnik, co-founder and CEO of Memcore Inc, is one of the early developers of Hadoop and a co-author of Apache BigTop, the open source framework and the community around creation of software stacks for data processing projects. With more than 20 years of experience in software development, big- and fast-data analytic, Git, distributed systems and more, Dr. Boudnik has authored 16 US patents in distributed computing. Dr. Boudnik... Read More →
avatar for Owen O’Malley

Owen O’Malley

Co-founder & Sr Architect, Hortonworks
Owen O’Malley is a co-founder and architect at Hortonworks, which develops the completely open source Hortonworks Data Platform (HDP). HDP includes Hadoop and the large ecosystem of big data tools that enterprises need for data analytics. Owen has been working on Hadoop since 2006 at Yahoo, and was the first committer added to the project. In the last 10 years, he has been the architect of MapReduce, Security, Hive, and Orc.


Tuesday September 29, 2015 09:40 - 10:10
Grand Ballroom

10:10

Morning Break
Tuesday September 29, 2015 10:10 - 10:30
Valletta 2 Foyer

10:30

Synthetic Data Generation for Realistic Analytics Examples and Testing - RJ Nowling, Red Hat
Big Data users are faced with an enormous gap between trivial tutorial applications and real-world analytics pipelines. Word count and TeraSort have limited value as blueprints and may not exercise enough of the data processing stack to be useful for testing deployments. Since real data are typically encumbered by privacy or intellectual property concerns, tutorials and test cases often use small or unrepresentative data sets. Generative models can enable a new class of realistic example and test applications by synthesizing rich and complex data sets. Furthermore, synthetic data can be scaled from a single laptop to data centers. We will present on data generators, such as BigPetStore from Apache BigTop, influenced by data we’ve analyzed in the Emerging Technologies team at Red Hat. We also discuss realistic example applications and usage for smoke-testing deployments.

Speakers
RN

RJ Nowling

Software Engineer, Red Hat, Inc.
RJ Nowling is a Software Engineer in Emerging Technology at Red Hat, Inc., where he is part of a data science team that consults for internal customers. RJ is a committer on Apache BigTop, a contributor to Apache Spark, and co-lead of the BigPetStore family of big data example applications. Before joining Red Hat, RJ focused on academic research in the fields of computational physics, bioinformatics, and distributed systems. He is currently a PhD... Read More →


Tuesday September 29, 2015 10:30 - 11:20
Krudy/Jokai

10:30

Search-Based Business Intelligence and Reverse Data Engineering with Apache Solr - Mario-Leander Reimer, QAware GmbH
We are searching the unknown. How can you find hidden and unknown relationships in unrelated data silos? How can you find relevant information in a 10^56 dimensional space? Sounds impossible? This talk will present a case study and success story about how Apache Solr has been used to build a search based business intelligence and information research application to answer these questions for a major German car manufacturer.

Speakers
avatar for Mario-Leander Reimer

Mario-Leander Reimer

Chief Technologist, QAware GmbH
M.-Leander Reimer has studied computer science at Rosenheim and Staffordshire University and is now working as a chief technologist for QAware GmbH. He is a senior Java developer with several years of experience in designing complex and large scale system architectures. He is continuously looking for innovations and ways to combine state of the art technology and open source software components to be successfully applied in real world customer... Read More →



Tuesday September 29, 2015 10:30 - 11:20
Tohotom

10:30

Architecture of Flink's Streaming Runtime - Robert Metzger
Apache Flink is an open-source framework for parallel data analysis. The core of Flink is a distributed stream processing engine that provides exactly-once semantics, low-latency processing, and system-managed operator state. Flink's high-level programming APIs and support for batch processing, make Flink a good choice for real-time data analysis. Apache Flink is one of the most active big data project in the Apache Software Foundation and has more than 100 contributors.

This talk presents Flink's architecture and design decisions that result in Flink's unique set of features. It discusses the pipelined execution engine for low latency processing, the operator state management and fault tolerance mechanisms. It will cover master high availability and system monitoring and also show a performance evaluation of the system.

Speakers
RM

Robert Metzger

Co-Founder and Software Engineer, data Artisans
Robert Metzger is a PMC member at Apache Flink and co-founder and software engineer at data Artisans. | Robert studied Computer Science at TU Berlin and worked at IBM Germany and at the IBM Almaden Research Center in San Jose.


Tuesday September 29, 2015 10:30 - 11:20
Dery/Mikszath

10:30

Apache Phoenix: The Evolution of a Relational Database Layer over HBase - Nick Dimiduk, Hortonworks
This presentation will begin by giving a "State of the Union" of Apache Phoenix, a relational database layer on top of HBase for low latency applications, with a brief overview of new and existing features. Next, the approach for transaction support, a work in-progress will be discussed. Lastly, the current means of integrating with the rest of the Hadoop ecosystem will be examined, including the vision for how this will evolve going forward.

Speakers
avatar for Nick Dimiduk

Nick Dimiduk

Hortonworks
Nick Dimiduk is a committer and PMC member on both Apache HBase and Apache Phoenix. He's Release Manager for the HBase 1.1 branch and an author of the book HBase in Action, on Manning Press. Nick has also contributed to a number of Apache projects around HBase, including, HTrace, and Calcite. Nick works on the HBase team at Hortonworks where his focus is on operability and performance.



Tuesday September 29, 2015 10:30 - 11:20
Petofi

10:30

High-Throughput Processing With Streaming-OODT - Michael Starch, NASA Jet Propulsion Laboratory
Upcoming customers of Streaming-OODT have predicted that their system will operate at data throughputs of 10Gb/s. Thus in order to use Streaming-OODT at these throughputs, the system must be characterized and well understood in order to support customers’ needs. This presentation discusses Streaming-OODT’s performance processing non-trivial data at these scales and the lessons learned from operation in this environment.

Streaming-OODT uses various other Apache technologies to support its mission to provide a high-performance data system. These technologies include: Apache Kafka, Apache Spark, Spark Streaming, and Apache Mesos. This presentation will therefore discuss the performance of these technologies working together at high-throughputs and will discuss lessons learned orchestrating these technologies for high-performance as part of the Streaming OODT system.

Speakers
MS

Michael Starch

Computer Engineer in Applications, NASA Jet Propulsion Laboratory
Michael Starch has been employed by the Jet Propulsion laboratory for the past 5 years. His primary responsibilities include: engineering big data processing systems for handling scientific data, researching the next generation of big data technologies, and helping infuse these systems into the mission world. He is a commiter and PMC on Apache OODT and has spoken about his work at the Southern California Linux Expo and ApacheCon North America.


Tuesday September 29, 2015 10:30 - 11:20
Tas

10:30

Data Ethics - Louis Suárez-Potts, Age of Peers, Inc.
I examine the ethics of Big Data in several ongoing projects and the possibilities of engaging subject communities in the processes and projects. As background: The ethics of data, especially "Big Data," can be considered as the linked ethics of gathering the data and then interpreting it. Big Data--the data and interpretation dyad--complicates this otherwise dull as dishwater process in part by obscuring acquisition and reading it as discovery, and in part by abstracting the particular elements making up the data even as those may refer to persons and their doings. That is: Were an experiment conducted on any population, the persons objectified would likely have to sign their consent. This talk looks to ways to engage (and so form) communities as subjects and not just objects of Big Data projects. Apache's important projects are key here.

Speakers
avatar for Louis Suárez-Potts

Louis Suárez-Potts

Community Strategist, Age of Peers, Inc.
Louis Suárez-Potts is the community strategist for Age of Peers, a consultancy he co-founded in 2011. He also participates on the Project Membership Committee for Apache OpenOffice. From 2000 to 2011, Suárez-Potts was the Community Manager for OpenOffice.org, a role that entailed considerable public speaking at international developer and marketing conferences, as well as more focused events. The role was partly subsidized by Sun Microsystems... Read More →


Tuesday September 29, 2015 10:30 - 11:20
Huba

11:30

Spark and Machine Learning to the aid of the Data Scientist - Frank Ketelaars, IBM and Andrey Vykhodtsev,IBM
In this talk, Andrey Vykhodtsev and Frank Ketelaars will talk about how data scientists benefit from the scalable machine learning capabilities of Spark. They will demonstrate a machine learning workflow and how this is executed using Apache Spark and MLlib. The ability to answer to the needs of data scientists will improve further with the contribution of IBM Research's Machine Learning system (also known as System ML) to the Apache Spark project, making is easier and quicker to express machine learning algorithms.

Speakers
avatar for Frank Ketelaars

Frank Ketelaars

Big Data Leader, IBM
Frank Ketelaars is working as part of a European team focused on IBM Big Data  | solutions, including Hadoop and Real-time Analytical Processing. In his capacity,  | Frank leads the European technical community and conducts Big Data architecture  | sessions with customers and business partners across all industries. | Prior to his current role, Frank has fulfilled various national and international  | assignments, being... Read More →
avatar for Andrey Vykhodtsev

Andrey Vykhodtsev

Big Data Solution Architect, IBM
Andrey is has broad expertise in Analytics. He is Big Data Solution Architect at IBM, and is responsible for IBM Big Data product stack, which includes many Open Source components. Andrey is educating and consulting customers and IBM Business Partners across Central & Eastern Europe on topics of Big Data, Machine Learning, Data Science, and related areas.


Tuesday September 29, 2015 11:30 - 12:20
Dery/Mikszath

11:30

How Bigtop Leveraged Docker for Build Automation and One-Click Hadoop Provisioning - Evans Ye, Trend Micro
Apache Bigtop as an open source Hadoop distribution, focuses on developing packaging, testing and deployment solutions that helps infrastructure engineers to build up their own customized bigdata platform as easy as possible. However, packages deployed in production require a solid CI testing framework to ensure its quality. Numbers of Hadoop component must be ensured to work perfectly together as well. In this presentation, we'll talk about how Bigtop deliver its containerized CI framework which can be directly replicated by Bigtop users. The core revolution here is the newly developed Hadoop provisioner that leveraged Docker for infra automation. The content of this talk includes the technical details of Bigtop Hadoop provisinoer, a hierarchy of docker images we designed, and several components we developed such as Bigtop Toolchain to achieve build automation.

Speakers
avatar for Yu-Hsin Yeh

Yu-Hsin Yeh

Sr. Software Engineer, Trend Micro
Evans Ye(Yu-hsin Yeh) is currently a Committer and PMC member of Apache Bigtop. He works at Trend Micro developing big data infra and applications. He loves to code, automate things, and develop big data solutions. Aside engineering stuff, he is also an enthusiast in giving talks to share software innovations and cutting edge technologies. Evans had talked about Bigtop’s recent improvements by leveraging Docker in Apache: Big Data EU 2015. In... Read More →


Tuesday September 29, 2015 11:30 - 12:20
Krudy/Jokai

11:30

What's New With Apache Tika? - Nick Burch
Apache Tika detects and extracts metadata and text from a huge range of file formats and types. From Search to Big Data, single file to internet scale, if you've got files, Tika can help you get out useful information!

Apache Tika has been around for nearly 10 years now, and in that time, a lot has changed. Not only has the number of formats supported gone up and up, but the ways of using Tika have expanded, and some of the philosophies on the best way to handle things have altered with experience. Tika has gained support for a wide range of programming languages to, and more recently, Big-Data scale support, and ways to automatically compare effects of changes to the library.

Whether you're an old-hand with Tika looking to know what's hot or different, or someone new looking to learn more about the power of Tika, this talk will have something in it for you!

Speakers
NB

Nick Burch

CTO, Apache Software Foundation
Nick began contributing to Apache projects in 2003, and hasn't looked back since! He's mostly involved in ""Content"" projects like Apache POI, Apache Tika and Apache Chemistry, as well as foundation-wide activities like Conferences and Travel Assistance. | | Nick is CTO at Quanticate, a Clinical Research Organisation (CRO) with a strong focus on data and statistics. | | Nick has spoken at most ApacheCons since 2007, and as well as many... Read More →


Tuesday September 29, 2015 11:30 - 12:20
Tohotom

11:30

Adding Insert, Update, and Delete to Apache Hive - Owen O'Malley, Hortonworks
Apache Hive provides a convenient SQL query engine and table abstraction for data stored in Hadoop. Hive uses Hadoop to provide highly scalable bandwidth to the data, but until recently did not support updates, deletes, or transaction isolation. This has prevented many desirable use cases such as updating of dimension tables or doing data cleanup. We have implemented the standard SQL commands insert, update, and delete allowing users to insert new records as they become available, update changing dimension tables, repair incorrect data, and remove individual records. This also allows very low latency ingestion of streaming data from tools like Storm and Flume. Additionally, we have added ACID-compliant snapshot isolation between queries so that queries will see a consistent view of the committed transactions when they are launched.

Speakers
avatar for Owen O’Malley

Owen O’Malley

Co-founder & Sr Architect, Hortonworks
Owen O’Malley is a co-founder and architect at Hortonworks, which develops the completely open source Hortonworks Data Platform (HDP). HDP includes Hadoop and the large ecosystem of big data tools that enterprises need for data analytics. Owen has been working on Hadoop since 2006 at Yahoo, and was the first committer added to the project. In the last 10 years, he has been the architect of MapReduce, Security, Hive, and Orc.


Tuesday September 29, 2015 11:30 - 12:20
Petofi

11:30

The Best of Apache Kafka Architecture - Ranganathan B, ThoughtWorks
Big data event streaming is very common part of any big data Architecture. Of the available open source big data streaming technologies Apache Kafka stands out because of it realtime, distributed, and reliable characteristics. This is possible because of the Kafka Architecture. This talk highlights those features.

Speakers
avatar for Ranganathan Balashanmugam

Ranganathan Balashanmugam

Technology Lead, ThoughtWorks
Ranganathan has eleven plus years of experience of developing awesome products and loves to works on full stack - from front end, to backend and scale. He works for ThoughtWorks as Technology Lead. He is a Microsoft MVP in Data Platform. He runs the one of the top technology meetups in Hyderabad - Hyderabad Scalability Meetup. He is very interested in exploring Big data technologies and a regular speaker.



Tuesday September 29, 2015 11:30 - 12:20
Tas

11:30

Deriving Business Value From Large Image Collections on Hadoop - Michael Natusch, Pivotal
Image collections are rapidly growing in size. Efficient image management is necessary for large image collections to ensure easy searching and browsing. In this talk, we will describe how large image collections can be efficiently managed by presenting a content-based image retrieval (CBIR) system built on Hadoop. A CBIR system takes as an input a query image and returns images depicting content most similar to the input query image. Putting together a CBIR system involves building many components: the image collection, a feature extractor, and machine learning models for mining similar images. In this talk, we will present how a CBIR system can be easily and efficiently realized using Hadoop and SQL on Hadoop technologies. The system we present here discovers latent visual topics associated with each image and retrieves images based on similarity between corresponding visual topics.

Speakers

Tuesday September 29, 2015 11:30 - 12:20
Huba

12:20

Lunch
Tuesday September 29, 2015 12:20 - 14:00
Brasserie Restaurant

14:00

An Introduction to Apache Geode (incubating) - William Markito Oliveira, Pivotal
Companies using Apache Geode (incubating), previously GemFire, have deployed it in some of the most mission critical time sensitive applications in their enterprises, making sure tickets are purchased in a timely fashion, hotel rooms are booked, financial trades are made, and credit card transactions are cleared. Come to this session to learn about becoming a contributor to this powerful and fascinating technology: - A brief history of Geode - Architecture and use cases - Why we going Open Source - Design philosophy and principles - Code walk-through - How to contribute - How to create your first application

Speakers
avatar for William Markito Oliveira

William Markito Oliveira

Enterprise Architect, Pivotal
William Markito Oliveira is a solution architect of enterprise applications with focus on system integration and highly distributed systems. He has large Java platform experience, solid skills in development and architecture of SOA, Big Data, EAI, and web services-based applications. William currently works at Pivotal and was formerly at Oracle, BEA, and Ericsson. He co-authored a few books (ISBN: 978-0321980083,ISBN: 978-0137081868, ISBN... Read More →


Tuesday September 29, 2015 14:00 - 14:50
Tohotom

14:00

CouchDB 2.0: The Awkward Bits - Mike Wallace, IBM
Apache CouchDB 2.0 will bring Dynamo-style clustering to CouchDB, allowing data and applications to scale out over hundreds of nodes for increased throughput and storage. While operation of a single-node CouchDB instance will remain largely unaffected, the addition of clustering and sharding introduces some new annoyances which have the potential to complicate operational life. In this talk, Mike will identify the parts of clustered CouchDB 2.0 that could be considered to be particularly awkward from an operations perspective so that we can be better prepared when things start to get real.

Speakers
MW

Mike Wallace

Software Engineer and Systems Operator, IBM
Mike Wallace is a software engineer and systems operator with a particular interest in distributed systems and the many and varied ways they can fail. He has been an engineer at IBM (formerly Cloudant) for the last two years as both a developer and operator of their globally distributed Database-as-a-Service platform (based on Apache CouchDB) and is a CouchDB committer. | | Although far from a regular public speaker he has some experience... Read More →


Tuesday September 29, 2015 14:00 - 14:50
Krudy/Jokai

14:00

Apache Sentry (incubating) : Fine-Grained Access Control to Hadoop Ecosystem - Sravya Tirukkovalur, Cloudera
Historically, each Hadoop component offers its own method of access control so each one needs its own set of permissions rules - even when they are accessing the same data in Hadoop. This is an administrative nightmare that slows the adoption of Hadoop when sensitive data is involved. Apache Sentry is a framework that enables fine grained, role based authorization for multiple Hadoop ecosystem components. Apache Sentry is a highly modular system that support authorization for various data models like Database style schemas, search indexes etc. It comes with out of the box support for SQL query frameworks like Apache Hive and Cloudera Impala Apache Hive, extending the table privileges to underlying HDFS storage, as well as open source search framework Apache Solr. This session will present an overview of Apache Sentry.

Speakers
avatar for Sravya Tirukkovalur

Sravya Tirukkovalur

Software Engineer, Cloudera
Sravya Tirukkovalur is a software engineer at Cloudera working on Hadoop security. She is one of the active contributors to the Apache Sentry project and also the PMC Chair. She got her Masters degree from The Ohio State University, with her research focus on High performance and Distributed computing. She is passionate about social impact through technology and volunteers outside of her day job. See... Read More →


Tuesday September 29, 2015 14:00 - 14:50
Dery/Mikszath

14:00

Drilling into Data with Apache Drill - Tugdual Grall, MapR Technologies
Apache Drill is a next-generation SQL engine for Hadoop and NoSQL. Its unique schema-free approach enables self-service data exploration with the agility that organizations need in this new era of rapidly growing and evolving data.

In this talk, based on demonstrations, you will understand the key features and architecture of Apache Drill. You will also see how to get started with Drill; and start query, using SQL, various data sources such as HBase, Hive, Parquet, and Avro, but also more complex data structure stored in JSON documents.

Speakers
avatar for Tugdual Grall

Tugdual Grall

Technical Evangelist, MapR
Tugdual Grall Bio: Tugdual Grall, est Chief Technical Evangelist EMEA chez MapR. Il travaille avec les clients et les communautés de développeurs européennes, pour faciliter l’adoption de MapR, Hadoop et NoSQL. | | Avant de travailler chez MapR, “Tug”, était Technical Evangelist chez MongoDB et Couchbase. Tug a travaillé comme CTO chez eXo Platform, et comme Product Manager et Développeur sur la platform Java/JavaEE d’Oracle... Read More →


Tuesday September 29, 2015 14:00 - 14:50
Petofi

14:00

Being Ready for Apache Kafka: Today's Ecosystem and Future Roadmap - Michael Noll, Confluent
Apache Kafka is a high-throughput distributed messaging system that has become a mission-critical infrastructure component for modern data platforms. Kafka is used across a wide range of industries by thousands of companies such as Twitter, Netflix, Cisco, PayPal, and many others. After a brief introduction to Kafka this talk will provide an update on the growth and status of the Kafka project community. Rest of the talk will focus on walking the audience through what's required to put Kafka in production. We’ll give an overview of the current ecosystem of Kafka, including: client libraries for creating your own apps; operational tools; peripheral components required for running Kafka in production and for integration with other systems like Hadoop. We will cover the upcoming project roadmap, which adds key features to make Kafka even more convenient to use and more robust in production.

Speakers
avatar for Michael Noll

Michael Noll

Developer Evangelist, Confluent
Michael Noll is the developer evangelist of Confluent, i.e. the US startup founded in 2014 by the creators of Apache Kafka who developed Kafka while at LinkedIn. Previously Michael was the technical lead of the Big Data platform of .COM/.NET DNS operator Verisign, where he grew the Hadoop, Kafka, and Storm based infrastructure from zero to PetaByte-sized production clusters spanning multiple data centers – one of the largest Big Data... Read More →


Tuesday September 29, 2015 14:00 - 14:50
Tas

14:00

More Data, More Problems - A Practical Guide to Testing on Hadoop - Michael Miklavcic, Hortonworks
Just because the data is big doesn't mean you can't test. We believe it's probably even more critical to your SDLC on Hadoop to automate testing of your Hive, Pig, and MapReduce code than almost any other time investment you can make. We provide a soup-to-nuts, practical exposition on testing a variety of Hadoop application types to enable you to get better results faster.

Speakers
avatar for Michael Miklavcic

Michael Miklavcic

Systems Architect, Hortonworks
Michael is a software engineer with over ten years of industry experience and has been a Systems Architect with Hortonworks for the past two years. He is a code contributor to the Apache Falcon project and works directly with clients to implement solutions using Hadoop. For over 2 years he has guided many Big Data and Hadoop projects at large enterprises to success. Michael has degrees in computer science and computer information systems from... Read More →



Tuesday September 29, 2015 14:00 - 14:50
Huba

15:00

Apache Ignite - JCache and Beyond - Dmitriy Setrakyan, GridGain
This presentation will provide a good overview of Apache Ignite project including a
detailed look into distributed in-memory Data Grid, Compute Grid, Streaming, in memory
SQL, and many other components provided by Apache Ignite. We will also go
into detail of how existing in-memory caching products and data grids can be used to
share memory across Apache Spark jobs and applications. We will also present a
hands on demo demonstrating performance benefits of querying shared memory using
SQL.

Speakers
DS

Dmitriy Setrakyan

EVP Engineering, GridGain
Dmitriy Setrakyan is founder and Chief Product Officer at GridGain. Dmitriy has been working with distributed architectures for over 15 years and has expertise in the development of various middleware platforms, financial trading systems, CRM applications and similar systems. Prior to GridGain, Dmitriy worked at eBay where he was responsible for the architecture of an add-serving system processing several billion hits a day. Currently Dmitriy... Read More →


Tuesday September 29, 2015 15:00 - 15:50
Tohotom

15:00

Decentralized Document Delivery - Benjamin Young, The Hypothesis Project
Apache CouchDB is a document-centric database. It also replicates. It can make exact copies of a database and keep them in sync--even as the network comes and goes. Both (or more) databases can be actively written too and changes synchronized across them. There is no center.

Add PouchDB, a CouchDB compatible database that lives inside the browser or node.js, and you have an architecture ready to survive the fickleness of technology, businesses, and other regimes. The end result is data where you need it.

We'll take a look at architecting for a decentralized future built of documents delivered democratically across all the divides.

Speakers
avatar for Benjamin Young

Benjamin Young

Web & Developer Advocate, BigBlueHat
Benjamin Young is a Developer, Web, and Open Source Advocate. Benjamin's focus is on content and how we human beings interface with it and each other around it. He currently explores the edges of a re-decentralized Web leveraging annotation--additional content added by anyone. Benjamin is currently an Invited Expert in the Annotation and Digital Publishing Working Groups at the W3C. He has previously worked as an inventor and evangelist for IBM's... Read More →



Tuesday September 29, 2015 15:00 - 15:50
Krudy/Jokai

15:00

Hive on Spark: What It Means to You? - Xuefu Zhang, Cloudera
Apache Hive has wide use cases for batch-oriented SQL workloads for ETL and data analytics in the Hadoop ecosystem. Up to now, most of the workloads are still executed by a 10 year old technology, MapReduce. On the other hand, Apache Spark as a general, open-source data processing framework is positioned to replace MapReduce with faster data processing and efficient memory utilization.

The Hive on Spark initiative introduced Spark as Hive's new execution engine, providing faster SQL on Hadoop while maintaining Hive's feature richness. With a joint effort from the Hive community and feedback from early adopters and beta users, Hive on Spark is ready for production deployment!

This presentation will share with you the motivation, architecture, deployment practice, and performance tuning. A live demo will be given to conclude the presentation.

Speakers
XZ

Xuefu Zhang

Software Engineer, Uber Technologies
Xuefu Zhang has over 10 year’s experience in software development. Earlier this year he joined as a software engineer in Uber from Cloudera, where he spent his main efforts on Apache Hive and Pig. He also worked in the Hadoop team at Yahoo when the majority of the development on Hadoop was still there. In addition, he spent his early career at Informatica, gaining important experience on enterprise software development, especially in ETL and... Read More →


Tuesday September 29, 2015 15:00 - 15:50
Petofi

15:00

Building a Highly-Scalable Open-Source Real-time Streaming Analytics System Using Spark SQL, Apache Geode (incubating), SpringXD and Apache Zeppelin (incubating) - Fred Melo, Pivotal
The Internet of Things requires new applications to consume data that streams in from connected devices, and apply advanced real-time analytics. It also demands the ability to scale horizontally in order to support a large number of devices, while keeping extreme low latency for immediate data insights. How can you leverage open source software like Apache Geode (incubating), Spring XD, Docker, Apache Zeppelin (incubating), Apache Spark and Cloud Foundry/Lattice to quickly build a complete IoT solution? This presentation will walk you through the construction of a system leveraging these technologies and a Raspberry PI with sensors including a live demo of data captured during the conference and perform some real-time analytics.

Speakers

Tuesday September 29, 2015 15:00 - 15:50
Tas

15:00

Unified Analytics @InMobi Through Apache Lens - Amareshwari Sriramadasu, Inmobi
Apache Lens enables multi-dimensional queries in a unified way over datasets stored in multiple warehouses. Apache Lens allows queries to be executed where the data resides providing logical data cube abstraction. In a typical enterprise multiple data warehouses co-exist, as single one does not address the needs of all workload requirements in cost-effective way. Apache Lens unifies the underlying storages and allows multiple execution engines to access underlying data. It picks the right engine for execution at query time. In this talk, speakers will share the experience of running Apache Lens in production and discuss upcoming features in Apache Lens.

Speakers
avatar for Amareshwari Sriramadasu

Amareshwari Sriramadasu

Architect, Inmobi
Amareshwari is currently working as Architect in data team at Inmobi, where she works on Hadoop and related projects for data collection and analytics. She is member of the ASF, Apache Incubator PMC, Apache Hadoop PMC, Apache Lens PMC and Apache Falcon PMC, and is Apache Hive committer. She has been working on Hadoop and its eco system since 2007. Prior to Inmobi, she was working with Yahoo! in core Hadoop team. She has spoken at Hadoop summit... Read More →


Tuesday September 29, 2015 15:00 - 15:50
Huba

16:00

Open-Source In-Memory Platforms - Konstantin Boudnik, WANdisco
Apache Bigtop has created the de-facto standard in how Hadoop-based stacks are developed, delivered, and managed. We are at it again! The track will present the composition of the next generation of in-memory computing stack that is completely built out of open-source components. The next generation of the Apache data processing stack will focus on in-memory and transactional processing of large amounts of data. We will also be talking about performance benefits that legacy data-processing software based on MapReduce, Hive, and similar, can derive from in-memory computing. This session will discuss and analyze the benefits of practicing Fast Data in the open. 

Speakers
avatar for Konstantin Boudnik

Konstantin Boudnik

CEO, Memcore
Dr.Konstantin Boudnik, co-founder and CEO of Memcore Inc, is one of the early developers of Hadoop and a co-author of Apache BigTop, the open source framework and the community around creation of software stacks for data processing projects. With more than 20 years of experience in software development, big- and fast-data analytic, Git, distributed systems and more, Dr. Boudnik has authored 16 US patents in distributed computing. Dr. Boudnik... Read More →


Tuesday September 29, 2015 16:00 - 16:50
Tohotom

16:00

Hands-On with Apache CouchDB 2.0 - Mike Wallace, IBM; Michelle Phung, IBM; Glynn Bird, IBM
This is a hands-on introduction to Apache CouchDB. We'll tour the user-facing API and Fauxton dashboard while looking at the concepts behind it. You'll learn what it means to build an application on top of CouchDB and make it shine within an hour's time.

You will learn basic data storage and retrieval, data-design, querying, replication and various neat features on the edges of CouchDB. If you are coming from the relational world, this talk will help you understand how to "think in CouchDB".

Speakers
GB

Glynn Bird

Before joining IBM Cloud Data Services, Glynn served as the Head of IT and Development for Central Index, creating a white-label frontend for a NoSQL business directory (using PHP, Node.js, MySQL, Redis, Cloudant, and Redshift). His experience includes writing CRM systems, "find my nearest" indexes, e-commerce platforms and a phone tracking app. He also built a transport route-planning system in Java. Glynn got his start in Research and... Read More →
avatar for Michelle Phung

Michelle Phung

UX Engineer, IBM, Cloudant
Technologist at IBM Cloudant. | | Big fan of graphic design and UX | Recent and active committer to Apache CouchDB | Contributes to user interface for Project Fauxton | Created a visual guide for Fauxton | Spoken once previously, at a CouchDB Meetup in Boston, MA
MW

Mike Wallace

Software Engineer and Systems Operator, IBM
Mike Wallace is a software engineer and systems operator with a particular interest in distributed systems and the many and varied ways they can fail. He has been an engineer at IBM (formerly Cloudant) for the last two years as both a developer and operator of their globally distributed Database-as-a-Service platform (based on Apache CouchDB) and is a CouchDB committer. | | Although far from a regular public speaker he has some experience... Read More →


Tuesday September 29, 2015 16:00 - 16:50
Krudy/Jokai

16:00

Netflix: Integrating Spark at Petabyte Scale - Cheolsoo Park, Netflix and Ashwin Shankar, Netflix
The Big Data Platform team at Netflix maintains a cloud-based data warehouse with over 10 petabytes of data stored predominantly in Parquet format. Our platform has traditionally leveraged Pig for ETL processing, Hive for large analytic workloads, and Presto for interactive and exploratory use cases. For a long time, Spark seemed attractive to complement our platform, but technical gaps prevented effective use at scale in our environment. Recent improvements have allowed us to add Spark to our cloud data architecture and interoperate seamlessly with the other tools and services in our stack.

We will go into detail about our deployment configuration and what it takes to run Spark alongside traditional workloads on YARN. We will share examples of a few of our largest workflows translated to Spark for comparison in terms of both performance and complexity.

Speakers
avatar for Cheolsoo Park

Cheolsoo Park

Senior Software Engineer, Netflix
Cheolsoo Park is an Apache Pig PMC member and Spark contributor. He is also a senior software engineer at Netflix and works on cloud-based big data analytics infrastructure that leverages open source technologies including Hadoop, Hive, Pig, and Spark.
AS

Ashwin Shankar

Ashwin Shankar is an Apache Hadoop and Spark contributor. He is a senior software engineer at Netflix and is passionate about developing features and debugging problems in large scale distributed systems. Ashwin holds a Master's degree in Computer Science from University of Illinois at Urbana Champaign.


Tuesday September 29, 2015 16:00 - 16:50
Dery/Mikszath

16:00

Help Build the most Advanced SQL Database on Hadoop: HAWQ - Lei Chang, Pivotal
HAWQ is a massively parallel processing SQL engine sitting on top of HDFS. As a hybrid of MPP database and Hadoop, it inherits the merits from both parties. It is standard SQL compliant, extremely fast and scalable, and unlike other SQL engines on Hadoop, it is fully
transactional. HAWQ is currently being proposed as an Apache incubating project.

In this talk, Dr. Lei Chang will give an overview on HAWQ architecture and the major exciting areas that are soliciting contributions from open source community. And he will also introduce the easiest way contributors can work with HAWQ developers to bring their innovative ideas to the HAWQ kernels.

Speakers
LC

Lei Chang

Engineering Director for Apache HAWQ, Pivotal
Dr. Lei Chang is the Engineering Director for Apache HAWQ at Pivotal Inc. He is the co-creator and architect of HAWQ. Before he joined Pivotal, he is a senior research scientist in EMC. Main research area include parallel database, data analytics and cloud computing. He has published widely on data management and data mining in refereed journals and conferences, and holds dozens of US patents. He obtained his PhD... Read More →


Tuesday September 29, 2015 16:00 - 16:50
Petofi

16:00

Fly the Coop! - Getting Big Data to Soar With Apache Falcon - Michael Miklavcic, Hortonworks
Getting your Data Lake to function like a reservoir doesn't happen by accident. From ETL to analytics, all enterprise-level big data jobs eventually need a reliable platform for automation and data lifecycle management. In this presentation we walk you through Apache Falcon and show real working code examples of data pipelines in action.

Speakers
avatar for Michael Miklavcic

Michael Miklavcic

Systems Architect, Hortonworks
Michael is a software engineer with over ten years of industry experience and has been a Systems Architect with Hortonworks for the past two years. He is a code contributor to the Apache Falcon project and works directly with clients to implement solutions using Hadoop. For over 2 years he has guided many Big Data and Hadoop projects at large enterprises to success. Michael has degrees in computer science and computer information systems from... Read More →



Tuesday September 29, 2015 16:00 - 16:50
Tas

16:00

Collecting User's Data in a Socially-Responsible Manner - Konark Modi, Cliqz
Data from users is needed to build great products. Google, Facebook, Doubleclick would not be able to offer their services unless they had tons of data.

Cliqz is no exception, it needs massive amounts of query-logs, browsing patterns, etc. to build its search engine and phishing protection. Such data is also collected by other search engines like Google, Yandex and Bing. Industry standard is to send raw data and sanitize and filter out at the 'backend', however, this approach implies absolute trust on the company's good intentions, more so there is always a risk of a data-breach or a government subpoena.

Cliqz developed a framework called 'Human Web' that combines algorithms and an open infrastructure to collect data in an anonymous way by removing any trace of user identifiability. The 'Human Web' will be open-sourced to encourage others to collect user's data in a safer manner.

Speakers
KM

Konark Modi

Software Engineer, Cliqz
Konark Modi works as a Software Engineer with Cliqz on projects related to data collection and safe web principles.Cliqz is a novel search engine embedded in the browser with a very strong focus on privacy. Cliqz for Firefox (Germany only) has more than 3M users with more than 500K daily active users. Prior to Cliqz he was working as a Sr.Engineer with Dataplatform team at MakeMyTrip.com . | He has also presented at numerous... Read More →


Tuesday September 29, 2015 16:00 - 16:50
Huba

17:00

BoFs: Next Generation Data Processing

This session is an informal meeting about post-map reduce frameworks such as Spark or Flink. We will also talk about the ecosystem, architectural patterns (eg. Lambda & Kappa), Programming(Scala et al) and abstraction/SQL framework on general purpose data engines.

 After hours of listening, it is about time that you have a chance talk. Share your thoughts, ideas and questions. Remember, there is no such thing as a stupid questions. This is also the perfect place to ask questions to session topics that came up after the session was closed.

Agenda

1.       Recap and introduction to the topic

2.       General discussion in the big room

3.       Fork into sub-bofs into smaller room on demand (If you want to talk about details on a certain topics and have a deep-dive into technical details, we invite you to gather some people and create a “sub-bof”)

 

Speakers
SP

Stefan Papp

I am an Apache Hadoop Evangelist and my focus is data processing in distributed platforms. My core interests is right now Apache Flink and Apache Spark. And in addition to this I love to program, Scala is my favourite language as it is perfectly designed for distributed data processing. Besides Scala, I invest my in time in Java, R and Python.


Tuesday September 29, 2015 17:00 - 19:00
Tas

18:00

Apache HAWQ's Nest Community Meeting
This meeting will also be available online at https://pivotalcommunity.adobeconnect.com/hawqnest/

Tuesday September 29, 2015 18:00 - 19:00
Pivotal Hacker Lounge
 
Wednesday, September 30
 

08:00

Registration
For those staying in the hotel, please proceed to the Brasserie Restaurant for breakfast. For those not staying in the hotel, continental breakfast will be available in the Valletta II Foyer and Attendee Lounge daily.

Wednesday September 30, 2015 08:00 - 09:30
Valletta 1 Foyer

09:00

Keynote: Creating a Market for Rapid Shared Innovation in Analytics - Mark Shuttleworth, Canonical
The focus of this talk will be to help the community and the many startups around the field
reduce the friction in the ecosystem, spreading their ideas and their code faster, and reducing the time to commercial success. 


Wednesday September 30, 2015 09:00 - 09:20
Grand Ballroom

09:20

Keynote: It Takes a Community to Build a Platform - Greg Chase, Pivotal
This era of IT is full of stories of fast growing startups disrupting
industry stalwarts leveraging big data new user experiences built from
open source software. While these startups merely need to find
engineers with the talent to build these new services, existing
enterprises need to find this expertise while helping bridge their
existing business and customer relationships into these new
experiences.

Large enterprise customers turn to Pivotal to create innovation
platforms because they recognize Pivotal's leadership in open source
and openly governed communities, ranging from the self-governed Spring
Source community, to creating the Cloud Foundry Foundation, and most
recently making some significant contributions of new projects to ASF.

Hear the latest about how Pivotal is working with ASF to grow
in-memory computing with Apache Geode, SQL on Hadoop with Apache HAWQ,
Data Science with Apache MADlib. Find out about the Open Data Platform
(ODP) and how this community of communities enhances the core mission
of ASF for the betterment of big data technology users.

Speakers
GC

Greg Chase

Biography coming soon..


Wednesday September 30, 2015 09:20 - 09:30
Grand Ballroom

09:30

Apache Spark - Making the Unthinkable Possible - Anjul Bhambhri, Vice President, Big Data, IBM Silicon Valley Lab
In this keynote, IBM's Anjul Bhambhri will highlight how Spark is stretching the boundaries of big data thinking, and how IBM is contributing to the movement.

Speakers
avatar for Anjul Bhambhri

Anjul Bhambhri

Anjul Bhambhri is IBM’s Vice President of Big Data Products, overseeing product strategy, development and business partnerships. Previously at IBM, Anjul focused on application and data lifecycle management tools and spearheaded the development of XML capabilities in DB2 database server. She has 25 years of experience in the database industry and has held engineering and management positions at IBM, Informix and Sybase. In 2009, she... Read More →


Wednesday September 30, 2015 09:30 - 09:40
Grand Ballroom

09:40

Morning Break
Wednesday September 30, 2015 09:40 - 10:00
Valletta 2 Foyer

10:00

Upholstering Apache CouchDB - Benjamin Young, The Hypothesis Project
Apache CouchDB does two things other databases don't: it replicates and it speaks HTTP as it's primary protocol. These unique qualities all you to build applications that are "of the Web" but that can also move "off the Web"--into your local network, your farm equipment, or the phone you're carrying.

In this tutorial we'll take a look at CouchApps. CouchApps are application logic (index definitions, document and result templates, validation functions) that live inside a CouchDB database along with the static HTML, JS, CSS, and images needed for the UI.

In this tutorial we'll take a look at various tools for building, integrating, and deploying CouchApps. We'll take a deep dive into building a CouchApp: both the thought process and the code. Near the end, we'll throw a replication party--moving the app between attendee devices sharing the app and it's accumulated data.

Speakers
avatar for Benjamin Young

Benjamin Young

Web & Developer Advocate, BigBlueHat
Benjamin Young is a Developer, Web, and Open Source Advocate. Benjamin's focus is on content and how we human beings interface with it and each other around it. He currently explores the edges of a re-decentralized Web leveraging annotation--additional content added by anyone. Benjamin is currently an Invited Expert in the Annotation and Digital Publishing Working Groups at the W3C. He has previously worked as an inventor and evangelist for IBM's... Read More →


Wednesday September 30, 2015 10:00 - 10:50
Huba

10:00

How to Deploy a Secure, High-Available, Hadoop Platform - Olaf Flebbe, science+computing ag
We demonstrate the fully automatic installation of a hadoop cluster including infrastructure. The basic building blocks of the Demonstration are the Debian Distribution incl. puppet, configuration with hiera, a list of community puppet modules and deploy scripts and packages from the apache bigtop distribution. The automatically installed cluster sports an HA MIT Kerberos and openLDAP setup, apache zookeeper fencing, HA Hadoop ( journalling, and RM). WebGUI’s are authenticated with SPNEGO, Hive is configured with standard SQL authorization and Hue is provided as frontend.

One of the advanced features is the use of puppets CA via PKINIT for preauthentication and bootstrapping a secure kerberos KDC and securing hadoop with it.

Speakers
OF

Olaf Flebbe

Chief Software Architect
Dr. Olaf Flebbe received his PhD in computational physics in Tübingen, Germany. He works as the chief software architect at science+computing ag. He is a member of the PMC of Apache Bigtop. Occasionally he gives talks about random projects at various conferences.


Wednesday September 30, 2015 10:00 - 10:50
Tohotom

10:00

Configuring and Optimizing Spark Applications with Ease - Nishkam Ravi, Cloudera
Spark API exports intuitive and performant one-liners for data processing, which hide complexity and allow applications to be developed quickly. As an in-memory system, Spark has to be configured properly for performance and stability. This can sometimes be challenging. Based on internal deployments and interaction with customers, we conclude that (i) most Spark woes can be traced back to misconfiguration, and (ii) there is a need for tools that can aid configuration, performance optimization and debugging. In this talk, we will discuss common Spark configuration pitfalls and show how they can be avoided with the help of the auto-configuration and optimization tool being developed at Cloudera.

Speakers
NR

Nishkam Ravi

Software Engineer, Cloudera
Nishkam is a Software Engineer at Cloudera. His current focus is Spark and MapReduce performance. Nishkam got his B.Tech from IIT-Bombay and PhD from Rutgers. His first job was with Intel as a compiler engineer. Prior to joining Cloudera, Nishkam was a Research Staff Member at NEC Labs where he developed an optimizing compiler for MapReduce. He has presented in numerous peer reviewed AI and systems conferences in the past. | | Hari is a... Read More →


Wednesday September 30, 2015 10:00 - 10:50
Dery/Mikszath

10:00

Apache Kylin - Extreme OLAP engine for Hadoop - Seshu Adunuthula, eBay Cloud Services
Apache Kylin is an open source Distributed Analytics Engine contributed by eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets. Kylin’s pre-built MOLAP cubes, distributed architecture, and high concurrency helps users analyze multidimensional queries using Kylin’s SQL interface as well as via other BI tools like Tableau and Microstrategy. Kylin is successfully deployed and used in eBay for a variety of production use cases, including web traffic analysis and geographical expansion analysis. It was open sourced on Oct 1, 2014 and has 320 stars and 125 forks. Kylin has been accepted as Apache Incubator Project on Nov 25, 2014.

Speakers
avatar for Seshu Adunuthula

Seshu Adunuthula

Sr. Director of Analytics Infrastructure, eBay
Seshu Adunuthula is Sr Director of Analytics Infrastructure at eBay responsible for managing some of the world’s largest deployments of Hadoop, Teradata and ETL Ingest platforms. He is an industry veteran with over 20 years of Distributed Computing and Analytics Experience. Prior to eBay he was managing the Engineering team at MapR responsible for MapReduce, MapR-DB and MapR Control System Products. Seshu also held various Engineering... Read More →
QZ

Qianhao Zhou

Qianhao Zhou is Sr. Software Engineer of eBay CCOE, core developer of Apache Kylin in eBay, working on different componments of Kylin, Including Job Engine, Streaming Engine, he is now working on Kylin on Spark to enable fast cubing on Spark for Kylin cube build process.


Wednesday September 30, 2015 10:00 - 10:50
Petofi

10:00

How We Use Kappa Architecture in all of Our Projects - Juantomas Garcia, Aspgems
As the CDO of ASPGems we use Kappa Architecture in every project we did the last 15 months. We want to explain what kappa architecture is, how we use it and what kind of problems we are solving in real projects. From small projects to very big ones (millions records per seconds).
We will explain also why scala + kafka + spark are the key technologies that help to do successful projects.

Speakers
avatar for Juantomas Garcia

Juantomas Garcia

Data Solutions Manager, Open Sistemas
President Hispalinux (Spanish User Local Group) (1999-2007) Author of the book "La Pastilla Roja" the first book in spanish about free software (2004) More than 200 lectures around the world. Now CDO of Open Sistemas and advocate of Apache Spark and Kappa Architecture. Organize of Machine Learning Spain Meetup.


Wednesday September 30, 2015 10:00 - 10:50
Tas

10:00

Tutorial: DIY Continous Delivery Pipeline for Big/Fast Data Apps - Nate DAmico
You have a data processing service you want/need to deploy, you have picked your Apache Foundation components you want to leverage, and begin to start developing and deploy your apps. As you iterate you start you quickly learn that ASF projects and best practices move pretty fast and you have to deal with the various issues with changes in versions/configurations not only in the ASF components you depend on, but also in your application itself.

This tutorial session will walk users through Apache Bigtop tooling and reference examples provided by the community to put together a framework and processes that empower users to better handle change management and improve confidence when deploying their data applications/services.

General knowledge of software testing and tools such as Jenkins is helpful, but not required.

Speakers
ND

Nate DAmico

Nate has been working in the enterprise and mobile software industry for 14 years in various capacities. In recent years his tech efforts have focused around areas of mobile computer vision as well as the rise of the consumerization of IT Operations. Three years ago he started Reactor8, creating a set of open source tooling called DTK, to ease the pain of infrastructure/service developers and make advanced automation more approachable. Nate is... Read More →


Wednesday September 30, 2015 10:00 - 11:50
Krudy/Jokai

10:00

BarCampApache
Join us for an ‘unconference’ with no set schedule, facilitated by those involved in various Apache projects. More details and registration information can be found here:
https://wiki.apache.org/apachecon/BarCampBudapest

Wednesday September 30, 2015 10:00 - 14:00
Kond

11:00

HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL - Tugdual Grall, MapR Technologies
The Apache HBase approach to data has a huge potential for expressing NoSQL-y, non-relational programs. Apache Drill supports SQL for non-relational data. Paradoxically, combining this NoSQL with this SQL tool results in something even better.

Using concrete examples such as Time Series and Music Database applications, I will show how and why you should combine HBase and Drill to create highly scalable and available applications exposing NoSQL data to any SQL compliant tool.

Speakers
avatar for Tugdual Grall

Tugdual Grall

Technical Evangelist, MapR
Tugdual Grall Bio: Tugdual Grall, est Chief Technical Evangelist EMEA chez MapR. Il travaille avec les clients et les communautés de développeurs européennes, pour faciliter l’adoption de MapR, Hadoop et NoSQL. | | Avant de travailler chez MapR, “Tug”, était Technical Evangelist chez MongoDB et Couchbase. Tug a travaillé comme CTO chez eXo Platform, et comme Product Manager et Développeur sur la platform Java/JavaEE d’Oracle... Read More →


Wednesday September 30, 2015 11:00 - 11:50
Huba

11:00

Hadoop and Kerberos: the Madness Beyond the Gate - Steve Loughran, Hortonworks
When HP Lovecraft wrote of forbidden knowledge about non-human deities, knowledge which would reduce the reader to insanity, most people assumed that he was making up a fantasy world. In fact he was documenting Kerberos and its Hadoop integration.

There are some things humanity was not meant to know. Most people are better off living lives of naive innocence, never having to see an error message about SASL or GSS, to never stare in terror at classes only whose initials, UGI, are ever spoken aloud -or more accurately, whispered.

This talk goes into the depths, to the knowledge which you need to write applications in a secure Hadoop cluster, knowledge that may drive you insane. Forever more, you shall fear voices calling out in the night, voices saying things like "we have an urgent Kerberos-related support call -can you help?"

Speakers
avatar for Steve Loughran

Steve Loughran

Member of Technical Staff, Hortonworks
Steve Loughran is a developer at Hortonworks, where he works on leading-edge Hadoop applications, most recently on Apache Slider and on Apache Spark's integration with Hadoop and YARN, and Hadoop's S3A connector to Amazon S3. He's the author of Ant in Action, a member of the Apache Software Foundation, and a committer on the Hadoop core since 2009. He lives and works in Bristol, England.


Wednesday September 30, 2015 11:00 - 11:50
Tohotom

11:00

Shared Memory Layer for Spark Applications - Dmitriy Setrakyan, GridGain
In this presentation we will talk about the need to share state across different Spark
jobs and applications and several technologies that make it possible, including
Tachyon and Apache Ignite. We will dive into importance of In Memory File Systems,
Shared In-Memory RDDs with Apache Ignite, as well as present a hands on demo
demonstrating advantages and disadvantages of one approach over another. We will
also discuss requirements of storing data off-heap in order to achieve large horizontal
and vertical scale of the applications using Spark and Ignite.

Speakers
DS

Dmitriy Setrakyan

EVP Engineering, GridGain
Dmitriy Setrakyan is founder and Chief Product Officer at GridGain. Dmitriy has been working with distributed architectures for over 15 years and has expertise in the development of various middleware platforms, financial trading systems, CRM applications and similar systems. Prior to GridGain, Dmitriy worked at eBay where he was responsible for the architecture of an add-serving system processing several billion hits a day. Currently Dmitriy... Read More →


Wednesday September 30, 2015 11:00 - 11:50
Dery/Mikszath

11:00

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA - Christian Tzolov, Pivotal
In the space of Big Data, two powerful data processing tools compliment each other. Namely HAWQ and Geode. HAWQ is a scalable OLAP SQL-on-Hadoop system, while Geode is OLTP like, in-memory data grid and event processing system. This presentation will show different integration approaches that allow integration and data exchange between HAWQ and Geode. Presentation will walking you through the implementation of the different Integration strategies demonstrating the power of combining various OSS technologies for processing bit and fast data. Presentation will touch upon OSS technologies like HAWQ, Geode, SpringXD, Hadoop and Spring Boot.

Speakers
avatar for Christian Tzolov

Christian Tzolov

Pivotal Inc
Christian Tzolov, Pivotal technical architect, BigData and Hadoop specialist, contributing to various open source projects. In addition to being an Apache® Committer and Apache Crunch PMC Member, he has spent over a decade working with various Java and Spring projects and has led several enterprises on large scale artificial intelligence, data science, and Apache Hadoop® projects. twitter: @christzolov blog: http://blog.tzolov.net


Wednesday September 30, 2015 11:00 - 11:50
Petofi

11:00

Profiting From Apache Projects Without Losing Your Soul - Shane Curcuru, Apache Software Foundation
Does your company want to capitalize on the Apache brand? Are you interested in seeing how closely you can tie your marketing into the latest Apache projects? Do you recognize the importance of supporting the Apache ecosystem, not just with code contributions but other actions? As VP of Brand Management for all Apache projects, Shane can help show business and technical leaders some of the ways they can respectfully and successfully market and position their own services and products in relation to Apache project brands. The key message is: Apache project governance is independent; but we are happy to have businesses build their software and services on any Apache software products. You may incorporate Apache brands within your brands, but in specific ways that still give our communities credit. We're here to help!

Speakers
avatar for Shane Curcuru

Shane Curcuru

VP, Brand Management, The Apache Software Foundation
Shane serves as V.P. of Brand Management for the ASF, setting trademark and brand policy for all 250+ Apache projects, and has served as five-time Director, and member and mentor for Conferences and the Incubator. | | Shane's Punderthings consultancy is here to help both companies and FOSS communities understand how to work together better. At home, Shane is: a father and husband, a Member of the ASF, a BMW driver and punny guy. Oh, and we... Read More →


Wednesday September 30, 2015 11:00 - 11:50
Tas

12:00

SAMOA: A Platform for Mining Big Data Streams - Nicolas Kourtellis, Telefonica I+D, Barcelona
In this talk, Nicolas Kourtellis will introduce Apache SAMOA (Scalable Advanced Massive Online Analysis), an open-source platform for mining big data streams (http://samoa.incubator.apache.org). Apache SAMOA provides a collection of distributed streaming algorithms for data mining tasks such as classification, regression, and clustering. The models built can be updated as new data arrive without the need to define data batches or update frequencies. The platform features a pluggable architecture that can run on existing and well-tested distributed stream processing engines such as Storm, S4, Samza and Flink, for scalability and fault tolerance.

Speakers
avatar for Nicolas Kourtellis

Nicolas Kourtellis

Researcher, Telefonica I+D
Nicolas Kourtellis is a Researcher at Telefonica Research. Previously he was a Researcher in the Web Mining Research Group at Yahoo Labs, Barcelona. He holds a Ph.D. in Computer Science and Engineering from the University of South Florida (2012), a MSc in Computer Science from the University of South Florida (2008), and a BSc/MEng in Electrical and Computer Engineering from the National Technical University of Athens, Greece (2006). He is... Read More →


Wednesday September 30, 2015 12:00 - 12:50
Huba

12:00

Securing Hadoop in an Enterprise Context - Hellmar Becker, ING
Hadoop clusters can be secured using Kerberos and LDAP, and tools like Ranger and Sentry facilitate security administration. How do you connect a cluster to an enterprise directory with 100,000+ users and centralized role and access management? Hellmar will present ING's approach to synchronize Hadoop role management with the central repository, emphasizing aspects of performance and system stability. He will discuss specific changes to the Ranger security tool that ING introduced to mitigate directory server load, and general aspects of the security model.

Speakers
avatar for Hellmar Becker

Hellmar Becker

Sr. IT Specialist, ING
Hellmar has worked in a number of positions in big data analytics and digital analytics. Currently working at ING Bank, implementing Datalake Foundation project (based on Hadoop) within Client Information management. Long standing experience in advanced analytics and data management. Speaker engagements at Hadoop Summit Brussels 2015 (https://www.youtube.com/watch?v=AhT-nxoEkbg) and at various industry events in Germany, including Online Value... Read More →


Wednesday September 30, 2015 12:00 - 12:50
Tohotom

12:00

Apache Ignite: In-Memory Data Fabric in Action - Dmitriy Setrakyan, GridGain
In this talk Dmitriy will dissect the Apache Ignite architecture. We will focus on how Apache Ignite data partitioning and replication works, how computations are distributed and failed over in case of crashes. We will also talk about in-memory streaming in Ignite and various techniques we can employ to make it fault tolerant. To demonstrate how easy it is to get started with Ignite, Dmitriy will also run several Ignite coding examples live during the presentation.

Speakers
DS

Dimitriy Setrakyan

EVP of Engineering, GridGain Systems
Dmitriy Setrakyan is founder and EVP of Engineering at GridGain Systems. Dmitriy has been working with distributed architectures for over 15 years and has expertise in the development of various middleware platforms, financial trading systems, CRM applications and similar systems. Prior to GridGain, Dmitriy worked at eBay where he was responsible for the architecture of an add-serving system processing several billion hits a day. Currently... Read More →


Wednesday September 30, 2015 12:00 - 12:50
Krudy/Jokai

12:00

Realtime Reactive Apps with Actor Model and Apache Spark - Rahul Kumar, Sigmoid Analytics
Developing Application with Big Data is really challenging work, scaling, fault tolerance and responsiveness some are the biggest challenge. Realtime bigdata application that have self healing feature is a dream these days. Apache Spark is a fast in-memory data processing system that gives a good backend for realtime application.In this talk I will show how to use reactive platform, Actor model and Apache Spark stack to develop a system that have responsiveness, resiliency, fault tolerance and message driven feature.

Speakers
avatar for Rahul Kumar

Rahul Kumar

Technical Lead, Sigmoid
Rahul Kumar working as a Technical lead with Sigmoid, He has more than 4 years of experience in Data-driven distributed application development with Java , Scala , and Akka toolkit. He developed various real-time data analytics applications using Apache Hadoop, Mesos ecosystem projects, and Apache Spark. He had given a couple of talks on Apache Spark, Reactive Dashboard and Actor Model in LinuxCon North America & Apache Bigdata Europe 2015.


Wednesday September 30, 2015 12:00 - 12:50
Dery/Mikszath

12:00

Introduction to Pivotal HAWQ[1]: A Deep Drive Into the Architecture of an Advanced SQL Engine - Caleb Welton, Pivotal
The Pivotal HAWQ[1] project, planned for incubation into an Apache project, is designed to provide a highly performant ANSI SQL compliant query engine supporting a sophisticated resource management model, transactional DML and DDL operations, window functions, grouping sets, complex sub-queries, common table expressions, and strong extensibility capabilities for customized analytics and machine learning.

In this session we will present an overview of the product features, describe the key architectural components, and walk through the overall project structure.

[1] Project incubation and Apache project name pending approval by the Apache Foundation.

Speakers
CW

Caleb Welton

Director, Pivotal
Caleb Welton is Director for SQL on Hadoop at Pivotal covering the Pivotal HAWQ database. He has spent the last 18 years developing database technology for Oracle, Greenplum, EMC and Pivotal. In addition to his contributions in database technology he is one of the founding members of the open source MADlib project for in-database machine learning. Caleb is named inventor for 11 patents in database technology and has presented papers at SIGMOD... Read More →


Wednesday September 30, 2015 12:00 - 12:50
Petofi

12:00

Data Quality on Mars - ISO 80000 and other Standards -Werner Keil
Big Data without Data Quality becomes messy and meaningless in most cases. Therefore, data and measurements have to be stored and transferred in a standard way.
We all know that when representing a temperature, for example, we normally have it as decimal/float. But, is this float in Celsius? Fahrenheit? Kelvin?

One of the most vivid examples was Mars Climate Orbiter being lost as the spacecraft went into orbital insertion, due to ground-based computer software which produced output in non-SI units of pound-seconds (lbf×s) instead of the metric units of newton-seconds (N×s) specified in the contract between NASA and Lockheed.

In this session we're going to explore data quality and measurement standards like ISO 80000 or UCUM (Unified Code for Units of Measure), Unit support for programming languages and APIs plus projects using them like Apache SIS, Performance Co-Pilot or uDig.

Speakers
avatar for Werner Keil

Werner Keil

Director, Creative Arts & Technologies
Werner Keil is Agile Coach Java and IoT/Embedded expert. Helping Global 500 Enterprises across industries and leading IT vendors. He worked for over 25 years as Program Manager, Coach, SW architect and consultant for Finance, Mobile, Media, Tansport and Public sector. Werner is Eclipse and Apache Committer and JCP member in JSRs like 354 (Money), 358/364 (JCP.next), Java ME 8, 362 (Portlet 3), 363 (Units, also Spec Lead), 365 (CDI 2), 366 (Java... Read More →


Wednesday September 30, 2015 12:00 - 12:50
Tas

12:50

Lunch
Wednesday September 30, 2015 12:50 - 14:20
Brasserie Restaurant

14:30

Implementing a Highly Scalable In-Memory Stock Prediction System with Apache Geode (incubating), R and Spring XD - William Markito Oliveira, Pivotal and Fred Melo, Pivotal
Finance market prediction has always been one of the hottest topics in Data Science and Machine Learning. However, the prediction algorithm is just a small piece of the puzzle. Building a data stream pipeline that is constantly combining the latest price info with high volume historical data is extremely challenging using traditional platforms, requiring a lot code and thinking about how to scale or move into the cloud.

This session is going to walk-through the architecture and implementation details of an application built on top of open-source tools that demonstrate how to easily build a stock prediction solution with almost no source code - except a few lines of R and the UI interface based on JavaFX, using Apache Geode for fast data and real-time notifications, combining streaming and distributed processing for stock indicator algorithms.

Speakers
avatar for William Markito Oliveira

William Markito Oliveira

Enterprise Architect, Pivotal
William Markito Oliveira is a solution architect of enterprise applications with focus on system integration and highly distributed systems. He has large Java platform experience, solid skills in development and architecture of SOA, Big Data, EAI, and web services-based applications. William currently works at Pivotal and was formerly at Oracle, BEA, and Ericsson. He co-authored a few books (ISBN: 978-0321980083,ISBN: 978-0137081868, ISBN... Read More →


Wednesday September 30, 2015 14:30 - 15:20
Tohotom

14:30

Using Natural Language Processing on Non-Textual Data with MLLib - Casey Stella, Hortonworks
Natural language processing techniques are well established due to their obvious utility. Further, the rise in unstructured textual data has resulted in mature, distributed and scalable implementations beginning to be seen. While textual data is extremely common, there is apparently unstructured data which has underlying structure in the same way words which compose sentences have an underlying grammatical structure. This talk explores borrowing some natural language programming techniques to analyze the structure in non-textual data.

In particular, we consider the Word2Vec implementation in MLLib to help us organize and analyze non-textual clinical event data (I.e. Diagnoses, drugs prescribed, etc.). We will explore connections between diseases and drugs in an unsupervised way with Python, Spark and MLLib.

Speakers
CS

Casey Stella

Principal Architect, Hortonworks
I am a principal architect focusing on Data Science in the consulting organization at Hortonworks. In the past, I've worked as an architect and senior engineer at a healthcare informatics startup spun out of the Cleveland Clinic, as a developer at Oracle and as a Research Geophysicist in the Oil & Gas industry. Before that, I was a poor graduate student in Math at Texas A&M. | | I primarily work with the Apache Hadoop software stack. I... Read More →


Wednesday September 30, 2015 14:30 - 15:20
Huba

14:30

Faster ETL Workflows Using Apache Pig & Spark - Praveen Rachabattuni, Sigmoid Analytics
Pig on Spark aims to combine the simplicity of Pig with faster execution engine Spark and make Pig more promising to developers. Currently, with the help of Apache foundation, various contributions are working on the project for a release quality build. With Pig on spark, significant performance benefit has been observed in ETL workflows already running on MapReduce. Our initial benchmarks have shown 2x-5x improvement over Mapreduce. For a benchmarking test, we considered the ‘distinct’ operation. We used the wikistats dump for 25 days with a size of 270G, on a cluster involving one master and four worker machines (16 cores and 64GB RAM each). It took about 14 mins with Pig on Spark, compared to about 30 mins on Mapreduce. In this talk, Praveen would be sharing the progress of the project with the community and help people take advantage of Pig-Spark in their workflows.

Speakers
avatar for Praveen Rachabattuni

Praveen Rachabattuni

Technical Team Lead, SigmoidAnalytics
Praveen Rachabattuni is a technical team lead at Sigmoid Analytics. His areas of expertise includes Real Time Big Data Analytics using open source technologies like Apache Spark, Shark and Pig on Spark. He is working as a committer on the Apache Pig project and contributing for Pig on Spark . He has also worked on building json APIs for Spark tasks data, consumable by custom dashboards or tools.


Wednesday September 30, 2015 14:30 - 15:20
Dery/Mikszath

14:30

Apache Trafodion (incubating) brings operational workloads to Hadoop - Rohit Jain, Esgyn

Trafodion is a world class Transactional SQL RDBMS running on HBase/Hadoop, currently in Apache incubation.

In this talk we will discuss:

  • How operational workloads are different from BI and analytical workloads
  • The operational (OLTP & Operational Data Store) use cases Trafodion addresses
  • Why Trafodion is the right solution for these use cases.  That is, what is the recipe for a world class database engine, and how Trafodion implements the ingredients that make up that recipe: 
  1. Time, money, and talent!
  2. World class query optimizer
  3. World class parallel data flow execution engine
  4. World class distributed transaction management system
  • Other important aspects such as performance, scale, availability, and future directions


Speakers
RJ

Rohit Jain

CTO, Esgyn
Rohit Jain is Co-Founder and CTO at Esgyn, an open source database company. Rohit provided the vision behind Apache Trafodion, an enterprise-class MPP SQL Database for Big Data, donated to the Apache Software Foundation by HP in 2015. A veteran database technologist over the past 28 years, Rohit has worked for Tandem, Compaq, and Hewlett-Packard in application and database development. His experience spans online transaction processing... Read More →


Wednesday September 30, 2015 14:30 - 15:20
Petofi

14:30

Integrating Fully-Managed Data Streaming Services with Apache Samza - Renato Marroquinm ETH Zurich
Recently, the interest in highly scalable stream processing engines has risen, thus many projects have appeared. Apache Samza is a distributed stream-processing framework that uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, and resource management. It is one of the most popular stream processing engines out there used by many high-profile companies. On the other hand, we have Amazon Kinesis that is a fully managed service for real-time processing of streaming data which allows users to scale the amount of data ingested by Kinesis without worrying about the infrastructure details. This presentation gives a brief introduction about the very popular Samza-Kafka integration, then focuses on the new Samza-Kinesis integration, and explains users the new opportunities they have due to the new Samza-Kinesis integration.

Speakers
avatar for Renato Marroquin

Renato Marroquin

PhD student, ETH Zurich
PhD Student at ETHZ Zurich working with distributed databases. Computer Science Master by the Pontifical University of Rio de Janeiro worked with Apache Pig. Google Summer of Code participant, Apache Gora PMC Member and Committer, Open Source and Big Data Enthusiast. Renato has spoken at both open source and academic conferences.


Wednesday September 30, 2015 14:30 - 15:20
Krudy/Jokai

14:30

Hot 100 on Spark - Analyzing Trends in the Billboard Charts - Michael Miklavcic, Hortonworks
Are you a fan of data and music? It may be common knowledge that Taylor Swift and Katy Perry land a lot of number one singles, but are there other more subtle truths that we can find if we dig a little deeper? In this talk we dive into the Billboard charts using Spark and Spark SQL to look for trends and chart outliers using popular statistical analysis techniques like median absolute deviation (MAD).

Speakers
avatar for Michael Miklavcic

Michael Miklavcic

Systems Architect, Hortonworks
Michael is a software engineer with over ten years of industry experience and has been a Systems Architect with Hortonworks for the past two years. He is a code contributor to the Apache Falcon project and works directly with clients to implement solutions using Hadoop. For over 2 years he has guided many Big Data and Hadoop projects at large enterprises to success. Michael has degrees in computer science and computer information systems from... Read More →


Wednesday September 30, 2015 14:30 - 15:20
Tas

15:30

HDFS 2015: Past, Present and Future - Akira Ajisaka, NTT DATA
Hadoop Distributed File System (HDFS) has plenty of functions for collecting and processing big data and therefore used by a lot of companies. As they have been using Hadoop and HDFS, some heavy users become to have new demands such as scalability, resource efficiency, and security. To satisfy these demands, heterogeneous storages, object storage, data encryption, and many features have been developed.
This presentation will introduce these new features developed in 2015 from developer's and vendor-neutral view. This talk will cover the main purpose (What is the problem to solve?), the architecture (How to solve the problem?), and the development progress (When users will be able to use the feature?) for each new feature.

Speakers
avatar for Akira Ajisaka

Akira Ajisaka

Software Engineer, NTT DATA Corporation
Akira Ajisaka is a software engineer working at NTT DATA, Japan. He belongs to OSS Professional Services team and deploys and operates Hadoop clusters for customers. He sometimes troubleshoots them by investigating source code and creating a patch. He is an Apache Hadoop committer/PMC member and he involves in various components of Hadoop for improving usability and supportability. He wrote blog posts about activities of the development of Apache... Read More →


Wednesday September 30, 2015 15:30 - 16:20
Huba

15:30

Apache Spark for High-Throughput Systems - Michael Starch, NASA Jet Propulsion Laboratory
Data systems are increasingly expected to support data rates nearing network bandwidth limitations around 10Gb/s. Apache Spark is capable of high-throughputs via distributed computing and thus is a good choice to support a data system in this environment; however, most technologies breakdown under these conditions. Thus it is essential that Apache Spark be characterized for production use at these scales.

This talk will discuss the approach to running Apache Spark at throughputs on the order of 10Gb/s while performing non-trivial processing. This will give users a feel for Apache Spark’s performance under the most demanding conditions. Setup of Apache Spark, configuration used, and resource requirements to process at this scale will be discussed. In addition, concrete take-aways will be provided to users desiring to push Apache Spark to this scale.

Speakers
MS

Michael Starch

Computer Engineer in Applications, NASA Jet Propulsion Laboratory
Michael Starch has been employed by the Jet Propulsion laboratory for the past 5 years. His primary responsibilities include: engineering big data processing systems for handling scientific data, researching the next generation of big data technologies, and helping infuse these systems into the mission world. He is a commiter and PMC on Apache OODT and has spoken about his work at the Southern California Linux Expo and ApacheCon North America.


Wednesday September 30, 2015 15:30 - 16:20
Petofi

15:30

Introduction to Apache Tajo: Data Warehouse for Big Data - Jihoon Son, Gruter
Apache Tajo is a data warehouse system for Web-scale data. It provides virtual integration of a multitude of diverse data sources, thereby facilitating easy and rapid data integration which has been regarded as an essential, but heavy step in business intelligence. In addition, it has a fault-tolerable distributed query engine for accelerating query speed. With the “query federation” and “distributed processing” capacities, Tajo is capable of providing users with reliable and efficient analysis of Web-scale data spread on multiple sources.

Jihoon Son will introduce Apache Tajo including its overall architecture, current state and challenges, and discuss advantages what Tajo can bring to users. In addition, he will give a demo of integrated data analysis with Tajo.

Speakers
JS

Jihoon Son

Software Engineer, Gruter
Dr. Jihoon Son is a distributed system engineer at Gruter, which is a Hadoop-based big data infrastructure company of South Korea. He is one of the co-founders of Apache Tajo project, and now working on distributed query processing and query optimization of Tajo. He has several speaking experiences at international conferences such as ACM International Workshop on Data Engineering for Wireless and Mobile Access (MobiDE) and International... Read More →


Wednesday September 30, 2015 15:30 - 16:20
Dery/Mikszath

15:30

Deploying Spark Streaming with Kafka: Gotchas and Performance Analysis - Nishkam Ravi, Cloudera
Apache Spark is an in-memory compute engine that supports real time data processing through the streaming API. Kafka is a popular publish-subscribe messaging system used for data ingest and distribution. The performance of Spark streaming with Kafka is barely understood. In this talk, we will discuss different Spark streaming APIs that can be used for receiving data from Kafka and evaluate their performance for complex event processing. We will also highlight some caveats and corresponding workarounds for best performance. We find that Spark+Kafka yields high throughput and sub-second latencies for complex events when configured properly.

Speakers
NR

Nishkam Ravi

Software Engineer, Cloudera
Nishkam is a Software Engineer at Cloudera. His current focus is Spark and MapReduce performance. Nishkam got his B.Tech from IIT-Bombay and PhD from Rutgers. His first job was with Intel as a compiler engineer. Prior to joining Cloudera, Nishkam was a Research Staff Member at NEC Labs where he developed an optimizing compiler for MapReduce. He has presented in numerous peer reviewed AI and systems conferences in the past. | | Hari is a... Read More →


Wednesday September 30, 2015 15:30 - 16:20
Krudy/Jokai

15:30

Hadoop Backup and Scaling in Hybrid Environment - Pawel Leszczynski, Robert Mroczkowski, Mariusz Strzelecki, Allegro Group
In event sourcing architecture there is a single source of truth and Hadoop is the tool to fulfil that. We use Apache Kafka and Hermes message bus as a single entry point of events. There is no efficient solution for live backup of data with CRUD operations enabled. However when handling immutable events, we can backup data live to multiple locations like Hadoop cluster in another data center or any storage provider that supports S3 API.

Storing exact copies of data in different locations allows to extend compute power of private data center with public platform provider. Such a hybrid solution benefits from cloud elasticity, so we can easily scale on demand.

In this presentation architectural design patterns for backup and compute power scaling will be presented. We also focus on technical aspects of our architecture built on the top of open source software.

Speakers
avatar for Pawel Leszczynski

Pawel Leszczynski

Hadoop Product Owner, Allegro Group
Paweł holds PhD in distributed databases and his interests focus on making Big Data easy. He has 7 years of technical experience at Allegro and currently works as Hadoop Product Owner in a Big Data Solutions Team. The team develops and maintains a petabyte Hadoop cluster with endpoints such as Apache Kafka messaging.
avatar for Robert Mroczkowski

Robert Mroczkowski

Senior Data Engineer, Allegro Group
In 2006 graduated master studies in Computer Science at Nicolaus Copernicus University. In years 2006 - 2011 he was a PhD student in Computer Science. His research field was Computer Science applied in Bioinformatics. He gained experience in Hadoop World building and maintaining a cluster for Allegro. Every day he works with modern high-performance and high-available technologies, centrally managed in cloud environment. In 2015 became Senior Data... Read More →
avatar for Mariusz Strzelecki

Mariusz Strzelecki

Senior Data Engineer, Allegro Group
A software developer with 5+ years of professional experience. Now working as a Senior Data Engineer in Allegro Group, developing tools that support internal Big Data ecosystem and contributing to Open Source.


Wednesday September 30, 2015 15:30 - 16:20
Tas

15:30

Leveraging Arm64 for Big Data Scale Out - Martin Stadtler, Linaro
ARM 64-bit servers are a true implementation of the scale-out architecture and a very good fit for distributed processing frameworks like Hadoop, Spark and big data analytics in general.

The session will provide a summary of the workloads running on ARM servers, the status of Aarch64 support in JDK9 and it will describe the set up, build and testing of Hadoop on ARM, the optimizations achieved so far, plans to be a reference citizen in the big data analytics community, collaborations with the ecosystem and next steps.

Speakers
avatar for Martin Stadtler

Martin Stadtler

Director, Enterprise Group, Linaro
Martin Stadtler leads the Enterprise Group at Linaro.org. With over 20 years of experience with Open Source in the Enterprise and Telecom fields, now focused on ARM server adoption.


Wednesday September 30, 2015 15:30 - 16:20
Tohotom

16:00

Apache Geode Clubhouse Community Meeting
This community meeting will also be available online at https://pivotalcommunity.adobeconnect.com/clubhouse/

Wednesday September 30, 2015 16:00 - 17:00
Pivotal Hacker Lounge

16:30

Data Science: A View from the Trenches - Ram Sriharsha, Hortonworks and Vinay Shukla,Hortonworks
At Hortonworks, we have been working closely with a group of customers to onboard Data Science and Predictive Analytics applications on the Hortonworks Data Platforms using Spark. In this talk, we discuss the use cases and explore some of the challenges and solutions that arise in building real-world predictive analytics systems on top of Spark. These include useful tricks for memory savings, tradeoffs on the choice of the learning and feature engineering algorithms, tips and tricks to avoid performance bottlenecks, methods for assessing and visualizing accuracy, methods for providing confidence estimates for predictions and methods for managing retraining and deployment of trained models to production.

Speakers
VS

Vinay Shukla

Director of Product Management, Hortonworks
Vinay Shukla is the Director of Product Management for Spark and Data Science at Hortonworks. Vinay is a veteran of enterprise software. Previously, Vinay has worked as Product Manager, Developer, and Security Architect. When not in front of a computer, Vinay enjoys being on a Yoga mat or on a hiking trail.
RS

Ram Sriharsha

Senior Member of Technical Staff, Hortonworks
Ram is currently Product Manager for Apache Spark at Databricks. Prior to joining Databricks, he was Principal Research Scientist at Yahoo Research where he worked on large scale machine learning algorithms and systems related to login risk detection, sponsored search advertising and advertising effectiveness measurement. | Prior talks include talks at ApacheCon BigData 2015, Spark Summit. | Prior talks include talks at ApacheCon BigData... Read More →


Wednesday September 30, 2015 16:30 - 17:20
Huba

16:30

How Cognitive Computing is Changing Data Science for the Better - Michael Ludden, IBM
During this session, attendees will learn about the emerging field of Cognitive Computing and how it can assist humans in crunching through big data sets and ultimately making deep insights accessible to the next generation of app develoeprs without requiring massive resources.

Speakers
ML

Michael Ludden

Michael is an IBMer in Developer Relations at Watson. Previously, Michael was Developer Marketing Manager Lead at Google, Head of Developer Marketing at Samsung, a Developer Evangelist at HTC, Global Director of Developer Relations at startups Quixey & Nexmo, and was involved at various times in development, product marketing, co-founding startups, tech show hosting, and even cruise-ship singing (don’t ask). Michael has a degree from... Read More →


Wednesday September 30, 2015 16:30 - 17:20
Dery/Mikszath

16:30

Apache Ignite: The Journey from Incubation to Graduation - Konstantin Boudnik, WANdisco & Roman Shaposhnik, Pivotal
Following up on the recent graduation of Apache Ignite into TLP, Konstantin and Roman will talk about Apache Incubation process from the mentor's point of view. What were the challenges and bumps in the process of helping people to adopt and appreciate the "Apache Way"? Was it worthy of bringing a successful commercial platform to open source and what were the motivations? What we could do differently? How the open source model changed the mind set of the contributors and customers of the project?

Speakers
avatar for Konstantin Boudnik

Konstantin Boudnik

CEO, Memcore
Dr.Konstantin Boudnik, co-founder and CEO of Memcore Inc, is one of the early developers of Hadoop and a co-author of Apache BigTop, the open source framework and the community around creation of software stacks for data processing projects. With more than 20 years of experience in software development, big- and fast-data analytic, Git, distributed systems and more, Dr. Boudnik has authored 16 US patents in distributed computing. Dr. Boudnik... Read More →


Wednesday September 30, 2015 16:30 - 17:20
Petofi

16:30

Near Real Time Indexing Kafka Messages to Apache Blur using Spark Streaming - Dibyendu Bhattacharya, Pearson North America
Pearson is building a next generation adaptive learning platform and their Near Real Time architecture is powered by Kafka and Spark Streaming. Pearson also building a search infrastructure to index various learners data to Apache Blur, which is a Lucene based distributed search solution on Hadoop. For supporting NRT indexing into Apache Blur, Pearson has designed a fault-tolerant and reliable low-level Kafka Consumer for Spark Streaming. This talk will cover why Pearson chosen Apache Blur and how they designed this Kafka Consumer for Spark which helped NRT indexing into Blur. This talk will also cover the implementation details of Spark to Blur connector for doing bulk indexing to Apache Blur using Spark Hadoop API. This Spark-Blur connector is contributed to Apache Blur Project (http://bit.ly/1HVWk7G) and Kafka-Spark consumer is contributed to spark-packages (http://bit.ly/1PRNNtM)

Speakers
avatar for Dibyendu Bhattacharya

Dibyendu Bhattacharya

Big Data Architect, Pearson North America
Holds MS in Software Systems and B.Tech in Computer Science. Experience in building applications and products leveraging distributed computing and big data technologies. Working as Big Data Architect at Pearson,building adaptive learning platform to capture behavioral data across Pearson learning applications which will help us to build student analytics across products and institution boundaries that deliver efficacy insights . Invited as... Read More →



Wednesday September 30, 2015 16:30 - 17:20
Krudy/Jokai

16:30

How to Transform Data into Money Using Big Data Technologies - Jorge Lopez-Malla, Stratio
We are used to hearing that we live in the Age of Data but we have to face the truth: we live in the Age of “Big Data”. Companies are starting to realize that traditional technologies are not enough to accomplish their usual tasks with the massive amount of information that we are generating every day. Big Data processes are not as brand new as people think. Nonetheless, what we, the developers, as well as the companies aren’t used to seeing, is getting value out of their own data.

To illustrate this fact, we are going to show a successful use case in which using Apache Spark, HDFS and Apache Parquet, a Middle-East Telco company could not only start a new business line getting prized information for third parties from their own data, but also improve its own coverage network through the analysis of these data

Speakers
avatar for Jorge Lopez-Malla

Jorge Lopez-Malla

Jorge has been involved in the inception and implementation of projects related to several fields such as digital media, telcos, banks & insurance companies. He is in charge of Stratio’s Big Data training, having been one of the first engineers to become Spark certified. | Previous speaking experience: Spark for Hadoop users. Spark for intermediates.How to make your Spark jobs fly. | | We at Stratio have been working with Big Data for... Read More →


Wednesday September 30, 2015 16:30 - 17:20
Tas

18:00