Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Data Science [clear filter]
Monday, September 28


Data Science in the Travel Industry: Real-World Experience with Current Leading Frameworks - Paul Balm, Amadeus IT Group
Amadeus IT Group is a leading IT provider in the travel industry, processing 525 million bookings per year and boarding 700 million passengers on its airline IT systems. The Travel Intelligence Unit was formed in 2013 with the objective to organize the travel information of the world. Amadeus Travel Intelligence is leveraging big data to help all parties in the travel industry make more effective and quicker decisions.
We will show how Amadeus Travel Intelligence employs open source technologies to achieve its objective: processing based on Hadoop; visualization layer based on Ruby-on-Rails and HTML5; streaming based on Spark and Flink; and API level access through web-services. We will review typical project requirements and our experiences, such as the pitfalls of immature projects, missing functionalities, and communities that have moved on.

avatar for Paul Balm

Paul Balm

Data Scientist, Amadeus IT Group
Paul Balm joined Amadeus as a Data Scientist in September 2014. Before joining the Travel Intelligence unit at Amadeus, he worked on data processing systems for the European Space Agency since 2005. Paul holds a Ph.D. in particle physics from Fermi National Accelerator Laboratory... Read More →

Monday September 28, 2015 10:30 - 11:20


Data Science Lifecycle with Apache Zeppelin (incubating) - Moon soo Lee, NFLabs and Alexander Bezzubov
Apache Zeppelin (incubating) is interactive data analytics environment for distributed data processing system. It provides beautiful interactive web-based interface, data visualization, collaborative work environment and many other nice features to make your data analytics more fun and enjoyable. LeeMoonSoo will going to demo Zeppelin's features to show how it helps data science lifecycle.

Zeppelin provides pluggable architecture for backend integration, visualization, notebook persistence storage.
This presentation will describe how these pluggable architecture works and how your project can leverage them.

Also will discuss about the future roadmap.


Alexander Bezzubov

Software Engineer, NFLabs
Alexander Bezzubov is Apache Zeppelin contributor, PMC member and software engineer at NFLabs. Previous speaking experience includes Apache BigData NA 2016 in Vancouver, FOSSASIA 2016 in Singapore, Apache BigData EU 2015 in Budapest.
avatar for Moon


cto, NFLabs
Moon soo Lee is a creator for Apache Zeppelin and a Co-Founder, CTO at NFLabs. For past few years he has been working on bootstrapping Zeppelin project and it’s community. His recent focus is growing Zeppelin community and getting adoptions.

Monday September 28, 2015 11:30 - 12:20


Catch Them in the Act: Fraud Detection in Real-Time - Seshika Fernandom WS02
Fraud is getting more complex and dangerous every minute, with Fraudsters countering anti-fraud measures through technology and advanced statistical models. On the converse, overprotective fraud solutions are driving customers away. Finding the right level of fraud prevention is more an art than a science. As data scientists our duty is not to master the art, but to enable our customers to draw this fine line in a simple yet effective manner.

In this session, Seshika will take you through
• How to detect anomalies in real time using Complex Event Processing
• Why Markov Modelling is great, in detecting rare activity sequences
• How Scoring Functions can be used to reduce False Positives
• How Machine Learning can be used to intensify fraud detection
• What visualizations will enable Analysts to further crackdown relationships in large fraud rings

avatar for Seshika Fernando

Seshika Fernando

Senior Technical Lead, WSO2
Seshika is a Senior Technical Lead at WSO2 and focuses on the applications of WSO2’s middleware platform in Financial Markets. Throughout her career, she has had extensive experience in providing technology for Stock Exchanges, Regulators and Investment Banks from across the globe... Read More →

Monday September 28, 2015 14:00 - 14:50


IPython Notebook as a Unified Data Science Interface for Hadoop - Casey Stella, Hortonworks
Data Science on Hadoop can be a daunting journey as you generally are spanning multiple tools and different interfaces. Furthermore, while there are people out there doing data science, worked examples are few and far between.

As part of the Social Security Act, the Center for Medicare and Medicaid Services has begun to publish data detailing the relationship between physicians and medical institutions. This data has been analyzed cursorily in the press, but an in-depth outlier and benford's law analysis hasn't been attempted (to my knowledge).

I will present an example of using Apache Spark and Hive on Hadoop to do the above analysis without leaving IPython notebook. This should motivate iPython and the Python bindings of Spark as a fantastic environment to do data science.


Casey Stella

Principal Architect, Hortonworks
I am a principal architect focusing on Data Science in the consulting organization at Hortonworks. In the past, I've worked as an architect and senior engineer at a healthcare informatics startup spun out of the Cleveland Clinic, as a developer at Oracle and as a Research Geophysicist... Read More →

Monday September 28, 2015 15:00 - 15:50


R as a Language For Big Data Analytics - Andrie de Vries, Microsoft
R is the language of data science, used by more than 2 million statisticians, data scientists and quantitative analysts around the world.

Many projects and companies have implemented libraries and solutions to make R available to the data scientist working with big data in Hadoop. Foremost among these is Revolution Analytics / Microsoft that sponsored the popular RHadoop project.

In this talk, I'll present
- A high level overview of R: its history, capabilities and community
- An introduction to predictive analytics, and some applications from industry, especially some examples from inside Microsoft
- Connections between R and big-data platforms including Hadoop and Spark
- The Revolution R Open and Revolution R Enterprise distributions, and the unique capabilities they bring to R.
- Using R in the Azure cloud and (coming soon) within the SQL Server 2016 database.

avatar for Andrie de Vries

Andrie de Vries

Senior Programme Manager, Microsoft
Andrie is a senior programme manager at Microsoft, responsible for community projects and evangelization of Microsoft's contribution in Europe to the open source R language. He is co-author of the very popular title "R for Dummies" and a top contributor to the Q&A website, StackOverflow... Read More →

Monday September 28, 2015 16:00 - 16:50
Wednesday, September 30


SAMOA: A Platform for Mining Big Data Streams - Nicolas Kourtellis, Telefonica I+D, Barcelona
In this talk, Nicolas Kourtellis will introduce Apache SAMOA (Scalable Advanced Massive Online Analysis), an open-source platform for mining big data streams (http://samoa.incubator.apache.org). Apache SAMOA provides a collection of distributed streaming algorithms for data mining tasks such as classification, regression, and clustering. The models built can be updated as new data arrive without the need to define data batches or update frequencies. The platform features a pluggable architecture that can run on existing and well-tested distributed stream processing engines such as Storm, S4, Samza and Flink, for scalability and fault tolerance.

avatar for Nicolas Kourtellis

Nicolas Kourtellis

Researcher, Telefonica I+D
Nicolas Kourtellis is a Researcher at Telefonica Research. Previously he was a Researcher in the Web Mining Research Group at Yahoo Labs, Barcelona. He holds a Ph.D. in Computer Science and Engineering from the University of South Florida (2012), a MSc in Computer Science from the... Read More →

Wednesday September 30, 2015 12:00 - 12:50


Implementing a Highly Scalable In-Memory Stock Prediction System with Apache Geode (incubating), R and Spring XD - William Markito Oliveira, Pivotal and Fred Melo, Pivotal
Finance market prediction has always been one of the hottest topics in Data Science and Machine Learning. However, the prediction algorithm is just a small piece of the puzzle. Building a data stream pipeline that is constantly combining the latest price info with high volume historical data is extremely challenging using traditional platforms, requiring a lot code and thinking about how to scale or move into the cloud.

This session is going to walk-through the architecture and implementation details of an application built on top of open-source tools that demonstrate how to easily build a stock prediction solution with almost no source code - except a few lines of R and the UI interface based on JavaFX, using Apache Geode for fast data and real-time notifications, combining streaming and distributed processing for stock indicator algorithms.

avatar for William Markito Oliveira

William Markito Oliveira

Enterprise Architect, Red Hat
William Markito Oliveira is a solution architect of enterprise applications with focus on system integration and highly distributed systems. He has large Java platform experience, solid skills in development and architecture of SOA, Big Data, EAI, and web services-based applications... Read More →

Wednesday September 30, 2015 14:30 - 15:20


Using Natural Language Processing on Non-Textual Data with MLLib - Casey Stella, Hortonworks
Natural language processing techniques are well established due to their obvious utility. Further, the rise in unstructured textual data has resulted in mature, distributed and scalable implementations beginning to be seen. While textual data is extremely common, there is apparently unstructured data which has underlying structure in the same way words which compose sentences have an underlying grammatical structure. This talk explores borrowing some natural language programming techniques to analyze the structure in non-textual data.

In particular, we consider the Word2Vec implementation in MLLib to help us organize and analyze non-textual clinical event data (I.e. Diagnoses, drugs prescribed, etc.). We will explore connections between diseases and drugs in an unsupervised way with Python, Spark and MLLib.


Casey Stella

Principal Architect, Hortonworks
I am a principal architect focusing on Data Science in the consulting organization at Hortonworks. In the past, I've worked as an architect and senior engineer at a healthcare informatics startup spun out of the Cleveland Clinic, as a developer at Oracle and as a Research Geophysicist... Read More →

Wednesday September 30, 2015 14:30 - 15:20


Data Science: A View from the Trenches - Ram Sriharsha, Hortonworks and Vinay Shukla,Hortonworks
At Hortonworks, we have been working closely with a group of customers to onboard Data Science and Predictive Analytics applications on the Hortonworks Data Platforms using Spark. In this talk, we discuss the use cases and explore some of the challenges and solutions that arise in building real-world predictive analytics systems on top of Spark. These include useful tricks for memory savings, tradeoffs on the choice of the learning and feature engineering algorithms, tips and tricks to avoid performance bottlenecks, methods for assessing and visualizing accuracy, methods for providing confidence estimates for predictions and methods for managing retraining and deployment of trained models to production.


Vinay Shukla

Director of Product Management, Hortonworks
Vinay Shukla is the Director of Product Management for Spark and Data Science at Hortonworks. Vinay is a veteran of enterprise software. Previously, Vinay has worked as Product Manager, Developer, and Security Architect. When not in front of a computer, Vinay enjoys being on a Yoga... Read More →

Ram Sriharsha

Senior Member of Technical Staff, Hortonworks
Ram is currently Product Manager for Apache Spark at Databricks. Prior to joining Databricks, he was Principal Research Scientist at Yahoo Research where he worked on large scale machine learning algorithms and systems related to login risk detection, sponsored search advertising... Read More →

Wednesday September 30, 2015 16:30 - 17:20


How Cognitive Computing is Changing Data Science for the Better - Michael Ludden, IBM
During this session, attendees will learn about the emerging field of Cognitive Computing and how it can assist humans in crunching through big data sets and ultimately making deep insights accessible to the next generation of app develoeprs without requiring massive resources.

avatar for Michael Ludden

Michael Ludden

IBM Watson Developer Labs Program Director, IBM
Michael Ludden is the IBM Watson Developer Labs Program Director and Senior Product Manager. Previously, Michael was Lead Developer Marketing Manager at Google, Head of Developer Marketing at Samsung, a Developer Evangelist at HTC, Global Director of Developer Relations at startups... Read More →

Wednesday September 30, 2015 16:30 - 17:20