Monday, September 28 • 15:00 - 15:50
IPython Notebook as a Unified Data Science Interface for Hadoop - Casey Stella, Hortonworks

Data Science on Hadoop can be a daunting journey as you generally are spanning multiple tools and different interfaces. Furthermore, while there are people out there doing data science, worked examples are few and far between.

As part of the Social Security Act, the Center for Medicare and Medicaid Services has begun to publish data detailing the relationship between physicians and medical institutions. This data has been analyzed cursorily in the press, but an in-depth outlier and benford's law analysis hasn't been attempted (to my knowledge).

I will present an example of using Apache Spark and Hive on Hadoop to do the above analysis without leaving IPython notebook. This should motivate iPython and the Python bindings of Spark as a fantastic environment to do data science.


Casey Stella

Principal Architect, Hortonworks
I am a principal architect focusing on Data Science in the consulting organization at Hortonworks. In the past, I've worked as an architect and senior engineer at a healthcare informatics startup spun out of the Cleveland Clinic, as a developer at Oracle and as a Research Geophysicist in the Oil & Gas industry. Before that, I was a poor graduate student in Math at Texas A&M. | | I primarily work with the Apache Hadoop software stack. I... Read More →

Monday September 28, 2015 15:00 - 15:50

