Loading…
This event has ended. View the official site or create your own event → Check it out
This event has ended. Create your own
View analytic
Wednesday, September 30 • 14:30 - 15:20
Faster ETL Workflows Using Apache Pig & Spark - Praveen Rachabattuni, Sigmoid Analytics

Sign up or log in to save this to your schedule and see who's attending!

Pig on Spark aims to combine the simplicity of Pig with faster execution engine Spark and make Pig more promising to developers. Currently, with the help of Apache foundation, various contributions are working on the project for a release quality build. With Pig on spark, significant performance benefit has been observed in ETL workflows already running on MapReduce. Our initial benchmarks have shown 2x-5x improvement over Mapreduce. For a benchmarking test, we considered the ‘distinct’ operation. We used the wikistats dump for 25 days with a size of 270G, on a cluster involving one master and four worker machines (16 cores and 64GB RAM each). It took about 14 mins with Pig on Spark, compared to about 30 mins on Mapreduce. In this talk, Praveen would be sharing the progress of the project with the community and help people take advantage of Pig-Spark in their workflows.

Speakers
avatar for Praveen Rachabattuni

Praveen Rachabattuni

Technical Team Lead, SigmoidAnalytics
Praveen Rachabattuni is a technical team lead at Sigmoid Analytics. His areas of expertise includes Real Time Big Data Analytics using open source technologies like Apache Spark, Shark and Pig on Spark. He is working as a committer on the Apache Pig project and contributing for Pig on Spark . He has also worked on building json APIs for Spark tasks data, consumable by custom dashboards or tools.


Wednesday September 30, 2015 14:30 - 15:20
Dery/Mikszath

Attendees (21)