Skip to main content

Driving Data Decisions with Real-time Analytics

Vijay Agneeswaran (Walmart Labs), Pranay Tonpay (Impetus)
Hardcore Data Science Gramercy Suite
Average rating: **...
(2.71, 7 ratings)
Slides:   1-PPTX 

We live in an “instant” world and are already very good at predicting “when” an event occurs –

Example –
A potential customer is browsing your website now.
A competitor just reduced the price of your best-selling item.
Stock prices are dwindling due to market issues.

Just knowing the event has very little significance, unless we “act and react” to these events instantaneously. Real-time analytics can help in pro-actively monitoring and analyzing events/data on the fly.

Real-time processing – ( 3 steps )
1. Data relevant to real time is collected this is already happening “instantaneously” as mentioned above]
2. Data is analyzed.
3. Decision is made using the analysis.

“2” and “3” should happen real-time (or near real-time, as it’s called for practical reasons) which makes it harder than the batch analytics (where some latency can be tolerated).

Given that real-time analytics of being so important, we have implemented certain Machine learning algorithms over Spark streaming to allow real time processing, namely the “Naïve Bayes” and “Logistic Regression” for Classification AND “k-means” for Clustering. Along with the implementations of the above, we have also implemented PMML support for some of the above algorithms, so as to allow sharing of models between PMML compliant applications. By providing PMML support for ML algorithms in Spark, we provide a very flexible way to import a model in Spark and evaluate its performance.

We chose Spark as it is designed for specific type of workload in cluster computing, namely, those that reuse a working set of data across parallel operations. Machine learning implementations are perfect examples of such computing. For such computing use-cases, Spark out-performs Hadoop by providing primitives for in-memory cluster computing; thereby avoiding the I/O bottleneck between the individual jobs of an iterative Map-Reduce workflow that repeatedly performs computations on the same working set. Due to this, we would be able to run machine learning algorithms very fast and provide a way for real time analytics.

The takeaways will include:

  1. Use cases for real-time analytics – illustration of web traffic analysis and manufacturing use cases.
  2. Discussion of throughput VS accuracy trade-off – challenges of real-time analytics.
  3. Implementation of ML algorithms over Spark streaming – code snippets.
  4. PMML support for ML algorithms – traditional R/SAS models can be run over Spark in real-time.
  5. Performance comparison of ML algorithms over Spark streaming and over Storm
Photo of Vijay Agneeswaran

Vijay Agneeswaran

Walmart Labs

Dr. Vijay Srinivas Agneeswaran has a Bachelor’s degree in Computer Science & Engineering from SVCE, Madras University (1998), an MS (By Research) from IIT Madras in 2001 and a PhD from IIT Madras (2008). He was a post-doctoral research fellow in the LSIR Labs, Swiss Federal Institute of Technology, Lausanne (EPFL) for a year. He has done an internship in Siemens Corporate Research in Bangalore and was with another product development company – Oracle for three years, He subsequently spent a year as principal architect position with GTO, the research arm of Cognizant in Chennai, where he led the Extreme Processing group within the High Performance Computing Centre of Excellence and created Intellectual property in the Big-Data space. He has now taken up the position as Director Technology/Principal Architect as head of the Big-Data R&D at Impetus. He is a professional member of the ACM and the IEEE for the last 7+ years. He has filed patents with US and European patent office’s (with one accepted US patent) and published in leading journals and conferences, including IEEE transactions. His research interests include distributed systems – cloud, grid, peer-to-peer computing as well as machine learning for Big-Data and other emerging technologies.

Photo of Pranay Tonpay

Pranay Tonpay

Impetus

Pranay works as a Senior Architect @Impetus.

Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, contact Susan Stewart at sstewart@oreilly.com

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences email mediapartners
@oreilly.com

Press & Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata + Hadoop World 2013 contacts