Office Hour with Paco Nathan

Expo Hall (Table D)

How to build Enterprise data workflows for Apache Hadoop based on Cascading (Java), Cascalog (Clojure), Scalding (Scala) — with best practices for simple, robust apps which run efficiently in parallel at scale, including techniques for test-driven development with Big Data. We can review examples among several different open source apps in Java, Clojure, Scala.

Using ANSI SQL and PMML to migrate Enterprise apps with predictive models from SAS, R, etc., to run cost-effectively at scale on Apache Hadoop clusters — for example, anti-fraud classifiers based on Random Forest or Logistic Regression, which can be built using tools such as RStudio, then exported to run on very large-scale customer 360 data sets. We can look at a variety of models in R and compare their operation on Hadoop.

Photo of Paco Nathan

Paco Nathan

derwen.ai

Data Scientist for Concurrent in SF, and a committer on the Cascading open source project. 10+ years leading innovative Data teams, 25+ yrs in tech industry overall. Background in math/stats and distributed computing. Expertise in Hadoop, R, AWS, predictive analytics, machine learning, NLP

Sponsors

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at sstewart@oreilly.com

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners
@oreilly.com

Press and Media

For media-related inquiries, contact Maureen Jennings at maureen@oreilly.com

Contact Us

View a complete list of Strata contacts