We present a Distributed Parallel GBM. GBM is a state-of-the-art Machine Learning algorithm used to win many of the recent Kaggle competitions. It’s well known for both it’s high quality results, and being very difficult to parallelize. H2O is 0xdata’s high-performance parallel distributed Math platform – single node performance often meets or exceeds C or Fortran codes, and it runs scale-out as well.
We’ll lightly cover the math of GBM itself, then look at the details.
GBM requires tree-building, and that in turn requires building distributed histograms on big data, which in turn has interesting memory, CPU and network implications. For instance, compressing histograms to reduce network latency is faster than brute force.
This talk will demo building and scoring a GBM model on a 100million+ row dataset on a large cluster. We will also cover some GBM concepts, distributed tree-building concepts, and some discussion on programming for distributed computation in general.
H2O is a pure Java open-source product, source code is available on Github.
Cliff Click is the CTO and Co-Founder of 0xdata, a firm dedicated to creating a new way to think about web-scale math and real-time analytics. I wrote my first compiler when I was 15 (Pascal to TRS Z-80!), although my most famous compiler is the HotSpot Server Compiler (the Sea of Nodes IR). I helped Azul Systems build an 864 core pure-Java mainframe that keeps GC pauses on 500Gb heaps to under 10ms, and worked on all aspects of that JVM. Before that I worked on HotSpot at Sun Microsystems, and am at least partially responsible for bringing Java into the mainstream.
I am invited to speak regularly at industry and academic conferences and has published many papers about HotSpot technology. I hold a PhD in Computer Science from Rice University and about 15 patents.