In the age of big data, there has been unprecedented growth in the amount of data available for analysis, but handling unstructured and semistructured data is a challenging task that prompts organizations to discard a substantial amount of data.
Artificial neural networks (ANNs) have been successfully used for imposing structure over unstructured data, by means of unsupervised feature extraction and nonlinear pattern detection. Restricted Boltzmann machines (RBMs), for example, have been shown to have a wide range of applications in this context: they can be used as generative models for dimensionality reduction, classification, collaborative filtering, extraction of semantic document representation, and more. RBMs are also used as building blocks for the multilayer learning architecture of deep belief networks.
Training RBMs against a big dataset, however, is problematic. When operating with millions and billions of parameters, the parameter estimation process for a conventional, nonparallelized RBM can take weeks. In addition, the constraints of using a single machine for model fitting introduces another limitation that negatively impacts scalability.
Numerous attempts have been made to overcome the aforementioned limitation—most of them involving computations using GPUs. Studies have shown that this approach can reduce the training time for an RBM-based deep belief network from several weeks to a single day. On the other hand, using GPU-based training also presents certain challenges. GPUs impose a limit on the amount of memory available for the computation, thus limiting the model in terms of size. Stacking multiple GPUs together is inefficient due to the communication-induced overhead and the increased economic costs. There are also limitations arising from memory transfer times and thread synchronization.
Nikolay Manchev explores an implementation of a CPU-based, parallelized version of the restricted Boltzmann machine created as a collaboration between IBM and City University London. The research team created a custom implementation of a restricted Boltzmann machine that runs on top of Apache SystemML, a declarative large-scale machine-learning platform, and carried out a number of tests with various datasets, using RBMs as feature extractors and feeding the outputs to different classification algorithms (support vector machines, decision trees, multinomial logistic regression, etc.). Nikolay offers an overview of the research and the current state of this stochastic ANN model in the context of big data, as well as future plans. Along the way, he also discusses how SystemML alleviates certain big data challenges (e.g., using cost-based optimization for distributed matrix operations) and why the team chose it as a foundation for its machine-learning problem.
Nikolay Manchev is a data scientist on IBM’s Big Data technical team. He specializes in machine learning, data science, and big data. He is a speaker, blogger, and the organizer of the London Machine Learning Study Group meetup. Nikolay holds an MSc in software technologies and an MSc in data science, both from City University London.
©2017, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org