The increasing complexity of learning algorithms and deep neural networks, combined with size of data and parameters, has made it challenging to exploit existing large-scale data processing pipelines for training and inference. Vartika Singh, Marton Balassi, Steven Totman, and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks. Vartika, Marton, Steven, and Juan walk you through different tools and frameworks, ranging from Spark for preprocessing to deep learning frameworks for training and inference, targeting the nuances in the datasets as they relate to algorithm optimization techniques, frameworks, and scale.
Vartika Singh is a field data science architect at Cloudera. Previously, Vartika was a data scientist applying machine learning algorithms to real-world use cases ranging from clickstream to image processing. She has 12 years of experience designing and developing solutions and frameworks utilizing machine learning techniques.
Juan Yu is a software engineer at Cloudera working on the Impala project, where she helps customers investigate, troubleshoot, and resolve escalations and analyzes performance issues to identify bottlenecks, failure points, and security holes. Juan also implements enhancements in Impala to improve customer experience. Previously, Juan was a software engineer at Interactive Intelligence and held developer positions at Bluestreak, Gameloft, and Engenuity.
Marton Balassi is a solutions architect at Cloudera, where he focuses on data science and stream processing with big data tools. Marton is a PMC member at Apache Flink and a regular contributor to open source. He is a frequent speaker at big data-related conferences and meetups, including Hadoop Summit, Spark Summit, and Apache Big Data.
Steven Totman is the financial services industry lead for Cloudera’s Field Technology Office, where he helps companies monetize their big data assets using Cloudera’s Enterprise Data Hub. Prior to Cloudera, Steve ran strategy for a mainframe-to-Hadoop company and drove product strategy at IBM for DataStage and Information Server after joining with the Ascential acquisition. He architected IBM’s Infosphere product suite and led the design and creation of governance and metadata products like Business Glossary and Metadata Workbench. Steve holds several patents for data-integration and governance/metadata-related designs.
Comments on this page are now closed.
©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org