Presented by O'Reilly and Cloudera
Make Data Work
July 12-13, 2017: Training
July 13-15, 2017: Tutorials & Conference
Beijing, China

使用Apache Spark和BigDL来构建深度学习驱动的大数据分析 (Building deep learning-powered big data analytics using Apache Spark and BigDL)

此演讲使用中文 (This will be presented in Chinese)

Yiheng Wang (Intel)
13:30–17:00 Thursday, 2017-07-13
Spark及更多发展 (Spark & beyond)
Location: 多功能厅3(Function Room 3) 观众水平 (Level): 中级 (Intermediate)

必要预备知识 (Prerequisite Knowledge)

A basic understanding of deep learning and Apache Spark

需要提前准备的资料和下载 (Materials or downloads needed in advance)

A laptop (macOS or Linux) (Instructions will be posted to the course GitHub page.)

该辅导课要求硬件和/或安装 (Hardware and/or installation requirements)

笔记本电脑(macOS或Linux)(说明会很快发布在course GitHub page)。

您将学到什么 (What you'll learn)

从这个教学课程里,学员将会学到如何应用深度学习(最先进的机器学习技术)到他们的Apache Spark驱动的大数据工作任务里。
Learn how to apply deep learning to your big data workloads driven by Apache Spark.

描述 (Description)

深度学习已经在很多的领域(例如计算机视觉、自然语言处理和语音识别等)取得了顶尖水准的表现,对工业界有极大的潜在应用价值。我们应该注意到深度学习和大数据的联系非常得紧密。首先,深度学习的模型需要使用大量的数据来训练,这就是为什么它直到大数据时代才开始蓬勃发展。其次,现在绝大部分的大数据都是视频、音频和文字数据,非常适合使用深度学习算法来处理。为了能释放深度学习的能力,我们就应该把它运用在大数据的环境里。

工业界已经构建了丰富的大数据生态系统,从分布式数据存储,到高速流计算系统,以及数据处理引擎。Apache Spark就是一个广为人知的大数据处理引擎。它提供了一个完整的框架来统一支持不同的大数据任务(SQL、流计算和机器学习)。大家已经使用它构建了大量的大数据应用。

这就是为什么我们要引入BigDL。BigDL是一个基于Apache Spark的大数据分布式的深度学习框架。它集成了“高性能计算”和“大数据”架构的优点,为Spark提供了原生的深度学习的支持。同时为现成的使用单节点的开源深度学习框架(如Caffeh和Torch)带来了数量级的性能速度提升,并为它们提供了基于Spark架构的对深度学习任务的水平扩展的能力。

在这个教学课程里,我们会介绍BigDL的功能,用例子来展示如何进行开发。我们还会分享我们的用户是如何在他们的深度学习应用(如图像识别、物体检测和自然语言处理等)中采用BigDL的案例。这些案例显示了用户可以使用他们的大数据平台(例如使用Apache Hadoop和Spark)作为一个统一的数据分析平台来进行数据存储、数据处理和挖掘、特征工程、传统的(非深度)机器学习和深度学习等各种任务。


Deep learning offers state-of-the-art performance in many domains (e.g., computer vision, NLP, and speech recognition) and has great potential application value to industry. But to unlock its true power, deep learning must be put in a big data context. Deep learning is tightly connected with big data: deep learning models need be trained with massive data, the majority of which is video, audio, or text.

The industry has already built a rich big data ecosystem, from distributed data storage to high-velocity streaming systems to process engines. Apache Spark, a well-known, fast engine for big data processing, provides a completed framework to unify different big data workloads (SQL, streaming, machine learning, etc.). People have already built tons of big data applications on these systems.

Yiheng Wang offers an overview of BigDL, a distributed deep learning framework built for big data platforms using Apache Spark. BigDL combines the benefits of high-performance computing and big data architecture, providing native support for deep learning functionalities in Spark, orders of magnitude speed-up over out-of-the-box open source DL frameworks, such as Caffe and Torch, with regard to single node performance (by leveraging Intel MKL), and the scale out of deep learning workloads based on the Spark architecture.

Yiheng explores the functionality of BigDL, walking you through development and discussing BigDL use cases for deep learning applications (for image recognition, object detection, NLP, etc.) that use an existing big data platform, such as Apache Hadoop or Spark, as a unified data analytics platform for data storage, data processing and mining, feature engineering, traditional (non-deep) machine learning, and deep learning workloads.

Photo of Yiheng Wang

Yiheng Wang

Intel

Yiheng Wang is a software development engineer on the Big Data Technology team at Intel working in the area of big data analytics. Yiheng and his colleagues are developing and optimizing distributed machine learning algorithms (e.g., neural network and logistic regression) on Apache Spark. He also helps Intel customers build and optimize their big data analytics applications.

Connect with O'ReillyData

Use the QR Code to follow OReillyData and get the latest conference information and browse data articles.

WeChat QRcode

 

Stay Connected Image 1
Stay Connected Image 3
Stay Connected Image 2

Read the latest ideas on big data.

ORB Data Site