Motivated by making technology more accessible, Anirudh Koul and Saqib Shaik explore how deep learning can enrich image understanding that can, in turn, enable the blind community to experience and interact with the physical world in a more holistic manner than has ever been possible before. The intersection of vision and language is a ripe area of research and, fueled by advances in deep learning, is shaping the future of artificial intelligence.
Anirudh and Saqib explore how computer vision has evolved through history and outline cutting-edge research in this area, especially in the areas of object recognition, image captioning, visual question answering, and emotion recognition. Using a 152-layer neural network, they first discuss the successes and pitfalls of object recognition. Going beyond object classification, they attempt to understand objects in context (as well as their relationships) and describe them in a sentence. Drawing on the winning entry from Microsoft researchers at the ImageNet Large Scale Visual Recognition Challenge and COCO Captioning Challenge, Anirudh and Saqib demonstrate how developers can utilize these state-of-the-art techniques in their own projects. For example, it is now possible to generate very detailed descriptions of images, such as “I see a young man on a sofa reading a book” or “I see people jogging at the beach.” This powerful research can be extremely useful to the blind and is beneficial to businesses that rely on image searches as well.
Anirudh and Saqib conclude by examining the exciting area of visual question answering, which enables blind users to get answers to questions asked about their surroundings. Anirudh and Saqib also briefly cover Microsoft’s Cognitive Services, the set of machine-learning APIs for vision, speech, facial, and emotion recognition, whose APIs are now open to use. This makes it straightforward for developers to integrate state-of-the-art image understanding into their own applications. By the end of the session, you’ll develop intuition into what works and what doesn’t, understand the practical limitations during development, and know how to use these techniques for your own applications.
Anirudh Koul is a senior data scientist at Microsoft Research and the founder of Seeing AI, a talking camera app for the blind community. Anirudh brings over a decade of production-oriented applied research experience on petabyte-scale datasets, with features shipped to about a billion people. An entrepreneur at heart, he has run ministartup teams within Microsoft, prototyping ideas using computer vision and deep learning techniques for augmented reality, productivity, and accessibility and building tools for communities with visual, hearing, and mobility impairments. A regular at hackathons, Anirudh has won close to three dozen awards, including top-three finishes for four years consecutively in the world’s largest private hackathon, with 18,000 participants. Some of his recent work, which IEEE has called “life changing,” has been showcased at a White House event, on Netflix, and in National Geographic and received awards from the American Foundation for the Blind and Mobile World Congress.
Saqib Shaikh is a software engineer at Microsoft, where he has worked for 10 years. Saqib has developed a variety of Internet-scale services and data pipelines powering Bing, Cortana, Edge, MSN, and various mobile apps. Being blind, Saqib is passionate about accessibility and universal design; he serves as an internal consultant for teams including Windows, Office, Skype, and Visual Studio and has spoken at several international conferences. Saqib has won three Microsoft hackathons in the past year. His current interests focus on the intersection between AI and HCI and the application of technology for social good.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com