Motivated by making technology more accessible, Anirudh Koul and Saqib Shaik explore how deep learning can enrich image understanding that can, in turn, enable the blind community to experience and interact with the physical world in a more holistic manner than has ever been possible before. The intersection of vision and language is a ripe area of research and, fueled by advances in deep learning, is shaping the future of artificial intelligence.
Anirudh and Saqib explore how computer vision has evolved through history and outline cutting-edge research in this area, especially in the areas of object recognition, image captioning, visual question answering, and emotion recognition. Using a 152-layer neural network, they first discuss the successes and pitfalls of object recognition. Going beyond object classification, they attempt to understand objects in context (as well as their relationships) and describe them in a sentence. Drawing on the winning entry from Microsoft researchers at the ImageNet Large Scale Visual Recognition Challenge and COCO Captioning Challenge, Anirudh and Saqib demonstrate how developers can utilize these state-of-the-art techniques in their own projects. For example, it is now possible to generate very detailed descriptions of images, such as “I see a young man on a sofa reading a book” or “I see people jogging at the beach.” This powerful research can be extremely useful to the blind and is beneficial to businesses that rely on image searches as well.
Anirudh and Saqib conclude by examining the exciting area of visual question answering, which enables blind users to get answers to questions asked about their surroundings. Anirudh and Saqib also briefly cover Microsoft’s Cognitive Services, the set of machine-learning APIs for vision, speech, facial, and emotion recognition, whose APIs are now open to use. This makes it straightforward for developers to integrate state-of-the-art image understanding into their own applications. By the end of the session, you’ll develop intuition into what works and what doesn’t, understand the practical limitations during development, and know how to use these techniques for your own applications.
Anirudh Koul is a head of AI and research at Aira, noted by Time magazine as one of the best inventions of 2018. He’s a noted AI expert and O’Reilly author, including the upcoming Practical Deep Learning for Cloud and Mobile. Previously, he was a scientist at Microsoft AI, where he founded Seeing AI, the most-used technology among the blind community after the iPhone. With features shipped to a billion users, he brings over a decade of production-oriented applied research experience on petabyte-scale datasets. He’s been developing technologies using AI techniques for augmented reality, robotics, speech, productivity, and accessibility. Some of his recent work, which IEEE has called “life-changing,” has been honored by CES, FCC, Cannes Lions, American Council of the Blind, showcased at events by the UN, the White House, the House of Lords, the World Economic Forum, Netflix, National Geographic, and applauded by world leaders including Justin Trudeau and Theresa May.
Saqib Shaikh is a software engineer at Microsoft, where he has worked for 10 years. Saqib has developed a variety of Internet-scale services and data pipelines powering Bing, Cortana, Edge, MSN, and various mobile apps. Being blind, Saqib is passionate about accessibility and universal design; he serves as an internal consultant for teams including Windows, Office, Skype, and Visual Studio and has spoken at several international conferences. Saqib has won three Microsoft hackathons in the past year. His current interests focus on the intersection between AI and HCI and the application of technology for social good.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org