We’ve seen significant progress in infrastructure for using data effectively in the last half-decade. But this hasn’t applied to all types of data equally. Unstructured text, in particular, has been slower to yield to the kinds of analysis that many businesses are starting to take for granted. Rather than being limited by what we can collect, we are now constrained by the tools, time, and techniques to make good use of it. But we are beginning to gain the ability to do remarkable things with unstructured text data.
Michael Williams explores text summarization—taking text in and returning a shorter document that contains the same information—covering both single document and multidocument summarization. Michael demonstrates ways to solve the summarization problem that range from extremely simple algorithms that date back to the 1950s to the latest recurrent neural networks, explains how to choose between these approaches, and shows working prototype products for each.
Summarizing tens or hundreds of thousands of articles at once represents an entirely new capability. But this capability is a solution to a bigger problem: it’s a gateway to quantified representations of text. The breakthrough capabilities realized by the application of sentence embedding and recurrent neural networks to the semantic meaning of text are poised to transform all the ways in which computers process language.
Mike Lee Williams is a research engineer at Cloudera Fast Forward Labs, where he builds prototypes that bring the latest ideas in machine learning and AI to life and helps Cloudera’s customers understand how to make use of these new technologies. Mike holds a PhD in astrophysics from Oxford.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.