The abundance of unstructured information increases the need for automatic systems that can “condense” information from various documents into a shorter- length, readable summary. Such summaries may be further required to cover a specific information need (e.g., summarizing web search results, medical records, question answering and more). Research on automatic document summarization started a long time ago but in recent years gained popularity and became a building block in many AI technologies across domains. Automatic summarization is considered a very hard problem—some will even claim it’s still an open problem—and as such, it attracts a lot of research. In order to make informed decisions, users such as financial analysts, policy makers, and medical practitioners are required to explore, extract, and analyze large amounts of information. This is almost impossible without automatic systems that can identify what is essential and condense the information in a coherent way.
Summarization technologies typically include the following functionalities: identification of important information in source documents, identification and removal of duplicate information, summary information coverage optimization, and content ordering and rewriting to enhance readability. Various methods have been proposed for the summarization task. These methods can be categorized based on two main dimensions: extractive versus abstractive and supervised versus unsupervised. Extractive methods generate a summary using only text fragments extracted from the document(s). Abstractive methods may also synthesize new text. Supervised methods try to fit a model that learns to select or generate “relevant” text fragments for a summary based on training data. While supervised methods may provide better quality, they require more domain knowledge compared to their unsupervised counterparts. Therefore, generalizing supervised methods to new datasets, domains, and languages still remains a great challenge. Abstractive methods hold a lot of promise from a summary readability perspective (coherency, fluency, and focus of text), but currently most abstractive methods require heavy supervision and are thus less practical. The most common evaluation method for summary quality is the ROUGE metric. This metric evaluates the quality of a machine generated summary against human summaries (ideal reference summaries) by counting n-grams overlap between the two. Although it has some limitations, studies have shown a significant correlation between the ROUGE summary score and human assigned score.
Guy Feigenblat offers an overview of unsupervised automated summarization techniques. Guy begins by reviewing several popular datasets, such as Document Understanding Conference (DUC), CNN\DailyMail, and WebAP, which are often used for evaluating automated summarization quality. He then explores key research, focusing on techniques and results, and shares a novel query-focused multidocument summarization technology developed by IBM Research AI, focusing on the unique approach, its various components, and architecture. Guy concludes with best practices for developing a user interface for summarization and details a web-client developed by IBM Research AI.
Guy Feigenblat is a research staff member in the AI Language Department of the Haifa Research Lab, where he leads research around machine-generated automatic document summarization. He’s also an adjunct faculty member at Haifa University. Previously, he was involved in various machine learning and AI projects that focused on developing cognitive bots that can express and predict human emotions. He has published several papers and patents. Guy holds a PhD in computer science from Bar-Ilan University, where he worked under the supervision of Ely Porat. His main academic interests include machine learning, AI, information retrieval, data mining, and data structures.
©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org