Language Technologies for a Connected World: Processing and Visualizing Unstructured Text in 5000 Languages

Design Ballroom CD
Presentation: external link
Average rating: *****
(5.00, 1 rating)

Big data analytics have blossomed because there is too much information for people to manually process. It is probably less widely known that as of the last couple of years, the average person could not understand the majority of the world’s data, even if there wasn’t so much of it: the majority of all the world’s data is now in non-English unstructured text and speech, in around 5000 languages.

Unstructured text, especially outside of English, is one of the least understood areas of information processing. English is a linguistic outlier in a number of ways: strictness in word order; simplicity of prefixing/suffixing; size of vocabulary; standardization of spelling. From search engines to social media firehoses, English-centric design decisions have left us unprepared to accurately process information from more typical languages. For example, assumptions about the utility of ‘keywords’ might not apply when every word in a language has several prefixes, suffixes or spelling variations, or when we cannot accurately break a sentence into words in the first place. These examples are problems that exist in languages that account for about 25% of the world’s digital data.

This talk will give a high-level overview of how languages vary, what kinds of communications are the most widely used in different languages, what current language processing technologies can (and cannot) achieve, and how we can process and visualize this information at scale.

Photo of Robert Munro

Robert Munro


Big data for a better world! Rob is the CEO of Idibon, who are tackling the problem of extracting information from unstructured speech and text in the world’s connected languages — all 5000 of them. His background includes building infrastructure in Sierra Leone and Liberia, running crowdsourced translation platforms for Haiti, and work in language processing technologies that support a number of Silicon Valley search-engines and start-ups. He has a PhD from Stanford University.


Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners

Press and Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata contacts