This presentation outlines several new academic developments in large data that you haven’t heard of yet but that have immediate applications in industry. We discuss industry applications, like search, question-answering, and distributed computing, that could be improved immensely using these techniques.
These techniques include:
I will discuss each techniques for about five to eight minutes each.
Semantic hashing (Salakhutdinov + Hinton, 2007)
Keyword search and its varients, like that done by Google, can easily scale to billions of documents, but can often miss relevant results.
What if your search is missing relevant results, because simple
keyword matching misses documents that don’t contain that exact
keywords? This issue is especially acute for short text, like tweets. Tweets about the MTV music awards, for example, rarely contain the term VMA or the hash tag #vma. But wouldn’t it be useful to retrieve all relevant results?
Semantic hashing allows you to do search just as fast as
keyword matching, but it does semantic search and find relevant documents that don’t necessarily contain the search keywords. It also is completely automatic, and doesn’t require ontologies or other human
annotation. And it can scale to billions of documents, like keyword
Graphlab, a new parallelism abstraction (Low et al, 2010)
There are two ways to achieve significant improvements in predictive analytics and ML tasks like recommendation, sentiment analysis, credit risk assessment, financial forecasting, etc: You can throw more data at the problem or you can use more sophisticated learning algorithms.
MapReduce, and its implementation Hadoop, have been highly successful at promoting distributed computing. MapReduce is good for single-iteration and embarassingly parallel distributed tasks like feature processing, which means that a lot more data can be processed. However, Map-Reduce is too high-level to implement sophisticated learning algorithms.
What kind of gains could you see if you could have the best of both worlds? Large data AND sophisticated learning algorithms? GraphLab might offer those gains.
GraphLab is only slightly lower-level than MapReduce, but significantly more powerful. It is good for iterative algorithms with computational dependencies or complex asynchronous schedules, and has been tested on a variety of sophisticated machine learning algorithms.
Source code is available that implements GraphLab.
Unsupervised Semantic Parsing (Poon + Domingos, 2009+2010)
A lot of work has gone into building natural language search engines, and question-answering systems. However, these works have only been moderately successful. In particular, previous approaches (like that of Powerset and Wolfram Alpha) have required sophisticated linguistic expertise, and extensive ontology and knowledge-base construction. Essentially, there have been a lot of human engineering in the loop, and these techniques still don’t work so well.
Unsupervised semantic parsing is a highly ambitious and successful technique that attacks the problem of reading text and understanding its meaning. It requires no human annotation, and just learns by reading text. It has been applied to question-answering and is far more successful that competing academic baselines. By combining this automatic technique with current human-engineered tricks, one could significantly improve deployed NL search and question-answering systems.
Source code is available that implements this technique.
Conclusion and question period
I conclude by summarizing the techniques and the applications that they address. During the question period, I will specific solicit audience questions about more technical applications and problems, that are important to them. I will note more academic developments that are relevant to these audience questions, which didn’t make it into the main talk.
Joseph Turian, Ph.D., heads MetaOptimize LLC, which consults on predictive analytics, business intelligence, NLP, ML, and data strategy. He also run the MetaOptimize Q&A site, where Machine Learning and Natural Language Processing experts share their knowledge. He specializes in large data sets.
Joseph Turian holds a Ph.D. in computer science (with a focus on Machine Learning and Natural Language Processing) from New York University since 2007. During his graduate studies, he developed a fast, large-scale machine learning method for parsing natural language. He received his AB from Harvard University in 2001.
As a scientist, Joseph Turian has over 14 refereed publications in top NLP + ML conferences. His team submitted the best parser in EVALITA 2009 Main+Pilot tasks. He is an advocate for open-notebook science, releasing his research code on his github, and for broader scientific collaboration through the internet.
Comments on this page are now closed.
For information on exhibition and sponsorship opportunities at the conference, contact Susan Young at email@example.com
Download the Strata Sponsor/Exhibitor Prospectus
View a complete list of Strata Contacts