New Developments in Large Data Techniques

Location: Mission City M
Average rating: ****.
(4.35, 23 ratings)

This presentation outlines several new academic developments in large data that you haven’t heard of yet but that have immediate applications in industry. We discuss industry applications, like search, question-answering, and distributed computing, that could be improved immensely using these techniques.

These techniques include:

  • deep learning + semantic hashing
  • graphlab, a new parallelism abstraction
  • unsupervised semantic parsing

I will discuss each techniques for about five to eight minutes each.

Semantic hashing (Salakhutdinov + Hinton, 2007)

Keyword search and its varients, like that done by Google, can easily scale to billions of documents, but can often miss relevant results.

What if your search is missing relevant results, because simple
keyword matching misses documents that don’t contain that exact
keywords? This issue is especially acute for short text, like tweets. Tweets about the MTV music awards, for example, rarely contain the term VMA or the hash tag #vma. But wouldn’t it be useful to retrieve all relevant results?

Semantic hashing allows you to do search just as fast as
keyword matching, but it does semantic search and find relevant documents that don’t necessarily contain the search keywords. It also is completely automatic, and doesn’t require ontologies or other human
annotation. And it can scale to billions of documents, like keyword

Graphlab, a new parallelism abstraction (Low et al, 2010)

There are two ways to achieve significant improvements in predictive analytics and ML tasks like recommendation, sentiment analysis, credit risk assessment, financial forecasting, etc: You can throw more data at the problem or you can use more sophisticated learning algorithms.

MapReduce, and its implementation Hadoop, have been highly successful at promoting distributed computing. MapReduce is good for single-iteration and embarassingly parallel distributed tasks like feature processing, which means that a lot more data can be processed. However, Map-Reduce is too high-level to implement sophisticated learning algorithms.

What kind of gains could you see if you could have the best of both worlds? Large data AND sophisticated learning algorithms? GraphLab might offer those gains.

GraphLab is only slightly lower-level than MapReduce, but significantly more powerful. It is good for iterative algorithms with computational dependencies or complex asynchronous schedules, and has been tested on a variety of sophisticated machine learning algorithms.

Source code is available that implements GraphLab.

Unsupervised Semantic Parsing (Poon + Domingos, 2009+2010)

A lot of work has gone into building natural language search engines, and question-answering systems. However, these works have only been moderately successful. In particular, previous approaches (like that of Powerset and Wolfram Alpha) have required sophisticated linguistic expertise, and extensive ontology and knowledge-base construction. Essentially, there have been a lot of human engineering in the loop, and these techniques still don’t work so well.

Unsupervised semantic parsing is a highly ambitious and successful technique that attacks the problem of reading text and understanding its meaning. It requires no human annotation, and just learns by reading text. It has been applied to question-answering and is far more successful that competing academic baselines. By combining this automatic technique with current human-engineered tricks, one could significantly improve deployed NL search and question-answering systems.

Source code is available that implements this technique.

Conclusion and question period

I conclude by summarizing the techniques and the applications that they address. During the question period, I will specific solicit audience questions about more technical applications and problems, that are important to them. I will note more academic developments that are relevant to these audience questions, which didn’t make it into the main talk.

Photo of Joseph Turian

Joseph Turian


Joseph Turian, Ph.D., heads MetaOp­ti­mize LLC, which consults on predictive analytics, business intelligence, NLP, ML, and data strat­egy. He also run the MetaOp­ti­mize Q&A site, where Machine Learning and Natural Language Processing experts share their knowledge. He specializes in large data sets.

Joseph Turian holds a Ph.D. in computer science (with a focus on Machine Learning and Natural Language Processing) from New York University since 2007. During his graduate studies, he developed a fast, large-scale machine learning method for parsing natural language. He received his AB from Harvard University in 2001.

As a scientist, Joseph Turian has over 14 refereed publications in top NLP + ML conferences. His team submitted the best parser in EVALITA 2009 Main+Pilot tasks. He is an advocate for open-notebook science, releasing his research code on his github, and for broader scientific collaboration through the internet.

Comments on this page are now closed.


Picture of Alex Pinkin
Alex Pinkin
02/08/2011 2:40pm PST

Joseph, thanks for the great talk! I had very little experience with ML, and you pushed me to do more research after the talk.

Regarding Google Pregel: it appears that there is an open source implementation called Golden Orb which is going through Apache incubator right now: We should learn more about it at next week’s Austin Hadoop user’s group.

Håkan Jonsson
02/07/2011 4:32pm PST

One of the best sessions

Picture of Joseph Turian
Joseph Turian
02/07/2011 6:43am PST

I’m glad people enjoyed my talk.

You can see a video of my talk, which I taped my talk and uploaded to YouTube. You can find my slides here. Silicon Angle interviewed me, and I gave a very high-level summary of my talk.

For implementations: * Theano is a Python math compiler with implementations of deep learning algorithms, and tutorials for deep learning. You write math at a numpy-like level of abstraction, and it automatically gets translated to C and compiled to CPU or GPU native code. The mailing list is very helpful. * Semantic Hashing has no public implementations. There is discussion on reddit * The GraphLab implementation is available. * The Google Pregel implementation is not available. Phoebus is the (only?) open-source implementation, and it is in Erlang. * Unsupervised semantic parsing code is available from Hoifung Poon, the author of the work. He has not released code for the more recent USP work that induces an ontology.

02/05/2011 1:59am PST

Very insightful presentation. Overview of a promising research path and some pointers to expand my knowledge on the subject.

Picture of Michael Cariaso
Michael Cariaso
02/03/2011 5:56pm PST

For me, this was the best session of the conference. I felt like someone with a deep knowledge of their field was doing a good job of distilling down the key points, which would be otherwise inaccessible to an outsider. From this session I learned enough to grasp what might be just over my horizon, with enough references to enable me to go further when I might need to apply this.

Picture of armen donigian
armen donigian
02/03/2011 2:52pm PST

great presentation, towards the end of the talk, you mentioned there are open source tools for the algorithms discussed with the exception of one.

1. Where can we find implementations of the algorithms? 2. Which one isn’t implemented yet?

Federico Brubacher
02/03/2011 7:05am PST

Fun and revealing, would love to have more general ML talks like this one and less field applications.

Picture of Tim Dysinger
Tim Dysinger
02/03/2011 6:53am PST

This was one of the top presentations at the conference. I really enjoyed it.

Picture of Mikael Huss
Mikael Huss
02/03/2011 6:02am PST

It was nice to hear a talk with useful tips about actual (and novel) machine learning techniques as opposed to applications and computing platforms. While Strata is geared towards applications, there is a place for discussing novel ML methods as well. Maybe the title and abstract were a bit hyperbolic, but I don’t really mind that.

Picture of Joel Westerberg
Joel Westerberg
02/03/2011 5:55am PST

Really enjoyed this visionary talk. Interesting to learn about some new algorithms for the future and not just what everybody’s using right now. Most interesting presentation at strataconf I think.

Picture of Joseph Turian
Joseph Turian
02/03/2011 2:56am PST

Anthony: You seem to have wanted a negatively scoped talk about the hard problems in AI. Mine was an optimistic talk about recent advances towards AI. Deep training techniques are an undeniable step forward, because deep models are a necessary (but not sufficient) condition for AI. Similarly, actually creating a working technique for unsupervised semantic parsing is a necessary (but not sufficient) condition for natural language understanding. These algorithms represent genuine progress towards AI. Additional evidence that supports these algorithms is that they have beaten the state-of-the-art on some of the hardest benchmark tasks. That’s not hyperbole, that’s evidence in publication. You are correct that many hard problems remain.

Anthony Cassandra
02/02/2011 2:38pm PST

I found a lot of the claims of breakthroughs in AI to be hyperbole based more on wishful thinking than evidence. Slightly different ways to think about statistical analysis, covariances and data organization in AI are good incremental advancements, but none of them have any support that they are any more likely to solve the underlying problem of context, knowledge and semantics than the dozens of AI techniques before them. Thus, this was more of an editorial session about the hope in the potential of a few new approaches. Often glossed over or ignored were the truly hard parts of the AI problem, which these techniques still have not addressed except in some fairly restricted ways.


  • Thomson Reuters
  • EMC Data Computing Division
  • EnterpriseDB
  • Microsoft
  • Gnip
  • Rackspace Hosting
  • IBM
  • Windows Azure MarketPlace DataMarket
  • Amazon Mechanical Turk
  • Amazon Web Services
  • Aster Data
  • Cloudera
  • Clustrix
  • DataStax, Inc. (formerly Riptano, Inc.)
  • Digital Reasoning Systems
  • Heritage Provider Network
  • Impetus
  • Jaspersoft
  • Karmasphere
  • LinkedIn
  • MarkLogic
  • Pentaho
  • Pervasive
  • Revolution Analytics
  • Splunk
  • Urban Mapping
  • Wolfram|Alpha
  • Esri
  • ParAccel
  • Tableau Software

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Young at

Download the Strata Sponsor/Exhibitor Prospectus

Contact Us

View a complete list of Strata Contacts