Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Strata + Hadoop World 2016 Sessions

Wednesday, March 30

11:00am–11:40am Wednesday, 03/30/2016
Location: 210 D/H
Eric Tschetter (Yahoo)
Average rating: ****.
(4.29, 7 ratings)
Yahoo uses Druid to provide visibility into the actions of its billions of users and developed a new type of sketch called a Theta Sketch to enable this analysis. Eric Tschetter discusses how Yahoo leverages Druid and Theta Sketches together to enable user-level understanding of their billions of users. Read more.
11:00am–11:40am Wednesday, 03/30/2016
Location: LL21 C/D
Denise McInerney (Intuit)
Average rating: ***..
(3.50, 12 ratings)
The most valuable people in your organization combine business acumen with data savviness. But these data heroes are rare. Denise McInerney describes how she has empowered business users at Intuit to make better decisions with data and explains how you can do the same thing in your organization. Read more.
11:00am–11:40am Wednesday, 03/30/2016
Location: LL21 B
Tags: media
Ram Shankar Siva Kumar (Microsoft (Azure Security Data Science)), Cody Rioux (Netflix (Real-time Analytics))
Average rating: ***..
(3.83, 12 ratings)
In the era of large-volume security applications, false positives, as Gartner says, can make the difference between building an "indicator machine" and an "answering machine." Ram Shankar and Cody Rioux explore how to suppress false positives in security monitoring systems through use cases from Microsoft and Netflix. Read more.
11:00am–11:40am Wednesday, 03/30/2016
Location: 230 A
HIREN SHAH (Microsoft), Anand Subbaraj (Microsoft)
Average rating: **...
(2.67, 3 ratings)
Whether you want to extend your on-prem data lake with workflows that leverage the benefits of the cloud’s elastic scale, or you have sensitive data that you need to anonymize and aggregate on-prem before sending to the cloud, you need a hybrid data-integration solution for Hadoop. Hiren Shah and Anand Subbaraj show how to build hybrid data flows with Microsoft HDInsight and Azure Data Factory. Read more.
11:00am–11:40am Wednesday, 03/30/2016
Location: LL20 D
Chris Sanden (Netflix), Christopher Colburn (Netflix)
Average rating: ****.
(4.50, 20 ratings)
Chris Sanden and Christopher Colburn outline a shared infrastructure for doing anomaly detection. Chris and Christopher explain how their solution addresses both real-time and batch use cases and offer a framework for performance evaluation. Read more.
11:00am–11:40am Wednesday, 03/30/2016
Location: LL20 C
Tags: travel
Average rating: ****.
(4.40, 10 ratings)
Panoramix makes it easy to slice, dice, and visualize your data. Point it to Druid (or almost any other database) and navigate through your data at the speed of thought. Maxime Beauchemin outlines the features and use cases for Panoramix. Read more.
11:00am–11:40am Wednesday, 03/30/2016
Location: 210 C/G
Tags: real-time
Jay Kreps (Confluent)
Average rating: ****.
(4.17, 24 ratings)
The world is moving to real-time data, and much of that data flows through Apache Kafka. Jay Kreps explores how Kafka forms the basis for our modern stream-processing architecture. He covers some of the pros and cons of different frameworks and approaches and discusses the recent APIs Kafka has added to allow direct stream processing of Kafka data. Read more.
11:00am–11:40am Wednesday, 03/30/2016
Location: 210 A/E
Reynold Xin (Databricks)
Average rating: ****.
(4.36, 28 ratings)
Reynold Xin reviews Spark’s adoption and development in 2015. Reynold then looks to the future to outline three major technology trends—the integration of streaming systems and enterprise data infrastructure, cloud computing and elasticity, and the rise of new hardware—discuss the major efforts to address these trends, and explore their implications for Spark users. Read more.
11:00am–11:40am Wednesday, 03/30/2016
Location: 230 C
Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Mike Cafarella (University of Michigan)
Average rating: **...
(2.87, 15 ratings)
Ben Lorica hosts a conversation with Doug Cutting and Mike Cafarella, the cofounders of Apache Hadoop. Read more.
11:00am–11:40am Wednesday, 03/30/2016
Location: LL21 E/F
Average rating: ****.
(4.46, 24 ratings)
Organizations do not need a big data strategy. They need a business strategy that incorporates big data. Most organizations lack a roadmap for using big data to uncover new business opportunities. Bill Schmarzo explains how to explore, justify, and plan big data projects with business management. Read more.
11:00am–11:40am Wednesday, 03/30/2016
Location: 211 A-C
Jake Porway (DataKind), Rachel Quint (Hewlett Foundation), Sue-Ann Ma, Jeremy Anderson (IBM)
Average rating: ****.
(4.00, 2 ratings)
So many of the data projects making headlines—from a new app for finding public services to a new probabilistic model for predicting weather patterns for subsistence farmers—are great accomplishments but don’t seem to have end users in mind. Discover how organizations are designing with, not for, people, accounting for what drives them in order to make long-lasting impact. Read more.
11:00am–11:40am Wednesday, 03/30/2016
Location: LL20 A
Robert Nishihara (University of California, Berkeley)
Average rating: ****.
(4.59, 17 ratings)
Robert Nishihara offers an overview of SparkNet, a framework for training deep networks in Spark using existing deep learning libraries (such as Caffe) for the backend. SparkNet gets an order of magnitude speedup from distributed training relative to Caffe on a single GPU, even in the regime in which communication is extremely expensive. Read more.
11:00am–11:40am Wednesday, 03/30/2016
Location: LL20 B
Tags: real-time, iot
Yvonne Quacken (Siemens), Allen Hoem (Teradata)
Average rating: ****.
(4.50, 4 ratings)
Yvonne Quacken and Allen Hoem explore the business and technical challenges that Siemens faced capturing continuous data from millions of sensors across different areas and explain how Teradata Listener helped Siemens simplify this data-capture process with a single, central service to ingest multiple real-time data streams simultaneously in a reliable fashion. Read more.
11:00am–11:40am Wednesday, 03/30/2016
Location: 230 B
Mario Inchiosa (Microsoft), Roni Burd (Microsoft)
Average rating: ***..
(3.50, 8 ratings)
Hadoop is famously scalable, as is cloud computing. R, the thriving and extensible open source data science software. . .not so much. Mario Inchiosa and Roni Burd outline how to seamlessly combine Hadoop, cloud computing, and R to create a scalable data science platform that lets you explore, transform, model, and score data at any scale from the comfort of your favorite R environment. Read more.
11:00am–11:40am Wednesday, 03/30/2016
Location: 210 B/F
Tags: real-time
Eric Frenkiel (MemSQL), JR Cahill (Kellogg)
Average rating: **...
(2.86, 7 ratings)
To win in the on-demand economy, businesses must embrace real-time analytics. Eric Frenkiel demos an enterprise approach to data solutions for predictive analytics. Eric is joined by JR Cahill, who outlines Kellogg's approach to advanced analytics with MemSQL, including moving from overnight to intraday analytics and integrating directly with business intelligence tools like Tableau. Read more.
11:00am–11:40am Wednesday, 03/30/2016
Location: LL21 A
Jagane Sundar (WANdisco)
Average rating: ***..
(3.00, 3 ratings)
Jagane Sundar discusses the unique challenges of hybrid big data deployments and outlines strategies to address them. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: 210 D/H
Tags: real-time
Helena Edelson (Apple), Evan Chan (Tuplejump)
Average rating: ***..
(3.85, 13 ratings)
Developers who want both streaming analytics and ad hoc, OLAP-like analysis have often had to develop complex architectures such as Lambda. Helena Edelson and Evan Chan highlight a much simpler approach using the NoLambda stack (Apache Spark/Scala, Mesos, Akka, Cassandra, Kafka) plus FiloDB, a new entrant to the distributed-database world, which combines streaming and ad hoc analytics. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: 210 A/E
Vinoth Chandar (Apache Hudi)
Average rating: ****.
(4.18, 17 ratings)
Vinoth Chandar explains how Uber revamped its foundational data infrastructure with Hadoop as the source-of-truth data lake, sharing lessons from the experience. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: LL21 C/D
Lior Abraham (Interana)
Average rating: **...
(2.18, 11 ratings)
Lior Abraham explores how Tinder reinvented its behavioral analytics approach with Interana to tune matchmaking and business operations. Lior discusses strategies for behavioral analytics and explains how they can be applied at your company to increase conversion, improve engagement, and maximize retention. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: LL20 A
Tags: ai, ecommerce
Eric Colson (Stitch Fix)
Average rating: ****.
(4.42, 24 ratings)
Recommender systems use machine-learning algorithms to surface relevant products to consumers. While they are extremely effective, they cannot fully replace human interpretation. The two have very different capabilities that are additive. Eric Colson shows what's possible when the unique contributions of machines are combined with those of human experts to create a truly personalized experience. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: LL20 D
Average rating: ***..
(3.23, 13 ratings)
If you consider user click paths a process, you can apply process mining. Process mining models users based on their actual behavior, which allows us to compare new clicks with modeled behavior and report any inconsistencies. Bolke de Bruin and Hylke Hendriksen explain how ING implemented process mining on Spark Streaming, enabling real-time fraud detection. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: 210 C/G
Tags: real-time
Ted Dunning (MapR, now part of HPE)
Average rating: ***..
(3.78, 9 ratings)
Application messaging isn’t new—solutions include IBM MQ, RabbitMQ, and ActiveMQ. Apache Kafka is a high-performance, high-scalability alternative that integrates well with Hadoop. Can modern distributed messaging systems like Kafka be considered a legacy replacement or is it purely complementary? Ted Dunning outlines Kafka's architectural benefits and tradeoffs to find the answer. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: LL20 C
Tags: real-time
Leo Meyerovich (Graphistry), Joshua Patterson (NVIDIA), Mike Wendt (NVIDIA)
Average rating: ****.
(4.17, 6 ratings)
“Assuming breach” led to centralizing all logs (SIEMs), but incident response and forensics are still behind on the analytics side. Leo Meyerovich, Mike Wendt, and Joshua Patterson share how Graphistry and Accenture Technology Labs are rethinking data engineering and data analysis and modernizing end-to-end architectures. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: LL21 E/F
Alex Gorelik (Waterline Data)
Average rating: ***..
(3.70, 23 ratings)
It is fashionable today to declare doom and gloom for the data lake. Alex Gorelik discusses best practices for Hadoop data lake success and provides real-world examples of successful data lake implementations in a non-vendor-specific talk. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: 230 A
Tags: real-time
Bin Fan (Alluxio), Haojun Wang (Baidu)
Average rating: ****.
(4.25, 8 ratings)
Baidu runs Alluxio in production with hundreds of nodes managing petabytes of data. Bin Fan and Haojun Wang demonstrate how Alluxio improves big data analytics (ad hoc query)—Baidu experienced a 30x performance improvement—and explain how Baidu leverages Alluxio in its machine-learning architecture and how it uses Alluxio to manage heterogeneous storage resources. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: LL21 B
Tags: real-time
Jun Rao (Confluent)
Average rating: ****.
(4.33, 15 ratings)
With Apache Kakfa 0.9, the community has introduced a number of features to make data streams secure. Jun Rao explains the motivation for making these changes, discusses the design of Kafka security, and demonstrates how to secure a Kafka cluster. Jun also covers common pitfalls in securing Kafka and ongoing security work. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: 230 C
Jennifer Wu (Cloudera), James Malone (Google)
Average rating: ***..
(3.75, 4 ratings)
Jennifer Wu and James Malone offer an insider look at how Google has integrated Hadoop components like HDFS, Impala, and Apache Spark with Google Cloud Platform technologies like Google Compute Engine (GCE), Bigtable, BigQuery, and Cloud Storage. Jennifer and James also explore the importance of Google’s growing collaboration with open source communities. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: 211 A-C
Jake Porway (DataKind), Daniella Perlroth (Lyra Health), Tim Hwang (ROFLCon / The Web Ecology Project), Lucy Bernholz (Stanford University)
Average rating: ***..
(3.83, 6 ratings)
So many of the data projects making headlines—from a new app for finding public services to a new probabilistic model for predicting weather patterns for subsistence farmers—are great accomplishments but don’t seem to have end users in mind. Discover how organizations are designing with, not for, people, accounting for what drives them in order to make long-lasting impact. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: 210 B/F
Nidhi Aggarwal (Tamr, Inc.)
Average rating: ***..
(3.00, 2 ratings)
Data scientists have career-making opportunities to use more diverse datasets to deliver bigger business returns. Nidhi Aggarwal demonstrates how Tamr, a machine-driven, human-guided approach to finding, integrating, and preparing data, enables new levels of insight into corporate spend over previous analytics tools—in one case identifying new savings opportunities worth more than $100M. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: 230 B
Tags: iot
Chris Rawles (Pivotal)
Average rating: ***..
(3.80, 10 ratings)
The Internet of Things (IoT) continues to provide value and hold promise for both the consumer and enterprise alike. To succeed, an IoT project must concern itself with how to ingest data, build actionable models, and react in real time. Chris Rawles describes approaches to addressing these concerns through a deep dive into an interactive demo centered around classification of human activities. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: LL20 B
Average rating: ***..
(3.00, 2 ratings)
An interactive panel, hosted by Dell's Armando Acosta, explores how business units have taken advantage of Hadoop's strengths to quickly identify and implement solutions that deal with massive amounts of data to deliver valuable results across the business. Read more.
11:50am–12:30pm Wednesday, 03/30/2016
Location: LL21 A
Wei Zheng (Trifacta), Mohan Sadashiva (Waterline Data), Mark Donsky (Okera)
Average rating: ***..
(3.00, 5 ratings)
Wei Zheng, Mohan Sadashiva, and Mark Donsky explain how data-wrangling tools not only enable users to work with a variety of new or complex sources of data in Hadoop but also ensure that the data lineage and metadata created through the process are appropriately catalogued and made available to others in the organization. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: LL20 A
John Berryman (Eventbrite)
Average rating: ****.
(4.00, 7 ratings)
At Eventbrite, users can serendipitously discover events they will love. But making this possible isn't easy. Events are short lived, and by the time Eventbrite can build an adequate collaborative-filtering model, the event is already over. John Berryman explains how Eventbrite overcomes these technical challenges with a combination of collaborative-filtering and content-based methods. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: 230 C
Tags: real-time
Todd Lipcon (Cloudera)
Average rating: ****.
(4.68, 19 ratings)
Todd Lipcon explores the tradeoffs between real-time transactional access and fast analytic performance from the perspective of storage-engine internals. Todd also outlines Kudu, the new addition to the open source Hadoop ecosystem that complements HDFS and HBase to provide a new option for achieving fast scans and fast random access from a single API. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: 210 C/G
Moty Fania (Intel)
Average rating: ***..
(3.20, 5 ratings)
Moty Fania shares Intel’s IT experience implementing an on-premises big data IoT platform for internal use cases. This unique platform was built on top of several open source technologies and enables highly scalable stream analytics with a stack of algorithms such as multisensor change detection, anomaly detection, and more. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: LL21 B
Pratik Verma (BlueTalon), Paulo Pereira (GE)
Average rating: ***..
(3.88, 8 ratings)
Pratik Verma and Paulo Pereira share three security architecture principles for Hadoop to protect sensitive data without disrupting users: modifying requests to filter content makes security transparent to users; centralizing data-access decisions and distributing enforcement makes security scalable; and using metadata instead of files or tables ensures systematic protection of sensitive data. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: 211 A-C
Mike Lee Williams (Cloudera Fast Forward Labs)
Average rating: ****.
(4.67, 6 ratings)
Machines are not objective, and big data is not fair. Michael Williams uses sentiment analysis to show that supervised machine learning has the potential to amplify the voices of the most privileged people in society, violate the spirit and letter of civil rights law, and make your product suck. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: 210 A/E
Tags: featured
Dean Wampler (Anyscale)
Average rating: ****.
(4.57, 23 ratings)
The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: LL21 C/D
Michael Dauber (Amplify Partners), Shivon Zilis (Bloomberg Beta), Cack Wilhelm (Scale Venture Partners), Roseanne Wincek (Institutional Venture Partners), Kristina Bergman (Ignition Partners)
Average rating: ***..
(3.87, 15 ratings)
In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road. Join us as Shivon Zilis, Cack Wilhelm, Michael Dauber, Kristina Bergman, and Roseanne Wincek talk about trends that everyone is seeing and areas for investment that they find exciting. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: 210 D/H
Joe Hellerstein (UC Berkeley), Vikram Sreekanti (Berkeley AMP Lab)
Average rating: ****.
(4.00, 7 ratings)
Metadata services are a critical missing piece of the current open source ecosystem for big data. Joe Hellerstein and Vikram Sreekanti give an overview of their vendor-neutral metadata services layer, Ground, through two reference use cases at UC Berkeley: genomics research driven by Spark and courseware using Jupyter Notebooks. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: LL21 E/F
Tags: ecommerce
Debora Seys (eBay)
Average rating: ***..
(3.00, 8 ratings)
Autofill, spellcheck, and turn-by-turn directions provide just-in-time suggestions. What if guiding users to accurate data were as simple? Debora Seys explains how eBay is delivering self-service analytics by moving from heavily engineered metadata systems to the new world of machine-learned guidance and asynchronous collaboration. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: LL20 D
Tags: featured
Average rating: ****.
(4.57, 23 ratings)
Data scientists inhabit such an ever-changing landscape of languages, packages, and frameworks that it can be easy to succumb to tool fatigue. If this sounds familiar, you may have missed the increasing popularity of Linux containers in the DevOps world, in particular Docker. Michelangelo D'Agostino demonstrates why Docker deserves a place in every data scientist’s toolkit. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: 230 A
Thomas Phelan (HPE BlueData), Joel Baxter (BlueData)
Average rating: ****.
(4.40, 10 ratings)
Thomas Phelan and Joel Baxter investigate the advantages and disadvantages of running specific Hadoop workloads in different infrastructure environments. Thomas and Joel then provide a set of rules to help users evaluate big data runtime environments and deployment options to determine which is best suited for a given application. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: LL20 C
Nick Turner (Markerstudy)
Average rating: ****.
(4.33, 3 ratings)
Nick Turner offers a case study of Markerstudy, an insurance and insurance-related-services company based in the UK that recreated their data platform around Hadoop. Dubbed the Big Data Insight project, the new platform features near real-time reporting and self-service exploration and has resulted in reduced claims costs, better fraud detection, and increased customer-retention rates. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: LL21 A
Emma McGrattan (Actian)
Average rating: ***..
(3.80, 5 ratings)
Hadoop can bring great value to businesses but also big headaches. Some solutions that provide SQL access to Hadoop data mean changing your business processes to overcome limitations in the technologies. Emma McGrattan explains how users can unlock tremendous business value through SQL-driven Hadoop solutions. Emma outlines what should be on your checklist and the pitfalls to avoid. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: LL20 B
Tags: real-time, iot
John Hugg (VoltDB)
Average rating: ****.
(4.00, 2 ratings)
In the race to pair streaming systems with stateful systems, the winners will be stateful systems that process streams natively. These systems remove the burden on application developers to be distributed systems experts and enable new applications to be both powerful and robust. John Hugg describes what’s possible when integrated systems apply a transactional approach to event processing. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: 230 B
Don Perigo (GE Power)
Average rating: ***..
(3.93, 15 ratings)
Applying big data to an internal business use case is challenging and requires expertise and focus. Even harder is scaling it out across a global enterprise. Don Perigo explains how GE Power Services has been able to deliver results in an uncertain world by leveraging big data and scaling its platform across a global employee base that spans over 25 countries. Read more.
1:50pm–2:30pm Wednesday, 03/30/2016
Location: 210 B/F
Keith Manthey (Dell EMC)
Average rating: ***..
(3.71, 7 ratings)
Many companies have created extremely powerful Hadoop use cases with highly valuable outcomes. The diverse adoption and application of Hadoop is producing an extremely robust ecosystem. However, teams often create silos around their Hadoop, forgetting some of the hard-learned lessons IT has gained over the years. Keith Manthey discusses one such often overlooked feature—governance. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: 210 A/E
Sandy Ryza (Clover Health)
Average rating: ***..
(3.48, 27 ratings)
Want to build models over data every second from millions of sensors? Dig into the histories of millions of financial instruments? Sandy Ryza discusses the unique challenges of time series data and explains how to work with it at scale. Sandy then introduces the open source Spark-Timeseries library, which provides a natural way of munging, manipulating, and modeling time series data. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: LL21 B
Chao Sun (Cloudera), Alex Leblang (Cloudera)
Average rating: ***..
(3.40, 5 ratings)
Chao Sun and Alex Leblang explore RecordService, a new solution that provides an API to read data from Hadoop storage managers and return them as canonical records. This eliminates the need for components to support individual file formats, handle security, perform auditing, and implement sophisticated IO scheduling and other common processing that is at the bottom of any computation. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: 210 C/G
Tags: iot
Brandon Rohrer (Microsoft)
Average rating: ****.
(4.00, 11 ratings)
Modern houses and robots have a lot in common. Both have a lot of sensors and have to make a lot of decisions. However, unlike houses, robots adapt and perform helpful tasks. Brandon Rohrer details an algorithm specifically designed to help houses, buildings, roads, and stores learn to actively help the people that use them. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: 230 A
Tags: telecom
Average rating: ***..
(3.82, 11 ratings)
Phillip Radley explores how to use an “accumulation of marginal gains” approach to achieve success with an Apache Hadoop-based enterprise data hub (EDH), drawing on a set of design patterns built up over five years establishing BT’s EDH. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: LL21 C/D
Jin Zhang (CA Technologies), Jerry Overton (DXC), Michele Chambers (Continuum Analytics)
Average rating: ***..
(3.14, 7 ratings)
Data has become a hot career choice, but some fear that a career in data is highly stressful or simply boring. Jin Zhang, Jerry Overton, and Michele Chambers give an overview of the field and its various specializations with the hope that this understanding will eliminate any fear and empower attendees to pursue a career in data. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: LL20 A
Robert Grossman (University of Chicago)
Average rating: ***..
(3.86, 14 ratings)
There is a big difference between running a machine-learning algorithm manually from time to time and building a production system that runs thousands of machine-learning algorithms each day on petabytes of data, while also dealing with all the edge cases that arise. Robert Grossman discusses some of the lessons learned when building such a system and explores the tools that made the job easier. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: LL21 E/F
Vishal Bamba (Transamerica), Nitin Prabhu (Transamerica), Jeremy Beck, Amy Wang (H2O.ai)
Average rating: ***..
(3.80, 5 ratings)
Transamerica built a product recommendation system that can be leveraged across multiple distribution channels to recommend products, serve customer needs, and reduce complexity. Vishal Bamba, Nitin Prabhu, Jeremy Beck, and Amy Wang highlight the machine-learning technology, models, and architecture behind Transamerica's product recommendation platform. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: LL20 D
Tags: featured
Average rating: ****.
(4.78, 9 ratings)
BayesDB enables rapid prototyping and incremental refinement of statistical models by combining a model-independent declarative query language, BQL, with machine-assisted modeling and compositional models. Richard Tibbetts and Vikash Mansinghka explore the applications of BayesDB for analyzing and understanding developmental economics data in collaboration with the Gates Foundation. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: 211 A-C
Louis Suarez-Potts (Age of Peers, Inc.)
Average rating: **...
(2.80, 5 ratings)
2015 saw an increased urgency in the ethics of big data, as the UN began to adopt civil-society partnerships with big data organizations. But what, if anything, are we supposed to do with the data we acquire, interpret, and label big data? Louis Suarez-Potts examines big data ethics to explain best practices for putting to use the information gained by big data methodology. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: 230 C
Wes McKinney (Two Sigma Investments), Jacques Nadeau (Dremio)
Average rating: ****.
(4.07, 15 ratings)
Hadoop’s traditional batch technologies are quickly being supplanted by in-memory columnar execution to drive faster data-to-value. Wes McKinney and Jacques Nadeau provide an overview of in-memory columnar execution, survey key related technologies, including Kudu, Ibis, Impala, and Drill, and cover a sample use case using Ibis in conjunction with Apache Drill to deliver real-time conclusions. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: 210 D/H
Tags: real-time
Calvin Jia (Alluxio), Jiri Simsa (Alluxio)
Not all storage resources are equal. Alluxio has developed Alluxio tiered storage to achieve highly efficient utilization of memory, SSDs, and HDDs that is completely transparent to computation frameworks and user applications. Calvin Jia and Jiri Simsa outline the features and use cases of Alluxio tiered storage. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: LL20 C
Christopher Nguyen (Arimo), Anh Trinh (Arimo, Inc.)
Average rating: ***..
(3.29, 7 ratings)
Most people think of data visualizations as charts or graphs with perhaps some interactivity. Christopher Nguyen and Anh Trinh present a new approach that considers visualizations to be first-class objects that also act as data sources and sinks. This enables powerful collaboration where thousands of users can build on the work of one another by sharing these visualization objects. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: LL20 B
Carlos Guestrin (Dato Inc.)
Average rating: ****.
(4.20, 5 ratings)
Machine learning is a hot topic. Recommenders, sentiment analysis, churn and click-through prediction, image recognition, and fraud detection are at the core of intelligent applications. However, developing these models is laborious. Carlos Guestrin shares a new approach to leverage massive amounts of data and applied machine learning at scale to create intelligent applications. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: 210 B/F
Grega Kespret (Celtra Inc.)
Average rating: ****.
(4.50, 2 ratings)
Celtra provides a platform for customers like Porsche and Fox to create, track, and analyze digital display advertising. Celtra's platform processes billions of ad events daily to give analysts fast and easy access to reports and ad hoc analytics. Grega Kešpret outlines Celtra’s data-pipeline challenges and explains how it solved them by combining Snowflake's cloud data warehouse with Spark. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: LL21 A
Wei Wang (Hortonworks), Scott Gnau (Hortonworks)
Average rating: ****.
(4.00, 1 rating)
Join Hortonworks to discuss transformational use cases from Hortonworks customers that manage data in motion and data at rest. Hortonworks's Wei Wang and Scott Gnau explore the modern data applications being built and deployed in 2016 that are driving new frontiers in information technology. Read more.
2:40pm–3:20pm Wednesday, 03/30/2016
Location: 230 B
Dave Wells (Paxata), Nenshad Bardoliwalla (Paxata), Travis Ringger (PwC), Conrad Mulcahy (K2 Intelligence)
Average rating: **...
(2.75, 8 ratings)
In a conversation moderated by Nenshad Bardoliwalla, analytic leaders Conrad Mulcahy, Travis Ringger, and Dave Wells share real-world data-preparation challenges and discuss new technologies, including Spark-powered machine learning, latent semantic indexing, statistical pattern recognition, and text analytics techniques, that accelerate the ability to transform data into usable information. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: 210 C/G
Tags: real-time, ai
Alex Ingerman (Amazon Web Services)
Average rating: ***..
(3.62, 8 ratings)
Alex Ingerman explains how several AWS services, including Amazon Machine Learning, Amazon Kinesis, AWS Lambda, and Amazon Mechanical Turk, can be tied together to build a predictive application to power a real-time customer-service use case. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: LL21 C/D
Andreas Schmidt (Blue Yonder)
Average rating: ***..
(3.67, 6 ratings)
While many companies struggle to adopt big data, a number of industry leaders are leapfrogging big data adoption by going straight to automating core business processes. Andreas Schmidt presents examples from leading European companies that have overcome cultural, technical, and scientific challenges and unlocked the potential of big data in an entirely different way. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: 210 D/H
Moderated by:
Derrick Harris (Mesosphere)
Panelists:
Rob Peglar (Micron Technology, Inc), Milind Bhandarkar (Ampool, Inc.), Richard Probst (SAP), Todd Lipcon (Cloudera)
Average rating: ****.
(4.00, 5 ratings)
Years of research in nonvolatile memory systems is being productized and has started coming to market. These exciting new technologies promise lower power consumption and higher density for persistent storage. Will these hardware advances revolutionize the data ecosystem as we know it? This compelling panel of data-infrastructure thought leaders discusses the possibilities. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: LL20 D
Brandon Ballinger (Cardiogram), Johnson Hsieh (Cardiogram)
Average rating: ****.
(4.67, 9 ratings)
Each year, 15 million people suffer strokes, and at least a fifth of those are due to atrial fibrillation, the most common heart arrhythmia. Brandon Ballinger reports on a collaboration between UCSF cardiologists and ex-Google data scientists that detects atrial fibrillation with deep learning. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: 230 A
John Omernik (MapR Technologies)
John Omernik walks attendees through Operation Ababil's 2013 DDoS attacks to understand how banks were able to implement controls to protect their networks. Using subject-matter experts, Hadoop, and low-friction access to data, members of the US banking industry were able to come up with new models to protect their networks from distributed denial of service attacks. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: 210 A/E
Tags: real-time
Alex Silva (Pluralsight)
Average rating: ***..
(3.94, 16 ratings)
Alex Silva outlines the implementation of a real-time analytics platform using microservices and a Scala stack that includes Kafka, Spark Streaming, Spray, and Akka. This infrastructure can process vast amounts of streaming data, ranging from video events to clickstreams and logs. The result is a powerful real-time data pipeline capable of flexible data ingestion and fast analysis. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: LL21 E/F
Vida Ha (Databricks)
Average rating: ****.
(4.13, 23 ratings)
Apache Spark is a versatile big data processing framework, but just because you can program in SQL for Spark does not mean Spark is a database. For an optimal big data infrastructure, you may still need a distributed file system, databases (SQL or NoSQL), message queues, and specialized systems such as ElasticSearch. Vida Ha explains how to design architecture for different use cases. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: 211 A-C
Tags: media, telecom
Jonathan King (Ericsson)
Average rating: ****.
(4.33, 6 ratings)
Jonathan King outlines ethical best practices for big data and explores the difficult questions emerging from missteps that have caused public outcry, as well as the legal, ethical, and regulatory frameworks that are just beginning to take shape around big data. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: 230 C
Ted Dunning (MapR, now part of HPE)
Average rating: ****.
(4.12, 8 ratings)
SQL is normally a very static language that assumes a fixed and well-known schema. Apache Drill breaks these assumptions by restructuring the execution of queries so optimizations and type resolution can be done just in time. This has profound consequences for how applicable SQL is in the big data world. Ted Dunning walks attendees through Drill and explores its implications for big data. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: LL20 C
Irene Ros (Bocoup)
Average rating: ****.
(4.18, 11 ratings)
Data visualization is everywhere—it communicates meaningful data, finds insights through exploratory interfaces, and informs people through data-driven content. More and more, consumers expect to interact with the data, not just consume it. Irene Ros explains how to employ techniques from user-centered design to build better data-visualization interfaces. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: LL20 A
Erik Andrejko (The Climate Corporation)
Average rating: ****.
(4.50, 4 ratings)
Best practices from scientific research can significantly increase the pace and quality of data science projects. Erik Andrejko discusses the benefits and challenges of reproducibility and collaboration, including review and inter-team communication, for data science work at the Climate Corporation. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: LL21 A
Mok Choe (TD Bank Group ), Paul Barth (Podium Data)
Average rating: ***..
(3.67, 9 ratings)
Learn how TD Bank is creating the bank of the future through IT 3.0. Central to this is business agility, fueled by secure, self-service access to enterprise and market data. Mok Choe and Paul Barth detail the fundamentals for success in this transformation, which started with rapid consolidation of hundreds of data sources onto a Hadoop enterprise data provisioning platform. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: 230 B
Bob Hansen (HPE)
Average rating: ***..
(3.00, 1 rating)
Bob Hansen outlines the latest innovations from HPE for SQL on Hadoop. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: LL20 B
Amit Walia (Informatica), Badhrinath Krishnamoorthy (Cognizant)
Average rating: **...
(2.50, 2 ratings)
Amit Walia, chief product officer of Informatica, hosts a discussion with industry experts on how big data management can enable organizations to deliver faster, more flexible, and more repeatable big data projects while ensuring security and governance. Learn how organizations are using big data management to be more successful with their big data initiatives. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: LL21 B
Tags: real-time
Yinglian Xie (DataVisor)
Average rating: **...
(2.88, 8 ratings)
Yinglian Xie describes the anatomy of modern online services, where large armies of malicious accounts hide among legitimate users and conduct a variety of attacks. Yinglian demonstrates how the Spark framework can facilitate early detection of these types of attacks by analyzing billions of user actions. Read more.
4:20pm–5:00pm Wednesday, 03/30/2016
Location: 210 B/F
Patrick Hall (SAS), Paul Kent (SAS)
Average rating: ***..
(3.60, 10 ratings)
Although it’s been around for decades, machine learning is currently thriving, and organizations are looking to benefit from it. Patrick Hall and Paul Kent offer 10 crucial tips to know before venturing into the mix—a personal survival guide from the creators of a solution that was there in the beginning and continues to drive the industry today. Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: 211 A-C
Tags: iot
Alysa Z. Hutnik (Kelley Drye & Warren LLP), Kristi Wolff (Kelley Drye & Warren LLP)
Average rating: ****.
(4.17, 6 ratings)
In the current explosion of the Internet of Things, big data, and mobile, compliance often takes a back seat. But the failure to address legal privacy and consumer-protection considerations has landed many in hot water, resulting in potential legal settlements and business failures. Alysa Hutnik and Kristi Wolff discuss flash points and proactive strategies to avoid becoming a target. Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: 210 A/E
Holden Karau (Independent)
Average rating: ****.
(4.11, 19 ratings)
Apache Spark is a fast, general engine for big data processing. As Spark jobs are used for more mission-critical tasks, it is important to have effective tools for testing and validation. Holden Karau details reasonable validation rules for production jobs and best practices for creating effective tests, as well as options for generating test data. Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: 230 C
Tags: real-time
Jean-Marc Spaggiari (Cloudera), Kevin O'Dell (Rocana)
Average rating: ****.
(4.20, 5 ratings)
Most already know HBase, but many don't know that it can be coupled with other tools from the ecosystem to increase efficiency. Jean-Marc Spaggiari and Kevin O'Dell walk attendees through some real-life HBase use cases and demonstrate how they have been efficiently implemented. Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: LL21 C/D
Benedikt Koehler (DataLion)
Average rating: ****.
(4.00, 5 ratings)
Benedikt Koehler offers approaches to analyzing and visualizing bitcoin data—accessing and downloading the blockchain, transforming the data into a networked data format, identifying hubs and clusters, and visualizing the results as dynamic network graphs—so that typical patterns and anomalies can quickly be identified. Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: 210 C/G
Tags: real-time
Todd Palino (LinkedIn), Gwen Shapira (Confluent)
Average rating: ****.
(4.62, 13 ratings)
Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Learn the right approach for getting the most out of Kafka from the experts at LinkedIn and Confluent. Todd Palino and Gwen Shapira explore how to monitor, optimize, and troubleshoot performance of your data pipelines—from producer to consumer, development to production. Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: LL21 B
Don Bosco Durai (Privacera)
Average rating: ****.
(4.50, 10 ratings)
Bosco Durai offers a top-down view of security in the Hadoop ecosystem. Bosco explores the right way to protect your data based on your enterprise's security requirements, as he covers the available mechanisms to achieve your information security goals. Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: LL20 C
Sebastien Pierre (FFunction)
Average rating: ****.
(4.50, 6 ratings)
Big data is great for feeding ML algorithms, but you quickly face a bandwidth issue when interfacing with humans. The brain is a fantastic information-processing machine and has an unparalleled, innate ability to detect patterns. Sébastien Pierre explains what designers can teach engineers about creating new ways to make large volumes of data understandable at the human level. Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: 210 D/H
Tags: real-time
Ted Dunning (MapR, now part of HPE)
Average rating: ****.
(4.11, 9 ratings)
Until recently, batch processing has been the standard model for big data. Today, many have shifted to streaming architectures that offer large benefits in simplicity and robustness, but this isn't your father’s complex event processing. Ted Dunning explores the key design techniques used in modern systems, including percolators, replayable queues, state-point queuing, and microarchitectures. Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: LL21 E/F
Aaron Kalb (Alation)
Average rating: ****.
(4.50, 4 ratings)
A data catalog provides context to help data analysts, data scientists, and other data consumers (including those with little technical background) find a relevant dataset, determine if it can be trusted, understand what it means, and utilize it to make better products and better decisions. Aaron Kalb explores how enterprises build interfaces that make sourcing data as easy as shopping on Amazon. Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: LL20 D
Josh Patterson (Patterson Consulting), Dave Kale (Skymind), Zachary Lipton (University of California, San Diego)
Average rating: ****.
(4.00, 11 ratings)
Time series data is increasingly ubiquitous with both the adoption of electronic health record (EHR) systems in hospitals and clinics and the proliferation of wearable sensors. Josh Patterson, David Kale, and Zachary Lipton bring the open source deep learning library DL4J to bear on the challenge of analyzing clinical time series using recurrent neural networks (RNNs). Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: LL20 A
Moderated by:
Michael Dauber (Amplify Partners)
Panelists:
Yael Garten (LinkedIn), Monica Rogati (Data Natives), Daniel Tunkelang (Various)
Average rating: ****.
(4.14, 7 ratings)
We’ve all heard that rare breed the data scientist described as a unicorn. In building your DS team, should you hold out for that unicorn or create groups of specialists who can work together? Michael Dauber, Yael Garten, Monica Rogati, and Daniel Tunkelang discuss the pros and cons of various team models to help you decide what works best for your particular situation and organization. Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: 230 A
Steven Totman (Cloudera), Nick Curcuru (Mastercard), Robert Bagley (ClickFox), LORI BIEDA (Bank of Montreal)
Average rating: ***..
(3.75, 8 ratings)
In a panel discussion, Cloudera's Steve Totman talks about the practicalities and realities of big data-based customer 360 with big data experts Lori Bieda, Nick Curcuru, and Robert Bagley. Attend if you have challenges implementing big data-based customer 360 or just want to learn from the panel's real-world experiences. Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: 230 B
Sandy Steier (1010data), Dennis Gleeson (1010data)
Average rating: ****.
(4.00, 1 rating)
Sandy Steier and Dennis Gleeson explain how the promise of easy data sharing and collaborative analysis—on petabyte-scale data—can fundamentally change business culture in the same way that the Internet has changed our consumer culture. Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: 210 B/F
Tags: real-time
Siva Raghupathy (Amazon Web Services), Manjeet Chayel (Amazon Web Services)
Average rating: ****.
(4.50, 6 ratings)
Analyzing real-time streams of data is becoming increasingly important to remain competitive. Siva Raghupathy and Manjeet Chayel guide attendees through some of the proven architectures for processing streaming data using a combination of cloud and open source tools such as Apache Spark. Watch a live demo and learn how you can easily scale your applications with Amazon Web Services. Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: LL20 B
Partha Seetala (Robin Systems)
Average rating: ***..
(3.50, 2 ratings)
Containers have taken the world by storm by radically transforming the way applications are built and deployed. But many fail to appreciate how powerful containers can be for performance-sensitive data applications. Partha Seetala explains how containers can help you "virtualize" your mission-critical enterprise applications, simplify application life cycles, and increase data-center efficiency. Read more.
5:10pm–5:50pm Wednesday, 03/30/2016
Location: LL21 A
Sudipto Dasgupta (Infosys Limited), Ganesan Pandurangan (Infosys Limited)
Sudipto Dasgupta and Ganesan Pandurangan offer a case study of a large multinational imaging and electronics company that migrated accounts receivable reports to the Hadoop-based open source Infosys Information Platform, which implemented dynamic age bucketing capabilities and reduced the number of end-user views from over 400 to 50. Read more.

Thursday, March 31

11:00am–11:40am Thursday, 03/31/2016
Location: LL21 C/D
Adam Sugano (Autodesk)
Average rating: ***..
(3.83, 6 ratings)
Autodesk's transition to a subscription business model has caused the company to rethink how it interacts with and engages its customers. Adam Sugano details how, in a short period of time, Autodesk has executed numerous data science projects that have enhanced its capabilities to acquire, retain, and provide more value to its customers. Read more.
11:00am–11:40am Thursday, 03/31/2016
Location: 210 C/G
Tags: real-time
Ted Malaska (Capital One), Jeff Holoman (Cloudera)
Average rating: ****.
(4.50, 10 ratings)
Ted Malaska and Jeff Holoman explain how to go from zero to full-on time series and mutable-profile systems in 40 minutes. Ted and Jeff cover code examples of ingestion from Kafka and Spark Streaming and access through SQL, Spark, and Spark SQL to explore the underlying theories and design patterns that will be common for most solutions with Kudu. Read more.
11:00am–11:40am Thursday, 03/31/2016
Location: 210 D/H
Tags: real-time
Costin Leau (Elastic)
Average rating: ****.
(4.15, 13 ratings)
Costin Leau offers an overview of Elastic’s current efforts to enhance Elasticsearch's existing integration with Spark, going beyond Spark core and Spark SQL by focusing on text processing and machine learning to allow data processing and tokenizing to be combined with Spark's MLlib algorithms. Read more.
11:00am–11:40am Thursday, 03/31/2016
Location: 230 A
Chang She (Cloudera)
Average rating: ***..
(3.00, 2 ratings)
Many third-party apps are built on top of the Hadoop platform for data ingest, ETL, analytics, and predictive modeling. These services/apps need a data-governance layer for security and compliance, but it is often burdensome for each individual app to build its own. Chang She describes the challenges in building an extensible metadata layer that serves common governance needs for Hadoop. Read more.
11:00am–11:40am Thursday, 03/31/2016
Location: LL21 E/F
Tags: media
Daniel Weeks (Netflix)
Average rating: ****.
(4.56, 27 ratings)
Netflix is exploring new avenues for data processing where traditional approaches fail to scale. Daniel Weeks explains how Netflix has enhanced its 25+ petabyte warehouse by combining Parquet's features with Presto and Spark to boost both ETL and interactive queries. Daniel explores how these approaches offer new ways to look at the relationship between storage and compute. Read more.
11:00am–11:40am Thursday, 03/31/2016
Location: LL20 D
Rohit Jain (Esgyn)
Average rating: ****.
(4.50, 2 ratings)
Companies are looking for a single database engine that can address all their varied needs—from transactional to analytical workloads, against structured, semistructured, and unstructured data, leveraging graph databases, document stores, text search engines, column stores, key value stores, and wide column stores. Rohit Jain discusses the challenges one faces on the path to this nirvana. Read more.
11:00am–11:40am Thursday, 03/31/2016
Location: LL21 B
Donald Miner (Miner & Kasch)
Average rating: ****.
(4.50, 10 ratings)
Figuring out Hadoop is daunting. However, understanding a set of basic yet important principles is all you need to cut through the hype and make intelligent enterprise decisions. Donald Miner breaks down modern Hadoop into 10 important principles you need to know to understand what Hadoop is and how it is different from the old way of doing things. Read more.
11:00am–11:40am Thursday, 03/31/2016
Location: 230 C
Abin Shahab (Altiscale)
Abin Shahab walks attendees through Altiscale's Docker deployment strategy, describes the design decisions behind it, and discusses the issues encountered and fixed along the way. Read more.
11:00am–11:40am Thursday, 03/31/2016
Location: LL20 A
Chi-Yi Kuan (LinkedIn), Weidong Zhang (LinkedIn), Tiger Zhang (LinkedIn)
Average rating: ****.
(4.29, 24 ratings)
Chi-Yi Kuan, Weidong Zhang, and Yongzheng Zhang explain how LinkedIn has built a "voice of member" platform to analyze hundreds of millions of text documents. Chi-Yi, Weidong, and Yongzheng illustrate the critical components of this platform and showcase how LinkedIn leverages it to derive insights such as customer value propositions from an enormous amount of unstructured data. Read more.
11:00am–11:40am Thursday, 03/31/2016
Location: LL20 C
Noah Illinsky (Amazon Web Services)
Average rating: ***..
(3.83, 18 ratings)
Noah Iliinsky surveys the state of visualization, outlines the major trends in the field, and explores the directions that visualization is headed. Noah also dives into the assorted tool domains—from enterprise to desktop to code-based—and discusses the pros and cons and use cases of each. Read more.
11:00am–11:40am Thursday, 03/31/2016
Location: LL21 A
Kevin Goode (Inmar)
Inmar handles 3.7 billion transactions annually. Kevin Goode explains Inmar's transformation, starting in 2012, from a business-services company to a data-driven enterprise using Hadoop. Read more.
11:00am–11:40am Thursday, 03/31/2016
Location: LL20 B
Average rating: ****.
(4.00, 5 ratings)
Did you know Apache Spark is helping transform industries, companies, and your everyday life? David Taieb and Mythili Venkatakrishnan demonstrate two use cases of how Apache Spark is being used to harness valuable insights from complex data across cloud and hybrid environments. Read more.
11:00am–11:40am Thursday, 03/31/2016
Location: 230 B
Bob Rogers (Intel)
Average rating: *****
(5.00, 1 rating)
Join Bob Rogers, Intel’s chief data scientist for big data solutions, and special guests to see how Intel’s open source Trusted Analytics Platform has accelerated and simplified the development of powerful analytics that are changing the game. Read more.
11:00am–11:40am Thursday, 03/31/2016
Location: 210 A/E
michael dddd (Databricks)
Average rating: *****
(5.00, 11 ratings)
Michael Armbrust explores real-time analytics with Spark from interactive queries to streaming. Read more.
11:00am–11:40am Thursday, 03/31/2016
Location: 210 B/F
Tags: real-time
Steve Wooledge (MapR Technologies)
Average rating: ***..
(3.67, 3 ratings)
In order to remain competitive, you need to be able to respond to changing conditions in the moment. New stream-based technologies allow you to build applications that incorporate low-latency processing so you can stream data immediately or whenever you’re ready. Steve Wooledge explores how new streaming technologies make this approach work and how they can be applied in many industries. Read more.
11:50am–12:30pm Thursday, 03/31/2016
Location: LL21 E/F
Tags: education
Roshan Sumbaly (Facebook), Pierre Barthelemy (Coursera)
Average rating: ***..
(3.90, 10 ratings)
Coursera's platform allows 15 million learners to take courses from the best universities. Roshan Sumbaly and Thomas Barthelemy outline the pieces of Coursera's data infrastructure (streaming, data warehouse) that support its growing semi- and unstructured data requirements and explain how this ecosystem allows Coursera to build various instructor- and learner-side data products. Read more.
11:50am–12:30pm Thursday, 03/31/2016
Location: LL21 C/D
Jeffrey Shmain (Cloudera), Mohammad Quraishi (Cigna)
Average rating: ****.
(4.67, 3 ratings)
How do you implement Apache Hadoop in a large healthcare company with a mature data-analysis infrastructure? Jeffrey Shmain and Mohammad Quraishi describe Cigna's journey toward big data and Hadoop, including an overview of new Hadoop capabilities like heterogeneous data integration and large-scale machine learning. Read more.
11:50am–12:30pm Thursday, 03/31/2016
Location: 230 A
Tags: telecom
Amy O'Connor (Cloudera)
Average rating: ***..
(3.89, 9 ratings)
Telcos are graduating from exploring Hadoop’s technical capabilities to implementing full-blown, multiworkload data hubs at the heart of their operations. The world’s leading telcos are delivering compelling results for strategic use cases that leverage big data solutions. Amy O'Connor explores three key case studies that showcase these successes. Read more.
11:50am–12:30pm Thursday, 03/31/2016
Location: 210 D/H
Tags: real-time
Joey Echeverria (Rocana)
Average rating: *****
(5.00, 2 ratings)
Real-time analysis starts with transforming raw data into structured records. Typically this is done with bespoke business logic custom written for each use case. Joey Echeverria presents a configuration-based, reusable library for data transformation that can be embedded in real-time stream-processing systems and demonstrates its real-world use cases with Apache Kafka and Apache Hadoop. Read more.
11:50am–12:30pm Thursday, 03/31/2016
Location: LL20 D
Prat Moghe (Cazena)
Average rating: ***..
(3.86, 7 ratings)
Many big data projects work in the lab yet never make it to full-scale production. Lengthy deployments and expertise shortages hinder enterprise adoption. Cloud deployments help but create challenges with security, integration, and costs. Prat Moghe outlines best practices for leveraging the modern big data stack and public cloud infrastructure while maintaining enterprise-grade standards. Read more.
11:50am–12:30pm Thursday, 03/31/2016
Location: 210 C/G
Tags: real-time
Guozhang Wang (Confluent)
Average rating: ***..
(3.80, 5 ratings)
You may have successfully made the transition from single machines and one-off solutions to large distributed stream infrastructures in your data center. But what if one data center is not enough? Guozhang Wang offers an overview of best practices for multi-data-center deployments, architecture guidelines for data replication, and disaster scenarios. Read more.
11:50am–12:30pm Thursday, 03/31/2016
Location: LL21 B
Tags: health, iot, energy
David Beyer (Amplify Partners)
Average rating: ***..
(3.82, 11 ratings)
Over the past decade, machine learning has become intertwined with newer, Internet-born businesses. This despite the fact that the vast majority of global GDP turns on larger, less visible industries like energy and construction. David Beyer explores the ways these backbone industries are adopting machine-intelligent applications and the trends underlying this shift. Read more.
11:50am–12:30pm Thursday, 03/31/2016
Location: 230 C
Sumeet Singh (Yahoo), Mridul Jain (Yahoo)
Average rating: ***..
(3.80, 5 ratings)
Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way. Read more.
11:50am–12:30pm Thursday, 03/31/2016
Location: LL20 A
Travis Oliphant (Continuum Analytics)
Average rating: ****.
(4.19, 21 ratings)
Despite Python's popularity throughout the data-engineering and data science workflow, the principles behind its performance and scaling behavior are less understood. Travis Oliphant explains best practices and modern tools to scale Python to larger-than-memory and distributed workloads without sacrificing its ease of use or being forced to adopt heavyweight frameworks. Read more.
11:50am–12:30pm Thursday, 03/31/2016
Location: 210 A/E
Tags: real-time
Tathagata Das (Databricks)
Average rating: ****.
(4.54, 13 ratings)
Tathagata Das introduces Streaming DataFrames, the next evolution of Spark Streaming. Streaming DataFrames unifies an additional dimension: interactive analysis. In addition, it provides enhanced support for out-of-order (delayed) data, zero-latency decision making and integration with existing enterprise data warehouses. Read more.
11:50am–12:30pm Thursday, 03/31/2016
Location: LL20 C
Jeremy Howard ( fast.ai | USF | doc.ai and platform.ai)
Average rating: ****.
(4.00, 7 ratings)
In his 20+ years of applying machine learning and data analysis to a wide range of industries, Jeremy Howard never felt that his work really changed anyone's life in a deep and positive way, so he spent a year researching ways he might effect real change. Jeremy outlines the impact that deep learning is going to make on the world and explains how you too can make a difference. Read more.
11:50am–12:30pm Thursday, 03/31/2016
Location: 210 B/F
Peter Prettenhofer (DataRobot), Owen Zhang (DataRobot)
Average rating: ****.
(4.44, 9 ratings)
Effective and efficient model selection and tuning is crucial for building machine-learning systems, but large-scale machine-learning problems require us to rethink the model-selection and tuning process. Peter Prettenhofer and Owen Zhang outline the tradeoffs we need to make and demonstrate how to efficiently search and tune complex machine-learning pipelines in MLlib. Read more.
11:50am–12:30pm Thursday, 03/31/2016
Location: LL20 B
Tags: iot, streaming
Pat McGarry (Ryft)
Average rating: ****.
(4.50, 2 ratings)
High-velocity, high-volume, and high-variety data streams challenge analytics organizations because the ability to get critical insights often decays rapidly. Pat McGarry explains how organizations that embrace heterogeneous computing techniques can overcome hurdles to real-time insights, thereby gaining significant competitive advantages. Read more.
11:50am–12:30pm Thursday, 03/31/2016
Location: LL21 A
Ben Sharma (Zaloni)
Average rating: ****.
(4.00, 12 ratings)
When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma offers lessons learned from the field to get you started. Read more.
11:50am–12:30pm Thursday, 03/31/2016
Location: 230 B
Chuck Yarbrough (Pentaho), Mark Burnette (Pentaho, a Hitachi Group Company)
Average rating: **...
(2.75, 4 ratings)
A major challenge in today’s world of big data is getting data into the data lake in a simple, automated way. Coding scripts for disparate sources is time consuming and difficult to manage. Developers need a process that supports disparate sources by detecting and passing metadata automatically. Chuck Yarbrough and Mark Burnette explain how to simplify and automate your data ingestion process. Read more.
1:50pm–2:30pm Thursday, 03/31/2016
Location: LL21 C/D
Linus Liang (Embrace), Brad Allen (Silicon Valley Data Science)
Average rating: ****.
(4.29, 7 ratings)
Linus Liang and Brad Allen explain how big data is helping Embrace save millions of babies around the world. Embrace invented the world's most affordable infant incubator, but the data it collects—from the hardest to reach and most rural parts of the world—will actually save more lives than the device will. Read more.
1:50pm–2:30pm Thursday, 03/31/2016
Location: LL20 D
Tags: media
Roopa Tangirala (Netflix)
Average rating: ****.
(4.16, 19 ratings)
Roopa Tangirala details Netflix's migration from Oracle to Cassandra, covering the problems encountered, what worked and what didn't, and lessons learned along the way. Read more.
1:50pm–2:30pm Thursday, 03/31/2016
Location: LL20 A
Marcel Kornacker (Cloudera), Alexander Behm (Cloudera)
Average rating: ***..
(3.86, 7 ratings)
Marcel Kornacker explains how to use nested data structures to increase analytic productivity. Marcel uses the well-known TPC-H schema to demonstrate how to simplify analytic workloads with nested schemas. Read more.
1:50pm–2:30pm Thursday, 03/31/2016
Location: 210 D/H
Ilya Ganelin (Capital One Data Innovation Lab)
Average rating: ***..
(3.33, 6 ratings)
What if we have reached the point where open source can handle massively difficult streaming problems with enterprise-grade durability? Ilya Ganelin presents Capital One’s novel solution for real-time decisioning on Apache Apex. Ilya shows how Apex provides unique capabilities that ensure less than 2 ms latency in an enterprise-grade solution on Hadoop. Read more.
1:50pm–2:30pm Thursday, 03/31/2016
Location: LL21 B
Scott Donaldson (FINRA), Matt Cardillo (FINRA)
Average rating: ****.
(4.25, 4 ratings)
Scott Donaldson and Matt Cardillo detail the security measures and system architecture needed to bring alive a multipetabyte data warehouse via interactive analytics and directed graphs from several trillions of market events, using HBase, EMR, Hive, Redshift, and S3 technologies in a cost-efficient manner. Read more.
1:50pm–2:30pm Thursday, 03/31/2016
Location: 210 A/E
Neelesh Salian (Stitch Fix)
Average rating: **...
(2.93, 14 ratings)
Spark has been growing in deployments for the past year. Neelesh Srinivas Salian explores common issues observed in a cluster environment setup with Apache Spark and offers guidelines to help setup a real-world environment when planning an Apache Spark deployment in a cluster. Attendees can use these observations to improve the usability and supportability of Apache Spark in their projects. Read more.
1:50pm–2:30pm Thursday, 03/31/2016
Location: 230 A
Silvia Oliveros (Silicon Valley Data Science), Stephen O'Sullivan (Data Whisperers)
Average rating: ***..
(3.58, 12 ratings)
You have your Hadoop cluster, and you are ready to fill it up with data. But wait! Which format should you use to store your data? Should you store it in plain text, SequenceFile, Avro, or Parquet? (And should you compress it?) Silvia Oliveros and Stephen O'Sullivan cover the hows, whys, and whens of choosing one format over another and take a closer look at some of the tradeoffs each offers. Read more.
1:50pm–2:30pm Thursday, 03/31/2016
Location: LL21 E/F
Joe Hellerstein (UC Berkeley), Seshadri Mahalingam (Trifacta)
Average rating: ****.
(4.00, 3 ratings)
Seshadri Mahalingam and Joe Hellerstein discuss Photon, a high-performance data-transformation engine that provides immediacy to the data-wrangling experience, and demonstrate how to make the most of modern processors from both the browser and the desktop, with a focus on issues specific to the variety of big raw data, including heavy string manipulation and statistical data profiling. Read more.
1:50pm–2:30pm Thursday, 03/31/2016
Location: 210 C/G
Sean Murphy (PingThings)
Average rating: ***..
(3.20, 5 ratings)
Sean Murphy demonstrates how and why the power grid and other legacy industrials built on traditional engineering will be transformed from deterministic machines described by mathematical equations to probabilistic systems requiring streaming data and analytics. Sean demonstrates how to take an agile approach to the scientific method with big data and fuse the two approaches. Read more.
1:50pm–2:30pm Thursday, 03/31/2016
Location: LL20 C
Aneesh Karve (Quilt)
Average rating: ****.
(4.62, 8 ratings)
Seemingly harmless choices in visualization, design, and content selection can distort your data and lead to false conclusions. Aneesh Karve presents a framework for identifying and overcoming these distortions by drawing upon research in human perception, focus and context, and mobile design. Read more.
1:50pm–2:30pm Thursday, 03/31/2016
Location: LL21 A
Tags: real-time
TJ Potter (Lucidworks )
Average rating: *****
(5.00, 2 ratings)
Solr has been adopted by all major Hadoop platform vendors as the de facto standard for big data search. Timothy Potter introduces an open source project that exposes Solr as a SparkSQL datasource. Timothy offers common use cases, access to open source code, and performance metrics to help you develop your own large-scale search and discovery solution. Read more.
1:50pm–2:30pm Thursday, 03/31/2016
Location: LL20 B
Kaz Sato (Google), Amy Unruh (Google)
Average rating: ****.
(4.21, 14 ratings)
Kazunori Sato and Amy Unruh explore how you can use TensorFlow to drive large-scale distributed machine learning against your analytic data sitting in Google BigQuery, with data preprocessing driven by Dataflow (now Apache Beam). Kazunori and Amy dive into practical examples of how these technologies can work together to enable a powerful workflow for distributed machine learning. Read more.
1:50pm–2:30pm Thursday, 03/31/2016
Location: 210 B/F
Average rating: **...
(2.00, 1 rating)
The Defense Advanced Research Projects Agency (DARPA) is synonymous with transformational change, developing the seeming impossible into the practical. Matthew van Adelsberg demonstrates how collaborative teams of SMEs, data scientists, and engineers have been organized to achieve “DARPA hard” results for nearly a decade and offers insights into how companies can do the same. Read more.
1:50pm–2:30pm Thursday, 03/31/2016
Location: 230 B
Tags: real-time
Matt Olson (CenturyLink)
Software-defined networking (SDN) and network functions virtualization (NFV) hold tremendous potential to enable efficiency and flexibility in service delivery, but SDN/NFV environments are also highly complex and multilayered. Matt Olson explains why effective support for SDN/NFV services requires leveraging the tremendous amount of service and data streaming from the platform. Read more.
1:50pm–2:30pm Thursday, 03/31/2016
Location: 230 C
Tags: real-time, iot
Karthik Ramasamy (Twitter)
Average rating: *****
(5.00, 1 rating)
Heron, Twitter's streaming system, has been in production nearly two years and is widely used by several teams for diverse use cases. Karthik Ramasamy discusses Twitter's operating experiences and shares the challenges of running Heron at scale as well as the approaches that Twitter took to solve them. Read more.
2:40pm–3:20pm Thursday, 03/31/2016
Location: 230 A
Spencer Kimball (Cockroach Labs)
Average rating: ***..
(3.50, 2 ratings)
Often without realizing it, companies spend significant resources engineering new databases. The need to combine traditional relational datasets with new operational and historical data leads to sharded RDBMS or hybridized RDBMS and NoSQL systems, typically leaving few of the constituent database guarantees intact. Spencer Kimball introduces CockroachDB, an open source, scale-out SQL database. Read more.
2:40pm–3:20pm Thursday, 03/31/2016
Location: 210 C/G
Sijie Guo (StreamNative)
Average rating: ***..
(3.50, 2 ratings)
DistributedLog is a high-performance replicated log service built on top of Apache BookKeeper that is the foundation of publish-subscribe at Twitter, serving traffic from transactional databases to real-time data analytic pipelines. Sijie Guo offers an overview of DistributedLog, detailing the technical decisions and challenges behind its creation and how it is used at Twitter. Read more.
2:40pm–3:20pm Thursday, 03/31/2016
Location: LL21 E/F
Tags: real-time
Fangjin Yang (Imply)
Average rating: ***..
(3.25, 4 ratings)
Running distributed systems in production can be tremendously challenging. Fangjin Yang covers common problems and failures with distributed systems and discusses design patterns that can be used to maintain data integrity and availability when everything goes wrong. Fangjin uses Druid as a real-world case study of how these patterns are implemented in an open source technology. Read more.
2:40pm–3:20pm Thursday, 03/31/2016
Location: LL20 A
Tags: science
Siddha Ganju (NVIDIA)
Average rating: ***..
(3.64, 11 ratings)
Siddha Ganju explains how CERN uses machine-learning models to predict which datasets will become popular over time. This helps to replicate the datasets that are most heavily accessed, which improves the efficiency of physics analysis in CMS. Analyzing this data leads to useful information about the physical processes. Read more.
2:40pm–3:20pm Thursday, 03/31/2016
Location: LL21 C/D
Tags: media
Christopher Berry (Canadian Broadcasting Corporation)
Average rating: ****.
(4.00, 4 ratings)
The Canadian Broadcasting Corporation broadcasts a lot of digital content. And Canadians create a huge amount of data about that content. So how does a public broadcaster, of all entities, broadcast its data exhaust? Christopher Berry details the CBC's early experiments with importing a variant of the lean startup into a 79-year-old institution. Read more.
2:40pm–3:20pm Thursday, 03/31/2016
Location: 210 D/H
Tags: real-time
Kostas Tzoumas (data Artisans)
Average rating: ****.
(4.41, 17 ratings)
Apache Flink is a full-featured streaming framework with high throughput, millisecond latency, strong consistency, support for out-of-order streams, and support for batch as a special case of streaming. Kostas Tzoumas gives an overview of Flink and its streaming-first philosophy, as well as the project roadmap and vision: fully unifying the worlds of “batch” and “streaming” analytics. Read more.
2:40pm–3:20pm Thursday, 03/31/2016
Location: 230 C
Brian Clark (Objectivity), Marco Ippolito (CGG GeoSoftware)
Average rating: *....
(1.00, 1 rating)
Oil and gas organizations are at the forefront of big data, adopting technologies such as Hadoop and Spark to develop next-generation fusion systems. Brian Clark and Marco Ippolito introduce a case study from CGG, a builder of common data models to drive analytics of sensor data and associated metadata from fast-changing big data streams, to show how to derive richer value from big data assets. Read more.
2:40pm–3:20pm Thursday, 03/31/2016
Location: LL21 B
Tags: real-time
Yann Landrin-Schweitzer (Autodesk), Charlie Crocker (Autodesk)
Average rating: ***..
(3.67, 6 ratings)
Autodesk's next-gen analytics pipeline, based on SDKs, Kafka, Spark, and containers, will solve the problems of platform and product fragmentation, instrumentation quality, and ease of access to analytics. Yann Landrin and Charlie Crocker explore the features that will enable teams to build reliable, high-quality usage analytics for Autodesk's products, autonomously and in mere minutes. Read more.
2:40pm–3:20pm Thursday, 03/31/2016
Location: LL20 D
Jacques Nadeau (Dremio)
Average rating: ****.
(4.55, 22 ratings)
There are (too?) many options for BI on Hadoop. Some are great at exploration, some are great at OLAP, some are fast, and some are flexible. Understanding the options and how they work with Hadoop systems is a key challenge for many organizations. Jacques Nadeau provides a survey of the main options, both traditional (Tableau, Qlik, etc.) and new (Platfora, Datameer, etc.). Read more.
2:40pm–3:20pm Thursday, 03/31/2016
Location: LL20 C
Tags: travel
Bill Hinderman (Vaystays)
Average rating: ***..
(3.54, 13 ratings)
With more than 1.4 billion smartphones and at least half that many tablets in use, there is a tremendous need for responsive web design in the data-visualization sphere. Bill Hinderman explains the principles of responsive data visualization, which allows you to respond to screen conditions as well as data conditions. Read more.
2:40pm–3:20pm Thursday, 03/31/2016
Location: 210 A/E
Kelvin Chu (Uber), Evan Richards (Uber)
Average rating: ****.
(4.07, 15 ratings)
Schema plays a key role in the Hadoop architecture at Uber. Kelvin Chu and Evan Richards explain why schema is important and how it can make your Hadoop and Spark application more reliable and efficient. Read more.
2:40pm–3:20pm Thursday, 03/31/2016
Location: LL21 A
Joe Goldberg (BMC Software)
Joseph Goldberg discusses the attributes required of a batch management platform that can accelerate development by enabling programmers to generate workflows as code, support continuous deployment with rich APIs and lightweight workflow-scheduling infrastructure, and optimize production with comprehensive enterprise operational capabilities like SLA management and full log and output management. Read more.
2:40pm–3:20pm Thursday, 03/31/2016
Location: 230 B
Jeff Pohlmann (Oracle)
Jeff Pohlmann explores the skills, challenges, and solutions necessary to turn big data into big results. Learn more effective ways to increase productivity and decrease costs, aid in the allocation of key personnel and resources, better determine the true sentiment of customers, determine the impact of changing processes on production, and help solve a host of other needs. Read more.
2:40pm–3:20pm Thursday, 03/31/2016
Location: 210 B/F
Martin Yip (VMware), Justin Murray (VMware)
Average rating: ***..
(3.00, 2 ratings)
Martin Yip and Justin Murray explore the benefits of virtualization of Hadoop on vSphere and delve into three different examples of real-world deployments—at small, medium, and large scales—to demonstrate how enterprises are currently deploying Hadoop differently on virtual machines. Read more.
2:40pm–3:20pm Thursday, 03/31/2016
Location: LL20 B
Tags: real-time
Average rating: ****.
(4.00, 1 rating)
Join the SAP team for a demonstration of how OLAP on Hadoop and real-time query federation help unify enterprise and big data, using SAP's new big data solution, SAP HANA Vora. Amit Satoor and Balalji Krishna explore real-world use cases where instant insights from a combination of operational and Hadoop data impact core business operations Read more.
4:20pm–5:00pm Thursday, 03/31/2016
Location: 210 D/H
Tags: real-time
Jim Scott (NVIDIA)
Average rating: ****.
(4.67, 3 ratings)
The Zeta Architecture is an enterprise architecture to move beyond the data lake. The most logical way to scale applications across tiers is to put a messaging platform in between the tiers, which allows a far simpler ability to scale the communications of applications. Jim Scott covers the benefits of this model and offers an example of data-center monitoring. Read more.
4:20pm–5:00pm Thursday, 03/31/2016
Location: LL21 C/D
Tags: telecom
John Belchamber (Telefónica), Arturo Canales (Telefónica)
Average rating: ***..
(3.00, 6 ratings)
Increasing competition and technological change is impelling the telco industry toward a new model of analytics. Telefónica has been at the front of this change, driving business transformation to a digital telco. John Belchamber and Arturo Canales tell the story of that transformation and detail the pitfalls and challenges faced by teams looking to follow a similar journey. Read more.
4:20pm–5:00pm Thursday, 03/31/2016
Location: 210 C/G
Tags: real-time
Tony Ng (WeWork)
Average rating: ****.
(4.11, 9 ratings)
Enterprises are increasingly demanding real-time analytics and insights. Tony Ng offers an overview of Pulsar, an open source real-time streaming system used at eBay, which can scale to millions of events per second with 4GL SQL-like language support. Tony explains how Pulsar integrates Kafka, Kylin, and Druid to provide flexibility and scalability in event and metrics consumption. Read more.
4:20pm–5:00pm Thursday, 03/31/2016
Location: LL20 A
Scott Draves (Two Sigma Open Source)
Average rating: ***..
(3.40, 5 ratings)
Scott Draves gives an overview of the Beaker notebook, a new open source tool for data scientists. Beaker was designed to be polyglot: a single notebook may contain cells from multiple languages that communicate with one another through a unique feature called autotranslation. Scott discusses motivations for the design, reviews the architecture, and gives a demo of Beaker in action. Read more.
4:20pm–5:00pm Thursday, 03/31/2016
Location: 210 A/E
Tags: health
Timothy Danford (Tamr, Inc.)
Average rating: ****.
(4.57, 7 ratings)
To keep up with the DNA-sequencing-technology revolution, bioinformaticians need more-scalable tools for genomics analysis. Timothy Danford outlines one possible solution in a case study of a cancer genomics analysis pipeline implemented as part of the open source genomics software project, ADAM, which uses Apache Spark-generated abstractions executed on commodity computing infrastructure. Read more.
4:20pm–5:00pm Thursday, 03/31/2016
Location: LL21 E/F
Tags: featured
Joseph Turian (Workday), Alex Nisnevich (Bayes Impact)
Average rating: ****.
(4.78, 9 ratings)
Next-gen UIs will allow people to use plain English to interact with software. However, current published research focuses on abstract understanding, not on translating English into concrete software actions. Joseph Turian and Alex Nisnevich outline UPSHOT's English-to-SQL semantic parser and demonstrate how to build your own English-to-“your software application” parser. Read more.
4:20pm–5:00pm Thursday, 03/31/2016
Location: LL21 B
Sreeni Iyer (quadanalytix), Anurag Bhardwaj (Quad Analytix)
Average rating: *****
(5.00, 6 ratings)
Typically, 8–10% of product URLs in ecommerce sites are misclassified. Sreeni Iyer and Anurag Bhardwaj discuss a machine-learning-based solution that relies on an innovative fusion of classifiers that are both text- and image-based, along with human touch to handle edge cases, to automatically classify product URLs according to a canonical taxonomic organization with a high F-score. Read more.
4:20pm–5:00pm Thursday, 03/31/2016
Location: 230 A
Krishnan Venkata (LatentView Analytics), Jose Abelenda (Hotwire)
Average rating: ***..
(3.00, 1 rating)
While organizations understand the importance of customer satisfaction, quantifying its impact on future engagement is a surprisingly hard analytical problem (most rely on Net Promoter Scores). Krishnan Venkata and Jose Abelenda explain how Hotwire used big data to put a dollar figure on promoter/detractor behavior to help the organization objectively prioritize customer-engagement initiatives. Read more.
4:20pm–5:00pm Thursday, 03/31/2016
Location: 230 C
Average rating: ****.
(4.80, 5 ratings)
Traditional data-warehousing techniques are sometimes limited by the scalability of the implementation tools themselves. Arun Thangamani explains how the advanced architectural approaches by tools like Apache Phoenix and HBase allow new, highly scalable live-analytics solutions using the same traditional techniques and showcases a successful implementation at CDK. Read more.
4:20pm–5:00pm Thursday, 03/31/2016
Location: LL20 D
Bill Loconzolo (Intuit)
Average rating: ****.
(4.33, 12 ratings)
Data initiatives are often approached with a feast-or-famine mentality: go big and do it all or go home. Bill Loconzolo explains how established enterprises can build scalable, secure data pipelines that create connections between central data and product teams and enable business results that matter. Learn the framework Bill developed to realize your big data vision. Read more.
4:20pm–5:00pm Thursday, 03/31/2016
Location: LL20 C
Jeroen Janssens (Data Science Workshops)
Average rating: ****.
(4.50, 4 ratings)
Vowpal Wabbit (VW) is a fast out-of-core learning system that pushes the frontier of machine learning. Jeroen Janssens offers a practical introduction to VW from both RStudio and the Unix command line and demonstrates how it can be used to perform tasks such as classification, regression, matrix factorization, and topic modeling. Read more.