Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Sessions

Wednesday, September 28

11:20am–12:00pm Wednesday, 09/28/2016
Location: 1 E 07/1 E 08 Level: Intermediate
Yonik Seeley (Cloudera)
Average rating: ****.
(4.00, 2 ratings)
Yonik Seeley explores recent Apache Solr features in the areas of faceting and analytics, including parallel SQL, streaming expressions, distributed join, and distributed graph queries, as well as the trade-offs of different approaches and strategies for maximizing scalability. Read more.
11:20am–12:00pm Wednesday, 09/28/2016
Location: 1 E 10/1 E11 Level: Intermediate
Melissa Santos (Big Cartel)
Average rating: *****
(5.00, 4 ratings)
Whether we're talking about spam emails, merging records, or investigating clusters, there are many times when having a measure of how alike things are makes them easier to work with (e.g., with unstructured data that isn't incorporated into your data models). Melissa Santos offers a practical approach to creating a distance metric and validating with business owners that it provides value. Read more.
11:20am–12:00pm Wednesday, 09/28/2016
Location: Hall 1B Level: Beginner
Ram Sriharsha (Databricks)
Average rating: **...
(2.88, 17 ratings)
Ram Sriharsha reviews major developments in Apache Spark 2.0 and discusses future directions for the project to make Spark faster and easier to use for a wider array of workloads, with an emphasis on API evolution, single-node performance (Project Tungsten Phase 3), and Structured Streaming. Read more.
11:20am–12:00pm Wednesday, 09/28/2016
Location: 1 E 12/1 E 13 Level: Intermediate
Sridhar Alla (BlueWhale), Kiran Muglurmath (Comcast)
Average rating: ****.
(4.17, 6 ratings)
Sridhar Alla and Kiran Muglurmath explain how real-time analytics on Comcast Xfinity set-top boxes (STBs) help drive several customer-facing and internal data-science-oriented applications and how Comcast uses Kudu to fill the gaps in batch and real-time storage and computation needs, allowing Comcast to process the high-speed data without the elaborate solutions needed till now. Read more.
11:20am–12:00pm Wednesday, 09/28/2016
Location: 1 E 15/1 E 16 Level: Intermediate
Tags: ecommerce
Jeremy Stanley (Instacart)
Average rating: ****.
(4.67, 3 ratings)
Fifteen years ago, Webvan spectacularly failed to bring grocery delivery online. Speculation has been high that the current wave of on-demand grocery delivery startups will meet similar fates. Jeremy Stanley explains why this time the story will be different—data science is the key. Read more.
11:20am–12:00pm Wednesday, 09/28/2016
Location: 3D 12 Level: Beginner
Emil Eifrem (Neo Technology), Tim Williamson (Monsanto )
Average rating: ***..
(3.89, 9 ratings)
Tim Williamson and Emil Eifrem explain how organizations can use graph databases to operationalize insights from big data, drawing on the real-life example of Monsanto’s use of graph databases to conduct real-time graph analysis of the company’s data to transform the business in ways that were previously impossible. Read more.
11:20am–12:00pm Wednesday, 09/28/2016
Location: 3D 10 Level: Intermediate
Owen O'Malley (Cloudera)
Average rating: ****.
(4.92, 12 ratings)
Picking the best data format depends on what kind of data you have and how you plan to use it. Owen O'Malley outlines the performance differences between formats in different use cases and offers an overview of the advantages and disadvantages of each to help you improve the performance of your applications. Read more.
11:20am–12:00pm Wednesday, 09/28/2016
Location: Hall 1C Level: Intermediate
Carlos Guestrin (Apple | University of Washington )
Average rating: ****.
(4.45, 20 ratings)
Despite widespread adoption, machine-learning models remain mostly black boxes, making it very difficult to understand the reasons behind a prediction. Such understanding is fundamentally important to assess trust in a model before we take actions based on a prediction or choose to deploy a new ML service. Carlos Guestrin offers a general approach for explaining predictions made by any ML model. Read more.
11:20am–12:00pm Wednesday, 09/28/2016
Location: 1 E 14
Cheryl Wiebe (Think Big, a Teradata Company)
Average rating: ****.
(4.80, 5 ratings)
The IoT is fundamentally transforming industries and reconfiguring the technology landscape, but challenges exist for enterprises to effectively realize the value from this next wave of information and opportunity. Cheryl Wiebe explores how leading companies harness the IoT by putting IoT data in context, fostering collaboration between IT and OT and enabling a new breed of scalable analytics. Read more.
11:20am–12:00pm Wednesday, 09/28/2016
Location: 1B 01/02
Carey James (BlueTalon)
Average rating: ***..
(3.67, 3 ratings)
Big data and analytics is a team sport empowering companies of all kinds to achieve business outcomes faster and with greater levels of success. Carey James explains how the formation of Dell Technologies and Dell EMC can help you on your data analytics journey and how you can turn actionable insights into new business opportunities. Read more.
11:20am–12:00pm Wednesday, 09/28/2016
Location: River Pavilion
JeanCarlo Bonilla (DataKind), Susan Sun (DataKind), Caitlin Augustin (DataKind)
Average rating: *****
(5.00, 1 rating)
JeanCarlo Bonilla, Susan Sun, and Caitlin Augustin explore how DataKind volunteer teams navigate the road to social impact by automating evidence collection for conservationists and helping expand the reach of mobile surveys so that more voices can be heard. Read more.
11:20am–12:00pm Wednesday, 09/28/2016
Location: 3D 08 Level: Intermediate
Navdeep Alam (IMS Health)
Average rating: ****.
(4.50, 12 ratings)
The need to find efficiencies in healthcare is becoming paramount as our society and the global population continue to grow and live longer. Navdeep Alam shares his experience and reviews current and emerging technologies in the marketplace that handle working with unbounded, de-identified patient datasets in the billions of rows in an efficient and scalable way. Read more.
11:20am–12:00pm Wednesday, 09/28/2016
Location: 1 E 09
Crystal Valentine (MapR Technologies)
Average rating: ***..
(3.15, 20 ratings)
Crystal Valentine draws on lessons learned from companies like Uber and Ericsson to outline the key principles to developing a microservices application. Along the way, Crystal describes how certain next-gen application areas—such as machine learning—are particularly well suited to implementation in a microservices architecture rather than a legacy application paradigm. Read more.
11:20am–12:00pm Wednesday, 09/28/2016
Location: 1 C04 / 1 C05
Kyle Ambert (Intel)
Average rating: ***..
(3.00, 2 ratings)
Creating production-ready analytical pipelines can be a messy, error-prone undertaking. Kyle Ambert explores the Trusted Analytics Platform, an open source-based platform that enables data scientists to ask bigger questions of their data and carry out principled data science experiments—all while engaging in iterative, collaborative development of production solutions with application developers. Read more.
11:20am–12:00pm Wednesday, 09/28/2016
Location: 1B 03/04
Average rating: ****.
(4.50, 2 ratings)
Guy Levy-Yurista explains the unexpected consequences of making big data processing significantly more agile than ever before and the impact it's having on human insight consumption. Read more.
1:15pm–1:55pm Wednesday, 09/28/2016
Location: 1 E 07/1 E 08 Level: Beginner
Tags: pydata
Brian Granger (Cal Poly San Luis Obispo), Sylvain Corlay (QuantStack), Jason Grout (Bloomberg LP)
Average rating: ****.
(4.75, 12 ratings)
Brian Granger, Sylvain Corlay, and Jason Grout offer an overview of JupyterLab, the next-generation user interface for Project Jupyter that puts Jupyter Notebooks within a powerful user interface that allows the building blocks of interactive computing to be assembled to support a wide range of interactive workflows used in data science. Read more.
1:15pm–1:55pm Wednesday, 09/28/2016
Location: 1 E 10/1 E11 Level: Advanced
Brad Sarsfield (Microsoft HoloLens)
Average rating: ****.
(4.00, 11 ratings)
Data visualizations using interactive holograms help us make smarter decisions and explore ideas faster by inspecting every vantage point of our data and interacting with it in new, more personal and human ways. There are new rules for the new world. Join Brad Sarsfield as he explores and experiments with the possibilities of the next generation of data visualization experiences. Read more.
1:15pm–1:55pm Wednesday, 09/28/2016
Location: Hall 1B Level: Advanced
Ted Malaska (Capital One), Mark Grover (Lyft)
Average rating: ***..
(3.92, 12 ratings)
Ted Malaska and Mark Grover cover the top five things that prevent Spark developers from getting the most out of their Spark clusters. When these issues are addressed, it is not uncommon to see the same job running 10x or 100x faster with the same clusters and the same data, using just a different approach. Read more.
1:15pm–1:55pm Wednesday, 09/28/2016
Location: 1 E 12/1 E 13 Level: Beginner
Tags: real-time
Neha Narkhede (Confluent)
Average rating: ****.
(4.92, 12 ratings)
Neha Narkhede explains how Apache Kafka serves as a foundation to streaming data applications that consume and process real-time data streams and introduces Kafka Connect, a system for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library. Neha also describes the lessons companies like LinkedIn learned building massive streaming data architectures. Read more.
1:15pm–1:55pm Wednesday, 09/28/2016
Location: 1 E 15/1 E 16 Level: Non-technical
Michael Dauber (Amplify Partners), Shivon Zilis (Bloomberg Beta), Sarah Guo (Greylock), Matt Witheiler, Sam Pullara (Sutter Hill Ventures)
Average rating: **...
(2.67, 3 ratings)
In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road. Join us to hear about the trends that everyone is seeing and areas for investment that they find exciting. Read more.
1:15pm–1:55pm Wednesday, 09/28/2016
Location: 3D 12 Level: Beginner
Taposh Roy (Kaiser Permanente), Rajiv Synghal (Kaiser Permanente), Sabrina Dahlgren (Kaiser Permanente)
Average rating: ****.
(4.57, 7 ratings)
While other industries have embraced the digital era, healthcare is still playing catch-up. Kaiser Permanente has been a leader in healthcare technology and first started using computing to improve healthcare results in the 1960s. Taposh Roy, Rajiv Synghal, and Sabrina Dahlgren offer an overview of Kaiser’s big data strategy and explain how other organizations can adopt similar strategies. Read more.
1:15pm–1:55pm Wednesday, 09/28/2016
Location: 3D 10 Level: Intermediate
Marcel Kornacker (Cloudera), Mostafa Mokhtar (Cloudera)
Average rating: ****.
(4.80, 5 ratings)
Performance tuning your SQL-on-Hadoop deployment may seem overwhelming at times, especially for BI workloads that need interactive response times with high concurrency. Marcel Kornacker and Mostafa Mokhtar simplify the process and cover top performance optimizations for Apache Impala (incubating), from schema design and memory optimization to query tuning. Read more.
1:15pm–1:55pm Wednesday, 09/28/2016
Location: Hall 1C Level: Intermediate
Average rating: ****.
(4.38, 8 ratings)
Data science has always been a focus at eHarmony, but recently more business units have needed data-driven models. Jonathan Morra introduces Aloha, an open source project that allows the modeling group to quickly deploy type-safe accurate models to production, and explores how eHarmony creates models with Apache Spark and how it uses them. Read more.
1:15pm–1:55pm Wednesday, 09/28/2016
Location: 1 E 09
Scott Anderson (ClearStory Data), Andrew Yeung (ClearStory Data)
Average rating: ***..
(3.50, 4 ratings)
More data exists than ever before and in more disparate silos. Getting the insights you need, sifting through data, and answering new questions have all been complex, hairy tasks that only data jocks have been able to do. Andrew Yeung and Scott Anderson explore new ways to challenge the status quo and speed insights on diverse sources and demonstrate real customer use cases. Read more.
1:15pm–1:55pm Wednesday, 09/28/2016
Location: 1B 01/02
Ingo Mierswa (RapidMiner)
The flux capacitor was the core component that made time travel possible in Back to the Future, processing garbage as a power source. Did you know that you can achieve the same affect in machine learning? Ingo Mierswa demonstrates how you can power through your analytics faster than ever before using the knowledge of 250K data scientists. Read more.
1:15pm–1:55pm Wednesday, 09/28/2016
Location: 1 E 14
Viral Shah (Asurion Services )
Average rating: **...
(2.50, 4 ratings)
Viral Shah explains how enterprises like Asurion Services are leveraging big data management solutions to accelerate enterprise data lake initiatives for business value. Read more.
1:15pm–1:55pm Wednesday, 09/28/2016
Location: 1B 03/04
Jack Gudenkauf (Hewlett Packard Enterprise)
Average rating: ****.
(4.00, 3 ratings)
Jack Gudenkauf explores how organizations have successfully deployed tiered hyperscale architecture for real-time streaming with Spark, Kafka, Hadoop, and Vertica and discusses how advancements in hardware technologies such as nonvolatile memory, SSDs, and accelerators are changing the role of big data and big analytics platforms in an overall enterprise-data-platform strategy. Read more.
1:15pm–1:55pm Wednesday, 09/28/2016
Location: River Pavilion Level: Intermediate
alex bordei (Bigstep )
Average rating: ***..
(3.67, 6 ratings)
Alex Bordei walks you through the steps required to build a data lake in the cloud and connect it to on-premises environments, covering best practices in architecting cloud data lakes and key aspects such as performance, security, data lineage, and data maintenance. The technologies presented range from basic HDFS storage to real-time processing with Spark Streaming. Read more.
1:15pm–1:55pm Wednesday, 09/28/2016
Location: 3D 08 Level: Beginner
Jun Liu (Intel), Zhaojuan Bian (Intel)
Average rating: **...
(2.00, 2 ratings)
Many challenges exist in designing an SQL-on-Hadoop cluster for production in a multiuser environment with heterogeneous and concurrent query workloads. Jun Liu and Zhaojuan Bian draw on their personal experience to address these challenges, explaining how to determine the right size of your cluster with different combinations of hardware and software resources using a simulation-based approach. Read more.
1:15pm–1:55pm Wednesday, 09/28/2016
Location: 1 C04 / 1 C05
Darryl Smith (Dell)
Average rating: ***..
(3.50, 4 ratings)
Hear the Chief Data Platform Architect of Dell Technologies outline streaming principles. Read more.
2:05pm–2:45pm Wednesday, 09/28/2016
Location: 1 E 07/1 E 08 Level: Beginner
Stuart Lynn (CartoDB), Andy Eschbacher (CARTO)
Average rating: ****.
(4.14, 7 ratings)
Geospatial analysis can provide deep insights into many datasets. Unfortunately the key tools to unlocking these insights—geospatial statistics, machine learning, and meaningful cartography—remain inaccessible to nontechnical audiences. Stuart Lynn and Andy Eschbacher explore the design challenges in making these tools accessible and integrated in an intuitive location intelligence platform. Read more.
2:05pm–2:45pm Wednesday, 09/28/2016
Location: 1 E 10/1 E11 Level: Intermediate
Average rating: **...
(2.50, 4 ratings)
FINRA ingests over 50 billion records of stock market trading data daily into multipetabyte databases. Janaki Parameswaran and Kishore Ramachandran explain how FINRA technology integrates data feeds from disparate systems to provide analytics and visuals for regulating equities, options, and fixed-income markets. Read more.
2:05pm–2:45pm Wednesday, 09/28/2016
Location: Hall 1B Level: Intermediate
Average rating: ***..
(3.75, 4 ratings)
Spark's efficiency and speed can help reduce the TCO of existing clusters. This is because Spark's performance advantages allow it to complete processing in drastically shorter batch windows with higher performance per dollar. Raj Krishnamurthy offers a detailed walk-through of an alternating least squares-based matrix factorization workload able to improve runtimes by a factor of 2.22. Read more.
2:05pm–2:45pm Wednesday, 09/28/2016
Location: 1 E 12/1 E 13 Level: Intermediate
Slava Chernyak (Google)
Average rating: ****.
(4.60, 10 ratings)
Watermarks are a system for measuring progress and completeness in out-of-order streaming systems and are utilized to emit correct results in a timely manner. Given the trend toward out-of-order processing in existing streaming systems, watermarks are an increasingly important tool when designing streaming pipelines. Slava Chernyak explains watermarks and explores real-world applications. Read more.
2:05pm–2:45pm Wednesday, 09/28/2016
Location: 1 E 15/1 E 16 Level: Non-technical
Brian Hopkins (Forrester Research)
Average rating: ****.
(4.71, 7 ratings)
Uber, Netflix, LinkedIn, Tesla, Stitch Fix, Earnest—the list of digital disruptors using data to steal customers grows every month. But is it just that these firms are data driven? Is because they have smart data scientists and Hadoop? The secret to their success is that these firms go further in order to be insight driven. Brian Hopkins explains what they're doing and how to join them. Read more.
2:05pm–2:45pm Wednesday, 09/28/2016
Location: 3D 12 Level: Intermediate
Nandu Jayakumar (Oracle)
Average rating: ***..
(3.00, 1 rating)
Visa, the world’s largest electronic payments network, is transforming the way it manages data: database appliances are giving way to Hadoop and HBase; proprietary ETL technologies are being replaced by Spark; and enterprise warehouse data models will be complemented by flexible data schemas. Nandu Jayakumar explores the adoption of big data practices at a conservative, financial enterprise. Read more.
2:05pm–2:45pm Wednesday, 09/28/2016
Location: 3D 10 Level: Intermediate
Adam Bordelon (Mesosphere), Mohit Soni (Mesosphere)
Average rating: ****.
(4.33, 3 ratings)
Adam Bordelon and Mohit Soni demonstrate how projects like Apache Myriad (incubating) can install Hadoop on Mesosphere DC/OS alongside other data center-scale applications, enabling efficient resource sharing and isolation across a variety of distributed applications while sharing the same cluster resources and hence breaking silos. Read more.
2:05pm–2:45pm Wednesday, 09/28/2016
Location: Hall 1C Level: Beginner
June Andrews (Wise / GE Digital)
Average rating: ****.
(4.31, 13 ratings)
Clustering algorithms produce vectors of information, which are almost surely difficult to interpret. These are then laboriously translated by data scientists into insights for influencing product and executive decisions. June Andrews offers an overview of a human-in-the-loop method used at Pinterest and LinkedIn that has lead to fast, accurate, and pertinent human-readable insights. Read more.
2:05pm–2:45pm Wednesday, 09/28/2016
Location: 1 E 09
Ben Sharma (Zaloni)
Average rating: ****.
(4.23, 13 ratings)
When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma offers lessons learned from the field to get you started. Read more.
2:05pm–2:45pm Wednesday, 09/28/2016
Location: 1 E 14
Tags: iot
Reiner Kappenberger (HPE Security–Data Security)
Average rating: ***..
(3.00, 1 rating)
Reiner Kappenberger explores the new standards and innovations enabling architects and developers to take a “build it in” approach to security in early design phases for big data and IoT systems, explaining why emerging technologies such as format-preserving encryption are rapidly delivering more trusted big data and IoT ecosystems without altering application behavior or device functionality. Read more.
2:05pm–2:45pm Wednesday, 09/28/2016
Location: 1B 01/02
Chuck Yarbrough (Pentaho)
Average rating: **...
(2.50, 2 ratings)
It’s hard to get data into a data lake. Organizations hand-code their way through this, but with hundreds of data sources, it soon becomes unmanageable. Chuck Yarbrough offers a solution that uses metadata to autogenerate ingestion processes. Teams can drive hundreds of Hadoop onboarding processes through just a few templates, reducing development time and risk. Read more.
2:05pm–2:45pm Wednesday, 09/28/2016
Location: 1B 03/04
Connor Carreras (Trifacta)
Average rating: ***..
(3.67, 3 ratings)
Connor Carreras offers an in-depth review of the most popular use cases for data wrangling solutions among enterprise organizations, drawing on real customer deployments to explain how data wrangling has enabled them to accelerate analysis and uncover new sources of business value. Read more.
2:05pm–2:45pm Wednesday, 09/28/2016
Location: River Pavilion Level: Non-technical
Kristi Wolff (Kelley Drye & Warren LLP), Crystal Skelton (Kelley Drye & Warren LLP)
Average rating: ****.
(4.00, 2 ratings)
Companies making data-driven decisions must consider critical legal obligations that may apply to the collection and use of data. Failing to do so has landed many tech stars and startups in hot legal water. Attorneys Kristi Wolff and Crystal Skelton discuss privacy, data security, and other legal considerations for using data across several industry types. Read more.
2:05pm–2:45pm Wednesday, 09/28/2016
Location: 3D 08 Level: Beginner
Marcel Kornacker (Cloudera), Todd Lipcon (Cloudera)
Average rating: ****.
(4.50, 8 ratings)
Todd Lipcon and Marcel Kornacker explain how to simplify Hadoop-based data-centric applications with the CRUD (create, read, update, and delete) and interactive analytic functionality of Apache Impala (incubating) and Apache Kudu (incubating). Read more.
2:05pm–2:45pm Wednesday, 09/28/2016
Location: 1 C04 / 1 C05
Shankar Ganapathy (Paxata), Mark Nelson (Standard Chartered Bank), Veronica Liwak (Polaris )
Average rating: **...
(2.00, 7 ratings)
Join data experts from Citi, Standard Charter Bank, and Polaris for a panel discussion moderated by Shankar Ganapathy. Learn about the principles, technologies, and processes they have used to design a highly efficient information management pipeline architected around the Hadoop ecosystem. Read more.
2:55pm–3:35pm Wednesday, 09/28/2016
Location: 1 E 07/1 E 08 Level: Advanced
Tags: real-time
Julien Le Dem (WeWork), Jacques Nadeau (Dremio)
Average rating: ****.
(4.33, 6 ratings)
In pursuit of speed, big data is evolving toward columnar execution. The solid foundation laid by Arrow and Parquet for a shared columnar representation across the ecosystem promises a great future. Julien Le Dem and Jacques Nadeau discuss the future of columnar and the hardware trends it takes advantage of, like RDMA, SSDs, and nonvolatile memory. Read more.
2:55pm–3:35pm Wednesday, 09/28/2016
Location: 1 E 10/1 E11 Level: Intermediate
Sean Kandel (Trifacta)
Average rating: *****
(5.00, 1 rating)
Traditional ways of visualizing data lineage provide static mapping source datasets to various targets or outputs. As the breadth of analysis occurring in schema-on-read environments increases, tracking how elements of the data were derived is critical. Sean Kandel introduces a new way to visualize data lineage allowing stakeholders a transparent view into their data. Read more.
2:55pm–3:35pm Wednesday, 09/28/2016
Location: Hall 1B Level: Beginner
Praveen Murugesan (Uber Technologies Inc)
Average rating: ***..
(3.67, 12 ratings)
Praveen Murugesan explains how Uber leverages Hadoop and Spark as the cornerstones of its data infrastructure. Praveen details the current data architecture at Uber and outlines some of the unique challenges with data processing Uber faced as well as its approach to solving some key issues in order to continue to power Uber's real-time marketplace. Read more.
2:55pm–3:35pm Wednesday, 09/28/2016
Location: 1 E 12/1 E 13 Level: Advanced
Kenneth Knowles (Google)
Average rating: *****
(5.00, 2 ratings)
Triggers specify when a stage of computation should emit output. With a small language of primitive conditions, triggers provide the flexibility to tailor a streaming pipeline to a variety of use cases and data sources. Kenneth Knowles delves into the details of language- and runner-independent semantics for triggers in Apache Beam and explores real-world implementations in Google Cloud Dataflow. Read more.
2:55pm–3:35pm Wednesday, 09/28/2016
Location: 1 E 15/1 E 16 Level: Intermediate
Jack Norris (MapR Technologies)
Average rating: ***..
(3.83, 6 ratings)
Leading companies that are getting the most out of their data are not focusing on queries and data lakes; they are actively integrating analytics into their operations. Jack Norris reviews three customer case studies in ad/media, financial services, and healthcare to show how a focus on real-time data streams can transform the development, deployment, and future agility of applications. Read more.
2:55pm–3:35pm Wednesday, 09/28/2016
Location: 3D 12 Level: Intermediate
Shirshanka Das (LinkedIn), Yael Garten (LinkedIn)
Average rating: ****.
(4.73, 11 ratings)
Shirshanka Das and Yael Garten describe how LinkedIn redesigned its data analytics ecosystem in the face of a significant product rewrite, covering the infrastructure changes, such as client-side activity tracking, a unified reporting platform, and data virtualization techniques to simplify migration, that enable LinkedIn to roll out future product innovations with minimal downstream impact. Read more.
2:55pm–3:35pm Wednesday, 09/28/2016
Location: 3D 10 Level: Intermediate
Zhe Zhang (LinkedIn), Uma Maheswara Rao G (Intel)
Average rating: **...
(2.67, 3 ratings)
The new erasure coding feature in Apache Hadoop (HDFS-EC) reduces the storage cost by ~50% compared with 3x replication. Zhe Zhang and Uma Maheswara Rao G present the first-ever performance study of HDFS-EC and share insights on when and how to use the feature. Read more.
2:55pm–3:35pm Wednesday, 09/28/2016
Location: Hall 1C Level: Beginner
Eui-Hong Han (The Washington Post), Shuguang Wang (The Washington Post)
Average rating: ****.
(4.20, 5 ratings)
Predicting which stories will become popular is an invaluable tool for newsrooms. Eui-Hong Han and Shuguang Wang explain how the Washington Post predicts what stories on its site will be popular with readers and share the challenges they faced in developing the tool and metrics on how they refined the tool to increase accuracy. Read more.
2:55pm–3:35pm Wednesday, 09/28/2016
Location: 1 C04 / 1 C05
Jeremy Achin (DataRobot), Tom de Godoy (DataRobot)
Average rating: ****.
(4.33, 6 ratings)
In today's world, executives need to be the drivers for data science solutions. Data analysis has moved from the domain of data scientists to the forefront of core strategic initiatives. Are you empowering your team to identify and execute on every opportunity to optimize business with machine learning? In this session, you will learn how executives are transforming business with machine learning. Read more.
2:55pm–3:35pm Wednesday, 09/28/2016
Location: 1 E 09
Jonathan Gray (Cask)
Average rating: ****.
(4.33, 3 ratings)
Building, running, and governing a data lake on Hadoop is often a difficult process filled with slow development cycles and painful operations. Jonathan Gray proposes a modern, unified integration architecture that helps IT mitigate these issues while enabling businesses to reduce time to insights and make decisions faster through a modern self-service environment. Read more.
2:55pm–3:35pm Wednesday, 09/28/2016
Location: 1 E 14
Jonathon Whitton (PRGX USA)
Jonathon Whitton details how PRGX is using Talend and Cloudera to load two million annual client flat files into a Hadoop cluster and perform recovery audit services in order to help clients detect, find, and fix leakage in their procurement and payment processes. Read more.
2:55pm–3:35pm Wednesday, 09/28/2016
Location: 1B 03/04
Johan Bjerke (Splunk Inc)
Average rating: ***..
(3.00, 3 ratings)
Machine data is growing at an exponential rate, and a key driver for this growth is the Internet of Things (IoT) revolution. Johan Bjerke explains how to find value in and make use of the unstructured machine data that plays an important role in the new connected world. Read more.
2:55pm–3:35pm Wednesday, 09/28/2016
Location: River Pavilion Level: Non-technical
Brett Goldstein (University of Chicago)
Average rating: *****
(5.00, 3 ratings)
How can we usher in a future of data-driven decision making that is characterized by more—not less—accountability and accessibility? Brett Goldstein discusses the imperative to couple new developments in data science with a renewed commitment to transparency and open source—with a particular focus on open source models to optimize deployment of policing resources. Read more.
2:55pm–3:35pm Wednesday, 09/28/2016
Location: 3D 08 Level: Beginner
Kaushik Deka (Novantas), Phil Jarymiszyn (Novantas)
Kaushik Deka and Phil Jarymiszyn discuss the benefits of a Spark-based feature store, a library of reusable features that allows data scientists to solve business problems across the enterprise. Kaushik and Phil outline three challenges they faced—semantic data integration within a data lake, high-performance feature engineering, and metadata governance—and explain how they overcame them. Read more.
2:55pm–3:35pm Wednesday, 09/28/2016
Location: 1B 01/02
Anthony Dina (Dell EMC)
Average rating: ***..
(3.00, 1 rating)
Mastercard's Nick Curcuru hosts an interactive fireside chat with Anthony Dina from Dell to explore how the flexibility, scalability, and agility of Hadoop big data solutions allow one of the world’s leading organizations to innovate, enable, and enhance the customer experience while still expanding emerging opportunities. Read more.
4:35pm–5:15pm Wednesday, 09/28/2016
Location: 1 E 07/1 E 08 Level: Beginner
Himanshu Gupta (Yahoo)
Average rating: ***..
(3.33, 3 ratings)
Himanshu Gupta explains why Yahoo has been increasingly investing in interactive analytics and how it leverages Druid to power a variety of internal- and external-facing data applications. Read more.
4:35pm–5:15pm Wednesday, 09/28/2016
Location: 1 E 10/1 E11 Level: Beginner
Mark Turner (Teradata)
Average rating: ****.
(4.00, 1 rating)
Which suppliers are most likely to have delivery or quality issues? Does service, product placement, or price make the biggest difference in customer sentiment? Text data from sources like email and social media can give answers. Mark Turner explains how to see the associations between any two variables in text data by combining text analytics and the bipartite graph visualization technique. Read more.
4:35pm–5:15pm Wednesday, 09/28/2016
Location: Hall 1B Level: Intermediate
Francois Garillot (Swisscom)
Average rating: ***..
(3.75, 4 ratings)
Swisscom, the leading mobile service provider in Switzerland, also provides data-driven intelligence through the analysis of its mobile network. Its Mobility Insights team works to help administrators understand the flow of people through their location of interest. François Garillot explores the platform, tooling, and choices that help achieve this service and some challenges the team has faced. Read more.
4:35pm–5:15pm Wednesday, 09/28/2016
Location: 1 E 12/1 E 13 Level: Intermediate
Tags: real-time, iot
Ira Cohen (Anodot)
Average rating: ****.
(4.00, 5 ratings)
Time series and event data form the basis for real-time insights about the performance of businesses such as ecommerce, the IoT, and web services, but gaining these insights involves designing a learning system that scales to millions and billions of data streams. Ira Cohen outlines a system that performs real-time machine learning and analytics on streams at massive scale. Read more.
4:35pm–5:15pm Wednesday, 09/28/2016
Location: 1 E 15/1 E 16 Level: Non-technical
Tags: ecommerce
Daniel Mintz (Looker)
Daniel Mintz dives into case studies from three companies—ThredUp, Twilio, and Warby Parker—that use data to generate sustainable competitive advantages in their industries. Read more.
4:35pm–5:15pm Wednesday, 09/28/2016
Location: 3D 12 Level: Intermediate
Siva Raghupathy (Amazon Web Services)
Average rating: ****.
(4.85, 13 ratings)
Siva Raghupathy demonstrates how to use Hadoop innovations in conjunction with Amazon Web Services (cloud) innovations. Read more.
4:35pm–5:15pm Wednesday, 09/28/2016
Location: 3D 10 Level: Beginner
Crystal Valentine (MapR Technologies)
Average rating: ****.
(4.50, 4 ratings)
Crystal Valentine explains how the large graph-processing frameworks that run on Hadoop can be used to detect significantly mutated protein signaling pathways in cancer genomes through a probabilistic analysis of large protein-protein interaction networks, using techniques similar to those used in social network analysis algorithms. Read more.
4:35pm–5:15pm Wednesday, 09/28/2016
Location: Hall 1C Level: Intermediate
Josh Patterson (Patterson Consulting), Dave Kale (Skymind)
Average rating: *****
(5.00, 1 rating)
Can machines be creative? Josh Patterson and David Kale offer a practical demonstration—an interactive Twitter bot that users can ping to receive a response dynamically generated by a conditional recurrent neural net implemented using DL4J—that suggests the answer may be yes. Read more.
4:35pm–5:15pm Wednesday, 09/28/2016
Location: 1 E 09
Peter Wang (Anaconda)
Average rating: ***..
(3.33, 3 ratings)
Although Python and R promise powerful data science insights, they can also be complex to manage and deploy with Hadoop infrastructure. Peter Wang distills the vast array of Hadoop and data science tools and architectures down to the essentials that deliver a powerful and lightweight stack quickly so that you can accelerate time to value while meeting your data science, governance, and IT needs. Read more.
4:35pm–5:15pm Wednesday, 09/28/2016
Location: 1 E 14
Martin Yip (VMware)
Average rating: ****.
(4.00, 1 rating)
The trend of deploying Hadoop on virtual infrastructure is rapidly increasing. Martin Yip explores the benefits of virtualizing Hadoop through the lens of three real-world examples. You'll leave with the confidence to deploy your Hadoop clusters using virtualization. Read more.
4:35pm–5:15pm Wednesday, 09/28/2016
Location: 1 C04 / 1 C05
Moderated by:
Edd Wilder-James (Google)
Panelists:
Maksim Pecherskiy (City of San Diego), Robert Stratton (Neustar), Chris Kakkanatt (Pfizer)
Average rating: *....
(1.00, 1 rating)
Analytic discovery is a team sport; the lone hero data scientist is a thing of the past. John Akred of Silicon Valley Data Science leads a panel of analytics and data experts from Pfizer, the City of San Diego, and Neustar that explores how these businesses were changed through analytic collaboration. Read more.
4:35pm–5:15pm Wednesday, 09/28/2016
Location: 1B 03/04
Jake Dolezal (McKnight Consulting Group Global Services)
Average rating: *****
(5.00, 1 rating)
Jake Dolezal shares research into the performance of data quality and data management workloads on Hadoop clusters. Jake discusses a YARN-based approach to data management and outlines highly effective IT resource utilization techniques to achieve extreme agility for organizations and performance gains in Hadoop. Read more.
4:35pm–5:15pm Wednesday, 09/28/2016
Location: River Pavilion Level: Non-technical
Sara Watson (Digital Asia Hub)
Average rating: *****
(5.00, 3 ratings)
How are users meant to interpret the influence of big data and personalization in their targeted experiences? What signals do we have to show us how our data is used, how it improves or constrains our experience? Sara Watson explains that in order to develop normative opinions to shape policy and practice, users need means to guide their experience—the personalization spectrum. Read more.
4:35pm–5:15pm Wednesday, 09/28/2016
Location: 3D 08 Level: Intermediate
Jasjeet Thind (Zillow)
Average rating: ***..
(3.75, 8 ratings)
Zillow pioneered providing access to unprecedented information about the housing market. Long gone are the days when you needed an agent to get comparables and prior sale and listing data. And with more data, data science has enabled more use cases. Jasjeet Thind explains how Zillow uses Spark and machine learning to transform real estate. Read more.
4:35pm–5:15pm Wednesday, 09/28/2016
Location: 1B 01/02
Antonio Rosales (Canonical)
Average rating: ****.
(4.00, 1 rating)
Antonio Rosales offers an overview of Juju, an open source method to distill the best practices and operations needed to use interconnected big data solutions. By providing an open source means to describe services and solutions, users can focus on using the science, and developers can focus on delivering best practices. Read more.
5:25pm–6:05pm Wednesday, 09/28/2016
Location: 1 E 07/1 E 08 Level: Intermediate
Kurt Brown (Netflix)
Average rating: ****.
(4.22, 9 ratings)
The Netflix data platform is constantly evolving, but fundamentally it's an all-cloud platform at a massive scale (40+ PB and over 700 billion new events per day) focused on empowering developers. Kurt Brown dives into the current technology landscape at Netflix and offers some thoughts on what the future holds. Read more.
5:25pm–6:05pm Wednesday, 09/28/2016
Location: 1 E 10/1 E11 Level: Intermediate
Leo Meyerovich (Graphistry)
Visual analysis is changing in the era of GPU clusters. Now that scale compute is easier, the bottleneck is mapping data to visualizations and intelligently interacting with them. Using datasets uploaded to Graphistry, Leo Meyerovich provides a glimpse into the emerging workflows for graph and linked event analysis and offers common tricks for success. Read more.
5:25pm–6:05pm Wednesday, 09/28/2016
Location: Hall 1B Level: Intermediate
Neelesh Salian (Stitch Fix)
Average rating: ***..
(3.00, 7 ratings)
Drawing on his experiences across 150+ production deployments, Neelesh Srinivas Salian focuses on five common issues observed in a cluster environment setup with Apache Spark (Core, Streaming, and SQL) to help you improve the usability and supportability of Apache Spark and avoid such issues in future deployments. Read more.
5:25pm–6:05pm Wednesday, 09/28/2016
Location: 1 E 12/1 E 13 Level: Intermediate
Tags: iot
Ted Dunning (MapR, now part of HPE)
Average rating: ****.
(4.00, 2 ratings)
Modern cars produce data. Lots of data. And Formula 1 cars produce more than their fair share. Ted Dunning presents a demo of how data streaming can be applied to the analytics problems posed by modern motorsports. Although he won't be bringing Formula 1 cars to the talk, Ted demonstrates a physics-based simulator to analyze realistic data from simulated cars. Read more.
5:25pm–6:05pm Wednesday, 09/28/2016
Location: 1 E 15/1 E 16 Level: Intermediate
Alexander Dean (Snowplow Analytics Ltd)
Average rating: ****.
(4.00, 1 rating)
In 1853, Britain’s workshops built 90 new gunboats for the Royal Navy in just 90 days—an astonishing feat of engineering made possible by industrial standardization. Snowplow's Alexander Dean argues that data-sophisticated corporations need a new standardization of their own, in the form of schema registries like Confluent Schema Registry or Snowplow’s own Iglu. Read more.
5:25pm–6:05pm Wednesday, 09/28/2016
Location: 3D 12 Level: Intermediate
Tags: cloud
Henry Robinson (Cloudera), Justin Erickson (Cloudera)
Average rating: **...
(2.50, 2 ratings)
Henry Robinson and Justin Erickson explain how to best take advantage of the flexibility and cost-effectiveness of the cloud with your BI and SQL analytic workloads using Apache Hadoop and Apache Impala (incubating), covering the architectural considerations, best practices, tuning, and functionality available when deploying or migrating BI and SQL analytic workloads to the cloud. Read more.
5:25pm–6:05pm Wednesday, 09/28/2016
Location: 3D 10 Level: Non-technical
Tags: ai
Mike Lee Williams (Cloudera Fast Forward Labs)
Average rating: ****.
(4.80, 10 ratings)
Our ability to extract meaning from unstructured text data has not kept pace with our ability to produce and store it, but recent breakthroughs in recurrent neural networks are allowing us to make exciting progress in computer understanding of language. Building on these new ideas, Michael Williams explores three ways to summarize text and presents prototype products for each approach. Read more.
5:25pm–6:05pm Wednesday, 09/28/2016
Location: Hall 1C Level: Intermediate
Martin Wicke (Google)
Average rating: ***..
(3.50, 2 ratings)
Much of the success of deep learning in recent years can be attributed to scale—bigger datasets and more computing power—but scale can quickly become a problem. Distributed, asynchronous computing in heterogenous environments is complex, hard to debug, and hard to profile and optimize. Martin Wicke demonstrates how to automate or abstract away such complexity, using TensorFlow as an example. Read more.
5:25pm–6:05pm Wednesday, 09/28/2016
Location: 1 C04 / 1 C05
Amit Vij (Kinetica), Mark Brooks (Kinetica DB, Inc.)
Data lakes provide large-scale data processing and storage at low cost but struggle to deliver real-time analytics without investment in large clusters. If you need subsecond analytic response on streaming data, consider a GPU database. Amit Vij and Mark Brooks outline the dramatic performance benefits a GPU database offers and explain how to integrate it with Hadoop. Read more.
5:25pm–6:05pm Wednesday, 09/28/2016
Location: 1 E 09
Matt Turck (FirstMark Capital), Einat Burshtine (Credit Suisse), Shui Cheung yip (Pershing LLC (Bank of New York Mellon)), Alasdair Anderson (Nordea)
Average rating: ****.
(4.00, 4 ratings)
What's the point at which Hadoop tips from a Swiss-army knife of use cases to a new foundation that rearranges how the financial services marketplace turns data into profit and competitive advantage? This panel of expert practitioners looks into the near future to see if the inflection point is at hand. Read more.
5:25pm–6:05pm Wednesday, 09/28/2016
Location: 1 E 14
Thomas Place (First Data)
Average rating: ****.
(4.33, 3 ratings)
Thomas Place explores the big data journey of the world’s biggest payment processor, which came dangerously close to building a data swamp before pivoting to embrace governance and quality-first patterns. This case study includes patterns, partners, successes, failures, and lessons learned to date and reviews the journey ahead. Read more.
5:25pm–6:05pm Wednesday, 09/28/2016
Location: 1B 01/02
Jim McHugh (NVIDIA)
Average rating: ****.
(4.50, 2 ratings)
Customers are looking to extend the benefits beyond big data with the power of the deep learning and accelerated analytics ecosystems. Jim McHugh explains how customers are leveraging deep learning and accelerated analytics to turn insights into AI-driven knowledge and covers the growing ecosystem of solutions and technologies that are delivering on this promise. Read more.
5:25pm–6:05pm Wednesday, 09/28/2016
Location: 1B 03/04
Amar Arsikere (infoworks.io)
Average rating: *....
(1.00, 1 rating)
Current data warehouse technologies are increasingly challenged to handle the growth in data volume, new data types, and multiple analytics types. Hadoop has the potential to address these issues, but you need to solve several complexities before you can realize its full benefits. Amar Arsikere showcases the business and technical aspects of augmenting and modernizing data warehouses on Hadoop. Read more.
5:25pm–6:05pm Wednesday, 09/28/2016
Location: River Pavilion Level: Beginner
Tony Ng (WeWork)
Average rating: ****.
(4.00, 1 rating)
Enterprises are increasingly demanding real-time analytics and insights. Tony Ng offers an overview of Pulsar, an open source real-time streaming system used at eBay. Tony explains how Pulsar integrates Kafka, Kylin, and Druid to provide flexibility and scalability in event and metrics consumption. Read more.
5:25pm–6:05pm Wednesday, 09/28/2016
Location: 3D 08 Level: Non-technical
Bas Geerdink (Aizonic)
Average rating: ****.
(4.67, 3 ratings)
Bas Geerdink offers an overview of the evolution that the Hadoop ecosystem has taken at ING. Since 2013, ING has invested heavily in a central data lake and data management practice. Bas shares historical lessons and best practices for enterprises that are incorporating Hadoop into their infrastructure landscape. Read more.

Thursday, September 29

11:20am–12:00pm Thursday, 09/29/2016
Location: 1 E 07/1 E 08 Level: Intermediate
Ryan Blue (Netflix)
Average rating: ****.
(4.71, 7 ratings)
Netflix is exploring new avenues for data processing where traditional approaches fail to scale. Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet's features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he's learned, creating the missing guide you need. Read more.
11:20am–12:00pm Thursday, 09/29/2016
Location: 1 E 10/1 E11 Level: Intermediate
Average rating: ****.
(4.50, 2 ratings)
Airbnb developed Caravel to provide all employees with interactive access to data while minimizing friction. Caravel's main goal is to make it easy to slice, dice, and visualize data. Maxime Beauchemin explains how Caravel empowers each and every employee to perform analytics at the speed of thought. Read more.
11:20am–12:00pm Thursday, 09/29/2016
Location: Hall 1B Level: Beginner
Tags: real-time
Ram Sriharsha (Databricks)
Average rating: ***..
(3.25, 8 ratings)
Structured Streaming is a new effort in Apache Spark to make stream processing simple without the need to learn a new programming paradigm or system. Ram Sriharsha offers an overview of Structured Streaming, discussing its support for event-time, out-of-order/delayed data, sessionization, and integration with the batch data stack to show how it simplifies building powerful continuous applications. Read more.
11:20am–12:00pm Thursday, 09/29/2016
Location: 1 E 12/1 E 13 Level: Beginner
Jim Scott (NVIDIA)
Average rating: ***..
(3.80, 5 ratings)
Jim Scott outlines the core tenets of a message-driven architecture and explains its importance in real-time big data-enabled distributed systems within the realm of finance. Read more.
11:20am–12:00pm Thursday, 09/29/2016
Location: 1 E 15/1 E 16
Susan Etlinger (Altimeter Group)
Average rating: *....
(1.00, 1 rating)
The history of the digital age is being written in photographs. To innovate in the visual age, we have to crack the visual code. Susan Etlinger explores why the ability to understand why one photo resonates and one doesn’t can make or break reputations, spark new products or lines of business, and make or save millions of dollars. Read more.
11:20am–12:00pm Thursday, 09/29/2016
Location: 3D 12 Level: Intermediate
Ihab Ilyas (University of Waterloo)
Average rating: *****
(5.00, 2 ratings)
Machine-learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas explains why leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions. Read more.
11:20am–12:00pm Thursday, 09/29/2016
Location: 3D 10 Level: Intermediate
terry mcfadden (P&G), Priyank Patel (Arcadia Data)
Average rating: ****.
(4.00, 3 ratings)
Terry Mcfadden and Priyank Patel discuss Procter and Gamble's three-year journey to enable production applications with on-cluster BI technology, exploring in detail the architecture challenges and choices made by the team along this journey. Read more.
11:20am–12:00pm Thursday, 09/29/2016
Location: Hall 1C Level: Beginner
Yishay Carmiel (IntelligentWire)
Average rating: ****.
(4.43, 7 ratings)
Deep learning has taken us a few steps further toward achieving AI for a man-machine interface. However, deep learning technologies like speech recognition and natural language processing remain a mystery to many. Yishay Carmiel reviews the history of deep learning, the impact it's made, recent breakthroughs, interesting solved and open problems, and what's in store for the future. Read more.
11:20am–12:00pm Thursday, 09/29/2016
Location: 1 C04 / 1 C05
John Hugg (VoltDB)
VoltDB promises full ACID with strong serializability in a fault-tolerant, distributed SQL platform, as well as higher throughput than other systems that promise much less. But why should users believe this? John Hugg discusses VoltDB's internal testing and support processes, its work with Kyle Kingsbury on the VoltDB Jepsen testing project, and where VoltDB will continue to improve. Read more.
11:20am–12:00pm Thursday, 09/29/2016
Location: 1 E 09
Tags: cloud
Rimma Nehme (Microsoft)
Average rating: *****
(5.00, 1 rating)
The amount of cutting-edge technology that Azure puts at your fingertips is incredible. Artificial intelligence is no exception. Azure enables sophisticated capabilities in artificial intelligence, machine learning, deep learning, cognitive services, and advanced analytics. Rimma Nehme explains why Azure is the next AI supercomputer and how this vision is being implemented in reality. Read more.
11:20am–12:00pm Thursday, 09/29/2016
Location: 1 E 14
Rajesh Shroff (Cisco Systems Inc)
Rajesh Shroff reviews the big data and analytics landscape, lessons learned in enterprise over the last few years, and some of the key considerations while designing a big data system. Read more.
11:20am–12:00pm Thursday, 09/29/2016
Location: 1B 03/04
Douglas Liming (SAS Institute Inc.)
Average rating: ****.
(4.00, 1 rating)
Ready to take a deeper look at how Hadoop and its ecosystem has a widespread impact on analytics? Douglas Liming explains where SAS fits into the open ecosystem, why you no longer have to choose between analytics languages like Python, R, or SAS, and how a single, unified open analytics architecture empowers you to literally have it all. Read more.
11:20am–12:00pm Thursday, 09/29/2016
Location: River Pavilion Level: Intermediate
Fang Yu (DataVisor)
Average rating: ****.
(4.67, 3 ratings)
The value of online user accounts has led to a significant increase in account takeover (ATO) attacks. Cyber criminals create armies of compromised accounts to perform attacks including fraudulent transactions, bank withdrawals, reward program theft, and more. Fang Yu explains how the latest in big data technology is helping turn the tide on ATO campaigns. Read more.
11:20am–12:00pm Thursday, 09/29/2016
Location: 3D 08 Level: Non-technical
Tags: iot
Mike Stringer (Datascope Analytics)
Average rating: *....
(1.89, 9 ratings)
We're likely just at the beginning of data science. The people and things that are starting to be equipped with sensors will enable entirely new classes of problems that will have to be approached more scientifically. Mike Stringer outlines some of the issues that may arise for business, for data scientists, and for society. Read more.
11:20am–12:00pm Thursday, 09/29/2016
Location: 1B 01/02
Average rating: ****.
(4.50, 2 ratings)
With so much variance across Hadoop distributions, ODPi was established to create standards for both Hadoop components and testing applications on those components. Join John Mertic and Berni Schiefer to learn how application developers and companies considering Hadoop can benefit from ODPi. Read more.
1:15pm–1:55pm Thursday, 09/29/2016
Location: 1 E 07/1 E 08 Level: Beginner
Tags: real-time
Tyler Akidau (Google)
Average rating: ****.
(4.67, 3 ratings)
Tyler Akidau offers a whirlwind tour of the conceptual building blocks of massive-scale data processing systems over the last decade, comparing and contrasting systems at Google with popular open source systems in use today. Read more.
1:15pm–1:55pm Thursday, 09/29/2016
Location: 1 E 10/1 E11
Uma Raghavan (Integris Software)
Average rating: ***..
(3.50, 2 ratings)
Uma Raghavan explains why you're about to see companies whose business models depend on using their customers' data, like Facebook, Google, and many others, scramble to keep up with the flood of new and evolving laws on data privacy. Read more.
1:15pm–1:55pm Thursday, 09/29/2016
Location: Hall 1B Level: Beginner
Yuhao Yang (Intel)
Average rating: ****.
(4.00, 5 ratings)
Through collaboration with some of the top payments companies around the world, Intel has developed an end-to-end solution for building fraud detection applications. Yuhao Yang explains how Intel used and extended Spark DataFrames and ML Pipelines to build the tool chain for financial fraud detection and shares the lessons learned during development. Read more.
1:15pm–1:55pm Thursday, 09/29/2016
Location: 1 E 12/1 E 13 Level: Intermediate
Tags: real-time
Ewen Cheslack-Postava (Confluent)
Average rating: ***..
(3.33, 3 ratings)
You may have successfully made the transition from single machines and one-off solutions to large, distributed stream infrastructures in your data center. But what if one data center is not enough? Ewen Cheslack-Postava explores resilient multi-data-center architecture with Apache Kafka, sharing best practices for data replication and mirroring as well as disaster scenarios and failure handling. Read more.
1:15pm–1:55pm Thursday, 09/29/2016
Location: 1 E 15/1 E 16 Level: Intermediate
Rupert Steffner (Otto GmbH & Co. KG)
Average rating: ***..
(3.67, 3 ratings)
Today’s online storefronts are good at procuring transactions but poor in managing customers. Rupert Steffner explains why online retailers must build a complementary intelligence to perceive and reason on customer signals to better manage opportunities and risks along the customer journey. Individually managed customer experience is retailers' next challenge, and fueling AI is the right answer. Read more.
1:15pm–1:55pm Thursday, 09/29/2016
Location: 3D 12
Karthik Ramasamy (Twitter)
Average rating: ***..
(3.20, 5 ratings)
Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Karthik Ramasamy offers an overview of the end-to-end real-time stack Twitter designed in order to meet this challenge, consisting of DistributedLog (the distributed and replicated messaging system) and Heron (the streaming system for real-time computation). Read more.
1:15pm–1:55pm Thursday, 09/29/2016
Location: 3D 10 Level: Intermediate
Tags: r-lang
Xiangrui Meng (Databricks)
Average rating: ****.
(4.00, 2 ratings)
Xiangrui Meng explores recent community efforts to extend SparkR for scalable advanced analytics—including summary statistics, single-pass approximate algorithms, and machine-learning algorithms ported from Spark MLlib—and shows how to integrate existing R packages with SparkR to accelerate existing R workflows. Read more.
1:15pm–1:55pm Thursday, 09/29/2016
Location: Hall 1C Level: Intermediate
Tags: ai
David Talby (Pacific AI), Claudiu Branzan (Accenture)
Average rating: ****.
(4.00, 1 rating)
David Talby and Claudiu Branzan lead a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, Titan, and Elasticsearch; data science components include custom UIMA annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing. Read more.
1:15pm–1:55pm Thursday, 09/29/2016
Location: 1 E 14
Average rating: *....
(1.00, 1 rating)
Big data is a critical part of the enterprise data fabric and must meet the critical enterprise criteria of correctness, quality, consistency, compliance, and traceability. Michael Eacrett explains how companies are using big data infrastructures, asynchronously and in real time, to actively solve information governance and data-quality challenges. Read more.
1:15pm–1:55pm Thursday, 09/29/2016
Location: 1B 01/02
Joe Goldberg (BMC Software)
Average rating: **...
(2.00, 1 rating)
Joe Goldberg explores how companies like GoPro, Produban, Navistar, and others have taken a platform approach to managing their workflows; how they are using workflows to power data ingest, ETL, and data integration processing; how an end-to-end view of workflows has reduced issue resolution time; and how these companies are achieving success in their data warehouse modernization projects. Read more.
1:15pm–1:55pm Thursday, 09/29/2016
Location: River Pavilion Level: Beginner
Joshua Patterson (NVIDIA), Mike Wendt (NVIDIA), Keith Kraus (NVIDIA)
Average rating: ****.
(4.00, 2 ratings)
Cybersecurity has become a data problem and thus needs the best-in-breed big data tools. Joshua Patterson, Michael Wendt, and Keith Kraus explain how Accenture Labs's Cybersecurity team is using Apache Kafka, Spark, and Flink to stream data into Blazegraph and Datastax Graph to accelerate cyber defense. Read more.
1:15pm–1:55pm Thursday, 09/29/2016
Location: 3D 08 Level: Non-technical
Brian Kahn (Climate Central), Edward Wisniewski (Radish Lab)
Average rating: ****.
(4.50, 2 ratings)
Radish Lab teamed up with science news nonprofit Climate Central to transform temperature data from 1,001 US cities into a compelling, simple interactive that received more than 1 million views within three days of launch. Alana Range and Brian Kahn offer an overview of the process of creating a viral, interactive data visualization with teams that regularly produce powerful data stories. Read more.
1:15pm–1:55pm Thursday, 09/29/2016
Location: 1B 03/04
Sherri Adame (Cigna)
Average rating: ****.
(4.57, 7 ratings)
Launched in late 2015, Cigna's enterprise data lake project is taking the company on a data governance journey. Sherri Adame offers an overview of the project, providing insights into some of the business pain points and key drivers, how it has led to organizational change, and the best practices associated with Cigna’s new data governance process. Read more.
1:15pm–1:55pm Thursday, 09/29/2016
Location: 1 C04 / 1 C05
Richard Langlois (IT Architecture & Strategy)
The self-service YP Analytics application allows advertisers to understand their digital presence and ROI. Richard Langlois explains how Yellow Pages used this expertise for an internal use case that delivers real-time analytics with Tableau, using OLAP on Hadoop and enabled by its stack, which includes HDFS, Parquet, Hive, Impala, and AtScale, for fast, real-time analytics and data exploration. Read more.
2:05pm–2:45pm Thursday, 09/29/2016
Location: 1 E 07/1 E 08 Level: Intermediate
Xavier Léauté (Confluent)
Ever wondered what it takes to scale Kafka, Samza, and Druid to handle complex, heterogeneous analytics workloads at petabyte size? Xavier Léauté discusses his experience scaling Metamarkets's real-time processing to over 3 million events per second and shares the challenges encountered and lessons learned along the way. Read more.
2:05pm–2:45pm Thursday, 09/29/2016
Location: 1 E 10/1 E11 Level: Beginner
Amit Kapoor (narrativeVIZ)
Average rating: ****.
(4.67, 3 ratings)
Though visualization is used in data science to understand the shape of the data, it's not widely used for statistical models, which are evaluated based on numerical summaries. Amit Kapoor explores model visualization, which aids in understanding the shape of the model, the impact of parameters and input data on the model, the fit of the model, and where it can be improved. Read more.
2:05pm–2:45pm Thursday, 09/29/2016
Location: Hall 1B Level: Intermediate
Holden Karau (Independent), Seth Hendrickson (Cloudera)
Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark's new Structured Streaming and walk you through creating your own streaming model. Read more.
2:05pm–2:45pm Thursday, 09/29/2016
Location: 1 E 12/1 E 13 Level: Intermediate
Venkatesh Sivasubramanian (GE Digital), Luis Ramos (GE Digital)
Average rating: ***..
(3.50, 2 ratings)
Opportunities in the industrial world are expected to outpace consumer business cases. Time series data is growing exponentially as new machines get connected. Venkatesh Sivasubramanian and Luis Ramos explain how GE makes it faster and easier for systems to access (using a common layer) and perform analytics on a massive volume of time series data by combining Apache Apex, Spark, and Kudu. Read more.
2:05pm–2:45pm Thursday, 09/29/2016
Location: 1 E 15/1 E 16 Level: Non-technical
Danielle Dean (iRobot), Amy O'Connor (Cloudera)
Average rating: ***..
(3.17, 6 ratings)
At Strata + Hadoop World 2012, Amy O'Connor and her daughter Danielle Dean shared how they learned and built data science skills at Nokia. This year, Amy and Danielle explore how the landscape in the world of data science has changed in the past four years and explain how to be successful deriving value from data today. Read more.
2:05pm–2:45pm Thursday, 09/29/2016
Location: 3D 12 Level: Beginner
Thomas Phelan (HPE BlueData)
Average rating: ****.
(4.11, 18 ratings)
Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale environments poses new challenges, especially for big data applications like Hadoop. Thomas Phelan shares lessons learned and some tips and tricks on how to Dockerize your big data applications in a reliable, scalable, and high-performance environment. Read more.
2:05pm–2:45pm Thursday, 09/29/2016
Location: 3D 10 Level: Beginner
Amitai Armon (Intel), Nir Lotan (Intel)
Average rating: ****.
(4.50, 2 ratings)
Amitai Armon and Nir Lotan outline a new, free software tool that enables the creation of deep learning models quickly and easily. The tool is based on existing deep learning frameworks and incorporates extensive optimizations that provide high performance on standard CPUs. Read more.
2:05pm–2:45pm Thursday, 09/29/2016
Location: Hall 1C Level: Intermediate
Tags: media, politics
Amir Hajian (Thomson Reuters), Khaled Ammar (Thomson Reuters), Alex Constandache (Thomson Reuters)
Average rating: ***..
(3.75, 4 ratings)
Amir Hajian, Khaled Ammar, and Alex Constandache offer an approach to mining a large dataset to predict the electability of hypothetical candidates in the US presidential election race, using machine learning, natural language processing, and deep learning on an infrastructure that includes Spark and Elasticsearch, which serves as the backbone of the mobile game White House Run. Read more.
2:05pm–2:45pm Thursday, 09/29/2016
Location: 1B 01/02
Scott Gnau (Hortonworks)
Average rating: ***..
(3.00, 1 rating)
Scott Gnau provides unique insights into the tipping point for data, how enterprises are now rethinking everything from their IT architecture and software strategies to data governance and security, and the cultural shifts CIOs must grapple with when supporting a business using real-time data to scale and grow. Read more.
2:05pm–2:45pm Thursday, 09/29/2016
Location: 1 C04 / 1 C05
John Morrell (Datameer)
A panel of practitioners from from Dell, National Instruments, and Citi—companies that are gaining real value from big data analytics—explore their companies' big data journeys, explaining how analytics can answer groundbreaking new questions about business and create a path to becoming a data-driven organization. Read more.
2:05pm–2:45pm Thursday, 09/29/2016
Location: 1 E 09
Tags: cloud
Jonathan Fritz (Amazon Web Services)
Average rating: *****
(5.00, 1 rating)
Running Hadoop, Spark, and Presto can be as fast and inexpensive as ordering a latte at your favorite coffee shop. Jonathan Fritz explains how organizations are deploying these and other big data frameworks with Amazon Web Services (AWS) and how you too can quickly and securely run Spark and Presto on AWS. Jonathan shows you how to get started and shares best practices and common use cases. Read more.
2:05pm–2:45pm Thursday, 09/29/2016
Location: 1 E 14
Steven Touw (Immuta)
Average rating: *****
(5.00, 3 ratings)
Sharing your valuable data internally or with third-party consumers can be risky due to data privacy regulations and IP considerations, but sharing can also generate revenue or help nonprofits succeed at world-changing missions. Steve Touw explores real-world examples of how a proper data architecture enables philanthropic missions and offers ideas for how to better share your data. Read more.
2:05pm–2:45pm Thursday, 09/29/2016
Location: 1B 03/04
Joe Caserta (Caserta Concepts)
Average rating: *****
(5.00, 1 rating)
Joe Caserta explores how a leading membership interest group is utilizing a data lake to track its members’ path-to-purchase touch points across multiple channels by matching and mastering individuals using Spark GraphFrames and stitching together website, marketing, email, and transaction data to discover the most effective way to attract new members and retain existing high-value members. Read more.
2:05pm–2:45pm Thursday, 09/29/2016
Location: River Pavilion Level: Intermediate
Jun Rao (Confluent)
Average rating: *****
(5.00, 1 rating)
With Apache Kakfa 0.9, the community has introduced a number of features to make data streams secure. Jun Rao explains the motivation for making these changes, discusses the design of Kafka security, and demonstrates how to secure a Kafka cluster. Jun also covers common pitfalls in securing Kafka and talks about ongoing security work. Read more.
2:05pm–2:45pm Thursday, 09/29/2016
Location: 3D 08 Level: Intermediate
Tags: real-time
Kostas Tzoumas (data Artisans)
Average rating: ****.
(4.00, 2 ratings)
Apache Flink has seen incredible growth during the last year, both in development and usage, driven by the fundamental shift from batch to stream processing. Kostas Tzoumas demonstrates how Apache Flink enables real-time decisions, makes infrastructure less complex, and enables extremely efficient, accurate, and fault-tolerant streaming applications. Read more.
2:55pm–3:35pm Thursday, 09/29/2016
Location: 1 E 07/1 E 08 Level: Non-technical
Bart van Leeuwen (Netage)
Average rating: *****
(5.00, 2 ratings)
Smart data allows fire services to better protect the people they serve and keep their firefighters safe. The combination of open and nonpublic data used in a smart way generates new insights both in preparation and operations. Bart van Leeuwen discusses how the fire service is benefiting from open standards and best practices. Read more.
2:55pm–3:35pm Thursday, 09/29/2016
Location: 1 E 10/1 E11 Level: Intermediate
Tags: ai
Stephen Pratt (Noodle.ai)
Average rating: ***..
(3.00, 3 ratings)
Stephen Pratt, the CEO of Noodle.ai and former head of Watson for IBM GBS, presents a shareholder value perspective on why enterprise artificial intelligence (eAI) will be the single largest competitive differentiator in business over the next five years—and what you can do to end up on top. Read more.
2:55pm–3:35pm Thursday, 09/29/2016
Location: Hall 1B Level: Intermediate
Narasimhan Sampath (Choice Hotels International), Avinash Ramineni (Clairvoyant)
Average rating: ****.
(4.00, 1 rating)
Narasimhan Sampath and Avinash Ramineni share how Choice Hotels International used Spark Streaming, Kafka, Spark, and Spark SQL to create an advanced analytics platform that enables business users to be self-reliant by accessing the data they need from a variety of sources to generate customer insights and property dashboards and enable data-driven decisions with minimal IT engagement. Read more.
2:55pm–3:35pm Thursday, 09/29/2016
Location: 1 E 12/1 E 13 Level: Intermediate
yaron haviv (iguaz.io)
Average rating: **...
(2.00, 1 rating)
Yaron Haviv explains how to design real-time IoT and FSI applications, leveraging Spark with advanced data frame acceleration. Yaron then presents a detailed, practical use case, diving deep into the architectural paradigm shift that makes the powerful processing of millions of events both efficient and simple to program. Read more.
2:55pm–3:35pm Thursday, 09/29/2016
Location: 1 E 15/1 E 16 Level: Non-technical
Tanya Cashorali (TCB Analytics)
Given the recent demand for data analytics and data science skills, adequately testing and qualifying candidates can be a daunting task. Interviewing hundreds of individuals of varying experience and skill levels requires a standardized approach. Tanya Cashorali explores strategies, best practices, and deceptively simple interviewing techniques for data analytics and data science candidates. Read more.
2:55pm–3:35pm Thursday, 09/29/2016
Location: 3D 12 Level: Intermediate
Rick McFarland (Hearst Corp)
Average rating: **...
(2.00, 2 ratings)
Rick McFarland explains how the Hearst Corporation utilizes big data and analytics tools like Spark and Kinesis to stream click data in real-time from its 300+ websites worldwide. This streaming process feeds an editorial tool called Buzzing@Hearst, which provides instant feedback to authors on what is trending across the Hearst network. Read more.
2:55pm–3:35pm Thursday, 09/29/2016
Location: 3D 10 Level: Intermediate
Brendan Herger (Capital One)
Average rating: ****.
(4.80, 5 ratings)
Many areas of applied machine learning require models optimized for rare occurrences, such as class imbalances, and users actively attempting to subvert the system (adversaries). Brendan Herger offers an overview of multiple published techniques that specifically attempt to address these issues and discusses lessons learned by the Data Innovation Lab at Capital One. Read more.
2:55pm–3:35pm Thursday, 09/29/2016
Location: Hall 1C Level: Intermediate
Danielle Dean (iRobot), Shaheen Gauher (Microsoft)
Average rating: ****.
(4.20, 5 ratings)
In the realm of predictive maintenance, the event of interest is an equipment failure. In real scenarios, this is usually a rare event. Unless the data collection has been taking place over a long period of time, the data will have very few of these events or, in the worst case, none at all. Danielle Dean and Shaheen Gauher discuss the various ways of building and evaluating models for such data. Read more.
2:55pm–3:35pm Thursday, 09/29/2016
Location: 1B 03/04
Mariusz Gadarowski (deepsense.io)
Mariusz Gądarowski offers an overview of Neptune, deepsense.io’s new IT platform-based machine-learning experiment management solution for data scientists. Neptune enhances the management of machine-learning tasks such as dependent computational processes, code versioning, comparing achieved results, monitoring tasks and progress, sharing infrastructure among teammates, and many others. Read more.
2:55pm–3:35pm Thursday, 09/29/2016
Location: River Pavilion Level: Beginner
Tags: cloud
Li Li (Google), Hao Hao (Cloudera)
Average rating: ***..
(3.50, 2 ratings)
Li Li and Hao Hao elaborate the architecture of Apache Sentry + RecordService for Hadoop in the cloud, which provides unified, fine-grained authorization via role- and attribute-based access control, to encourage attendees to adopt Apache Sentry and RecordService to protect sensitive data on the multitenant cloud across the Hadoop ecosystem. Read more.
2:55pm–3:35pm Thursday, 09/29/2016
Location: 3D 08 Level: Intermediate
Kaz Sato (Google)
Average rating: *****
(5.00, 4 ratings)
The largest challenge for deep learning is scalability. Google has built a large-scale neural network in the cloud and is now sharing that power. Kazunori Sato introduces pretrained ML services, such as the Cloud Vision API and the Speech API, and explores how TensorFlow and Cloud Machine Learning can accelerate custom model training 10x–40x with Google's distributed training infrastructure. Read more.
2:55pm–3:35pm Thursday, 09/29/2016
Location: 1 C04 / 1 C05 Level: Beginner
Haoyuan Li (Alluxio)
Average rating: ****.
(4.00, 1 rating)
Haoyuan Li offers an overview of Alluxio (formerly Tachyon), a memory-speed virtual distributed storage system. In the past year, the Alluxio project experienced a tremendous improvement in performance and scalability and was extended with key new features. This year, the goal is to make Alluxio accessible to an even wider set of users through a focus on security, new language bindings, and APIs. Read more.
4:35pm–5:15pm Thursday, 09/29/2016
Location: 1 E 07/1 E 08 Level: Intermediate
Tags: real-time
Fangjin Yang (Imply)
Average rating: *****
(5.00, 4 ratings)
Cluster computing frameworks such as Hadoop or Spark are tremendously beneficial in processing and deriving insights from data. However, long query latencies make these frameworks suboptimal choices to power interactive applications. Fangjin Yang discusses using Druid for analytics and explains why the architecture is well suited to power analytic dashboards. Read more.
4:35pm–5:15pm Thursday, 09/29/2016
Location: 1 E 10/1 E11 Level: Non-technical
Cameron Turner (The Data Guild), Brad Sarsfield (Microsoft HoloLens), Hanna Kang-Brown (R/GA), Evan Macmillan (Gridspace)
Average rating: ***..
(3.50, 2 ratings)
Data should be something you can see, feel, hear, taste, and touch. Drawing on real-world examples, Cameron Turner, Brad Sarsfield, Hanna Kang-Brown, and Evan Macmillan cover the emerging field of sensory data visualization, including data sonification, and explain where it's headed in the future. Read more.
4:35pm–5:15pm Thursday, 09/29/2016
Location: Hall 1B Level: Intermediate
Tags: real-time
Jesse Anderson (Big Data Institute)
Average rating: *****
(5.00, 2 ratings)
Although Spark gets a lot of attention, we only think about two languages being supported—Python and Scala. Jesse Anderson proves that Java works just as well. With lambdas, we even get syntax comparable to Scala, so Java developers get the best of both worlds without having to learn Scala. Read more.
4:35pm–5:15pm Thursday, 09/29/2016
Location: 1 E 12/1 E 13 Level: Intermediate
Moty Fania (Intel)
Moty Fania shares Intel’s IT experience implementing an on-premises IoT platform for internal use cases. The platform was designed as a multitenant platform with built-in analytical capabilities and based on open source big data technologies and containers. Moty highlights the lessons learned from this journey with a thorough review of the platform’s architecture. Read more.
4:35pm–5:15pm Thursday, 09/29/2016
Location: 1 E 15/1 E 16 Level: Beginner
Tags: iot, energy
Kim Montgomery (GridCure)
Average rating: **...
(2.80, 5 ratings)
With the advent of smart grid technology, the quantity of data collected by electrical utilities has increased by 3–5 orders of magnitude. To make full use of this data, utilities must expand their analytical capabilities and develop new analytical techniques. Kim Montgomery discusses some ways that big data tools are advancing the practice of preventative maintenance in the utility industry. Read more.
4:35pm–5:15pm Thursday, 09/29/2016
Location: 3D 12 Level: Non-technical
Tags: ai
David Beyer (Amplify Partners)
Average rating: ****.
(4.33, 3 ratings)
Society is standing at the gates of what promises to be a profound transformation in the nature of work, the role of data, and the future of the world's major industries. Intelligent machines will play a variety of roles in every sector of the economy. David Beyer explores a number of key industries and their idiosyncratic journeys on the way to adopting AI. Read more.
4:35pm–5:15pm Thursday, 09/29/2016
Location: 3D 10 Level: Intermediate
Jeffrey Carpenter (DataStax)
Average rating: ****.
(4.33, 3 ratings)
Jeff Carpenter describes how data modeling can be a key enabler of microservice architectures for transactional and analytics systems, including service identification, schema design, and event streaming. Read more.
4:35pm–5:15pm Thursday, 09/29/2016
Location: Hall 1C Level: Intermediate
Josh Lemaitre (Thomson Reuters)
Average rating: *****
(5.00, 1 rating)
How can the value of a patent be quantified? Josh Lemaitre explores how Thomson Reuters Labs approached this problem by applying machine learning to the patent corpus in an effort to predict those most likely to be enforced via litigation. Josh covers infrastructure, methods, challenges, and opportunities for future research. Read more.
4:35pm–5:15pm Thursday, 09/29/2016
Location: River Pavilion Level: Intermediate
Todd Lipcon (Cloudera)
Apache Kudu was first announced as a public beta release at Strata NYC 2015 and recently reached 1.0. This conference marks its one year anniversary as a public open source project. Todd Lipcon offers a very brief refresher on the goals and feature set of the Kudu storage engine, covering the development that has taken place over the last year. Read more.
4:35pm–5:15pm Thursday, 09/29/2016
Location: 3D 08 Level: Beginner
Roy Ben-Alta (Amazon Web Services)
Average rating: *****
(5.00, 1 rating)
Roy Ben-Alta explores the Amazon Kinesis platform in detail and discusses best practices for scaling your core streaming data ingestion pipeline as well as real-world customer use cases and design pattern integration with Amazon Elasticsearch, AWS Lambda, and Apache Spark. Read more.
4:35pm–5:15pm Thursday, 09/29/2016
Location: 1 C04 / 1 C05 Level: Beginner
Vinayak Borkar (X15 Software)
Average rating: ***..
(3.50, 2 ratings)
Starting from first principles, Vinayak Borkar defines the requirements for a modern operational data store and explores some possible architectures to support those requirements. Read more.