Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Speakers

Experts and innovators from around the world share their insights and best practices. New speakers are added regularly. Please check back to see the latest updates to the agenda.

Filter

Search Speakers

Bill Chambers is a Product Manager at Databricks where he works on Structured Streaming and data science products. At Databricks, he was lead author of Spark: The Definitive Guide for O’Reilly Press; co-authored with Matei Zaharia.

Presentations

Streaming Big Data in the Cloud: What to Consider and Why Session

This talk will cover two core topics. Firstly, we'll cover the motivation and basics of the Structured Streaming processing engine in Apache Spark. Secondly, we'll cover the core lessons that we've learned running hundreds of Structured Streaming workloads in the cloud.

Mohamed AbdelHady is a senior data scientist on the algorithms and data science (ADS) team within the AI+R Group at Microsoft, where he focuses on machine learning applications for text analytics and natural language processing. Mohamed works with Microsoft product teams and external customers to deliver advanced technologies that extract useful and actionable insights from unstructured free text such as search queries, social network messages, product reviews, customer feedback. Previously, he spent three years at Microsoft Research’s Advanced Technology Labs. He holds a PhD in machine learning from the University of Ulm in Germany.

Presentations

Deep learning for domain-specific entity extraction from unstructured text Session

Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction.

Vijay Srinivas Agneeswaran is director of technology at SapientNitro. Vijay has spent the last 10 years creating intellectual property and building products in the big data area at Oracle, Cognizant, and Impetus, including building PMML support into Spark/Storm and implementing several machine-learning algorithms, such as LDA and random forests, over Spark. He also led a team that build a big data governance product for role-based, fine-grained access control inside of Hadoop YARN and built the first distributed deep learning framework on Spark. Earlier in his career, Vijay was a postdoctoral research fellow at the LSIR Labs within the Swiss Federal Institute of Technology, Lausanne (EPFL). He is a senior member of the IEEE and a professional member of the ACM. He holds four full US patents and has published in leading journals and conferences, including IEEE Transactions. His research interests include distributed systems, cloud, grid, peer-to-peer computing, machine learning for big data, and other emerging technologies. Vijay holds a a bachelor’s degree in computer science and engineering from SVCE, Madras University, an MS (by research) from IIT Madras, and a PhD from IIT Madras.

Presentations

Achieving GDPR Compliance and Data Privacy using BlockChain Technology Session

We explain how we have used open source BlockChain technologies such as HyperLedger to implement the European Union's General Data Protection Regulation (GDPR) regulation. The key takeaways are: 1. Introduction to GDPR – a step further on data privacy 2. Why Blockchain is a suitable candidate for implementing GDPR 3. Lessons learnt in our blockchain implementation of GDPR compliance.

Deep Learning Based Search and Recommendation Systems Using TensorFlow Tutorial

The key takeaways are: 1. Introduction to deep learning - different networks such as RBMs, Conv nets, auto-encoders. 2. Introduction to recommendation systems - why deep learning is required for hybrid systems. 3. complete hands-on TensorFlow tutorial, including TensorBoard. 4. end-to-end view of deep learning based recommendation and learning to rank systems.

John Mark leads a team that is expanding the machine learning and artificial intelligence capabilities of Azure at Microsoft, which he recently joined; and if he were smarter, should have done earlier in his career — a career that involved working with startups and labs in the Bay Area, in such areas as “The Connected Car 2025” at Toyota ITC, peer-to-peer malware detection at Intel, and automated planning at SRI. His dedication to probability and AI led him to found an annual applications workshop for the Uncertainty in AI conference, where he could be found since its inception in 1985. When feeling low he recharges his spirits by singing Russian music with Slavyanka, the Bay Area’s Slavic music chorus.

Presentations

Distributed Clinical Models: Inference without Sharing Patient Data Session

Clinical collaboration benefits from pooling data to learn models from large datasets, but its hampered by concerns about sharing data. We've developed a privacy-preserving alternative to create statistical models equivalent to one from the entire dataset. We've built this as a cloud application, where each collaborator installs their own, and the installations self-assemble into a star network.

Using R and Python for Scalable Data Science, Machine Learning, and AI Tutorial

Join us to learn how to do scalable, end-to-end data science in R and Python on single machines as well as on Spark clusters and cloud-based infrastructure. You'll be assigned an individual virtual machine with all contents preloaded and software installed and use it to gain experience building and operationalizing machine learning models using distributed functions in both R and Python.

A leading data scientist for optimizing infrastructure, Ritesh Agrawal at Uber leads the Intelligent Infrastructure Systems team. The team focuses on scaling data infrastructure for Uber’s growing business needs now and foreseeable in the future. Before Uber, Ritesh specialized in predictive and ranking models at Netflix, AT&T Labs and Yellow Pages where he built scalable machine learning infrastructure with technologies such as Docker, Hadoop, Spark, and more. A Ph.D. scholar from Pennsylvania State University (State College) in Environmental Earth Science, Ritesh’s thesis focused on computational tools and technologies such as concept map ontologies.

Recent Speaking History:

  • “Smart Data Infrastructure: Using Machine Learning for Infrastructure Optimization”, LCInnovate, San Francisco, CA, 2017
  • “Simple Strategies For Faster Knowledge Discovery In Big Data”, Data Summit, New York, NY, 2017 [Slides: https://www.slideshare.net/slideshow/embed_code/key/JLgqISni6rRnZB]
  • “Efficient Disk Space Utilization In Fast Analytic Engines”, Data Innovation Summit, San Francisco, CA, 2017 [Video: https://ieondemand.com/presentations/future-of-computing-and-the-role-of-iot]

Patents:

  • Ritesh Agrawal, I King, Remi Zajac, “Methods, Systems, And Computer Program Products For Integrated Web Query Classification”. (US Patent 20140032536)
  • Ritesh Agrawal, James Shanahan, “Systems and Methods To Facilitate Local Searches Via Location Disambiguation”. (US Patent 20120117007 A1)

Publications

  • Ritesh Agrawal, Xiaofeng Yu, Irwin King, Remi Zajac. (2011). “Enrichment and Reductionism: Two Approaches for Web Query Classification“. International Conference on Neural Information Processing (ICONIP 2011). Shanghai, China.
  • Ritesh Agrawal, James Shanahan. (2010). “Location Disambiguation in Local Searches Using Gradient Boosted Decision Trees“. ACM GIS (SIGSPATIAL) 2010. San Jose, CA [PDF].
  • Mark Gahegan, Ritesh Agrawal, Anuj Jaiswal, Junyan Luo, Kean-Huat Soon. (2008). “A Platform for Visualizing and Experimenting with Measures of Semantic Similarity in Ontologies and Concept Maps”. Transactions in GIS. 12(6):713-732
  • Ritesh Agrawal, William Pike. (2008). “Capturing visualizing and sharing the process of data analysis“. DHS Summit, Washington, DC [PDF].
  • Mark Gahegan, Ritesh Agrawal, Tawan Banchue, David DiBiase. (2007). “Building rich, semantic descriptions of learning activities to facilitate reuse in digital libraries“. International Journal on Digital Libraries. 7(1-2):81-97 [PDF].
  • Mark Gahegan, Ritesh Agrawal, Anuj Jaiswal, Kean-Huat Soon. (2007). “Measures of similarity for integrating conceptual geographical knowledge: some ideas and some questions”. International Conference on Spatial Information Theory (COSIT 2007) [PDF].
  • Stephen Weaver, Ritesh Agrawal. (2007). “On the Brink: Using Visual Analytics to Explore Decisions Made During the Cuban Missile Crisis”. Annual Meeting of American Association of Geographers. [LINK].
  • Ritesh Agrawal, Donna Peuquet. (2006). “A unified task taxonomy of spatial analytical and visualization operations”, GIScience 2006 .
  • Ritesh Agrawal. (2003). “Space time analysis in an Enterprise GIS”, University Consortium for Geographic Information Systems.
  • Ritesh Agrawal, Deepa Bansod. (2002). “Design and Analysis of Water Distribution System”, Nirmittee 2002. Pune, India.
  • Ritesh Agrawal. (2002). “Sanitary Landfill”. Institute of Engineers, Annual Technical Paper Meet. Nagpur, India.
  • Ritesh Agrawal. (2002). “Rehabilitation of Bridges”. Institute of Engineers, Annual Technical Paper Meet. Nagpur, India

Presentations

Presto Query Gate: Identifying and stopping rogue queries. Session

"Presto" has emerged as the defacto query engine to quickly process petabytes of data. Few rogue SQL queries can however waste significant amount of critical compute resource and reduce Presto's through put. At Uber, we use machine learning to identify such rogue queries and stop them early. This has led to significant amount of savings at both in terms of computational power and money.

Tyler Akidau is a senior staff software engineer at Google Seattle, where he leads technical infrastructure’s internal data processing teams for MillWheel and Flume. Tyler is a founding member of the Apache Beam PMC and has spent the last seven years working on massive-scale data processing systems. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer that batch and streaming are two sides of the same coin and that the real endgame for data processing systems the seamless merging between the two. He is the author of the 2015 “Dataflow Model” paper and “Streaming 101” and “Streaming 102” blog posts. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Presentations

Foundations of streaming SQL or: how I learned to love stream & table theory Session

What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how does all of this relate to the programmatic frameworks we’re all familiar with? Tyler Akidau answers these questions and more as he walks you through key concepts underpinning data processing in general.

With over 15 years in advanced analytical applications and architecture, John Akred is dedicated to helping organizations become more data driven. As CTO of Silicon Valley Data Science, John combines deep expertise in analytics and data science with business acumen and dynamic engineering leadership.

Presentations

Architecting A Data Platform Tutorial

What are the essential components of a data platform? This tutorial will explain how the various parts of the Hadoop, Spark and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Developing a Modern Enterprise Data Strategy Tutorial

Big data and data science have great potential for accelerating business, but how do you reconcile the business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those key aspirations that will define an organization’s future vision. In this tutorial, we explain how to create a modern data strategy that powers data-driven business.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Real-time systems with Spark Streaming and Kafka 2-Day Training

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.

Michael Armbrust is the lead developer of the Spark SQL and Structured Streaming projects at Databricks. Michael’s interests broadly include distributed systems, large-scale structured storage, and query optimization. Michael holds a PhD from UC Berkeley, where his thesis focused on building systems that allow developers to rapidly build scalable interactive applications and specifically defined the notion of scale independence.

Presentations

Streaming Big Data in the Cloud: What to Consider and Why Session

This talk will cover two core topics. Firstly, we'll cover the motivation and basics of the Structured Streaming processing engine in Apache Spark. Secondly, we'll cover the core lessons that we've learned running hundreds of Structured Streaming workloads in the cloud.

Shivnath Babu is an adjunct professor of computer science at Duke University, where his research focuses on ease of use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. He is also the CTO at Unravel Data Systems, the company he cofounded to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has received a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award. He has given talks and distinguished lectures at many research conferences and universities worldwide. Shivnath has also spoken at industry conferences, such as the Hadoop Summit.

Presentations

Using Machine Learning to Simplify Kafka Operations Session

Getting the best performance, predictability, and reliability for Kafka-based applications is an art today. We aim to change that by leveraging recent advances in machine learning and AI. This talk will describe our methodology of applying statistical learning to the rich and diverse monitoring data that is available from Kafka.

Dorna Bandari is the director of algorithms at AI-driven prediction platform Jetlore, where she leads development of large-scale machine learning models and machine learning infrastructure. Previously, she was a lead data scientist at Pinterest and the founder of ML startup Penda. Dorna holds a PhD in electrical engineering from UCLA.

Presentations

Building​ ​a​ ​flexible​ ​ML​ ​pipeline​ ​at​ ​a​ ​B2B​ ​AI​ ​start​up Session

Dorna Bandari offers an overview of the machine learning pipeline at B2B AI startup Jetlore and explains why even small B2B startups in AI should invest in a flexible machine learning pipeline. Dorna covers the design choices, the trade-offs made when implementing and maintaining the pipeline, and how it has accelerated Jetlore's product development and growth.

Roger Barga is general manager and director of development at Amazon Web Services, where he is responsible for Kinesis data streaming services. Before joining Amazon, Roger was in the Cloud Machine Learning group at Microsoft, where he was responsible for product management of the Azure Machine Learning service. Roger is also an affiliate professor at the University of Washington, where he is a lecturer in the Data Science and Machine Learning programs. Roger holds a PhD in computer science, has been granted over 30 patents, has published over 100 peer-reviewed technical papers and book chapters, and has authored a book on predictive analytics.

Presentations

Continuous Machine Learning over Streaming Data Session

In this talk, we present continuous machine learning algorithms that discover useful information in streaming data. We focus on explainable machine learning, including anomaly detection with attribution, ability to reduce false positives through user feedback, and detection of anomalies in directed graphs. We describe the algorithms, implementation, and application in real customer use cases.

James Bednar is a senior solutions architect at Anaconda, Inc. Previously, Jim was a lecturer and researcher in Informatics at the University of Edinburgh, Scotland, and a software and hardware engineer at National Instruments. He manages the open source Python projects datashader, HoloViews, GeoViews, ImaGen, and Param. He has published more than 50 papers and books about the visual system, data visualization, and software development. Jim holds a PhD in computer science from the University of Texas as well as degrees in electrical engineering and philosophy.

Presentations

Custom interactive visualizations and dashboards for one billion datapoints on a laptop in 30 lines of Python Tutorial

Python lets you solve data-science problems by stitching together packages from the Python ecosystem, but it can be difficult to choose packages that work well together. Here we take you through a small number of lines of Python code that provide a concise, fast, easily customizable, and fully reproducible recipe for interactive visualization of millions or billions of datapoints.

Tim Berglund is a teacher, author, and technology leader with Confluent, where he serves as the senior director of developer experience. Tim can frequently be found at speaking at conferences internationally and in the United States. He is the copresenter of various O’Reilly training videos on topics ranging from Git to distributed systems and is the author of Gradle Beyond the Basics. He tweets as @tlberglund, blogs very occasionally at Timberglund.com, and is the cohost of the DevRel Radio Podcast. He lives in Littleton, Colorado, with the wife of his youth and and their youngest child, the other two having mostly grown up.

Presentations

Stream processing with Kafka Tutorial

Tim Berglund leads a basic architectural introduction to Kafka and walks you through using Kafka Streams and KSQL to process streaming data.

Ray Bernard is the founder and chief architect at SuprFanz.com. Previously, Ray was a adjunct professor at Columbia University and worked for technology giants like Compaq, Dell, and EMC. As leader of the Cosmic Blues Band, he performs regularly at the BB King Blues Club & Grill in New York City.

Presentations

Data science in practice: Examining events in social media Media and Advertising

Ray Bernard and Jennifer Webb explain how cloud-based marketing company SuprFanz uses data science techniques and graph theory with Neo4j to generate live event attendance from social media platforms, email, and SMS.

Industrial mathematician and data scientist

Presentations

Data science and machine learning with Apache Spark 2-Day Training

Brian Bloechle demonstrates how to implement typical data science workflows using Apache Spark. You'll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib.

Ron Bodkin is technical director for applied artificial intelligence at Google, where he helps Global Fortune 500 enterprises unlock strategic value with AI, acts as executive sponsor for Google product and engineering teams to deliver value from AI solutions, and leads strategic initiatives working with customers and partners. Previously, Ron was vice president and general manager of artificial intelligence at Teradata; the founding CEO of Think Big Analytics (acquired by Teradata in 2014), which provides end-to-end support for enterprise big data, including data science, data engineering, advisory and managed services, and frameworks such as Kylo for enterprise data lakes; vice president of engineering at Quantcast, where he led the data science and engineer teams that pioneered the use of Hadoop and NoSQL for batch and real-time decision making; founder of enterprise consulting firm New Aspects; and cofounder and CTO of B2B applications provider C-Bridge. Ron holds a BS in math and computer science with honors from McGill University and a master’s degree in computer science from MIT.

Presentations

Deploying deep learning with TensorFlow Tutorial

TensorFlow and Keras are popular libraries for machine learning because of their support for deep learning and GPU deployment. Join Ron Bodkin to learn how to execute these libraries in production with vision and recommendation models and how to export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.

Fidan Boylu Uz holds a Ph.D in Decision Sciences and has 10+ years of technical experience on machine learning and business intelligence. She is a former professor at the University of Connecticut where she conducted research and taught courses on machine learning theory and its business applications. She has a number of academic publications ranging in the areas of machine learning and optimization. She is currently working as a senior data scientist in the algorithms and data science team and responsible for successful delivery of end to end advanced analytics solutions. She has worked on projects in multiple domains such as predictive maintenance, fraud detection, mathematical optimization and deep learning.

Presentations

Operationalize Deep Learning: How to deploy and consume your LSTM networks for predictive maintenance scenarios Session

Deep learning has shown superior performance in domains such as object recognition and image classification, where time-series data plays an important role. Predictive Maintenance is also a domain where data is collected over time to monitor the state of an asset to predict failures. In this talk we show how to operationalize LSTM networks that predict remaining useful life of aircraft engines.

Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.

Presentations

Productionizing Apache Spark MLlib Models: Best Practices Session

We discuss common paths to productionizing Apache Spark MLlib models: engineering challenges and corresponding best practices. We cover several deployment scenarios including batch scoring, Structured Streaming, and real-time low-latency serving.

Claudiu Branzan is the director of data science at G2 Web Services where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine-learning and distributed-systems experience. Previously, Claudiu worked for Atigeo Inc, building big data and data science-driven products for various customers.

Presentations

Natural language understanding at scale with spaCy and Spark NLP Tutorial

Natural language processing is a key component in many data science systems that must understand or reason about text. This is a hands-on tutorial for scalable NLP using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

Kurt Brown leads the Data Platform team at Netflix. Kurt’s group architects and manages the technical infrastructure underpinning the company’s analytics, which includes various big data technologies like Hadoop, Spark, and Presto, Netflix open-sourced applications and services such as Genie and Lipstick, and traditional BI tools including Tableau and Redshift.

Presentations

20 principles & practices (Netflix-style!) to get the most out of your data platform Session

How can you get the most out of your data infrastructure? Come and find out what we do at Netflix and why. We'll run through 20 principles & practices that we've refined and embraced over time. For each one, we'll weave in how they interplay with the technologies we use at Netflix (e.g. S3, Spark, Presto, Druid, Python, Jupyter,...).

Anne Buff is a Business Solutions Manager and Thought Leader for SAS Best Practices, a thought leadership organization at SAS institute. As a speaker and author, she specializes in the topics of analytic strategy and culture, governance, change management, and fostering data driven organizations. Anne has been a specialist in the world of data and analytics for almost 20 years and has developed courseware for a wide range of technical concepts and software including, SAS® Data Management. In her current role, she leverages her training and consulting experience and her data-savviness to lead best practices workshops and facilitate intra-team dialogs to help companies realize their full data and analytics potential.

Presentations

Progressive Data Governance for Emerging Technologies Session

Emerging technologies such as IoT, AI, and ML present businesses with enormous opportunities for innovation. But, to maximize the potential of these technologies, the approach to governance must radically shift. This session will look at what it takes to shift the focus of governance from standards, conformity and control to accountability, extensibility and enablement.

Noah is a software engineer at Salesforce working on the Intelligence Services team. He holds a PhD in Decision and Risk Analysis from Stanford University where his research simplified complex decision making techniques for application in everyday life. At Salesforce he focuses on the application of Artificial Intelligence to improve the quality of decisions that his customers can make everyday in their businesses.

Presentations

Building a Contacts Graph from activity data Session

In the customer age, being able to extract relevant communications information in real-time and cross reference it with context is key. Salesforce is using data science and engineering to enable salespeople to monitor their emails in real-time to surface insights and recommendations using a graph modeling contextual data.

Yuri Bykov leads Data Science at Dice.com. Yuri’s team leverages Machine Learning, NLP, Big Data, Information Retrieval and other scientific disciplines to research and build innovative data products and services that help tech professionals manage their careers. Yuri started his career as a software developer, moving into BI and data analytics, before finding his passion in Data Science. He holds an MBA/MIS from the University of Iowa.

Presentations

Building Career Advisory Tools for the Tech Sector using Machine Learning Session

At Dice.com we recently released several free career advisory tools for technology professionals, including a salary predictor, a tool that recommends the next skills to learn, and a career path explorer. We will discuss how we applied different machine learning algorithms to solve each of these problems, and the technologies used to build, deploy and monitor these solutions in production.

Henry Cai is a software engineer on the data engineering team at Pinterest, where he designs large-scale big data infrastructures. Previously, he worked at LinkedIn. Henry is the maintainer and contributor of many open source data ingestion systems, including Camus, Kafka, Gobblin, and Secor.

Presentations

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously Session

With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin explain offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices.

Presentations

Pipeline testing with Great Expectations Session

Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test.

Yishay Carmiel is the head of Spoken Labs, the strategic artificial intelligence and machine learning research arm of Spoken Communications. Spoken Labs develops and implements industry-leading deep learning and AI technologies for speech recognition (ASR), natural language processing (NLP) and advanced voice data extraction. Yishay and his team are currently working on bleeding-edge innovations that make the real-time customer experience a reality – at scale. Yishay has nearly 20 years’ experience as an algorithm scientist and technology leader building large-scale machine learning algorithms and serving as a deep learning expert.

Presentations

Executive Briefing: The conversational AI revolution Session

For years, one of the most important tasks of AI has been to understand humans. People want machines to understand not only what they say but also what they mean, and to take particular actions based on that information. This goal is the essence of conversational AI. We will explore the latest breakthroughs and revolutions in this field and what are the challenges that are still to come.

Michelle Casbon is director of data science at Qordoba. Michelle’s development experience spans more than a decade across various industries, including media, investment banking, healthcare, retail, and geospatial services. Previously, she was a senior data science engineer at Idibon, where she built tools for generating predictions on text datasets. She loves working with open source projects and has contributed to Apache Spark and Apache Flume. Her writing has been featured in the AI section of O’Reilly Radar. Michelle holds a master’s degree from the University of Cambridge, focusing on NLP, speech recognition, speech synthesis, and machine translation.

Presentations

Continuous Delivery for NLP on Kubernetes: Lessons Learned Session

Michelle Casbon describes how to speed up the development of ML models by using open-source tools such as Kubernetes, Docker, Scala, Apache Spark, & Weave Flux. Her lessons-learned approach details how to build resilient systems so that engineers and data scientists can spend more of their time on product improvement rather than triage & uptime.

Rachita Chandra is a Solutions Architect at IBM Watson Health where she brings together end-to-end machine learning solutions in healthcare. She has experience implementing large scale, distributed machine learning algorithms. She holds a Masters and Bachelors in Electrical and Computer Engineering from Carnegie Mellon.

Presentations

Transforming a machine learning prototype to a deployable solution leveraging Spark in healthcare Session

In this proposal, we describe the challenges and considerations for transforming a research prototype built for a single machine to a deployable healthcare solution that leverages Spark in a distributed environment.

Diane Chang is a Distinguished Data Scientist at Intuit, where she has worked on many interesting business problems that depend on machine learning, behavioral analysis, and risk prediction. Diane has a PhD in Operations Research from Stanford and has worked for a small “mathematical” consulting firm, and a start-up in the online advertising space. Prior to joining Intuit Diane was a stay-at-home mom for 6 years.

Presentations

Want to build a better chatbot? Start with your data. Session

When building a chatbot, it’s important to develop one that is humanized, has contextual responses, and can simulate true empathy for the end users. In this session, Diane Chang, Principal Data Scientist at Intuit, shares how her team preps, cleans, organizes and augments her training data through multiple practices, and some of the best practices she has learned along the way.

Sugreev analyzes data from all of Thorn’s products to help law enforcement rescue victims of sexual exploitation faster. He has had years of analytics experience in the field of experimental and computational fusion energy physics, and more recently was a data scientist working with realtime sensor data for healthcare and defense applications. He holds a B.A. in Physics and Applied Mathematics from UC Berkeley and is pursuing a Ph.D. in Engineering Physics from UC San Diego.

Presentations

Fighting Sex Trafficking with Data Science Session

Thorn is a nonprofit that uses technology to fight online child sexual exploitation. This talk describes Spotlight, a tool created by Thorn that allows law enforcement to process millions of escort ads per month in an effort to fight sex trafficking. Graph analysis, time series analysis and NLP techniques are used to surface important networks of ads and characterize their behavior over time.

Anny (Yunzhu) Chen is a Senior data scientist in UBER. She’s interested in applying statistical and machine learning models to real business problems. She is currently working on time series anomaly detection and forecasting.

Prior to joining UBER, she was a data scientist in Adobe, where she worked on digital attribution modeling for customer conversion data. She received her MS in statistics from Stanford University in 2013 and her BS in Probability and Statistics from Peking University in 2011. She’s passionate about statistical application to real datasets and big data technology.

Presentations

Detecting time series anomalies at Uber scale with recurrent neural networks Session

Time series forecasting and anomaly detection is of utmost importance at Uber. However, the scale of the problem, the need for speed, and the importance of accuracy make anomaly detection a challenging data science problem. Andrea Pasqua and Anny Chen explain how the use of recurrent neural networks is allowing Uber to meet this challenge.

April Chen is a lead data scientist on the R&D team at Civis Analytics. She develops software to automate statistical modeling workflows to help organizations from Fortune 500 companies to non-profits understand and leverage their data. April’s background is in economics. Prior to joining Civis, she worked as an analytics consultant.

Presentations

Show Me the Money: Understanding Causality for Ad Attribution Media and Advertising

Which of your ad campaigns lead to the most sales? In the absence of A/B testing, marketers often turn to simple touch attribution models. This talk covers the shortcomings of these models and proposes a new approach that uses matching methods from causal inference to more accurately measure marketing effectiveness.

Shuyi Chen is a Senior Software Engineer at Uber working on building scalable real-time data solutions. He built Uber’s real-time complex event processing platform for Marketplace that power 100+ production real-time use cases. He is the lead of the streaming analytic team at Uber. He has years of experience in storage infrastructure, data infrastructure, Android/iOS development in Google and Uber.

Presentations

Streaming SQL to unify Batch and Stream Processing - Theory and Practice with Apache Flink at Uber Session

In this talk we discuss SQL in the world of streaming data, and its implementation in Apache Flink. We will cover the concepts (streaming semantics, event time, incremental results) and discuss practical experiences of using Flink SQL in production at Uber. We will cover how Uber leverages Flink SQL to solve its unique business challenges.

Weiting is a senior software engineer at Intel Software Service Group. He has worked on “Big Data on Cloud” Solution for several years. One of his responsibility is as a consultant to engage with the customer integrating Big Data solution into their Cloud infrastructure. He is a contributor for OpenStack Sahara project in the past two years and currently is focusing on docker container based Big Data solutions in cloud.

Presentations

Spark on Kubernetes: A case study at JD.com Session

This topic would like to use JD.com as an example to tell you about how they are using Spark on Kubernetes in a production environment and why they choose Spark on Kubernetes for their AI workloads. You will learn how to run Spark with Kubernetes and the advantages you can get from Spark on Kubernetes.

Nic Chakhani is the Product leader responsible for defining the data Product Roadmap for Weight Watchers.

Presentations

How Weight Watchers embraced modern data practices during its transformation from a legacy IT shop to a modern Technology organization. Session

For organizations stuck in a myriad of legacy infrastructure, the path to AI and deep learning seems impossible. In this talk, we discuss how Weight Watchers was able to move from a traditional BI organization to one that uses data effectively. We look at where we were, what were our needs, the changes that were required and the technologies & architecture we use to achieve our goals.

Pramit Choudhary is a lead data scientist at DataScience.com, where he focuses on optimizing and applying classical machine learning and Bayesian design strategy to solve real-world problems. Currently, he is leading initiatives on figuring out better ways to explain a model’s learned decision policies to reduce the chaos in building effective models and close the gap between a prototype and operationalized model.

Presentations

Human in the loop: Bayesian rules enabling explainable AI Session

Pramit Choudhary explores the usefulness of a generative approach that applies Bayesian inference to generate human-interpretable decision sets in the form of "if. . .and else" statements. These human interpretable decision lists with high posterior probabilities might be the right way to balance between model interpretability, performance, and computation.

Michael Chui is a San Francisco-based partner in the McKinsey Global Institute, where he directs research on the impact of disruptive technologies, such as big data, social media, and the internet of things, on business and the economy. Previously, as a McKinsey consultant, Michael served clients in the high-tech, media, and telecom industries on multiple topics. Prior to joining McKinsey, he was the first chief information officer of the City of Bloomington, Indiana, and was the founder and executive director of HoosierNet, a regional internet service provider. Michael is a frequent speaker at major global conferences and his research has been cited in leading publications around the world. He holds a BS in symbolic systems from Stanford University and a PhD in computer science and cognitive science and an MS in computer science, both from Indiana University.

Presentations

Executive Briefing: Artificial intelligence—The next digital frontier? Session

After decades of extravagant promises, artificial intelligence is finally starting to deliver real-life benefits to early adopters. However, we're still early in the cycle of adoption. Michael Chui explains where investment is going, patterns of AI adoption and value capture by enterprises, and how the value potential of AI across sectors and business functions is beginning to emerge.

Garner Chung is the engineering manager of the human computation team and the data science team supporting core product, growth, and infrastructure at Pinterest. Previously, he managed the data science team at Opower, where he drove efforts to research and productionize predictive models for all of product and engineering. Many years ago, he studied film at UC Berkeley, where he learned to deconstruct and complicate misleadingly simple narratives. Over the course of his 20 years in the tech industry, he has witnessed exuberance over technology’s great promise ebb and flow, all the while remaining steadfast in his gratitude for having played some small part. As a leader, Garner has learned to drive teams that privilege responsibility and end-to-end ownership over arbitrary commitments.

Presentations

Humans versus the machines: Using human-based computation to improve machine learning Session

Veronica Mapes and Garner Chung detail the human evaluation platform Pinterest developed to better serve its deep learning and operational teams when its needs grew beyond platforms like Mechanical Turk. Along the way, they cover tricks for increasing data reliability and judgement reproducibility and explain how Pinterest integrated end-user-sourced judgements into its in-house platform.

Eric is the Chief Algorithms Officer at Stitch Fix, leading a team of 80+ Data Scientists. He is responsible for the multitude of algorithms that are pervasive to nearly every function of the company: merchandise, inventory, marketing, forecasting & demand, operations, and the styling recommender system. Prior to joining Stitch Fix, he was the Vice President of Data Science & Engineering at Netflix. Eric holds a B.A. in Economics, an M.S. in Information Systems, M.S. in Management Science & Engineering.

Presentations

Differentiating via Data Science Keynote

Data Science has historically been leveraged as a supportive function. But for some business models and companies, Data Science can be the primary means for competitive differentiation. This requires a different way of working and organizing. For it to thrive, Data Science needs its own department reporting directly to the CEO with a workflow completely different from any other department.

Mike Conover builds machine learning technologies that leverage the behavior and relationships of hundreds of millions of people. An AI engineer at SkipFlag, Mike previously led news relevance research & development at LinkedIn. He has a PhD in complex systems analysis with a focus on information propagation in large-scale social networks. His work has appeared in the New York Times, the Wall Street Journal, and on National Public Radio.

Presentations

Fast & Effective: Natural Language Understanding Session

Cut to the chase with an overview of the essential techniques for understanding and working with natural language. From off-the-shelf neural networks and snappy preprocessing libraries to architectural patterns for bulletproof productionization, this talk will be of interest to anyone who uses language on a regular basis.

Ian Cook is a data scientist at Cloudera and the author of several R packages including implyr. Ian is co-founder of Research Triangle Analysts, the largest data science meetup group in the Raleigh, North Carolina area, where he lives with his wife and two young children. Previously, Ian was a data scientist at TIBCO and a statistical software developer at Advanced Micro Devices. He holds an MS in statistics from Lehigh University and a BS in applied mathematics from Stony Brook University.

Presentations

sparklyr, implyr, and more: dplyr interfaces to large-scale data Session

The popular R package dplyr provides a consistent grammar for data manipulation that can abstract over diverse data sources. Ian Cook shows how you can use dplyr to query large-scale data using different processing engines including Spark and Impala. He demonstrates the R package sparklyr (from RStudio) and the new R package implyr (from Cloudera) and shares tips for making dplyr code portable.

Dan Crankshaw is a PhD student in the CS Department at UC Berkeley, where he works in the RISELab. After cutting his teeth doing large-scale data analysis on cosmology simulation data and building systems for distributed graph analysis, Dan has turned his attention to machine learning systems. His current research interests include systems and techniques for serving and deploying machine learning, with a particular emphasis on low-latency and interactive applications.

Presentations

Deploying and monitoring interactive machine learning applications with Clipper Session

Clipper is an open source, general-purpose model-serving system that provides low-latency predictions under heavy serving workloads for interactive applications. Dan Crankshaw offers an overview of the Clipper serving system and explains how to use it to serve Apache Spark and TensorFlow models on Kubernetes.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Mauro Damo is responsible for working with organizations to help them identify, develop and implement analytical solutions in Big Data environments focusing on solve business problems.

As Senior Data Scientist, he developed and implemented a wide range of analytical projects in several industries like Mortgage Insurance, Financial Brokers Companies, Cable Companies, non-governmental organizations, Health Care and Supply Chain in wide range of customers in Americas.

He has a wide range of Supervised Models and Unsupervised Models like Time Series Models, Graphs Analysis, Optimization Models, Deep Learning Models like Convolutional Neural Networks, Neural Networks, Clustering, Dimensionality Reduction, Tree Algorithms, Frequent Pattern Mining, Ensembles Models, Markov Chain, Gradient Descent and others

He has written patents, several papers and spoke in seminars and classes room

Mr Damo has Master of Science in Business, MBA in Finance, Undergraduate in Business and Associate in Computer Science

Main programming languages are R, Python and SQL

Presentations

Bladder Cancer Diagnosis using Deep Learning Session

Image recognition classification diseases is expected to improve and support physicians decisions. Application of Deep Learning techniques to recognize diseases in organs will minimize the possibility of medical mistakes, improve patient treatment and speed up patient diagnosis.

John Davis is a data scientist on the R&D team at Civis Analytics. He spends his time writing tools that automate causal inference analyses. John holds his PhD in statistics from the University of Wisconsin-Madison, where he taught biostatistics to pre-med students and did research in mathematical statistics.

Presentations

Show Me the Money: Understanding Causality for Ad Attribution Media and Advertising

Which of your ad campaigns lead to the most sales? In the absence of A/B testing, marketers often turn to simple touch attribution models. This talk covers the shortcomings of these models and proposes a new approach that uses matching methods from causal inference to more accurately measure marketing effectiveness.

Rahim Daya is head of search products at Pinterest. Previously, he led search and recommendation product teams at LinkedIn and Groupon.

Presentations

Personalization at scale: Mastering the challenges of personalization to create compelling user experiences Session

Personalization is a powerful tool for building sticky and impactful product experiences. Rahim Daya shares Pinterest's frameworks for building personalized user experiences, from sourcing the right contextual data to designing and evaluating personalization algorithms that can delight the user.

Danielle Dean is a Principal Data Scientist lead at Microsoft in the Cloud AI Platform group, where she leads a team of data scientists and engineers on end-to-end analytics projects using Microsoft’s Cortana Intelligence Suite—from automating the ingestion of data to analysis and implementation of algorithms, creating web services of these implementations, and using those to integrate into customer solutions or build end-user dashboards and visualizations. Danielle holds a PhD in quantitative psychology from the University of North Carolina at Chapel Hill, where she studied the application of multilevel event history models to understand the timing and processes leading to events between dyads within social networks.

Presentations

How Does a Big Data Professional get Started with AI? Session

Artificial Intelligence (AI) has tremendous potential to extend our capabilities, and empowering organizations to accelerate their digital transformation by infusing apps and experiences with AI. This session will help big data professional demystify AI, and how they can leverage and evolve their valuable big data skills towards doing AI.

Jeff Dean is a Google senior fellow in Google’s Knowledge Group. During his time at Google, Jeff has codesigned and implemented five generations of Google’s crawling, indexing, and query serving systems, major pieces of Google’s initial advertising and AdSense for content systems, and Google’s distributed computing infrastructure, including the MapReduce, BigTable and Spanner systems, protocol buffers, LevelDB, systems infrastructure for statistical machine translation, and a variety of internal and external libraries and developer tools. He is currently working on large-scale distributed systems for training deep neural models for speech, vision, and text understanding. Jeff is a fellow of the ACM and the AAAS, a member of the US National Academy of Engineering, and a recipient of the ACM-Infosys Foundation Award in the Computing Sciences.

Presentations

Deep learning for tackling important problems Keynote

Keynote with Jeff Dean

Seasoned Data Science and Analytics leader with experience in building and managing high performing teams to support strategic decision-making, business analytics, marketing analytics, product analytics, predictive modeling, reporting and executive communication. Extensive experience in building the Analytics capabilities from scratch and forming and grooming a team of well skilled analysts and scientists. An innovative leader in managing large teams and drive Analytics in organizations. Widely recognized for superior skills in leadership, team building, employee development, and creating significant business relationships.

Presentations

Presto Query Gate: Identifying and stopping rogue queries. Session

"Presto" has emerged as the defacto query engine to quickly process petabytes of data. Few rogue SQL queries can however waste significant amount of critical compute resource and reduce Presto's through put. At Uber, we use machine learning to identify such rogue queries and stop them early. This has led to significant amount of savings at both in terms of computational power and money.

Alex Deng is a Principal Data Scientist Manager on the Microsoft Analysis and Experimentation Team. He and his team work on methodological improvements of the experimentation platform as well as related engineering challenges. His works in this area are published in conference proceedings like KDD, WWW, WSDM and other statistical journals. He co-lectured a tutorial on A/B Testing at JSM 2015. Alex received a Ph.D. degree in Statistics from Stanford University in 2010 and a B.S degree in Mathematics from Zhejiang university in 2006.

Presentations

A/B Testing at Scale: Accelerating Software Innovation Tutorial

Controlled experiments, including A/B tests, have revolutionized the way software is being developed, with new ideas objectively evaluated with real users. We provide an intro and lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft and executing over 10K experiments/year.

Matt Derda first discovered Trifacta at PepsiCo, where he was a CPFR (Collaborative Planning, Forecasting and Replenishment) Analyst. With Trifacta, Matt’s team accelerated the preparation of customer supply chain data to more accurately and quickly forecast sales. He became a huge advocate for Trifacta, even telling his story at Strata + Hadoop world. He then later joined Trifacta as the newest member of the Customer Success team!

Presentations

Data wrangling for retail giants Session

PRGX is a global leader in Recovery Audit and Source-to-Pay (S2P) Analytics services, serving around 75% of the top 20 global retailers. During this session, PRGX will explain how they’ve adopted Trifacta and Cloudera to scale their current processes, and increase revenue for the products and services they offer clients.

Ding Ding is a senior software engineer on Intel’s big data technology team, where she works on developing and optimizing distributed machine learning and deep learning algorithms on Apache Spark, focusing particularly on large-scale analytical applications and infrastructure on Spark.

Presentations

Accelerating Deep Learning on Apache Spark with Coarse Grained Scheduling Session

The BigDL framework scales deep learning for large data sets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. In this talk we propose a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception, VGG etc.

Pavel Dmitriev is a Principal Data Scientist with Microsoft’s Analysis and Experimentation team. He was previously a Researcher at Yahoo! Labs. Pavel has been working in the field of web mining, search, and experimentation for close to 15 years. He published a number of papers at top Data Mining conferences including KDD, WWW, CIKM, ICDM, BigData. He was an invited lecturer at Russian Summer School on Information Retrieval in 2007 and 2009, taught a tutorial at WWW 2010, and was an invited speaker at University of Pittsburgh Big Data colloquium in 2016. Pavel received a Ph.D. degree in Computer Science from Cornell University in 2008, and a B.S. degree in Applied Mathematics from Moscow State University in 2002.

Presentations

A/B Testing at Scale: Accelerating Software Innovation Tutorial

Controlled experiments, including A/B tests, have revolutionized the way software is being developed, with new ideas objectively evaluated with real users. We provide an intro and lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft and executing over 10K experiments/year.

Mark Donsky leads data management and governance solutions at Cloudera. Previously, Mark held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

Executive Briefing: GDPR - Getting Your Data Ready for Heavy New EU Privacy Regulations Session

General Data Protection Regulation (GDPR) will go into effect in May 2018 for firms doing any business in the EU. Yet many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). This session will explore the capabilities your data environment needs in order to simplify GDPR compliance, as well as future regulations.

Mike Driscoll founded Metamarkets in 2010 after spending more than a decade developing data analytics solutions for online retail, life sciences, digital media, insurance, and banking. Prior to Metamarkets, Mike successfully founded and sold two companies: Dataspora, a life science analytics company, and CustomInk, an early pioneer in customized apparel. He began his career as a software engineer for the Human Genome Project. Mike holds an A.B. in Government from Harvard and a Ph.D. in Bioinformatics from Boston University.

Presentations

Human Eyes on AI Session

There’s a make-or-break step ahead for AI development – we need to focus on translating data from machine learning models into beautiful, intuitive visuals. AI tools shouldn’t be designed to replace humans, they should be built with their eyes in mind. This session will offer advice for creators of next-gen predictive algorithms from our experiences turning big data into interactive visualizations

Ted Dunning is chief applications architect at MapR Technologies. He’s also a board member for the Apache Software Foundation, a PMC member and committer of the Apache Mahout, Apache Zookeeper, and Apache Drill projects, and a mentor for various incubator projects. Ted contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library. He also designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date. He holds a PhD in computing science from the University of Sheffield. He is on Twitter as @ted_dunning.

Presentations

Better machine learning logistics with the rendezvous architecture Session

Ted Dunning offers an overview of the rendezvous architecture, developed to be the "continuous integration" system for machine learning models. It allows always-hot zero latency rollout and rollback of new models and supports extensive metrics and diagnostics so models can be compared as they process production data. It can even hot-swap the framework itself with no downtime.

Zoran Dzunic is a data scientist on the algorithms and data science (ADS) team within the AI+R Group at Microsoft, where he focuses on machine learning applications for text analytics and natural language processing. He holds a PhD and a master’s degree from MIT, where he focused on Bayesian probabilistic inference, and a bachelor’s degree from the University of Nis in Serbia.

Presentations

Deep learning for domain-specific entity extraction from unstructured text Session

Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction.

Nick Elprin is the CEO and cofounder of Domino Data Lab, a data science platform that enterprises use to accelerate research and more rapidly integrate predictive models into their business. Nick has over a decade of experience working with quantitative researchers and data scientists, stemming from his time as a senior technologist at Bridgewater Associates, where his team designed and built the firm’s next-generation research platform.

Presentations

Managing data science in the enterprise Tutorial

The honeymoon era of data science is ending, and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders deliver measurable impact on an increasing share of an enterprise's KPIs. Nick Elprin details how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage.

Sergey Ermolin is a Silicon Valley’s veteran with a passion for machine learning and artificial intelligence. His interest in neural networks goes back to 1996 when he used them to predict aging behavior of quartz crystals and cesium atomic clocks made by Hewlett-Packard at its Santa Clara campus. Sergey is currently a solutions architect in Big Data Technologies team at Intel, working on Apache Spark and distributed deep learning projects. Sergey holds MSEE as well as Mining Massive Data Sets certificate from Stanford and BS in Physics as well as BS in Mechanical Engineering from California State University, Sacramento

Presentations

Accelerating Deep Learning on Apache Spark with Coarse Grained Scheduling Session

The BigDL framework scales deep learning for large data sets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. In this talk we propose a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception, VGG etc.

Improving user-merchant propensity modeling using BigDL’s RNN/LSTMs at scale Session

We are going to demonstrate the use of RNNs on BIgDL to predict a user’s probability of shopping at a particular offer merchant during a “campaign period”. We will compare and contrast the RNN-based method with traditional ones, such as logistics regression and random forests

Lenny Evans is a data scientist at Uber focused on the applications of unsupervised methods and deep learning to fraud prevention, specifically developing anomaly detection models to prevent account takeovers and computer vision models for verifying possession of credit cards.

Presentations

Using computer vision to combat stolen credit card fraud Session

Stolen credit cards are a major problem faced by many companies, including Uber. Karthik Ramasamy and Lenny Evans detail a new weapon against stolen credit cards that uses computer vision to scan credit cards, verifying possession of the physical card with basic fake card detection capabilities.

Stephan Ewen is one of the originators and committers of the Apache Flink project and CTO at data Artisans, leading the development of large-scale data stream processing technology. He is also a PMC member of Apache Beam, a project to create a unified abstraction for Batch and Stream data processing. Stephan coauthored the Stratosphere system and has worked on data processing technologies at IBM and Microsoft. Stephan holds a PhD from the Berlin University of Technology.

Presentations

Unified and elastic Batch- and Stream Processing with Pravega and Apache Flink Session

We present an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams). The combination of these two systems offers an unprecedented way of handling “everything as a stream”, with un-bounded streaming storage, unified Batch- and Streaming abstraction, and dynamically accommodating workload variations in a novel way.

Bin Fan is a software engineer at Alluxio and a PMC member of the Alluxio project. Previously, Bin worked at Google building next-generation storage infrastructure, where he won Google’s Technical Infrastructure award. He holds a PhD in computer science from Carnegie Mellon University.

Presentations

Powering robotics clouds with Alluxio Session

Bin Fan and Shaoshan Liu explain how PerceptIn designed and implemented a cloud architecture to support video streaming and online object recognition tasks and demonstrate how Alluxio delivers high throughput, low latency, and a unified namespace to support these emerging cloud architectures.

Zhen is a software development engineer at JD.com. He has joint JD.com for 4 years and his jobs are focusing on Machine Learning platform development and management.

Presentations

Spark on Kubernetes: A case study at JD.com Session

This topic would like to use JD.com as an example to tell you about how they are using Spark on Kubernetes in a production environment and why they choose Spark on Kubernetes for their AI workloads. You will learn how to run Spark with Kubernetes and the advantages you can get from Spark on Kubernetes.

Keno Fischer leads Julia Computing’s efforts in the compiler and developer tools space. He has been a core developer of the Julia Language for more than five years. Prior to joining Julia Computing, Mr. Fischer attended Harvard University, obtaining an AM degree in Physics as well as an AB degree in Physics, Mathematics and Computer Science.

Presentations

Cataloging the Visible Universe through Bayesian Inference at Petascale in Julia Session

Julia is rapidly becoming a prevalent language at the forefront of scientific discovery. This talk will highlight one of the most ambitious recent use cases for julia: Using Machine Learning to Catalogue astronomical objects to derive a catalogue from multi-terabyte size astronomical image data sets. This work was a collaboration between MIT, UC Berkley, LBNL and Julia Computing.

As CTO, Tom works with enterprise customers to ensure they take full advantage of MapR technology. He also leads initiatives to advance the company’s innovation agenda globally.
Tom was previously with Oracle, where he was a senior executive in engineering and operations for over five years, supporting the company’s top 40 cloud customers globally. He was also Oracle’s senior vice president and CIO for global commercial cloud services, focusing on improving service delivery through automation and direct action with customers. Prior to Oracle, Tom served as CIO and vice president of cloud computing at SuccessFactors (now SAP), where he ran cloud operations as well as emerging technologies in product engineering. Additionally, Tom led technology teams at Qualcomm as CIO of CDMA technologies and with eBay Inc. where he was vice president and acting CTO.

Presentations

Cloud, Multi-Cloud and the Data Refinery Session

The monolithic cloud is dying. Delivering capabilities across multiple clouds and, simultaneously, transitioning to the next generation platforms and applications is the challenge today. This presentation will talk through technological approaches and solutions that make this possible while delivering data driven applications and operations.

I am a seasoned marketing strategy and analytics leader with experience combining disparate information sources to make the right business decision. My experiences have gained me unique skills to run all aspects of a Marketing Analytics function. This includes managing methods, technology and people. With my unique skills in strategy, analytics, statistics and financial engineering, I have had the privilege to lead diverse teams that helped over 100 fortune 1,000 companies across virtually every industry including banking to maximize their investments in marketing.

RELEVANT EXPERIENCE:

General Mills
Senior Manager Decision Sciences
4/2013-Current
• Started Big Data at GMI, started as team of 1 and now lead team of 35 globally across Data Science & Analytics, Visualization & Data Stewards
• Led company to bring in Hadoop and our analytic tools
• Developed Paid Owned Earned measurement platform
• Lead analytics team for Owned Marketing

Carlson Marketing now called Aimia
Last Position: Director Strategy & Analytics
8/2003-4/2013
• Started as an analyst and promoted 4 times up to a senior leader in the analytics group
• Led customer relationship marketing analytics on the behalf of clients (see page 2 for select examples)
• Part of leadership team that standardized and trained analytics staff on techniques

KPMG and 2 other Consulting Firms
6/1999 – 3/2013
• Performed statistical and financial modeling for acquisition analysis (purchase price)
• This was primarily an international role with extensive work in Asia and Europe

ANALYTICS SKILLS
Languages: R, Python, SQL
Platforms: Oracle, SQL, Hadoop
Analytic techniques: Clustering, Statistical Inference, Structural Equation Modeling, Monte Carlo Simulation, Regression

EDUCATION

University of Minnesota
B.S. Chemical Engineering – 1999

University of Minnesota
Master of Business Administration –2008

Presentations

Automating Business Insights through Artificial Intelligence Data Case Studies

Decision Makers are busy. Businesses can hire people to analyze data for them, but most companies are resource constrained and can’t hire a small army to look through all their data. In this session, General Mills will share how we built automation so Decision Makers can quickly focus on metrics that matter and cut through everything else that does not.

Eugene Fratkin is a director of engineering at Cloudera leading cloud infrastructure efforts. He was one of the founding members of the Apache MADlib project (scalable in-database algorithms for machine learning). Previously, Eugene was a cofounder of a Sequoia Capital-backed company focusing on applications of data analytics to problems of genomics. He holds PhD in computer science from Stanford University’s AI lab.

Presentations

A deep dive into running data analytic workloads in the cloud Tutorial

Vinithra Varadharajan, Philip Langdale, Eugene Fratkin, and Jennifer Wu lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices.

Michael J. Freedman is a Professor in the Computer Science Department at Princeton University, as well as the co-founder and CTO of Timescale, building an open-source database that scales out SQL for time-series data. His work broadly focuses on distributed systems, networking, and security.

Freedman developed CoralCDN (a decentralized content distribution network serving millions of daily users) and Ethane (which formed the basis for the OpenFlow / software-defined networking architecture). He co-founded Illuminics Systems around IP geolocation and intelligence, which was acquired by Quova (now part of Neustar). Freedman is also a technical advisor to Blockstack, building a more decentralized Internet leveraging the blockchain.

Honors include a Presidential Early Career Award for Scientists and Engineers (PECASE, given by President Obama), SIGCOMM Test of Time Award, Caspar Bowden Award for Privacy Enhancing Technologies, Sloan Fellowship, NSF CAREER Award, Office of Naval Research Young Investigator Award, DARPA Computer Science Study Group membership, and multiple award publications. Prior to joining Princeton in 2007, he received his Ph.D. in computer science from NYU’s Courant Institute and his S.B. and M.Eng. degrees from MIT.

Presentations

TimescaleDB: Re-engineering PostgreSQL as a time-series database Session

I offer an overview of TimescaleDB, a new open-source database designed for time series workloads, engineered up as a plugin to PostgreSQL. Unlike most time-series newcomers, TimescaleDB supports full SQL while achieving fast ingest and complex queries. This enables developers to avoid today’s polyglot architectures and their corresponding operational and application complexity.

Siddha graduated from Carnegie Mellon University with a Master’s in Computational Data Science. Her work ranges from Visual Question Answering to Generative Adversarial Networks to gathering insights from CERN’s petabyte scale data and has been published at top tier conferences like CVPR. She is a frequent speaker at Strata+Hadoop & Strata AI conferences and advises the Data Lab at NASA. When not working, you might catch her hiking!

Presentations

Being smarter than dinosaurs: How NASA uses Deep Learning for Planetary Defense Session

We will talk about how the FDL lab at NASA uses artificial intelligence to (1) improve and automate the identification of meteors above human level performance using meteor shower images, and (2) recover known meteor shower streams and characterize previously unknown meteor showers using orbital data. This is aimed at providing more warning time for long period comet impacts.

Programmer, father, husband, avid reader, author, occasional speaker at technology conferences and Seinfeld fanboy. Senior member of ACM and loves spending time with his beautiful family. Passionate about technology and open source, loves functional programming, and has been trying to learn math and machine learning.

Authored 2 books –

  • DSLs In Action (http://manning.com/ghosh) from Manning in December 2010 &
  • Functional & Reactive Domain Modeling (http://manning.com/ghosh2) from Manning, October 2016.

Debasish is an occasional speaker in technology conferences worldwide including the likes of QCon, PhillyETE, Codemesh, ScalaWorld, FunctionalConf and Goto. Some of his presentations can be found in his linked in profile https://linkedin.com/in/debasishgh.

Presentations

Approximation Data Structures in Streaming Data Processing Session

Discusses the role that approximation data structures like bloom filter, the sketches, hyperloglog etc. play in processing streaming data. Typically streams are unbounded in space and time. Any processing has to be online using sublinear space. I discuss probabilistic bounds that these data structures offer and how they can be used to implement solutions for the fast and streaming architectures.

In the last eight years I have been responsible for shipping greater than 10 new products at multiple companies that generated millions of dollars of revenue and had global scale. I have also written production machine learning models for last five years in Python and R. In my last position, I helped build the company from scratch, create the first product and hired and managed all employees over three years. These new products have been in varying conditions:

° products I wrote myself 100% from scratch
° teams I hired and managed to write them
° deeply troubled products that were full of technical debt when I inherited them and fixed them

★ Year – Product – Technologies ★
° 2017 -Shopify App
° 2017 – AI Pipeline on Google Cloud
° 2016 – VR Pipeline on AW – Python, Lambda, AWS
° 2015 – Sports Social Network (iOS,Android,Web) – Erlang/Elixir/C#/Swift/Python/R
° 2014 – Daily Fantasy Sports iOS App – Erlang/C#/Xamarin
° 2013 – Versu iOS Mobile Game – Erlang/C#/Xamarin
° 2013 – Blocksworld iOS Game – Python/C#/Unity 3D
° 2012 – DIO Fan Fiction Social Network – Rails/RabbitMQ
° 2011 – Monthly Subscription E-Commerce System – Python
° 2011 – Mac Desktop Application – Objective-C
° 2010 – Centralized Asset Management System – Python/SQL
° 2009 – Link Sharing Social Network – Python/Django
° 2009 – Content Management System – Python/Django

I am an adaptable technical leader, entrepreneur, software developer/architect/engineer with over 20 years experience in leadership and engineering, including P&L responsibility. I am also a data driven leader that is comfortable using mathematical modeling to solve complex problems.

★ Specialties ★

° Building Companies
° Shipping new Products
° Solving interesting (tough and scary) problems, in any language/environments.
° Leading and growing engineering teams
° Production Machine Learning
° Advising Early Stage Startups/Consulting CTO services
° Distributed Systems and Scalability
Media (14)This position has 14 media
Previous Next

Presentations

What is the relationship between social influence and the NBA? Media and Advertising

Explore NBA Team valuation and attendance using data science and machine learning, as well as individual player performance. Questions that will be discussed: What drives the valuation of teams: attendance, local real estate market? Does winning bring more fans to games? Does salary correlate with social media performance?

Clare Gollnick is the CTO and chief data scientist at Terbium Labs, an information security startup based in Baltimore, Maryland. As a statistician and engineer, Clare designs the algorithms that direct Terbium’s automated crawl of the dark web and leads the crawler engineering team. Previously, Clare was a neuroscientist. Her academic publications focus on information processing within neural networks and validation of new statistical methods. Clare holds a PhD in biomedical engineering from Georgia Tech and a BS in bioengineering from UC Berkeley.

Presentations

The limits of inference: What data scientists can learn from the reproducibility crisis in science Session

At the heart of the reproducibility crisis in the sciences is the widespread misapplication of statistics. Data science relies on the same statistical methodology as these scientific fields. Can we avoid the same crisis of integrity? Clare Gollnick considers the philosophy of data science and shares a framework that explains (and even predicts) the likelihood of success of a data project.

Abe Gong is CEO and cofounder at Superconductive Health. A seasoned entrepreneur, Abe has been leading teams using data and technology to solve problems in health care, consumer wellness, and public policy for over a decade. Previously, he was chief data officer at Aspire Health, the founding member of the Jawbone data science team, and lead data scientist at Massive Health. Abe holds a PhD in public policy, political science, and complex systems from the University of Michigan. He speaks and writes regularly on data science, healthcare, and the internet of things.

Presentations

Pipeline testing with Great Expectations Session

Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test.

Matthew Granade is a cofounder of Domino Data Lab, which makes a workbench for data scientists to run, scale, share, and deploy analytical models, where he works with companies such as Quantopian, Premise, and Orbital Insights. He also invests in, advises, and serves on the boards of startups in data, data analysis, finance, and
 financial tech. Previously, Matthew was co-head of research at Bridgewater Associates, where he built and managed teams that ensured Bridgewater’s understanding of the global economy, created new systems for generating alpha, produced daily trading signals, and published Bridgewater’s market commentary, and an engagement manager at McKinsey & Company. He holds an undergraduate degree from Harvard University, where he was president of the Harvard Crimson, the university’s daily newspaper, and an MBA with highest honors from Harvard Business School.

Presentations

Managing data science at scale Session

Predictive analytics and artificial intelligence have become critical competitive capabilities. Yet IT teams struggle to provide the support data science teams needs to succeed. Matthew Granade explains how leading banks, insurance and pharmaceutical companies, and others manage data science at scale.

Adam Greenhall is a data scientist at Lyft.

Presentations

Simulation in a two-sided transportation marketplace Session

Adam Greenhall explains how Lyft uses simulation to test out new algorithms, help develop new features, and study the economics of ride-sharing markets as they grow.

Mark Grover is a Product Manager at Lyft. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating) and a committer and PMC member on Apache Sentry and has contributed to a number of open source projects including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and also wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data at various national and international conference. He occasionally blogs on topics related to technology.

Presentations

Dogfooding our data at Lyft Session

This talk gives an overview of how we leverage application metrics, logs & auditing to monitor and troubleshoot our data platform at Lyft. We share how we dogfood our own platform to provide, security, auditing, alerting & replayability in our platform. We also detail some of the services & tools we have developed internally to make our data more robust, scalable & self-serving.

Sudipto studies the design and implementation of a wide range of computational systems, from resource constrained devices, such as sensors, up through massively parallel and distributed systems. Using an algorithmic framework, Sudipto seeks to design systems that are correct, efficient, and optimized, despite their bidirectional asymptotic scale and seeming lack of similarity to human information processes. Sudipto’s recent work focuses on clustering and location theory, statistics and learning theory, database query optimization and mining, approximation algorithms for stochastic control, communication complexity and data stream algorithms.

Presentations

Continuous Machine Learning over Streaming Data Session

In this talk, we present continuous machine learning algorithms that discover useful information in streaming data. We focus on explainable machine learning, including anomaly detection with attribution, ability to reduce false positives through user feedback, and detection of anomalies in directed graphs. We describe the algorithms, implementation, and application in real customer use cases.

Debraj GuhaThakurta is a senior data scientist lead in AI & Research, Cloud Data Platform, Algorithms and Data Science, where he focuses on developing the Team Data Science Process and the use of different Microsoft data platforms and toolkits (Spark, SQL-server, ADL, Hadoop, DL toolkits) for creating scalable and operationalized analytical processes. He has a PhD in chemistry and biophysics and many years of experience in data science and machine learning applications, particularly in biomedical and forecasting domains. Debraj has published more than 25 peer-reviewed papers, book chapters, and patents.

Presentations

Using R and Python for Scalable Data Science, Machine Learning, and AI Tutorial

Join us to learn how to do scalable, end-to-end data science in R and Python on single machines as well as on Spark clusters and cloud-based infrastructure. You'll be assigned an individual virtual machine with all contents preloaded and software installed and use it to gain experience building and operationalizing machine learning models using distributed functions in both R and Python.

Alexandra Gunderson is a data scientist at Arundo Analytics. Her background is in mechanical engineering and applied numerical methods.

Presentations

Machine Learning to tackle Industrial Data Fusion Session

Heavy industries, such as oil & gas, have tremendous amounts of data from which predictive models could be built, but it takes weeks and even months to build a comprehensive dataset from all of the various data sources. We will discuss the methodology behind an industry-tested approach that incorporates machine learning to structure and link data from different sources.

Sijie Guo is the cofounder of Streamlio, a company focused on building a next-generation real-time data stack. Previously, he was the tech lead for messaging group at Twitter, where he cocreated Apache DistributedLog, and worked on push notification infrastructure at Yahoo. He is the PMC chair of Apache BookKeeper.

Presentations

Modern Real Time Streaming Architectures Tutorial

Across diverse segments in industry, there has been a shift in focus from Big Data to Fast Data. This, in part, stems from the deluge of high velocity data streams and, more importantly, the need for instant data-driven insights. In this tutorial, we walk the audience through the state-of-the-art streaming systems and provide an in-depth review of modern streaming algorithms.

Stream Storage with Apache BookKeeper Session

Apache BookKeeper is a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads. It is widely adopted, including by enterprises like Twitter, Yahoo, and Salesforce, to store and serve mission-critical data. We will present how Apache BookKeeper satisfies the needs of stream storage.

Yufeng Guo is a developer advocate for the Google Cloud Platform, where he is trying to make machine learning more understandable and usable for all. He enjoys hearing about new and interesting applications of machine learning, so be sure to share your use case with him on Twitter.

Presentations

Getting started with TensorFlow Tutorial

Yufeng Guo and Amy Unruh walk you through training and deploying a machine learning system using TensorFlow, a popular open source library. Yufeng and Amy take you from a conceptual overview all the way to building complex classifiers and explain how you can apply deep learning to complex problems in science and industry.

Jordan Hambleton is a senior solutions architect for Cloudera, based in the San Francisco office. While at Cloudera, his focus has been partnering with customers to build and manage scalable enterprise products on the Hadoop stack. Prior to Cloudera, Jordan was a member of technical staff at NetApp, where he designed and implemented the NRT operational data store which continually manages automated support for all of NetApps customer’s production systems.

Presentations

How to build leak-proof stream processing pipelines with Apache Kafka and Apache Spark​ Session

Streaming data continuously from Kafka allows users to gain insights faster, but when they fail, can leave users panicked about data loss when restarting their application. Offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results.

Chris Harland is Director of Data Engineering at Textio, an augmented writing platform. Previously, he was a data scientist and machine learning engineer at Versive (formerly Context Relevant) and a data Scientist at Microsoft working on problems in Bing search, Xbox, Windows, and MSN.

He holds a PhD in Physics from the University of Oregon and has worked in a wide variety of fields spanning elementary science education, cutting edge biophysical research, and recommendation/personalization engines.

Every year Chris thinks “this is the year I’m going to stop thinking SQL is the best query language ever” and every year he’s wrong.

Presentations

Data products should be as simple as possible, but not simpler Session

The number of resources explaining how to build a machine learning model from data greatly overshadows information on how to make real data products from such models. This creates a gap between what machine learning engineers and data scientists know is possible and what users experience. Using examples from Textio's augmented writing platform, we'll go end to end in building a data product.

Patrick started and leads the Data Science team at S&P Global Market Intelligence (S&P MI), a business and financial intelligence firm and data provider. Working in a Fortune 500 company for which data is both the primary raw material and finished product provides an unusually broad, target-rich environment for data scientists. The Data Science team at S&P MI employs a wide variety of data science tools and techniques, including machine learning, natural language processing, recommender systems, graph analytics, among others.

Patrick is the co-author of the forthcoming book Deep Learning with Text from O’Reilly Media, along with Matthew Honnibal, creator of spaCy, the industrial-strength natural language processing software library. Patrick is a founding organizer of a Machine Learning Conference in Charlottesville, Virginia and is actively involved in building both regional and global data science communities.

Prior to joining S&P MI, Patrick received degrees in Economics (BA) and Systems Engineering (MS), both from the University of Virginia. His graduate research focused on complex systems and agent-based modeling.

Presentations

Word Embeddings Under the Hood: How Neural Networks Learn from Language Session

Word vector embeddings are everywhere, but relatively few understand how they produce their remarkable results. In this talk, we’ll open up the black box of a popular word embedding algorithm and embark on an end-to-end walk-through of how it works its magic. Along the way, we’ll dig into many core neural network concepts, including hidden layers, loss gradients, backpropagation, and more.

Frances Haugen is a data product manager at Pinterest focusing on ranking content in the home feed and related pins and the challenges of driving immediate user engagement without harming the long-term health of the Pinterest content ecosystem. Previously, Frances worked at Google, where she founded the Google+ search team, built the first non-quality-based search experience at Google, and cofounded the Google Boston search team. She loves user-facing big data applications and finding ways to make mountains of information useful and delightful to the user. Frances was a member of the founding class of Olin College and holds a master’s degree from Harvard.

Presentations

Executive Briefing: Building effective heterogeneous data communities—Driving organizational outcomes with broad-based data science Session

Data science is most powerful when combined with deep domain knowledge, but those with domain knowledge don't work on data-focused teams. So how do you empower employees with diverse backgrounds and skill sets to be effective users of data? Frances Haugen and Patrick Phelps dive into the social side of data and share strategies for unlocking otherwise unobtainable insights.

Or Herman-Saffar is Data Scientist at Dell. She holds an MS in biomedical engineering, where her research focused on breast cancer detection using breath signals and machine learning algorithms, and a BS in biomedical engineering specializing in signal processing from Ben-Gurion University, Israel.

Presentations

AI Powered Crime Prediction Session

What if we could predict when and where next crimes will be committed? Crimes in Chicago is a publicly published data set which reflect the reported incidents of crime that occurred in Chicago since 2001. Using this data, we would like not only be able to explore specific crimes to find interesting trends, but also predict how many crimes will be taking place next week, and even next month.

Szehon is a staff software engineer in the Analytics Data Storage team at Criteo, and he is driving Criteo’s Hive platform to the next level. Previously, he was a software engineer in the Hive team in Cloudera, and in this he was a committer and PMC member in the Apache Hive open source community, working on features like Hive on Spark, Hive monitoring and metrics, among others.

Presentations

Hive as a Service Session

Hundreds of analysts and thousands of automated jobs run Hive queries at Criteo every day. As Hive is the main data transformation tool at Criteo, we spent a year evolving Hive's platform from an error-prone add-on installed on some spare machines, to a best-in-class installation capable of self-healing and automatically scaling to handle our growing load.

Bob Horton is a senior data scientist with the AI&R Data Group Deep Partner Engagement Team, where he helps Independent Software Vendors build and deploy machine learning solutions for their customers. He holds an adjunct faculty appointment in Health Informatics at the University of San Francisco, and has a particular interest in educational simulations.

Presentations

Using R and Python for Scalable Data Science, Machine Learning, and AI Tutorial

Join us to learn how to do scalable, end-to-end data science in R and Python on single machines as well as on Spark clusters and cloud-based infrastructure. You'll be assigned an individual virtual machine with all contents preloaded and software installed and use it to gain experience building and operationalizing machine learning models using distributed functions in both R and Python.

Fabian Hueske is a committer and PMC member of the Apache Flink project. He was one of the three original authors of the Stratosphere research system, from which Apache Flink was forked in 2014. Fabian is a cofounder of data Artisans, a Berlin-based startup devoted to fostering Flink, where he works as a software engineer and contributes to Apache Flink. He holds a PhD in computer science from TU Berlin and spends currently a lot of time writing a book, Stream Processing with Apache Flink.

Presentations

Streaming SQL to unify Batch and Stream Processing - Theory and Practice with Apache Flink at Uber Session

In this talk we discuss SQL in the world of streaming data, and its implementation in Apache Flink. We will cover the concepts (streaming semantics, event time, incremental results) and discuss practical experiences of using Flink SQL in production at Uber. We will cover how Uber leverages Flink SQL to solve its unique business challenges.

Simon is currently the Chief Data Scientist at Dice.com, the technology professional recruiting site. He is also a PhD candidate at DePaul university, studying a PhD in machine learning and natural language processing. At Dice, he has developed multiple recommender engines for matching job seekers with jobs, as well as optimized the accuracy and relevancy of Dice.com’s job and candidates search. More recently, Simon has been instrumental in building the machine intelligence behind the ‘Career Explorer’ portion of Dice’s website, which allows users to gauge their market value and explore potential career paths. In his academic research, Simon is researching machine learning approaches for determining causal relations in student essays, with the view to building more intelligent essay grading software.

Presentations

Building Career Advisory Tools for the Tech Sector using Machine Learning Session

At Dice.com we recently released several free career advisory tools for technology professionals, including a salary predictor, a tool that recommends the next skills to learn, and a career path explorer. We will discuss how we applied different machine learning algorithms to solve each of these problems, and the technologies used to build, deploy and monitor these solutions in production.

Alysa Z. Hutnik is a partner at Kelley Drye & Warren LLP in Washington, DC, where she delivers comprehensive expertise in all areas of privacy, data security, and advertising law. Alysa’s experience ranges from counseling to defending clients in FTC and state attorneys general investigations, consumer class actions, and commercial disputes. Much of her practice is focused on the digital and mobile space in particular, including the cloud, mobile payments, calling and texting practices, and big data-related services. Ranked as a leading practitioner in the privacy and data security area by Chambers USA, Chambers Global, and Law360, Alysa has received accolades for the dedicated and responsive service she provides to clients. The US Legal 500 notes that she provides “excellent, fast, efficient advice” regarding data privacy matters. In 2013, she was one of just three attorneys under 40 practicing in the area of privacy and consumer protection law to be recognized as a rising star by Law360.

Presentations

Executive Briefing: Legal best practices for making data work Session

Big data promises enormous benefits for companies. But what about privacy, data protection, and consumer laws? Having a solid understanding of the legal and self-regulatory rules of the road are key to maximizing the value of your data while avoiding data disasters. Alysa Hutnik and Crystal Skelton share legal best practices and practical tips to avoid becoming a big data “don’t.”

Mario Inchiosa’s passion for data science and high-performance computing drives his work as Principal Software Engineer in Microsoft AI & Research, where he focuses on delivering advances in scalable advanced analytics, machine learning, and AI. Previously, Mario served as Revolution Analytics’ Chief Scientist and as Analytics Architect in IBM’s Big Data organization, where he worked on advanced analytics in Hadoop, Teradata, and R. Prior to that, Mario was US Chief Scientist in Netezza Labs, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances. He also served as US Chief Science Officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining, and Senior Scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds Bachelor’s, Master’s, and PhD degrees in Physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning Publication of the Year and Open Literature Publication Excellence awards.

Presentations

Using R and Python for Scalable Data Science, Machine Learning, and AI Tutorial

Join us to learn how to do scalable, end-to-end data science in R and Python on single machines as well as on Spark clusters and cloud-based infrastructure. You'll be assigned an individual virtual machine with all contents preloaded and software installed and use it to gain experience building and operationalizing machine learning models using distributed functions in both R and Python.

Founder and Chief Architect of ShiftLeft, creator of the industry’s first open-source Lambda framework, ex Google, ex VMware, infrastructure engineer. Co-author RabbitMQ Erlang Client.

Presentations

Code Property Graph : A modern, queryable data storage for source code Session

While in the earlier days, code would generate data, with CPG we now generate data for the code so that we can understand it better.

Kinnary Jangla is a senior software engineer on the homefeed team at Pinterest, where she works on the machine learning infrastructure team as a backend engineer. Kinnary has worked in the industry for 10+ years. Previously, she worked on maps and international growth at Uber and on Bing search at Microsoft. Kinnary holds an MS in computer science from the University of Illinois and a BE from the University of Mumbai.

Presentations

Accelerating development velocity of production ML systems with Docker Session

Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of the company's ML teams while increasing uptime and ease of deployment.

Sumit Jindal is an experienced data engineer who has developed big data solutions for Telecom, Finance, and the Internet Domain. He likes working on the Architecture as well as Design and Implementation of Scalable, Parallel, Distributed Web Scale Systems. He is a committer on Aerospike and has worked extensively with Kafka and NoSQL systems.

Presentations

Using Machine Learning to Simplify Kafka Operations Session

Getting the best performance, predictability, and reliability for Kafka-based applications is an art today. We aim to change that by leveraging recent advances in machine learning and AI. This talk will describe our methodology of applying statistical learning to the rich and diverse monitoring data that is available from Kafka.

Flavio Junqueira leads the Pravega team at DellEMC. He is interested in various aspects of distributed systems, including distributed algorithms, concurrency, and scalability. Previously, he held an engineering position with Confluent and research positions with Yahoo Research and Microsoft Research. Flavio is an active contributor of Apache projects, such as Apache ZooKeeper (PMC and committer), Apache BookKeeper (PMC and committer), and Apache Kafka, and he coauthored the O’Reilly ZooKeeper book. Flavio holds a PhD degree in computer science from the University of California, San Diego.

Presentations

Unified and elastic Batch- and Stream Processing with Pravega and Apache Flink Session

We present an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams). The combination of these two systems offers an unprecedented way of handling “everything as a stream”, with un-bounded streaming storage, unified Batch- and Streaming abstraction, and dynamically accommodating workload variations in a novel way.

Tomer Kaftan is a second-year PhD student at the University of Washington, working with Prof. Balazinska and Prof. Cheung. His research interests are in machine learning systems, distributed systems, and query optimization.  Previously, Tomer was a staff engineer in UC Berkeley’s AMPLab, working on systems for large scale machine learning. He holds a degree in EECS from UC Berkeley. He is also a recipient of an NSF Graduate Research Fellowship.

Presentations

Cuttlefish: Lightweight primitives for online tuning Session

Cuttlefish is a lightweight framework, prototyped in Apache Spark, for developers to adaptively improve the performance of data processing applications. Developers use Cuttlefish by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time.

Sean Kandel is the founder and chief technical officer at Trifacta. Sean holds a PhD from Stanford University, where his research focused on new interactive tools for data transformation and discovery, such as Data Wrangler. Prior to Stanford, Sean worked as a data analyst at Citadel Investment Group.

Presentations

Semi-automated analytic pipeline creation and validation using active learning Session

Organizations leverage reporting, analytic, and machine learning pipelines to drive decision making and power critical operational systems. Sean Kandel discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models across these pipelines.

Holden Karau is a transgender Canadian Apache Spark committer, an active open source contributor, and coauthor of Learning Spark and High Performance Spark. When not in San Francisco working as a software development engineer at IBM’s Spark Technology Center, Holden speaks internationally about Spark and holds office hours at coffee shops at home and abroad. She makes frequent contributions to Spark, specializing in PySpark and machine learning. Prior to IBM, she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She holds a bachelor of mathematics in computer science from the University of Waterloo.

Presentations

Playing Well Together: Big Data beyond the JVM w/Spark & friends. Session

This talk will explore the state of the current big data ecosystem, and how to best work with it in non-JVM languages. Since the presenter works extensively on PySpark much of the focus will be on Python + Spark, but will also include interesting* anecdotes about how this applies to other systems (including Kafka).

Brian is a data scientist at Pinterest. Previously, he was Senior Data Analyst at the NYU Furman Center, working on housing and urban policy issues. He also served as a Research Fellow at Stanford Law School, where he helped research the effects of workplace safety and health policy.

Presentations

Trapped by the Present - Estimating Long-Term Impact from A/B Experiments Session

When software companies use A/B tests to evaluate product changes and fail to accurately estimate the long-term impact of such experiments, they risk optimizing for the users they have at the expense of the users they want to have. This talk will show how to estimate an experiment’s impact over time, thus mitigating this risk and giving full credit to experiments targeted at non-core users.

Sagar Kewalramani is an enterprise data architect at Meijer, where he leads efforts in building an enterprise data lake with Hadoop. He is primarily focused on building business use cases, high-volume real-time data ingestion, transformation, and movement, and data lineage and discovery but has also led the discovery and development of big data and machine learning applications to accelerate digital business and simplify data management and analytics. Sagar has wide experience in building data architectures integrating multiple systems using ETL tools, relational databases, and big data technologies and specializes in architecture design and administration roles for ETL tools like DataStage, Alteryx, and Talend; relational databases like Teradata and Oracle; and big data distributions like MapR and Hortonworks. He is part of core organizing committee of Big Data Ignite, Michigan’s premier conference on big data, the IoT, and cloud computing, along with meetup groups in Grand Rapids, MI, where he’s a frequent speaker on Hadoop and big data.

Presentations

Architecting an open source enterprise data lake Session

With so many business intelligence tools in the Hadoop ecosystem and no common measure to identify the efficiency of each tool, where do you begin to build or modify your enterprise data lake strategy? Sagar Kewalramani shares real-world BI problems and how they were resolved with Hadoop tools and demonstrates how to build an effective data lake strategy with open source tools and components.

Eugene is a Staff Software Engineer on the Cloud Dataflow team at Google, currently working on the Apache Beam programming model and APIs. Previously he worked on Cloud Dataflow’s autoscaling and straggler elimination techniques. He is interested in programming language theory, data visualization, and machine learning.

Presentations

Radically modular data ingestion APIs in Apache Beam Session

Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. One key aspect is Beam's DoFn operation for data integration, which makes it suitable for fully unified batch/streaming data ingestion and enabling new, highly modular data ingestion design patterns.

Ronny Kohavi is a Microsoft Distinguished Engineer and the General Manager for Microsoft’s Analysis and Experimentation team at Microsoft’s Applications and Services Group. He was previously Partner Architect at Bing, part of the Online Services Division at Microsoft. He joined Microsoft in 2005 and founded the Experimentation Platform team in 2006. He was previously the director of data mining and personalization at Amazon.com, and the Vice President of Business Intelligence at Blue Martini Software, which went public in 2000, and later acquired by Red Prairie. Prior to joining Blue Martini, Kohavi managed MineSet project, Silicon Graphics’ award-winning product for data mining and visualization. He joined Silicon Graphics after getting a Ph.D. in Machine Learning from Stanford University, where he led the MLC++ project, the Machine Learning library in C++ used in MineSet and at Blue Martini Software. Kohavi received his BA from the Technion, Israel. He was the General Chair for KDD 2004, co-chair of KDD 99’s industrial track with Jim Gray, and co-chair of the KDD Cup 2000 with Carla Brodley. He was an invited speaker at the National Academy of Engineering in 2000, a keynote speaker at PAKDD 2001, an invited speaker at KDD 2001’s industrial track, a keynote speaker at EC 10 (2010) and at Recsys 2012. His papers have over 24,000 citations and three of his papers are in the top 1,000 most-cited papers in Computer Science.

Presentations

A/B Testing at Scale: Accelerating Software Innovation Tutorial

Controlled experiments, including A/B tests, have revolutionized the way software is being developed, with new ideas objectively evaluated with real users. We provide an intro and lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft and executing over 10K experiments/year.

Sanjeev Kulkarni is the cofounder of Streamlio, a company focused on building a next-generation real-time stack. Previously, he was the technical lead for real-time analytics at Twitter, where he cocreated Twitter Heron; worked at Locomatix handling the company’s engineering stack; and led several initiatives for the AdSense team at Google. Sanjeev holds an MS in computer science from the University of Wisconsin, Madison.

Presentations

Effectively once, Exactly once, and more in Heron Session

Stream processing systems have the need to support different types of processing semantics due to the diverse nature of streaming applications. In this talk, we methodically effectively once, exactly once and different types of states and consistency, how it is implemented in Heron and how applications can benefit.

Modern Real Time Streaming Architectures Tutorial

Across diverse segments in industry, there has been a shift in focus from Big Data to Fast Data. This, in part, stems from the deluge of high velocity data streams and, more importantly, the need for instant data-driven insights. In this tutorial, we walk the audience through the state-of-the-art streaming systems and provide an in-depth review of modern streaming algorithms.

Abhishek is a working at Sapient as Manager, Data Science.

Presentations

Deep Learning Based Search and Recommendation Systems Using TensorFlow Tutorial

The key takeaways are: 1. Introduction to deep learning - different networks such as RBMs, Conv nets, auto-encoders. 2. Introduction to recommendation systems - why deep learning is required for hybrid systems. 3. complete hands-on TensorFlow tutorial, including TensorBoard. 4. end-to-end view of deep learning based recommendation and learning to rank systems.

Philip Langdale is the engineering lead for cloud at Cloudera. He joined the company as one of the first engineers building Cloudera Manager and served as an engineering lead for that project until moving to working on cloud products. Previously, Philip worked at VMware, developing various desktop virtualization technologies. Philip holds a bachelor’s degree with honors in electrical engineering from the University of Texas at Austin.

Presentations

A deep dive into running data analytic workloads in the cloud Tutorial

Vinithra Varadharajan, Philip Langdale, Eugene Fratkin, and Jennifer Wu lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices.

Francesca Lazzeri is a Data Scientist II at Microsoft, AI Research, where she is part of the Algorithms and Data Science team. Francesca is passionate about innovations in big data technologies and applications of advanced analytics to real-world problems. Her work focuses on the deployment of machine learning algorithms and web service-based solutions to solve real business problems for customers in the energy, retail, and HR analytics sectors. Previously, she was a research fellow in business economics at Harvard Business School. She holds a PhD in Innovation Management.

Presentations

Operationalize Deep Learning: How to deploy and consume your LSTM networks for predictive maintenance scenarios Session

Deep learning has shown superior performance in domains such as object recognition and image classification, where time-series data plays an important role. Predictive Maintenance is also a domain where data is collected over time to monitor the state of an asset to predict failures. In this talk we show how to operationalize LSTM networks that predict remaining useful life of aircraft engines.

Mike Lee Williams is a research engineer at Fast Forward Labs, an applied machine intelligence lab in New York City, where he builds prototypes that bring the latest ideas in machine learning and AI to life and helps Fast Forward Labs’s clients understand how to make use of these new technologies. Mike holds a PhD in astrophysics from Oxford.

Presentations

Interpretable machine learning products Session

Interpretable models result in more accurate, safer and more profitable machine learning products. But interpretability can be hard to ensure. In this talk, we'll look closely at the growing business case for interpretability, concrete applications including churn, finance and healthcare, and demonstrate the use of LIME, an open source, model-agnostic tool you can apply to your models today.

Steven is a delivery focused Hands-on Architect (Scala, Java, Data and Cloud) capable of architecting, developing, and leading the delivery of reactive, pragmatic, on time, on budget, customer focused solutions. He enjoys solving big picture problems like deciding architectures, cloud providers, programming languages, frameworks, among others. Once the big picture has been mapped out, he really likes to sit down and be a key part of its implementation.

Steven is well versed in Scala technologies like Akka, Play, Streams, among others. As well as Java technologies like Java 8, Javaslang, jOOX, Spring, Hibernate, JBoss, Tomcat, Maven, and many others. Also “Big Data” technologies like Spark, Cassandra, Kafka, Kinesis, EMR, Redshift, among others. Steven has great experience building and deploying RESTFul services to Cloud providers like AWS, Cloud Foundry, and OpenStack.

Presentations

How Weight Watchers embraced modern data practices during its transformation from a legacy IT shop to a modern Technology organization. Session

For organizations stuck in a myriad of legacy infrastructure, the path to AI and deep learning seems impossible. In this talk, we discuss how Weight Watchers was able to move from a traditional BI organization to one that uses data effectively. We look at where we were, what were our needs, the changes that were required and the technologies & architecture we use to achieve our goals.

Dr. Wei Lin is responsible for planning Dell EMC’s Data Science strategy, and leads data science services delivery for Dell EMC Professional Services Big Data Practice. He has authored over 100 private and public publications including the 2016 EMC award winning “Cardholder Attrition Analysis and Treatments Framework”, the 2015 “Analytical Case Study of Casino and Resort,” and the 2014 “Conceptual Data Mining Framework for Mortgage Default Propensity” papers.

Dr. Lin has over 20 years projects full lifecycle experience in Predictive Analytics including analytical modeling, and architecture design, data warehousing, reporting, and marketing. Dr. Lin has published over 100 papers (25 public and 105 proprietary papers) through his professional career across the industries of Healthcare, Telecom, Financial Services, Banking, Education, Gaming (Gaming floor, Hotel, Retail, Restaurant, Entertainment), Energy, Hospital, Entertainment (Music, Movie, and TV), Automobile, Retail and Government. Wei’s work has been reported in Professional Journals, Business Week and Forbes.

As the Chief Data Scientist for Dell EMC’s Big Data Practice, Dr. Lin is responsible for leading data scientist project delivery and the hiring, training and certification of new data scientists. He also hosts Dell EMC’s data science mentorship program to share data scientists’ engagement findings, industry experience, techniques and trends. Wei developed Dell EMC’s Data Science field consulting methodology Descriptive, Exploration, Predictive and Prescriptive (DEPP) that provides a practical analytics roadmap and approaches for organization’s business initiatives, data and analytic requirements.

Previously, Dr. Lin was the Principal Consultant of IBM, PWC and Cooper & Lybrand R&D and professional consulting services of Analytics.

Dr. Lin holds a Ph.D. degree in Electrical Engineering – Specialized in Artificial Intelligence and Master of Science in Electrical engineering from the State University of New York at Binghamton, and a Bachelor of Science degree in Electrical Engineering from National Taipei Institute of Technology, Taiwan.

Presentations

Bladder Cancer Diagnosis using Deep Learning Session

Image recognition classification diseases is expected to improve and support physicians decisions. Application of Deep Learning techniques to recognize diseases in organs will minimize the possibility of medical mistakes, improve patient treatment and speed up patient diagnosis.

Shaoshan Liu is the cofounder and president of PerceptIn, a company working on developing a next-generation robotics platform. Previously, he worked on autonomous driving and deep learning infrastructure at Baidu USA. Shaoshan holds a PhD in computer engineering from the University of California, Irvine.

Presentations

Powering robotics clouds with Alluxio Session

Bin Fan and Shaoshan Liu explain how PerceptIn designed and implemented a cloud architecture to support video streaming and online object recognition tasks and demonstrate how Alluxio delivers high throughput, low latency, and a unified namespace to support these emerging cloud architectures.

Jorge A. Lopez works in big data solutions at Amazon Web Services. Jorge has more than 15 years of business intelligence and DI experience and enjoys intelligent design and engaging storytelling. His is passionate about data, music, and nature.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application on the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Edwina Lu is a software engineer on LinkedIn’s Hadoop infrastructure development team, currently focused on supporting Spark on the company’s clusters. Previously, she worked at Oracle on database replication.

Presentations

Spark for everyone: Self-service monitoring and tuning Session

Spark applications need to be well tuned so that individual applications run quickly and reliably, and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems.

Nancy Lublin does not sleep very much. She is currently the Founder & CEO of Crisis Text Line, which has processed over 50 million messages in 4 years and is one of the first “big data for good” orgs. She was CEO of DoSomething.org for 12 years, taking it from bankruptcy to the largest organization for teens and social change in the world. Her first venture was Dress for Success, which helps women transition from welfare to work in almost 150 cities in 22 countries. She founded this organization with a $5,000 inheritance from her great-grandfather. Before leading three of the most popular charity brands in America, she was a bookworm. She studied politics at Brown University, political theory at Oxford University (as a Marshall Scholar), and has a law degree from New York University. She is the author of 4 books and is a board member of McGraw Hill Education. Nancy was a judge for 2017’s Miss USA Pageant (she thought that was hilarious.) Nancy is a Young Global Leader of the World Economic Forum (attending Davos multiple times), was named Schwab Social Entrepreneur of the Year in 2014, and has been named in the NonProfit Times Power and Influence Top 50 list 3 times. She is married to Jason Diaz and has two children who have never tasted Chicken McNuggets.

Presentations

Keynote with Nancy Lublin Keynote

Keynote with Nancy Lublin

Boris Lublinsky has been a full-time Enterprise Architecture professional for the last 15 years. He has been accountable for setting architectural direction, conducting architecture assessments and creating and executing architectural roadmaps including Big Data (Hadoop-based) solutions, Service-Oriented Architecture (SOA), Business Process Management (BPM) and Enterprise Application Integration (EAI). Boris is the co author of Applied SOA: Service-Oriented Architecture and Design Strategies and Professional Hadoop Solutions both from Wiley. He is also cofounder and frequent speaker of several Chicago user groups

Presentations

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams Tutorial

This hands-on tutorial builds several streaming applications as "microservices" based on Kafka with Akka Streams and Kafka Streams for data processing. We'll assess the strengths and weaknesses of each tool for particular needs, so you'll feel better informed when choosing tools for your needs. We'll also contrast them with Spark Streaming and Flink, including when to chose them instead.

Dan Lurie leads the Strategic Analytics team at Pinterest. The group is a mix technical-business hybrids who mix deep data skills with strategic thinking to help Pinterest’s product team grow the user base, develop new features, and increase engagement. The team’s work ranges from understanding product performance via A/B experiment analysis to identifying & sizing market opportunities to defining & tracking success through metrics. Prior to Pinterest, Dan lead analytics for a sales focused business line at Linkedin; prior to that, he was in consulting.

Presentations

Breaking Up the Block - Using Heterogenous Population Modeling to Drive Growth Session

All successful start-ups thrive on tight product-market fit, which can produce homogenous initial user-bases. To become the Next Big Thing your user base will need to diversify and your product change to accommodate new needs. This talk discusses how Pinterest leveraged external data to begin measuring racial & income diversity in our user base and changed user modeling to drive growth.

As SVP, Data Science, Digital Technology, Kevin is responsible for leading the vision and execution of Nielsen Marketing Cloud’s analytics and data optimization activities.

Prior to Nielsen Marketing Cloud, Kevin served as VP Analytics and Business Intelligence at x+1, a leader in audience targeting that leverages sophisticated statistical modeling to surpass traditional online marketing techniques. In this capacity, Kevin strove to maximize profitable website user behavior via analytics and real-time decisioning. Before x+1, Kevin spent over a decade as a vice president responsible for web and marketing analytics at QualityHealth.com, a leading website providing consumer health news and information, and at Harte-Hanks, a large marketing service provider. Earlier in his career, Kevin served in account management at Grey Direct.

Kevin has a BA in Russian Language and Eastern European Studies from the University of Illinois at Urbana-Champaign, an MA in Medieval History from The Ohio State University and an MA in Applied Statistics from the City University of New York-Hunter College.

Presentations

Case Study: How InvestingChannel is using AI to help advertisers reach intended audiences with 100% accuracy. Media and Advertising

Consumer behavior is in a constant state of flux. Adapting to these changes is especially hard given the staggering amount of “Big Data” marketers need to understand & act on. Learn how financial news publisher InvestingChannel and its technology partner, Nielsen, are using an advanced form of AI, online machine learning, to adapt to real-time changes in audience behavior & market conditions.

Michael is the Chief Technology Officer at Weight Watchers with global responsibility for all aspects of Technology. Since joining Weight Watchers in 2014, Michael has lead a major digital transformation, first within Engineering and subsequently for all of Technology. Prior to Weight Watchers, Michael joined SecondMarket (now Nasdaq Private Market) in 2009 as the Vice President of Engineering where he led the design, development, delivery and quality of SecondMarket’s software products. Prior to SecondMarket, Michael spent 12 years working as an independent consultant at various institutions around the world building highly scalable web based products for startups, banks and telecoms companies.

Presentations

How Weight Watchers embraced modern data practices during its transformation from a legacy IT shop to a modern Technology organization. Session

For organizations stuck in a myriad of legacy infrastructure, the path to AI and deep learning seems impossible. In this talk, we discuss how Weight Watchers was able to move from a traditional BI organization to one that uses data effectively. We look at where we were, what were our needs, the changes that were required and the technologies & architecture we use to achieve our goals.

Angie Ma is cofounder and COO of ASI Data Science, a London-based AI tech startup that offers data science as a service, which has completed more than 120 commercial data science projects in multiple industries and sectors and is regarded as the EMEA-based leader in data science. Angie is passionate about real-world applications of machine learning that generate business value for companies and organizations and has experience delivering complex projects from prototyping to implementation. A physicist by training, Angie was previously a researcher in nanotechnology working on developing optical detection for medical diagnostics.

Presentations

Data science for managers 2-Day Training

Angie Ma offers a condensed introduction to key data science and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.

Madhav Madaboosi is a Digital Business and Technology Professional and is presently in the Strategy, Architecture and Planning group in BP. He leads a number of Global Innovation initiatives in the areas of Robotic Process Automation, AI, Big Data/Data Lakes and Industrial IoT. His emphasis is in enabling innovations in business technology and data across the BP Group globally through strategic planning initiatives. Previously, he was the interface to several business portfolios within BP as a Business Information Manager. His background prior to BP has been primarily in Management Consulting for a number of Fortune 100 firms. Madhav has a degree in Business and has completed Executive programs at Kellogg Institute of Management.

Presentations

Meta your Data, Drain the Big Data Swamp Data Case Studies

Self service operational data lake to improve operational efficiency, boosting productivity through fully identifiable data, reducing risk of a data swamp. These were the objectives which drove BP to create a strategic and methodical approach to a Data Lake architecture. Through this approach, BP provides a template of turning insights, hidden risks & unseen opportunities into actionable solutions

Mark Madsen is a research analyst at Third Nature, where he advises companies on data strategy and technology planning. Mark has designed analysis, data collection, and data management infrastructure for companies worldwide. He focuses on two types of work: the business applications of data and guiding the construction of data infrastructure. As a result, Mark does as much information strategy and IT architecture work as he does analytics.

Presentations

Executive Briefing: BI on Big Data Session

If your goal is to provide data to an analyst rather than a data scientist, what’s the best way to deliver analytics? There are 70+ BI tools in the market and a dozen or more SQL- or OLAP-on-Hadoop open source projects. This briefing will present the tradeoffs between different architectures to provide self-service access to data.

Arup Malakar is a Software Engineer at Lyft.

Presentations

Dogfooding our data at Lyft Session

This talk gives an overview of how we leverage application metrics, logs & auditing to monitor and troubleshoot our data platform at Lyft. We share how we dogfood our own platform to provide, security, auditing, alerting & replayability in our platform. We also detail some of the services & tools we have developed internally to make our data more robust, scalable & self-serving.

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera helping clients find success with the Hadoop ecosystem and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Executive Briefing: Managing successful data projects—Technology selection and team building Session

Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives.

Jules Malin is a manager of product analytics and data science at GoPro, where he leads a team responsible for discovering product and behavioral insights from GoPro’s growing family and ecosystem of smart devices and driving product and user experience improvements, including influencing and refining data pipelines in Hadoop/Spark and developing scalable machine learning data products, metrics, and visualizations that produce actionable insights. Previously, Jules worked in product management and analytics engineering at Intel and Shutterfly. He holds a master’s degree in predictive analytics from Northwestern University.

Presentations

Drone data analytics using Spark, Python, and Plotly Data Case Studies

Drones and smart devices are generating billions of event logs for companies, presenting the opportunity to discover insights that inform product, engineering, and marketing team decisions. Jules Malin explains how technologies like Spark and analytics and visualization tools like Python and Plotly enable those insights to be discovered in the data.

Tracy Malingo is President of Next IT, the provider of conversational AI for the enterprise. Tracy brings a wealth of experience to Next IT, leading the executive team and company in providing strategic and operational vision, helping to shape and define the company today and for the future. Her compelling blend of business acumen and technical expertise enables her to relate to all elements of the industry, and she was instrumental in introducing many of the company’s largest clients and early adopters to the world of conversational AI. In addition to leading the conversational AI company for the Global 5000, in the evenings and weekends, Tracy is the proud owner of the world’s most awesome dog and is never afraid to go “all in” with a jack/ten pocket hand at the table.

Presentations

Your Enterprise AI Is Only As Good as Your Data Data Case Studies

AI is transformative for business, but it’s not magic; it’s data. Tracy Malingo, President of Next IT, presents recent work with global enterprise customers on how they have helped transform their businesses with AI solutions. She’ll outline how companies should build AI strategies, utilize data to develop and evolve conversational intelligence and business intents, and ultimately increase ROI.

Veronica Mapes is a technical program manager focused on human evaluation and computation at Pinterest, where she manages Pinterest’s internal human evaluation platform, maturing it from just an idea to a self-service platform with a 10 million annual run rate of tasks less than six months after launch, as well as third-party communities of crowdsourcing raters. She also hires, trains, and manages high-quality content evaluators and tests template and worker quality to ensure the delivery of highly accurate data for time series measurement and training machine learning models.

Presentations

Humans versus the machines: Using human-based computation to improve machine learning Session

Veronica Mapes and Garner Chung detail the human evaluation platform Pinterest developed to better serve its deep learning and operational teams when its needs grew beyond platforms like Mechanical Turk. Along the way, they cover tricks for increasing data reliability and judgement reproducibility and explain how Pinterest integrated end-user-sourced judgements into its in-house platform.

Brian McMahan is a research engineer at Joostware, a San Francisco-based company specialized in consulting and building intellectual property in natural language processing and deep learning. He is also a cofounder at R7 Speech Sciences, a company focused on understanding spoken conversations. Brian is wrapping up his PhD in computer science from Rutgers University, where his research focuses on Bayesian and deep learning models for grounding perceptual language in the visual domain. Brian has also conducted research in reinforcement learning and various aspects of dialogue systems.

Presentations

Machine learning with PyTorch 2-Day Training

PyTorch is a recent deep learning framework from Facebook that is gaining massive momentum in the deep learning community. Its fundamentally flexible design makes building and debugging models straightforward, simple, and fun. Delip Rao and Brian McMahan walk you through PyTorch's capabilities and demonstrate how to use PyTorch to build deep learning models and apply them to real-world problems.

Guru Medasani is a Senior Solutions Architect at Cloudera focusing on BigData and data science. For the past 3 years, he has helped several fortune 500 companies in building BigData platforms and helped them leverage technologies like apache Hadoop and apache spark to solve complex business problems. Some of the business applications he worked on include collecting, storing and processing huge amounts of machine/sensor data, image processing applications on Hadoop, building machine learning models to predict consumer demand, building tools to perform advanced analytics on large volumes of data stored in Hadoop. Prior to Cloudera, he built research applications as a BigData engineer at Monsanto Research and Development. He currently lives in Chicago.

Presentations

How to build leak-proof stream processing pipelines with Apache Kafka and Apache Spark​ Session

Streaming data continuously from Kafka allows users to gain insights faster, but when they fail, can leave users panicked about data loss when restarting their application. Offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results.

Dong Meng is a Data Scientist for MapR. In his role, Dong helps customers solving their business problem through big data ecosystems. He translates the value from customers’ data and turns them into actionable insights or machine learning products. His recent work includes integrating open source machine learning framework like PredictionIO/Xgboost with MapR platform. He also created time series QSS and deep learning QSS as MapR service offering.

Dong has several years of experience in statistical machine learning, data mining, and big data product development. Previously, Dong was a senior data scientist with ADP, where he built machine learning pipelines and data products on HR, payroll data to power ADP Analytics. Prior to ADP, Dong was a staff software engineer with IBM, SPSS, where he is part of the team built Watson analytics. During graduate study, he served as research assistant at the Ohio State University, where he concentrated on compressive sensing and solving point estimation problems from a Bayesian perspective.

Presentations

Distributed Deep Learning with Containers on Heterogeneous GPU Clusters Session

DL model performance relies on underlying data. We use a converged data platform to serve as data infrastructure providing distributed file system, key-value storage and streams, Kubernetes as orchestration layer to manage containers to train/deploy DL models using GPU clusters. We also publish and subscribe to streams on the platform to build next-gen applications with DL models.

Peng Meng is a senior software engineer on the big data and cloud team at Intel, where he focuses on Spark and MLlib optimization. Peng is interested in machine learning algorithm optimization and large-scale data processing. He holds a PhD from the University of Science and Technology of China.

Presentations

Spark ML optimization at Intel: A case study Session

Intel has been deeply involved in Spark from its earliest moments. Vincent Xie and Peng Meng share what Intel has been working on Spark ML and introduce the methodology behind Intel's work on SparkML optimization.

Gian Merlino is CTO and co-founder of Imply, and is one of the original committers of the Druid project. Prior to Imply, he worked at Metamarkets and Yahoo. He holds a B.S. in computer science from the California Institute of Technology.

Presentations

NoSQL no more: SQL on Druid with Apache Calcite Session

In this talk we'll discuss the SQL layer recently added to the open-source Druid project. It's based on Apache Calcite, which bills itself as "the foundation for your next high-performance database". We'll discuss how Druid and Calcite are integrated, and how you too can learn to stop worrying and love relational algebra in your own projects.

John Mertic is director of program management for ODPi and Open Mainframe Project at the Linux Foundation. John comes from a PHP and open source background. Previously, he was director of business development software alliances at Bitnami, a developer, evangelist, and partnership leader at SugarCRM, board member at OW2, president of OpenSocial, and a frequent conference speaker around the world. As an avid writer, John has published articles on IBM Developerworks, Apple Developer Connection, and PHP Architect and authored The Definitive Guide to SugarCRM: Better Business Applications and Building on SugarCRM.

Presentations

The Rise of Big Data Governance: Insight on this Emerging Trend from Active Open Source Initiatives Session

This joint presentation, John Mertic – Director of ODPi – and Ferd Scheepers – Global Chief Information Architect of ING – will address the benefits of a vendor-neutral approach to data governance, the need for an open metadata standard, along with insight around how companies ING, IBM, Hortonworks and more are delivering solutions to this challenge as an open source initiative.

Thomas W. Miller, Ph.D., is faculty director of the data science program at Northwestern University. He has designed and taught many courses in the program. He is the author of six books in the field of data science. Miller has consulted with many businesses, providing advice on performance and value measurement, data science methods, information technology, and best practices in building teams of data scientists and data engineers.

Presentations

Working with the Data of Sports Data Case Studies

Sports analytics today is more than a matter of analyzing box scores and play-by-play statistics. Faced with detailed on-field or on-court data from every game, teams seek competitive advantage through data science and data engineering. We provide a review of the data challenges that teams face and information technologies useful in addressing those challenges.

My research interests are in data science, data mining, web search, machine learning and privacy. I have many years of experience leading projects in industry at Amazon, Microsoft Research and HP Labs, as well as academia as Associate Professor at the University of Virginia and Acting Faculty at Stanford University.

The projects that I pursue encompass the design and evaluation of new data mining algorithms on real, colossal-sized datasets. I authored ~50 publications in top venues including: Web Search: WWW, WSDM, SIGIR; Machine Learning: ICML, NIPS, AAAI, COLT; Databases: VLDB, PODS; Cryptography: CRYPTO, EUROCRYPT; Theory: FOCS and SODA. My research publications received external recognition: best paper award nomination, algorithm in Wikipedia and taught in graduate courses around the world. Also, my research has product implications at Microsoft, specifically in the Bing search engine, and was featured in external press coverage including New Scientist, ACM TechNews, IEEE Computing Now, Search Engine Land and Microsoft Research. I’ve been granted 14 patent applications with a dozen more still in the application stage. I’ve had the distinct privilege of helping others advance in their careers, including 15 summer interns and many full-time researchers.

My service to the community includes: serving on journal editorial boards Machine Learning, Journal of Privacy and Confidentiality, IEEE Transactions on Knowledge and Data Engineering and IEEE Intelligent Systems; chairing the premier machine learning conference ICML in 2003, as well as numerous program committees for web search, data mining and machine learning conferences. I was awarded an NSF Grant as a Principal Investigator and served on 8 PhD dissertation committees. I taught several courses at Stanford University and the University of Virginia.

Presentations

Continuous Machine Learning over Streaming Data Session

In this talk, we present continuous machine learning algorithms that discover useful information in streaming data. We focus on explainable machine learning, including anomaly detection with attribution, ability to reduce false positives through user feedback, and detection of anomalies in directed graphs. We describe the algorithms, implementation, and application in real customer use cases.

Rajat Monga leads TensorFlow, an open source machine learning library and the center of Google’s efforts at scaling up deep learning. He is one of the founding members of the Google Brain team and is interested in pushing machine learning research forward toward general AI. Previously, Rajat was the chief architect and director of engineering at Attributor, where he led the labs and operations and built out the engineering team. A veteran developer, Rajat has worked at eBay, Infosys, and a number of startups.

Presentations

The current state of TensorFlow and where it's headed in 2018 Session

Rajat Monga offers an overview of TensorFlow progress and adoption in 2017 before looking ahead to the areas of importance in the future—performance, usability, and ubiquity—and the efforts TensorFlow is making in those areas.

Ajay is an architect in the data technologies team at Sapient.

Presentations

Achieving GDPR Compliance and Data Privacy using BlockChain Technology Session

We explain how we have used open source BlockChain technologies such as HyperLedger to implement the European Union's General Data Protection Regulation (GDPR) regulation. The key takeaways are: 1. Introduction to GDPR – a step further on data privacy 2. Why Blockchain is a suitable candidate for implementing GDPR 3. Lessons learnt in our blockchain implementation of GDPR compliance.

Manu has a background in cloud computing and big data, handling billions of transactions per day in real time. He enjoys building and architecting scalable, highly available data solutions, and has extensive experience working in online advertising and social media.

Presentations

Machine Learning vs Machine Learning in Production Session

Most Machine Learning talks are about the actual algorithm, this talk is about how you take that and scale it and make it production grade. - How the training set and test set is generated and annotated - How the model is pushed to production and evaluated (automatically) and finally used in production. - How the model works for other countries and languages

Jacques Nadeau is the CTO and co-founder of Dremio. He is also the PMC Chair of the open source Apache Arrow project, spearheading the project’s technology and community. Prior to Dremio, he was the architect and engineering manager for Apache Drill and other distributed systems technologies at MapR. In addition, Jacques was CTO and co-founder of YapMap, an enterprise search startup, and held engineering leadership roles at Quigo (AOL), Offermatica (ADBE), and aQuantive (MSFT).

Presentations

Reflections vs. Copies: Making Data Fast and Easy to Use Without Making Copies Session

Most organization manage 5-15 copies of their data in multiple systems and formats to support different analytical use cases, including BI and machine learning. In this talk we introduce a new approach called Data Reflections, which dramatically reduces the need for data copies. We demonstrate an open source implementation built with Apache Calcite and explore two production case studies.

Balasubramanian Narasimhan is a Senior Research Scientist in the Department of Statistics and the Department of Biomedical Data Sciences. He is also the Director of the Data Coordinating Center in the Department of Biomedical Data Sciences. His research areas are Statistical Computing, Distributed Computing, Clinical Trial Design and Reproducible Research. Together with John Chambers, an inventor of the S language, he teaches a Computing for Data Science course at Stanford.

Presentations

Distributed Clinical Models: Inference without Sharing Patient Data Session

Clinical collaboration benefits from pooling data to learn models from large datasets, but its hampered by concerns about sharing data. We've developed a privacy-preserving alternative to create statistical models equivalent to one from the entire dataset. We've built this as a cloud application, where each collaborator installs their own, and the installations self-assemble into a star network.

Paco Nathan leads the Learning Group at O’Reilly Media. Known as a “player/coach” data scientist, Paco led innovative data teams building ML apps at scale for several years and more recently was evangelist for Apache Spark, Apache Mesos, and Cascading. Paco has expertise in machine learning, distributed systems, functional programming, and cloud computing with 30+ years of tech-industry experience, ranging from Bell Labs to early-stage startups. Paco is an advisor for Amplify Partners and was cited in 2015 as one of the top 30 people in big data and analytics by Innovation Enterprise. He is the author of Just Enough Math, Intro to Apache Spark, and Enterprise Data Workflows with Cascading.

Presentations

Human in the loop: A design pattern for managing teams working with machine learning Session

Human in the loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. Such systems are mostly automated, with exceptions referred to human experts, who help train the machines further. Paco Nathan offers an overview of HITL from the perspective of a business manager, focusing on use cases within O'Reilly Media.

Ann evangelizes design for impact. At Whole Whale, she leads the tech and design team with a little help from Puppy Whaler to build meaningful digital products for nonprofits. She has designed and managed the execution of multiple websites, including the LAMP, Opportunities for a Better Tomorrow, and Breakthrough. Ann is always challenging designs with A/B testing. She bets $1 on every experiment that she runs and to date has accumulated a decent sum.

Before joining Whole Whale, Ann worked with a wide range of organizations from the Ford Foundation to Bitly. She is Google Analytics and Optimizely Platform certified. Ann is a regular speaker on nonprofit design and strategy, recently presenting at the DMA Nonprofit Conference and teaching at Sarah Lawrence College. Outside of Whole Whale, Ann enjoys multisensory art, comedy shows, fitness, and making cocktails, ideally all together.

Presentations

Using ML to improve UX and literacy for young poets Data Case Studies

Power Poetry is the largest online platform for young poets with over 350k users. In 2017, we started building the Poetry Genome, as series of ML tools that analyze and breakdown similarity scores of the poems added to the site. The most recent is a Rap Poetry similarity that matches young poet's work to rap artists and then shows them the education value of that connection.

Ryan Nienhuis is a senior technical product manager on the Amazon Kinesis team, where he defines products and features that make it easier for customers to work with real-time, streaming data in the cloud. Previously, Ryan worked at Deloitte Consulting, helping customers in banking and insurance solve their data architecture and real-time processing problems. Ryan holds a BE from Virginia Tech.

Presentations

Continuous Machine Learning over Streaming Data Session

In this talk, we present continuous machine learning algorithms that discover useful information in streaming data. We focus on explainable machine learning, including anomaly detection with attribution, ability to reduce false positives through user feedback, and detection of anomalies in directed graphs. We describe the algorithms, implementation, and application in real customer use cases.

Berk Norman is a data scientist at UC San Francisco. He works on constructing deep learning models for the Department of Radiology and Biomedical Imaging at UCSF.

Presentations

Automatic 3D MRI Knee Damage Classification with 3D CNN using BigDL on Spark Session

Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage poses the advantage for quicker and more accurate diagnosis at the time of an MRI scan. We will talk about building this classification system with 3D convolutional neural networks using BigDL on Apache Spark.

Meagan O’Leary considers herself full-stack leader focused on delivering results through her education, skills and experiences in business, technology, strategy and design thinking. She engages the world with curiosity, energy and enthusiasm to unleash capabilities in others and to generate value. Given the seemingly complex, Meagan drives for clarity while empowering teams and individuals to do more than they thought was possible. She has been successful implementing a diverse portfolio of solutions including enterprise resource planning (SAP), e-commerce, business performance management, financial business intelligence and most recently artificial intelligence and intelligent automation. Meagan is passionate about improving business and human performance and has done so across industries, in Fortune 500 organizations, nonprofits and startups.

Presentations

How to Successfully Reinvent Productivity in Finance with Machine Learning. Hint: Machine Learning is only part of it. Data Case Studies

Microsoft’s Finance organization is reinventing forecasting using machine learning that its leaders describe as “game changing”. This session covers the learnings the data sciences and finance teams experienced in bringing machine learning forecasting to the office of the CFO, by improving forecast accuracy and frequency and driving cultural change through an ML in Finance Center of Excellence.

A leading expert on big data architecture and Hadoop, Stephen O’Sullivan has 20 years of experience creating scalable, high-availability data and applications solutions. A veteran of @WalmartLabs, Sun, and Yahoo, Stephen leads data architecture and infrastructure at Silicon Valley Data Science.

Presentations

Architecting A Data Platform Tutorial

What are the essential components of a data platform? This tutorial will explain how the various parts of the Hadoop, Spark and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Enough data engineering for a Data Scientist - “How I Learned to Stop Worrying and Love the Data Scientists” Session

So how much data engineering should a Data Scientist know? For a Data Scientist to get to the fun part of their job, they normally have to do a bit of data engineering. Like on boarding data. Do a little bit of “wrangling”. Before they get to the fun part part - The Data Science! In most cases this is 50%-80% of the time.

With a background in computer engineering and visual analytics, Silvia Oliveros has worked on several projects helping clients explore and analyze their data. Silvia is interested in building and optimizing the infrastructure and data pipelines used to gather insights from various datasets.

Presentations

Enough data engineering for a Data Scientist - “How I Learned to Stop Worrying and Love the Data Scientists” Session

So how much data engineering should a Data Scientist know? For a Data Scientist to get to the fun part of their job, they normally have to do a bit of data engineering. Like on boarding data. Do a little bit of “wrangling”. Before they get to the fun part part - The Data Science! In most cases this is 50%-80% of the time.

Mike Olson cofounded Cloudera in 2008 and served as its CEO until 2013, when he took on his current role of chief strategy officer. As CSO, Mike is responsible for Cloudera’s product strategy, open source leadership, engineering alignment, and direct engagement with customers. Previously, Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine, and he spent two years at Oracle Corporation as vice president for embedded technologies after Oracle’s acquisition of Sleepycat. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies, and Informix Software. Mike holds a bachelor’s and a master’s degree in computer science from the University of California, Berkeley.

Presentations

Executive Briefing: Machine learning—Why you need it, why it's hard, and what to do about it Session

Mike Olson shares examples of real-world machine learning applications, explores a variety of challenges in putting these capabilities into production—including the speed with with technology is moving, cloud versus in-data-center consumption, security and regulatory compliance, and skills and agility in getting data and answers into the right hands—and outlines proven ways to meet them.

Andrea Pasqua is a data science manager at Uber, where he leads the time series forecasting and anomaly detection teams. Previously, Andrea was director of data science at Radius Intelligence, a company spearheading the use of machine learning in the marketing space; a financial analyst at MSCI, a leading company in the field of risk analysis; and a postdoctoral fellow in biophysics at UC Berkeley. He holds a PhD in physics from UC Berkeley.

Presentations

Detecting time series anomalies at Uber scale with recurrent neural networks Session

Time series forecasting and anomaly detection is of utmost importance at Uber. However, the scale of the problem, the need for speed, and the importance of accuracy make anomaly detection a challenging data science problem. Andrea Pasqua and Anny Chen explain how the use of recurrent neural networks is allowing Uber to meet this challenge.

Mo Patel is an independent Deep Learning consultant, focusing on independently advising on strategic and technical AI topics to individuals, startups and enterprise. Previously, as Practice Director for AI and Deep Learning at Think Big Analytics, a Teradata Company, where he mentored and advised Think Big clients and provided guidance on ongoing deep learning projects. Mo has successfully managed and executed data science projects with clients across several industries, including cable, auto manufacturing, medical device manufacturing, technology, and car insurance. Previously, Mo was a management consultant and a software engineer. A continuous learner, Mo conducts research on applications of deep learning, reinforcement learning, and graph analytics toward solving existing and novel business problems and brings a diversity of educational and hands-on expertise connecting business and technology. He holds an MBA, a master’s degree in computer science, and a bachelor’s degree in mathematics.

Presentations

Learning PyTorch by building a recommender system Tutorial

Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Mo Patel and Neejole Patel walk you through using PyTorch to build a content recommendation model.

Neejole Patel is a sophomore at Virginia Tech, where she is pursuing a BS in computer science with a focus on machine learning, data science, and artificial intelligence. In her free time, Neejole completes independent big data projects, including one that tests the Broken Windows theory using DC crime data. She recently completed an internship at a major home improvement retailer.

Presentations

Learning PyTorch by building a recommender system Tutorial

Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Mo Patel and Neejole Patel walk you through using PyTorch to build a content recommendation model.

Rizwan Patel is senior director of big data, innovation, and emerging technology at Caesars Entertainment. A senior technologist with strong leadership skills coupled with hands-on application and system expertise, Rizwan has a proven track record of delivering large-scale, mission-critical projects on time and budget using leading-edge technologies to solve critical business problems as well as extensive experience in managing client relations at all levels, including senior executives.

Presentations

Big data applicability to the gaming industry Media and Advertising

Rizwan Patel explains how the gaming industry can leverage Cloudera’s big data platform to adapt to the change in patron dynamics (both in terms of demographics as well as in spending patterns) to create a new paradigm for customer (micro) segmentation.

Vanja Paunić is a data scientist with the Business Operations and Economics group within AI&R at Microsoft. Previously, Vanja worked as a research scientist in the field of bioinformatics, where she published on uncertainty in genetic data, genetic admixture, and prediction of genes. She holds a PhD in computer science with a focus on data mining from the University of Minnesota.

Presentations

Using R and Python for Scalable Data Science, Machine Learning, and AI Tutorial

Join us to learn how to do scalable, end-to-end data science in R and Python on single machines as well as on Spark clusters and cloud-based infrastructure. You'll be assigned an individual virtual machine with all contents preloaded and software installed and use it to gain experience building and operationalizing machine learning models using distributed functions in both R and Python.

Valentina Pedoia, PhD, is a Specialist in the Musculoskeletal and Imaging Research Group.
She is a data scientist with main interest in developing algorithms for advanced computer vision and machine learning for improving the usage of non-invasive imaging as diagnostic and prognostic tools.
She obtained a doctoral degree in computer science working on features extraction from functional and structural brain MRI in subjects with glial tumors. After graduation, in 2013, she joined the Musculoskeletal and Imaging Research Group at UCSF as post-doctoral fellow. Her role was in providing support and expertise in medical computer vision, with a focus to reduce human effort and to extract semantic features from MRI to study degenerative joint disease.
Her current main research focus is on exploring the role of machine learning in the extraction of contributors to osteoarthritis (OA). She is studying analytics to model the complex interactions between morphological, biochemical and biomechanical aspects of the knee joint as a whole; deep learning convolutional neural network for musculoskeletal tissue segmentation and for the extraction of silent features from quantitative relaxation maps for a comprehensive study of the biochemical articular cartilage composition; with ultimate goal of developing a completely data-driven model that is able to extract imaging features and use them to identify risk factors and predict outcomes.
Dr. Pedoia’s recent work on machine learning applied to OA was awarded as annual scientific highlights of the 25th conference of the International Society of Magnetic
Resonance In Medicine (ISMRM 2017) and selected as best paper presented at the MRI drug discovery study group.

Presentations

Automatic 3D MRI Knee Damage Classification with 3D CNN using BigDL on Spark Session

Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage poses the advantage for quicker and more accurate diagnosis at the time of an MRI scan. We will talk about building this classification system with 3D convolutional neural networks using BigDL on Apache Spark.

Thomas Phelan is cofounder and chief architect of BlueData. Prior to BlueData, Tom was an early employee at VMware and as senior staff engineer was a key member of the ESX storage architecture team. During his 10-year stint at VMware, he designed and developed the ESX storage I/O load-balancing subsystem and modular “pluggable storage architecture.” He went on to lead teams working on many key storage initiatives, such as the cloud storage gateway and vFlash. Earlier, Tom was a member of the original team at Silicon Graphics that designed and implemented XFS, the first commercially available 64-bit file system.

Presentations

How to Protect Big Data in a Containerized Environment Session

Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for Big Data is HDFS configured with Transparent Data Encryption (TDE). However, TDE can be difficult to configure and manage; issues that are only compounded when running on Docker containers. This session will discuss these challenges and how to overcome them.

Patrick Phelps is the lead data scientist on ads at Pinterest, focusing on auction dynamics and advertiser success. Previously, Patrick was the lead data scientist at Yelp, leading a team focusing on projects as diverse as search, ads, delivery operations, and HR. He has an engineering background in traffic quality (the art of distinguishing automated systems and malicious actors from legitimate users across a variety of platforms) and held an Insight Data Science fellowship. Patrick is passionate about the ability of data to provide key, quantitative insights to businesses during the decision-making process and is an advocate for data science education across all layers of a company. Patrick holds a PhD in experimental high-energy particle astrophysics.

Presentations

Executive Briefing: Building effective heterogeneous data communities—Driving organizational outcomes with broad-based data science Session

Data science is most powerful when combined with deep domain knowledge, but those with domain knowledge don't work on data-focused teams. So how do you empower employees with diverse backgrounds and skill sets to be effective users of data? Frances Haugen and Patrick Phelps dive into the social side of data and share strategies for unlocking otherwise unobtainable insights.

Marcin is a data scientist and a leader of Ryanair Data & Analytics department. Marcin has around 14 year of professional experience working in aviation, telco and financial industries. He has architecture experience across topics including data science, big data solutions and data warehouses.

Presentations

Data driven fuel management at Ryanair Session

Managing fuel at a company flying 120 millions passengers yearly is not a trivial task. This session highlights the main aspects of fuel management of a modern airline and provides an overview of machine learning methods supporting long term planning and daily decisions.

Dr. Jennifer Prendki is the Head of Data Science at Atlassian, where she leads all Search and Machine Learning initiatives and is in charge of leveraging the massive amount of data collected by the company to load the suite of Atlassian products with smart features.  She received her PhD in Particle Physics from University UPMC – La Sorbonne in 2009 and has since that worked as a data scientists for many different industries.  Prior to joining Atlassian, Jennifer was a Senior Data Science Manager in the Search team of Walmart eCommerce.  She enjoys addressing both technical and non-technical audiences at conferences and sharing her knowledge and experience with aspiring data scientists.  

Presentations

The Science of Patchy Data Session

This talk reviews the options to develop machine learning models even if the data is protected by privacy and compliance laws and cannot be used without anonymizing. Dr. Prendki will discuss how techniques ranging from contextual bandits to document vector representation offer data scientists the opportunity to build models even when the data can't be used in its whole integrity.

Michael Prorock is CTO and Founder at mesur.io. Michael is an expert in systems and analytics, as well as in building teams that deliver results. Most recently he served as Director of Emerging Technologies for the Bardess Group, where he defined and implemented a technology strategy that enabled Bardess to scale its business to new verticals across a variety of clients. Michael enabled marquee customers to reinvent strategically important areas of their business via data analysis and decision making by leveraging new technologies and business approaches, and worked closely to build and maintain key partnerships. Prior to Bardess, Michael is a seasoned analytics veteran that has worked directly with customers such as Raytheon, Cisco, and IBM. He has filed multiple patents related to heuristics, media analysis, and speech recognition. In his spare time, Michael applies his findings and environmentally conscious methods on his small farm.

Presentations

Smart Agriculture: Blending IoT Sensor Data with Advanced Analytics Data Case Studies

mesur.io is transforming the agricultural and turf management market with a combination of IoT sensor technology, an advanced analytic platform, and self-service visualization. Growers are able to monitor areas of concern, from water conservation to soil conditions and beyond. This is a climate awareness solution for managing the modern farm, plantation, or golf course.

Jiangjie Qin is on the Data Infrastructure team at LinkedIn. He works on Apache Kafka and is a Kafka committer and PMC member. Previously, he worked at IBM, where he managed IBM’s zSeries platform for banking clients. Jiangjie holds a master’s degree in information networking from Carnegie Mellon’s Information Networking Institute.

Presentations

Secret Sauce behind Self Managing Kafka Clusters at LinkedIn Session

LinkedIn runs more than 1800+ Kafka brokers that deliver more than 2 trillion messages/day. Running Kafka at such a scale makes automated operations a necessity. We will share the lessons we learned from operating Kafka at scale with minimum human intervention.

Paul Raff is a Principal Data Scientist Manager in Microsoft’s Analysis and Experimentation team. Previously, he was a Supply Chain Researcher at Amazon. Paul and his team work to enable scalable experimentation in varied teams around Microsoft, including Windows 10, Office Online, Exchange Online, and Cortana. Additionally, his team focuses on experiment quality, ensuring that all experiments are operating as intended an in a way that allows for the appropriate conclusions to be made. Paul received a Ph.D. degree in Mathematics from Rutgers University in 2009, and prior to that received degrees in Mathematics and Computer Science from Carnegie Mellon University.

Presentations

A/B Testing at Scale: Accelerating Software Innovation Tutorial

Controlled experiments, including A/B tests, have revolutionized the way software is being developed, with new ideas objectively evaluated with real users. We provide an intro and lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft and executing over 10K experiments/year.

Greg is a director of product management at Cloudera, where he is responsible for driving SQL product strategy as part of Cloudera’s analytic database product, including working directly with Impala. Over 20 years, Greg has worked with relational database systems across a variety of roles, including software engineering, database administration, database performance engineering, and most recently product management, to provide a holistic view and expertise on the database market. Previously, Greg was part of the esteemed Real-World Performance group at Oracle and was the first member of the product management team at Snowflake Computing.

Presentations

Analytics in the Cloud - Building a Modern Cloud-based Big Data Warehouse Session

For many organizations, the cloud will likely be the destination of their next big data warehouse. The speakers will discuss considerations when evaluating the cloud for analytics and big data warehousing in order to steer attendees down the path of success allowing them to get the most from the cloud. Attendees will leave with an understanding of different architectural approaches and impacts.

Arunkumar is a senior architect with data team at Sapient.

Presentations

Achieving GDPR Compliance and Data Privacy using BlockChain Technology Session

We explain how we have used open source BlockChain technologies such as HyperLedger to implement the European Union's General Data Protection Regulation (GDPR) regulation. The key takeaways are: 1. Introduction to GDPR – a step further on data privacy 2. Why Blockchain is a suitable candidate for implementing GDPR 3. Lessons learnt in our blockchain implementation of GDPR compliance.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); briefly worked on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper Networks. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from UW Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Effectively once, Exactly once, and more in Heron Session

Stream processing systems have the need to support different types of processing semantics due to the diverse nature of streaming applications. In this talk, we methodically effectively once, exactly once and different types of states and consistency, how it is implemented in Heron and how applications can benefit.

Modern Real Time Streaming Architectures Tutorial

Across diverse segments in industry, there has been a shift in focus from Big Data to Fast Data. This, in part, stems from the deluge of high velocity data streams and, more importantly, the need for instant data-driven insights. In this tutorial, we walk the audience through the state-of-the-art streaming systems and provide an in-depth review of modern streaming algorithms.

Karthik Ramasamy leads a data science team at Uber focusing on solving fraud problems using machine learning. His team builds advanced machine learning models like semisupervised and deep learning models to detect account takeovers and stolen credit cards. Previously, Karthik was a cofounder of LogBase, where he worked on real-time analytics infrastructure and built models to rate drivers based on their driving behavior, and a founding member of the LinkedIn security team, where he developed various security products, with a particular focus on anti-automation efforts.

Presentations

Using computer vision to combat stolen credit card fraud Session

Stolen credit cards are a major problem faced by many companies, including Uber. Karthik Ramasamy and Lenny Evans detail a new weapon against stolen credit cards that uses computer vision to scan credit cards, verifying possession of the physical card with basic fake card detection capabilities.

Rishi Ranjan is the director of big data and analytics at Freddie Mac where he focuses on providing big data solutions for analytics and data science. Rishi has over 20 years of experience in managing data and database platforms.

Presentations

From big data to good data: How Apache NiFi and Apache Atlas eased Dataflow management at Freddie Mac with better data governance and reduced data latency Data Case Studies

Rishi Ranjan explains how Freddie Mac used Apache NiFi and Apache Atlas to build a centralized production operational data store on a Hadoop cluster. NiFi reduced the time to build a new data pipeline from months to hours and provided a robust data governance capability at the same time.

Delip Rao is the founder of R7 Speech Science, a San Francisco-based company focused on building innovative products on spoken conversations. Previously, Delip was the founder of Joostware, which specialized in consulting and building IP in natural language processing and deep learning. Delip is a well-cited researcher in natural language processing and machine learning and has worked at Google Research, Twitter, and Amazon (Echo) on various NLP problems. He is interested in building cost-effective, state-of-the-art AI solutions that scale well. Delip has an upcoming book on NLP and deep learning from O’Reilly.

Presentations

Going beyond Words: Understand what your spoken conversation data can do for you Session

Spoken conversations have rich information beyond what was said in words. Delip Rao details the potential of spoken conversational datasets, including identifying speakers and their demographic attributes, understanding intent and dynamics between speakers, and so on. Delip also discusses some of the latest science, including some of the work developed at R7.

Machine learning with PyTorch 2-Day Training

PyTorch is a recent deep learning framework from Facebook that is gaining massive momentum in the deep learning community. Its fundamentally flexible design makes building and debugging models straightforward, simple, and fun. Delip Rao and Brian McMahan walk you through PyTorch's capabilities and demonstrate how to use PyTorch to build deep learning models and apply them to real-world problems.

Dr. Andrew Ray is a Senior Technical Expert at Sam’s Club Technology. He is passionate about big data and has extensive experience working with Apache Spark and Hadoop. Andrew is an active contributor to the Apache Spark project including SparkSQL and GraphX. At Walmart Andrew built an analytics platform on Hadoop that integrated data from multiple retail channels using fuzzy matching and distributed graph algorithms. Andrew also led the adoption of Spark at Walmart from proof-of-concept to production. Andrew earned his Ph.D. in Mathematics from the University of Nebraska, where he worked on extremal graph theory.

Presentations

Writing Distributed Graph Algorithms Session

This talk will give a brief introduction to the distributed graph algorithm abstractions provided by Pregel, PowerGraph, and GraphX. We will discuss the implementations of three key examples in each abstraction and provide historical context for the evolution between these three abstractions.

Joseph (Joey) Richards is VP of Data & Analytics at GE Digital and head of the Wise.io Data Science Applications team. His team is responsible for defining and implementing machine learning applications on behalf of GE and its customers. Prior to joining GE, he was Co-founder and Chief Data Scientist at Wise.io (acquired by GE in 2016), where he built and deployed high-value ML applications for dozens of customers. In his academic life, Joey was an NSF postdoctoral researcher in the Statistics and Astronomy departments at UC Berkeley and a Fulbright Scholar whose research focused on application of supervised and semi-supervised learning for problems in astrophysics; he holds a PhD in Statistics from Carnegie Mellon University.

Presentations

Machine Learning Applications for the Industrial Internet Session

Deploying ML software applications for use cases in the Industrial Internet presents a unique set of challenges. Data-driven problems at GE require approaches that are highly accurate, robust, fast, scalable and fault tolerant. I'll discuss our approach to building production-grade ML applications and will talk about our work across GE in industries such as Power, Aviation and Oil & Gas.

At Salesforce, Alexis is managing a team of data scientists and machine learning engineers focusing on deriving intelligence from activity data for Einstein platform. He has over twenty years of experience, with the last five years focusing on large scale data science and engineering applications using vast amount of data. Alexis has built and led teams building Spark based production applications for the last three years using Scala, Spark batch and streaming, GraphX, NLP and Machine Learning. Previously Alexis worked for Radius Intelligence, Concurrent Inc, Couchbase, Sun Microsystems/Oracle for thirteen years and two large SIs in Europe. Alexis holds a Master’s Degree in CS with a focus on Cognitive Sciences and has done countless online trainings around data science and engineering.

Presentations

Building a Contacts Graph from activity data Session

In the customer age, being able to extract relevant communications information in real-time and cross reference it with context is key. Salesforce is using data science and engineering to enable salespeople to monitor their emails in real-time to surface insights and recommendations using a graph modeling contextual data.

Jeff Rosenberg is responsible for the overall technology direction of the Business Intelligence, Data and Product Analytics, Big Data, Data Quality Management and Data Science at Hulu. With a background in technical program management, and software development, Jeff’s experience with platform and device development with companies including Warner Bros, DirecTV and Sony, that drive millions and billions of experiences and data points, have served him well in understanding the needs of customer, business user and technology (IT) alike.

Presentations

Hulu: Unlocking the Power of Hadoop with Interactive Business Intelligence Media and Advertising

During Hulu’s journey to becoming a major player in the subscription video on demand industry with yearly revenues over $1B, the company also began an all-too familiar Big Data journey. Hulu will discuss their modern approach to deliver business intelligence where users can do targeted analysis on a large Hadoop Data Lake while supporting high concurrency in a self-service manner across the globe.

Mike Ruberry is a Senior Associate of Data Science at ZestFinance, where his research interests include explainability and generative models. He has four degrees in Computer Science, including a PhD from Harvard University. During and since his his doctorate, Mike has worked on several machine learning models and tools, including deploying automated models that process terabytes of data daily. Before specializing in machine learning, he worked on Windows as a Program Manager at Microsoft.

Presentations

Explaining Machine Learning Models Session

What does it mean to explain a machine learning model, and why is it important? Mike Ruberry of ZestFinance will address those questions while discussing several modern explainability methods, including traditional feature contributions, LIME, and DeepLift. Each of these techniques offers a different perspective, and their clever application can reveal new insights and solve business requirements.

Daniel L. Rubin is Associate Professor of Biomedical Data Science, Radiology, and Medicine (Biomedical Informatics Research) at Stanford University and Director of Biomedical Informatics for the Stanford Cancer Institute. His NIH-funded research program focuses on quantitative imaging, integrating imaging with clinical and molecular data, and mining these Big Data to discover imaging phenotypes that can predict disease biology, define disease subtypes, and personalize treatment. He is applying these methods for distributed computation of decision support models. He has over 240 scientific publications in biomedical imaging informatics and medical imaging.

Presentations

Distributed Clinical Models: Inference without Sharing Patient Data Session

Clinical collaboration benefits from pooling data to learn models from large datasets, but its hampered by concerns about sharing data. We've developed a privacy-preserving alternative to create statistical models equivalent to one from the entire dataset. We've built this as a cloud application, where each collaborator installs their own, and the installations self-assemble into a star network.

Philipp Rudiger is a software developer at Anaconda, Inc., where he develops open source and client-specific software solutions for data management, visualization, and analysis. Philipp holds a PhD in computational modeling of the visual system.

Presentations

Custom interactive visualizations and dashboards for one billion datapoints on a laptop in 30 lines of Python Tutorial

Python lets you solve data-science problems by stitching together packages from the Python ecosystem, but it can be difficult to choose packages that work well together. Here we take you through a small number of lines of Python code that provide a concise, fast, easily customizable, and fully reproducible recipe for interactive visualization of millions or billions of datapoints.

Ferd Scheepers is the global Chief Information Architect of ING. Ferd has been driving ING’s journey to becoming a data driven company for the last 5 years. He has published on Data Lakes, and is a frequent speaker on both major vendor conferences, and on open source summits. Currently he is championing the open metadata initiative including Apache Atlas. Passionate about data, both on the opportunities and the risks, Ferd loves to share his vision and ideas on what data will mean for both companies, and for individuals.

Presentations

The Rise of Big Data Governance: Insight on this Emerging Trend from Active Open Source Initiatives Session

This joint presentation, John Mertic – Director of ODPi – and Ferd Scheepers – Global Chief Information Architect of ING – will address the benefits of a vendor-neutral approach to data governance, the need for an open metadata standard, along with insight around how companies ING, IBM, Hortonworks and more are delivering solutions to this challenge as an open source initiative.

Michael Schrenk has developed software that collects and processes information for some of the biggest news agencies in Europe, and has lectured at journalism conferences in Belgium, The Netherlands, and created several weekend data workshops for The Centre for Investigative Journalism in London.

Mike also consults on information security everywhere from Moscow to Silicon Valley, and most places in between. Along the way, he’s been interviewed by BBC, The Christian Science Monitor, National Public Radio, and many others. In addition to his interest in Journalism, Mike conducts a Competitive Intelligence consultancy in Las Vegas and is the author of “Webbots, Spiders, and Screen Scrapers” (San Francisco: No Starch Press, 2012). Michael Schrenk is also an eight-time speaker at the notorious DEF CON hacking conference. He may be best known for software that—over a period of a few months, autonomously purchased over $13 million dollars worth of cars by adapting to real-time market conditions.

Presentations

Understanding Metadata Session

Big data becomes much more powerful when it has context. Fortunately, creative data scientists can create needed context though the use of metadata. This talk describes how metadata is created and used to: gain competitive advantages, predict troop strength, or even guess Social Security numbers.

Robert Schroll is a data scientist in residence at the Data Incubator. Previously, he held postdocs in Amherst, Massachusetts, and Santiago, Chile, where he realized that his favorite parts of his job were teaching and analyzing data. He made the switch to data science and has been at the Data Incubator since. Robert holds a PhD in physics from the University of Chicago.

Presentations

Hands-on data science with Python 2-Day Training

Robert Schroll offers an introduction to machine learning in Python, as he walks you through building an anomaly detection model and a recommendation engine. You'll gain hands-on experience from prototyping to production, and everything in between, including data cleaning, feature engineering, model building and evaluation, and deployment.

Baron is the founder and CEO of VividCortex, the best way to see what your production database servers are doing. Baron has written a lot of open source software, and several books including High Performance MySQL. He’s focused his career on learning and teaching about performance and observability of systems generally (including the view that teams are systems and culture influences their performance), and databases specifically.

Presentations

Why Nobody Cares About Your Anomaly Detection Session

Anomaly detection is super hot in my industry (~monitoring). I've built anomaly detection, I've seen customers not really understand or care about it, and I've seen others repeat the same pattern many times. Why? And what can we do about it? This is my story of arriving at a "post-anomaly-detection" point of view.

Jonathan Seidman is a software engineer on the Partner Engineering team at Cloudera. Previously, he was a lead engineer on the Big Data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Executive Briefing: Managing successful data projects—Technology selection and team building Session

Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives.

Gwen Shapira is a system architect at Confluent, where she helps customers achieve success with their Apache Kafka implementation. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen currently specializes in building real-time reliable data-processing pipelines using Apache Kafka. Gwen is an Oracle Ace Director, the coauthor of Hadoop Application Architectures, and a frequent presenter at industry conferences. She is also a committer on Apache Kafka and Apache Sqoop. When Gwen isn’t coding or building data pipelines, you can find her pedaling her bike, exploring the roads and trails of California and beyond.

Presentations

The future of ETL isn’t what it used to be Session

Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve.

Min Shen is an engineer on LinkedIn’s Hadoop infrastructure development team helping to build next-generation Hadoop infrastructure at LinkedIn with better performance and manageability. Min holds a PhD degree in computer science from the University of Illinois with a research interest in distributed computing.

Presentations

Spark for everyone: Self-service monitoring and tuning Session

Spark applications need to be well tuned so that individual applications run quickly and reliably, and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems.

Tomer Shiran is the CEO and co-founder of Dremio. Prior to Dremio, he was VP Product and employee #5 at MapR, where he was responsible for product strategy, roadmap and new feature development. As a member of the executive team, Tomer helped grow the company from five employees to over 300 employees and 700 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. He holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from Technion – Israel Institute of Technology, as well as five U.S. patents.

Presentations

Reflections vs. Copies: Making Data Fast and Easy to Use Without Making Copies Session

Most organization manage 5-15 copies of their data in multiple systems and formats to support different analytical use cases, including BI and machine learning. In this talk we introduce a new approach called Data Reflections, which dramatically reduces the need for data copies. We demonstrate an open source implementation built with Apache Calcite and explore two production case studies.

Jeff Shmain is a principal solutions architect at Cloudera. He has 16+ years of financial industry experience with a strong understanding of security trading, risk, and regulations. Over the last few years, Jeff has worked on various use-case implementations at 8 out of 10 of the world’s largest investment banks.

Presentations

Understanding data at scale leveraging Spark and Deep Learning Frameworks. Tutorial

We go through approaches for preprocessing, training, inference and deployment across data sets (time-series, audio, video and text), leveraging Spark, extended ecosystem of libraries and Deep Learning Frameworks. We use respective (sample) data and code to understand implementation nuances, and subsequently highlight the bottlenecks and solutions for data/model at scale.

Vartika Singh is a Field Data Science Architect at Cloudera. Previously, Vartika was a data scientist applying machine-learning algorithms to real-world use cases, ranging from clickstream to image processing. She has 12 years of experience designing and developing solutions and frameworks utilizing machine-learning techniques.

Presentations

Understanding data at scale leveraging Spark and Deep Learning Frameworks. Tutorial

We go through approaches for preprocessing, training, inference and deployment across data sets (time-series, audio, video and text), leveraging Spark, extended ecosystem of libraries and Deep Learning Frameworks. We use respective (sample) data and code to understand implementation nuances, and subsequently highlight the bottlenecks and solutions for data/model at scale.

Tomas Singliar is a data scientist in the AI and Research group in Microsoft. He studied machine learning at University of Pittsburgh. He published a dozen papers in and serves as reviewer for several top tier AI conferences (AAAI, UAI, etc). He holds four patents in intent recognition through inverse reinforcement learning. Tomas’s favorite hammer is probabilistic and Bayesian modeling, which he applies analytically and predictively to business data.

Presentations

Using R and Python for Scalable Data Science, Machine Learning, and AI Tutorial

Join us to learn how to do scalable, end-to-end data science in R and Python on single machines as well as on Spark clusters and cloud-based infrastructure. You'll be assigned an individual virtual machine with all contents preloaded and software installed and use it to gain experience building and operationalizing machine learning models using distributed functions in both R and Python.

Ram Shankar is a security data wrangler in Azure Security Data Science, where he works on the intersection of ML and security. Ram’s work at Microsoft includes a slew of patents in the large intrusion detection space (called “fundamental and groundbreaking” by evaluators). In addition, he has given talks in internal conferences and received Microsoft’s Engineering Excellence award. Ram has previously spoken at data-analytics-focused conferences like Strata San Jose and the Practice of Machine Learning as well as at security-focused conferences like BlueHat, DerbyCon, FireEye Security Summit (MIRCon), and Infiltrate. He is the organizer of “Security Data Science Colloquium”, an effort to bring together security analysts, engineers and applied ML engineers working in the security analytics area. Ram graduated from Carnegie Mellon University with master’s degrees in both ECE and innovation management.

Presentations

Failed experiments in infrastructure security analytics and lessons learned from fixing them Session

How to debug a security data science system when it doesn’t work as intended: change the ML approach, redefine the security scenario or start from scratch again? We answer this question by sharing the failed experiments and the lessons learned when building ML detections for 3 security scenarios: cloud lateral movement, identifying anomalous executables and automating incident response process.

Crystal Skelton is an associate in Kelley Drye & Warren’s Los Angeles office, where she represents a wide array of clients from tech startups to established companies in privacy and data security, advertising and marketing, and consumer protection matters. Crystal advises clients on privacy, data security, and other consumer protection matters, specifically focusing on issues involving children’s privacy, mobile apps, data breach notification, and other emerging technologies and counsels clients on conducting practices in compliance with the FTC Act, the Children’s Online Privacy Protection Act (COPPA), the Gramm-Leach-Bliley Act, the GLB Safeguards Rule, Fair Credit Reporting Act (FCRA), the Fair and Accurate Credit Transactions Act (FACTA), and state privacy and information security laws. She regularly drafts privacy policies and terms of use for websites, mobile applications, and other connected devices.

Crystal also helps advertisers and manufacturers balance legal risks and business objectives to minimize the potential for regulator, competitor, or consumer challenge while still executing a successful campaign. Her advertising and marketing experience includes counseling clients on issues involved in environmental marketing, marketing to children, online behavioral advertising (OBA), commercial email messages, endorsements and testimonials, food marketing, and alcoholic beverage advertising. She represents clients in advertising substantiation proceedings and other matters before the Federal Trade Commission (FTC), the US Food and Drug Administration (FDA), and the Alcohol and Tobacco Tax and Trade Bureau (TTB) as well as in advertiser or competitor challenges before the National Advertising Division (NAD) of the Council of Better Business Bureaus. In addition, she assists clients in complying with accessibility standards and regulations implementing the Americans with Disabilities Act (ADA), including counseling companies on website accessibility and advertising and technical compliance issues for commercial and residential products. Prior to joining Kelley Drye, Crystal practiced privacy, advertising, and transactional law at a highly regarded firm in Washington, DC, and as a law clerk at a well-respected complex commercial and environmental litigation law firm in Los Angeles, CA. Previously, she worked at the law firm featured in the movie Erin Brockovich, where she worked directly with Erin Brockovich and the firm’s name partner to review potential new cases.

Presentations

Executive Briefing: Legal best practices for making data work Session

Big data promises enormous benefits for companies. But what about privacy, data protection, and consumer laws? Having a solid understanding of the legal and self-regulatory rules of the road are key to maximizing the value of your data while avoiding data disasters. Alysa Hutnik and Crystal Skelton share legal best practices and practical tips to avoid becoming a big data “don’t.”

Lead Enterprise Architect, MasterCard

Presentations

Improving user-merchant propensity modeling using BigDL’s RNN/LSTMs at scale Session

We are going to demonstrate the use of RNNs on BIgDL to predict a user’s probability of shopping at a particular offer merchant during a “campaign period”. We will compare and contrast the RNN-based method with traditional ones, such as logistics regression and random forests

Ram Sriharsha is the product manager for Apache Spark at Databricks and an Apache Spark committer and PMC member. Previously, Ram was architect of Spark and data science at Hortonworks and principal research scientist at Yahoo Labs, where he worked on scalable machine learning and data science. He holds a PhD in theoretical physics from the University of Maryland and a BTech in electronics from the Indian Institute of Technology, Madras.

Presentations

Magellan: Scalable and Fast Geospatial Analytics Session

How do you scale geospatial analytics on big data? And while we are at it, can you make it easy to use while achieving state of the art performance on a single node ?Join us to learn about the internals of Magellan and how it provides scalability and performance without sacrificing simplicity.

Seth Stephens-Davidowitz uses data from the internet (particularly Google searches) to get new insights into the human psyche, measuring racism, self-induced abortion, depression, child abuse, hateful mobs, the science of humor, sexual preference, anxiety, son preference, and sexual insecurity, among many other topics. His 2017 book, Everybody Lies, published by HarperCollins, was a New York Times best seller. Seth is also a contributing op-ed writer for the New York Times. Previously, he was a data scientist at Google and a visiting lecturer at the Wharton School at the University of Pennsylvania. He holds a BA in philosophy (Phi Beta Kappa) from Stanford and a PhD in economics from Harvard. In high school, Seth wrote obituaries for his local newspaper, the Bergen Record, and was a juggler in theatrical shows. He now lives in Brooklyn and is a passionate fan of the Mets, Knicks, Jets, Stanford football, and Leonard Cohen.

Presentations

Keynote with Seth Stephens-Davidowitz Keynote

Keynote with Seth Stephens-Davidowitz

Kapil Surlaker leads the data and analytics team at LinkedIn, where he’s responsible for core analytics infrastructure platforms including Hadoop, Spark, other computation frameworks such as Gobblin and Pinot, an OLAP serving store, and XLNT, LinkedIn’s experimentation platform. Previously, Kapil led the development of Databus, a database change capture platform that forms the backbone of LinkedIn’s online data ecosystem, Espresso, a distributed document store that powers many applications on the site, and Helix, a generic cluster management framework that manages multiple infrastructure deployments at LinkedIn. Prior to LinkedIn, Kapil held senior technical leadership positions at Kickfire (acquired by Teradata) and Oracle. Kapil holds a BTech in computer science from IIT, Bombay, and an MS from the University of Minnesota.

Presentations

If you can’t measure it, you can’t improve it: How reporting and experimentation fuels product innovation at LinkedIn Session

Metrics measurement and experimentation plays a crucial role in every product decision at LinkedIn. Kapil Surlaker and Ya Xu explain why, to meet the company's needs, LinkedIn built the UMP and XLNT platforms for metrics computation and experimentation, respectively, which have allowed the company to perform measurement and experimentation very efficiently at scale while preserving trust in data.

Rajiv Synghal is an accomplished strategic thinker and adviser to senior management on issues around growth, profitability, competition, and innovation. He is equally outstanding in presenting a value proposition to top management and doing a deep dive with fellow engineers. Rajiv is the rare kind of technology professional who carries within him the pragmatism of business urgency and the will to find a way to solve a problem no matter what it takes. Rajiv has a long career history in Fortune 100 organizations (Visa, Nokia) and startups (Kivera), in both delivery and architecture roles. He has demonstrated an uncanny ability to learn and teach new concepts, easily adapt to change, and manage multiple concurrent tasks. Rajiv is currently advising a number of startups in the big data space that may be developing technologies that will provide strategic solution to lot of challenges in the healthcare field.

Presentations

Building a flu predictor model for improved patient care Data Case Studies

As healthcare data becomes increasingly digitized, medical centers are leveraging data in new ways to improve patient care. At Kaiser Permanente, one such initiative is focused on the flu. Each year, as many as 49,000 people die in the U.S. alone. Kaiser will discuss how they developed a sophisticated flu predictor model to better determine where resources were needed and how to reduce outbreaks.

Pawel is a software engineer in the Analytics Data Storage team at Criteo. His primary tasks include adding new features to Vertica and making Hive more reliable and efficient. Previously, he worked at CERN as a System Performance Engineer on modern computing architectures and scientific software optimization.

Presentations

Hive as a Service Session

Hundreds of analysts and thousands of automated jobs run Hive queries at Criteo every day. As Hive is the main data transformation tool at Criteo, we spent a year evolving Hive's platform from an error-prone add-on installed on some spare machines, to a best-in-class installation capable of self-healing and automatically scaling to handle our growing load.

Ph.D in Computer Science from Ben Gurion Univerity, Israel. Specializing in Artificial Intelligence. His research mainly focused on Automated planning. Ran also served as a lecturer for the design of algorithms course and other CS theory courses for CS bachelors at BGU and has thorough algorithmic design expertise.

Ran has joined Dell-EMC on early 2016, shortly after completing his PhD as a senior data scientist in the data science as a service team of Dell’s IT.

In the last 16 months Ran has led various data science projects, especially in domain of hardware failure prediction. In parallel, he has a key roll in designing the team engagement models and work structure, serving as a consultant to EMC’s business data lake team.

Ran is responsible for the team academic relations and continue,from time to time, to give theory courses for CS students.

Presentations

AI Powered Crime Prediction Session

What if we could predict when and where next crimes will be committed? Crimes in Chicago is a publicly published data set which reflect the reported incidents of crime that occurred in Chicago since 2001. Using this data, we would like not only be able to explore specific crimes to find interesting trends, but also predict how many crimes will be taking place next week, and even next month.

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing group, where he led business operations for Bing Shopping in the US and Europe. Earlier, he worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Executive Briefing: Why machine learned models crash & burn in production and what to do about it Session

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Natural language understanding at scale with spaCy and Spark NLP Tutorial

Natural language processing is a key component in many data science systems that must understand or reason about text. This is a hands-on tutorial for scalable NLP using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

Yulia Tell is a Technical Program Manager in Big Data Technologies team within Software and Services Group at Intel. She works on several open source projects and partner engagements in the big data domain. Her work is focused specifically on Apache Hadoop and Apache Spark, including big data analytics applications that use machine learning and deep learning.

Presentations

Automatic 3D MRI Knee Damage Classification with 3D CNN using BigDL on Spark Session

Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage poses the advantage for quicker and more accurate diagnosis at the time of an MRI scan. We will talk about building this classification system with 3D convolutional neural networks using BigDL on Apache Spark.

Daniel Templeton has a long history in high-performance computing, open source communities, and technology evangelism. Today Daniel works on the YARN development team at Cloudera, focused on the resource manager, fair scheduler, and Docker support.

Presentations

What's new in Hadoop 3.0 Session

Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.

Siddharth Teotia is a software engineer at Dremio and a contributor to Apache Arrow project. Previously, Siddharth was on the database kernel team at Oracle, where he worked on storage, indexing, and the in-memory columnar query processing layers of Oracle RDBMS. He holds an MS in software engineering from CMU and a BS in information systems from BITS Pilani, India. During his studies, Siddharth focused on distributed systems, databases, and software architecture.

Presentations

Vectorized query processing using Apache Arrow Session

Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow.

Meena has extensive experience leading Application, Technology, Data and Infrastructure teams: developing Strategy, Architecture, Implementations and IT operational services. He focusses on leveraging technical advancements and industry reference architectures for defining a Data Delivery platform. A Big Data and Analytics evangelist, focused on defining Strategy for accelerating the Digitization journey for Oil and Gas Clients

Most recently he delivered functional and technical Architecture for a ONE-STOP Self Service Data & Information Portal across the Oil and Gas Independent.

Extensive leadership, management and engineering experience focused on Innovation Strategy. He is a design thinker and innovator, resilient, adaptive, and creative leader passionate about creating human-centered experiences

Presentations

Meta your Data, Drain the Big Data Swamp Data Case Studies

Self service operational data lake to improve operational efficiency, boosting productivity through fully identifiable data, reducing risk of a data swamp. These were the objectives which drove BP to create a strategic and methodical approach to a Data Lake architecture. Through this approach, BP provides a template of turning insights, hidden risks & unseen opportunities into actionable solutions

Alex Thomas is a data scientist at Indeed. Over his career, Alex has used natural language processing (NLP) and machine learning with clinical data, identity data, and (now) employer and jobseeker data. He has worked with Apache Spark since version 0.9 as well as NLP libraries and frameworks including UIMA and OpenNLP.

Presentations

Natural language understanding at scale with spaCy and Spark NLP Tutorial

Natural language processing is a key component in many data science systems that must understand or reason about text. This is a hands-on tutorial for scalable NLP using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

Dr. Cindi Thompson is the Head of Data Science at Silicon Valley Data Science. Dr. Thompson has over fifeteen years experience of research and applications of machine learning and natural language processing across academia and industry. She has a PhD and MA in Computer Sciences from UT-Austin, and a BS Computer Science from NCSU. She has dozens of publications in both journals and refereed conferences and is the co-inventor of three patents. She has also collaborated extensively to solve problems by bridging technical and business concerns using strong communication and facilitation skills.

Presentations

Developing a Modern Enterprise Data Strategy Tutorial

Big data and data science have great potential for accelerating business, but how do you reconcile the business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those key aspirations that will define an organization’s future vision. In this tutorial, we explain how to create a modern data strategy that powers data-driven business.

Wee Hyong Tok is a principal data science manager at Microsoft, where he works with teams to cocreate new value and turn each of the challenges facing organizations into compelling data stories that can be concretely realized using proven enterprise architecture. Wee Hyong has worn many hats in his career, including developer, program/product manager, data scientist, researcher, and strategist, and his range of experience has given him unique super powers to nurture and grow high-performing innovation teams that enable organizations to embark on their data-driven digital transformations using artificial intelligence. He has a passion for leading artificial intelligence-driven innovations and working with teams to envision how these innovations can create new competitive advantage and value for their business and strongly believes in story-driven innovation.

Presentations

How Does a Big Data Professional get Started with AI? Session

Artificial Intelligence (AI) has tremendous potential to extend our capabilities, and empowering organizations to accelerate their digital transformation by infusing apps and experiences with AI. This session will help big data professional demystify AI, and how they can leverage and evolve their valuable big data skills towards doing AI.

Head of Data Science and Analytics at Pirelli Tyre.
Well-established, accomplished Data Scientist with experience ranging across various areas of Computer Science and Information Technology. Strong background in data modeling, data analysis and data engineering. Extensive experience with Python in the data science space (pandas, scipy, scikit-learn). Skilled manager with a record of successful team built from scratch. Strong communicator with the ability to bring together technical and business profiles on challenging issues.

Presentations

Pirelli Connesso: when the road meets the cloud Session

In this talk we explore the architectural challenges we faced in building Pirelli Connesso: an IoT cloud-based system providing information on the tyre operating conditions, consumption and maintenance. We will highlight the operative approaches that enabled the integration of different contribution across cross-functional teams.

Amy Unruh is a developer programs engineer for the Google Cloud Platform, where she focuses on machine learning and data analytics as well as other Cloud Platform technologies. Amy has an academic background in CS/AI and has also worked at several startups, done industrial R&D, and published a book on App Engine.

Presentations

Getting started with TensorFlow Tutorial

Yufeng Guo and Amy Unruh walk you through training and deploying a machine learning system using TensorFlow, a popular open source library. Yufeng and Amy take you from a conceptual overview all the way to building complex classifiers and explain how you can apply deep learning to complex problems in science and industry.

Ayin Vala is the cofounder and chief data scientist of the nonprofit organization Foundation for Precision Medicine, where he and his research and development team work on statistical analysis and machine learning, pharmacogenetics, molecular medicine, and sciences relevant to the advancement of medicine and healthcare delivery. Ayin has won several awards and patents in the healthcare, aerospace, energy, and education sectors. He also volunteers at DataKind, where he leads machine learning efforts in humanitarian projects. Ayin holds master’s degrees in information management systems from Harvard University and mechanical engineering from Georgia Tech.

Presentations

Reinventing healthcare: Early detection of Alzheimer’s disease with deep learning Session

Complex diseases like Alzheimer’s cannot be cured by pharmaceutical or genetic sciences alone, and current treatments and therapies lead to mixed successes. Ayin Vala explains how to use the power of big data and AI to treat challenging diseases with personalized medicine, which takes into account individual variability in medicine intake, lifestyle, and genetic factors for each patient.

Crystal Valentine is the vice president of technology strategy at MapR Technologies. She has nearly two decades’ experience in big data research and practice. Previously, Crystal was a consultant at Ab Initio, where she worked with Fortune 500 companies to design and implement high-throughput, mission-critical applications and with equity investors as a technical expert on competing technologies and market trends. She was also a tenure-track professor in the Department of Computer Science at Amherst College. She is the author of several academic publications in the areas of algorithms, high-performance computing, and computational biology and holds a patent for extreme virtual memory. Crystal was a Fulbright Scholar in Italy and holds a PhD in computer science from Brown University as well as a bachelor’s degree from Amherst College.

Presentations

DataOps: An Agile methodology for data-driven organizations Session

DataOps—a methodology for developing and deploying data-intensive applications, especially those involving data science and machine learning pipelines—supports cross-functional collaboration and fast time to value with an Agile, self-service workflow. Crystal Valentine offers an overview of this emerging field and explains how to implement a DataOps process.

Vinithra Varadharajan is an engineering manager in the cloud organization at Cloudera, where she is responsible for products such as Cloudera Director and Cloudera’s usage-based billing service. Previously, Vinithra was a software engineer at Cloudera, working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop lifecycle management.

Presentations

A deep dive into running data analytic workloads in the cloud Tutorial

Vinithra Varadharajan, Philip Langdale, Eugene Fratkin, and Jennifer Wu lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices.

Emre Velipasaoglu is a Principal Data Scientist at Lightbend. He has worked in various labs and start-ups as a researcher, scientist, engineer and manager for over 20 years, including several years at Yahoo! Labs as a senior scientist, where Emre and his team built the machine learning ranking models for Yahoo! Search. His expertise spans machine learning, information retrieval, natural language processing and signal processing. He holds a Ph.D. in Electrical and Computer Engineering from Purdue University.

Presentations

Machine Learned Model Quality Monitoring in Fast Data and Streaming Applications Session

Most machine learning algorithms are designed to work on stationary data. Yet, real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Here, we review the monitoring methods and evaluate them for applicability in modern fast data and streaming applications.

Shivaram Venkataraman is currently a post-doctoral researcher at Microsoft Research, Redmond and starting in Fall 2018, an assistant professor in Computer Science at the University of Wisconsin, Madison. He received his PhD at the University of California, Berkeley, where he was advised by Mike Franklin and Ion Stoica. His work spans distributed systems, operating systems and machine learning, and his recent research has looked at designing systems and algorithms for large scale data analysis.

Presentations

Accelerating Deep Learning on Apache Spark with Coarse Grained Scheduling Session

The BigDL framework scales deep learning for large data sets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. In this talk we propose a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception, VGG etc.

Dean Wampler, Ph.D., is the VP of Fast Data Engineering at Lightbend, leading the development of Lightbend Fast Data Platform, a scalable, distributed stream data processing stack using Spark, Flink, Kafka, and Akka, with machine learning and management tools. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly Media. He is a contributor to several open source projects, a frequent Strata speaker, and the co-organizer of several conferences around the world and several user groups in Chicago. Dean can be found embarrassing himself on Twitter as @deanwampler.

Presentations

Kafka Streaming Applications with Akka Streams and Kafka Streams Session

This talk uses two "microservice" streaming applications based on Kafka, to compare and contrast using Akka Streams and Kafka Streams for data processing. I'll discuss the strengths and weaknesses of each tool for particular design needs, so you'll feel better informed when making choices. I'll also contrast them with Spark Streaming and Flink, including when to chose them instead.

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams Tutorial

This hands-on tutorial builds several streaming applications as "microservices" based on Kafka with Akka Streams and Kafka Streams for data processing. We'll assess the strengths and weaknesses of each tool for particular needs, so you'll feel better informed when choosing tools for your needs. We'll also contrast them with Spark Streaming and Flink, including when to chose them instead.

Andrew Wang is a software engineer on the HDFS team at Cloudera. Previously, he was a graduate student in the AMPLab at the University of California, Berkeley, advised by Ion Stoica, where he worked on research related to in-memory caching and quality of service. In his spare time, he enjoys going on bike rides, cooking, and playing guitar.

Presentations

What's new in Hadoop 3.0 Session

Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.

Jiao (Jennie) Wang is a software engineer on the Big Data Technology team at Intel working in the area of big data analytics. She is engaged in developing and optimizing distributed deep learning framework on Apache Spark.

Presentations

Automatic 3D MRI Knee Damage Classification with 3D CNN using BigDL on Spark Session

Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage poses the advantage for quicker and more accurate diagnosis at the time of an MRI scan. We will talk about building this classification system with 3D convolutional neural networks using BigDL on Apache Spark.

Rachel Warren is a programmer, data analyst, adventurer, and aspiring data scientist. After spending a semester helping teach algorithms and software engineering in Africa, Rachel has returned to the Bay Area, where she is looking for work as a data scientist or programmer. Previously, Rachel worked as an analyst for both Pandora and the Political Science department at Wesleyan. She is currently interested in pursuing a more technical, algorithmic, approach to data science and is particularly passionate about dynamic learning algorithms (ML) and text analysis. Rachel holds a BA in computer science from Wesleyan University, where she completed two senior projects: an application which uses machine learning and text analysis for the Computer Science department and a critical essay exploring the implications of machine learning on the analytic philosophy of language for the Philosophy department.

Presentations

Playing Well Together: Big Data beyond the JVM w/Spark & friends. Session

This talk will explore the state of the current big data ecosystem, and how to best work with it in non-JVM languages. Since the presenter works extensively on PySpark much of the focus will be on Python + Spark, but will also include interesting* anecdotes about how this applies to other systems (including Kafka).

Jennifer Webb is vice president of development and operations at SuprFanz. Jennifer has over 10 years experience as a website and application developer for large and small companies, including major banks, and as a keyboardist in rock bands in Toronto, Calgary, and Vancouver.

Presentations

Data science in practice: Examining events in social media Media and Advertising

Ray Bernard and Jennifer Webb explain how cloud-based marketing company SuprFanz uses data science techniques and graph theory with Neo4j to generate live event attendance from social media platforms, email, and SMS.

Brooke Wenig is a consultant for Databricks and a teaching associate at UCLA, where she has taught graduate machine learning, senior software engineering, and introductory programming courses. Previously, Brooke worked at Splunk and Under Armour as a KPCB fellow. She holds an MS in computer science with highest honors from UCLA with a focus on distributed machine learning. Brooke speaks Mandarin Chinese fluently and enjoys cycling.

Presentations

Apache Spark programming 2-Day Training

Brooke Wenig walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs.

Jonathon Whitton serves as director of data services for PRGX, a global account services company that helps clients better manage and leverage their AP and supplier data. He has over 20 years of experience in technology, specializing in big data, Hadoop, process transformation, migration, and business analysis. Jonathon currently leads a big data initiative that results in 10 times faster processing of its clients’ data, lowering the cost of storage and creating increased availability to the data for its business partners in the Recovery Audit, Profit Optimization, Fraud Prevention, Healthcare, and Oil & Gas business lines. Prior to working at PRGX, Jonathon passed the Series 7/63 exams and was licensed in NY, NJ, and CT to provide insurance-related advice as a financial planner; he was also a top-rated technical instructor with ExecuTrain and served in the 1/75 Ranger Regiment. Jonathon holds an MBA from Kennesaw State University and a bachelor’s degree from Duke University.

Presentations

Data wrangling for retail giants Session

PRGX is a global leader in Recovery Audit and Source-to-Pay (S2P) Analytics services, serving around 75% of the top 20 global retailers. During this session, PRGX will explain how they’ve adopted Trifacta and Cloudera to scale their current processes, and increase revenue for the products and services they offer clients.

Josh Wills is a software engineer on Slack’s search, learning, and intelligence team. Previously, Josh built data teams, products, and infrastructure at Google and Cloudera. He is the founder and vice president of the Apache Crunch project for creating optimized MapReduce pipelines in Java and lead developer of Cloudera ML, a set of open source libraries and command-line tools for building machine learning models on Hadoop. Josh is a coauthor of Advanced Analytics with Spark. He is also known for his pithy definition of a data scientist as “someone who is better at software engineering than any statistician and better at statistics than any software engineer.”

Presentations

Data science at Slack Session

Josh Wills describes recent data science and machine learning projects at Slack.

Jennifer Wu is director of product management for cloud at Cloudera, where she focuses on cloud services and data engineering. Previously, Jennifer worked as a product line manager at VMware, working on the vSphere and Photon system management platforms.

Presentations

A deep dive into running data analytic workloads in the cloud Tutorial

Vinithra Varadharajan, Philip Langdale, Eugene Fratkin, and Jennifer Wu lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices.

Vincent Xie is a software engineer at Intel, where he works on machine learning- and big data-related domains. He holds a master’s degree in engineering from Shanghai Jiaotong University.

Presentations

Spark ML optimization at Intel: A case study Session

Intel has been deeply involved in Spark from its earliest moments. Vincent Xie and Peng Meng share what Intel has been working on Spark ML and introduce the methodology behind Intel's work on SparkML optimization.

Ya Xu is principal staff engineer and statistician at LinkedIn, where she leads a team of engineers and data scientists building a world-class online A/B testing platform. She also spearheads taking LinkedIn’s A/B testing culture to the next level by evangelizing best practices and pushing for broad-based platform adoption. She holds a PhD in statistics from Stanford University.

Presentations

If you can’t measure it, you can’t improve it: How reporting and experimentation fuels product innovation at LinkedIn Session

Metrics measurement and experimentation plays a crucial role in every product decision at LinkedIn. Kapil Surlaker and Ya Xu explain why, to meet the company's needs, LinkedIn built the UMP and XLNT platforms for metrics computation and experimentation, respectively, which have allowed the company to perform measurement and experimentation very efficiently at scale while preserving trust in data.

Dr. Yu Xu is the founder and CEO of GraphSQL, the world’s first Native Parallel Graph database. Dr. Xu received his Ph.D in Computer Science and Engineering from the University of California San Diego. He is an expert in big data and parallel database systems and has over 26 patents in parallel data management and optimization. Prior to founding GraphSQL, Dr. Xu worked on Twitter’s data infrastructure for massive data analytics. Before that, he worked as Teradata’s Hadoop architect where he led the company’s big data initiatives.

Presentations

Real Time Deep Link Analytics: the next stage of Graph Analytics Session

Graph database is the fastest growing category in all of data management. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real world applications require deep link analytics which traverses far more than three hops. We present a real world fraud detection system managing 100 billion graph elements to detect risk and fraudulent groups.

Chief Scientist at ShiftLeft, PHD, Institute of System Security, TU Braunschweig, Security Consultant and Vulnerability Researcher,

  • CAST/GI Dissertation Award IT-Security – 2015/16
  • DIMVA Best Paper Award – 2016

Presentations

Code Property Graph : A modern, queryable data storage for source code Session

While in the earlier days, code would generate data, with CPG we now generate data for the code so that we can understand it better.

Yi Yin is a software engineer on the data engineering team at Pinterest, working on Kafka-to-S3 persisting tools and schema generation of Pinterest’s data.

Presentations

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously Session

With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin explain offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices.

Juan Yu is a software engineer at Cloudera working on the Impala project, where she helps customers investigate, troubleshoot, and resolve escalations and analyzes performance issues to identify bottlenecks, failure points, and security holes. Juan also implements enhancements in Impala to improve customer experience. Previously, Juan was a software engineer at Interactive Intelligence and held developer positions at Bluestreak, Gameloft, and Engenuity.

Presentations

How to use Impala query plan and profile to fix performance issues Tutorial

Apache Impala (incubating) is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu explores the cost model Impala planner uses, how Impala optimizes queries, how to identify performance bottleneck through query plan and profile, and how to drive Impala to its full potential.

Ali is a data scientist in the Microsoft AI and Research organization, where he spends his day trying to make distributed computing and machine learning in the cloud easier, more efficient, and more enjoyable for data scientists and developers alike. Ali studied statistics at the University of Toronto, and computer science at Stanford University.

Presentations

Using R and Python for Scalable Data Science, Machine Learning, and AI Tutorial

Join us to learn how to do scalable, end-to-end data science in R and Python on single machines as well as on Spark clusters and cloud-based infrastructure. You'll be assigned an individual virtual machine with all contents preloaded and software installed and use it to gain experience building and operationalizing machine learning models using distributed functions in both R and Python.

Ye Zhou is a software engineer in LinkedIn’s Hadoop infrastructure development team and mostly focusing on Hadoop Yarn and Spark related projects. Ye holds a Master degree in computer science from Carnegie Mellon University.

Presentations

Spark for everyone: Self-service monitoring and tuning Session

Spark applications need to be well tuned so that individual applications run quickly and reliably, and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems.

Angela Zutavern is a leading expert on machine intelligence and coauthor with Josh Sullivan of THE MATHEMATICAL CORPORATION: Where Machine Intelligence and Human Ingenuity Achieve the Impossible (PublicAffairs; June 6, 2017). They have radically transformed how a wide array of Fortune 500 companies, nonprofits and major government agencies approach and use data.

Zutavern pioneered the application of machine intelligence to organizational leadership and strategy. She has worked with clients in every major U.S. cabinet-level department, advised many Fortune 500 companies and led teams across every major industry. She also helped create the Data Science Bowl—a first-of-its-kind, world-class competition that solves global issues through machine intelligence—and is an enthusiastic champion of women in data science.

Presentations

The Mathematical Corporation: A New Leadership Mindset for the Machine Intelligence Era Session

How can you most effectively use machine intelligence to drive strategy? By merging it in the right way with the human ingenuity of leaders throughout your organization. In this talk, Angela Zutavern will share insights from her work with pioneering companies, government agencies and nonprofits that are successfully navigating this partnership by becoming “mathematical corporations.”