Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Speakers

Experts and innovators from around the world share their insights and best practices. New speakers are added regularly. Please check back to see the latest updates to the agenda.

Filter

Search Speakers

William Chambers is a product manager at Databricks, where he works on Structured Streaming and data science products. He is lead author of Spark: The Definitive Guide, coauthored with Matei Zaharia. Bill also created SparkTutorials.net as a way to teach Apache Spark basics. Bill holds a master’s degree in information management and systems from UC Berkeley’s School of Information. During his time at school, Bill was also creator of Data Analysis in Python with pandas course for Udemy and cocreator of and first instructor for Python for Data Science, part of UC Berkeley’s Masters of Data Science program.

Presentations

Streaming big data in the cloud: What to consider and why Session

William Chambers and Michael Armbrust discuss the motivation and basics of Apache Spark's Structured Streaming processing engine and share lessons they've learned running hundreds of Structured Streaming workloads in the cloud.

Mohamed AbdelHady is a senior data scientist on the algorithms and data science (ADS) team within the AI+R Group at Microsoft, where he focuses on machine learning applications for text analytics and natural language processing. Mohamed works with Microsoft product teams and external customers to deliver advanced technologies that extract useful and actionable insights from unstructured free text such as search queries, social network messages, product reviews, customer feedback. Previously, he spent three years at Microsoft Research’s Advanced Technology Labs. He holds a PhD in machine learning from the University of Ulm in Germany.

Presentations

Deep learning for domain-specific entity extraction from unstructured text Session

Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction.

Vijay Srinivas Agneeswaran is director of technology at SapientNitro. Vijay has spent the last 10 years creating intellectual property and building products in the big data area at Oracle, Cognizant, and Impetus, including building PMML support into Spark/Storm and implementing several machine learning algorithms, such as LDA and random forests, over Spark. He also led a team that build a big data governance product for role-based, fine-grained access control inside of Hadoop YARN and built the first distributed deep learning framework on Spark. Earlier in his career, Vijay was a postdoctoral research fellow at the LSIR Labs within the Swiss Federal Institute of Technology, Lausanne (EPFL). He is a senior member of the IEEE and a professional member of the ACM. He holds four full US patents and has published in leading journals and conferences, including IEEE Transactions. His research interests include distributed systems, cloud, grid, peer-to-peer computing, machine learning for big data, and other emerging technologies. Vijay holds a a bachelor’s degree in computer science and engineering from SVCE, Madras University, an MS (by research) from IIT Madras, and a PhD from IIT Madras.

Presentations

Achieving GDPR compliance and data privacy using blockchain technology Session

Ajay Mothukuri, Arunkumar Ramanatha, and Vijay Srinivas Agneeswaran explain how to use open source blockchain technologies such as Hyperledger to implement the European Union's General Data Protection Regulation (GDPR) regulation.

Deep learning-based search and recommendation systems using TensorFlow Tutorial

Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.

John Mark leads a team that is expanding the machine learning and artificial intelligence capabilities of Azure at Microsoft. Previously, John worked with startups and labs in the Bay Area, including “The Connected Car 2025” at Toyota ITC, peer-to-peer malware detection at Intel, and automated planning at SRI. His dedication to probability and AI led him to found an annual applications workshop for the Uncertainty in AI conference. When feeling low, he recharges his spirits by singing Russian music with Slavyanka, the Bay Area’s Slavic music chorus.

Presentations

Distributed clinical models: Inference without sharing patient data Session

Clinical collaboration benefits from pooling data to train models from large datasets, but it's hampered by concerns about sharing data. Balasubramanian Narasimhan, John-Mark Agosta, and Daniel Rubin outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset.

Using R and Python for scalable data science, machine learning, and AI Tutorial

R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure.

Ritesh Agrawal leads the intelligent infrastructure systems team at Uber, which focuses on scaling data infrastructure for Uber’s growing business needs now and foreseeable in the future. A leading data scientist for optimizing infrastructure, previously, Ritesh specialized in predictive and ranking models at Netflix, AT&T Labs, and Yellow Pages, where he built scalable machine learning infrastructure with technologies such as Docker, Hadoop, and Spark. He holds a PhD in environmental earth science from Pennsylvania State University, where his thesis focused on computational tools and technologies such as concept map ontologies.

Presentations

Presto query gate: Identifying and stopping rogue queries Session

Presto has emerged as the de facto query engine to quickly process petabytes of data. However, rogue SQL queries can waste a significant amount of critical compute resource and reduce Presto's throughput. Ritesh Agrawal and Anirban Deb explain how Uber uses machine learning to identify and stop rogue queries, saving both computational power and money.

Tyler Akidau is a senior staff software engineer at Google Seattle, where he leads technical infrastructure internal data processing teams for MillWheel and Flume. Tyler is a founding member of the Apache Beam PMC and has spent the last seven years working on massive-scale data processing systems. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer that batch and streaming are two sides of the same coin and that the real endgame for data processing systems the seamless merging between the two. He is the author of the 2015 “Dataflow Model” paper and “Streaming 101” and “Streaming 102” blog posts. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Presentations

Foundations of streaming SQL; or, How I learned to love stream and table theory Session

What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how does all of this relate to the programmatic frameworks we’re all familiar with? Tyler Akidau answers these questions and more as he walks you through key concepts underpinning data processing in general.

Presentations

Improving the customer experience via Click Through Analytics Media and Advertising

In this talk we will briefly explore some of the technologies and methodologies we can use to gain insights into the customers experience on the platform to gain understanding as to what content is working better than other and how we could personalize the content to enhance customer experience.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Real-time systems with Spark Streaming and Kafka 2-Day Training

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.

Michael Armbrust is the lead developer of the Spark SQL and Structured Streaming projects at Databricks. Michael’s interests broadly include distributed systems, large-scale structured storage, and query optimization. Michael holds a PhD from UC Berkeley, where his thesis focused on building systems that allow developers to rapidly build scalable interactive applications and specifically defined the notion of scale independence.

Presentations

Streaming big data in the cloud: What to consider and why Session

William Chambers and Michael Armbrust discuss the motivation and basics of Apache Spark's Structured Streaming processing engine and share lessons they've learned running hundreds of Structured Streaming workloads in the cloud.

Shivnath Babu is an associate professor of computer science at Duke University, where his research focuses on ease of use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. He is also the chief scientist at Unravel Data Systems, the company he cofounded to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has received a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award. He has given talks and distinguished lectures at many research conferences and universities worldwide. Shivnath has also spoken at industry conferences, such as the Hadoop Summit.

Presentations

Using machine learning to simplify Kafka operations Session

Getting the best performance, predictability, and reliability for Kafka-based applications is a complex art. Shivnath Babu and Sumit Jindal explain how to simplify the process by leveraging recent advances in machine learning and AI and outline a methodology for applying statistical learning to the rich and diverse monitoring data that is available from Kafka.

Dorna Bandari is the director of algorithms at AI-driven prediction platform Jetlore, where she leads development of large-scale machine learning models and machine learning infrastructure. Previously, she was a lead data scientist at Pinterest and the founder of ML startup Penda. Dorna holds a PhD in electrical engineering from UCLA.

Presentations

Building​ ​a​ ​flexible​ ​ML​ ​pipeline​ ​at​ ​a​ ​B2B​ ​AI​ ​start​up Session

Dorna Bandari offers an overview of the machine learning pipeline at B2B AI startup Jetlore and explains why even small B2B startups in AI should invest in a flexible machine learning pipeline. Dorna covers the design choices, the trade-offs made when implementing and maintaining the pipeline, and how it has accelerated Jetlore's product development and growth.

Burcu Baran is a senior data scientist on LinkedIn’s analytics data mining team. Burcu is passionate about bringing mathematical solutions to business problems using machine learning techniques. Previously, she worked on predicting modeling at a B2B business intelligence company and was a postdoc in the Mathematics Departments at both Stanford and the University Michigan Mathematics Department. Burcu holds a PhD in number theory.

Presentations

Big data analytics and machine learning techniques to drive and grow business Tutorial

Burcu Baran, Wei Di, Michael Li, and Chi-Yi Kuan walk you through the big data analytics and data science lifecycle and share their experience and lessons learned leveraging advanced analytics and machine learning techniques such as predictive modeling to drive and grow business at LinkedIn.

Roger Barga is general manager and director of development at Amazon Web Services, where he is responsible for Kinesis data streaming services. Previously, Roger was in the Cloud Machine Learning Group at Microsoft, where he was responsible for product management of the Azure Machine Learning service. Roger is also an affiliate professor at the University of Washington, where he is a lecturer in the Data Science and Machine Learning programs. Roger holds a PhD in computer science, has been granted over 30 patents, has published over 100 peer-reviewed technical papers and book chapters, and has authored a book on predictive analytics.

Presentations

Continuous machine learning over streaming data Session

Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data, focusing on explainable machine learning, including anomaly detection with attribution, ability to reduce false positives through user feedback, and detection of anomalies in directed graphs.

James Bednar is a senior solutions architect at Anaconda. Previously, Jim was a lecturer and researcher in computational neuroscience at the University of Edinburgh, Scotland, and a software and hardware engineer at National Instruments. He manages the open source Python projects datashader, HoloViews, GeoViews, ImaGen, and Param. He has published more than 50 papers and books about the visual system, data visualization, and software development. Jim holds a PhD in computer science from the University of Texas as well as degrees in electrical engineering and philosophy.

Presentations

Custom interactive visualizations and dashboards for one billion datapoints on a laptop in 30 lines of Python Tutorial

Python lets you solve data science problems by stitching together packages from its ecosystem, but it can be difficult to choose packages that work well together. James Bednar and Philipp Rudiger walk you through a concise, fast, easily customizable, and fully reproducible recipe for interactive visualization of millions or billions of datapoints—all in just 30 lines of Python code.

Roy Ben-Alta is a solution architect and principal business development manager at Amazon Web Services, where he focuses on AI and real-time streaming technologies and working with AWS customers to build data-driven products (whether batch or real time) and create solutions powered by ML in the cloud. Roy has worked in the data and analytics industry for over a decade and has helped hundreds of customers bring compelling data-driven products to the market. He serves on the advisory board of Applied Mathematics and Data Science at Post University in Connecticut. Roy holds a BSc in information systems and an MBA from the University of Georgia.

Presentations

The real-time journey from raw streaming data to AI-based analytics Session

Many domains, such as mobile, web, the IoT, ecommerce, and more, have turned to analyzing streaming data. However, this presents challenges both in transforming the raw data to metrics and automatically analyzing the metrics in to produce the insights. Roy Ben-Alta shares a solution implemented using Amazon Kinesis as the real-time pipeline feeding Anodot's anomaly detection solution.

Tim Berglund is a teacher, author, and technology leader with Confluent, where he serves as the senior director of developer experience. Tim can frequently be found at speaking at conferences internationally and in the United States. He is the copresenter of various O’Reilly training videos on topics ranging from Git to distributed systems and is the author of Gradle Beyond the Basics. He tweets as @tlberglund, blogs very occasionally at Timberglund.com, and is the cohost of the DevRel Radio Podcast. He lives in Littleton, Colorado, with the wife of his youth and and their youngest child, the other two having mostly grown up.

Presentations

Stream processing with Kafka Tutorial

Tim Berglund leads a basic architectural introduction to Kafka and walks you through using Kafka Streams and KSQL to process streaming data.

Ray Bernard is the founder and chief architect at SuprFanz.com. Previously, Ray was a adjunct professor at Columbia University and worked for technology giants like Compaq, Dell, and EMC. As leader of the Cosmic Blues Band, he performs regularly at the BB King Blues Club & Grill in New York City.

Presentations

Data science in practice: Examining events in social media Media and Advertising

Ray Bernard and Jennifer Webb explain how cloud-based marketing company SuprFanz uses data science techniques and graph theory with Neo4j to generate live event attendance from social media platforms, email, and SMS.

Brian Bloechle is an Industrial mathematician and data scientist, and a technical instructor at Cloudera, Inc.

Presentations

Data science and machine learning with Apache Spark 2-Day Training

Brian Bloechle demonstrates how to implement typical data science workflows using Apache Spark. You'll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib.

Ron Bodkin is technical director for applied artificial intelligence at Google, where he helps Global Fortune 500 enterprises unlock strategic value with AI, acts as executive sponsor for Google product and engineering teams to deliver value from AI solutions, and leads strategic initiatives working with customers and partners. Previously, Ron was vice president and general manager of artificial intelligence at Teradata; the founding CEO of Think Big Analytics (acquired by Teradata in 2014), which provides end-to-end support for enterprise big data, including data science, data engineering, advisory and managed services, and frameworks such as Kylo for enterprise data lakes; vice president of engineering at Quantcast, where he led the data science and engineer teams that pioneered the use of Hadoop and NoSQL for batch and real-time decision making; founder of enterprise consulting firm New Aspects; and cofounder and CTO of B2B applications provider C-Bridge. Ron holds a BS in math and computer science with honors from McGill University and a master’s degree in computer science from MIT.

Presentations

Deploying deep learning with TensorFlow Tutorial

TensorFlow and Keras are popular libraries for machine learning because of their support for deep learning and GPU deployment. Join Ron Bodkin to learn how to execute these libraries in production with vision and recommendation models and how to export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.

Fidan Boylu Uz is a senior data scientist on the algorithms and data science team at Microsoft, where she is responsible for successful delivery of end-to-end advanced analytics solutions. Fidan has 10+ years of technical experience in machine learning and business intelligence and has worked on projects in multiple domains such as predictive maintenance, fraud detection, mathematical optimization and deep learning. She is a former professor at the University of Connecticut, where she conducted research and taught courses on machine learning theory and its business applications. She has authored a number of academic publications in the areas of machine learning and optimization. Fidan holds a PhD in decision sciences.

Presentations

Operationalize deep learning: How to deploy and consume your LSTM networks for predictive maintenance scenarios Session

Francesca Lazzeri and Fidan Boylu Uz explain how to operationalize LSTM networks to predict the remaining useful life of aircraft engines. They use simulated aircraft sensor values to predict when an aircraft engine will fail in the future so that maintenance can be planned in advance.

Joseph Bradley is a software engineer working on machine learning at Databricks. Joseph is an Apache Spark committer and PMC member. Previously, he was a postdoc at UC Berkeley. Joseph holds a PhD in machine learning from Carnegie Mellon University, where he focused on scalable learning for probabilistic graphical models, examining trade-offs between computation, statistical efficiency, and parallelization.

Presentations

Best practices for productionizing Apache Spark MLlib models Session

Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving.

Claudiu Branzan is the director of data science at G2 Web Services, where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine learning and distributed systems experience. Previously, Claudiu worked for Atigeo building big data and data science-driven products for various customers.

Presentations

Natural language understanding at scale with spaCy and Spark NLP Tutorial

Natural language processing is a key component in many data science systems. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

Kurt Brown leads the data platform team at Netflix, which architects and manages the technical infrastructure underpinning the company’s analytics, including various big data technologies like Hadoop, Spark, and Presto, Netflix open-sourced applications and services such as Genie and Lipstick, and traditional BI tools including Tableau and Redshift.

Presentations

20 Netflix-style principles and practices to get the most out of your data platform Session

Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter.

Anne Buff is a business solutions manager and thought leader for SAS Best Practices, a thought leadership organization within the SAS institute, where she leverages her training and consulting experience and her data savviness to lead best practices workshops and facilitate intrateam dialogues to help companies realize their full data and analytics potential. As a speaker and author, Anne specializes in analytic strategy and culture, governance, change management, and fostering data-driven organizations. She has been a specialist in the world of data and analytics for almost 20 years and has developed courseware for a wide range of technical concepts and software, including SAS Data Management.

Presentations

Progressive data governance for emerging technologies Session

Emerging technologies such as the IoT, AI, and ML present businesses with enormous opportunities for innovation, but to maximize the potential of these technologies, businesses must radically shift their approach to governance. Anne Buff explains what it takes to shift the focus of governance from standards, conformity, and control to accountability, extensibility, and enablement.

Noah Burbank is a software engineer on Salesforce’s intelligence services team, where he focuses on the application of artificial intelligence to improve the quality of decisions that his customers can make everyday in their businesses. He holds a PhD in decision and risk analysis from Stanford University, where his research simplified complex decision making techniques for application in everyday life.

Presentations

Building a contacts graph from activity data Session

In the customer age, being able to extract relevant communications information in real time and cross-reference it with context is key. Alexis Roos and Noah Burbank explain how Salesforce uses data science and engineering to enable salespeople to monitor their emails in real time to surface insights and recommendations using a graph modeling contextual data.

Yuri Bykov is director of data science at Dice.com, where he and his team leverage machine learning, NLP, big data, information retrieval, and other scientific disciplines to research and build innovative data products and services that help tech professionals manage their careers. Yuri started his career as a software developer, moving into BI and data analytics before finding his passion in data science. He holds an MBA and MIS from the University of Iowa.

Presentations

Building career advisory tools for the tech sector using machine learning Session

Dice.com recently released several free career advisory tools for technology professionals, including a salary predictor, a tool that recommends the next skills to learn, and a career path explorer. Simon Hughes and Yuri Bykov offer an overview of the machine learning algorithms behind these tools and the technologies used to build, deploy, and monitor these solutions in production.

Henry Cai is a software engineer on the data engineering team at Pinterest, where he designs large-scale big data infrastructures. Previously, he worked at LinkedIn. Henry is the maintainer and contributor of many open source data ingestion systems, including Camus, Kafka, Gobblin, and Secor.

Presentations

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously Session

With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin explain offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices.

James Campbell is a senior data scientist and researcher at the Laboratory for Analytical Sciences (LAS), a collaborative public/private research and development organization housed at NC State University. His current work focuses on measuring and enhancing analytic quality by weaving together traditional, human-centric, analytic processes with predictive, model-driven analytic tools. He is one of the core contributors to the Great Expectations project.

James has worked in Government for more than a decade, leading significant data science tradecraft development efforts. He has managed multiple data science teams tackling a wide range of topics, including counterterrorism and information operations. His prior analytical experience includes strategic cyber threat intelligence research and economic analysis for litigation.

James earned his bachelor’s degree in Math and Philosophy from Yale, and his master’s degree in Security Studies from Georgetown. James lives in Cary, North Carolina, with his wife, two daughters, and dog. He speaks Russian, enjoys running and cycling, and designs mathematical sculpture.

Presentations

Pipeline testing with Great Expectations Session

Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test.

Yishay Carmiel is the founder of IntelligentWire, a company that develops and implements industry-leading deep learning and AI technologies for automatic speech recognition (ASR), natural language processing (NLP) and advanced voice data extraction, and the head of Spoken Labs, the strategic artificial intelligence and machine learning research arm of Spoken Communications. Yishay and his teams are currently working on bleeding-edge innovations that make the real-time customer experience a reality—at scale. Yishay has nearly 20 years’ experience as an algorithm scientist and technology leader building large-scale machine learning algorithms and serving as a deep learning expert.

Presentations

Executive Briefing: The conversational AI revolution Session

One of the most important tasks of AI has been to understand humans. People want machines to understand not only what they say but also what they mean and to take particular actions based on that information. This goal is the essence of conversational AI. Yishay Carmiel explores the latest breakthroughs and revolutions in this field and the challenges still to come.

Michelle Casbon is director of data science at Qordoba. Michelle’s development experience spans more than a decade across various industries, including media, investment banking, healthcare, retail, and geospatial services. Previously, she was a senior data science engineer at Idibon, where she built tools for generating predictions on textual datasets. She loves working with open source projects and has contributed to Apache Spark and Apache Flume. Her writing has been featured in the AI section of O’Reilly Radar. Michelle holds a master’s degree from the University of Cambridge, focusing on NLP, speech recognition, speech synthesis, and machine translation.

Presentations

Continuous delivery for NLP on Kubernetes: Lessons Learned Session

Michelle Casbon explains how to speed up the development of ML models by using open source tools such as Kubernetes, Docker, Scala, Apache Spark, and Weave Flux, detailing how to build resilient systems so that you can spend more of your time on product improvement rather than triage and uptime.

Rachita Chandra is a solutions architect at IBM Watson Health, where she brings together end-to-end machine learning solutions in healthcare. She has experience implementing large-scale, distributed machine learning algorithms. Rachita holds both a master’s and bachelor’s degree in electrical and computer engineering from Carnegie Mellon.

Presentations

Transforming a machine learning prototype to a deployable solution leveraging Spark in healthcare Session

Rachita Chandra outlines challenges and considerations for transforming a research prototype built for a single machine to a deployable healthcare solution that leverages Spark in a distributed environment.

Diane Chang is a distinguished data scientist at Intuit, where she has worked on many interesting business problems that depend on machine learning, behavioral analysis, and risk prediction. Previously, Diane worked for a small “mathematical” consulting firm and a startup in the online advertising space and was a stay-at-home mom for six years. She holds a PhD in operations research from Stanford.

Presentations

Want to build a better chatbot? Start with your data. Session

When building a chatbot, it’s important to develop one that is humanized, has contextual responses, and can simulate true empathy for the end users. Diane Chang shares how Intuit's data science team preps, cleans, organizes, and augments training data along with best practices she's learned along the way.

Sugreev Chawla is a data scientist at Thorn, where he analyzes data from all of Thorn’s products to help law enforcement rescue victims of sexual exploitation faster. Sugreev has years of analytics experience in the field of experimental and computational fusion energy physics. Previously, he was a data scientist working with real-time sensor data for healthcare and defense applications. He holds a BA in physics and applied mathematics from UC Berkeley and is pursuing a PhD in engineering physics from UC San Diego.

Presentations

Fighting sex trafficking with data science Session

Sugreev Chawla offers an overview of Spotlight, a tool created by Thorn, a nonprofit that uses technology to fight online child sexual exploitation. It allows law enforcement to process millions of escort ads per month in an effort to fight sex trafficking, using graph analysis, time series analysis and NLP techniques to surface important networks of ads and characterize their behavior over time.

Anny (Yunzhu) Chen is a Senior data scientist in UBER. She’s interested in applying statistical and machine learning models to real business problems. She is currently working on time series anomaly detection and forecasting.

Prior to joining UBER, she was a data scientist in Adobe, where she worked on digital attribution modeling for customer conversion data. She received her MS in statistics from Stanford University in 2013 and her BS in Probability and Statistics from Peking University in 2011. She’s passionate about statistical application to real datasets and big data technology.

Presentations

Detecting time series anomalies at Uber scale with recurrent neural networks Session

Time series forecasting and anomaly detection is of utmost importance at Uber. However, the scale of the problem, the need for speed, and the importance of accuracy make anomaly detection a challenging data science problem. Andrea Pasqua and Anny Chen explain how the use of recurrent neural networks is allowing Uber to meet this challenge.

April Chen is a lead data scientist on the R&D team at Civis Analytics, where she develops software to automate statistical modeling workflows to help organizations from Fortune 500 companies to nonprofits understand and leverage their data. April’s background is in economics. Previously, she worked as an analytics consultant.

Presentations

Show me the money: Understanding causality for ad attribution Media and Advertising

Which of your ad campaigns lead to the most sales? In the absence of A/B testing, marketers often turn to simple touch attribution models. April Chen and John Davis detail shortcomings of these models and propose a new approach that uses matching methods from causal inference to more accurately measure marketing effectiveness.

Shuyi Chen is a senior software engineer at Uber working on building scalable real-time data solutions. He built Uber’s real-time complex event processing platform for the marketplace, which powers 100+ production real-time use cases. Currently, he is the tech lead of Uber’s stream processing platform team. Shuyi has years of experience in storage infrastructure, data infrastructure, and Android and iOS development at both Google and Uber.

Presentations

Streaming SQL to unify batch and stream processing: Theory and practice with Apache Flink at Uber Session

Fabian Hueske and Shuyi Chen explore SQL's role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges.

Wei Ting Chen is a senior software engineer in Intel’s Software Service Group, where he works on big data on cloud solutions. One of his responsibilities is helping customers integrate big data solutions into their cloud infrastructure. Wei Ting is a contributor to the OpenStack Sahara project.

Presentations

Spark on Kubernetes: A case study from JD.com Session

Zhen Fan and Wei Ting Chen explain how JD.com uses Spark on Kubernetes in a production environment and why the company chose Spark on Kubernetes for its AI workloads. You'll learn how to run Spark with Kubernetes and gain an understanding of the advantages this provides.

Nic Chikhani represents multiple areas from a Product perspective at Weight Watchers, including Data and Analytics, CRM, Pricing and Billing, and AI.

Presentations

How Weight Watchers embraced modern data practices during its transformation from a legacy IT shop to a modern technology organization Session

For organizations stuck in a myriad of legacy infrastructure, the path to AI and deep learning seems impossible. Michael Lysaght, Steven Levine, and Nicolas Chikhani discuss Weight Watchers's transition from a traditional BI organization to one that uses data effectively, covering the company's needs, the changes that were required, and the technologies and architecture used to achieve its goals.

Pramit Choudhary is a lead data scientist at DataScience.com, where he focuses on optimizing and applying classical machine learning and Bayesian design strategy to solve real-world problems. Currently, he is leading initiatives on figuring out better ways to explain a model’s learned decision policies to reduce the chaos in building effective models and close the gap between a prototype and operationalized model.

Presentations

Human in the loop: Bayesian rules enabling explainable AI Session

Pramit Choudhary explores the usefulness of a generative approach that applies Bayesian inference to generate human-interpretable decision sets in the form of "if. . .and else" statements. These human interpretable decision lists with high posterior probabilities might be the right way to balance between model interpretability, performance, and computation.

Michael Chui is a San Francisco-based partner in the McKinsey Global Institute, where he directs research on the impact of disruptive technologies, such as big data, social media, and the internet of things, on business and the economy. Previously, as a McKinsey consultant, Michael served clients in the high-tech, media, and telecom industries on multiple topics. Prior to joining McKinsey, he was the first chief information officer of the City of Bloomington, Indiana, and was the founder and executive director of HoosierNet, a regional internet service provider. Michael is a frequent speaker at major global conferences and his research has been cited in leading publications around the world. He holds a BS in symbolic systems from Stanford University and a PhD in computer science and cognitive science and an MS in computer science, both from Indiana University.

Presentations

Executive Briefing: Artificial intelligence—The next digital frontier? Session

After decades of extravagant promises, artificial intelligence is finally starting to deliver real-life benefits to early adopters. However, we're still early in the cycle of adoption. Michael Chui explains where investment is going, patterns of AI adoption and value capture by enterprises, and how the value potential of AI across sectors and business functions is beginning to emerge.

Garner Chung is the engineering manager of the human computation team and the data science team supporting core product, growth, and infrastructure at Pinterest. Previously, he managed the data science team at Opower, where he drove efforts to research and productionize predictive models for all of product and engineering. Many years ago, he studied film at UC Berkeley, where he learned to deconstruct and complicate misleadingly simple narratives. Over the course of his 20 years in the tech industry, he has witnessed exuberance over technology’s great promise ebb and flow, all the while remaining steadfast in his gratitude for having played some small part. As a leader, Garner has learned to drive teams that privilege responsibility and end-to-end ownership over arbitrary commitments.

Presentations

Humans versus the machines: Using human-based computation to improve machine learning Session

Veronica Mapes and Garner Chung detail the human evaluation platform Pinterest developed to better serve its deep learning and operational teams when its needs grew beyond platforms like Mechanical Turk. Along the way, they cover tricks for increasing data reliability and judgement reproducibility and explain how Pinterest integrated end-user-sourced judgements into its in-house platform.

Eric Colson is chief algorithms officer at Stitch Fix, where he leads a team of 80+ data scientists and is responsible for the multitude of algorithms that are pervasive to nearly every function of the company, from merchandise, inventory, and marketing to forecasting and demand, operations, and the styling recommender system. He’s also an advisor to several big data startups. Previously, Eric was vice president of data science and engineering at Netflix. He holds a BA in economics from SFSU, an MS in information systems from GGU, and an MS in management science and engineering from Stanford.

Presentations

Differentiating via data science Keynote

While companies often use data science as a supportive function, the emergence of new business models has made it possible for some companies to differentiate via data science. Eric Colson explores what it means to differentiate by data science and explains why companies must now think very differently about the role and placement of data science in the organization.

Mike Conover is an AI engineer at SkipFlag, where he builds machine learning technologies that leverage the behavior and relationships of hundreds of millions of people. Previously, Mike led news relevance research and development at LinkedIn. His work has appeared in the New York Times, the Wall Street Journal, and on National Public Radio. Mike holds a PhD in complex systems analysis with a focus on information propagation in large-scale social networks.

Presentations

Fast and effective natural language understanding Session

Mike Conover offers an overview of the essential techniques for understanding and working with natural language. From off-the-shelf neural networks and snappy preprocessing libraries to architectural patterns for bulletproof productionization, this talk will be of interest to anyone who uses language on a regular basis.

Ian Cook is a data scientist at Cloudera and the author of several R packages including implyr. Previously, Ian was a data scientist at TIBCO and a statistical software developer at Advanced Micro Devices. Ian is cofounder of Research Triangle Analysts, the largest data science meetup group in the Raleigh, North Carolina, area, where he lives with his wife and two young children. He holds an MS in statistics from Lehigh University and a BS in applied mathematics from Stony Brook University.

Presentations

sparklyr, implyr, and more: dplyr interfaces to large-scale data Session

The popular R package dplyr provides a consistent grammar for data manipulation that can abstract over diverse data sources. Ian Cook shows how you can use dplyr to query large-scale data using different processing engines including Spark and Impala. He demonstrates the R package sparklyr (from RStudio) and the new R package implyr (from Cloudera) and shares tips for making dplyr code portable.

Dan Crankshaw is a PhD student in the CS Department at UC Berkeley, where he works in the RISELab. After cutting his teeth doing large-scale data analysis on cosmology simulation data and building systems for distributed graph analysis, Dan has turned his attention to machine learning systems. His current research interests include systems and techniques for serving and deploying machine learning, with a particular emphasis on low-latency and interactive applications.

Presentations

Deploying and monitoring interactive machine learning applications with Clipper Session

Clipper is an open source, general-purpose model-serving system that provides low-latency predictions under heavy serving workloads for interactive applications. Dan Crankshaw offers an overview of the Clipper serving system and explains how to use it to serve Apache Spark and TensorFlow models on Kubernetes.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Opening remarks Tutorial

Strata Data Conference Program Chair, Alistair Croll, welcomes you to the Data Case Studies tutorial.

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Mauro Damo is a senior data scientist at Dell, where he is responsible for helping organizations identify, develop, and implement analytical solutions in big data environments, focusing on solving business problems. He has developed and implemented a wide range of analytical projects for a number of companies in a range of industries, including mortgage insurance, financial brokers, cable companies, nongovernmental organizations, healthcare and supply chain. He has experience with a wide range of supervised and unsupervised models, including time series, graphs analysis, optimization models, and deep learning models such as convolutional neural networks, neural networks, clustering, dimensionality reduction, tree algorithms, frequent pattern mining, ensembles models, Markov chains, and gradient descent. Mauro holds patents, has authored several papers, and speaks at conferences, seminars, and classes. His main programming languages are R, Python, and SQL. He holds an MS in business, an MBA in finance, an undergraduate degree in business, and an associate’s degree in computer science.

Presentations

Bladder cancer diagnosis using deep learning Session

Image recognition classification of diseases will minimize the possibility of medical mistakes, improve patient treatment, and speed up patient diagnosis. Mauro Damo and Wei Lin offer an overview of an approach to identify bladder cancer on patients using nonsupervised and supervised machine learning techniques on more than 5,000 magnetic resonance images from the Cancer Imaging Archive.

John Davis is a data scientist on the R&D team at Civis Analytics, where he spends his time writing tools that automate causal inference analyses. John holds a PhD in statistics from the University of Wisconsin-Madison, where he taught biostatistics to pre-med students and did research in mathematical statistics.

Presentations

Show me the money: Understanding causality for ad attribution Media and Advertising

Which of your ad campaigns lead to the most sales? In the absence of A/B testing, marketers often turn to simple touch attribution models. April Chen and John Davis detail shortcomings of these models and propose a new approach that uses matching methods from causal inference to more accurately measure marketing effectiveness.

Rahim Daya is head of search products at Pinterest. Previously, he led search and recommendation product teams at LinkedIn and Groupon.

Presentations

Personalization at scale: Mastering the challenges of personalization to create compelling user experiences Session

Personalization is a powerful tool for building sticky and impactful product experiences. Rahim Daya shares Pinterest's frameworks for building personalized user experiences, from sourcing the right contextual data to designing and evaluating personalization algorithms that can delight the user.

Danielle Dean is a principal data scientist lead at Microsoft in the Algorithms and Data Science Group within the Artificial Intelligence and Research Division, where she leads a team of data scientists and engineers building predictive analytics and machine learning solutions with external companies utilizing Microsoft’s Cloud AI Platform. Previously, she was a data scientist at Nokia, where she produced business value and insights from big data through data mining and statistical modeling on data-driven projects that impacted a range of businesses, products, and initiatives. Danielle holds a PhD in quantitative psychology from the University of North Carolina at Chapel Hill, where she studied the application of multilevel event history models to understand the timing and processes leading to events between dyads within social networks.

Presentations

How does a big data professional get started with AI? Session

Artificial intelligence (AI) has tremendous potential to extend our capabilities and empower organizations to accelerate their digital transformation. Wee Hyong Tok and Danielle Dean demystify AI for big data professionals and explain how they can leverage and evolve their valuable big data skills by getting started with AI.

Jeff Dean joined Google in 1999 and is currently a Google Senior Fellow in Google’s Research Group, where he co-founded and leads the Google Brain team, Google’s deep learning and artificial intelligence research team. He and his collaborators are working on systems for speech recognition, computer vision, language understanding, and various other machine learning tasks. He has co-designed/implemented many generations of Google’s crawling, indexing, and query serving systems, and co-designed/implemented major pieces of Google’s initial advertising and AdSense for Content systems. He is also a co-designer and co-implementor of Google’s distributed computing infrastructure, including the MapReduce, BigTable and Spanner systems, protocol buffers, the open-source TensorFlow system for machine learning, and a variety of internal and external libraries and developer tools.

Jeff received a Ph.D. in Computer Science from the University of Washington in 1996, working with Craig Chambers on whole-program optimization techniques for object-oriented languages. He received a B.S. in computer science & economics from the University of Minnesota in 1990. He is a member of the National Academy of Engineering, and of the American Academy of Arts and Sciences, a Fellow of the Association for Computing Machinery (ACM), a Fellow of the American Association for the Advancement of Sciences (AAAS), and a winner of the ACM Prize in Computing.

Presentations

Deep learning for tackling important problems Keynote

Keynote with Jeff Dean

Anirban Deb is a data science manager at Uber. A seasoned data science and analytics leader, Anirban has extensive experience building and managing high-performing teams to support strategic decision making, business analytics, marketing analytics, product analytics, predictive modeling, reporting, and executive communication.

Presentations

Presto query gate: Identifying and stopping rogue queries Session

Presto has emerged as the de facto query engine to quickly process petabytes of data. However, rogue SQL queries can waste a significant amount of critical compute resource and reduce Presto's throughput. Ritesh Agrawal and Anirban Deb explain how Uber uses machine learning to identify and stop rogue queries, saving both computational power and money.

Alex Deng is a principal data scientist manager on Microsoft’s analysis and experimentation team, where he and his team work on methodological improvements of the experimentation platform as well as related engineering challenges. Alex has published his work in conference proceedings like KDD, WWW, WSDM, and other statistical journals. He colectured a tutorial on A/B testing at JSM 2015. Alex holds a PhD in statistics from Stanford University and a BS in mathematics from Zhejiang university.

Presentations

A/B testing at scale: Accelerating software innovation Tutorial

Controlled experiments such as A/B tests have revolutionized the way software is being developed, allowing real users to objectively evaluate new ideas. Ronny Kohavi, Alex Deng, Pavel Dmitriev, and Paul Raff lead an introduction to A/B texting and share lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft, which executes over 10K experiments a year.

Matt Derda is a customer success manager at Trifacta. Previously, Matt was a CPFR (collaborative planning, forecasting, and replenishment) analyst at PepsiCo, where he worked with Trifacta to accelerate the preparation of customer supply chain data to more accurately and quickly forecast sales.

Presentations

Data wrangling for retail giants Session

PRGX is a global leader in recovery audit and source-to-pay (S2P) analytics services, serving around 75% of the top 20 global retailers. Matt Derda and Jonathon Whitton explain how PRGX uses Trifacta and Cloudera to scale current processes and increase revenue for the products and services it offers clients.

Wei Di is a staff member on LinkedIn’s business analytics data mining team. Wei is passionate about creating smart and scalable solutions that can impact millions of individuals and empower successful business. She has wide interests covering artificial intelligence, machine learning, and computer vision. Previously, Wei worked with eBay Human Language Technology and eBay Research Labs, where she focused on large-scale image understanding and joint learning from visual and text information, and worked at Ancestry.com in the areas of record linkage and search relevance. Wei holds a PhD from Purdue University.

Presentations

Big data analytics and machine learning techniques to drive and grow business Tutorial

Burcu Baran, Wei Di, Michael Li, and Chi-Yi Kuan walk you through the big data analytics and data science lifecycle and share their experience and lessons learned leveraging advanced analytics and machine learning techniques such as predictive modeling to drive and grow business at LinkedIn.

Ding Ding is a senior software engineer on Intel’s big data technology team, where she works on developing and optimizing distributed machine learning and deep learning algorithms on Apache Spark, focusing particularly on large-scale analytical applications and infrastructure on Spark.

Presentations

Accelerating deep learning on Apache Spark with coarse-grained scheduling Session

The BigDL framework scales deep learning for large datasets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. Shivaram Venkataraman, Sergey Ermolin, and Ding Ding outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG.

Mark Donsky leads data management and governance solutions at Cloudera. Previously, Mark held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

Executive Briefing: GDPR—Getting your data ready for heavy, new EU privacy regulations Session

In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky outlines the capabilities your data environment needs to simplify compliance with GDPR and future regulations.

Mike Driscoll is the founder and CEO of Metamarkets. Previously, Mike spent more than a decade focused on making the most of data to help companies grow and developed data analytics solutions for online retail, life sciences, digital media, insurance, and banking. He also successfully founded and sold two companies: Dataspora, a life science analytics company, and CustomInk, an early pioneer in customized apparel. Mike began his career as a software engineer for the Human Genome Project. He holds an AB in government from Harvard and a PhD in bioinformatics from Boston University.

Presentations

Human eyes on AI Session

There’s a make-or-break step ahead for AI development. AI tools shouldn’t be designed to replace humans; they should be built with them in mind. We need to focus on translating data from machine learning models into beautiful, intuitive visuals. Mike Driscoll shares advice for creators of next-gen predictive algorithms from his experience turning big data into interactive visualizations.

Ted Dunning is chief applications architect at MapR Technologies. He’s also a board member for the Apache Software Foundation, a PMC member and committer of the Apache Mahout, Apache Zookeeper, and Apache Drill projects, and a mentor for various incubator projects. Ted contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library. He also designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date. He holds a PhD in computing science from the University of Sheffield. He is on Twitter as @ted_dunning.

Presentations

Better machine learning logistics with the rendezvous architecture Session

Ted Dunning offers an overview of the rendezvous architecture, developed to be the "continuous integration" system for machine learning models. It allows always-hot zero latency rollout and rollback of new models and supports extensive metrics and diagnostics so models can be compared as they process production data. It can even hot-swap the framework itself with no downtime.

Zoran Dzunic is a data scientist on the algorithms and data science (ADS) team within the AI+R Group at Microsoft, where he focuses on machine learning applications for text analytics and natural language processing. He holds a PhD and a master’s degree from MIT, where he focused on Bayesian probabilistic inference, and a bachelor’s degree from the University of Nis in Serbia.

Presentations

Deep learning for domain-specific entity extraction from unstructured text Session

Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction.

Nick Elprin is the CEO and cofounder of Domino Data Lab, a data science platform that enterprises use to accelerate research and more rapidly integrate predictive models into their business. Nick has over a decade of experience working with quantitative researchers and data scientists, stemming from his time as a senior technologist at Bridgewater Associates, where his team designed and built the firm’s next-generation research platform.

Presentations

Managing data science in the enterprise Tutorial

The honeymoon era of data science is ending, and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders deliver measurable impact on an increasing share of an enterprise's KPIs. Nick Elprin details how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage.

Sergey Ermolin is a technical program manager for deep learning, Spark analytics, and big data technologies at Intel. A Silicon Valley veteran with a passion for machine learning and artificial intelligence, Sergey has been interested in neural networks since 1996, when he used them to predict aging behavior of quartz crystals and cesium atomic clocks made by Hewlett-Packard. Sergey holds an MSEE and a certificate in mining massive datasets from Stanford and BS degrees in both physics and mechanical engineering from California State University, Sacramento.

Presentations

Accelerating deep learning on Apache Spark with coarse-grained scheduling Session

The BigDL framework scales deep learning for large datasets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. Shivaram Venkataraman, Sergey Ermolin, and Ding Ding outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG.

Improving user-merchant propensity modeling using Neural Collaborative Filtering and Wide-and-Deep models on Spark BigDL at scale Session

Sergey Ermolin and Suqiang Song will demonstrate how to use Spark BigDL Wide-and-Deep and Neural Collaborative Filtering (NCF) algorithms to predict a user’s probability of shopping at a particular offer merchant during a campaign period. Along the way, they will compare the Deep Learning results with those obtained by a classical MLlib’s Alternating Least Squares (ALS) approach.

Lenny Evans is a data scientist at Uber focused on the applications of unsupervised methods and deep learning to fraud prevention, specifically developing anomaly detection models to prevent account takeovers and computer vision models for verifying possession of credit cards.

Presentations

Using computer vision to combat stolen credit card fraud Session

Stolen credit cards are a major problem faced by many companies, including Uber. Karthik Ramasamy and Lenny Evans detail a new weapon against stolen credit cards that uses computer vision to scan credit cards, verifying possession of the physical card with basic fake card detection capabilities.

Both as Founder of Harris Data Consulting and COO of BrightHive, Ms. Evans Harris has dedicated over 16 years to driving the strategic use of data to answer some of our nation’s toughest questions and driving organizational success; Working with a broad network of academic institutions, data science organization, application developers, and foundations to increase the use of accessible data standards, APIs, and ethical algorithms in scaling data science efforts that directly benefit people receiving social services. Most recently bringing together Bloomberg, Data For Democracy and BrightHive to lead the development of a Data Science Code of Ethics through the Community-driven Principles for Ethical Data Sharing (CPEDS) Initiative.

Most recently as a Senior Policy Advisor to the US Chief Technology Officer in the Obama Administration, Ms. Evans Harris founded The Data Cabinet – a federal data science community of practice with over 200 active members across more than 40 federal agencies. She co-led a cohort of federal, non-profit, and for-profit organizations to develop data-driven tools through The Opportunity Project and established the Open Skills Community through the Workforce Data Initiative.

She led an analytics development center for the National Security Agency (NSA) that served as the foundation for the Enterprise data science development program and became a model for other Intelligence Community Agencies. With experience on both the offensive and defensive sides of the mission, she served as a Project Manager, Operations Lead, and Organizational Manager. Her achievements resulted in being the sole selection for NSA to spend a year on Capitol Hill as a Brookings Legislative Fellow. As a member of Senator Cory A. Booker’s (NJ) legislative team, she spent a year focused on cyber and governmental affairs issues, serving as his lead technical and policy advisor on bills such as the Cyber Information Security Protection Act (CISPA).

She has a Masters in Public Administration from George Washington University, a BS in Computer Science and a BS in Sociology from University of Maryland Eastern Shore.

Presentations

Keynote with Natalie Evans Harris Keynote

Natalie Evans Harris, COO and VP of Ecosystem Development, BrightHive, Inc.

Stephan Ewen is one of the originators and committers of the Apache Flink project and CTO at data Artisans, where he leads the development of large-scale data stream processing technology. He is also a PMC member of Apache Beam, a project to create a unified abstraction for Batch and Stream data processing. He coauthored the Stratosphere system and has worked on data processing technologies at IBM and Microsoft. Stephan holds a PhD from the Berlin University of Technology.

Presentations

Unified and elastic batch and stream processing with Pravega and Apache Flink Session

Stephan Ewen and Flavio Junqueira detail an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams) that offers an unprecedented way of handling “everything as a stream” that includes unbounded streaming storage and unified batch and streaming abstraction and dynamically accommodates workload variations in a novel way.

Bin Fan is a software engineer at Alluxio and a PMC member of the Alluxio project. Previously, Bin worked at Google building next-generation storage infrastructure, where he won Google’s Technical Infrastructure award. He holds a PhD in computer science from Carnegie Mellon University.

Presentations

Powering robotics clouds with Alluxio Session

Bin Fan and Shaoshan Liu explain how PerceptIn designed and implemented a cloud architecture to support video streaming and online object recognition tasks and demonstrate how Alluxio delivers high throughput, low latency, and a unified namespace to support these emerging cloud architectures.

Li Fan is the Senior Vice President of Engineering at Pinterest, where she leads the company’s technical direction and oversees a team of 400+ engineers building a visual discovery engine. Prior to joining Pinterest, she led image search at Google as the Senior Director of Engineering. From 2012 to 2014, she was Vice President of Engineering for Baidu, where she was responsible for product design and development at China’s largest search engine. Li began her career in software development and engineering management at Cisco and Ingrian Networks before joining Google in 2002. She holds a Master’s in Computer Science from the University of Wisconsin-Madison, and a BS in Computer Science from Fudan University in Shanghai.

Presentations

Keynote with Li Fan Keynote

Li Fan, Senior Vice President, Engineering, Pinterest.

Zhen Fan is a software development engineer at JD.com, where he focuses on machine learning platform development and management.

Presentations

Spark on Kubernetes: A case study from JD.com Session

Zhen Fan and Wei Ting Chen explain how JD.com uses Spark on Kubernetes in a production environment and why the company chose Spark on Kubernetes for its AI workloads. You'll learn how to run Spark with Kubernetes and gain an understanding of the advantages this provides.

Keno Fischer is CTO of Julia Computing, where he leads the company’s efforts in the compiler and developer tools space. Keno has been a core developer of the Julia Language for more than five years. Keno holds an AM in physics and an AB in physics, mathematics, and computer science from Harvard University.

Presentations

Cataloging the visible universe through Bayesian inference at petascale in Julia Session

Julia is rapidly becoming a popular language at the forefront of scientific discovery. Keno Fischer explores one of the most ambitious use cases for Julia: using machine learning to catalog astronomical objects to derive a catalog from multiterabyte size astronomical image datasets. This work was a collaboration between MIT, UC Berkley, LBNL, and Julia Computing.

Tom Fisher is CTO at MapR Technologies, where he helps enterprise customers take full advantage of MapR technology and leads initiatives to advance the company’s innovation agenda globally. Previously, Tom was a senior executive in engineering and operations at Oracle, where he supported the company’s top 40 cloud customers globally and served as senior vice president and CIO for global commercial cloud services focusing on improving service delivery through automation and direct action with customers; CIO and vice president of cloud computing at SuccessFactors (now SAP), where he ran cloud operations and emerging technologies in product engineering; CIO of CDMA technologies at Qualcomm; and vice president and acting CTO at eBay.

Presentations

Cloud, multicloud, and the data refinery Session

The monolithic cloud is dying. Delivering capabilities across multiple clouds and, simultaneously, transitioning to the next generation platforms and applications is the challenge today. Tom Fisher explores technological approaches and solutions that make this possible while delivering data-driven applications and operations.

Wayde Fleener is senior manager of decision sciences at General Mills. A seasoned marketing strategy and analytics leader with experience combining disparate information sources to make the right business decision, Wayde has run all aspects of a marketing analytics function, including managing methods, technology, and people. Over his career, he has led diverse teams helping over 100 Fortune 1,000 companies across virtually every industry maximize their investments in marketing.

Presentations

Automating business insights through artificial intelligence Data Case Studies

Decision makers are busy. Businesses can hire people to analyze data for them, but most companies are resource constrained and can’t hire a small army to look through all their data. Wayde Fleener explains how General Mills implemented automation to enable decision makers to quickly focus on the metrics that matter and cut through everything else that does not.

Eugene Fratkin is a director of engineering at Cloudera leading cloud infrastructure efforts. He was one of the founding members of the Apache MADlib project (scalable in-database algorithms for machine learning). Previously, Eugene was a cofounder of a Sequoia Capital-backed company focusing on applications of data analytics to problems of genomics. He holds PhD in computer science from Stanford University’s AI lab.

Presentations

A deep dive into running data analytic workloads in the cloud Tutorial

Vinithra Varadharajan, Philip Langdale, and Eugene Fratkin lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices.

Michael J. Freedman is a professor in the Computer Science Department at Princeton University and the cofounder and CTO of TimescaleDB, which provides an open source time series database optimized for fast ingest and complex queries. His research broadly focuses on distributed systems, networking, and security. He developed and operates several self-managing systems, including CoralCDN (a decentralized content distribution network) and DONAR (a server resolution system that powered the FCC’s Consumer Broadband Test), both of which serve millions of users daily. Michael’s other research has included software-defined and service-centric networking, cloud storage and data management, untrusted cloud services, fault-tolerant distributed systems, virtual world systems, peer-to-peer systems, and various privacy-enhancing and anticensorship systems. Michael’s work on IP geolocation and intelligence led him to cofound Illuminics Systems, which was acquired by Quova (now part of Neustar). His work on programmable enterprise networking (Ethane) helped form the basis for the OpenFlow/software-defined networking (SDN) architecture. His honors include the Presidential Early Career Award for Scientists and Engineers (PECASE), a Sloan fellowship, the NSF CAREER Award, the Office of Naval Research Young Investigator Award, DARPA Computer Science Study Group membership, and multiple award publications. Michael holds a PhD in computer science from NYU’s Courant Institute and both an SB and an MEng degree from MIT.

Presentations

TimescaleDB: Reengineering PostgreSQL as a time series database Session

Michael Freedman offers an overview of TimescaleDB, a new scale-out database designed for time series workloads yet open-sourced and engineered up as a plugin to Postgres. Unlike most time series newcomers, TimescaleDB supports full SQL while achieving fast ingest and complex queries.

Siddha Ganju is a data scientist at Deep Vision, where she works on building deep learning models and software for embedded devices. Siddha is interested in problems that connect natural languages and computer vision using deep learning. Her work ranges from visual question answering to generative adversarial networks to gathering insights from CERN’s petabyte scale data and has been published at top tier conferences like CVPR. She is a frequent speaker at conferences and advises the Data Lab at NASA. Siddha holds a master’s degree in computational data science from Carnegie Mellon University, where she worked on multimodal deep learning-based question answering. When she’s not working, you might catch her hiking.

Presentations

Being smarter than dinosaurs: How NASA uses deep learning for planetary defense Session

Siddha Ganju explains how the FDL lab at NASA uses artificial intelligence to improve and automate the identification of meteors above human-level performance using meteor shower images and recover known meteor shower streams and characterize previously unknown meteor showers using orbital data—research aimed at providing more warning time for long-period comet impacts.

Debasish Ghosh is principal software engineer at Lightbend. Passionate about technology and open source, he loves functional programming and has been trying to learn math and machine learning. Debasish is an occasional speaker in technology conferences worldwide, including the likes of QCon, Philly ETE, Code Mesh, Scala World, Functional Conf, and GOTO. He is the author of DSLs In Action and Functional & Reactive Domain Modeling. Debasish is a senior member of ACM. He’s also a father, husband, avid reader, and Seinfeld fanboy who loves spending time with his beautiful family.

Presentations

Approximation data structures in streaming data processing Session

Debasish Ghosh explores the role that approximation data structures play in processing streaming data. Typically streams are unbounded in space and time, and processing has to be done online using sublinear space. Debasish covers the probabilistic bounds that these data structures offer and how they can be used to implement solutions for the fast and streaming architectures.

Noah Gift is consulting CTO and lecturer at UC Davis. An adaptable technical leader, entrepreneur, software developer, architect, and engineer with over 20 years’ experience in leadership and engineering (including P&L responsibility), over the past eight years, Noah has shipped than 10 new products at multiple companies that generated millions of dollars of revenue and had global scale. Previously, Noah helped build Sqor Sports from scratch, creating the company’s first product and hiring and managing all employees. He has also written production machine learning models in Python and R. Noah is the author of the forthcoming book, Pragmatic AI: An Introduction to Cloud-based Machine Learning as well as a number of articles.

Presentations

What is the relationship between social influence and the NBA? Media and Advertising

Noah Gift uses data science and machine learning to explore NBA team valuation and attendance as well as individual player performance. Questions include: What drives the valuation of teams (e.g., attendance, local real estate market)? Does winning bring more fans to games? Does salary correlate with social media performance?

Zachary Glassman studied physics and mathematics as an undergraduate at Pomona College before completing a masters degree studying atomic physics at the University of Maryland. He realized that he has a passion for building data tools and teaching others to Python, so he made the switch to data science. He is currently a Data Scientist in Residence at the Data Incubator.

Presentations

Hands-on data science with Python 2-Day Training

Instructors from the Data Incubator demonstrate how to build intelligent business applications using machine learning, taking you through each step in developing a machine learning pipeline, from prototyping to production. You'll explore data cleaning, feature engineering, model building and evaluation, and deployment and extend your knowledge by building two applications from real-world datasets.

Clare Gollnick is the CTO and chief data scientist at Terbium Labs, an information security startup based in Baltimore, Maryland. As a statistician and engineer, Clare designs the algorithms that direct Terbium’s automated crawl of the dark web and leads the crawler engineering team. Previously, Clare was a neuroscientist. Her academic publications focus on information processing within neural networks and validation of new statistical methods. Clare holds a PhD in biomedical engineering from Georgia Tech and a BS in bioengineering from UC Berkeley.

Presentations

The limits of inference: What data scientists can learn from the reproducibility crisis in science Session

At the heart of the reproducibility crisis in the sciences is the widespread misapplication of statistics. Data science relies on the same statistical methodology as these scientific fields. Can we avoid the same crisis of integrity? Clare Gollnick considers the philosophy of data science and shares a framework that explains (and even predicts) the likelihood of success of a data project.

Abe Gong is CEO and cofounder at Superconductive Health. A seasoned entrepreneur, Abe has been leading teams using data and technology to solve problems in health care, consumer wellness, and public policy for over a decade. Previously, he was chief data officer at Aspire Health, the founding member of the Jawbone data science team, and lead data scientist at Massive Health. Abe holds a PhD in public policy, political science, and complex systems from the University of Michigan. He speaks and writes regularly on data science, healthcare, and the internet of things.

Presentations

Pipeline testing with Great Expectations Session

Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test.

Matthew Granade is a cofounder of Domino Data Lab, which makes a workbench for data scientists to run, scale, share, and deploy analytical models, where he works with companies such as Quantopian, Premise, and Orbital Insights. He also invests in, advises, and serves on the boards of startups in data, data analysis, finance, and
 financial tech. Previously, Matthew was co-head of research at Bridgewater Associates, where he built and managed teams that ensured Bridgewater’s understanding of the global economy, created new systems for generating alpha, produced daily trading signals, and published Bridgewater’s market commentary, and an engagement manager at McKinsey & Company. He holds an undergraduate degree from Harvard University, where he was president of the Harvard Crimson, the university’s daily newspaper, and an MBA with highest honors from Harvard Business School.

Presentations

Managing data science at scale Session

Predictive analytics and artificial intelligence have become critical competitive capabilities. Yet IT teams struggle to provide the support data science teams needs to succeed. Matthew Granade explains how leading banks, insurance and pharmaceutical companies, and others manage data science at scale.

Adam Greenhall is a data scientist at Lyft.

Presentations

Simulation in a two-sided transportation marketplace Session

Adam Greenhall explains how Lyft uses simulation to test out new algorithms, help develop new features, and study the economics of ride-sharing markets as they grow.

Mark Grover is a product manager at Lyft. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He has also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Presentations

Dogfooding data at Lyft Session

Mark Grover and Arup Malakar offer an overview of how Lyft leverages application metrics, logs, and auditing to monitor and troubleshoot its data platform and share how the company dogfoods the platform to provide security, auditing, alerting, and replayability. They also detail some of the internal services and tools Lyft has developed to make its data more robust, scalable, and self-serving.

Goodman Xiaoyuan Gu is head of marketing data engineering at Atlassian, where he leads product strategy and engineering of marketing and growth data pipelines as well as customer acquisition and retention machine learning capabilities. Previously, he was vice president of technology at CPXi, director of engineering at Dell, and general manager at Amazon, where he built marketing and analytics applications. He has served on technical program committees of two IEEE flagship conferences and is the author of over a dozen academic publications in high-profile IEEE and ACM journals and conferences. Goodman holds a degree in engineering and management from MIT.

Presentations

Not your parents' machine learning: How to ship an XGBoost churn prediction app in under four weeks with Amazon SageMaker Session

Machine learning is a pivotal technology. However, bringing an ML application to life often requires overcoming bottlenecks not just in the model code but in operationalizing the end-to-end system itself. Goodman Gu shares a case study from a leading SaaS company that quickly and easily built, trained, optimized, and deployed an XGBoost churn prediction ML app at scale with Amazon SageMaker.

Sudipto Guha is principal scientist at Amazon Web Services, where he studies the design and implementation of a wide range of computational systems, from resource constrained devices, such as sensors, up through massively parallel and distributed systems. Using an algorithmic framework, Sudipto seeks to design systems that are correct, efficient, and optimized, despite their bidirectional asymptotic scale and seeming lack of similarity to human information processes. His recent work focuses on clustering and location theory, statistics and learning theory, database query optimization and mining, approximation algorithms for stochastic control, communication complexity, and data stream algorithms.

Presentations

Continuous machine learning over streaming data Session

Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data, focusing on explainable machine learning, including anomaly detection with attribution, ability to reduce false positives through user feedback, and detection of anomalies in directed graphs.

Debraj GuhaThakurta is a senior data scientist lead for AI and research, the Cloud Data Platform, algorithms, and data science at Microsoft, where he focuses on developing the team data science process and the use of different Microsoft data platforms and toolkits (Spark, SQL-server, ADL, Hadoop, DL toolkits, etc.) for creating scalable and operationalized analytical processes. He has many years of experience in data science and machine learning applications, particularly in biomedical and forecasting domains, and has published more than 25 peer-reviewed papers, book chapters, and patents. Debraj holds a PhD in chemistry and biophysics.

Presentations

Using R and Python for scalable data science, machine learning, and AI Tutorial

R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure.

I’m a fourth year PhD student at University of California, Los Angeles Computer Science Department where I’m advised by Miryung Kim. I obtained my undergraduate degree in Computer Science from Lahore University of Management Sciences (LUMS) SBASSE, Pakistan, where I was mentored by Fareed Zaffar. Before that I was a GCSE A Level student at Lahore Grammar School.
My research interests lie at the intersection of software engineering and big data systems. Specifically, I am interested in supporting interactive debugging in big data processing frameworks and providing efficient ways to perform automated fault localization in big data applications.

Presentations

Who are we? The largest scale study of professional data scientists Session

Even though we know that there are more data scientists in the workforce today, what those data scientists actually do and what we mean by data scientists have not been studied quantitatively. In this talk, we present a large-scale survey with 793 professional data scientists. Our study should inform managers on how to leverage data science capability effectively within their teams.

Alexandra Gunderson is a data scientist at Arundo Analytics. Her background is in mechanical engineering and applied numerical methods.

Presentations

Machine learning to tackle industrial data fusion Session

Heavy industries, such as oil and gas, have tremendous amounts of data from which predictive models could be built, but it takes weeks and even months to create a comprehensive dataset from all of the various data sources. Alexandra Gunderson details the methodology behind an industry-tested approach that incorporates machine learning to structure and link data from different sources.

Sijie Guo is the cofounder of Streamlio, a company focused on building a next-generation real-time data stack. Previously, he was the tech lead for messaging group at Twitter, where he cocreated Apache DistributedLog, and worked on push notification infrastructure at Yahoo. He is the PMC chair of Apache BookKeeper.

Presentations

Modern real-time streaming architectures Tutorial

Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Stream storage with Apache BookKeeper Session

Apache BookKeeper, a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads, has been widely adopted by enterprises like Twitter, Yahoo, and Salesforce to store and serve mission-critical data. Sijie Guo explains how Apache BookKeeper satisfies the needs of stream storage.

Yufeng Guo is a developer advocate for the Google Cloud Platform, where he is trying to make machine learning more understandable and usable for all. He enjoys hearing about new and interesting applications of machine learning, so be sure to share your use case with him on Twitter.

Presentations

Getting started with TensorFlow Tutorial

Yufeng Guo and Amy Unruh walk you through training and deploying a machine learning system using TensorFlow, a popular open source library. Yufeng and Amy take you from a conceptual overview all the way to building complex classifiers and explain how you can apply deep learning to complex problems in science and industry.

Jordan Hambleton is a solutions architect at Cloudera, where he partners with customers to build and manage scalable enterprise products on the Hadoop stack. Previously, Jordan was a member of technical staff at NetApp, where he designed and implemented the NRT operational data store that continually manages automated support for all of the company’s customers’ production systems.

Presentations

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark​ Session

When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results.

Chris Harland is director of data engineering at augmented writing platform Textio. Over his career, Chris has worked in a wide variety of fields spanning elementary science education, cutting-edge biophysical research, and recommendation and personalization engines. Previously, he was a data scientist and machine learning engineer at Versive (formerly Context Relevant) and a data scientist at Microsoft working on problems in Bing search, Xbox, Windows, and MSN. Chris holds a PhD in physics from the University of Oregon. Every year he thinks, “This is the year I’m going to stop thinking SQL is the best query language ever,” and every year he’s wrong.

Presentations

Data products should be as simple as possible Session

The number of resources explaining how to build a machine learning model from data greatly overshadows information on how to make real data products from such models, creating a gap between what machine learning engineers and data scientists know is possible and what users experience. Using examples from Textio's augmented writing platform, Chris Harland walks you through building a data product.

Patrick Harrison started and leads the data science team at S&P Global Market Intelligence (S&P MI), a business and financial intelligence firm and data provider. The Data Science team at S&P MI employs a wide variety of data science tools and techniques, including machine learning, natural language processing, recommender systems, graph analytics, among others. Patrick is the coauthor of the forthcoming book Deep Learning with Text from O’Reilly Media, along with Matthew Honnibal, creator of spaCy, the industrial-strength natural language processing software library, and is a founding organizer of a machine learning conference in Charlottesville, Virginia. He is actively involved in building both regional and global data science communities. Patrick holds a BA in economics and an MS in systems engineering, both from the University of Virginia. His graduate research focused on complex systems and agent-based modeling.

Presentations

Word embeddings under the hood: How neural networks learn from language Session

Word vector embeddings are everywhere, but relatively few understand how they produce their remarkable results. Patrick Harrison opens up the black box of a popular word embedding algorithm and walks you through of how it works its magic. Along the way, Patrick also covers core neural network concepts, including hidden layers, loss gradients, backpropagation, and more.

Frances Haugen is a data product manager at Pinterest focusing on ranking content in the home feed and related pins and the challenges of driving immediate user engagement without harming the long-term health of the Pinterest content ecosystem. Previously, Frances worked at Google, where she founded the Google+ search team, built the first non-quality-based search experience at Google, and cofounded the Google Boston search team. She loves user-facing big data applications and finding ways to make mountains of information useful and delightful to the user. Frances was a member of the founding class of Olin College and holds a master’s degree from Harvard.

Presentations

Executive Briefing: Building effective heterogeneous data communities—Driving organizational outcomes with broad-based data science Session

Data science is most powerful when combined with deep domain knowledge, but those with domain knowledge don't work on data-focused teams. So how do you empower employees with diverse backgrounds and skill sets to be effective users of data? Frances Haugen and Patrick Phelps dive into the social side of data and share strategies for unlocking otherwise unobtainable insights.

Or Herman-Saffar is data scientist at Dell. She holds an MSc in biomedical engineering, where her research focused on breast cancer detection using breath signals and machine learning algorithms, and a BS in biomedical engineering specializing in signal processing from Ben-Gurion University, Israel.

Presentations

AI-powered crime prediction Session

What if we could predict when and where next crimes will be committed? Or Herman-Saffar and Ran Taig offer an overview of Crimes in Chicago, a publicly published dataset of reported incidents of crime that have occurred in Chicago since 2001. Or and Ran explain how to use this data to explore committed crimes to find interesting trends and make predictions for the future.

Szehon Ho is a staff software engineer on the analytics data storage team at Criteo, where he works on Criteo’s Hive platform. Previously, he was a software engineer on the Hive team at Cloudera. He was a committer and PMC member in the Apache Hive open source community, working on features like Hive on Spark and Hive monitoring and metrics, among others.

Presentations

Hive as a service Session

Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo's Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load.

Bob Horton is a senior data scientist on Microsoft’s AI and Research Group deep partner engagement team, where he helps independent software vendors build and deploy machine learning solutions for their customers. Previously, he worked on the professional services team at Revolution Analytics. Long before becoming a data scientist, he was a regular scientist (with a PhD in biomedical science and molecular biology from the Mayo Clinic). Some time after that, he got an MS in computer science from California State University, Sacramento. Bob currently holds an adjunct faculty appointment in health informatics at the University of San Francisco, where he gives occasional lectures and advises students on data analysis and simulation projects.

Presentations

Using R and Python for scalable data science, machine learning, and AI Tutorial

R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure.

Fabian Hueske is a committer and PMC member of the Apache Flink project. He was one of the three original authors of the Stratosphere research system, from which Apache Flink was forked in 2014. Fabian is a cofounder of data Artisans, a Berlin-based startup devoted to fostering Flink, where he works as a software engineer and contributes to Apache Flink. He holds a PhD in computer science from TU Berlin and is currently spending a lot of his time writing a book, Stream Processing with Apache Flink.

Presentations

Streaming SQL to unify batch and stream processing: Theory and practice with Apache Flink at Uber Session

Fabian Hueske and Shuyi Chen explore SQL's role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges.

Simon Hughes is the chief data scientist at technology professional recruiting site Dice.com, where he develops multiple recommender engines for matching job seekers with jobs and optimizes the accuracy and relevancy of Dice.com’s job and candidates search. More recently, Simon has been instrumental in building the machine intelligence behind the Career Explorer portion of Dice’s website, which allows users to gauge their market value and explore potential career paths. Simon is a PhD candidate in machine learning and natural language processing at DePaul University, where he is researching machine learning approaches for determining causal relations in student essays, with the view to building more intelligent essay-grading software.

Presentations

Building career advisory tools for the tech sector using machine learning Session

Dice.com recently released several free career advisory tools for technology professionals, including a salary predictor, a tool that recommends the next skills to learn, and a career path explorer. Simon Hughes and Yuri Bykov offer an overview of the machine learning algorithms behind these tools and the technologies used to build, deploy, and monitor these solutions in production.

Alysa Z. Hutnik is a partner at Kelley Drye & Warren LLP in Washington, DC, where she delivers comprehensive expertise in all areas of privacy, data security, and advertising law. Alysa’s experience ranges from counseling to defending clients in FTC and state attorneys general investigations, consumer class actions, and commercial disputes. Much of her practice is focused on the digital and mobile space in particular, including the cloud, mobile payments, calling and texting practices, and big data-related services. Ranked as a leading practitioner in the privacy and data security area by Chambers USA, Chambers Global, and Law360, Alysa has received accolades for the dedicated and responsive service she provides to clients. The US Legal 500 notes that she provides “excellent, fast, efficient advice” regarding data privacy matters. In 2013, she was one of just three attorneys under 40 practicing in the area of privacy and consumer protection law to be recognized as a rising star by Law360.

Presentations

Executive Briefing: Legal best practices for making data work Session

Big data promises enormous benefits for companies. But what about privacy, data protection, and consumer laws? Having a solid understanding of the legal and self-regulatory rules of the road are key to maximizing the value of your data while avoiding data disasters. Alysa Hutnik and Crystal Skelton share legal best practices and practical tips to avoid becoming a big data “don’t.”

Mario Inchiosa is a principal software engineer at Microsoft, where he focuses on delivering parallelized, scalable advanced analytics integrated with the R language. Previously, Mario served as Revolution Analytics’s chief scientist; analytics architect in IBM’s Big Data organization, where he worked on advanced analytics in Hadoop, Teradata, and R; US chief scientist in Netezza Labs, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances; US chief science officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining; and senior scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds bachelor’s, master’s, and PhD degrees in physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning publication of the year and open literature publication excellence awards.

Presentations

Using R and Python for scalable data science, machine learning, and AI Tutorial

R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure.

Vlad Ionescu is founder and chief architect of ShiftLeft. Vlad is the creator of the industry’s first open source lambda framework. Previously, he worked at Google and VMware as an infrastructure engineer. Vlad is the coauthor RabbitMQ’s Erlang client.

Presentations

Code Property Graph: A modern, queryable data storage for source code Session

Vlad Ionescu and Fabian Yamaguchi outline Code Property Graph (CPG), a unique approach that allows the functional elements of code to be represented in an interconnected graph of data and control flows, which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed.

Kinnary Jangla is a senior software engineer on the homefeed team at Pinterest, where she works on the machine learning infrastructure team as a backend engineer. Kinnary has worked in the industry for 10+ years. Previously, she worked on maps and international growth at Uber and on Bing search at Microsoft. Kinnary holds an MS in computer science from the University of Illinois and a BE from the University of Mumbai.

Presentations

Accelerating development velocity of production ML systems with Docker Session

Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of the company's ML teams while increasing uptime and ease of deployment.

Sumit Jindal is a senior engineer at Unravel Data Systems. An experienced data engineer who has developed big data solutions for the telecom, finance, and internet domains, Sumit likes working on the architecture, design, and implementation of scalable, parallel, distributed web-scale systems. He is a committer on Aerospike and has worked extensively with Kafka and NoSQL systems.

Presentations

Using machine learning to simplify Kafka operations Session

Getting the best performance, predictability, and reliability for Kafka-based applications is a complex art. Shivnath Babu and Sumit Jindal explain how to simplify the process by leveraging recent advances in machine learning and AI and outline a methodology for applying statistical learning to the rich and diverse monitoring data that is available from Kafka.

Flavio Junqueira is senior director of software engineering at Dell EMC, where he leads the Pravega team. He is interested in various aspects of distributed systems, including distributed algorithms, concurrency, and scalability. Previously, Flavio held an engineering position with Confluent and research positions with Yahoo Research and Microsoft Research. He is an active contributor to Apache projects, including Apache ZooKeeper (as PMC and committer), Apache BookKeeper (as PMC and committer), and Apache Kafka. Flavio coauthored the O’Reilly ZooKeeper book. He holds a PhD in computer science from the University of California, San Diego.

Presentations

Unified and elastic batch and stream processing with Pravega and Apache Flink Session

Stephan Ewen and Flavio Junqueira detail an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams) that offers an unprecedented way of handling “everything as a stream” that includes unbounded streaming storage and unified batch and streaming abstraction and dynamically accommodates workload variations in a novel way.

Tomer Kaftan is a second-year PhD student at the University of Washington, working with Magdalena Balazinska and Alvin Cheung. His research interests include machine learning systems, distributed systems, and query optimization.  Previously, Tomer was a staff engineer in UC Berkeley’s AMPLab, working on systems for large-scale machine learning. He holds a degree in EECS from UC Berkeley. He is a recipient of an NSF Graduate Research Fellowship.

Presentations

Cuttlefish: Lightweight primitives for online tuning Session

Tomer Kaftan offers an overview of Cuttlefish, a lightweight framework prototyped in Apache Spark that helps developers adaptively improve the performance of their data processing applications by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time.

Sean Kandel is the founder and chief technical officer at Trifacta. Sean holds a PhD from Stanford University, where his research focused on new interactive tools for data transformation and discovery, such as Data Wrangler. Prior to Stanford, Sean worked as a data analyst at Citadel Investment Group.

Presentations

Semi-automated analytic pipeline creation and validation using active learning Session

Organizations leverage reporting, analytic, and machine learning pipelines to drive decision making and power critical operational systems. Sean Kandel discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models across these pipelines.

Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work she enjoys playing with fire, riding scooters, and dancing.

Presentations

Playing well together: Big data beyond the JVM with Spark and friends Session

Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).

Brian Karfunkel is a data scientist at Pinterest. Previously, he was senior data analyst at the NYU Furman Center, where he worked on housing and urban policy issues, and a research fellow at Stanford Law School, where he helped research the effects of workplace safety and health policy.

Presentations

Trapped by the present: Estimating long-term impact from A/B experiments Session

When software companies use A/B tests to evaluate product changes and fail to accurately estimate the long-term impact of such experiments, they risk optimizing for the users they have at the expense of the users they want to have. Brian Karfunkel explains how to estimate an experiment’s impact over time, thus mitigating this risk and giving full credit to experiments targeted at noncore users.

Sagar Kewalramani is an enterprise data architect at Meijer, where he leads efforts in building an enterprise data lake with Hadoop. He is primarily focused on building business use cases, high-volume real-time data ingestion, transformation, and movement, and data lineage and discovery but has also led the discovery and development of big data and machine learning applications to accelerate digital business and simplify data management and analytics. Sagar has wide experience in building data architectures integrating multiple systems using ETL tools, relational databases, and big data technologies and specializes in architecture design and administration roles for ETL tools like DataStage, Alteryx, and Talend; relational databases like Teradata and Oracle; and big data distributions like MapR and Hortonworks. He is part of core organizing committee of Big Data Ignite, Michigan’s premier conference on big data, the IoT, and cloud computing, along with meetup groups in Grand Rapids, MI, where he’s a frequent speaker on Hadoop and big data.

Presentations

Architecting an open source enterprise data lake Session

With so many business intelligence tools in the Hadoop ecosystem and no common measure to identify the efficiency of each tool, where do you begin to build or modify your enterprise data lake strategy? Sagar Kewalramani shares real-world BI problems and how they were resolved with Hadoop tools and demonstrates how to build an effective data lake strategy with open source tools and components.

Miryung Kim is an Associate Professor in the Department of Computer Science at UCLA as well as the co-founder of MK.Collective. She builds automated software tools—debuggers, testing tools, refactoring engines, and code analytics—for improving data scientist productivity and efficiency in developing big data analytics. She conducts empirical studies of professional software engineers and data scientists in the wild and uses the resulting insights to design novel software engineering tools.

Honors include an NSF CAREER award, a Microsoft Software Engineering Innovation Foundation Award, an IBM Jazz Innovation Award, a Google Faculty Research Award, an Okawa Foundation Research Grant Award, and an ACM SIGSOFT Distinguished Paper Award. Prior to joining UCLA, she was an Assistant Professor in the Department of Electrical and Computer Engineering at the University of Texas at Austin from 2009 to 2014. She also spent time as a visiting researcher at the Research in Software Engineering (RiSE) group at Microsoft Research. She received her B.S. in Computer Science from Korea Advanced Institute of Science and Technology in 2001 and her M.S. and Ph.D. in Computer Science and Engineering from the University of Washington. She received the Korean Ministry of Education, Science, and Technology Award, the highest honor given to an undergraduate student in Korea.

Presentations

Who are we? The largest scale study of professional data scientists Session

Even though we know that there are more data scientists in the workforce today, what those data scientists actually do and what we mean by data scientists have not been studied quantitatively. In this talk, we present a large-scale survey with 793 professional data scientists. Our study should inform managers on how to leverage data science capability effectively within their teams.

Eugene Kirpichov is a staff software engineer on the Cloud Dataflow team at Google, where he works on the Apache Beam programming model and APIs. Previously, he worked on Cloud Dataflow’s autoscaling and straggler elimination techniques. Eugene is interested in programming language theory, data visualization, and machine learning.

Presentations

Radically modular data ingestion APIs in Apache Beam Session

Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive SplittableDoFn.

Ronny Kohavi is a Microsoft distinguished engineer and the general manager for the analysis and experimentation team within Microsoft’s Artificial Intelligence and Research Group. Previously, he was partner architect at Bing and founder of the experimentation platform team. Prior to Microsoft, he was the director of data mining and personalization at Amazon; the vice president of business intelligence at Blue Martini Software (acquired by Red Prairie); and manager of the MineSet project, Silicon Graphics’ award-winning product for data mining and visualization. Ronny was the general chair for KDD 2004, cochair of KDD 99’s industrial track with Jim Gray, and cochair of the KDD Cup 2000 with Carla Brodley and has been an invited or keynote speaker at a number of conferences around the world. His papers have over 34,000 citations; three of them are in the top 1,000 most-cited papers in computer science. In 2016, he was named the fifth-most-influential scholar in AI and the twenty-sixth most influential scholar in machine learning. Ronny holds a PhD in machine learning from Stanford University, where he led the MLC++ project (the machine learning library in C++ used in MineSet and at Blue Martini Software), and a BA from the Technion, Israel.

Presentations

A/B testing at scale: Accelerating software innovation Tutorial

Controlled experiments such as A/B tests have revolutionized the way software is being developed, allowing real users to objectively evaluate new ideas. Ronny Kohavi, Alex Deng, Pavel Dmitriev, and Paul Raff lead an introduction to A/B texting and share lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft, which executes over 10K experiments a year.

Chi-Yi Kuan is director of business analytics at LinkedIn. He has over 15 years of extensive experience in applying big data analytics, business intelligence, risk and fraud management, data science, and marketing mix modeling across various business domains (social network, ecommerce, SaaS, and consulting) at both Fortune 500 firms and startups. Chi-Yi is dedicated to helping organizations become more data driven and profitable. He combines deep expertise in analytics and data science with business acumen and dynamic technology leadership.

Presentations

Big data analytics and machine learning techniques to drive and grow business Tutorial

Burcu Baran, Wei Di, Michael Li, and Chi-Yi Kuan walk you through the big data analytics and data science lifecycle and share their experience and lessons learned leveraging advanced analytics and machine learning techniques such as predictive modeling to drive and grow business at LinkedIn.

Sanjeev Kulkarni is the cofounder of Streamlio, a company focused on building a next-generation real-time stack. Previously, he was the technical lead for real-time analytics at Twitter, where he cocreated Twitter Heron; worked at Locomatix handling the company’s engineering stack; and led several initiatives for the AdSense team at Google. Sanjeev holds an MS in computer science from the University of Wisconsin-Madison.

Presentations

Effectively once, exactly once, and more in Heron Session

Stream processing systems must support a number of different types of processing semantics due to the diverse nature of streaming applications. Karthik Ramasamy and Sanjeev Kulkarni explore effectively once, exactly once, and other types of stateful processing techniques, explain how they are implemented in Heron, and discuss how applications can benefit.

Modern real-time streaming architectures Tutorial

Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Abhishek Kumar is a manager of data science in Sapient’s Bangalore office, where he looks after scaling up the data science practice by applying machine learning and deep learning techniques to domains such as retail, ecommerce, marketing, and operations. Abhishek is an experienced data science professional and technical team lead specializing in building and managing data products from conceptualization to deployment phase and interested in solving challenging machine learning problems. Previously, he worked in the R&D center for the largest power-generation company in India on various machine learning projects involving predictive modeling, forecasting, optimization, and anomaly detection and led the center’s data science team in the development and deployment of data science-related projects in several thermal and solar power plant sites. Abhishek is a technical writer and blogger as well as a Pluralsight author and has created several data science courses. He is also a regular speaker at various national and international conferences and universities. Abhishek holds a master’s degree in information and data science from the University of California, Berkeley.

Presentations

Deep learning-based search and recommendation systems using TensorFlow Tutorial

Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.

Philip Langdale is the engineering lead for cloud at Cloudera. He joined the company as one of the first engineers building Cloudera Manager and served as an engineering lead for that project until moving to working on cloud products. Previously, Philip worked at VMware, developing various desktop virtualization technologies. Philip holds a bachelor’s degree with honors in electrical engineering from the University of Texas at Austin.

Presentations

A deep dive into running data analytic workloads in the cloud Tutorial

Vinithra Varadharajan, Philip Langdale, and Eugene Fratkin lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices.

Francesca Lazzeri is a data scientist at Microsoft, where she is part of the algorithms and data science team. Francesca is passionate about innovations in big data technologies and the applications of advanced analytics to real-world problems. Her work focuses on the deployment of machine learning algorithms and web service-based solutions to solve real business problems for customers in the energy, retail, and HR analytics sectors. Previously, she was a research fellow in business economics at Harvard Business School. She holds a PhD in innovation management.

Presentations

Operationalize deep learning: How to deploy and consume your LSTM networks for predictive maintenance scenarios Session

Francesca Lazzeri and Fidan Boylu Uz explain how to operationalize LSTM networks to predict the remaining useful life of aircraft engines. They use simulated aircraft sensor values to predict when an aircraft engine will fail in the future so that maintenance can be planned in advance.

Mike Lee Williams is director of research at Fast Forward Labs, an applied machine intelligence lab in New York City, where he builds prototypes that bring the latest ideas in machine learning and AI to life and helps Fast Forward Labs’s clients understand how to make use of these new technologies. Mike holds a PhD in astrophysics from Oxford.

Presentations

Interpretable machine learning products Session

Interpretable models result in more accurate, safer, and more profitable machine learning products. But interpretability can be hard to ensure. Michael Lee Williams explores the growing business case for interpretability and its concrete applications, including churn, finance, and healthcare. Along the way, Michael offers an overview of open source, model-agnostic tool LIME.

Steven Levine is director of data engineering and principal architect at Weight Watchers. A delivery-focused hands-on architect, Steve is known for developing and leading the delivery of reactive, pragmatic, on-time, on-budget, customer-focused solutions. He enjoys solving big picture problems like deciding architectures, cloud providers, programming languages, and frameworks, among others. Steven is well versed in Scala technologies, Java technologies, and big data technologies. His previous experience includes building and deploying RESTFul services to cloud providers like AWS, Cloud Foundry, and OpenStack.

Presentations

How Weight Watchers embraced modern data practices during its transformation from a legacy IT shop to a modern technology organization Session

For organizations stuck in a myriad of legacy infrastructure, the path to AI and deep learning seems impossible. Michael Lysaght, Steven Levine, and Nicolas Chikhani discuss Weight Watchers's transition from a traditional BI organization to one that uses data effectively, covering the company's needs, the changes that were required, and the technologies and architecture used to achieve its goals.

Michael Li is head of analytics at LinkedIn, where he helps define what big data means for LinkedIn’s business and how it can drive business value through the EOI analytics framework. Michael is passionate about solving complicated business problems with a combination of superb analytical skills and sharp business instincts. His specialties include building and leading high-performance teams to quickly meet the needs of fast-paced, growing companies. Michael has a number of years’ experience in big data innovation, business analytics, business intelligence, predictive analytics, fraud detection, analytics, operations, and statistical modeling across financial, ecommerce, and social networks.

Presentations

Big data analytics and machine learning techniques to drive and grow business Tutorial

Burcu Baran, Wei Di, Michael Li, and Chi-Yi Kuan walk you through the big data analytics and data science lifecycle and share their experience and lessons learned leveraging advanced analytics and machine learning techniques such as predictive modeling to drive and grow business at LinkedIn.

Wei Lin is senior manager at Dell EMC and chief data scientist for the company’s Big Data practice, where he is responsible for planning the company’s data science strategy and leads data science services delivery for Dell EMC Professional Services’s big data practice. Wei is also responsible for leading data scientist project delivery as well as the hiring, training, and certification of new data scientists. He hosts Dell EMC’s data science mentorship program, which shares data scientists’ engagement findings, industry experience, techniques, and trends. Among his successes include developing Dell EMC’s data science field consulting methodology, Descriptive, Exploration, Predictive and Prescriptive (DEPP), which provides a practical analytics roadmap and approaches for an organization’s business initiatives and data and analytic requirements. Wei has over 20 years of experience in predictive analytics, including analytical modeling, architecture design, data warehousing, reporting, and marketing. Previously, he was the principal consultant at IBM, PwC, and Cooper & Lybrand. He has authored over 100 papers, and his work has been published or reported on in professional journals as well as Businessweek and Forbes. Wei holds both a PhD in electrical engineering, specializing in artificial intelligence, and an MA in electrical engineering from the State University of New York at Binghamton and a BS in electrical engineering from National Taipei Institute of Technology, Taiwan.

Presentations

Bladder cancer diagnosis using deep learning Session

Image recognition classification of diseases will minimize the possibility of medical mistakes, improve patient treatment, and speed up patient diagnosis. Mauro Damo and Wei Lin offer an overview of an approach to identify bladder cancer on patients using nonsupervised and supervised machine learning techniques on more than 5,000 magnetic resonance images from the Cancer Imaging Archive.

Shaoshan Liu is the cofounder and president of PerceptIn, a company working on developing a next-generation robotics platform. Previously, he worked on autonomous driving and deep learning infrastructure at Baidu USA. Shaoshan holds a PhD in computer engineering from the University of California, Irvine.

Presentations

Powering robotics clouds with Alluxio Session

Bin Fan and Shaoshan Liu explain how PerceptIn designed and implemented a cloud architecture to support video streaming and online object recognition tasks and demonstrate how Alluxio delivers high throughput, low latency, and a unified namespace to support these emerging cloud architectures.

Jorge A. Lopez works in big data solutions at Amazon Web Services. Jorge has more than 15 years of business intelligence and DI experience and enjoys intelligent design and engaging storytelling. His is passionate about data, music, and nature.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application on the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Edwina Lu is a software engineer on LinkedIn’s Hadoop infrastructure development team, currently focused on supporting Spark on the company’s clusters. Previously, she worked at Oracle on database replication.

Presentations

Spark for everyone: Self-service monitoring and tuning Session

Spark applications need to be well tuned so that individual applications run quickly and reliably, and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems.

Nancy Lublin does not sleep very much. She is currently the Founder & CEO of Crisis Text Line, which has processed over 50 million messages in 4 years and is one of the first “big data for good” orgs. She was CEO of DoSomething.org for 12 years, taking it from bankruptcy to the largest organization for teens and social change in the world. Her first venture was Dress for Success, which helps women transition from welfare to work in almost 150 cities in 22 countries. She founded this organization with a $5,000 inheritance from her great-grandfather. Before leading three of the most popular charity brands in America, she was a bookworm. She studied politics at Brown University, political theory at Oxford University (as a Marshall Scholar), and has a law degree from New York University. She is the author of 4 books and is a board member of McGraw Hill Education. Nancy was a judge for 2017’s Miss USA Pageant (she thought that was hilarious.) Nancy is a Young Global Leader of the World Economic Forum (attending Davos multiple times), was named Schwab Social Entrepreneur of the Year in 2014, and has been named in the NonProfit Times Power and Influence Top 50 list 3 times. She is married to Jason Diaz and has two children who have never tasted Chicken McNuggets.

Presentations

Keynote with Nancy Lublin Keynote

Keynote with Nancy Lublin

Boris Lublinsky is a software architect at Lightbend, where he specializes in big data, stream processing, and services.
Boris has over 30 years’ experience in enterprise architecture and has been accountable for setting architectural direction, conducting architecture assessments, and creating and executing architectural roadmaps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, both from Wiley and Serving Machine Learning Models from O’Reilly. He is also cofounder of and frequent speaker at several Chicago user groups.

Presentations

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams Tutorial

Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to chose them instead.

Daniel Lurie leads the product analytics and science team at Pinterest, a group that mixes deep data skills with strategic thinking to help Pinterest’s product team grow the company’s user base, develop new features, and increase engagement. The team’s work ranges from understanding product performance via A/B experiment analysis to identifying and sizing market opportunities to defining and tracking success through metrics. Previously, Dan led analytics for a sales-focused business line at LinkedIn and worked in consulting.

Presentations

Breaking up the block: Using heterogenous population modeling to drive growth Session

All successful startups thrive on tight product-market fit, which can produce homogenous initial user bases. To become the next big thing, your user base will need to diversify and your product must change to accommodate new needs. Daniel Lurie explains how Pinterest leverages external data to measure racial and income diversity in its user base and changed user modeling to drive growth.

Kevin Lyons is senior vice president of data science for digital technology at Nielsen, where he is responsible for leading the vision and execution of Nielsen Marketing Cloud’s analytics and data optimization activities. Previously, Kevin was vice president of analytics and business intelligence at x+1, a leader in audience targeting that leverages sophisticated statistical modeling to surpass traditional online marketing techniques, where he strove to maximize profitable website user behavior via analytics and real-time decisioning; spent over a decade as a vice president responsible for web and marketing analytics at QualityHealth.com, a leading website providing consumer health news and information, and at Harte-Hanks, a large marketing service provider; and served in account management at Grey Direct. Kevin holds a BA in Russian language and Eastern European studies from the University of Illinois at Urbana-Champaign, an MA in medieval history from the Ohio State University, and an MA in applied statistics from Hunter College.

Presentations

MARKETING AT FUTURE SPEED Media and Advertising

Consumer behavior is in a constant state of flux. Adapting to these changes is especially hard, given the staggering amount of big data marketers need to understand and act on. In this session, Kevin Lyons, SVP of Data Science at Nielsen, will introduce ‘Online Learning’, a cutting-edge AI technology that uses event-level data streams to build and adapt models in real time.

Angie Ma is cofounder and COO of ASI Data Science, a London-based AI tech startup that offers data science as a service, which has completed more than 120 commercial data science projects in multiple industries and sectors and is regarded as the EMEA-based leader in data science. Angie is passionate about real-world applications of machine learning that generate business value for companies and organizations and has experience delivering complex projects from prototyping to implementation. A physicist by training, Angie was previously a researcher in nanotechnology working on developing optical detection for medical diagnostics.

Presentations

Data science for managers 2-Day Training

Angie Ma offers a condensed introduction to key data science and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.

Madhav Madaboosi is a digital business and technology strategist within the Strategy, Architecture, and Planning Group at BP, where he leads a number of global innovation initiatives in the areas of robotic process automation, AI, big data, data lakes, and the industrial IoT. Previously, Madhav was the interface to several business portfolios within BP as a business information manager. Prior to BP, he worked in management consulting for a number of Fortune 100 firms. Madhav holds a degree in business and has completed executive programs at the Kellogg Institute of Management.

Presentations

Meta your data; drain the big data swamp Data Case Studies

Madhav Madaboosi and Meenakshisundaram Thandavarayan offer an overview of BP's self-service operational data lake, which improved operational efficiency, boosting productivity through fully identifiable data and reducing risk of a data swamp. They cover the path and big data technologies that BP chose, lessons learned, and pitfalls encountered along the way.

Mark Madsen is a research analyst at Third Nature, where he advises companies on data strategy and technology planning. Mark has designed analysis, data collection, and data management infrastructure for companies worldwide. He focuses on two types of work: the business applications of data and guiding the construction of data infrastructure. As a result, Mark does as much information strategy and IT architecture work as he does analytics.

Presentations

Executive Briefing: BI on big data Session

If your goal is to provide data to an analyst rather than a data scientist, what’s the best way to deliver analytics? There are 70+ BI tools in the market and a dozen or more SQL- or OLAP-on-Hadoop open source projects. A panel of experts details the trade-offs between a number of architectures that provide self-service access to data.

Arup Malakar is a software engineer at Lyft.

Presentations

Dogfooding data at Lyft Session

Mark Grover and Arup Malakar offer an overview of how Lyft leverages application metrics, logs, and auditing to monitor and troubleshoot its data platform and share how the company dogfoods the platform to provide security, auditing, alerting, and replayability. They also detail some of the internal services and tools Lyft has developed to make its data more robust, scalable, and self-serving.

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem, and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Executive Briefing: Managing successful data projects—Technology selection and team building Session

Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives.

Time series data: Architecture and use cases Tutorial

If you have data that has a time factor to it, then you need to think in terms of time series datasets. Ted Malaska explores time series in all of its forms, from tumbling windows to sessionization in batch or in streaming. You'll gain exposure to the tools and background you need to be successful in the world of time-oriented data.

Jules Malin is a manager of product analytics and data science at GoPro, where he leads a team responsible for discovering product and behavioral insights from GoPro’s growing family and ecosystem of smart devices and driving product and user experience improvements, including influencing and refining data pipelines in Hadoop/Spark and developing scalable machine learning data products, metrics, and visualizations that produce actionable insights. Previously, Jules worked in product management and analytics engineering at Intel and Shutterfly. He holds a master’s degree in predictive analytics from Northwestern University.

Presentations

Drone data analytics using Spark, Python, and Plotly Data Case Studies

Drones and smart devices are generating billions of event logs for companies, presenting the opportunity to discover insights that inform product, engineering, and marketing team decisions. Jules Malin explains how technologies like Spark and analytics and visualization tools like Python and Plotly enable those insights to be discovered in the data.

Katie Malone is Director of Data Science at Civis Analytics, a data science software and services company. She leads a team of diverse data scientists who serve as technical and methodological advisors to the Civis consulting team, as well as writing the core machine learning and data science software that underpins the Civis Data Science Platform. Before working at Civis, she completed a PhD in physics at Stanford, working at CERN on Higgs boson searches. She was also the instructor of Udacity’s Introduction to Machine Learning course, and hosts Linear Digressions, a weekly podcast on data science and machine learning.

Presentations

Building a Data Science Idea Factory: How to Prioritize the Portfolio of a Large, Diverse, Opinionated Data Science Team Session

A huge challenge for data science managers is determining priorities for their team. Every data science team has more good ideas than they have time, so it’s critical to quickly prioritize the highest-impact projects. This talk shares a framework that our large and diverse data science team uses to identify, discuss, select, and manage a data science portfolio for a fast-moving startup.

From the Presidential Campaign Trail to the Enterprise: Building Effective Data-Driven Teams Data Case Studies

The 2012 Obama Campaign ran the first personalized presidential campaign in history. The data team was made up of people from diverse backgrounds who embraced data science in service of the goal. Civis Analytics was started from this team. Today Civis enables organizations to use many of the same methods outside politics. Along the way, we’ve learned a thing or two about building effective teams.

Veronica Mapes is a technical program manager focused on human evaluation and computation at Pinterest, where she manages Pinterest’s internal human evaluation platform, maturing it from just an idea to a self-service platform with a 10 million annual run rate of tasks less than six months after launch, as well as third-party communities of crowdsourcing raters. She also hires, trains, and manages high-quality content evaluators and tests template and worker quality to ensure the delivery of highly accurate data for time series measurement and training machine learning models.

Presentations

Humans versus the machines: Using human-based computation to improve machine learning Session

Veronica Mapes and Garner Chung detail the human evaluation platform Pinterest developed to better serve its deep learning and operational teams when its needs grew beyond platforms like Mechanical Turk. Along the way, they cover tricks for increasing data reliability and judgement reproducibility and explain how Pinterest integrated end-user-sourced judgements into its in-house platform.

Dana Mastropole is a data scientist in residence at the Data Incubator and contributes to curriculum development and instruction. Previously, Dana taught elementary school science after completing MIT’s Kaufman teaching certificate program. She studied physics as an undergraduate student at Georgetown University and holds a master’s in physical oceanography from MIT.

Presentations

Machine learning with TensorFlow 2-Day Training

The TensorFlow library enables the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs, making it ideal for implementing neural networks and other machine learning algorithms. Robert Schroll and Dana Mastropole demonstrate TensorFlow's capabilities and walk you through building machine learning models on real-world data.

Brian McMahan is a research engineer at Joostware, a San Francisco-based company specialized in consulting and building intellectual property in natural language processing and deep learning. He is also a cofounder at R7 Speech Sciences, a company focused on understanding spoken conversations. Brian is wrapping up his PhD in computer science from Rutgers University, where his research focuses on Bayesian and deep learning models for grounding perceptual language in the visual domain. Brian has also conducted research in reinforcement learning and various aspects of dialogue systems.

Presentations

Machine learning with PyTorch 2-Day Training

PyTorch is a recent deep learning framework from Facebook that is gaining massive momentum in the deep learning community. Its fundamentally flexible design makes building and debugging models straightforward, simple, and fun. Delip Rao and Brian McMahan walk you through PyTorch's capabilities and demonstrate how to use PyTorch to build deep learning models and apply them to real-world problems.

Guru Medasani is a senior solutions architect at Cloudera, where he helps customers build big data platforms and leverage technologies like Apache Hadoop and Apache spark to solve complex business problems. Some of the business applications he’s worked on include applications for collecting, storing, and processing huge amounts of machine and sensor data, image processing applications on Hadoop, machine learning models to predict consumer demand, and tools to perform advanced analytics on large volumes of data stored in Hadoop. Previously, Guru built research applications as a big data engineer at Monsanto Research and Development.

Presentations

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark​ Session

When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results.

Dong Meng is a data scientist at MapR, where he helps customers solve their business problems with big data by translating the value from customers’ data and turns it into actionable insights or machine learning products. His recent work includes integrating open source machine learning frameworks like PredictionIO/xgboost with MapR’s platform. He also created time series QSS and deep learning QSS as a MapR service offering. Dong has several years of experience in statistical machine learning, data mining, and big data product development. Previously, he was a senior data scientist with ADP, where he built machine learning pipelines and data products for HR using payroll data to power ADP Analytics, and a staff software engineer with IBM, SPSS, where he was part of the team that built Watson analytics. During his graduate study at the Ohio State University, Dong served as research assistant, where he concentrated on compressive sensing and solving point estimation problems from a Bayesian perspective.

Presentations

Distributed deep learning with containers on heterogeneous GPU clusters Session

Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters.

Peng Meng is a senior software engineer on the big data and cloud team at Intel, where he focuses on Spark and MLlib optimization. Peng is interested in machine learning algorithm optimization and large-scale data processing. He holds a PhD from the University of Science and Technology of China.

Presentations

Spark ML optimization at Intel: A case study Session

Intel has been deeply involved in Spark from its earliest moments. Vincent Xie and Peng Meng share what Intel has been working on Spark ML and introduce the methodology behind Intel's work on SparkML optimization.

Gian Merlino is CTO and cofounder of Imply and is one of the original committers of the Druid project. Previously, he worked at Metamarkets and Yahoo. Gian holds a BS in computer science from the California Institute of Technology.

Presentations

NoSQL no more: SQL on Druid with Apache Calcite Session

Gian Merlino discusses the SQL layer recently added to the open source Druid project. It's based on Apache Calcite, which bills itself as "the foundation for your next high-performance database." Gian explains how Druid and Calcite are integrated and why you should stop worrying and learn to love relational algebra in your own projects.

John Mertic is director of program management for ODPi and the Open Mainframe Project at the Linux Foundation. John comes from a PHP and open source background. Previously, he was director of business development software alliances at Bitnami, a developer, evangelist, and partnership leader at SugarCRM, board member at OW2, president of OpenSocial, and a frequent speaker at conferences around the world. As an avid writer, John has published articles on IBM Developerworks, Apple Developer Connection, and PHP Architect and authored The Definitive Guide to SugarCRM: Better Business Applications and Building on SugarCRM.

Presentations

The rise of big data governance: Insight on this emerging trend from active open source initiatives Session

John Mertic and Ferd Scheepers detail the benefits of a vendor-neutral approach to data governance, explain the need for an open metadata standard, and share insight around how companies ING, IBM, Hortonworks, and more are delivering solutions to this challenge as an open source initiative.

Thomas W. Miller is faculty director of the data science program at Northwestern University. He has developed and taught many courses in the program, including practical machine learning, web information retrieval, and network data science. Miller has written six books about data science and has consulted with many businesses, providing advice on performance and value measurement, data science methods, information technology, and best practices in building teams of data scientists and data engineers.

Presentations

Working with the data of sports Data Case Studies

Sports analytics today is more than a matter of analyzing box scores and play-by-play statistics. Faced with detailed on-field or on-court data from every game, sports teams face challenges in data management, data engineering, and analytics. This study addresses challenges faced by a Major League Baseball team as it seeks competitive advantage through data science and deep learning.

Nina Mishra is principal scientist at Amazon Web Services, where she focuses on data science, data mining, web search, machine learning and privacy. Nina has many years of experience leading projects at Amazon, Microsoft Research, and HP Labs. She was also an associate professor at the University of Virginia and an acting faculty member at Stanford University. Nina’s research encompasses the design and evaluation of new data mining algorithms on real, colossal-sized datasets. She has authored almost 50 publications in top venues, including WWW, WSDM, SIGIR, ICML, NIPS, AAAI, COLT, VLDB, PODS, CRYPTO, EUROCRYPT, FOCS, and SODA, which have been recognized with best paper award nominations. Nina’s research was was central to the Bing search engine and has been widely featured in external press coverage. Nina holds 14 patents with a dozen more still in the application stage. She has had the distinct privilege of helping others advance in their careers, including 15 summer interns and many full-time researchers. Nina’s service to the community includes serving on journal editorial boards Machine Learning, the Journal of Privacy and Confidentiality, IEEE Transactions on Knowledge and Data Engineering, and IEEE Intelligent Systems and chairing the premier machine learning conference ICML in 2003, as well as numerous program committees for web search, data mining, and machine learning conferences. She was awarded an NSF grant as a principal investigator and has served on eight PhD dissertation committees.

Presentations

Continuous machine learning over streaming data Session

Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data, focusing on explainable machine learning, including anomaly detection with attribution, ability to reduce false positives through user feedback, and detection of anomalies in directed graphs.

Rajat Monga leads TensorFlow, an open source machine learning library and the center of Google’s efforts at scaling up deep learning. He is one of the founding members of the Google Brain team and is interested in pushing machine learning research forward toward general AI. Previously, Rajat was the chief architect and director of engineering at Attributor, where he led the labs and operations and built out the engineering team. A veteran developer, Rajat has worked at eBay, Infosys, and a number of startups.

Presentations

The current state of TensorFlow and where it's headed in 2018 Session

Rajat Monga offers an overview of TensorFlow progress and adoption in 2017 before looking ahead to the areas of importance in the future—performance, usability, and ubiquity—and the efforts TensorFlow is making in those areas.

Ajay Mothukuri is an architect on the data technologies team at Sapient.

Presentations

Achieving GDPR compliance and data privacy using blockchain technology Session

Ajay Mothukuri, Arunkumar Ramanatha, and Vijay Srinivas Agneeswaran explain how to use open source blockchain technologies such as Hyperledger to implement the European Union's General Data Protection Regulation (GDPR) regulation.

Manu Mukerji works on machine learning and AI at Criteo. Manu has a background in cloud computing and big data, working on systems handling billions of transactions per day in real time. He enjoys building and architecting scalable, highly available data solutions and has extensive experience working in online advertising and social media.

Presentations

Machine learning versus machine learning in production Session

Criteo is a global leader in commerce marketing. Manu Mukerji walks you through Criteo's machine learning example for universal catalogs, explaining how the training and test sets are generated and annotated, how the model is pushed to production, evaluated (automatically), and used, production issues that arise when applying ML at scale in production, lessons learned, and more.

Ash Munshi is CEO of Pepperdata. Previously, Ash was executive chairman for deep learning startup Marianas Labs (acquired by Askin in 2015); CEO of big data storage startup Graphite Systems (acquired by EMC DSSD in 2015); CTO of Yahoo; and CEO of a number of other public and private companies. He serves on the board of several technology startups.

Presentations

Classifying job execution using deep learning Session

Ash Munshi shares techniques for labeling big data apps using runtime measurements of CPU, memory, I/O and network and details a deep neural network to help operators understand the types of apps running on the cluster and better predict runtimes, tune resource utilization, and increase efficiency. These methods are new and are the first approach to classify multivariate time series.

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.

Presentations

Data reflections: Making data fast and easy to use without making copies Session

Most organizations manage 5 to 15 copies of their data in multiple systems and formats to support different analytical use cases. Tomer Shiran and Jacques Nadeau introduce a new approach called data reflections, which dramatically reduces the need for data copies, demonstrate an open source implementation built with Apache Calcite, and explore two production case studies.

Balasubramanian Narasimhan is a senior research scientist in the Department of Statistics and the Department of Biomedical Data Sciences at Stanford University and the director of the Data Coordinating Center within the Department of Biomedical Data Sciences. His research areas include statistical computing, distributed computing, clinical trial design, and reproducible research. Balasubramanian coteaches a computing for data science course with John Chambers, an inventor of the S language.

Presentations

Distributed clinical models: Inference without sharing patient data Session

Clinical collaboration benefits from pooling data to train models from large datasets, but it's hampered by concerns about sharing data. Balasubramanian Narasimhan, John-Mark Agosta, and Daniel Rubin outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset.

Paco Nathan leads the Learning Group at O’Reilly Media. Known as a “player/coach” data scientist, Paco led innovative data teams building ML apps at scale for several years and more recently was evangelist for Apache Spark, Apache Mesos, and Cascading. Paco has expertise in machine learning, distributed systems, functional programming, and cloud computing with 30+ years of tech-industry experience, ranging from Bell Labs to early-stage startups. Paco is an advisor for Amplify Partners and was cited in 2015 as one of the top 30 people in big data and analytics by Innovation Enterprise. He is the author of Just Enough Math, Intro to Apache Spark, and Enterprise Data Workflows with Cascading.

Presentations

Human in the loop: A design pattern for managing teams working with machine learning Session

Human in the loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. Such systems are mostly automated, with exceptions referred to human experts, who help train the machines further. Paco Nathan offers an overview of HITL from the perspective of a business manager, focusing on use cases within O'Reilly Media.

Ann Nguyen evangelizes design for impact at Whole Whale, where she leads the tech and design team in building meaningful digital products for nonprofits. She has designed and managed the execution of multiple websites for organizations including the LAMP, Opportunities for a Better Tomorrow, and Breakthrough. Ann is always challenging designs with A/B testing. She bets $1 on every experiment that she runs and to date has accumulated a decent sum. Previously, Ann worked with a wide range of organizations from the Ford Foundation to Bitly. She is Google Analytics and Optimizely Platform certified. Ann is a regular speaker on nonprofit design and strategy and recently presented at the DMA Nonprofit Conference. She has also taught at Sarah Lawrence College. Outside of work, Ann enjoys multisensory art, comedy shows, fitness, and making cocktails, ideally all together.

Presentations

Using ML to improve UX and literacy for young poets Data Case Studies

Power Poetry is the largest online platform for young poets, with over 350K users. Ann Nguyen explains how Power Poetry is extending the learning potential with machine learning and covers the technical elements of the Poetry Genome, a series of ML tools to analyze and break down similarity scores of the poems added to the site.

Ryan Nienhuis is a senior technical product manager on the Amazon Kinesis team, where he defines products and features that make it easier for customers to work with real-time, streaming data in the cloud. Previously, Ryan worked at Deloitte Consulting, helping customers in banking and insurance solve their data architecture and real-time processing problems. Ryan holds a BE from Virginia Tech.

Presentations

Continuous machine learning over streaming data Session

Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data, focusing on explainable machine learning, including anomaly detection with attribution, ability to reduce false positives through user feedback, and detection of anomalies in directed graphs.

Berk Norman is a data scientist in the Department of Radiology and Biomedical Imaging at UC San Francisco, where he works on constructing deep learning models.

Presentations

Automatic 3D MRI knee damage classification with 3D CNN using BigDL on Spark Session

Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark.

Meagan O’Leary is director of finance business intelligence at Microsoft. Meagan considers herself full stack leader focused on delivering results through her education, skills and experiences in business, technology, strategy, and design thinking. She engages the world with curiosity, energy, and enthusiasm to unleash capabilities in others and to generate value. Given the seemingly complex, Meagan drives for clarity while empowering teams and individuals to do more than they thought was possible. She has been successful implementing a diverse portfolio of solutions, including enterprise resource planning (SAP), ecommerce, business performance management, financial business intelligence, and most recently, artificial intelligence and intelligent automation. Meagan is passionate about improving business and human performance and has done so in Fortune 500 organizations, nonprofits, and startups across a number of industries.

Presentations

How to successfully reinvent productivity in finance with machine learning (Hint: machine learning is only part of it.) Data Case Studies

Microsoft’s finance organization is reinventing forecasting using machine learning that its leaders describe as game changing. Meagan O'Leary shares the lessons the data sciences and finance teams learned while bringing machine learning forecasting to the office of the CFO by improving forecast accuracy and frequency and driving cultural change through a finance center of excellence.

A leading expert on big data architectures, Stephen O’Sullivan has 25 years of experience creating scalable, high-availability data and applications solutions. A veteran of Silicon Valley Data Science, @WalmartLabs, Sun, and Yahoo. Stephen is an independent adviser to enterprises on all things data..

Presentations

Enough data engineering for a data scientist; or, How I learned to stop worrying and love the data scientists Session

Stephen O'Sullivan takes you along the data science journey, from on-boarding data (using a number of data/object stores) to understanding and choosing the right data format for the data assets to using query engines (and basic query tuning). You'll learn some new skills to help you be more productive and reduce contention with the data engineering team.

Mike Olson cofounded Cloudera in 2008 and served as its CEO until 2013, when he took on his current role of chief strategy officer. As CSO, Mike is responsible for Cloudera’s product strategy, open source leadership, engineering alignment, and direct engagement with customers. Previously, Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine, and he spent two years at Oracle Corporation as vice president for embedded technologies after Oracle’s acquisition of Sleepycat. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies, and Informix Software. Mike holds a bachelor’s and a master’s degree in computer science from the University of California, Berkeley.

Presentations

Executive Briefing: Machine learning—Why you need it, why it's hard, and what to do about it Session

Mike Olson shares examples of real-world machine learning applications, explores a variety of challenges in putting these capabilities into production—including the speed with with technology is moving, cloud versus in-data-center consumption, security and regulatory compliance, and skills and agility in getting data and answers into the right hands—and outlines proven ways to meet them.

Andrea Pasqua is a data science manager at Uber, where he leads the time series forecasting and anomaly detection teams. Previously, Andrea was director of data science at Radius Intelligence, a company spearheading the use of machine learning in the marketing space; a financial analyst at MSCI, a leading company in the field of risk analysis; and a postdoctoral fellow in biophysics at UC Berkeley. He holds a PhD in physics from UC Berkeley.

Presentations

Detecting time series anomalies at Uber scale with recurrent neural networks Session

Time series forecasting and anomaly detection is of utmost importance at Uber. However, the scale of the problem, the need for speed, and the importance of accuracy make anomaly detection a challenging data science problem. Andrea Pasqua and Anny Chen explain how the use of recurrent neural networks is allowing Uber to meet this challenge.

Mo Patel is an independent deep learning consultant advising individuals, startups, and enterprise clients on strategic and technical AI topics. Mo has successfully managed and executed data science projects with clients across several industries, including cable, auto manufacturing, medical device manufacturing, technology, and car insurance. Previously, he was practice director for AI and deep learning at Think Big Analytics, a Teradata Company, where he mentored and advised Think Big clients and provided guidance on ongoing deep learning projects, as well as a management consultant and a software engineer earlier in his career. A continuous learner, Mo conducts research on applications of deep learning, reinforcement learning, and graph analytics toward solving existing and novel business problems and brings a diversity of educational and hands-on expertise connecting business and technology. He holds an MBA, a master’s degree in computer science, and a bachelor’s degree in mathematics.

Presentations

Learning PyTorch by building a recommender system Tutorial

Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Mo Patel and Neejole Patel walk you through using PyTorch to build a content recommendation model.

Neejole Patel is a sophomore at Virginia Tech, where she is pursuing a BS in computer science with a focus on machine learning, data science, and artificial intelligence. In her free time, Neejole completes independent big data projects, including one that tests the Broken Windows theory using DC crime data. She recently completed an internship at a major home improvement retailer. (Twitter: datajolie)

Presentations

Learning PyTorch by building a recommender system Tutorial

Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Mo Patel and Neejole Patel walk you through using PyTorch to build a content recommendation model.

Rizwan Patel is senior director of big data, innovation, and emerging technology at Caesars Entertainment. A senior technologist with strong leadership skills coupled with hands-on application and system expertise, Rizwan has a proven track record of delivering large-scale, mission-critical projects on time and budget using leading-edge technologies to solve critical business problems as well as extensive experience in managing client relations at all levels, including senior executives.

Presentations

Big data applicability to the gaming industry Media and Advertising

Rizwan Patel explains how the gaming industry can leverage Cloudera’s big data platform to adapt to the change in patron dynamics (both in terms of demographics as well as in spending patterns) to create a new paradigm for customer (micro) segmentation.

Vanja Paunić is a data scientist on the Azure Machine Learning team at Microsoft. Previously, Vanja worked as a research scientist in the field of bioinformatics, where she published on uncertainty in genetic data, genetic admixture, and prediction of genes. She holds a PhD in computer science with a focus on data mining from the University of Minnesota.

Presentations

Using R and Python for scalable data science, machine learning, and AI Tutorial

R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure.

K. Wayne Peacock is Vice President, Global Insights for Blizzard Entertainment. In this capacity, Wayne is responsible for teams of business analysts, data scientists, software engineers, and data engineers that deliver advanced analytics, strategic insights, predictive models, real time data services, and analytics platforms to power Blizzard’s global business. He joined Blizzard in 2017 and brings over 25 years of technology and analytical leadership experience from such well known brands as Capital One, Netflix, Visa, and The Walt Disney Company. The vision of the Global Insights team is to blend a judicious view of our business, a rich understanding of our players, with a deep passion for our games, to help craft epic entertainment experiences.

Presentations

Keynote with Wayne Peacock Keynote

Wayne Peacock, Vice President, Global Insights for Blizzard Entertainment.

Valentina Pedoia is a specialist in the Musculoskeletal and Imaging Research Group at UCSF and a data scientist focusing on developing algorithms for advanced computer vision and machine learning for improving the usage of noninvasive imaging as diagnostic and prognostic tools. Her current research explores the role of machine learning in the extraction of contributors to osteoarthritis (OA), and she is studying analytics to model the complex interactions between morphological, biochemical, and biomechanical aspects of the knee joint as a whole and deep learning convolutional neural network for musculoskeletal tissue segmentation and for the extraction of silent features from quantitative relaxation maps for a comprehensive study of the biochemical articular cartilage composition with the ultimate goal of developing a completely data-driven model that is able to extract imaging features and use them to identify risk factors and predict outcomes. Previously, she was a postdoc in the Musculoskeletal and Imaging Research Group, where she provided support and expertise in medical computer vision with a focus on reducing human effort and extracting semantic features from MRIs to study degenerative joint disease. Valentina’s recent work on machine learning applied to OA was awarded as annual scientific highlights of the 25th conference of the International Society of Magnetic Resonance In Medicine (ISMRM 2017) and selected as best paper presented at the MRI drug discovery study group. Valentina holds a PhD in computer science, where her research focused on feature extraction from functional and structural brain MRI in subjects with glial tumors.

Presentations

Automatic 3D MRI knee damage classification with 3D CNN using BigDL on Spark Session

Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark.

Thomas Phelan is cofounder and chief architect of BlueData. Previously, Tom was an early employee at VMware; as senior staff engineer, he was a key member of the ESX storage architecture team. During his 10-year stint at VMware, he designed and developed the ESX storage I/O load-balancing subsystem and modular “pluggable storage architecture.” He went on to lead teams working on many key storage initiatives, such as the cloud storage gateway and vFlash. Earlier, Tom was a member of the original team at Silicon Graphics that designed and implemented XFS, the first commercially available 64-bit file system.

Presentations

How to protect big data in a containerized environment Session

Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with transparent data encryption (TDE). However, TDE can be difficult to configure and manage; issues that are only compounded when running on Docker containers. Thomas Phelan explores these challenges and how to overcome them.

Patrick Phelps is the lead data scientist on ads at Pinterest, focusing on auction dynamics and advertiser success. Previously, Patrick was the lead data scientist at Yelp, leading a team focusing on projects as diverse as search, ads, delivery operations, and HR. He has an engineering background in traffic quality (the art of distinguishing automated systems and malicious actors from legitimate users across a variety of platforms) and held an Insight Data Science fellowship. Patrick is passionate about the ability of data to provide key, quantitative insights to businesses during the decision-making process and is an advocate for data science education across all layers of a company. Patrick holds a PhD in experimental high-energy particle astrophysics.

Presentations

Executive Briefing: Building effective heterogeneous data communities—Driving organizational outcomes with broad-based data science Session

Data science is most powerful when combined with deep domain knowledge, but those with domain knowledge don't work on data-focused teams. So how do you empower employees with diverse backgrounds and skill sets to be effective users of data? Frances Haugen and Patrick Phelps dive into the social side of data and share strategies for unlocking otherwise unobtainable insights.

Marcin Pilarczyk is a data scientist and a leader of Ryanair’s Data and Analytics Department. Marcin has around 14 years of professional experience working in the aviation, telco, and financial industries in topics including data science, big data solutions, and data warehouses.

Presentations

Data-driven fuel management at Ryanair Session

Managing fuel at a company flying 120 millions passengers yearly is not a trivial task. Marcin Pilarczyk explores the main aspects of fuel management of a modern airline and offers an overview of machine learning methods supporting long-term planning and daily decisions.

Jennifer Prendki is the head of data science at Atlassian, where she leads all search and machine learning initiatives and is in charge of leveraging the massive amount of data collected by the company to load the suite of Atlassian products with smart features. Jennifer has worked as a data scientist for many different industries. Previously, she was a senior data science manager on the search team at Walmart eCommerce.  Jennifer enjoys addressing both technical and nontechnical audiences at conferences and sharing her knowledge and experience with aspiring data scientists. She holds a PhD in particle physics from University UPMC-La Sorbonne.

Presentations

The science of patchy data Session

Jennifer Prendki explains how to develop machine learning models even if the data is protected by privacy and compliance laws and cannot be used without anonymizing, covering techniques ranging from contextual bandits to document vector representation.

Michael Prorock is founder and CTO at mesur.io. Michael is an expert in systems and analytics, as well as in building teams that deliver results. Previously, he was director of emerging technologies for the Bardess Group, where he defined and implemented a technology strategy that enabled Bardess to scale its business to new verticals across a variety of clients, and worked in analytics for Raytheon, Cisco, and IBM, among others. He has filed multiple patents related to heuristics, media analysis, and speech recognition. In his spare time, Michael applies his findings and environmentally conscious methods on his small farm.

Presentations

Smart agriculture: Blending IoT sensor data with visual analytics Data Case Studies

Mike Prorock offers an overview of mesur.io, a game-changing climate awareness solution that combines smart sensor technology, data transmission, and state-of-the-art visual analytics to transform the agricultural and turf management market. Mesur.io enables growers to monitor areas of concern, providing immediate benefits to crop yield, supply costs, farm labor overhead, and water consumption.

Jiangjie Qin is a software engineer on the data infrastructure team at LinkedIn, where he works on Apache Kafka. Previously, Jiangjie worked at IBM, where he managed IBM’s zSeries platform for banking clients. He is a Kafka PMC member. Jiangjie holds a master’s degree in information networking from Carnegie Mellon’s Information Networking Institute.

Presentations

The secret sauce behind LinkedIn's self-managing Kafka clusters Session

LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity. Jiangjie Qin shares lessons learned from operating Kafka at scale with minimum human intervention.

Paul Raff is a principal data scientist manager on Microsoft’s analysis and experimentation team, where he and his team work to enable scalable experimentation for teams around Microsoft, including Windows 10, Office Online, Exchange Online, and Cortana, focusing on experiment quality and ensuring that all experiments are operating as intended and in a way that allows for the appropriate conclusions to be made. Previously, he was a supply chain researcher at Amazon. Paul holds a PhD in mathematics from Rutgers University as well as degrees in mathematics and computer science from Carnegie Mellon University.

Presentations

A/B testing at scale: Accelerating software innovation Tutorial

Controlled experiments such as A/B tests have revolutionized the way software is being developed, allowing real users to objectively evaluate new ideas. Ronny Kohavi, Alex Deng, Pavel Dmitriev, and Paul Raff lead an introduction to A/B texting and share lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft, which executes over 10K experiments a year.

Greg Rahn is a director of product management at Cloudera, where he is responsible for driving SQL product strategy as part of Cloudera’s analytic database product, including working directly with Impala. Over his 20-year career, Greg has worked with relational database systems in a variety of roles, including software engineering, database administration, database performance engineering, and most recently, product management, to provide a holistic view and expertise on the database market. Previously, Greg was part of the esteemed Real-World Performance Group at Oracle and was the first member of the product management team at Snowflake Computing.

Presentations

Analytics in the cloud: Building a modern cloud-based big data warehouse Session

For many organizations, the cloud will likely be the destination of their next big data warehouse. Greg Rahn shares considerations when evaluating the cloud for analytics and big data warehousing in order to help you get the most from the cloud. You'll leave with an understanding of different architectural approaches and impacts for moving analytic workloads to the cloud.

Arunkumar Ramanatha is a senior architect on the data team at Sapient.

Presentations

Achieving GDPR compliance and data privacy using blockchain technology Session

Ajay Mothukuri, Arunkumar Ramanatha, and Vijay Srinivas Agneeswaran explain how to use open source blockchain technologies such as Hyperledger to implement the European Union's General Data Protection Regulation (GDPR) regulation.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); briefly worked on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper Networks. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin-Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Effectively once, exactly once, and more in Heron Session

Stream processing systems must support a number of different types of processing semantics due to the diverse nature of streaming applications. Karthik Ramasamy and Sanjeev Kulkarni explore effectively once, exactly once, and other types of stateful processing techniques, explain how they are implemented in Heron, and discuss how applications can benefit.

Modern real-time streaming architectures Tutorial

Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Karthik Ramasamy leads a data science team at Uber focusing on solving fraud problems using machine learning. His team builds advanced machine learning models like semisupervised and deep learning models to detect account takeovers and stolen credit cards. Previously, Karthik was a cofounder of LogBase, where he worked on real-time analytics infrastructure and built models to rate drivers based on their driving behavior, and a founding member of the LinkedIn security team, where he developed various security products, with a particular focus on anti-automation efforts.

Presentations

Using computer vision to combat stolen credit card fraud Session

Stolen credit cards are a major problem faced by many companies, including Uber. Karthik Ramasamy and Lenny Evans detail a new weapon against stolen credit cards that uses computer vision to scan credit cards, verifying possession of the physical card with basic fake card detection capabilities.

Rishi Ranjan is the director of big data and analytics at Freddie Mac where he focuses on providing big data solutions for analytics and data science. Rishi has over 20 years of experience in managing data and database platforms.

Presentations

From big data to good data: How Apache NiFi and Apache Atlas eased Dataflow management at Freddie Mac with better data governance and reduced data latency Data Case Studies

Rishi Ranjan explains how Freddie Mac used Apache NiFi and Apache Atlas to build a centralized production operational data store on a Hadoop cluster. NiFi reduced the time to build a new data pipeline from months to hours and provided a robust data governance capability at the same time.

Delip Rao is the founder of R7 Speech Science, a San Francisco-based company focused on building innovative products on spoken conversations. Previously, Delip was the founder of Joostware, which specialized in consulting and building IP in natural language processing and deep learning. Delip is a well-cited researcher in natural language processing and machine learning and has worked at Google Research, Twitter, and Amazon (Echo) on various NLP problems. He is interested in building cost-effective, state-of-the-art AI solutions that scale well. Delip has an upcoming book on NLP and deep learning from O’Reilly.

Presentations

Going beyond words: Understand what your spoken conversation data can do for you Session

Spoken conversations have rich information beyond what was said in words. Delip Rao details the potential of spoken conversational datasets, including identifying speakers and their demographic attributes, understanding intent and dynamics between speakers, and so on. Delip also discusses some of the latest science, including some of the work developed at R7.

Machine learning with PyTorch 2-Day Training

PyTorch is a recent deep learning framework from Facebook that is gaining massive momentum in the deep learning community. Its fundamentally flexible design makes building and debugging models straightforward, simple, and fun. Delip Rao and Brian McMahan walk you through PyTorch's capabilities and demonstrate how to use PyTorch to build deep learning models and apply them to real-world problems.

Andrew Ray is a senior technical expert at Sam’s Club Technology. He is passionate about big data and has extensive experience working with Apache Spark and Hadoop. Previously, at Walmart, Andrew built an analytics platform on Hadoop that integrated data from multiple retail channels using fuzzy matching and distributed graph algorithms and led the adoption of Spark from proof of concept to production. He is an active contributor to the Apache Spark project, including SparkSQL and GraphX. Andrew holds a PhD in mathematics from the University of Nebraska, where he worked on extremal graph theory.

Presentations

Writing distributed graph algorithms Session

Andrew Ray offers a brief introduction to the distributed graph algorithm abstractions provided by Pregel, PowerGraph, and GraphX, drawing on real-world examples, and provides historical context for the evolution between these three abstractions.

Joseph (Joey) Richards is vice president of data and analytics at GE Digital and head of the Wise.io data science applications team, which is responsible for defining and implementing machine learning applications on behalf of GE and its customers. Previously, he was cofounder and chief data scientist at Wise.io (acquired by GE in 2016), where he built and deployed high-value ML applications for dozens of customers; an NSF postdoctoral researcher in the Statistics and Astronomy Departments at UC Berkeley; and a Fulbright Scholar whose research focused on application of supervised and semisupervised learning for problems in astrophysics. Joey holds a PhD in statistics from Carnegie Mellon University.

Presentations

Machine learning applications for the industrial internet Session

Deploying ML software applications for use cases in the industrial internet presents a unique set of challenges. Data-driven problems require approaches that are highly accurate, robust, fast, scalable, and fault tolerant. Joseph Richards shares GE's approach to building production-grade ML applications and explores work across GE in industries such as power, aviation, and oil and gas.

Alexis Roos is senior engineering manager at Salesforce, where he leads a team of data engineers and scientists focusing on deriving intelligence from activity data for the Einstein platform using streaming, batch, and graph data processing along with natural language processing and AI (machine learning and deep learning). He also leads presentations and online trainings for data science and engineering. Alexis has over 20 years of software engineering experience with the last five years focused on large-scale data science and engineering, working for SIs in Europe, Sun Microsystems/Oracle, and several startups, including Radius Intelligence, Concurrent, and Couchbase. Alexis holds a master’s degree in CS with a focus on cognitive sciences.

Presentations

Building a contacts graph from activity data Session

In the customer age, being able to extract relevant communications information in real time and cross-reference it with context is key. Alexis Roos and Noah Burbank explain how Salesforce uses data science and engineering to enable salespeople to monitor their emails in real time to surface insights and recommendations using a graph modeling contextual data.

Mike Ruberry is a senior associate of data science at ZestFinance, where his research interests include explainability and generative models. Mike has worked on several machine learning models and tools, including deploying automated models that process terabytes of data daily. Before specializing in machine learning, he worked on Windows as a program manager at Microsoft. Mike holds four degrees in computer science, including a PhD from Harvard University.

Presentations

Explaining machine learning models Session

What does it mean to explain a machine learning model, and why is it important? Mike Ruberry offers an overview of several modern explainability methods, including traditional feature contributions, LIME, and DeepLift. Each of these techniques presents a different perspective, and their clever application can reveal new insights and solve business requirements.

Daniel L. Rubin is associate professor of biomedical data science, radiology, and medicine (biomedical informatics research) at Stanford University and director of Biomedical Informatics for the Stanford Cancer Institute. Daniel’s NIH-funded research program focuses on quantitative imaging, integrating imaging with clinical and molecular data, and mining the data to discover imaging phenotypes that can predict disease biology, define disease subtypes, and personalize treatment. He is applying these methods for distributed computation of decision support models. Daniel has over 240 scientific publications in biomedical imaging informatics and medical imaging.

Presentations

Distributed clinical models: Inference without sharing patient data Session

Clinical collaboration benefits from pooling data to train models from large datasets, but it's hampered by concerns about sharing data. Balasubramanian Narasimhan, John-Mark Agosta, and Daniel Rubin outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset.

Philipp Rudiger is a software developer at Anaconda, where he develops open source and client-specific software solutions for data management, visualization, and analysis. Philipp holds a PhD in computational modeling of the visual system.

Presentations

Custom interactive visualizations and dashboards for one billion datapoints on a laptop in 30 lines of Python Tutorial

Python lets you solve data science problems by stitching together packages from its ecosystem, but it can be difficult to choose packages that work well together. James Bednar and Philipp Rudiger walk you through a concise, fast, easily customizable, and fully reproducible recipe for interactive visualization of millions or billions of datapoints—all in just 30 lines of Python code.

Ferd Scheepers is the global chief information architect at ING, where he drives ING’s journey to becoming a data-driven company. Ferd has published on data lakes and is a frequent speaker on both major vendor conferences and open source summits. Currently, he is championing the open metadata initiative, including Apache Atlas. Passionate about data (both its opportunities and risks), Ferd loves to share his vision and ideas on what data will mean for both companies and individuals.

Presentations

The rise of big data governance: Insight on this emerging trend from active open source initiatives Session

John Mertic and Ferd Scheepers detail the benefits of a vendor-neutral approach to data governance, explain the need for an open metadata standard, and share insight around how companies ING, IBM, Hortonworks, and more are delivering solutions to this challenge as an open source initiative.

Michael Schrenk has developed software that collects and processes information for some of the biggest news agencies in Europe and leads a competitive intelligence consultancy in Las Vegas, where he consults on information security everywhere from Moscow to Silicon Valley, and most places in between. Mike is the author of Webbots, Spiders, and Screen Scrapers. He has lectured at journalism conferences in Belgium and the Netherlands and has created several weekend data workshops for the Centre for Investigative Journalism in London. Along the way, he’s been interviewed by BBC, the Christian Science Monitor, National Public Radio, and many others. Mike is also an eight-time speaker at the notorious DEF CON hacking conference. He may be best known for building software that over a period of a few months autonomously purchased over $13 million dollars worth of cars by adapting to real-time market conditions.

Presentations

Understanding metadata Session

Big data becomes much more powerful when it has context. Fortunately, creative data scientists can create needed context though the use of metadata. Michael Schrenk explains how metadata is created and used to gain competitive advantages, predict troop strength, or even guess Social Security numbers.

Robert Schroll is a data scientist in residence at the Data Incubator. Previously, he held postdocs in Amherst, Massachusetts, and Santiago, Chile, where he realized that his favorite parts of his job were teaching and analyzing data. He made the switch to data science and has been at the Data Incubator since. Robert holds a PhD in physics from the University of Chicago.

Presentations

Machine learning with TensorFlow 2-Day Training

The TensorFlow library enables the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs, making it ideal for implementing neural networks and other machine learning algorithms. Robert Schroll and Dana Mastropole demonstrate TensorFlow's capabilities and walk you through building machine learning models on real-world data.

Machine learning with TensorFlow Training Day 2

The instructors demonstrate TensorFlow's capabilities through its Python interface and explore TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine learning models on real-world data.

Baron Schwartz is the founder and CEO of VividCortex, the best way to see what your production database servers are doing. Baron has written a lot of open source software and several books, including High Performance MySQL. He’s focused his career on learning and teaching about performance and observability of systems generally, including the view that teams are systems and culture influences their performance, and databases specifically.

Presentations

Why nobody cares about your anomaly detection Session

Anomaly detection is white hot in the monitoring industry, but many don't really understand or care about it, while others repeat the same pattern many times. Why? And what can we do about it? Baron Schwartz explains how he arrived at a "post-anomaly detection" point of view.

Skipper Seabold is Director of Data Science at Civis Analytics, a data science software and services company. He leads product-facing data science research and development, directing a large and diverse team of data scientists in building core exploratory and modeling software for the Civis Data Science Platform. Prior to Civis, Skipper worked at DataPad (acquired by Cloudera). He’s an economist by training and has a decade of experience working in the Python data community. He started and led the statsmodels Python project, was formerly on the core pandas team, and has contributed to many packages in the Python data stack. He holds strong opinions about writing and barbeque.

Presentations

Building a Data Science Idea Factory: How to Prioritize the Portfolio of a Large, Diverse, Opinionated Data Science Team Session

A huge challenge for data science managers is determining priorities for their team. Every data science team has more good ideas than they have time, so it’s critical to quickly prioritize the highest-impact projects. This talk shares a framework that our large and diverse data science team uses to identify, discuss, select, and manage a data science portfolio for a fast-moving startup.

Jonathan Seidman is a software engineer on the partner engineering team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Executive Briefing: Managing successful data projects—Technology selection and team building Session

Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives.

Janelle Shane’s neural network blog at aiweirdness.com features computer programs that try to invent human things like recipes and paint colors and Halloween costumes. Her blog has been covered in The Guardian, The Atlantic, NBC News, and Slate, and was even featured as a recent quiz question on Wait Wait, Don’t Tell Me. Dr. Shane also works as a research scientist in Colorado, where she makes computer-controlled holograms for studying the brain. She has only made a neural network recipe once, and discovered that horseradish brownies are about as terrible as you might imagine.

Presentations

Sprouted clams and stanky bean: When machine learning makes mistakes Keynote

At AIweirdness.com Janelle Shane posts the results of neural network experiments gone delightfully wrong. But machine learning mistakes can also be very embarrassing, or even dangerous. Using silly datasets as examples, Shane talks about some ways that algorithms fail.

Gwen Shapira is a system architect at Confluent, where she helps customers achieve success with their Apache Kafka implementation. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen currently specializes in building real-time reliable data-processing pipelines using Apache Kafka. Gwen is an Oracle Ace Director, the coauthor of Hadoop Application Architectures, and a frequent presenter at industry conferences. She is also a committer on Apache Kafka and Apache Sqoop. When Gwen isn’t coding or building data pipelines, you can find her pedaling her bike, exploring the roads and trails of California and beyond.

Presentations

The future of ETL isn’t what it used to be Session

Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve.

Min Shen is an engineer on LinkedIn’s Hadoop infrastructure development team helping to build next-generation Hadoop infrastructure at LinkedIn with better performance and manageability. Min holds a PhD degree in computer science from the University of Illinois with a research interest in distributed computing.

Presentations

Spark for everyone: Self-service monitoring and tuning Session

Spark applications need to be well tuned so that individual applications run quickly and reliably, and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems.

Tomer Shiran is the CEO and cofounder of Dremio. Previously, he was vice president of product at MapR, where he was responsible for product strategy, roadmap, and new feature development, and as a member of the executive team, helped grow the company from 5 to 300 employees and 700 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. Tomer is the founder of the open source Apache Drill project. He holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from Technion, the Israel Institute of Technology. He has authored five US patents.

Presentations

Data reflections: Making data fast and easy to use without making copies Session

Most organizations manage 5 to 15 copies of their data in multiple systems and formats to support different analytical use cases. Tomer Shiran and Jacques Nadeau introduce a new approach called data reflections, which dramatically reduces the need for data copies, demonstrate an open source implementation built with Apache Calcite, and explore two production case studies.

Jeff Shmain is a principal solutions architect at Cloudera. He has 16+ years of financial industry experience with a strong understanding of security trading, risk, and regulations. Over the last few years, Jeff has worked on various use-case implementations at 8 out of 10 of the world’s largest investment banks.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh and Jeffrey Shmain outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Vartika Singh is a field data science architect at Cloudera. Previously, Vartika was a data scientist applying machine learning algorithms to real-world use cases, ranging from clickstream to image processing. She has 12 years of experience designing and developing solutions and frameworks utilizing machine learning techniques.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh and Jeffrey Shmain outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Tomas Singliar is a data scientist in Microsoft’s AI and Research Group. Tomas’s favorite hammer is probabilistic and Bayesian modeling, which he applies analytically and predictively to business data. He has published a dozen papers in and serves as reviewer for several top tier AI conferences, including AAAI and UAI, and holds four patents in intent recognition through inverse reinforcement learning. Tomas studied machine learning at University of Pittsburgh.

Presentations

Using R and Python for scalable data science, machine learning, and AI Tutorial

R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure.

Ram Shankar is a security data wrangler in Azure Security Data Science, where he works on the intersection of ML and security. Ram’s work at Microsoft includes a slew of patents in the large intrusion detection space (called “fundamental and groundbreaking” by evaluators). In addition, he has given talks in internal conferences and received Microsoft’s Engineering Excellence award. Ram has previously spoken at data-analytics-focused conferences like Strata San Jose and the Practice of Machine Learning as well as at security-focused conferences like BlueHat, DerbyCon, FireEye Security Summit (MIRCon), and Infiltrate. Ram graduated from Carnegie Mellon University with master’s degrees in both ECE and innovation management.

Presentations

Failed experiments in infrastructure security analytics and lessons learned from fixing them Session

How should you best debug a security data science system: change the ML approach, redefine the security scenario, or start over from scratch? Ram Shankar answers this question by sharing the results of failed experiments and the lessons learned when building ML detections for cloud lateral movement, identifying anomalous executables, and automating incident response process.

Crystal Skelton is an associate in Kelley Drye & Warren’s Los Angeles office, where she represents a wide array of clients from tech startups to established companies in privacy and data security, advertising and marketing, and consumer protection matters. Crystal advises clients on privacy, data security, and other consumer protection matters, specifically focusing on issues involving children’s privacy, mobile apps, data breach notification, and other emerging technologies and counsels clients on conducting practices in compliance with the FTC Act, the Children’s Online Privacy Protection Act (COPPA), the Gramm-Leach-Bliley Act, the GLB Safeguards Rule, Fair Credit Reporting Act (FCRA), the Fair and Accurate Credit Transactions Act (FACTA), and state privacy and information security laws. She regularly drafts privacy policies and terms of use for websites, mobile applications, and other connected devices.

Crystal also helps advertisers and manufacturers balance legal risks and business objectives to minimize the potential for regulator, competitor, or consumer challenge while still executing a successful campaign. Her advertising and marketing experience includes counseling clients on issues involved in environmental marketing, marketing to children, online behavioral advertising (OBA), commercial email messages, endorsements and testimonials, food marketing, and alcoholic beverage advertising. She represents clients in advertising substantiation proceedings and other matters before the Federal Trade Commission (FTC), the US Food and Drug Administration (FDA), and the Alcohol and Tobacco Tax and Trade Bureau (TTB) as well as in advertiser or competitor challenges before the National Advertising Division (NAD) of the Council of Better Business Bureaus. In addition, she assists clients in complying with accessibility standards and regulations implementing the Americans with Disabilities Act (ADA), including counseling companies on website accessibility and advertising and technical compliance issues for commercial and residential products. Prior to joining Kelley Drye, Crystal practiced privacy, advertising, and transactional law at a highly regarded firm in Washington, DC, and as a law clerk at a well-respected complex commercial and environmental litigation law firm in Los Angeles, CA. Previously, she worked at the law firm featured in the movie Erin Brockovich, where she worked directly with Erin Brockovich and the firm’s name partner to review potential new cases.

Presentations

Executive Briefing: Legal best practices for making data work Session

Big data promises enormous benefits for companies. But what about privacy, data protection, and consumer laws? Having a solid understanding of the legal and self-regulatory rules of the road are key to maximizing the value of your data while avoiding data disasters. Alysa Hutnik and Crystal Skelton share legal best practices and practical tips to avoid becoming a big data “don’t.”

Suqiang Song is lead enterprise architect at MasterCard.

Presentations

Improving user-merchant propensity modeling using Neural Collaborative Filtering and Wide-and-Deep models on Spark BigDL at scale Session

Sergey Ermolin and Suqiang Song will demonstrate how to use Spark BigDL Wide-and-Deep and Neural Collaborative Filtering (NCF) algorithms to predict a user’s probability of shopping at a particular offer merchant during a campaign period. Along the way, they will compare the Deep Learning results with those obtained by a classical MLlib’s Alternating Least Squares (ALS) approach.

Ram Sriharsha is the product manager for Apache Spark at Databricks and an Apache Spark committer and PMC member. Previously, Ram was architect of Spark and data science at Hortonworks and principal research scientist at Yahoo Labs, where he worked on scalable machine learning and data science. He holds a PhD in theoretical physics from the University of Maryland and a BTech in electronics from the Indian Institute of Technology, Madras.

Presentations

Magellan: Scalable and fast geospatial analytics Session

How do you scale geospatial analytics on big data? And while you're at it, can you make it easy to use while achieving state-of-the-art performance on a single node? Ram Sriharsha offers an overview of Magellan—a geospatial optimization engine that seamlessly integrates with Spark—and explains how it provides scalability and performance without sacrificing simplicity.

Seth Stephens-Davidowitz uses data from the internet (particularly Google searches) to get new insights into the human psyche, measuring racism, self-induced abortion, depression, child abuse, hateful mobs, the science of humor, sexual preference, anxiety, son preference, and sexual insecurity, among many other topics. His 2017 book, Everybody Lies, published by HarperCollins, was a New York Times best seller. Seth is also a contributing op-ed writer for the New York Times. Previously, he was a data scientist at Google and a visiting lecturer at the Wharton School at the University of Pennsylvania. He holds a BA in philosophy (Phi Beta Kappa) from Stanford and a PhD in economics from Harvard. In high school, Seth wrote obituaries for his local newspaper, the Bergen Record, and was a juggler in theatrical shows. He now lives in Brooklyn and is a passionate fan of the Mets, Knicks, Jets, Stanford football, and Leonard Cohen.

Presentations

Keynote with Seth Stephens-Davidowitz Keynote

Keynote with Seth Stephens-Davidowitz

Kapil Surlaker leads the data and analytics team at LinkedIn, where he’s responsible for core analytics infrastructure platforms including Hadoop, Spark, other computation frameworks such as Gobblin and Pinot, an OLAP serving store, and XLNT, LinkedIn’s experimentation platform. Previously, Kapil led the development of Databus, a database change capture platform that forms the backbone of LinkedIn’s online data ecosystem, Espresso, a distributed document store that powers many applications on the site, and Helix, a generic cluster management framework that manages multiple infrastructure deployments at LinkedIn. Prior to LinkedIn, Kapil held senior technical leadership positions at Kickfire (acquired by Teradata) and Oracle. Kapil holds a BTech in computer science from IIT, Bombay, and an MS from the University of Minnesota.

Presentations

If you can’t measure it, you can’t improve it: How reporting and experimentation fuels product innovation at LinkedIn Session

Metrics measurement and experimentation plays a crucial role in every product decision at LinkedIn. Kapil Surlaker and Ya Xu explain why, to meet the company's needs, LinkedIn built the UMP and XLNT platforms for metrics computation and experimentation, respectively, which have allowed the company to perform measurement and experimentation very efficiently at scale while preserving trust in data.

Rajiv Synghal is principal of big data strategy at Kaiser Permanente. Previously, he held delivery and architecture roles in Fortune 100 organizations, including Visa and Nokia, and startups, such as Kivera. An accomplished strategic thinker and adviser to senior management on issues around growth, profitability, competition, and innovation, Rajiv is equally adept at presenting value propositions to top management and doing deep dives with fellow engineers. Rajiv is the rare kind of technology professional who carries within him the pragmatism of business urgency and the will to find a way to solve a problem no matter what it takes. He has demonstrated an uncanny ability to learn and teach new concepts, easily adapt to change, and manage multiple concurrent tasks. Rajiv is currently advising a number of startups in the big data space that are developing technologies to provide strategic solutions to challenges in the healthcare field.

Presentations

Building a flu predictor model for improved patient care Data Case Studies

As healthcare data becomes increasingly digitized, medical centers are able to leverage data in new ways to improve patient care. Each year, as many as 49,000 people die in the US alone. Rajiv Synghal explains how Kaiser Permanente developed a sophisticated flu predictor model to better determine where resources were needed and how to reduce outbreaks.

Pawel Szostek is a senior software engineer on Criteo’s analytics data storage team, where he works on various projects, including implementing an improved HyperLogLog algorithm. Previously, he was a researcher at CERN in Geneva.

Presentations

Hive as a service Session

Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo's Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load.

Ran Taig is a senior data scientist at Dell-EMC, where he leads data science projects, especially in domain of hardware failure prediction, and plays a key roll in designing the team engagement models and work structure, serving as a consultant to EMC’s business data lake team. Ran is also responsible for the team academic relations and continues to teach theory courses for CS students. Previously, Ran served as a lecturer for the design of algorithms course and other CS theory courses for CS bachelors at Ben-Gurion University. He holds a PhD in computer science from Ben-Gurion University, Israel, where he specialized in artificial intelligence. His research mainly focused on automated planning.

Presentations

AI-powered crime prediction Session

What if we could predict when and where next crimes will be committed? Or Herman-Saffar and Ran Taig offer an overview of Crimes in Chicago, a publicly published dataset of reported incidents of crime that have occurred in Chicago since 2001. Or and Ran explain how to use this data to explore committed crimes to find interesting trends and make predictions for the future.

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe, and worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Executive Briefing: Why machine-learned models crash and burn in production and what to do about it Session

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Natural language understanding at scale with spaCy and Spark NLP Tutorial

Natural language processing is a key component in many data science systems. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

Spark NLP in Action: Improving Patient Flow Forecasting at Kaiser Permanente Session

This is a real-world case study of applying the open source NLP library for Apache Spark, and tackling one of the most common challenges with applying natural language process in practice: Integrating domain-specific NLP as part of a scalable, performant, measurable and reproducible machine learning pipeline.

Yulia Tell is a technical program manager on the big data technologies team within the Software and Services Group at Intel, where she is working on several open source projects and partner engagements in the big data domain. Yulia’s work is focused specifically on Apache Hadoop and Apache Spark, including big data analytics applications that use machine learning and deep learning.

Presentations

Automatic 3D MRI knee damage classification with 3D CNN using BigDL on Spark Session

Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark.

Daniel Templeton has a long history in high-performance computing, open source communities, and technology evangelism. Today Daniel works on the YARN development team at Cloudera, focused on the resource manager, fair scheduler, and Docker support.

Presentations

What's new in Hadoop 3.0 Session

Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.

Siddharth Teotia is a software engineer at Dremio and a contributor to Apache Arrow project. Previously, Siddharth was on the database kernel team at Oracle, where he worked on storage, indexing, and the in-memory columnar query processing layers of Oracle RDBMS. He holds an MS in software engineering from CMU and a BS in information systems from BITS Pilani, India. During his studies, Siddharth focused on distributed systems, databases, and software architecture.

Presentations

Vectorized query processing using Apache Arrow Session

Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow.

Meena Thandavarayan is a practice lead at Infosys, where he focuses on leveraging technical advancements and industry reference architectures for defining a data delivery platform. Meena has extensive experience leading application, technology, data, and infrastructure teams developing strategy, architecture, implementation, and IT operational services. A big data and analytics evangelist, he specializes in strategy for accelerating the digitization journey for oil and gas clients: most recently, he delivered functional and technical architecture for a one-stop self-service data and information portal.

Presentations

Meta your data; drain the big data swamp Data Case Studies

Madhav Madaboosi and Meenakshisundaram Thandavarayan offer an overview of BP's self-service operational data lake, which improved operational efficiency, boosting productivity through fully identifiable data and reducing risk of a data swamp. They cover the path and big data technologies that BP chose, lessons learned, and pitfalls encountered along the way.

Alex Thomas is a data scientist at Indeed. He has used natural language processing (NLP) and machine learning with clinical data, identity data, and now employer and jobseeker data. He has worked with Apache Spark since version 0.9, and has worked with NLP libraries and frameworks including UIMA and OpenNLP.

Presentations

Natural language understanding at scale with spaCy and Spark NLP Tutorial

Natural language processing is a key component in many data science systems. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

Ganesh is an Executive Director of Enterprise Architecture at Kaiser Permanente and leads its Systems of Intelligence program. He is responsible for establishing the vision, laying the foundation and setting technology direction for data and analytics capabilities across the enterprise. The Systems of Intelligence Architecture is key in leveraging 100+ Petabytes of data as a strategic asset to transform health care and to support the 14,000 data analysts inside Kaiser Permanente. Ganesh has over a decade of experience in Healthcare IT innovation and was previously the head of Enterprise Architecture at Blue Shield of California.

Presentations

Spark NLP in Action: Improving Patient Flow Forecasting at Kaiser Permanente Session

This is a real-world case study of applying the open source NLP library for Apache Spark, and tackling one of the most common challenges with applying natural language process in practice: Integrating domain-specific NLP as part of a scalable, performant, measurable and reproducible machine learning pipeline.

Wee Hyong Tok is a principal data science manager at Microsoft, where he works with teams to cocreate new value and turn each of the challenges facing organizations into compelling data stories that can be concretely realized using proven enterprise architecture. Wee Hyong has worn many hats in his career, including developer, program and product manager, data scientist, researcher, and strategist, and his range of experience has given him unique superpowers to nurture and grow high-performing innovation teams that enable organizations to embark on their data-driven digital transformations using artificial intelligence. He strongly believes in story-driven innovation and has a passion for leading artificial intelligence-driven innovations and working with teams to envision how these innovations can create new competitive advantage and value for their business. He coauthored one of the first books on Azure Machine Learning, Predictive Analytics Using Azure Machine Learning, and authored another demonstrating how database professionals can do AI with databases, Doing Data Science with SQL Server.

Presentations

How does a big data professional get started with AI? Session

Artificial intelligence (AI) has tremendous potential to extend our capabilities and empower organizations to accelerate their digital transformation. Wee Hyong Tok and Danielle Dean demystify AI for big data professionals and explain how they can leverage and evolve their valuable big data skills by getting started with AI.

Carlo Torniai is head of data science and analytics at Pirelli. An accomplished data scientist with experience ranging across various areas of computer science and information technology, Carlo has extensive experience in data modeling, data analysis, and data engineering, and Python in the data science space (e.g., pandas, scipy, scikit-learn). Previously, he was a staff data scientist at Tesla Motors. He received his PhD in informatics from the Università degli Studi di Firenze, Italy.

Presentations

Pirelli Connesso: Where the road meets the cloud Session

Carlo Torniai shares the architectural challenges Pirelli faced in building Pirelli Connesso, an IoT cloud-based system providing information on tire operating conditions, consumption, and maintenance, and highlights the operative approaches that enabled the integration of different contribution across cross-functional teams.

Amy Unruh is a developer programs engineer for the Google Cloud Platform, where she focuses on machine learning and data analytics as well as other Cloud Platform technologies. Amy has an academic background in CS/AI and has also worked at several startups, done industrial R&D, and published a book on App Engine.

Presentations

Getting started with TensorFlow Tutorial

Yufeng Guo and Amy Unruh walk you through training and deploying a machine learning system using TensorFlow, a popular open source library. Yufeng and Amy take you from a conceptual overview all the way to building complex classifiers and explain how you can apply deep learning to complex problems in science and industry.

Ayin Vala is the cofounder and chief data scientist of the nonprofit organization Foundation for Precision Medicine, where he and his research and development team work on statistical analysis and machine learning, pharmacogenetics, molecular medicine, and sciences relevant to the advancement of medicine and healthcare delivery. Ayin has won several awards and patents in the healthcare, aerospace, energy, and education sectors. He also volunteers at DataKind, where he leads machine learning efforts in humanitarian projects. Ayin holds master’s degrees in information management systems from Harvard University and mechanical engineering from Georgia Tech.

Presentations

Reinventing healthcare: Early detection of Alzheimer’s disease with deep learning Session

Complex diseases like Alzheimer’s cannot be cured by pharmaceutical or genetic sciences alone, and current treatments and therapies lead to mixed successes. Ayin Vala explains how to use the power of big data and AI to treat challenging diseases with personalized medicine, which takes into account individual variability in medicine intake, lifestyle, and genetic factors for each patient.

Crystal Valentine is the vice president of technology strategy at MapR Technologies. She has nearly two decades’ experience in big data research and practice. Previously, Crystal was a consultant at Ab Initio, where she worked with Fortune 500 companies to design and implement high-throughput, mission-critical applications and with equity investors as a technical expert on competing technologies and market trends. She was also a tenure-track professor in the Department of Computer Science at Amherst College. She is the author of several academic publications in the areas of algorithms, high-performance computing, and computational biology and holds a patent for extreme virtual memory. Crystal was a Fulbright Scholar in Italy and holds a PhD in computer science from Brown University as well as a bachelor’s degree from Amherst College.

Presentations

DataOps: An Agile methodology for data-driven organizations Session

DataOps—a methodology for developing and deploying data-intensive applications, especially those involving data science and machine learning pipelines—supports cross-functional collaboration and fast time to value with an Agile, self-service workflow. Crystal Valentine offers an overview of this emerging field and explains how to implement a DataOps process.

Vinithra Varadharajan is an engineering manager in the cloud organization at Cloudera, where she is responsible for products such as Cloudera Director and Cloudera’s usage-based billing service. Previously, Vinithra was a software engineer at Cloudera, working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop lifecycle management.

Presentations

A deep dive into running data analytic workloads in the cloud Tutorial

Vinithra Varadharajan, Philip Langdale, and Eugene Fratkin lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices.

Emre Velipasaoglu is principal data scientist at Lightbend. A machined learning expert, Emre previously served as principal scientist and senior manager in Yahoo! Labs. He has authored 23 peer-reviewed publications and nine patents in search, machine learning, and data mining. Emre holds a PhD in electrical and computer engineering from Purdue University and completed postdoctoral training at Baylor College of Medicine.

Presentations

Machine-learned model quality monitoring in fast data and streaming applications Session

Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu evaluates monitoring methods for applicability in modern fast data and streaming applications.

Shivaram Venkataraman is a postdoctoral researcher at Microsoft Research, Redmond. Starting in Fall 2018, he will be an assistant professor in computer science at the University of Wisconsin, Madison. Shivaram holds a PhD from the University of California, Berkeley, where he was advised by Mike Franklin and Ion Stoica. His work spans distributed systems, operating systems, and machine learning, and his recent research has looked at designing systems and algorithms for large scale data analysis.

Presentations

Accelerating deep learning on Apache Spark with coarse-grained scheduling Session

The BigDL framework scales deep learning for large datasets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. Shivaram Venkataraman, Sergey Ermolin, and Ding Ding outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG.

Dean Wampler is the vice president of fast data engineering at Lightbend, where he leads the creation of the Lightbend Fast Data Platform, a streaming data platform built on the Lightbend Reactive Platform, Kafka, Spark, Flink, and Mesosphere DC/OS. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He is a contributor to several open source projects. He’s also the co-organizer of several conferences around the world and several user groups in Chicago.

Presentations

Kafka streaming applications with Akka Streams and Kafka Streams Session

Dean Wampler explores two microservice streaming applications based on Kafka to compare and contrast using Akka Streams and Kafka Streams for data processing. Dean discusses the strengths and weaknesses of each tool for particular design needs and contrasts them with Spark Streaming and Flink, so you'll know when to chose them instead.

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams Tutorial

Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to chose them instead.

Andrew Wang is a software engineer on the HDFS team at Cloudera. Previously, he was a graduate student in the AMPLab at the University of California, Berkeley, advised by Ion Stoica, where he worked on research related to in-memory caching and quality of service. In his spare time, he enjoys going on bike rides, cooking, and playing guitar.

Presentations

What's new in Hadoop 3.0 Session

Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.

Jiao (Jennie) Wang is a software engineer on the big data technology team at Intel, where she works in the area of big data analytics. She is engaged in developing and optimizing distributed deep learning framework on Apache Spark.

Presentations

Automatic 3D MRI knee damage classification with 3D CNN using BigDL on Spark Session

Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark.

Rachel Warren is a programmer, data analyst, adventurer, and aspiring data scientist. After spending a semester helping teach algorithms and software engineering in Africa, Rachel has returned to the Bay Area, where she is looking for work as a data scientist or programmer. Previously, Rachel worked as an analyst for both Pandora and the Political Science department at Wesleyan. She is currently interested in pursuing a more technical, algorithmic, approach to data science and is particularly passionate about dynamic learning algorithms (ML) and text analysis. Rachel holds a BA in computer science from Wesleyan University, where she completed two senior projects: an application which uses machine learning and text analysis for the Computer Science department and a critical essay exploring the implications of machine learning on the analytic philosophy of language for the Philosophy department.

Presentations

Playing well together: Big data beyond the JVM with Spark and friends Session

Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).

Jennifer Webb is vice president of development and operations at SuprFanz. Jennifer has over 10 years experience as a website and application developer for large and small companies, including major banks, and as a keyboardist in rock bands in Toronto, Calgary, and Vancouver.

Presentations

Data science in practice: Examining events in social media Media and Advertising

Ray Bernard and Jennifer Webb explain how cloud-based marketing company SuprFanz uses data science techniques and graph theory with Neo4j to generate live event attendance from social media platforms, email, and SMS.

Brooke Wenig is a consultant for Databricks and a teaching associate at UCLA, where she has taught graduate machine learning, senior software engineering, and introductory programming courses. Previously, Brooke worked at Splunk and Under Armour as a KPCB fellow. She holds an MS in computer science with highest honors from UCLA with a focus on distributed machine learning. Brooke speaks Mandarin Chinese fluently and enjoys cycling.

Presentations

Apache Spark programming 2-Day Training

Brooke Wenig walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs.

Jonathon Whitton is director of data services for PRGX, a global account services company that helps clients better manage and leverage their AP and supplier data, where he is leading a big data initiative that has resulted in 10x faster processing of client data, lowering the cost of storage and creating increased availability to the data for its business partners in the recovery audit, profit optimization, fraud prevention, healthcare, and oil and gas business lines. Jonathon has over 20 years of experience in technology, specializing in big data, Hadoop, process transformation, migration, and business analysis. Previously, he was licensed in NY, NJ, and CT to provide insurance-related advice as a financial planner; was also a top-rated technical instructor with ExecuTrain; and served in the 1/75 Ranger Regiment. Jonathon holds an MBA from Kennesaw State University and a bachelor’s degree from Duke University.

Presentations

Data wrangling for retail giants Session

PRGX is a global leader in recovery audit and source-to-pay (S2P) analytics services, serving around 75% of the top 20 global retailers. Matt Derda and Jonathon Whitton explain how PRGX uses Trifacta and Cloudera to scale current processes and increase revenue for the products and services it offers clients.

Josh Wills is a software engineer on Slack’s search, learning, and intelligence team. Previously, Josh built data teams, products, and infrastructure at Google and Cloudera. He is the founder and vice president of the Apache Crunch project for creating optimized MapReduce pipelines in Java and lead developer of Cloudera ML, a set of open source libraries and command-line tools for building machine learning models on Hadoop. Josh is a coauthor of Advanced Analytics with Spark. He is also known for his pithy definition of a data scientist as “someone who is better at software engineering than any statistician and better at statistics than any software engineer.”

Presentations

Data science at Slack Session

Josh Wills describes recent data science and machine learning projects at Slack.

Vincent Xie is a software engineer at Intel, where he works on machine learning- and big data-related domains. He holds a master’s degree in engineering from Shanghai Jiaotong University.

Presentations

Spark ML optimization at Intel: A case study Session

Intel has been deeply involved in Spark from its earliest moments. Vincent Xie and Peng Meng share what Intel has been working on Spark ML and introduce the methodology behind Intel's work on SparkML optimization.

Ya Xu is principal staff engineer and statistician at LinkedIn, where she leads a team of engineers and data scientists building a world-class online A/B testing platform. She also spearheads taking LinkedIn’s A/B testing culture to the next level by evangelizing best practices and pushing for broad-based platform adoption. She holds a PhD in statistics from Stanford University.

Presentations

If you can’t measure it, you can’t improve it: How reporting and experimentation fuels product innovation at LinkedIn Session

Metrics measurement and experimentation plays a crucial role in every product decision at LinkedIn. Kapil Surlaker and Ya Xu explain why, to meet the company's needs, LinkedIn built the UMP and XLNT platforms for metrics computation and experimentation, respectively, which have allowed the company to perform measurement and experimentation very efficiently at scale while preserving trust in data.

Yu Xu is the founder and CEO of TigerGraph, the world’s first native parallel graph database. He is an expert in big data and parallel database systems and has over 26 patents in parallel data management and optimization. Previously, Yu worked on Twitter’s data infrastructure for massive data analytics and was Teradata’s Hadoop architect leading the company’s big data initiatives. Yu holds a PhD in computer science and engineering from the University of California, San Diego.

Presentations

Real-time deep link analytics: The next stage of graph analytics Session

Graph databases are the fastest growing category in data management. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. Yu Xu offers an overview of a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups.

Fabian Yamaguchi is the chief scientist at ShiftLeft. Fabian has over 10 years of experience in the security domain, where he has worked as a security consultant and researcher focusing on manual and automated vulnerability discovery. He has identified previously unknown vulnerabilities in popular system components and applications such as the Microsoft Windows kernel, the Linux kernel, the Squid proxy server, and the VLC media player. Fabian is a frequent speaker at major industry conferences such as Black Hat USA, DEF CON, First, and CCC and renowned academic security conferences such as ACSAC, Security and Privacy, and CCS. He holds a master’s degree in computer engineering from Technical University Berlin and a PhD in computer science from the University of Goettingen.

Presentations

Code Property Graph: A modern, queryable data storage for source code Session

Vlad Ionescu and Fabian Yamaguchi outline Code Property Graph (CPG), a unique approach that allows the functional elements of code to be represented in an interconnected graph of data and control flows, which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed.

Yi Yin is a software engineer on the data engineering team at Pinterest, working on Kafka-to-S3 persisting tools and schema generation of Pinterest’s data.

Presentations

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously Session

With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin explain offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices.

Juan Yu is a software engineer at Cloudera working on the Impala project, where she helps customers investigate, troubleshoot, and resolve escalations and analyzes performance issues to identify bottlenecks, failure points, and security holes. Juan also implements enhancements in Impala to improve customer experience. Previously, Juan was a software engineer at Interactive Intelligence and held developer positions at Bluestreak, Gameloft, and Engenuity.

Presentations

How to use Impala query plan and profile to fix performance issues Tutorial

Apache Impala (incubating) is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu explores the cost model Impala planner uses, how Impala optimizes queries, how to identify performance bottleneck through query plan and profile, and how to drive Impala to its full potential.

Ali Zaidi is data scientist in Microsoft’s AI and Research Group, where he spends his day trying to make distributed computing and machine learning in the cloud easier, more efficient, and more enjoyable for data scientists and developers alike. Previously, Ali was a research associate at NERA (National Economic Research Associates), providing statistical expertise on financial risk, securities valuation, and asset pricing. He studied statistics at the University of Toronto and computer science at Stanford University.

Presentations

Using R and Python for scalable data science, machine learning, and AI Tutorial

R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure.

Ye Zhou is a software engineer in LinkedIn’s Hadoop infrastructure development team and mostly focusing on Hadoop Yarn and Spark related projects. Ye holds a Master degree in computer science from Carnegie Mellon University.

Presentations

Spark for everyone: Self-service monitoring and tuning Session

Spark applications need to be well tuned so that individual applications run quickly and reliably, and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems.

Angela Zutavern is a vice president at Booz Allen Hamilton, where she focuses on next-gen analytics and has helped radically transform how a wide array of companies, nonprofits, and major government agencies approach and use data. A leading expert on machine intelligence, Angela has worked with clients in every major US cabinet-level department, advised Fortune 500 companies, and led teams across every major industry. She pioneered the application of machine intelligence to organizational leadership and strategy and helped create the Data Science Bowl—a first-of-its-kind world-class competition that solves global issues through machine intelligence. She is an enthusiastic champion of women in data science. Angela is coauthor, along with Josh Sullivan, of The Mathematical Corporation: Where Machine Intelligence and Human Ingenuity Achieve the Impossible.

Presentations

The mathematical corporation: A new leadership mindset for the machine intelligence era Session

How can you most effectively use machine intelligence to drive strategy? By merging it in the right way with the human ingenuity of leaders throughout your organization. Angela Zutavern shares insights from her work with pioneering companies, government agencies, and nonprofits that are successfully navigating this partnership by becoming “mathematical corporations.”