Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Speakers

Hear from innovative CxOs, talented data practitioners, and senior engineers who are leading the data industry. More speakers will be announced; please check back for updates.

Filter

Search Speakers

As a director of data science at Salesforce Einstein, Sarah works to democratize machine learning by building products that allow customers to deploy predictive apps in a few short clicks. Her team builds automated machine learning pipelines and products like Einstein Prediction Builder. Prior to Salesforce she led teams in healthcare and life sciences at Pivotal building models for customers. She is a recovering academic and entrepreneur, with a PhD in Biomedical Informatics from Stanford University and a passion for building diverse teams, education and exploration.

Presentations

Automated Machine Learning for Agile Data Science At Scale Session

How does Salesforce manage to make data science an agile partner to over 100,000 customers? We will share the nuts and bolts of the platform and our agile process. From our open-source autoML library (TransmogrifAI) and experimentation to deployment and monitoring, we will cover how the tools make it possible for our data scientist to rapidly iterate and adopt a truly agile methodology.

Vijay Srinivas Agneeswaran is a senior director of technology at SapientRazorfish. Vijay has spent the last 10 years creating intellectual property and building products in the big data area at Oracle, Cognizant, and Impetus, including building PMML support into Spark/Storm and implementing several machine learning algorithms, such as LDA and random forests, over Spark. He also led a team that build a big data governance product for role-based, fine-grained access control inside of Hadoop YARN and built the first distributed deep learning framework on Spark. Earlier in his career, Vijay was a postdoctoral research fellow at the LSIR Labs within the Swiss Federal Institute of Technology, Lausanne (EPFL). He is a senior member of the IEEE and a professional member of the ACM. He holds four full US patents and has published in leading journals and conferences, including IEEE Transactions. His research interests include distributed systems, cloud, grid, peer-to-peer computing, machine learning for big data, and other emerging technologies. Vijay holds a bachelor’s degree in computer science and engineering from SVCE, Madras University, an MS (by research) from IIT Madras, and a PhD from IIT Madras.

Presentations

The Hitchhiker's Guide to Deep Learning Based Recommenders in Production Tutorial

This tutorial describes deep learning based recommender and personalisation systems that we have built for clients. The tutorial primarily gives the view of TensorFlow Serving and MLFlow for the end-to-end productionalization, including model serving, dockerization, reproducibility and experimentation plus how to use Kubernetes for deployment and orchestration of ML based micro-architectures.

Jaipaul Agonus is a Director at FINRA’s Market Regulation Technology. Jaipaul is a big data engineering leader with around 18 years of IT industry experience specialized in big data analytics and cloud-based solutions, currently involved in building next-generation big data market analytic platforms with machine learning, advanced visualization and contextual access across applications.

Presentations

Scaling Visualization for Big Data and Analytics in the Cloud Session

This talk will focus on big data analytics and visualization practices and tools used by FINRA to support machine learning and other surveillance activities that Market Regulation department conducts in AWS Cloud.

Shradha is currently working as Data Scientist at Adobe Systems, San Jose. Prior to that, she completed her masters in Computer Science with AI/ML track from University of California, San Diego. She has published papers and patent applications to her credit.

Presentations

Efficient Multi-armed Bandit with Thompson Sampling for applications with Delayed feedback Session

Decision making often struggles with the exploration-exploitation dilemma and multi-armed bandits (MAB) are popular Reinforcement Learning for tackling it. However, increasing the number of decision criteria leads to exponential blowup in complexity of MAB and observational delays doesn’t allow for optimal performance. This talk will introduce MAB and explain how to overcome the above challenges.

Sridhar Alla is director of data science and engineering at Comcast. He works on solving very large scalable problems around data processing and machine learning pipelines on TBs of data using Big Data technologies such as Spark, AWS , Azure and Deep learning using GPUs built on tensorflow.

Presentations

Anomaly detection using deep learning to measure quality of Large Datasets​ Session

Any Business big or small depends on analytics whether the goal is revenue generation, churn reduction or sales/marketing purposes. No matter the algorithm and the techniques used, the result depends on the accuracy and consistency of the data being processed. In this talk, we will present some techniques used to evaluate the the quality of data and the means to detect the anomalies in the data.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. He works with companies ranging from startups to Fortune 100 companies on Big Data. This includes training on cutting edge technologies like Apache Kafka, Apache Hadoop and Apache Spark. He has taught over 30,000 people the skills to become data engineers. He is widely regarded as an expert in the field and for his novel teaching practices. Jesse is published on O’Reilly and Pragmatic Programmers. He has been covered in prestigious publications such as The Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Creating a Data Engineering Culture at USAA Session

What happens when you have a data science organization, but no data engineering organization? This is what happened at USAA. In this session, we will share what happened without data engineering, how we fixed it, and what were the results.

Professional Kafka Development 2-Day Training

Takes a participant through an in-depth look at Apache Kafka. We show how Kafka works and how to create real-time systems with it. It shows how to create consumers and publishers in Kafka. The we look at Kafka’s ecosystem and how each one is used. We show how to use Kafka Streams, Kafka Connect, and KSQL.

Professional Kafka Development (Day 2) Training Day 2

Takes a participant through an in-depth look at Apache Kafka. We show how Kafka works and how to create real-time systems with it. It shows how to create consumers and publishers in Kafka. The we look at Kafka’s ecosystem and how each one is used. We show how to use Kafka Streams, Kafka Connect, and KSQL.

Zachery Anderson is the Chief Analytics Officer and Senior Vice President at Electronic Arts (EA), the world’s largest video game company. He is responsible for leading Consumer Insights, UX Research, Data Science, Studio Analytics, and Marketing Analytics for EA. His team uses in-game behavioral data, traditional consumer research, lab work, and online advertising data to provoke and inspire EA’s development and marketing teams to think and act “Player First.” Prior to joining EA in 2007, Zachery was head of consulting and modeling for J.D. Power and Associates’ PIN group, Corporate Economist for Nissan North America, and Economist for the private investment company Fremont Group.

Zachery’s work has been highlighted in the Harvard Business Review and the MIT Sloan Management Review. His work has won many awards, including the INFORMS Marketing Science Practice Prize, and while at Nissan he was recognized by the U.S. Federal Reserve for the Best Industry Forecast. He is a member of the University of California Master of Science in Business Analytics Industry Advisory Board.

Zachery’s undergraduate degree in Political Science and Communications is from Southern Illinois University. His graduate work was at UCLA, in Economics and Political Science, where he studied game theory with Nobel Prize Winner Lloyd Shapley.

Presentations

Purchase, Play, and Upgrade Data for Video Game Players Session

A case study presented by leadership at the Wharton Customer Analytics Initiative and Electronic Arts about the WCAI Research Opportunity process & how some of EA’s business problems were solved using their data by 11 teams of researchers from around the world.

Keynote with Zachery Anderson Keynote

Zachery Anderson, SVP & Chief Analytics Officer, Electronic Arts

Eva Nahari has been working with Java virtual machine technologies, SOA, Cloud, and other enterprise middleware solutions for the past 15+ years. With a past as developer of JRockit (the world’s fastest JVM). Eva has been awarded two patents on Garbage Collection heuristics and algorithms. She also pioneered Deterministic Garbage Collection. She’s managed many technical partnerships, among others Sun, Intel, Dell, and Redhat, as well as multi-component software integration projects (JRockit/Coherence/WebLogic, Zing/RHEL, Cloudera Search). After productizing the worlds only pauseless JVM – Zing – at Azul Systems, she joined Cloudera in 2012. Since, she’s helped drive the future of distributed data processing and machine learning applications through Cloudera’s Distribution of Hadoop and expedited the next generation of integrated search engines. She has a M.Sc in Artificial Intelligence and Autonomous Systems from the Royal Institute of Technology, in Stockholm, Sweden.

Presentations

How to survive the future data warehousing challenges with the help of hybrid cloud Session

In this talk, you will learn how Cloudera’s Finance Department used a hybrid model to speed up report delivery and reduce cost of end of quarter reporting. Learn from our experience some guidelines for how to deploy modern data warehousing in a hybrid cloud environment: When should you choose private vs public cloud services? What options are there? Do:s and dont:s

June Andrews is a Principal Data Scientist at GE building a machine learning platform used for monitoring the health of airplanes and power plants around the world. Previously, she worked at Pinterest spearheading the Data Trustworthiness and Signals Program to create a healthy data ecosystem for machine learning. She has also lead efforts at LinkedIn on growth, engagement, and social network analysis to increase economic opportunity for professionals. June holds degrees in applied mathematics, computer science, and electrical engineering from UC Berkeley and Cornell.

Presentations

Critical Turbine Maintenance: Monitoring & Diagnosing Planes and Power Plants in Real Time Session

GE produces a third of the world's power and 60% of airplane engines. These engines form a critical portion of the world's infrastructure and require meticulous monitoring of the hundreds of sensors streaming data from each turbine. Here, we share the case study of releasing into production the first real-time ML systems used to determine turbine health by GE's monitoring and diagnostics teams.

From Data Driven to Data Competitive Data Case Studies

Companies have adopted data into their DNA in a variety of methods including data driven, data enabled, and data informed. However, many implementations have fallen short of the promised ROI between the cost of investing in people and infrastructure and the business value delivered at the end of the day. Here we take a structured look at the ROI of using data and introduce being data competitive.

Tim Armstrong is an engineer at Cloudera, working on Apache Impala. He focuses on making Impala faster and more robust via improvements to query execution and resource management. Previously he completed a Ph.D. working at the intersection of high-performance computing and programming language implementation.

Presentations

When SQL Users Run Wild: Resource Management Features & Techniques to Tame Apache Impala Session

As the popularity and utilization of Apache Impala deployments increases, often clusters become victims of their own success when demand for resources exceeds the supply. This talk will dive into the latest resource management features in Impala to maintain high cluster availability and optimal performance as well as provide examples of how to configure them in your Impala deployment.

Shirya works on the Data engineering team for Personalization. Which, among other things, delivers recommendations made for each user. The team is responsible for the data that goes into training and scoring of the various machine learning models that power the Netflix homepage. They have been working on moving some of our core datasets from being processed in a once-a-day daily batch ETL to being processed in near-real time using Apache Flink. Before Netflix, she was at Walmart Labs, where she helped build and architect the new generation item-setup, moving from batch processing to stream. They used Storm-Kafka to enable a micro-services architecture that can allow for products to be updated near real-time as opposed to once-a-day update on the legacy framework.

Presentations

Taming large-state to join datasets for Personalization Session

With so much data being generated in real-time what if we could combine all these high-volume data streams in real time and provide a near realtime feedback for model training, improve personalization and recommendations, thereby taking the customer experience on the product to a whole new level. Well, it is possible to tame large state-join for exactly that purpose using Flink's keyed state.

David Arpin is a data scientist at Amazon Web Services.

Presentations

Building a recommender system with Amazon SageMaker Tutorial

Learn how to use the Amazon SageMaker platform to build a machine learning model to recommend products to customers based on their past preferences.

Dr. Aschbacher is both a Data Scientist and licensed Clinical Psychologist, with a specialty in Behavioral Medicine. One of her passions is to bridge the worlds of behavior change and data science, in order to transform health. She works at UCSF as an Associate Professor in Cardiology, and she serves as the Data Team Lead on the Health eHeart (HeH)/Eureka Digital Research Platform. Before that, she worked as a Data Scientist at the Silicon Valley start-up Jawbone, where she helped design, test, and analyze mini-interventions to help users make healthier behavior choices and lose weight. At UCSF, she continues to build active partnerships with companies in the behavior change and lifestyle medicine space. She enjoys finding creative ways to take knowledge from Psychology, Neuroscience, and Biology and apply them to discover new insights in large datasets. When she is not at work, she enjoys being a mother to her two children, biking and dancing, and learning to speak Mandarin.

Dr. Aschbacher wants to acknowledge and thank her UCSF coauthors on this work: R Avram, G Tison, K Rutledge, M Pletcher, J Olgin, G Marcus.

Presentations

Machine Learning Prediction of Blood Alcohol Content: A Digital Signature of Behavior Session

Some people use digital devices to track their blood alcohol content (BAC) – for example, to avoid driving drunk. If a BAC-tracking App could anticipate when a person is likely to have a high BAC, it might offer coaching in a time of need. We offer a machine learning approach that predicts user BAC levels with good precision based on minimal information, thereby enabling targeted interventions.

Jitender Aswani supports the infrastructure and Security Data Engineering teams at Netflix. His team designs, builds and deploys scalable big data architecture and solutions to enable business and operations teams achieve consistent capacity, reliability and security gains. Jitender is a life-long student of smart data products and data science solutions that push organization to make data-inspired decisions and adopt analytics-first approach.

Presentations

Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability & Efficiency Session

Hundreds of thousands of ETL pipelines ingest over a trillion events daily to populate millions of data tables downstream at Netflix. This session discusses Netflix’s internal data lineage service aimed at establishing end-to-end lineage across millions of data artifacts that was essential for enhancing platform’s reliability, increasing trust in data and improving data infrastructure efficiency.

Shivnath Babu is the CTO at Unravel Data Systems and an adjunct professor of computer science at Duke University. His research focuses on ease of use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath cofounded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.

Presentations

Automation of Root Cause Analysis for Big Data Stack Applications Session

This describes an automated technique for root cause analysis (RCA) for big data stack applications using deep learning techniques. Spark and Impala will be used as examples, but the concepts generalize to the big data stack.

Kamil is co-founder and CTO of Starburst, the enterprise Presto company. Prior to co-founding Starburst, Kamil was the Chief Architect at the Teradata Center for Hadoop in Boston, focusing on the open source SQL engine Presto. Previously, he was the co-founder and chief software architect of Hadapt, the first SQL-on-Hadoop company, acquired by Teradata in 2014. Kamil began his journey with Hadoop and modern MPP SQL architectures about 10 years ago during a doctoral program at Yale University where he co-invented HadoopDB, the original foundation of Hadapt’s technology.

Kamil holds an M.S. in Computer Science from Wroclaw University of Technology and as well as M.S. and an M.Phil. in Computer Science from Yale University.

Presentations

Presto: Tuning Performance of SQL-on-Anything Analytics Session

Presto, an open source distributed SQL engine, is designed for interactive queries and ability to query multiple data sources. With the ever-growing list of connectors (e.g., Apache Kudu, Pulsar, Netflix Iceberg, Elasticsearch) recently introduced Cost-Based Optimizer in Presto must account for heterogeneous data source with incomplete statistics and new use cases such as geospatial analytics.

Ambal is a technology focused product management/marketing leader with a strong track record of leading strategic growth. She leads digital transformation and growth of Data Organization segment of IBM Analytics portfolio by creating and bringing to market digital and SaaS product offerings. Before that, she was World Wide Content Marketing Leader for Systems Hybrid Cloud DevOps where she managed content marketing strategy, digital content design and production to drive business results. At Cisco, she focused on world wide marketing and positioning of Cisco’s Cloud & Data Center switching business. She brings both strong engineering & marketing skills with verticals experience from many different industries.

Ambal received her Masters in Computer Science from Purdue University and an MBA in Marketing, Strategy and Entrepreneurship from Wharton University of Pennsylvania.

Presentations

How to extract stories from your data and tell them visually? It Can Be Done. We Will Show You How. Data Case Studies

Whether you are a tech or biz professional, you must master the art of visual storytelling with data. To do that, you must learn how to find the story worthy being told that is hidden in your data. As with many things in life, visual storytelling with data will take practice. But, that doesn't mean you can't accelerate your learning from others' mistakes and successes.

Technical Evangelist and Trainer for data Artisans. Author of “Introduction of Hadoop Security” OReilly course.

Presentations

Introduction to Flink via Flink SQL Tutorial

This hands-on session introduces Flink via the SQL interface. You will receive an overview of stream processing, and a survey of Apache Flink with its various modes of use. Then we’ll use Flink to run SQL queries on data streams and contrast this with the Flink data stream API.

Maxime Beauchemin works as a Senior Software Engineer at Lyft where he develops open source products that reduce friction and help generate insights from data. He is the creator and a lead maintainer of Apache Airflow [incubating], a data pipeline workflow engine; and Apache Superset [incubating], a data visualization platform; and is recognized as a thought leader in the data engineering field. Before Lyft, Maxime worked at Airbnb on the “Analytics & Experimentation Products team”. Previously, he worked at Facebook on computation frameworks powering engagement and growth analytics, on clickstream analytics at Yahoo!, and as a data warehouse architect at Ubisoft.

Presentations

Apache Superset - an Open Source Data Visualization Platform Session

This presentation will introduce Superset to the audience through a live demo. We will also discuss many aspects of the project like the open source development dynamics, security, architecture, underlying technologies and the key items on the project's roadmap.

John Bennett has been writing code for almost 20 years. Past stints include Blizzard and IGN. Bennett leads the data engineering efforts of Netflix’s Cloud Infrastructure Analytics team with a focus on security. For the past three years he has built large-scale data processing systems that provide anomaly detection, network visibility, and dependency insights. John is currently developing a template-driven platform that enables engineers to rapidly build streaming and batch ETL pipelines for detection purposes.

Presentations

Building and Scaling a Security Detection Platform, a Netflix Original Session

Data has become a foundational pillar for security teams operating in organizations of all shapes and sizes. This new norm has created a need for platforms that enable engineers to harness data for various security purposes. This talk introduces our internal platform aimed at quickly deploying data-based detection capabilities in the Netflix corporate environment.

Till Bergmann is a Senior Data Scientist at Salesforce Einstein, building platforms to make it easier to integrate machine learning into Salesforce products, with a focus on automating many of the laborious steps in the machine learning pipeline. Before Salesforce, he obtained a PhD in Cognitive Science at the University of California, Merced, where he studied collaboration patterns of academics using NLP techniques.

Presentations

How to train your model (and catch label leakage) Session

A problem in predictive modeling data is label leakage. At Enterprise companies such as Salesforce, this problem takes on monstrous proportions as the data is populated by diverse business processes, making it hard to distinguish cause from effect. We will describe how we tackled this problem at Salesforce, where we need to churn out thousands of customer-specific models for any given use case.

Ron Bodkin is a technical director on the applied artificial intelligence team at Google, where he provides leadership for AI success for customers in Google’s Cloud CTO office. Ron engages deeply with Global F500 enterprises to unlock strategic value with AI, acts as executive sponsor with Google product and engineering to deliver value from AI solutions, and leads strategic initiatives working with customers and partners. Previously, Ron was the founding CEO of Think Big Analytics, a company that provides end-to-end support for enterprise big data, including data science, data engineering, advisory, and managed services and frameworks such as Kylo for enterprise data lakes. When Think Big was acquired by Teradata, Ron led global growth, the development of the Kylo open source data lake framework, and the company’s expansion to architecture consulting; he also created Teradata’s artificial intelligence incubator.

Presentations

Applying Deep Learning at Google for Recommendations Session

Google uses Deep Learning extensively in new and existing products. Come learn about how Google has used Deep Learning for recommendations at YouTube, the Play store and for customers in Google Cloud. Learn about the role of embeddings, recurrent networks, contextual variables and wide and deep learning and how to do both candidate generation and ranking with Deep Learning.

Dhruba Borthakur is co-founder and CTO at Rockset. Rockset builds software for enabling data powered applications.

Dhruba was the founding engineer of the open source RocksDB database at Facebook and one of the founding engineers of the Hadoop File System at Yahoo Inc. Dhruba was an early contributor to the open source Apache HBase project. Earlier, he was a Senior Engineer at Veritas Software and was responsible for the development of VxFS and Veritas SanPointDirect Storage System. Prior to Veritas, he was the co-founder at Oreceipt.com, an e-commerce startup based in Sunnyvale. Before that, he was a Senior Engineer at IBM-Transarc Labs where he contributed to the development of Andrew File System (AFS), a part of IBM’s e-commerce initiative, WebSphere. Dhruba has an M.S. in Computer Science from the University of Wisconsin, Madison and a B.S. in Computer Science BITS, Pilani, India. He has 25 issued patents.

Earlier, he was a Senior Lead Engineer at Veritas Software (now acquired By Symantec) and was responsible for the design and development of software for the Veritas San File System. He was the Team Lead for developing the Mendocino Continuous Data Protection Software Appliance at a startup named Mendocino Software.Prior to Mendocino Software, he was the Chief Architect at Oreceipt.com, an e-commerce startup based in Sunnyvale, California. Earlier, he was a Senior Engineer at IBM-Transarc Labs where he was responsible for the development of Andrew File System (AFS) which is a part of IBM’s e-commerce initiative WebSphere. Prior to his experience in the United States, Dhruba developed call processing software for Digital Switching Systems at C-DOT Delhi.

Dhruba has an M.S. in Computer Science from the University of Wisconsin, Madison and a B.S. in Computer Science from the Birla Institute of Technology and Science (BITS), Pilani, India.

He has 9 issued patents and 12 patents pending.

Presentations

ROCKSET: Design and Implementation of a data system for low latency queries for Search and Analytics Session

Most existing big data systems prefer sequential scans for processing queries. We challenge this view and present converged indexing: a single system called ROCKSET that builds inverted, columnar and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines.

Eric T. Bradlow is currently Chairperson, Wharton Marketing Department, The K.P. Chao Professor, Professor of Marketing, Statistics, Economics and Education and Faculty Director of the Wharton Customer Analytics Initiative at The Wharton School of the University of Pennsylvania. He earned a Bachelor of Science in Economics from The Wharton School in 1988, an A.M. in Mathematical Statistics in 1990 and a Ph.D. in Mathematical Statistics in 1994 from Harvard University. He joined the Wharton faculty in 1996.

From 2008- 2011, Eric was Editor-in-Chief of Marketing Science, the premier academic journal in Marketing. He was recently named one of eight inaugural University of Pennsylvania Fellows, a Fellow of the American Statistical Association, a Fellow of the American Education Research Association, a Fellow of the Wharton Risk Center, a Senior Fellow of the Leonard Davis Institute for Health Economics, is past chair of the American Statistical Association Section on Statistics in Marketing, is a statistical Fellow of Bell Labs, and was previously named DuPont Corporation’s best young researcher. His academic research interests include Bayesian modeling, statistical computing, and developing new methodology for unique data structures with application to business problems, education and psychometrics and health outcomes. He has won research awards in Marketing, Statistics, Psychology, Education and Medicine. His personal interests include his wife Laura, his sons Ethan, Zach, and Ben, and his love of sports and movies.

Presentations

Purchase, Play, and Upgrade Data for Video Game Players Session

A case study presented by leadership at the Wharton Customer Analytics Initiative and Electronic Arts about the WCAI Research Opportunity process & how some of EA’s business problems were solved using their data by 11 teams of researchers from around the world.

Claudiu Branzan is the vice president of data science and engineering at G2 Web Services, where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine learning and distributed systems experience. Previously, Claudiu worked for Atigeo building big data and data science-driven products for various customers.

Presentations

Natural language understanding at scale with Spark NLP Tutorial

This is a hands-on tutorial for scalable NLP using the highly performant, highly scalable open-source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Mark Brine is the director of finance of Cloudera.

Presentations

How to survive the future data warehousing challenges with the help of hybrid cloud Session

In this talk, you will learn how Cloudera’s Finance Department used a hybrid model to speed up report delivery and reduce cost of end of quarter reporting. Learn from our experience some guidelines for how to deploy modern data warehousing in a hybrid cloud environment: When should you choose private vs public cloud services? What options are there? Do:s and dont:s

Kurt Brown leads the data platform team at Netflix. This team architects and manages the infrastructure for Netflix analytics, including various big data technologies (e.g. Spark, Flink, and Presto), machine learning infrastructure for Netflix data scientists, and some “traditional” BI tools (e.g. Tableau).

Presentations

Netflix - The journey towards a self-service data platform Session

The Netflix data platform is a massive-scale, cloud-only suite of tools and technologies. It includes big data techs (e.g. Spark and Flink), enabling services (e.g. federated metadata management), and machine learning support. But with power comes complexity. I'll talk through how we are investing towards an easier, "self-service" data platform without sacrificing our enabling capabilities.

Stuart Buck is the Vice President of Research at the Laura and John Arnold Foundation, one of the leading funders of research to inform public policy. He has a Ph.D. in education policy from the University of Arkansas, where he studied econometrics, statistics, and program evaluation; a J.D. with honors from Harvard Law School, where he was an editor of the Harvard Law Review; and bachelor’s and master’s degrees in music performance from the University of Georgia. He has given advice to IARPA (the CIA’s research arm) and the White House Social and Behavioral Sciences Team on rigorous research processes. He has sponsored major efforts showing that even the best scientific research is often irreproducible; this work has been featured in Wired, the Economist, the New York Times, and the Atlantic. He has also published in top journals (such as Science and BMJ) on how to make research more accurate.

Presentations

What the Reproducibility Problem Means for Your Business Session

Academic research has been plagued by a reproducibility crisis in fields ranging from medicine to psychology. This talk will explain how to take precautions in your data analysis and experiments so as to avoid those reproducibility problems.

Andrew Burt is chief privacy officer and legal engineer at Immuta, the data management platform for the world’s most secure organizations. He is also a visiting fellow at Yale Law School’s Information Society Project. Previously, Andrew was a special advisor for policy to the head of the FBI Cyber Division, where he served as lead author on the FBI’s after-action report on the 2014 attack on Sony. A leading authority on the intersection between machine learning, regulation and law, Andrew has published articles on technology, history, and law in the New York Times, the Financial Times, Slate, and the Yale Journal of International Affairs. His book, American Hysteria: The Untold Story of Mass Political Extremism in the United States, was called “a must-read book dealing with a topic few want to tackle” by Nobel laureate Archbishop Emeritus Desmond Tutu. Andrew holds a JD from Yale Law School and a BA from McGill University. He is a term-member of the Council on Foreign Relations, a member of the Washington, DC, and Virginia State Bars and a Global Information Assurance Certified (GIAC) cyber incident response handler.

Presentations

Manage the Risks of ML - In Practice! Tutorial

This tutorial will provide a hands on overview of how to train, validate and audit machine learning models (ML) across the enterprise. As ML becomes increasingly important, managing its risks is quickly becoming one of the biggest challenges to the technology’s widespread adoption. Join us to walk through practical tools and best practices to help safely deploy ML.

Igor is a core developer of open source RocksDB code.

Presentations

ROCKSET: Design and Implementation of a data system for low latency queries for Search and Analytics Session

Most existing big data systems prefer sequential scans for processing queries. We challenge this view and present converged indexing: a single system called ROCKSET that builds inverted, columnar and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines.

Data science expert and software system architect with expertise in machine-learning and big-data systems. Rich experiences of leading innovation projects and R&D activities to promote data science best practice within large organizations. Deep domain knowledge on various vertical use cases (Finance, Telco, Healthcare, etc.). Currently working pushing the cutting-edge application of AI at the intersection of high-performance database and IoT, focusing on unleashing the value of spatial-temporal data. I am also a frequent speaker at various technology conferences, including: O’Reilly Strata AI Conference, NVidia GPU Technology Conference, Hadoop Summit, DataWorks Summit, Amazon re:Invent, Global Big Data Conference, Global AI Conference, World IoT Expo, Intel Partner Summit, presenting keynote talks and sharing technology leadership thoughts.

Received my Ph.D. from the Department of Computer and Information Science (CIS), University of Pennsylvania, under the advisory of Professor Insup Lee (ACM Fellow, IEEE Fellow). Published and presented research paper and posters at many top-tier conferences and journals, including: ACM Computing Surveys, ACSAC, CEAS, EuroSec, FGCS, HiCoNS, HSCC, IEEE Systems Journal, MASHUPS, PST, SSS, TRUST, and WiVeC. Served as reviewers for many highly reputable international journals and conferences.

Presentations

Building the AI Engine for Retail in the New Era Session

We focus on sharing the design of the AI Engine on Alibaba TSDB service that enables fast and complex analytics of large-scale retail data. A successful case study of the Fresh Hema Supermarket, a major “New Retail” platform operated by Alibaba Group. We will highlight our solutions to the major technical challenges in data cleaning, storage and processing.

Haifeng Chen is a senior software architect at Intel’s Asia Pacific R&D Center. He has more than 12 years’ experience in software design and development, big data, and security, with a particular interest in image processing. Haifeng is the author of image browsing, editing, and processing software ColorStorm.

Presentations

Spark Adaptive Execution Unleash the Power of Spark SQL Session

Spark SQL is widely used today. However, it still suffers from stability and performance challenges in the highly dynamic environment with large scale of data. To address these challenges, we introduced Spark adaptive execution engine which can handle the task parallelism, join conversion and data skew dynamically during run-time, guaranteeing the best plan is chosen using run-time statistics.

Jeff Chen is BEA’s Chief Innovation Officer, responsible for integrating advancements in data science and machine learning to advance the Bureau’s capabilities. A statistician and data scientist, Jeff has extensive experience in launching and leading data science initiatives in over 40 domains, working with diverse stakeholders such as firefighters, climatologists, technologists, among others to introduce data science and new technologies to advance their missions.  Before coming to BEA, he served as the U.S. Department of Commerce’s Chief Data Scientist;  a White House Presidential Innovation Fellow with NASA and the White House Office of Science and Technology Policy focused on data science for the environment, the first Director of Analytics at the NYC Fire Department where he engineered pioneering algorithms for fire prediction; and was among the first data scientists at the NYC Mayor’s Office under then-Mayor Mike Bloomberg. Jeff started his career as an econometrician at an international engineering consultancy where he developed forecasting and prediction models supporting for large-scale infrastructure investment projects. In the evenings, he is an adjunct professor of data science at Georgetown University. He holds a bachelors in economics from Tufts University and a masters in applied statistics from Columbia University.

Presentations

Deploying Data Science for National Economic Statistics Session

Jeff Chen presents strategies for overcoming time series challenges at the intersection of macroeconomics and data science, drawing from machine learning research conducted at the Bureau of Economic Analysis aimed at improving its flagship product the Gross Domestic Product.

Roger Chen is cofounder and CEO of Computable Labs and program chair for the O’Reilly Artificial Intelligence Conference. Previously, he was a principal at O’Reilly AlphaTech Ventures (OATV), where he invested in and worked with early-stage startups primarily in the realms of data, machine learning, and robotics. Roger has a deep and hands-on history with technology. Before startups and venture capital, he was an engineer at Oracle, EMC, and Vicor. He also developed novel nanoscale and quantum optics technology as a PhD researcher at UC Berkeley. Roger holds a BS from Boston University and a PhD from UC Berkeley, both in electrical engineering.

Presentations

New models for generating training data for AI

New models for generating training data for AI

Data scientist with deep knowledge in large-scale machine learning algorithms. Partnered with several Fortune 500 companies and advise the leaderships on making data-driven strategic decisions. Provided software-based data analytics consulting service to 7 global firms across multiple industries, including financial services, automotive, telecommunications, and retail.

Presentations

Building the AI Engine for Retail in the New Era Session

We focus on sharing the design of the AI Engine on Alibaba TSDB service that enables fast and complex analytics of large-scale retail data. A successful case study of the Fresh Hema Supermarket, a major “New Retail” platform operated by Alibaba Group. We will highlight our solutions to the major technical challenges in data cleaning, storage and processing.

Tim Chen is a software engineer at Cloudera leading cloud initiatives for their enterprise machine learning platform. Prior to joining Cloudera, Timothy Chen was cofounder and CEO and cofounder of Hyperpilot, a startup focused on applying machine learning to improve performance and cost efficiency of container clusters and big data workloads. He help initiate the Spark-on-Kubernetes project and led development of Mesos support for Spark. He is Apache PMC/committer on Apache Drill and Apache Mesos. Prior to founding Hyperpilot, he worked at Mesosphere, leading containerization development and design, VMWare, and Microsoft.

Presentations

Cloud-Native Machine Learning: Emerging Trends and the Road Ahead Session

Data platforms are being asked to support an ever increasing range of workloads and compute environments, including machine learning and elastic cloud platforms. In this talk, we will discuss some emerging capabilities, including running machine learning and Spark workloads on autoscaling container platforms, and share our vision of the road ahead for ML and AI in the cloud.

Chakri Cherukuri is a senior researcher in the Quantitative Financial Research group at Bloomberg LP in NYC. His research interests include quantitative portfolio management, algorithmic trading strategies and applied machine learning. He has extensive experience in scientific computing and software development. Previously, he built analytical tools for the trading desks at Goldman Sachs and Lehman Brothers. He holds an undergraduate degree from the Indian Institute of Technology (IIT) Madras, India and an MS in computational finance from Carnegie Mellon University.

Presentations

Applied Machine Learning In Finance Session

In this talk we will see how machine learning and deep learning techniques can be applied in the field of quantitative finance. We will look at a few use-cases in detail and see how machine learning techniques can supplement and sometimes even improve upon already existing statistical models. We will also look at novel visualizations to help us better understand and interpret these models.

Alan Choi is a software engineer at Cloudera working on the Impala project. Before joining Cloudera, he worked at Greenplum on the Greenplum-Hadoop integration. Prior to that, Alan worked extensively on PL/SQL and SQL at Oracle.

Presentations

How to survive the future data warehousing challenges with the help of hybrid cloud Session

In this talk, you will learn how Cloudera’s Finance Department used a hybrid model to speed up report delivery and reduce cost of end of quarter reporting. Learn from our experience some guidelines for how to deploy modern data warehousing in a hybrid cloud environment: When should you choose private vs public cloud services? What options are there? Do:s and dont:s

A computer science engineer turned decision scientist turned data scientist. She has an experience of ~4 years unveiling wonders of data using data science.
Having worked closely with the board of directors of 3 startups in India & Indonesia, she is known for her business understanding, problem solving approach, machine learning, NLP and obviously driving data science problems to the final execution.

Personal:
A yoga lover, painter, poetess, avid trekker & wanderer who is best at talking to people and learning about them

Presentations

From an archived data field to GOJEK’s world class product feature for customer experience Session

Who would have imagined that a random chat message or a note written in a local language sent by customers to their drivers while waiting for a ride/car to arrive for their pickup can be utilised to carve out unparalleled information about pickup points, their names that sometimes even Google map has no idea of & to finally help in creating a world class customer pick-up experience feature!

Eric is the Chief Algorithms Officer at Stitch Fix, leading a team of 100+ Data Scientists. He is responsible for the multitude of algorithms that are pervasive to nearly every function of the company: merchandise, inventory, marketing, forecasting & demand, operations, and the styling recommender system. Prior to joining Stitch Fix, he was the Vice President of Data Science & Engineering at Netflix. Eric holds a B.A. in Economics, an M.S. in Information Systems, M.S. in Management Science & Engineering.

Presentations

How to Make Fewer Bad Decisions Session

A|B Testing has revealed the fallibility in human intuition that typically drives business decisions. We describe some types of systematic errors domain experts commit. In this interactive session, we demonstrate and discuss how cognitive biases arise from heuristic reasoning processes. Finally, we propose several mechanisms to mitigate these human limitations and improve our decision-making.

Ian Cook is a data scientist at Cloudera and the author of several R packages, including implyr. Previously, Ian was a data scientist at TIBCO and a statistical software developer at Advanced Micro Devices. Ian is cofounder of Research Triangle Analysts, the largest data science meetup group in the Raleigh, North Carolina, area, where he lives with his wife and two young children. He holds an MS in statistics from Lehigh University and a BS in applied mathematics from Stony Brook University.

Presentations

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow 2-Day Training

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, you'll overcome obstacles to getting started using new tools.

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow (Day 2) Training Day 2

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, you'll overcome obstacles to getting started using new tools.

Will Crichton is a PhD student in Computer Science at Stanford, advised by Prof. Pat Hanrahan. He creates systems that merge research in parallel computing and programming language design to solve impactful problems. Will’s current focus is on tools to enable large-scale visual data analysis, or processing massive collections of images and videos, including published work at SIGGRAPH.

Presentations

Scanner: Efficient Video Analysis at Scale Session

Systems like Spark made it possible to process big numerical/textual data on hundreds of machines. Today, the majority of data in the world is video. Scanner is the first open-source distributed system for building large-scale video processing applications. Scanner is being used at Stanford for analyzing TBs of film with deep learning on GCP, and at Facebook for synthesizing VR video on AWS.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Thursday keynotes Keynote

Program Chairs, Ben Lorica, Alistair Croll, and Doug Cutting, welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program Chairs, Ben Lorica, Alistair Croll, and Doug Cutting, welcome you to the first day of keynotes.

Dillon Cullinan is a Data Engineering Cyber Security Specialist at Accenture Cyber Labs in the Washington, DC area. At Accenture, Dillon’s focus is on bringing and building big data solutions for the cyber security realm to enable large scale analytics and visualizations.

Presentations

Using Graph Metrics to detect Lateral Movement in Enterprise Cybersecurity Data Session

In this talk, we will show how Accenture's Cyber Security Lab built Security Analytics Models to detect Attempted Lateral Movement in networks by transforming enterprise scale security data into a graph format, generating graph analytics for individual users, and building time series detection models that visualize the changing graph metrics for security operators.

Nick Curcuru is vice president of enterprise information management at Mastercard, where he is responsible for leading a team that works with organizations to generate revenue through smart data, architect next-generation technology platforms, and protect data assets from cyberattacks by leveraging Mastercard’s information technology and information security resources and creating peer-to-peer collaboration with their clients. Nick brings over 20 years of global experience successfully delivering large-scale advanced analytics initiatives for such companies as the Walt Disney Company, Capital One, Home Depot, Burlington Northern Railroad, Merrill Lynch, Nordea Bank, and GE. He frequently speaks on big data trends and data security strategy at conferences and symposiums, has publishing several articles on security, revenue management, and data security, and has contributed to several books on the topic of data and analytics.

Presentations

Executive Briefing: Forcing the legal and ethical hands of companies that collect, use, and analyze data Session

In recent years, security breaches have happened to a number of household names, and users feel violated.People around the world have shared their valuable, personally identifiable information with companies they trusted, and many of those companies didn’t guard that information appropriately.

Paul Curtis is a principal solutions engineer at MapR, where he provides pre- and postsales technical support to MapR’s worldwide systems engineering team. Previously, Paul was senior operations engineer for Unami, a startup founded to deliver on the promise of interactive TV for consumers, networks, and advertisers, and a systems manager for Spiral Universe, a company providing school administration software as a service. He also held senior support engineer positions at Sun Microsystems, enterprise account technical management positions for both Netscape and FileNet, and positions in application development at Applix, IBM Service Bureau, and Ticketron. Paul got started in the ancient personal computing days; he began his first full-time programming job on the day the IBM PC was introduced.

Presentations

Clusters in Kubernetes on a Cluster: Building a Multi-Tenant Environment for the Field Session

Just like almost everybody, we needed a way for ordinary users to stand up applications on top of Kubernetes, but we had additional requirements. And we had to do it without breaking the bank. Our field sales engineering force of sixty engineers around the globe now can spin up and down our technology quickly and simply using Kubernetes, the cloud, and shared data storage.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynotes Keynote

Program Chairs, Ben Lorica, Alistair Croll, and Doug Cutting, welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program Chairs, Ben Lorica, Alistair Croll, and Doug Cutting, welcome you to the first day of keynotes.

Sabrina Dahlgren is a director in charge of strategic analysis at Kaiser Permanente. Her expertise ranges from statistics and economics to project management and computer science. Sabrina has 20 years’ total work experience in leadership and analytical roles such as vice president of marketing and product development and CRM manager and customer segmentation in technology companies including Vodafone, among others. Sabrina has twice won the Innovation Award at Kaiser, most recently in the category of broadly applicable technology for big data analytics.

Presentations

AutoML & Interpretability in Healthcare Data Case Studies

Healthcare data is usually plagued with missing information, & incorrect values.However, the need for accuracy & highly interpretable models is also very high. To this rescue comes, the two new areas – automl and auto-model interpretability. With the advent of new tools for automl and interpretability led by a few niche companies such as H2o, this has become very critical for healthcare modeling.

Jason Dai is a senior principal engineer and chief architect for big data technologies at Intel, where he leads the development of advanced big data analytics, including distributed machine learning and deep learning. Jason is an internationally recognized expert on big data, the cloud, and distributed machine learning; he is the cochair of the Strata Data Conference in Beijing, a committer and PMC member of the Apache Spark project, and the chief architect of BigDL, a distributed deep learning framework on Apache Spark.

Presentations

Analytics Zoo: Distributed Tensorflow and Keras on Apache Spark Tutorial

In this tutorial, we will show how to build and productionize deep learning applications for Big Data using "Analytics Zoo":https://github.com/intel-analytics/analytics-zoo (a unified analytics + AI platform that seamlessly unites Spark, TensorFlow, Keras and BigDL programs into an integrated pipeline) using real-world use cases (such as JD.com, MLSListings, World Bank, Baosight, Midea/KUKA, etc.)

Founder at Raised Real

PhD in Nutrition from New York University

RD (registered dietitian) from University of California, San Francisco.

Expert contributor to MyDomaine, PopSugar, and MotherMag.

Presentations

Nutrition Data Science Session

Learn how to explore exciting ideas in Nutrition using Data Science. In this presentation we analyze the detrimental relationship between sugar and longevity, obesity and chronic diseases.

Julien Delange is a staff software engineer at Twitter, working on infrastructure services. He was a senior software engineer at Amazon Web Services, a senior member of the technical staff at Carnegie Mellon University and a software engineer at the European Space Agency.

Julien got his PhD in computer science at TELECOM ParisTech and a Master degree in computer science at the University in Pierre et Marie Curie.

Presentations

Real-time monitoring of Twitter network infrastructure with Heron Session

This presentation presents how Twitter uses the heron data processing engine to monitor and analyze its network infrastructure. Within 2 months, infrastructure engineers implemented a new data pipeline that ingests multiple sources and processes about 1 billion of tuples to detect network issues generate usage statistics. The talk focuses on key technologies used, the architecture and challenges.

Sourav Dey is CTO at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. Prior to Manifold, Sourav led teams to build data products across the technology stack, from smart thermostats and security cams (Google/Nest) to power grid forecasting (AutoGrid) to wireless communication chips (Qualcomm). He holds patents for his work, has been published in several IEEE journals, and has won numerous awards. He earned his PhD, MS, and BS degrees in Electrical Engineering and Computer Science from the Massachusetts Insitute of Technology (MIT).

Presentations

Applications of Mixed Effect Random Forests Session

Clustered data is all around us. The best way to attack it? Mixed effect models. This talk explains how the Mixed Effects Random Forests (MERF) model and Python package marries the world of classical mixed effect modeling with modern machine learning algorithms, and how it can be extended to be used with other advanced modeling techniques like gradient boosting machines and deep learning.

Streamlining a Machine Learning Project Team Tutorial

Many teams are still run as if data science is mainly about experimentation, but those days are over. Now it must be turnkey to take models into production. Sourav Day and Alex Ng explain how to streamline a machine learning project and help your engineers work as an an integrated part of your production teams, using a Lean AI process and the Orbyter package for Docker-first data science.

Rohan Dhupelia leads the analytics platform team at Atlassian. This is a team which focuses it’s effort on further democratising data in the company and providing a world class, highly innovative data platform.

The primary focus of Rohan’s career for the last 10+ years have been in the data space across a variety of industries including FMCG’s, property and technology. During this time he has held a number of roles doing everything from BI report development to data warehousing and data engineering.

Presentations

Transforming behavioural analytics at Atlassian Session

Analytics is easy, good analytics is hard. Here at Atlassian we know this all to well with our push to become a truely data-driven organisation. In order to achieve this we've transformed the way we thought about behavioural analytics, from how we defined our events all the way to how we ingested and analysed them.

Wei Di is a Staff Machine Learning Scientist on LinkedIn’s business analytics data mining team. Wei is passionate about creating smart and scalable solutions that can impact millions of individuals and empower successful business. She has wide interests covering artificial intelligence, machine learning, and computer vision. Previously, Wei worked with eBay Human Language Technology and eBay Research Labs, where she focused on large-scale image understanding and joint learning from visual and text information, and worked at Ancestry.com in the areas of record linkage and search relevance. Wei holds a PhD from Purdue University. Her past talks include Strata 2018 Tutorial: Big data analytics and machine learning techniques to drive and grow business and Spark Summit 2017 Transforming b2b with Spark Powered Sales Intelligence.

Presentations

Full Spectrum of Data Science to Drive Business Decisions Tutorial

Thanks to the rapid growth in data resources, it is common for business leaders to appreciate the challenge and importance in mining the information from data. In this tutorial, a group of well respected data scientists would share with you their experiences and success on leveraging the emerging techniques in assisting intelligent decisions, that would lead to impactful outcomes at LinkedIn.

Louis DiValentin is a Security Data Scientist at the Accenture Cyber Labs located in Washington, DC area. His research focus is on Security Analytics modeling, Graph Analytics and Big Data.

Presentations

Using Graph Metrics to detect Lateral Movement in Enterprise Cybersecurity Data Session

In this talk, we will show how Accenture's Cyber Security Lab built Security Analytics Models to detect Attempted Lateral Movement in networks by transforming enterprise scale security data into a graph format, generating graph analytics for individual users, and building time series detection models that visualize the changing graph metrics for security operators.

Thomas is a Data Science Product Manager who is responsible for maintaining and building machine learning models from ideation to full implementation. His experience spans marketing, user acquisition, finance, and product and has spent 7 years in the gaming industry. He wrote the underlying algorithm for KIXEYE’s offer recommendation engine and is currently an MBA candidate at Berkeley-Haas.

Presentations

Recommendation Engines & Mobile Gaming Session

As a fully closed model economy games offer a unique opportunity to use analytics to create unique purchase opportunities for customers. We’ll cover how KIXEYE was able to use machine learning to create personalized offer recommendations for our customers resulting in significantly increased monetization and retention. We’ll go over some of the important choices to make and pitfalls to avoid.

Xiaojing Dong is Head of Marketing Science and a Staff Data Scientist at the LinkedIn Analytics team. She is also an Associate Professor of Marketing and Business Analytics at Santa Clara University, where she lead the effort in designing and starting a popular Master of Science program in Business Analytics and served as the founding director. She helps translate business and marketing problems into data questions, and apply analytical techniques into solving such problems to assist business decisions.

Presentations

Full Spectrum of Data Science to Drive Business Decisions Tutorial

Thanks to the rapid growth in data resources, it is common for business leaders to appreciate the challenge and importance in mining the information from data. In this tutorial, a group of well respected data scientists would share with you their experiences and success on leveraging the emerging techniques in assisting intelligent decisions, that would lead to impactful outcomes at LinkedIn.

Mark Donsky leads product management at Okera, a software provider that provides discovery, access control, and governance at scale for today’s modern heterogeneous data environments. Previously, Mark led data management and governance solutions at Cloudera. Mark has held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

Executive briefing: big data in the era of heavy worldwide privacy regulations Session

General Data Protection Regulation went into effect in 2018, and California is following suit with the California Consumer Protection Act (CCPA) in 2020. However many companies aren't prepared for the strict regulation or fines for noncompliance. This session will explore the capabilities your data environment needs in order to simplify CCPA and GDPR compliance, as well as other regulations.

Ted Dunning is chief application architect at MapR. He’s also a board member for the Apache Software Foundation, a PMC member and committer on many Apache projects, and a mentor for various incubator projects. Ted has years of experience with machine learning and other big data solutions across a range of sectors. He has contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library and designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date plus a dozen pending. He holds a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.

Presentations

Online Evaluation of Machine Learning Models Session

Evaluating machine learning models is surprisingly hard. It gets even harder because these systems interact in very subtle ways. I will break the problem of evaluation apart into operational and function evaluation and show how each can be done without unnecessary pain and suffering. In particular, I will show some exciting visualization techniques that help make differences strikingly apparent.

Taposh Roy leads, innovation team of decision support group at Kaiser Permanente. His work focuses on journey analytics, deep learning, data science architecture and strategy. Prior to KP, Taposh was Head of AD products at couple of start-ups Inpowered and Netshelter(Sold to Ziff Davis). Prior to start-ups he worked as Sr. Associate Consultant in a MIT based consulting company Sapient. He was the co-founder a biotech company Bio-Integrated solutions developing DNA sequencers and liquid handling devices for proteomics.
He has a unique combination of product, technology and strategy consulting, data science and start-up experience. He is a consumer focused, machine learning and data science geek.

Presentations

AutoML & Interpretability in Healthcare Data Case Studies

Healthcare data is usually plagued with missing information, & incorrect values.However, the need for accuracy & highly interpretable models is also very high. To this rescue comes, the two new areas – automl and auto-model interpretability. With the advent of new tools for automl and interpretability led by a few niche companies such as H2o, this has become very critical for healthcare modeling.

Bysshe is the Director of Analytics at KIXEYE and brings his experience in using a combination of data analysis, economics, intuition and game design sense to solve monetization and content delivery problems in games.
As Director of Analytics Bysshe manages the insights and implementations of the machine learning offer system for War Commander: Rogue Assault.

Presentations

Recommendation Engines & Mobile Gaming Session

As a fully closed model economy games offer a unique opportunity to use analytics to create unique purchase opportunities for customers. We’ll cover how KIXEYE was able to use machine learning to create personalized offer recommendations for our customers resulting in significantly increased monetization and retention. We’ll go over some of the important choices to make and pitfalls to avoid.

Wenchen Fan is a software engineer at Databricks, working on Spark Core and Spark SQL. He mainly focuses on the Apache Spark open source community, leading the discussion and reviews of many features/fixes in Spark. He is a Spark committer and a Spark PMC member.

Presentations

Apache Spark 2.4 and Beyond Session

This talk will provide an overview of the major features and enhancements in Apache Spark 2.4 release and the upcoming releases and will be followed by a Q&A session.

Zhen Fan is a senior software development engineer at JD.com, where he focuses on big data platform development and management.

Presentations

Optimizing Computing Clusters Resource Utilization with In-Memory Distributed File System Session

JD.com has designed a brand new architecture to optimize the spark computing clusters. We will show the problems we faced before and how we benefit from the in-memory distributed filesystem now.

Tao Feng is a software engineer for data platform at lyft working on data products. Tao is a committer and PPMC for Apache Airflow. Previously, Tao worked at Linkedin and oracle on data infrastructure, tooling and performance.

Presentations

Democratize data discovery at lyft Data Case Studies

The number of data resources at lyft are increasing constantly. It becomes harder for users to figure out which table to use , information about these resources (e.g owners, frequent users, how it got populated, data lineage etc) and gain trust in these data resources. In this talk we will talk about how we build the data portal product to democratize data discovery at lyft.

Disrupting Data Discovery Session

In this talk, we'll discuss how Lyft has reduced time taken for discovering data by 10x by building its own data portal - Amundsen. We will give a demo of Amundsen, deep dive into its architecture and discuss how it leverages centralized metadata, page rank, and a comprehensive data graph to achieve its goal. We will close with future roadmap, unsolved problems and collaboration model.

Rustem Feyzkhanov is a machine learning engineer who creates analytical models for manufacturing industry at Instrumental. Rustem is passionate about serverless infrastructure (and AI deployments on it) and has ported several packages to AWS Lambda from TensorFlow, Keras, and scikit-learn for ML to PhantomJS, Selenium, and WRK for web scraping.

Presentations

Serverless workflows for orchestration hybrid cluster-based and serverless processing Session

Serverless implementation of the core processing is becoming a production-ready solution for a lot of companies. The companies with existing processing pipelines may find it hard to go completely serverless. Serverless workflows unite serverless world and cluster world to use benefits of both approaches. My talk will show how serverless workflows change our perception of software architecture.

Ilan Filonenko is a member of the Data Science Infrastructure team at Bloomberg, where he has designed and implemented distributed systems at both the application and infrastructure level. He is one of the principal contributors to Spark on Kubernetes, primarily focusing on the effort to enabled Secure HDFS interaction and non-JVM support. Previously, Ilan was an engineering consultant and technical lead in various startups and research divisions across multiple industry verticals, including medicine, hospitality, finance, and music. Ilan’s research has focused on algorithmic, software, and hardware techniques for high-performance machine learning with a particular interest in optimizing stochastic algorithms, convolutional sequence-to-sequence models, multi-task learning for deep text recommendations, and model management.

Presentations

Cross-Cloud Model Training & Serving with Kubeflow Tutorial

This workshop will quickly introduce what Kubeflow is, and how we can use it to train and serve models across different cloud environments (and on-prem). We’ll have a script to do the initial set up work ready so you can jump (almost) straight into training a model on one cloud, and then look at how to set up serving in another cluster/cloud. We will start with a simple model w/follow up links.

Jonatahan Francis is VP of marketing analytics & optimization at Starbucks.

Presentations

Improving AI solutions for personalization with continuous experimentation & learning Data Case Studies

In this talk, Jon Francis (VP Marketing analytics, Starbucks) & Arun Veettil (Founder & CEO at Skellam Ai) talks about how they improved the performance of AI based personalization solution by 2x through continuous Ai enabled experimentation and learning.

Bill Franks is chief analytics officer for the International Institute for Analytics (IIA). His work has spanned clients in a variety of industries for companies ranging in size from Fortune 100 companies to small nonprofit organizations. Bill is the author of Taming the Big Data Tidal Wave and The Analytics Revolution. Just prior to IIA, Bill was Chief Analytics Officer at Teradata. You can learn more on his website.

Presentations

The Ethics Of Analytics Session

Concerns are constantly being raised today about what data is appropriate to collect and how (or if) it should be analyzed. There are many ethical, privacy, and legal issues to consider and no clear standards exist in many cases as to is fair and what is foul. This talk will discuss a variety of dilemmas and provide some guidance on how to approach them.

Michael J. Freedman is the cofounder and CTO of TimescaleDB, an open source database that scales SQL for time series data, and a professor of computer science at Princeton University, where his research focuses on distributed systems, networking, and security. Previously, Michael developed CoralCDN (a decentralized CDN serving millions of daily users) and Ethane (the basis for OpenFlow and software-defined networking) and cofounded Illuminics Systems (acquired by Quova, now part of Neustar). He is a technical advisor to Blockstack. Michael’s honors include the Presidential Early Career Award for Scientists and Engineers (PECASE, given by President Obama), the SIGCOMM Test of Time Award, the Caspar Bowden Award for Privacy Enhancing Technologies, a Sloan Fellowship, the NSF CAREER Award, the Office of Naval Research Young Investigator Award, a DARPA Computer Science Study Group membership, and multiple award publications. He holds a PhD in computer science from NYU’s Courant Institute and bachelor’s and master’s degrees from MIT.

Presentations

Performant time-series data management and analytics with Postgres Session

In this talk, I focus on two newly-released features of TimescaleDB (automated adaptation of time-partitioning intervals and continuous aggregations in near-real-time), and discuss how these capabilities ease time-series data management. I discuss how these capabilities have been leveraged across several different use cases, including in use with other technologies such as Kafka.

Cynthia Freeman is a Research Engineer at Verint, a developer of conversational AI systems. She is currently pursuing her PhD in computer science at the University of New Mexico, where she works on time series analysis and developing new anomaly detection methods. She holds an MS in applied mathematics from the University of Washington and a BS in mathematics from Gonzaga University.

Presentations

How to Determine the Optimal Anomaly Detection Method For Your Application Session

An anomaly is a pattern not conforming to past, expected behavior. Its detection has many applications such as tracking business KPIs or fraud spotting in credit card transactions. Unfortunately, there is no one best way to detect anomalies across a variety of domains. We introduce a framework to determine the best anomaly detection method for the application based on time series characteristics.

Matt is co-founder at Starburst, the Presto company. Matt has been working in various engineering roles in the data warehousing and analytics space for the past 10 years. Prior to Starburst Data, Matt was Director of Engineering at Teradata leading engineering teams working on Presto. Matt was part of the team that led the initiative to bring open source, in particular Presto, to Teradata’s products. Prior to joining Teradata, Matt architected and led development efforts for the next generation distributed SQL engine at Hadapt which was acquired by Teradata in 2014.

Prior to Hadapt, Matt was an early engineer at Vertica Systems which was acquired by HP (NYSE: HPQ). At Vertica Matt worked on the Query Optimizer.

Presentations

Learning Presto: SQL-on-Anything Tutorial

Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL-on-Anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from Gigabytes to Petabytes. In this tutorial, attendees will learn Presto usages, best practices, and optional hands on exercises.

Mei Lin Fung is a technology pioneer working to ensure that technology works for humanity as the next 3.9 billion people come online. In 1989 she was part of the 2-person skunkworks team that developed “OASIS,” the first customer relationship management (CRM) system. She later served as socio-technical lead for the US Dept. of Defense’s Federal Health Futures initiative. In 2015, she joined “father of the Internet,” Vint Cerf, to co-found the “People-Centered Internet,” which maintains a global network of “positive change agents” committed to ensuring that technology is developed with a “people-centered” focus – increasing access while ensuring equality, protecting the vulnerable, and prioritizing human well-being. She is a member of the World Economic Forum (WEF)’s Global Future Council on Digital Economy and Society, serving on the Steering Committee for Internet for All. She is vice-chair for Internet Inclusion within the Institute for Electrical and Electronic Engineers (IEEE) Internet Initiative, 3i.

Presentations

Community and Regional Data Sharing Policy Frameworks: Frontier Stories Session

Data Sharing requires stakeholders and populations of people to come and learn together the benefits, risks, challenges and the known and unknown "Unknowns". Data Sharing and data sharing policies and data sharing policy frameworks require increasing levels of trust - which takes time to build: Trail breaking stories from Solano County, California and ASEAN (SE Asia) offer important insights

Li Gao is the tech lead in the cloud native spark compute initiative at Lyft. Prior to Lyft, Li worked at Salesforce, Fitbit, Marin Software, and a few startups on various technical leadership positions focusing on cloud native and hybrid cloud data platforms at scale. Besides Spark, Li has scaled and productionized other open source projects, such as Presto, Apache HBase, Apache Phoenix, Apache Kafka, Apache Airflow and Apache Hive

Internal meetup (2018) slides on building cloud native data platform using spark and presto:

https://www.slideshare.net/LiGao1/cloud-native-data-platform-113940460

Internal meetup (2016) on Presto connector extensions:

https://twitter.com/AvikonHadoop/status/760242170796253184

Presentations

Scaling Apache Spark on Kubernetes at Lyft Session

In this talk, Li Gao and Bill Graham will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale.

Adem Efe Gencer received his PhD in Computer Science from Cornell University in 2017. His PhD research has focused on improving the scalability of blockchain technologies. The protocols introduced in his research were adopted by Waves Platform, Aeternity, Cypherium, Enecuum, Legalthings, and Ergo Platform, and are actively being developed into other systems. His papers received over 500 citations and a best student paper award in Middleware Conference. Efe actively acts as a reviewer for top-tier journals and conferences, including Communications of the ACM and ACM Computing Surveys.

Efe develops Apache Kafka and the ecosystem around it, and supports their operation at LinkedIn. In particular, he works on the design, development, and maintenance of Cruise Control — a system for alleviating the management overhead of large-scale Kafka clusters at LinkedIn.

Presentations

Cruise Control: Effortless Management of Kafka Clusters Session

This talk will describe our work and experiences towards alleviating the management overhead of large-scale Kafka clusters using Cruise Control at LinkedIn.

Noah Gift is lecturer and consultant at both UC Davis Graduate School of Management MSBA program and the Graduate Data Science program, MSDS, at Northwestern . He is teaching and designing graduate machine learning, AI, Data Science courses and consulting on Machine Learning and Cloud Architecture for students and faculty. These responsibilities including leading a multi-cloud certification initiative for students. He has published close to 100 technical publications including two books on subjects ranging from Cloud Machine Learning to DevOps. Gift received an MBA from UC Davis, a M.S. in Computer Information Systems from Cal State Los Angeles, and a B.S. in Nutritional Science from Cal Poly San Luis Obispo.

Noah is a Python Software Foundation Fellow, AWS Subject Matter Expert (SME) on Machine Learning, AWS Certified Solutions Architect and AWS Academy Accredited Instructor, Google Certified Professional Cloud Architect, Microsoft MTA on Python, and has published books on cloud machine learning and DevOps. He writes and publishes content for publications including Forbes, IBM, Red Hat, Microsoft, O’Reilly, and Pearson. Workshops and Talks have been given around the world for organizations including NASA, PayPal, PyCon, Strata, and FooCamp. As a SME on Machine Learning for AWS, he helped created the AWS Machine Learning certification.

He has worked in roles ranging from CTO, General Manager, Consulting CTO, Consulting Chief Data Scientists and Cloud Architect. This experience has been with a wide variety of companies including ABC, Caltech, Sony Imageworks, Disney Feature Animation, Weta Digital, AT&T, Turner Studios and Linden Lab. In the last ten years, he has been responsible for shipping many new products at multiple companies that generated millions of dollars of revenue and had global scale. Currently he is consulting startups and other companies, on Machine Learning, Cloud Architecture and CTO level consulting as the founder of Pragmatic AI Labs. His most recent book is Pragmatic AI: An introduction to Cloud-Based Machine Learning (Pearson, 2018). His most recent video series is Essential Machine Learning and AI with Python and Jupyter Notebook LiveLessons. It is also available on Safari Online

Presentations

Nutrition Data Science Session

Learn how to explore exciting ideas in Nutrition using Data Science. In this presentation we analyze the detrimental relationship between sugar and longevity, obesity and chronic diseases.

Zachary Glassman is a data scientist in residence at the Data Incubator. Zachary has a passion for building data tools and teaching others to use Python. He studied physics and mathematics as an undergraduate at Pomona College and holds a master’s degree in atomic physics from the University of Maryland.

Presentations

Hands-On Data Science with Python 2-Day Training

We will walk through all the steps - from prototyping to production - of developing a machine learning pipeline. We’ll look at data cleaning, feature engineering, model building/evaluation, and deployment. Students will extend these models into two applications from real-world datasets. All work will be done in Python.

Benjamin Glicksberg, PhD is a post-doctoral scholar in the lab of Dr. Atul Butte in the Bakar Computational Health Sciences Institute at the University of California, San Francisco. His work involves utilizing state-of-the-art computational methods, including Artificial Intelligence algorithms, on bio- and clinical informatics frameworks to make discoveries to push forward precision medicine. His work often ties together multi-omic data types ranging from genomics to clinical data in the form of Electronic Health Records (EHR). Dr. Glicksberg has also built software, tools, and applications for interacting with and visualizing EHR data across patients in the UC Health system, with a particular emphasis on interoperable common data model formats. He obtained a PhD from the Icahn School of Medicine at Mount Sinai in 2017.

Presentations

Sharing Cancer Genomic Data from Clinical Sequencing Using Blockchain Data Case Studies

Sequencing cancer genomes has transformed how we diagnose and treat the deadliest disease in America: Cancer. We are allowing patients to share these data using the Blockchain.

Sean is a Senior Software Engineer on the Fast Data Platform team at Lightbend where he specializes in Kubernetes, Apache Kafka and its ecosystem. Sean enjoys building Fast Data platforms, reactive distributed systems, and contributing to open source projects.

Presentations

Put Kafka in jail with Strimzi Session

Introducing Strimzi, a Kafka project for Kubernetes. The best way to run stateful services with complex operational needs like Kafka is to use the operator pattern. This talk will review a popular new open source operator-based Apache Kafka implementation on Kubernetes called the Strimzi Kafka Operator.

Sharad Goel is an Assistant Professor at Stanford University in the Department of Management Science & Engineering, and is the founder and director of the Stanford Computational Policy Lab. He also holds courtesy appointments in Computer Science, Sociology, and the Law School. In his research, Sharad looks at public policy through the lens of computer science, bringing a new, computational perspective to a diverse range of contemporary social issues, including policing, incarceration, and elections. Before joining the Stanford faculty, Sharad completed a Ph.D. in Applied Mathematics at Cornell University and worked as a senior researcher at Yahoo and Microsoft in New York City.

Presentations

The Measure and Mismeasure of Fairness in Machine Learning Session

By highlighting these challenges in the foundation of fair machine learning, I hope to help researchers and practitioners productively advance the area.

Goldwasser is the RSA Professor of Electrical Engineering and Computer Science at MIT and has recently been names the incoming director of the Simons institute for theory of computing at UC Berkeley. She is also a professor of computer science and applied mathematics at the Weizmann Institute of Science in Israel. Goldwasser received a BS degree in applied mathematics from Carnegie Mellon University in 1979, and MS and PhD degrees in computer science from the University of California, Berkeley, in 1984.

Goldwasser pioneering contributions include the introduction of probabilistic encryption, interactive zero knowledge protocols, elliptic curve primality testings, hardness of approximation proofs for combinatorial problems, and combinatorial property testing.

She was the recipient of the ACM Turing Award for 2012, the Gödel Prize in 1993 and another in 2001, the ACM Grace Murray Hopper award in 1996, the RSA award in mathematics in 1998, the ACM Athena award for women in computer science in 2008, the Benjamin Franklin Medal in 2010, the IEEE Emanuel R. Piore award in 2011, the Simons Foundation Investigator award in 2012, and the BBVA Foundation Frontiers of Knowledge award in 2018. She is a member of the NAS, NAE, AAAS, the Russian academy of science, the Israeli Academy of Science, and the London Royal Mathematical society. She a holds honorary degrees from Ben Gurion University, Bar Ilan University, Haifa University, a Berkeley Distinguished Alumnus Award and the Barnard College Medal of Distinction.

Presentations

AI and Cryptography: Challenges and Opportunities Keynote

Shafi Goldwasser, Director of the Simons Institute for Theory of Computing | RSA Professor of EECS | Computer Science and Applied Mathematics, University of California, Berkeley | MIT | Weizmann Institute of Science

Data & Analytics Director at USAA, focus on our members interaction data and research with universities to help provide new insights and build pipelines of talent.

Presentations

Creating a Data Engineering Culture at USAA Session

What happens when you have a data science organization, but no data engineering organization? This is what happened at USAA. In this session, we will share what happened without data engineering, how we fixed it, and what were the results.

Alex Gorbachev is the Head of Enterprise Data Science at Pythian. His mission is to help clients around the world build applied AI solutions and democratize data science. Over the course of his 12 years at Pythian, Alex has held many roles, including Chief Technology Officer and Chief Digital Officer. His deep technological roots and industry vision has helped Pythian get to the forefront of the emerging cloud and data markets. Alex is a highly sought-after speaker at industry conferences and user groups around the world. His past accomplishments include achieving the prestigious Oracle ACE Director designation from Oracle and being named “Big Data Champion” by Cloudera.

Presentations

Machine Learning for Preventive Maintenance of Mining Haul Trucks Session

Using the example of r a mining haul truck at a leading Canadian mining company, we will cover mapping preventive maintenance needs to supervised machine learning problems, creating labeled datasets, feature engineering from sensors and alerts data, evaluating models— then converting it all to a complete AI solution on Google Cloud Platform which is integrated with existing on-premise systems.

Martin is passionate about science, technology, coding, algorithms and everything in between. He graduated from Mines Paris Tech, enjoyed his first engineering years in the computer architecture group of ST Microlectronics and then spent the next 11 years shaping the nascent eBook market, starting with the Mobipocket startup, which later became the software part of the Amazon Kindle and its mobile variants. He joined Google Developer Relations in 2011 and now focuses on parallel processing and machine learning. He is the author of the successful “TensorFlow without a PhD series”

Presentations

Recurrent Neural Networks without a PhD workshop Tutorial

Hands-on with Recurrent Neural Networks and Tensorflow. Discover what makes RNNs so powerful for time series analysis.

Dr. Denise Gosnell leads a team at DataStax which builds some of the largest, distributed graph applications in the world. Her passion centers on examining, applying, and evangelizing the applications of graph data and complex graph problems.

As an NSF Fellow, Dr. Gosnell earned her Ph.D. in Computer Science from the University of Tennessee. Her research coined the concept of “social fingerprinting” by applying graph algorithms to predict user identity from social media interactions.​ ​Since then, Dr. Gosnell has built, published, patented, and spoke on dozens of topics related to graph theory, graph algorithms, graph databases, and applications of graph data across all industry verticals.

Presentations

Taking Graph Applications to Production Session

The graph community has spent years defining and describing our passion - applying graph thinking to solve difficult problems. This talk will leverage years of experience from shipping large scale applications built on graph databases. We’ll discuss some practical and tangible decisions that come into play when designing and delivering distributed graph applications … or playing SimCity 2000.

Bill Graham is an architect on the Data Platform team at Lyft. Bill’s primary area of focus is on data processing applications and analytics infrastructure. Previously he was a staff engineer on the Data Platform team at Twitter, where he built streaming compute, interactive query, batch query, ETL and data management systems. Before Twitter Bill was a Principal Engineer at CBS Interactive and CNET Networks where he developed ad targeting and content publishing infrastructure. Before CBSi he was a Senior Engineer at Logitech focusing on webcam streaming and messaging applications. He’s contributed to a number of open-source projects including Apache HBase, Apache Hive and Presto and he’s an Apache Pig and Apache Heron (Incubating) PMC member.

Videos:

From Rivulets to Rivers: Elastic Stream Processing in Heron
Stata + Hadoop World, San Jose, March 2017
https://www.safaribooksonline.com/library/view/strata-hadoop/9781491976166/video302414.html

Heron at Twitter – The Hive Meetup, August, 2016
minute 48:30
https://www.youtube.com/watch?v=FRvmeoJCZKU&feature=youtu.be

Presto at Twitter – Facebook Presto Meetup, March 2016
Minute 24:20
https://www.facebook.com/prestodb/videos/531276353732033/

Intro to Hadoop – UC Berkeley Information School, September 2012
http://blogs.ischool.berkeley.edu/i290-abdt-s12/2012/08/31/video-lecture-posted-intro-to-hadoop/

Hadoop Summit 2011 – Using a Hadoop Data Pipeline to Build a Graph of Users and Data, August, 2011
https://www.youtube.com/watch?v=wGXZmTt1p38

Presentations

Scaling Apache Spark on Kubernetes at Lyft Session

In this talk, Li Gao and Bill Graham will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale.

Trevor Grant is committer on the Apache Mahout, and contributor on Apache Streams (incubating), Apache Zeppelin, and Apache Flink projects and Open Source Technical Evangelist at IBM. In former rolls he called himself a data scientist, but the term is so over used these days. He holds an MS in Applied Math and an MBA from Illinois State University. Trevor is an organizer of the newly formed Chicago Apache Flink Meet Up, and has presented at Flink Forward, ApacheCon, Apache Big Data, and other meetups nationwide.

Trevor was a combat medic in Afghanistan in 2009, and wrote an award winning undergraduate thesis between missions. He has a dog and a cat and a 64 Ford and he loves them all very much.

Presentations

Cross-Cloud Model Training & Serving with Kubeflow Tutorial

This workshop will quickly introduce what Kubeflow is, and how we can use it to train and serve models across different cloud environments (and on-prem). We’ll have a script to do the initial set up work ready so you can jump (almost) straight into training a model on one cloud, and then look at how to set up serving in another cluster/cloud. We will start with a simple model w/follow up links.

Michael Gregory is a Systems Engineering Manager for Cloudera in EMEA.

In addition to leading a team of world-class engineers Michael has designed and implemented big data solutions and evangelized the power of data to transform organizations across industries in the US and Europe. With GDPR a major concern for EMEA organizations, Michael has guided data-driven companies in implementing data products with GDPR in mind.

Presentations

Machine Learning and GDPR Session

The General Data Protection Regulation (GDPR) enacted by the European Union can restrict the use of Machine Learning practices in many cases. This presentation will provide an overview of the regulations, important considerations for both EU and non-EU organizations and tools and technologies to ensure that ML applications can appropriately be used to drive continued transformation and insights.

Mark Grover is a product manager at Lyft. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He has also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Presentations

Disrupting Data Discovery Session

In this talk, we'll discuss how Lyft has reduced time taken for discovering data by 10x by building its own data portal - Amundsen. We will give a demo of Amundsen, deep dive into its architecture and discuss how it leverages centralized metadata, page rank, and a comprehensive data graph to achieve its goal. We will close with future roadmap, unsolved problems and collaboration model.

Sijie Guo is the cofounder of Streamlio, a company focused on building a next-generation real-time data stack. Previously, he was the tech lead for the Messaging Group at Twitter, where he cocreated Apache DistributedLog, and worked on push notification infrastructure at Yahoo. He is the PMC chair of Apache BookKeeper.

Presentations

How zhaopin.com built its enterprise event bus using Apache Pulsar Session

Using a messaging system to build an event bus is very common. However, certain use cases demand messaging system with a certain set of features. This talk will focus on the event bus requirements for Zhaopin.com, one of the biggest Chinese online recruitment services provider, and why they chose Apache Pulsar.

Chunky Gupta is a Member of Technical Staff at Mist Systems where he is working on scaling their Cloud Infrastructure. Chunky Gupta received his M.S in Computer Science from Texas A&M University in 2014. He worked with Yelp for 2 years as a Software Engineer and developed an autoscaling engine, FleetMiser, to intelligently autoscale Yelp’s Mesos cluster and saved millions of dollars. He presented FleetMiser at re:invent-2016. He has also scaled Yelp’s in-house distributed and reliable task runner Seagull. Chunky has a blog posted for Seagull at Yelp Engineering Blog. He has also built a hadoop-based data warehouse system at Vizury.

Presentations

Live-Aggregators: A Scalable, Cost Effective and Reliable Way of Aggregating Billions of Messages in Realtime Session

Live Aggregators(LA) is a highly reliable and scalable in-house real time aggregation system that can autoscale for sudden changes in load. LA consumes billions of kafka messages and does over 1.5 billion writes to Cassandra per day. It is 80% cheaper than competing streaming solutions due to running over AWS spot instances and having 70% CPU utilization.

Kapil Gupta is a data science leader at Airbnb in San Francisco. He leads the data science team focussed on launching new travel verticals like Experiences and establishing Airbnb as an end to end travel platform. At Airbnb, he has worked on many challenging machine learning and personalization problems arising in search, pricing, and risk. Prior to Airbnb, he has worked at PayPal and Duff & Phelps. He holds Ph.D. in Operations Research from Georgia Tech and B.Tech from Indian Institute of Technology (IIT) Madras, India.

https://www.linkedin.com/in/capsgupta/

Presentations

Personalize 
Guest Booking Experience 
at Airbnb Session

In this talk, we will present how we approach personalization of travelers’ booking experience using Machine Learning.

Sonal Gupta is a research scientist at Facebook working on conversational AI systems. Before joining FB, she developed deep learning natural language understanding models for conversational AI systems at Viv, a startup later acquired by Samsung. She completed her PhD at Stanford University in 2015 on weakly supervised and interpretable information extraction. Prior to that, she did her Master’s at University of Texas at Austin on combining language and vision modes for information extraction.

Presentations

Natural Language Understanding in Task Oriented Conversational AI Session

In this talk, I will describe practical systems for building a conversational AI system for task oriented queries. I will describe a way to do more advanced compositional understanding, which can understand cross-domain queries, using hierarchical representations.

Juan Paulo Gutierrez is a senior software engineer at Rakuten, where he leads data architecture, data engineering and data visualization teams. Paulo contributes to open source projects through code, documentation, feature requests and discussions. Previously, he was the product development lead in Media Links’ Network Management Software.

Presentations

Building Rakuten Analytics: A Story of Evolutions Session

Learn about how a small team in Tokyo went through several evolutions as they built an analytics service to help 200+ businesses accelerate their decision-making process. This presentation will cover the background, challenges, architecture, success stories, and best practices as they built and productionalized Rakuten Analytics.

Barkha is a Principal at GV (formely Google Ventures). She sources startups as well as helps portfolio companies with their analytical problems. Prior to that she worked on the Google finance team, where she tackled hard data problems and was the chief of staff for the team overseeing Google’s financial systems strategy.

Presentations

Executive Briefing: Upskilling your business teams to scale analytics in your organization Session

How do you decide if you should invest in upskilling business teams? The question is no longer IF but rather WHEN and HOW. I’m going to share with you a framework for that. In my time at GE, Google and GV, I have created and conducted multiple analytics trainings for non-technical users. They have resulted in increased productivity as well as work satisfaction.

John Haddad is Senior Director of Big Data Product Marketing at Informatica Corporation. He has over 25 years’ experience developing and marketing enterprise applications. Today, he advises organizations on Big Data best practices from a management and technology perspective. Prior to Informatica, John was Director of Product Marketing and Management at Right Hemisphere (acquired by SAP) and held various positions in R&D and Business Development at Oracle Corporation. John holds an AB degree in Applied Mathematics from U.C. Berkeley.

Presentations

Understanding the Data Universe with a Data Catalog Session

Before tackling any project its always prudent to first take inventory of what’s available. This helps you plan and execute towards a timely and efficient project completion. Just like a powerful space telescope that scans the universe, a data catalog scans the data universe to help data scientists and analysts find data, collaborate, and curate data for analytic and data governance projects.

Patrick Hall is a senior director for data science products at H2O.ai, where he focuses mainly on model interpretability and model management. Patrick is also currently an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning. Previously, Patrick held global customer-facing and R&D research roles at SAS Institute. He holds multiple patents in automated market segmentation using clustering and deep neural networks. Patrick is the 11th person worldwide to become a Cloudera Certified Data Scientist. He studied computational chemistry at the University of Illinois before graduating from the Institute for Advanced Analytics at North Carolina State University.

Presentations

Practical Techniques for Interpretable Machine Learning Tutorial

If machine learning can lead to financial gains for your organization why isn’t everyone doing it? One reason is training machine learning systems with transparent inner-workings and auditable predictions is difficult. This talk will present the good, bad, and downright ugly lessons learned from the presenters’ years of experience in implementing solutions for interpretable machine learning.

Yaron Haviv is a serial entrepreneur who has deep technological experience in the fields of big data, cloud, storage and networking. Prior to iguazio, Yaron was the Vice President of Datacenter Solutions at Mellanox, where he led technology innovation, software development and solution integrations. He was also the CTO and Vice President of R&D at Voltaire, a high performance computing, IO and networking company. Yaron is a CNCF member and one of the authors in the CNCF working group. He tweets as @yaronhaviv.

Presentations

Goodbye, Data Lake: Why Continuous Analytics Yield Higher ROI Session

Faced with the need to handle increasing volumes of data, alternative data sets ("alt data") and AI, many enterprises are working to design or redesign their big data architectures. While traditional batch platforms fail to generate sufficient ROI, Yaron Haviv suggests a Continuous Analytics approach yielding faster answers for the business while remaining simpler and less expensive for IT.

Norris is a senior research and data scientist with 19 years’ real-world experience in converting data into knowledge, that is, 19 years’ experience in many areas of Natural Language Processing, knowledge systems, cleaning and normalizing messy data, and rigorous accuracy measurement. Norris has published several papers in the fields of health informatics, and general knowledge management. She has worked for Lockheed Martin for a very long time, in multiple business areas, from public sector contracts to Advanced R&D to internal business process support. An alumna of both Temple University and the University of Pennsylvania, she currently lives in Wilmington, Delaware with her husband, 2 daughters, and 2 cats. She likes to eat and talk about food.

Presentations

NLP from Scratch: Solving the Cold Start Problem for Natural Language Processing Session

How do you train a machine learning model with no training data? We will present our journey implementing multiple solutions to bootstrapping training data in the NLP domain. We will cover topics including weak supervision, building an active learning framework, and annotation adjudication for Named Entity Recognition.

Michael is a software engineer at Cloudera. He has contributed to various parts of Apache Impala query execution engine such as codegen, query operators, expression evaluations and most recently distributed execution aspect of Impala. Before his tenure at Cloudera, he spent almost 8 years building hypervisor at VMware Inc.

Presentations

Accelerating Analytical Antelopes: Integrating Apache Kudu's RPC into Apache Impala Session

In recent years, Apache Impala has been deployed to clusters that are large enough to hit architectural limitations in the stack. Our talk will cover the efforts and results to address the scalability limitations in the now legacy Thrift RPC framework by using Apache Kudu's RPC which was built from the ground up to support asynchronous communication, multiplexed connections, TLS, and Kerberos.

Dr. Horton is a senior data scientist with the Microsoft Knowledge Graph Team, where he analyzes customer data and helps to design and evaluate approaches for knowledge extraction. He holds an adjunct faculty appointment in Health Informatics at the University of San Francisco and has a particular interest in educational simulations.

Presentations

Building High-Performance Text Classifiers on a Limited Labeling Budget Session

We show how three cutting-edge machine learning techniques can be used together to up your modeling game: 1. Transfer learning from pre-trained language models 2. Active learning to make more effective use of a limited labeling budget 3. Hyperparameter tuning to maximize model performance We will apply these techniques to a growing business challenge: moderating public discussions.

Jeremy Howard is an entrepreneur, business strategist, developer, and educator. Jeremy is a founding researcher at fast.ai, a research institute dedicated to making deep learning more accessible. He is also a Distinguished Research Scientist at the University of San Francisco, a faculty member at Singularity University, and a Young Global Leader with the World Economic Forum.

Jeremy’s most recent startup, Enlitic, was the first company to apply deep learning to medicine, and has been selected one of the world’s top 50 smartest companies by MIT Tech Review two years running. He was previously the President and Chief Scientist of the data science platform Kaggle, where he was the top ranked participant in international machine learning competitions 2 years running. He was the founding CEO of two successful Australian startups (FastMail, and Optimal Decisions Group–purchased by Lexis-Nexis). Before that, he spent 8 years in management consulting, at McKinsey & Co, and AT Kearney. Jeremy has invested in, mentored, and advised many startups, and contributed to many open source projects.

He has many television and other video appearances, including as a regular guest on Australia’s highest-rated breakfast news program, a popular talk on TED.com, and data science and web development tutorials and discussions.

Presentations

Deep Learning Applications for Non-Engineers Session

When deep learning is able to be easily applied by non-engineers (that possess extensive domain expertise), we can accelerate not only the pace of industry adoption but also the rate at which we uncover interesting and relevant research problems.

Chenhui Hu is a Data Scientist in the Cloud & AI division of Microsoft. His current interests include retail forecast, inventory optimization, IoT data, and deep learning. He received his PhD degree from Harvard University with his PhD thesis focusing on biomedical imaging data mining. He also has research experience in wireless networks and network data analysis. He is a recipient of the third IEEE ComSoc Asia-Pacific Outstanding Paper Award. 

Presentations

Dilated Neural Networks for Time Series Forecasting Session

Dilated neural networks are a class of recently developed neural networks that achieve promising results in time series forecasting. We introduce representative network architectures of dilated neural networks. Then, we demonstrate their advantages in terms of training efficiency and forecast accuracy by applying them to solve sales forecasting and financial time series forecasting problems.

Fabian Hueske is a committer and PMC member of the Apache Flink project. He was one of the three original authors of the Stratosphere research system, from which Apache Flink was forked in 2014. Fabian is a cofounder of data Artisans, a Berlin-based startup devoted to fostering Flink, where he works as a software engineer and contributes to Apache Flink. He holds a PhD in computer science from TU Berlin and is currently spending a lot of his time writing a book, Stream Processing with Apache Flink.

Presentations

Flink SQL in Action Session

Processing streaming data with SQL is gaining a lot of attention. In this talk, Fabian Hueske explains why SQL queries on streams should have the same semantics as SQL queries on static data. Moreover, Fabian will present a selection of common use cases and demonstrate how easily they can be addressed by Flink SQL.

Cory Ilo is a computer vision engineer in the Automotive Solutions group at Intel. He helps prototype and research the feasibility of various computer vision solutions in relation to privacy, ethics, deep learning, and autonomous vehicles. In his spare time, Cory focuses on his passion for fitness, video games, and wanderlust, in addition to finding ways on how they tie into computer vision.

Presentations

AI Privacy and Ethical Compliance Toolkit Tutorial

From healthcare to smart home to autonomous vehicles, new applications of autonomous systems are raising ethical concerns including bias, transparency, and privacy. In this tutorial, we will demonstrate tools and capabilities that can help data scientists address these concerns. The tools help bridge the gap between ethicists and regulators, and machine learning practitioners.

Dr. Inchiosa is a principal software engineer at Microsoft, where he focuses on scalable machine learning and AI. Previously, Mario served as Revolution Analytics’s chief scientist; analytics architect in IBM’s Big Data organization, where he worked on advanced analytics in Hadoop, Teradata, and R; US chief scientist in Netezza Labs, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances; US chief science officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining; and senior scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds bachelor’s, master’s, and PhD degrees in physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning publication of the year and open literature publication excellence awards.

Presentations

Building High-Performance Text Classifiers on a Limited Labeling Budget Session

We show how three cutting-edge machine learning techniques can be used together to up your modeling game: 1. Transfer learning from pre-trained language models 2. Active learning to make more effective use of a limited labeling budget 3. Hyperparameter tuning to maximize model performance We will apply these techniques to a growing business challenge: moderating public discussions.

Alex Ingerman is a product manager at Google AI, focusing on federated learning and other privacy-preserving technologies. His mission is to enable all ML practitioners to protect their users’ privacy by default. Prior to joining Google, Alex worked on ML-as-a-service platforms for developers, web-scale search, content recommendation systems and immersive data-exploration and visualization. Alex lives in Seattle, where as a frequent bike and occasional kayak commuter, he has fully embraced the rain. Alex holds a BS in computer science and an MS in medical engineering.

Presentations

The future of machine learning is decentralized Session

Federated Learning is the approach of training ML models across a fleet of participating devices, without collecting their data in a central location. Alex Ingerman introduces Federated Learning, compares the traditional and federated ML workflows, and explores the current and upcoming use cases for decentralized machine learning, with examples from Google's deployment of this technology.

Maryam is a research scientist at TapRecruit, a company that is developing software tools to implement evidence-based recruiting. TapRecruit’s research program integrate recent advances in NLP, data science and decision science to identify robust methods to reduce bias in talent decision-making and attract more qualified and diverse candidate pools. In a past life, Maryam was a cancer scientist where she researched how growing organs ‘know’ they’ve reached the right size. She is originally from Melbourne, Australia.

Presentations

Shortcuts that shortcircuit talent pipelines: Data-driven optimization of hiring Session

Hiring teams largely rely on both intuition and experience to scout talent for data science and data engineering roles. Drawing on results from analyzing over 15 million jobs and their outcomes, Maryam Jahanshahi interrogates these “common sense” judgements to determine whether they help or hurt hiring of data scientists and engineers.

Michael currently serves as a Senior Data Scientist at Lockheed Martin Corporation. He has done Data Science and analytics in fields including manufacturing optimization, semiconductor reliability, and human resources focused time series forecasting and simulation. He has recently been focused on how to apply cutting edge deep learning algorithms to NLP Domains

Presentations

NLP from Scratch: Solving the Cold Start Problem for Natural Language Processing Session

How do you train a machine learning model with no training data? We will present our journey implementing multiple solutions to bootstrapping training data in the NLP domain. We will cover topics including weak supervision, building an active learning framework, and annotation adjudication for Named Entity Recognition.

Ken Johnston is a frequent keynote presenter, trainer, blogger, and author. Currently he is the Principal Data Science Manager for the Microsoft 360 Business Intelligence Group (M360 BIG). Since joining Microsoft in 1998 Johnston has shipped many products to include Commerce Server, Office 365, Bing Local and Segments, and Windows. For two and a half years (2004-2006) he served as the Microsoft Director of Test Excellence. He earned his MBA from the University of Washington and is a co-author of “How we Test Software at Microsoft” and contributing author to “Experiences of Test Automation: Case Studies of Software Test Automation.” For more information contact him through twitter @rkjohnston or read his blogs on Data Science management on LinkedIn (https://www.linkedin.com/in/rkjohnston/).

Presentations

Executive Briefing: The 6 Keys to Successful Data Spelunking Session

At the rate data sources are multiplying business value can often be developed faster by joining data sources rather than mining a single source to the very end. This presentation covers four years of hands on practical experience sourcing and integrating massive numbers of data sources to build the Microsoft Business Intelligence Graph (M360 BIG).

Infinite Segmentation: Scalable Mutual Information Ranking on real world graphs Session

These days it’s not about normal growth, it’s about driving hockey-stick levels of growth. Sales & marketing orgs are looking to AI to help growth hack their way to new markets and segments. We have used Mutual Information for many years to help filter out noise and find the critical insights to new cohort of users, businesses and networks and now we can do it at scale across massive data sources.

Jowanza Joseph is Principal Software Engineer at One Click Retail. Jowanza’s work is focused on distributed Stream Processing and distributed data storage.

Presentations

Reducing Stream Processing Complexity by using Apache Pulsar Functions Session

After 2 years of running streaming pipelines through Kinesis and Spark at One Click Retail, we evaluated our solution and decided to explore a new platform that would (1) take advantage of Kubernetes and (2) support a simpler data processing DSL. We settled on Apache Pulsar because of its native support for Kubernetes and Pulsar Functions a serverless functions model on top of Pulsar.

Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

Presentations

Cross-Cloud Model Training & Serving with Kubeflow Tutorial

This workshop will quickly introduce what Kubeflow is, and how we can use it to train and serve models across different cloud environments (and on-prem). We’ll have a script to do the initial set up work ready so you can jump (almost) straight into training a model on one cloud, and then look at how to set up serving in another cluster/cloud. We will start with a simple model w/follow up links.

Understanding Spark Tuning with Auto Tuning (or how to stop your pager going off at 2am*) Session

Apache Spark is an amazing distributed system, but part of the bargain we've all made with the infrastructure demons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. This talk will look at auto-tuning jobs using historical & static job information using systems like Mahout, and internal Spark ML jobs as workloads including new settings in 2.4.

Alon Kaufman is the CEO and co-founder of Duality Technologies. Prior to founding Duality, he was RSA’s global Director of Data Science and Innovation, leading data science for RSA across its full portfolio. Alon has over 20 years of experience in technology and innovation management in hi-tech companies, dealing with various aspects of artificial intelligence. He earned his PhD in computational neuroscience and machine learning from the Hebrew University and an MBA from the Tel Aviv University.

Presentations

Machine Learning on Encrypted Data: Challenges and Opportunities Session

In this talk, we will discuss the challenges and opportunities of machine learning on encrypted data and describe the state of the art in this space.

Anish Kejariwal is the Director of the Analytics Engineering group at Roche. At Roche, Anish leads the analytics architecture to support Roche’s NAVIFY platform, which aims to support oncology care teams to review, discuss, and align on treatment decisions for the patient. Anish has extensive experience building large-scale big data cloud architecture platforms in the life sciences and health care space.

Presentations

Spark NLP: How Roche Automates Knowledge Extraction from Pathology & Radiology Reports Session

We’ll show how Roche applies Spark NLP for Healthcare to extract clinical facts from pathology reports and radiology, and the design of the deep learning pipelines used to simplify training, optimization, and inference of such domain-specific models at scale.

Until recently, Arun Kejariwal was a statistical learning principal at Machine Zone (MZ), where he led a team of top-tier researchers and worked on research and development of novel techniques for install and click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns. In addition, his team built novel methods for bot detection, intrusion detection, and real-time anomaly detection. Previously, Arun worked at Twitter, where he developed and open-sourced techniques for anomaly detection and breakout detection. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high-performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Architecture and Algorithms for End-to-End Streaming Data Processing Tutorial

Many industry segments have been grappling with fast data (high-volume, high-velocity data). In this tutorial we shall lead the audience through a journey of the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline - messaging, compute and storage - for real-time data and algorithms to extract insights - e.g., heavy-hitters, quantiles - from data streams.

- Committer of Apache Impala (May, 2018~)
- Senior software engineer at SK Telecom (Mar, 2017~)
Lead scrum of cloud platform development using Kubernetes, Docker, Apache Druid and Apache Hadoop.
Designed and implemented Dockerized DevOps framework.
- Senior software engineer at SAP Labs (Apr, 2014 ~ Feb, 2017)
Development of SAP HANA in-memory engine
- Software engineer at SAP Labs (Jan, 2008 ~ Mar, 2014)
Development of SAP HANA in-memory engine
- Internship at Samsung Electronics (Mar, 2003 ~ Dec, 2005)

Presentations

Apache Druid auto scale-out/in for streaming data ingestion on Kubernetes Session

Druid supports auto scaling feature for data ingestion, but it is only available on AWS EC2. We cannot rely on the feature on our private cloud. In this talk, we are going to introduce auto scale-out/in on Kubernetes. We will show benefit on our approach and where it comes from and share development of Druid Helm chart, rolling update, custom metric usage for horizontal auto scaling.

Alex Kira is an engineering tech lead on Uber’s data workflow management team. His team provides a data infrastructure platform for thousands of engineers, data scientists, and city ops thereby empowering them to own and manage their data pipelines.

During his 19-year career, he has worked with various startups, as well as Apple and Oracle. He’s had experience in several software disciplines, including data engineering, infrastructure, devops, and full stack development thereby allowing him to bring a holistic systems view to his projects. He received his undergraduate degree in Computer Science from the University of Miami, and his Master’s degree from Georgia Institute of Technology. When not slinging code, Alex enjoys hiking around the bay area, rock climbing, and traveling internationally.

Presentations

Managing Uber's Data Workflows at Scale Session

Uber operates at scale, with thousands of microservices serving millions of rides a day leading to more than a hundred petabytes of data. We will describe our journey towards a unified and scalable data workflow system at Uber used to manage this data. We will talk about the challenges we faced and how we have re-architected our system to make it highly available and horizontally scalable.

Tobi Knaup is the CTO and Co-Founder of Mesosphere, the hybrid cloud platform company which helps companies like NBCUniversal, Deutsche Telekom, and Royal Caribbean adopt transformative technologies, like machine learning and real-time analytics, with ease. He was one of the first engineers and tech lead at Airbnb. At Airbnb, he wrote large parts of the infrastructure including the search and fraud prediction services. He helped scale the site to millions of users and build a world class engineering team. Tobi is the main author of Marathon, Mesosphere’s container orchestrator.

Presentations

Deep learning beyond the learning Session

There are many great tutorials for training your deep learning models using TensorFlow, Keras, Spark or one of the many other frameworks. But training is only a small part in the overall deep learning pipeline. This talk gives an overview into building a complete automated deep learning pipeline starting with exploratory analysis, over training, model storage, model serving, and monitoring.

VP of Technology at FICO
Jari holds a Ph.D. in Computer Science from the Royal Institute of Technology, Stockholm Sweden. Jari led research at Ericsson Labs (Stockholm and Los Angeles) and Hewlett-Packard Laboratories (Palo Alto) in computer languages and distributed computing. Between 1998 and 2002 Jari led the development at CommerceOne of the flagship product MarketSite. After Commerce One Jari worked as CTO and CEO with investors turning around and growing companies in online gaming and smart metering. In 2006 Jari founded two companies, both in the emerging space of cloud computing. The first operating as CSO/CTO was Qrodo.com (Singapore) that delivers an elastic platform for broadcasting sports events live on the internet. Jari also founded and held the position of CTO at Groupswim.com (San Francisco), an early social enterprise collaboration company. In 2009 Salesforce.com acquired Groupswim.com, at which time Jari moved in to the role of leading the development of Chatter, Salesforce.com’s social enterprise application and platform. Jari is currently at FICO leading the development of AI and analytics-driven automated decisions in the financial industry and beyond. Jari teaches data science at UC Berkeley.

Presentations

Interpretable and Resilient AI for Financial Services Session

Financial Services are increasingly deploying AI services for a wide range of applications such as credit life cycle, fraud, and financial crimes. Such deployment requires models to be interpretable, explainable and resilient to adversarial attacks. Regulatory requirements prohibit application of black-box machine learning models. This talk describes what FICO has developed to support these needs.

Jing(Nicole) is a data scientist experienced with different machine learning/deep learning model and deals with big data and transform data/model into products and service that drive business.

Presentations

User-based real-time product recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL Session

User-based real-time recommendation system has become an important topic in e-commerce field nowadays. This talk demonstrates how to build deep learning algorithms using Analytics Zoo with BigDL on Apache Spark and create end to end system to serve real-time product recommendation.

Chi-Yi Kuan is director of business analytics at LinkedIn. He has over 15 years of extensive experience in applying big data analytics, business intelligence, risk and fraud management, data science, and marketing mix modeling across various business domains (social network, ecommerce, SaaS, and consulting) at both Fortune 500 firms and startups. Chi-Yi is dedicated to helping organizations become more data driven and profitable. He combines deep expertise in analytics and data science with business acumen and dynamic technology leadership.

Presentations

Full Spectrum of Data Science to Drive Business Decisions Tutorial

Thanks to the rapid growth in data resources, it is common for business leaders to appreciate the challenge and importance in mining the information from data. In this tutorial, a group of well respected data scientists would share with you their experiences and success on leveraging the emerging techniques in assisting intelligent decisions, that would lead to impactful outcomes at LinkedIn.

Aleksandra is a head of product at Astro Digital – a platform for fast and easy access to satellite imagery. She was a co-founder of ImageAiry – online marketplace for satellite imaging services. She is passionate about data pipelines and analytics on top of satellite imagery.

Presentations

Understanding World Food Economy with Satellite Images and AI Data Case Studies

It has become possible to observe the food growing from satellites daily at a global scale. With it we can understand agriculture-specific insights and predict productivity. This talk will help you understand current publicly available satellite imagery data, how you can inject it into your data pipeline and how to train and deploy AI/ML models based on it.

Abhishek Kumar is a manager of data science in Sapient’s Bangalore office, where he looks after scaling up the data science practice by applying machine learning and deep learning techniques to domains such as retail, ecommerce, marketing, and operations. Abhishek is an experienced data science professional and technical team lead specializing in building and managing data products from conceptualization to deployment phase and interested in solving challenging machine learning problems. Previously, he worked in the R&D center for the largest power-generation company in India on various machine learning projects involving predictive modeling, forecasting, optimization, and anomaly detection and led the center’s data science team in the development and deployment of data science-related projects in several thermal and solar power plant sites. Abhishek is a technical writer and blogger as well as a Pluralsight author and has created several data science courses. He is also a regular speaker at various national and international conferences and universities. Abhishek holds a master’s degree in information and data science from the University of California, Berkeley.

Presentations

The Hitchhiker's Guide to Deep Learning Based Recommenders in Production Tutorial

This tutorial describes deep learning based recommender and personalisation systems that we have built for clients. The tutorial primarily gives the view of TensorFlow Serving and MLFlow for the end-to-end productionalization, including model serving, dockerization, reproducibility and experimentation plus how to use Kubernetes for deployment and orchestration of ML based micro-architectures.

Arun Kumar is an Assistant Professor in the Department of Computer Science and Engineering at the University of California, San Diego. He is a member of the Database Lab and CNS and an affiliate member of the AI Group. His primary research interests are in data management and systems for machine learning/artificial intelligence-based data analytics. Systems and ideas based on his research have been released as part of the MADlib open-source library, shipped as part of products from EMC, Oracle, Cloudera, and IBM, and used internally by Facebook, LogicBlox, Microsoft, and other companies. He is a recipient of the ACM SIGMOD 2014 Best Paper Award, the 2016 Graduate Student Research Award for the best dissertation research in UW-Madison CS, and a 2016 Google Faculty Research Award.

Presentations

Faster ML over Joins of Tables Session

This talks presents a couple of recent techniques from research to accelerate ML over data that is the output of joins of multiple tables. Using ideas from query optimization and learning theory, we show how to avoid joins before ML to reduce runtimes and memory/storage footprints. Open source software prototypes and sample ML code in both R and Python will also be shown.

Rakesh Kumar is a Software Engineer for the pricing team at Lyft. He has a diverse background. He started his career as an embedded software engineer for mobile devices, later he moved to server-side engineer to tackle bigger challenges in distributed systems. Lately, he is into Machine Learning and streaming systems.

Presentations

The magic behind your Lyft ride prices - a case study of Machine Learning and Streaming Session

At the core of Lyft is how we dynamically price our rides - a combination of various data sources, ML models, and streaming infrastructure for low latency, reliability and scalability. This allows the pricing system to be more adaptable to real world changes. The streaming platform powers pricing by bringing together the best of both worlds; ML algorithm in Python and JVM based streaming engine.

Ram Shankar is a Data Cowboy at the Azure Security Data Science team at Microsoft, where his team’s primary focus is modeling massive amounts of security logs to surface malicious activity. His work has appeared in industry conferences like Defcon, BSides, BlueHat, DerbyCon, MIRCon, Infiltrate, Strata+Hadoop World as well as academic conferences like NIPS and ACM-CCS. Ram graduated from Carnegie Mellon University focusing on machine learning and security. He is also currently an affiliate at the Berkman Klein Center at Harvard, exploring the intersection of Machine Learning and Security

Presentations

Framework to quantitatively assess ML Safety – Technical Implementation & Best Practices Session

How can we guarantee to our customers that the ML system we develop is adequately protected from adversarial manipulation? Data scientists, program managers and security experts, will takeaway a framework and corresponding best practices to quantitatively assess the safety of their ML systems.

Santosh is a Senior Product Manager at Cloudera and he leads SDX – Cloudera’s Shared Data eXperience offering. Before joining Cloudera, he was a Data Scientist at Facebook. Prior to that he was a software engineer at Yahoo and Akamai. He received his BS in Computer Science from IIT Kanpur, India and an MBA from Insead, France.

Presentations

Hands on with Cloudera SDX: Setting up your own Shared Data eXperience Tutorial

Cloudera SDX provides unified metadata control, simplifies administration, and maintains context as well as data lineage across storage services, workloads, and operating environments. In this 3h tutorial, we cover the background to SDX, before diving deep into the moving parts and also get hands on in setting it up. You'll leave with all the skills and experience you need to setup your own SDX.

To be added

Presentations

Combining deep learning and Gaussian processes Session

Machine learning is delivering immense value across industries. However, in some instances machine learning models can produce over-confident results - with the potential for catastrophic outcomes. In this talk, we'll describe how to address this challenge through Bayesian machine learning, and highlight real-world examples to illustrate its benefits.

Collaborative machine learning leader who combines expert level statistical modeling and computer science knowledge to provide powerful models used in making strategic decisions in various industries. Ability to lead teams and work independently under demanding time constraints. Passion for designing high-performance AI systems with deep understanding of underlying data

Presentations

ML & AI @ Scale at PayPal Session

PayPal data eco system is fairly large with over 250+PB of data transacting in over 200+ countries. Given this massive scale and complexity, discovering and access to the right data sets in a frictionless environment is a massive challenge.PayPal’s Data Platform team is helping solve this problem holistically with a combination of self service integrated and interoperable products.

JoLynn Lavin is a manager of decision sciences and analytics at General Mills, where she leads a team of analysts focused on unleashing the power of data to drive consumer led decision making. Previously, JoLynn was loyalty marketing consultant helping clients acquire, retain, and build profitable relationships with their customers across virtually every industry. JoLynn holds a master’s degree in Agricultural and Consumer Economics from the University of Illinois at Champaign-Urbana.

Presentations

Voice of the Customer; a Case Study in how Machine Learning can Automate Consumer Insights Data Case Studies

General Mills, engages millions of consumers in conversations every year; through traditional 1-800 numbers and text messaging with call center agents along with online conversations on social media, comments on our recipe websites, and chat bots. This session highlights the application of machine learning to listen to the voice of our customers, arguably the most powerful force in today’s market.

Francesca Lazzeri, PhD is AI & Machine Learning Scientist at Microsoft in the Cloud Developer Advocacy team. Francesca is passionate about innovations in big data technologies and the applications of machine learning-based solutions to real-world problems. Her work on these issues covers a wide range of industries including energy, oil and gas, retail, aerospace, healthcare, and professional services.
Before joining Microsoft, she was Research Fellow in Business Economics at Harvard Business School, where she performed statistical and econometric analysis within the Technology and Operations Management Unit. At Harvard Business School, she worked on multiple patent data-driven projects to investigate and measure the impact of external knowledge networks on companies’ competitiveness and innovation.
Francesca is a mentor for PhD and Postdoc students at the Massachusetts Institute of Technology and enjoys speaking at academic and industry conferences to share her knowledge and passion for AI, machine learning, and coding.

Presentations

Cross-Cloud Model Training & Serving with Kubeflow Tutorial

This workshop will quickly introduce what Kubeflow is, and how we can use it to train and serve models across different cloud environments (and on-prem). We’ll have a script to do the initial set up work ready so you can jump (almost) straight into training a model on one cloud, and then look at how to set up serving in another cluster/cloud. We will start with a simple model w/follow up links.

Forecasting Financial Time Series with Deep Learning on Azure 2-Day Training

Francesca Lazzeri will walk you through the core steps for using Azure Machine Learning services to train your machine learning models both locally and on remote compute resources.

Julien Le Dem is the coauthor of Apache Parquet and the PMC chair of the project. He is also a committer and PMC member on Apache Pig, Apache Arrow, and a few other projects. Julien is a principal engineer at WeWork. Previously, he was an architect at Dremio; tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_); and a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

Presentations

From flat files to deconstructed database: The evolution and future of the big data ecosystem Session

Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem.

Mike Lee Williams is a research engineer at Cloudera Fast Forward Labs, where he builds prototypes that bring the latest ideas in machine learning and AI to life and helps Cloudera’s customers understand how to make use of these new technologies. Mike holds a PhD in astrophysics from Oxford.

Presentations

Federated learning Session

Imagine building a model whose training data is collected on edge devices such as cell phones or sensors. Each device collects data unlike any other, and the data cannot leave the device because of privacy concerns or unreliable network access. This challenging situation is known as federated learning. In this talk we’ll cover the algorithmic solutions and the product opportunities.

Christopher is a Senior Data Scientist at idealo.de where he works on computer vision problems to improve the product search experience. In previous positions he applied machine learning methods to fMRI as well as financial data. Christopher holds a Master’s degree in statistics from Humboldt Universität Berlin.

Presentations

Using Deep Learning to automatically rank millions of hotel images Session

At idealo.de we trained Convolutional Neural Networks (CNN) for aesthetic and technical image quality predictions. We will present our training approach, practical insights, and shed some light on what the trained models actually learned by visualising the convolutional filter weights and output nodes of our trained models.

Jimmy Li is a versatile principal developer at Atlassian who has a breadth of experience from working in a variety of teams and countries. He’s been part of several key initiatives ranging from single sign-on to segmentation and targeted in-product messaging. In his most recent role he was the technical lead on a initiative to change Atlassian into a more data-driven organization by transforming the company’s behavioral analytics solution.

Presentations

Transforming behavioural analytics at Atlassian Session

Analytics is easy, good analytics is hard. Here at Atlassian we know this all to well with our push to become a truely data-driven organisation. In order to achieve this we've transformed the way we thought about behavioural analytics, from how we defined our events all the way to how we ingested and analysed them.

Tianhui Michael Li is the founder and CEO of the Data Incubator. Michael has worked as a data scientist lead at Foursquare, a quant at D.E. Shaw and JPMorgan, and a rocket scientist at NASA. At Foursquare, Michael discovered that his favorite part of the job was teaching and mentoring smart people about data science. He decided to build a startup that lets him focus on what he really loves. He did his PhD at Princeton as a Hertz fellow and read Part III Maths at Cambridge as a Marshall scholar.

Presentations

Big Data for Managers 2-Day Training

In this course, the instructors will be offering a non-technical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and utilize their input and analysis for your business’s strategic priorities and decision making.

Executive Briefing: How Organizations Scale Along the Data and AI Maturity Curve Session

As their data and AI teams scale from one to thousands of employees and the maturity of their analytics capabilities evolve, companies find that the analytics journey is not always smooth. Drawing on experiences gleaned from dozens of clients, we present organizational growing pains and the best practices that successful executives have adopted to scale and grow their team.

Tommy Li is a software developer at IBM focusing on cloud, container, and infrastructure technology. He has worked on various developer journeys, which provide use cases on cloud-computing solutions, such as Kubernetes, microservices, and hybrid cloud deployments. He is passionate about machine learning and big data.

Presentations

Use Jupyter notebook to integrate adversarial attacks into a model training pipeline to detect vulnerabilities Session

In this talk we are going to discuss how to provide an implementation for many state-of-the-art methods for attacking and defending classifiers using open source Adversarial Robustness Toolbox. For AI developers, the library provides interfaces that support the composition of comprehensive defense systems using individual methods as building blocks.

Xiao Li is a software engineer, Apache Spark Committer, and PMC member at Databricks. His main interests are on Spark SQL, data replication and data integration. Previously, he was an IBM master inventor and an expert on asynchronous database replication and consistency verification. He received his Ph.D. from University of Florida in 2011.

Presentations

Apache Spark 2.4 and Beyond Session

This talk will provide an overview of the major features and enhancements in Apache Spark 2.4 release and the upcoming releases and will be followed by a Q&A session.

Di Lin is a Senior Data Engineer in Infrastructure and Information Security team at Netflix. Di’s primary focus at Netflix is to build and scale complex data systems to help infrastructure teams improve reliability and efficiency. Prior to Netflix, Di Lin was a Data Engineer at Facebook where he built various company wide data products related to identity and subscriber growth.

Presentations

Scaling Data Lineage at Netflix to Improve Data Infrastructure Reliability & Efficiency Session

Hundreds of thousands of ETL pipelines ingest over a trillion events daily to populate millions of data tables downstream at Netflix. This session discusses Netflix’s internal data lineage service aimed at establishing end-to-end lineage across millions of data artifacts that was essential for enhancing platform’s reliability, increasing trust in data and improving data infrastructure efficiency.

Jorge A. Lopez works in big data solutions at Amazon Web Services. Jorge has more than 15 years of business intelligence and DI experience and enjoys intelligent design and engaging storytelling. He is passionate about data, music, and nature.

Presentations

Building a Serverless Big Data Application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. In this workshop, we show you how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You will build a big data application using AWS technologies such as S3, Athena, Kinesis, and more

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Thursday keynotes Keynote

Program Chairs, Ben Lorica, Alistair Croll, and Doug Cutting, welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program Chairs, Ben Lorica, Alistair Croll, and Doug Cutting, welcome you to the first day of keynotes.

Neng is a software engineer at Twitter. He has a broad interest in the distributed system and real-time analytics area. He worked on Twitter’s Key-Value storage system (Manhattan), the monitoring system(Cuckoo) and real-time processing system(Heron). He is currently the core member of Twitter’s realtime-compute team and the core committer of Apache Incubator Heron project. He holds an MS degree in CS from UCLA and a Bachelor’s degree in CS from Zhejiang University.

Presentations

Real-time monitoring of Twitter network infrastructure with Heron Session

This presentation presents how Twitter uses the heron data processing engine to monitor and analyze its network infrastructure. Within 2 months, infrastructure engineers implemented a new data pipeline that ingests multiple sources and processes about 1 billion of tuples to detect network issues generate usage statistics. The talk focuses on key technologies used, the architecture and challenges.

Boris Lublinsky is a software architect at Lightbend, where he specializes in big data, stream processing, and services. Boris has over 30 years’ experience in enterprise architecture. Over his career, he has been responsible for setting architectural direction, conducting architecture assessments, and creating and executing architectural roadmaps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, and Serving Machine Learning Models. He is also cofounder of and frequent speaker at several Chicago user groups.

Presentations

Hands-on Machine Learning with Kafka-based Streaming Pipelines Tutorial

This hands-on tutorial examines production use of ML in streaming data pipelines; how to do periodic model retraining and low-latency scoring in live streams. We'll discuss Kafka as the data backplane, pros and cons of microservices vs. systems like Spark and Flink, tips for Tensorflow and SparkML, performance considerations, model metadata tracking, and other techniques.

Adrian Lungu is a Computer Scientist at Adobe working with Audience Manager, a leading solution in the DMP market. Ever since he joined the team, over 4 year ago, the Cassandra clusters were his main focus, trying to build a scalable architecture that would keep up with the exponential growth of the product. Adrian holds a degree in Computer Science and Engineering from Politehnica University of Bucharest and a DataStax Certified Apache Cassandra Professional certification.

Presentations

Database migrations don't have to be painful, but the road will be bumpy Session

Inspired by the Green / Blue deployment technique, the Adobe Audience Manager team developed an Active / Passive database migration procedure that allows us to test our database clusters in production, minimising the risks without compromising the innovation. We successfully applied this approach twice to upgrade the entire technology stack. But it never was a smooth move.

Zhenxiao Luo is an engineering manager at Uber, where he runs the interactive analytics team. Previously, he led the development and operations of Presto at Netflix and worked on big data and Hadoop-related projects at Facebook, Cloudera, and Vertica. He holds a master’s degree from the University of Wisconsin-Madison and a bachelor’s degree from Fudan University.

Presentations

Real Time Analytics at Uber: bring SQL into everything Session

From determining the most convenient rider pickup points to predicting the fastest routes, Uber uses data-driven analytics to create seamless trip experiences. Inside Uber, analysts would like to run Analytics on any data sources, in real time. This talk will share Uber’s engineering effort about real time Analytics on any data source on the fly, without any data copy.

Real Time Analytics on Deep Learning: when Tensorflow meets Presto at Uber Session

From determining the most convenient rider pickup points to predicting the fastest routes, Uber uses data-driven analytics to create seamless trip experiences. Inside Uber, analysts are using deep learning and big data to train models, make predictions, and run analytics in real time. This talk will share Uber’s engineering effort about running real time Analytics with deep learning.

Mark Madsen is the global head of architecture at Think Big Analytics, where he is responsible for understanding, forecasting, and defining the analytics landscape and architecture. Previously, he was CEO of Third Nature, where he advised companies on data strategy and technology planning and vendors on product management. Mark has designed analysis, data collection, and data management infrastructure for companies worldwide.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem, and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Foundations for Successful Data Projects Tutorial

The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. In this presentation we’ll provide guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

Jon Merriman is a senior software engineer and researcher at Verint Intelligent Self-Service, where he works on core natural language understanding capabilities for dialog systems. His primary focus is on algorithms and machine learning theory for text and speech analysis.

Presentations

How to Determine the Optimal Anomaly Detection Method For Your Application Session

An anomaly is a pattern not conforming to past, expected behavior. Its detection has many applications such as tracking business KPIs or fraud spotting in credit card transactions. Unfortunately, there is no one best way to detect anomalies across a variety of domains. We introduce a framework to determine the best anomaly detection method for the application based on time series characteristics.

Patrick is a data scientist at Civis Analytics, specializing in survey data analysis, causal inference, and production R. You can usually find him at his desk drinking tea and listening to Sufjan Stevens. Before coming to Civis, Patrick finished a Ph.D. in Quantitative Psychology, where he studied the applications of machine learning to analysis psychological and behavioral data.

Presentations

Testing ad content with survey experiments. Session

Brands that test the content of ads before they are shown to an audience can avoid spending resources on the 11% of ads that cause backlash. Using a survey experiment to choose the best ad typically improves effectiveness of marketing campaigns by 13% on average, and up to 37% for particular demographics. We discuss data collection and statistical methods for analysis and reporting.

Siamac Mirzaie is an applied Machine Learning practitioner in the Security space. Over the past several years, his work at Netflix has revolved around building end-to-end anomaly detection systems for corporate security. Prior to Netflix, Siamac was a Data Scientist at Facebook HQ and Director of Analytics at Everquote in Boston. He received a Masters in EECS from Ecole Supérieure d’Electricité and a Masters in Financial Engineering from the University of Michigan.

Presentations

Building and Scaling a Security Detection Platform, a Netflix Original Session

Data has become a foundational pillar for security teams operating in organizations of all shapes and sizes. This new norm has created a need for platforms that enable engineers to harness data for various security purposes. This talk introduces our internal platform aimed at quickly deploying data-based detection capabilities in the Netflix corporate environment.

Daniel Monteiro works at FINRA’s Market Regulation Technology, focused on solutions for market surveillance analytics and visualizations. Daniel developed solutions for various business areas in his 15 years of software development experience, and is currently the principal developer of Surveillance Review and Feedback visualization tool at Finra.

Presentations

Scaling Visualization for Big Data and Analytics in the Cloud Session

This talk will focus on big data analytics and visualization practices and tools used by FINRA to support machine learning and other surveillance activities that Market Regulation department conducts in AWS Cloud.

Kevin is a senior data scientist at Salesforce where he works on automated machine learning pipelines to generate and deploy customized models for a wide variety of customers and use cases. He has a PhD in astrophysics and prior to becoming a data scientist he worked on modeling how stars evolve and eventually explode. When not stirring piles of linear algebra, he can usually be found snowboarding, brewing beer, or gaming.

Presentations

Point, Click, Predict. Session

In this talk, I walk through how our open-source AutoML library built on Spark - TransmogrifAI - automatically generates these models and provides insights into why the model is making the predictions it does.

Francesco Mucio is a BI architect at Zalando. The first time Francesco met the word data, it was just the plural of datum. Now he’s helping to redraw Zalando’s data architecture. He likes to draw data models and optimize queries. He spends his free time with his daughter, who, for some reason, speaks four languages.

Presentations

Scaling data infrastructure in the fashion world or “What is this? Business intelligence for ants?” Session

Francesco Mucio tells the story of how Zalando went from an old-school BI company to an AI-driven company built on a solid data platform. Along the way, he shares what Zalando learned in the process and the challenges that still lie ahead.

Jacques Nadeau is the CTO and cofounder of Dremio. Jacques is also the founding PMC chair of the open source Apache Drill project, spearheading the project’s technology and community. Previously, he was the architect and engineering manager for Drill and other distributed systems technologies at MapR; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive.

Presentations

Loosely Coupled Data with Apache Arrow Flight Session

Apache Arrow Flight is a new initiative focused on providing high performance communication within data engineering and data science infrastructure. This talk will discuss how Flight works and where it has been integrated. We’ll also discuss how Flight can be used to abstract physical data management from logical access. We’ll then share benchmarks of workloads that have been improved by Flight.

Syed Nasar is a Solutions Architect at Cloudera. As a big data and machine learning professional, his expertise extends to artificial intelligence, machine learning, and computer vision, and he has worked with a number of enterprises in bridging big data technologies with advanced statistical analysis, machine learning, and deep learning to create high-quality data products and intelligent systems that drive strategy and investment decisions. Syed is a founder of the Nashville Artificial Intelligence Society. His research interests include NLP, deep learning (mainly RNN and GAN), distributed systems, machine learning at scale, and emerging technologies. He is the founder of Nashville Artificial Intelligence Society. He holds a master’s degree in interactive intelligence from the Georgia Institute of Technology.

Presentations

Anomaly detection using deep learning to measure quality of Large Datasets​ Session

Any Business big or small depends on analytics whether the goal is revenue generation, churn reduction or sales/marketing purposes. No matter the algorithm and the techniques used, the result depends on the accuracy and consistency of the data being processed. In this talk, we will present some techniques used to evaluate the the quality of data and the means to detect the anomalies in the data.

Paco Nathan is known as a “player/coach” with core expertise in data science, natural language processing, machine learning, and cloud computing. He has 35+ years of experience in the tech industry, at companies ranging from Bell Labs to early-stage startups. His recent roles include director of the Learning Group at O’Reilly Media and director of community evangelism at Databricks and Apache Spark. Paco is the co-chair of JupyterCon, and an advisor for Amplify Partners, Deep Learning Analytics, and Recognai. He was named one of the top 30 people in big data and analytics in 2015 by Innovation Enterprise.

Presentations

Executive Briefing: Overview of Data Governance Session

Data governance is an almost overwhelming topic. This talk surveys history, themes, plus a survey of tools, process, standards, etc. Mistakes imply data quality issues, lack of availability, and other risks that prevent leveraging data. OTOH, compliance issues aim to preventing risks of leveraging data inappropriately. Ultimately, risk management plays the "thin edge of the wedge" in enterprise.

Michael serves as the architect and director for Yahoo’s next generation stream processing, batch processing, experimentation, and general data tools. Problems he’s dealt with focus on increasing scale, reducing latency, improving operability, focusing on customer satisfaction, while driving quality in data and engineering best practices.

Presentations

Bullet: Querying Streaming Data in Transit with Sketches Session

Bullet is a scalable, pluggable, light, multi-tenant query system on any data flowing through a streaming system without storing it. Bullet queries are submitted first and operate on data flowing through the system from the point of submission. Bullet efficiently supports intractable operations like Top K, Counting Distincts and Windowing without any storage using Sketch-based algorithms.

Alexander Ng is a Senior Data Engineer at Manifold. His previous work includes a stint as engineer and technical lead doing DevOps at Kryuus, as well as engineering work for the Navy. He holds a BS degree from Boston University in Electrical Engineering.

Presentations

Streamlining a Machine Learning Project Team Tutorial

Many teams are still run as if data science is mainly about experimentation, but those days are over. Now it must be turnkey to take models into production. Sourav Day and Alex Ng explain how to streamline a machine learning project and help your engineers work as an an integrated part of your production teams, using a Lean AI process and the Orbyter package for Docker-first data science.

Kyungtaak runs the metatron Project team within leading product development at SK Telecom(SKT), South Korea’s largest wireless communications provider, where he is responsible for managing development for product and developing part of visualizing and coordinating with big data applications.
Drawing on his 10+ years of experience in applied big data platform for for groupware, semiconductor, finance and telecommunication as a sowftware engineer. Based on this experience, Kyungtaak is focusing on developing the metatron Discovery, a big data analysis product.

Presentations

When Self-Service BI meets Geospatial Analysis, Data Case Studies

In the analysis of the mobile world, everyone starts with the question "WHERE". Until now, we have provided a quick and easy self-service BI tool called metatron-discovery, and we are also trying to meet the needs of "WHERE". In this session, we will explain how to provide Geospatial Analysis easily and quickly, and how we process geospatial data fast through Druid combined with Lucene.

Mike Olson cofounded Cloudera in 2008 and served as its CEO until 2013, when he took on his current role of chief strategy officer. As CSO, Mike is responsible for Cloudera’s product strategy, open source leadership, engineering alignment, and direct engagement with customers. Previously, Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine, and he spent two years at Oracle Corporation as vice president for embedded technologies after Oracle’s acquisition of Sleepycat. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies, and Informix Software. Mike holds a bachelor’s and a master’s degree in computer science from the University of California, Berkeley.

Presentations

Executive Briefing: From the Edge to AI - Taking Control of your Data for Fun and Profit Session

It's easier than ever to collect data -- but managing it securely, in compliance with regulations and legal constraints is harder. There are plenty of tools that promise to bring machine learning techniques to your data -- but choosing the right tools, and managing models and applications in compliance with regulation and law is quite difficult.

Diego Oppenheimer is the founder and CEO of Algorithmia. An entrepreneur and product developer with extensive background in all things data, Diego has designed, managed, and shipped some of Microsoft’s most used data analysis products including Excel, Power Pivot, SQL Server, and Power BI. Diego holds a bachelor’s degree in information systems and a master’s degree in business intelligence and data analytics from Carnegie Mellon University.

Presentations

Automating DevOps for Machine Learning Session

You've invested heavily in cleaning your data, feature engineering, training and tuning your model—but now you have to deploy your model into production and you discover it's a huge challenge. In this talk, you'll learn common architectural patterns and best practices of the most advanced organizations who are deploying your model for scalability and accessibility.

Richard Ott is a data scientist in residence at the Data Incubator, where he gets to combine his interest in data with his love of teaching. Previously, he was a data scientist and software engineer at Verizon. Rich holds a PhD in particle physics from the Massachusetts Institute of Technology, which he followed with postdoctoral research at the University of California, Davis.

Presentations

Big Data for Managers 2-Day Training

In this course, the instructors will be offering a non-technical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and utilize their input and analysis for your business’s strategic priorities and decision making.

Marc Paradis’s career has spanned a variety of healthcare companies in data related roles where he continuously seeks the unlock the hidden drivers of profit, efficiency, and value by applying the rigor and discipline of the scientific method to business data sets. Over the past two years, he has built out Data Science University, training the next generation of UnitedHealth Group’s Data Science and Machine Learning experts in the tools, techniques and technologies of the discipline as well as in the architecture, content, and ontology of UnitedHealth Group’s uniquely integrated claims, clinical, and pharmacy data assets. He received his S.M. from Massachusetts Institute of Technology in Brain and Cognitive Sciences.

Presentations

Data Science University: Transforming a Fortune 5 Workforce Session

Data Science University (DSU) was established to bring analytics education to UnitedHealth Group, the world’s largest healthcare company with over 270,000 employees. In an era of rapidly changing analytics technology and capability in an industry ripe for disruption, this session will cover how DSU has been built out over time, the challenges faced, and lessons learned.

I am a Senior Data Engineer at Netflix focused on building high quality data assets which drive innovation in delivering performant, consistent and reliable user experiences for Netflix members. Prior to joining Netflix I spent several years in financial technology helping build various large scale cloud and data solutions.

Presentations

How Netflix measures app performance on 250 million unique devices across 190 countries Session

Netflix has over 125 million members spread across 191 countries. Each day our members interact with our client applications on 250 million+ devices under highly variable network conditions. These interactions result in over 200 billion daily data points. In this session, we will highlight the data engineering and architecture which enables application performance measurement at this scale.

Joshua Patterson is the director of applied solutions engineering at NVIDIA. Previously, Josh worked with leading experts across the public and private sectors and academia to build a next-generation cyberdefense platform. He was also a White House Presidential Innovation Fellow. His current passions are graph analytics, machine learning, and GPU data acceleration. Josh also loves storytelling with data and creating interactive data visualizations. He holds a BA in economics from the University of North Carolina at Chapel Hill and an MA in economics from the University of South Carolina’s Moore School of Business.

Presentations

The Next Step in The Evolution of Data Science with GoAI Session

The next big step in data science combines the ease of use of common Python APIs but with the power and scalability of GPUs. This session highlights the progress that has been made on PyGDF, the first step to give data scientists access to familiar APIs while increasing speed. We also discuss how to get started doing data sciend on the GPU and provide use cases involving graph analytics.

Thomas Phelan is cofounder and chief architect of BlueData. Previously, Tom was an early employee at VMware; as senior staff engineer, he was a key member of the ESX storage architecture team. During his 10-year stint at VMware, he designed and developed the ESX storage I/O load-balancing subsystem and modular pluggable storage architecture. He went on to lead teams working on many key storage initiatives, such as the cloud storage gateway and vFlash. Earlier, Tom was a member of the original team at Silicon Graphics that designed and implemented XFS, the first commercially available 64-bit filesystem.

Presentations

How to Protect Big Data in a Containerized Environment Session

Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for Big Data is HDFS configured with Transparent Data Encryption (TDE). TDE is difficult to configure and manage - even more so when run in Docker containers. This session will discuss these challenges and how to overcome them.

Josh Poduska is the Chief Data Scientist with Domino Data Lab. He has 17 years of experience in analytics. His work experience includes leading the statistical practice at one of Intel’s largest manufacturing sites, working on smarter cities data science projects with IBM, and leading data science teams and strategy with several big data software companies. Josh has a Masters in Applied Statistics from Cornell University.

Presentations

Managing data science in the enterprise Tutorial

The honeymoon era of data science is ending: accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders deliver measurable impact on an increasing share of an enterprises’ KPIs. You’ll learn how leading organizations take a holistic approach to people, process, and technology to build a sustainable competitive advantage.

Gungor Polatkan is a Machine Learning expert and engineering leader with experience in building massive scale distributed data pipelines serving personalized content at LinkedIn and Twitter. Most recently, he led the design and implementation of the AI backend for LinkedIn Learning and ramped the recommendation engine from scratch to hyper-personalized models learning billions of coefficients for 500M+ users. He deployed some of the first deep ranking models for search verticals at LinkedIn improving Talent Search. He enjoys leading teams, mentoring engineers, and fostering a culture of technical rigor and craftsmanship while iterating fast. He has worked in several notable applied research groups in Twitter, Princeton, Google, MERL and UC Berkeley before joining LinkedIn. He published and refereed papers at top-tier ML & AI venues such as UAI, ICML, PAMI.

Presentations

Towards Deep and Representation Learning for Talent Search at LinkedIn Session

Talent search systems at LinkedIn strive to match the potential candidates to the hiring needs of a recruiter expressed in terms of a search query. In this talk, we present the results of our deployment of deep learning models on real-world production system serving 500M+users through LinkedIn Recruiter. The challenges and approaches discussed generalize to any multi-faceted search engine.

Alex Poms is a CS Ph.D. student at Stanford, advised by Prof. Kayvon Fatahalian, and research contractor for Oculus/Facebook. Alex’s Ph.D. research focuses on designing algorithms and programmable systems for efficiently analyzing video. They have published and presented work at SIGGRAPH and CVPR on systems for large-scale video analysis and efficient 3D reconstruction using deep learning.

Presentations

Scanner: Efficient Video Analysis at Scale Session

Systems like Spark made it possible to process big numerical/textual data on hundreds of machines. Today, the majority of data in the world is video. Scanner is the first open-source distributed system for building large-scale video processing applications. Scanner is being used at Stanford for analyzing TBs of film with deep learning on GCP, and at Facebook for synthesizing VR video on AWS.

Mohammad Quraishi has worked in the Healthcare industry for 23 years. He is a Senior Principal Technologist at Cigna Corporation within the Data & Analytics organization. He graduated with a BS in Computer Science & Engineering from the University of Connecticut at Storrs.
Currently, the Lead Engineer in the Big Data Guild his primary focus is on Hadoop and Streaming architectures.

Presentations

Enabling Insights and Analytics with Data Streaming Architectures and Pipelines using Kafka and Hadoop Session

In a large Global Health Service company, streaming data for processing and sharing comes with its own challenges. Data science and analytics platforms need data fast, from relevant sources to act on this data quickly and share the insights with consumers with the same speed and urgency. Streaming data architectures are a necessity. Kafka and Hadoop are key.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); briefly worked on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper Networks. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin-Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Architecture and Algorithms for End-to-End Streaming Data Processing Tutorial

Many industry segments have been grappling with fast data (high-volume, high-velocity data). In this tutorial we shall lead the audience through a journey of the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline - messaging, compute and storage - for real-time data and algorithms to extract insights - e.g., heavy-hitters, quantiles - from data streams.

Reducing Stream Processing Complexity by using Apache Pulsar Functions Session

After 2 years of running streaming pipelines through Kinesis and Spark at One Click Retail, we evaluated our solution and decided to explore a new platform that would (1) take advantage of Kubernetes and (2) support a simpler data processing DSL. We settled on Apache Pulsar because of its native support for Kubernetes and Pulsar Functions a serverless functions model on top of Pulsar.

Nancy Rausch is a Senior Manager at SAS Institute. Nancy has been involved for many years in the design and development of SAS’ data warehouse and data management products, working closely with customers and authoring a number of papers on SAS data management products and best practice design principals for data management solutions. She holds a Master of Science degree from Duke University in Computer Engineering where she specialized in Statistical Signal Processing, and Bachelor of Science degree in Electrical Engineering from Michigan Technological University. She has recently returned to college and is pursuing a Master of Science in Analytics from Capella University.

Presentations

Bringing Data to Life: Combining Machine Learning and Art to tell a Data Story Session

For data to be meaningful it needs to be presented in a way that people can relate to. In this session, we explain how we combined machine learning and art to tell a compelling data story. We will explain how we combined streaming data from a Solar Array and machine learning techniques to create a live-action Art piece. This approach helped to bring the data to life in a fun and compelling way.

Bartley Richardson is a Senior Data Scientist in the AI Infrastructure team at NVIDIA. Bartley’s focus at NVIDIA is the research and application of GPU-accelerated methods that can help solve today’s information security and cybersecurity challenges. Prior to joining NVIDIA, Bartley was a technical lead and performer on multiple DARPA research projects, where he applied data science and machine learning algorithms at-scale to solve large cybersecurity problems. He was also the principal investigator of an Internet of Things research project which focused on applying machine and deep learning techniques to large amounts of IoT data to provide intelligence value relating to form, function, and pattern-of-life. His primary research areas involve NLP and sequence-based methods applied to cyber network datasets as well as cross-domain applications of machine and deep learning solutions to tackle the growing number of cybersecurity threats. He loves using data and visualizations to tell stories and help make complex concepts more relatable. Bartley holds a PhD in Computer Science and Engineering from the University of Cincinnati with a focus on loosely- and un-structured query optimization. His BS is in Computer Engineering with a focus on software design and AI.

Presentations

The Next Step in The Evolution of Data Science with GoAI Session

The next big step in data science combines the ease of use of common Python APIs but with the power and scalability of GPUs. This session highlights the progress that has been made on PyGDF, the first step to give data scientists access to familiar APIs while increasing speed. We also discuss how to get started doing data sciend on the GPU and provide use cases involving graph analytics.

Kelley is an engineering manager at Stripe, where she leads the data infrastructure group. As an engineer, she built Stripe’s first real-time machine learning evaluation of user risk. Prior to joining Stripe, she completed a PhD at Stanford and worked on nanophotonics and 3D imaging as a researcher at HP Labs.

Presentations

Scaling model training: from flexible training APIs to resource management with Kubernetes Session

Production ML applications benefit from reproducible, automated retraining and deployment of ever-more predictive models trained on ever-increasing amounts of data. In this talk, I’ll describe how Stripe built a flexible API for training machine learning models that we use to train thousands of models per week on Kubernetes, supporting automated deployment of new models with improved performance.

David Rodriguez is a Senior Research Engineer at Cisco Umbrella (formerly OpenDNS). He has co-authored multiple pending patents with Cisco Systems, Inc. in distributed machine learning applications centered around deep learning and behavioral analytics. He has an MA in Mathematics from San Francisco State University and has previously spoken about machine learning in cybersecurity at Flink Forward, Black Hat, Flocon, Virus Bulletin, and HitBSEC.

Presentations

Masquerading Malicious DNS Traffic Session

Malicious DNS traffic patterns are inconsistent, ranging from periodic to sporadic, and typically thwart anomaly detection. Using Apache Spark and Stripe’s Bayesian inference software - Rainier, we fit the underlying time-series distribution for millions of domains and outline techniques to identify artificial traffic volumes related to spam, malvertising, and botnets we call masquerading traffic.

Pierre Romera is the chief technology officer at the International Consortium of Investigative Journalists (ICIJ), where he manages a team of programmers working on the platforms that enabled more than 300 journalists to collaborate on the Paradise Papers and Panama Papers investigations. Previously, he cofounded Journalism++, the Franco-German data journalism agency behind the Migrant Files, a project that won the European Press Prize in 2015 for Innovation. He is one of the pioneers of data journalism in France.

Presentations

The Paradise Papers and West Africa Leaks: Behind the scenes with the ICIJ Session

Pierre Romera, the ICIJ’s Chief Technical Officer, can offer a behind-the-scenes look into the process and explore the challenges in handling 1.4 TB of data (in many different formats) – and making it available securely to journalists all over the world. The ICIJ was the team behind the Panama Papers and Paradise Papers.

Craig seeks to Inform the Art of Business through Data “and” Science. His goal as a Data Solutions Architect and Scientist is to sift through petabytes of information to find the “Little Data”, those nuggets of information and insight, that help Inform the Art and Influence the Artists to deliver the Right Products and Experiences to the Right Consumers at the Right Time. And more importantly, to Do the Right Thing for Consumers by protecting their Right to Privacy on a global, comprehensive, and even predictive scale. Luck and great opportunities have helped him deliver patents on Streaming Behavioral Data models, Adaptive Augmented Reality applications, Location-aware Engagement Intelligence, In-the-moment predictive algorithms for wearable devices, Fraud Detection/Prediction Platforms, and Personalized/Connection-based Recommenders.

Specialties:

  • Petabyte Scale Dataflow Execution Architectures to enable full-loop Data Science
  • Data Exploration and Modeling of Semi/Un/Structured Data
  • Inventing New Machine Learning Algorithms, Heuristics, and Applications
  • Science Fiction Movies and Books; Video games, of course; Learning

Presentations

Informing the Art of Business with Data and Science Data Case Studies

Few Analytics organizations are successfully delivering Actionable Insights that make it further than a Keynote or PowerPoint presentation. In this session we will focus on understanding why the "Human" element must be considered in a successful Analytics project.

John builds machine learning applications and helps develop Wise’s data science platform. Prior to joining GE, he was the data scientist for an energy efficiency startup, where he headed algorithm development and exploratory analyses. He holds physics, mathematics, and astrophysics degrees from Stanford and MIT.

Presentations

Critical Turbine Maintenance: Monitoring & Diagnosing Planes and Power Plants in Real Time Session

GE produces a third of the world's power and 60% of airplane engines. These engines form a critical portion of the world's infrastructure and require meticulous monitoring of the hundreds of sensors streaming data from each turbine. Here, we share the case study of releasing into production the first real-time ML systems used to determine turbine health by GE's monitoring and diagnostics teams.

Iman Saleh is a Research Scientist with the Automotive Solutions group. She holds a Ph.D. from the Computer Science department at Virginia Tech, a Master degree in Computer Science from Alexandria University, Egypt. And, a Master degree in Software Engineering from Virginia Tech. Dr. Saleh has 30+ technical publications in the areas of big data, formal data specification, service-oriented computing and privacy-preserving data mining. Her research interests include ethical AI, machine learning, privacy-preserving solutions, software engineering, data modeling, Web services, formal methods and cryptography.

Presentations

AI Privacy and Ethical Compliance Toolkit Tutorial

From healthcare to smart home to autonomous vehicles, new applications of autonomous systems are raising ethical concerns including bias, transparency, and privacy. In this tutorial, we will demonstrate tools and capabilities that can help data scientists address these concerns. The tools help bridge the gap between ethicists and regulators, and machine learning practitioners.

Neelesh Srinivas Salian is a Software Engineer on the Data Platform team at Stitch Fix, where he works on the compute infrastructure used by the company’s data scientists.

Presentations

Data Lineage - Why and How Session

This talk helps describe the Data lineage system we built at Stitch Fix and what has the journey been as we built it from the ground up.

Akshai is a Principal Software Engineer working in Big Data, ETL, analytics and distributed computing at Oath. He enjoys dealing with problems at scale, decreasing latency, improving quality and creating systems that handle billions of events and terabytes of data, both streaming and batch.

Presentations

Bullet: Querying Streaming Data in Transit with Sketches Session

Bullet is a scalable, pluggable, light, multi-tenant query system on any data flowing through a streaming system without storing it. Bullet queries are submitted first and operate on data flowing through the system from the point of submission. Bullet efficiently supports intractable operations like Top K, Counting Distincts and Windowing without any storage using Sketch-based algorithms.

Osman Sarood leads the infrastructure team at Mist Systems, where he helps Mist scale the Mist Cloud in a cost-effective and reliable manner. Osman has published more than 20 research papers in highly rated journals, conferences, and workshops and has presented his research at several academic conferences. He has over 400 citations along with an i10-index and h-index of 12. Previously, he was a software engineer at Yelp, where he prototyped, architected, and implemented several key production systems and architected and authored Yelp’s autoscaled spot infrastructure, fleet_miser. Osman holds a PhD in high-performance computing from the University of Illinois Urbana-Champaign, where he focused on load balancing and fault tolerance.

Presentations

Live-Aggregators: A Scalable, Cost Effective and Reliable Way of Aggregating Billions of Messages in Realtime Session

Live Aggregators(LA) is a highly reliable and scalable in-house real time aggregation system that can autoscale for sudden changes in load. LA consumes billions of kafka messages and does over 1.5 billion writes to Cassandra per day. It is 80% cheaper than competing streaming solutions due to running over AWS spot instances and having 70% CPU utilization.

Jörg is a software engineer at Mesosphere in Hamburg. In his previous life he implemented distributed and in memory databases and conducted research in the Hadoop and Cloud area. His speaking experience includes various Meetups, international conferences, and lecture halls.

Presentations

Deep learning beyond the learning Session

There are many great tutorials for training your deep learning models using TensorFlow, Keras, Spark or one of the many other frameworks. But training is only a small part in the overall deep learning pipeline. This talk gives an overview into building a complete automated deep learning pipeline starting with exploratory analysis, over training, model storage, model serving, and monitoring.

Robert Schroll is a data scientist in residence at the Data Incubator. Previously, he held postdocs in Amherst, Massachusetts, and Santiago, Chile, where he realized that his favorite parts of his job were teaching and analyzing data. He made the switch to data science and has been at the Data Incubator since. Robert holds a PhD in physics from the University of Chicago.

Presentations

Machine Learning from Scratch in TensorFlow 2-Day Training

The TensorFlow library provides for the use of computational graphs, with automatic parallelization across resources. This architecture is ideal for implementing neural networks. This training will introduce TensorFlow's capabilities in Python. It will move from building machine learning algorithms piece by piece to using the Keras API provided by TensorFlow with several hands-on applications.

Jonathan Seidman is a software engineer on the cloud team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Foundations for Successful Data Projects Tutorial

The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. In this presentation we’ll provide guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

Gwen Shapira is a system architect at Confluent, where she helps customers achieve success with their Apache Kafka implementations. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen currently specializes in building real-time reliable data-processing pipelines using Apache Kafka. Gwen is an Oracle Ace Director, the coauthor of Hadoop Application Architectures, and a frequent presenter at industry conferences. She is also a committer on Apache Kafka and Apache Sqoop. When Gwen isn’t coding or building data pipelines, you can find her pedaling her bike, exploring the roads and trails of California and beyond.

Presentations

Cloud Native Data Pipelines with Apache Kafka Session

As microservices, data services and serverless APIs proliferate, data engineers need to collect and standardize data in an increasingly complex and diverse system. In this presentation, we’ll discuss how data engineering requirements changed in a cloud-native world and share architectural patterns that are commonly used to build flexible, scalable and reliable data pipelines.

Sonali Sharma is a data engineer at Netflix in the Personalization team, which, among other things, delivers recommendations made for each user. The team is responsible for the data that goes into training and scoring of the various machine learning models that power the Netflix homepage. They have been working on moving some of our core datasets from being processed in a once-a-day daily batch ETL to being processed in near-real time using Apache Flink.
A UC berkeley graduate, Sonali has worked on a variety of problems involving big data. Before Netflix, She was at Yahoo, working in mail monetization and data insights engineering team with a focus to build great data driven products to do large scale unstructured data extractions, recommendation systems and audience insights for targeting using technologies like Spark, Hadoop ecosystem (pig, hive, map reduce), Solr, Druid, Elastic search

Presentations

Taming large-state to join datasets for Personalization Session

With so much data being generated in real-time what if we could combine all these high-volume data streams in real time and provide a near realtime feedback for model training, improve personalization and recommendations, thereby taking the customer experience on the product to a whole new level. Well, it is possible to tame large state-join for exactly that purpose using Flink's keyed state.

Aashish Sheshadri is a Research Engineer at PayPal. He graduated with a MS in Computer Science from the University of Texas as Austin where his research focused on active learning with human-in-the-loop systems. He currently ideates and applies deep learning to new avenues at PayPal. He actively contributes to the Jupyter Ecosystem and the SEIF Project.

Presentations

On a Deep Journey Towards Five Nines Session

Deep learning using Sequence to Sequence networks (Seq2Seq) has demonstrated unparalleled success in Neural Machine Translation. A less explored but highly sought-after area of Forecasting can leverage recent gains made in Seq2Seq networks. This talk will introduce the application of deep networks to monitoring and alerting intelligence at PayPal.

Daragh leads a team of Data Scientists that use algorithms and the scientific method to optimize the portfolio of products stocked in Stitch Fix’s inventory.

Prior to Stitch Fix, Daragh spent a decade in academia, where he developed neural networks of human language acquisition, and tested their predictions with behavioral and neuroimaging experiments.

Presentations

How to Make Fewer Bad Decisions Session

A|B Testing has revealed the fallibility in human intuition that typically drives business decisions. We describe some types of systematic errors domain experts commit. In this interactive session, we demonstrate and discuss how cognitive biases arise from heuristic reasoning processes. Finally, we propose several mechanisms to mitigate these human limitations and improve our decision-making.

Rachel Silver is the product lead for ML & AI at MapR Data Technologies. Rachel also manages the MapR Ecosystem Packs. She is passionate about open source technologies. Previously, Rachel was a solutions architect and applications engineer with a focus on search technology.

Presentations

Persistent Storage for Machine Learning in KubeFlow Session

KubeFlow separates compute and storage to provide the ability to deploy best-of-breed open source systems for machine learning to any cluster running Kubernetes, whether on-premises or in the cloud. This talk will explore the problems of state and storage and how distributed persistent storage can logically extend the compute flexibility provided by KubeFlow.

Dr. Alkis Simitsis is a Chief Scientist Cyber Security Analytics with Micro Focus. He has more than 15 years of experience in multiple roles building innovative information and data management solutions in areas like real-time business intelligence, security, massively parallel processing, systems optimization, data warehousing, graph processing, and web services. Alkis holds 26 U.S. patents and has filed over 50 patent applications in the U.S. and worldwide, has published more than 100 papers in refereed international journals and conferences (top publications cited 5000+ times), and frequently serves in various roles in program committees of top-tier international scientific conferences. He is an IEEE senior member and a member of ACM.

Presentations

Automation of Root Cause Analysis for Big Data Stack Applications Session

This describes an automated technique for root cause analysis (RCA) for big data stack applications using deep learning techniques. Spark and Impala will be used as examples, but the concepts generalize to the big data stack.

Peter Warren Singer is Strategist at New America and an editor at Popular Science magazine. He has been named by the Smithsonian as one of the nation’s 100 leading innovators, by Defense News as one of the 100 most influential people in defense issues, by Foreign Policy to their Top 100 Global Thinkers List, as an official “Mad Scientist” for the U.S. Army’s Training and Doctrine Command, and by Onalytica social media data analysis as one of the ten most influential voices in the world on cybersecurity and 25th most influential in the field of robotics. Peter’s award winning books include Corporate Warriors: The Rise of the Privatized Military Industry, Children at War, Wired for War: The Robotics Revolution and Conflict in the 21st Century; and Cybersecurity and Cyberwar: What Everyone Needs to Know and Ghost Fleet: A Novel of the Next World War, a technothriller crossed with nonfiction research, which has been endorsed by people who range from the Chairman of the Joint Chiefs to the co-inventor of the Internet to the writer of HBO Game of Thrones. His latest book is LikeWar (Oct 2018, HMH), which explores how social media has changed war and politics, and war and politics has changed social media. It was named an Amazon book of the month, a NY Times “new and notable,” and reviewed by Booklistas “LikeWar should be required reading for everyone living in a democracy and all who aspire to.”
His past work include serving at the Office of the Secretary of Defense, Harvard University, and as the founding director of the Center for 21st Century Security and Intelligence at Brookings, where he was the youngest person named senior fellow in its 100 year history.

Presentations

The CyberThreatscape: What are the Key Trends in Cybersecurity? Keynote

From social media operations and ransomware to the collapse of cyber deterrence, a series of new threats are changing the cybersecurity landscape.

Animesh Singh is an STSM and lead for IBM Watson and Cloud Platform, where he leads machine learning and deep learning initiatives on IBM Cloud and works with communities and customers to design and implement deep learning, machine learning, and cloud computing frameworks. He has a proven track record of driving design and implementation of private and public cloud solutions from concept to production. In his decade-plus at IBM, Animesh has worked on cutting-edge projects for IBM enterprise customers in the telco, banking, and healthcare Industries, particularly focusing on cloud and virtualization technologies, and led the design and development first IBM public cloud offering.

Presentations

Use Jupyter notebook to integrate adversarial attacks into a model training pipeline to detect vulnerabilities Session

In this talk we are going to discuss how to provide an implementation for many state-of-the-art methods for attacking and defending classifiers using open source Adversarial Robustness Toolbox. For AI developers, the library provides interfaces that support the composition of comprehensive defense systems using individual methods as building blocks.

Guoqiong Song is a software engineer on the big data technology team at Intel, where she works in the area of big data analytics. She is engaged in developing and optimizing distributed deep learning frameworks on Apache Spark.

Presentations

Analytics Zoo: Distributed Tensorflow and Keras on Apache Spark Tutorial

In this tutorial, we will show how to build and productionize deep learning applications for Big Data using "Analytics Zoo":https://github.com/intel-analytics/analytics-zoo (a unified analytics + AI platform that seamlessly unites Spark, TensorFlow, Keras and BigDL programs into an integrated pipeline) using real-world use cases (such as JD.com, MLSListings, World Bank, Baosight, Midea/KUKA, etc.)

Paul Spiegelhalter is currently a data scientist and deep learning specialist at Pythian. He holds a Ph.D. in Mathematics from the University of Illinois at Urbana-Champaign and is recognized for his deep expertise in utilizing cutting-edge advances in artificial intelligence and machine learning in order to transform ground breaking research into usable algorithms. Paul’s expertise in predictive analytics and algorithmic modeling runs across a number of industries, including computer vision, predictive maintenance, online advertising and user analysis, medical diagnostics, natural language processing, and anomaly detection.

Presentations

Machine Learning for Preventive Maintenance of Mining Haul Trucks Session

Using the example of r a mining haul truck at a leading Canadian mining company, we will cover mapping preventive maintenance needs to supervised machine learning problems, creating labeled datasets, feature engineering from sensors and alerts data, evaluating models— then converting it all to a complete AI solution on Google Cloud Platform which is integrated with existing on-premise systems.

Ankit Srivastava is a Senior Data Scientist in the Core Data Science team of Azure Cloud+AI Platform division at Microsoft. He was a summer intern at Microsoft in 2012 and joined back full-time starting Jan, 2013. He earlier worked as a developer in Data Integration & Insights team and then now moved to a data scientist role in the core infrastructure research and development for data coming from Windows clients all over the world. He focuses on Commercial and Education segment data science projects within Microsoft. He has built several production scale ML enrichments that are leveraged for Sales compensation and Senior Leadership Team metrics.

Presentations

Executive Briefing: The 6 Keys to Successful Data Spelunking Session

At the rate data sources are multiplying business value can often be developed faster by joining data sources rather than mining a single source to the very end. This presentation covers four years of hands on practical experience sourcing and integrating massive numbers of data sources to build the Microsoft Business Intelligence Graph (M360 BIG).

Infinite Segmentation: Scalable Mutual Information Ranking on real world graphs Session

These days it’s not about normal growth, it’s about driving hockey-stick levels of growth. Sales & marketing orgs are looking to AI to help growth hack their way to new markets and segments. We have used Mutual Information for many years to help filter out noise and find the critical insights to new cohort of users, businesses and networks and now we can do it at scale across massive data sources.

Dave Stuart is a senior product manager within the US Department of Defense. Dave currently leads a large-scale effort to transform the workflows of thousands of enterprise business analysts through Jupyter and Python adoption, making tradecraft more efficient, sharable, and repeatable. Prior to this focus, Dave led multiple grass-roots technology adoption efforts, developing innovative training methods which tangibly increased the technical proficiency of a large, non-coding enterprise workforce.

Presentations

An Alternative Approach to Adding Data Science to an Organization: Use Jupyter and Start with the Domain Experts. Session

Many organizations look to add data science to their skill portfolios through the hiring of data science experts. We explore a complementary way to build a data science savvy workforce that nets tremendous value by using Jupyter to add introductory data science practices to domain experts and business analysts.

Patrick is a member of the research staff at IBM research Zurich. His research interests are in distributed systems, networking and operating systems. Patrick graduated with a PhD from ETH Zurich in 2008 and spent two years (2008-2010) as a Postdoc at Microsoft Research Silicon Valley. The general theme of his work is to explore how modern networking and storage hardware can be exploited in distributed systems. Patrick is the creator of several open source projects such as DiSNI (RDMA for Java), DaRPC (Low latency RPC) and Apache Crail (Incubating).

Presentations

Data processing at the speed of 100 Gbps using Apache Crail Session

Modern networking and storage technologies like RDMA or NVMe find their ways into the data center. Apache Crail (incubating) is a new project that facilitates running data processing workloads (ML, SQL, etc.) on such hardware. In this talk I will present Apache Crail, what it does and how workloads based on TensorFlow or Spark can benefit from Crail.

Václav Surovec joined T-Mobile Czech Republic five years ago while he was still a student of Czech Technical University in Prague. During his master thesis, he got first into touch with Big Data and Hadoop. After few successful PoCs and pilots, the Big Data concept has been successfully integrated into the company “roots”. Now he is co-managing the Big Data department which has more than 45 Big Data engineers, and which delivers the Big Data projects to Germany, Netherlands or Czech Republic. One of the project he lead was Commercial Roaming project.

Presentations

Data Science in Deutsche Telekom - Predicting global travel patterns and network demand Session

The knowledge of location and travel patterns of customers is important for many companies. One of them is a German telco service operator Deutsche Telekom. Commercial Roaming project using Cloudera Hadoop helped the company to better analyze the behavior of its customers from 13 countries, in a very secure way, to be able to provide better predictions and visualizations for the high management.

Shubham Tagra is a Senior Staff Engineer at Qubole working on Presto and Hive development and making these solutions cloud ready. Previously, Shubham worked at NetApp on its storage area network. Shubham holds a bachelor’s degree in computer engineering from the National Institue of Technology, Karnataka, India.

Presentations

Cost Effective Presto on AWS with Spot Nodes Session

Running Presto in AWS at 1/10th the cost with AWS Spot nodes can be achieved with few architectural enhancements to Presto. This talk will explain the gaps in Presto architecture to use spot nodes and cover these enhancements and showcase the improvements in terms of reliability and TCO achieved through them.

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe, and worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Executive Briefing: Why machine-learned models crash and burn in production and what to do about it Session

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Natural language understanding at scale with Spark NLP Tutorial

This is a hands-on tutorial for scalable NLP using the highly performant, highly scalable open-source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Spark NLP: How Roche Automates Knowledge Extraction from Pathology & Radiology Reports Session

We’ll show how Roche applies Spark NLP for Healthcare to extract clinical facts from pathology reports and radiology, and the design of the deep learning pipelines used to simplify training, optimization, and inference of such domain-specific models at scale.

BigData storage optimization and development Engineer from Intel.

Presentations

Spark-PMoF: Accelerating Bigdata Analytics with Persistent Memory over Fabric Session

We introduce Spark-PMOF and explain how it improves Spark analytics performance.

Lead large organization wide transformations that drive innovation and accelerate business delivery.

Currently I lead Strategy and Product for Data Platforms and Infrastructure at PayPal.

Our team manages and propels Data Platforms that power PP Core customers, processing over 250+ PB of data and builds products that cater to over 5000+ PayPal developer/analyst and data scientists community. Our mission is to not just enable this community but also to drive efficiency, reduce friction and reduce their Time To Market which in turn drives PP growth.

Presentations

ML & AI @ Scale at PayPal Session

PayPal data eco system is fairly large with over 250+PB of data transacting in over 200+ countries. Given this massive scale and complexity, discovering and access to the right data sets in a frictionless environment is a massive challenge.PayPal’s Data Platform team is helping solve this problem holistically with a combination of self service integrated and interoperable products.

James is a software engineer in the Data Infrastructure group at Lyft working on various big data systems. Prior to that he was an architect at Salesforce where he founded the Apache Phoenix project and led its development. He also worked at BEA Systems on federated query processing systems and event driven programming platforms.

Presentations

Adaptive ETL to Optimize Query Performance at Lyft Session

This talk will provide details of an automated feedback loop at Lyft to adapt ETL based on the aggregate cost of queries run across the cluster. In addition, future work will be outlined to enhance the system through the use of materialized views to reduce the number of ad hoc joins and sorting performed by the most expensive queries by transparently rewriting queries when possible.

Serban Teodorescu is an SRE at Adobe, where he is part of a small team that manages 20+ Cassandra clusters for Adobe Audience Manager. Before this he was a Python programmer, and he’s still trying to find out how a developer that preferred SQL databases ended up as an SRE for a Cassandra team. Apart from Cassandra and Python he’s also interested in automating infrastructure provisioning with Terraform.

Presentations

Database migrations don't have to be painful, but the road will be bumpy Session

Inspired by the Green / Blue deployment technique, the Adobe Audience Manager team developed an Active / Passive database migration procedure that allows us to test our database clusters in production, minimising the risks without compromising the innovation. We successfully applied this approach twice to upgrade the entire technology stack. But it never was a smooth move.

Alex Thomas is a data scientist at Indeed. He has used natural language processing (NLP) and machine learning with clinical data, identity data, and now employer and jobseeker data. He has worked with Apache Spark since version 0.9, and has worked with NLP libraries and frameworks including UIMA and OpenNLP.

Presentations

Natural language understanding at scale with Spark NLP Tutorial

This is a hands-on tutorial for scalable NLP using the highly performant, highly scalable open-source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Steve Touw is the cofounder and CTO of Immuta. Steve has a long history of designing large-scale geotemporal analytics across the US intelligence community, including some of the very first Hadoop analytics as well as frameworks to manage complex multitenant data policy controls. He and his cofounders at Immuta drew on this real-world experience to build a software product to make data experimentation easier. Previously, Steve was the CTO of 42Six Solutions (acquired by Computer Sciences Corporation), where he led a large big data services engineering team. Steve holds a BS in geography from the University of Maryland.

Presentations

Manage the Risks of ML - In Practice! Tutorial

This tutorial will provide a hands on overview of how to train, validate and audit machine learning models (ML) across the enterprise. As ML becomes increasingly important, managing its risks is quickly becoming one of the biggest challenges to the technology’s widespread adoption. Join us to walk through practical tools and best practices to help safely deploy ML.

Martin is a co-founder of Presto and a Software Engineer at Facebook where he leads the Presto development team. Previously, he was an architect at Proofpoint and Ning.

Presentations

Presto: Tuning Performance of SQL-on-Anything Analytics Session

Presto, an open source distributed SQL engine, is designed for interactive queries and ability to query multiple data sources. With the ever-growing list of connectors (e.g., Apache Kudu, Pulsar, Netflix Iceberg, Elasticsearch) recently introduced Cost-Based Optimizer in Presto must account for heterogeneous data source with incomplete statistics and new use cases such as geospatial analytics.

Cindy Tseng is a Research Scientist with the Intel Applied Research in Automotive Driving group. She holds a Master degree from the Electrical and Computer Engineering department at Carnegie Mellon University, and a bachelor’s degree in Electrical Engineering and Computer Science from University of Michigan at Ann Arbor. Cindy is currently enrolled as a part time student in the Masters in Data Science program in Computer Science from University of Illinois at Urbana Champaign. Cindy Tseng has worked in the space of high throughput computing and deep learning hardware accelerators. Her recent work in the Applied Research group covers bias detection in convolution neural nets.

Presentations

AI Privacy and Ethical Compliance Toolkit Tutorial

From healthcare to smart home to autonomous vehicles, new applications of autonomous systems are raising ethical concerns including bias, transparency, and privacy. In this tutorial, we will demonstrate tools and capabilities that can help data scientists address these concerns. The tools help bridge the gap between ethicists and regulators, and machine learning practitioners.

Sandeep Uttamchandani is the hands-on Chief Data Architect at Intuit. He is currently leading the Cloud transformation of the Big Data Analytics, ML, and Transactional platform used by 3M+ Small Business Users for financial accounting, payroll, and billions of dollars in daily payments. Prior to Intuit, Sandeep has played various engineering roles at VMware, IBM, as well as founding a startup focused on ML for managing Enterprise systems. Sandeep’s experience uniquely combines building Enterprise data products and operational expertise in managing petabyte scale data and analytics platforms in production for IBM’s Federal and Fortune 100 customers. Sandeep has received several excellence awards, and over 40 issued patents and 25 publications in key systems conference such as VLDB, SIGMOD, CIDR, USENIX. Sandeep is a regular speaker at academic institutions, guest lectures for university courses, as well as conducts conference tutorials for data engineers and scientists. He advises PhD students and startups, serves as Program Committee Member for systems and data conferences, and the past associate editor for ACM Transactions on Storage. He blogs on LinkedIn and Wrong Data Fabric (his personal blog). Sandeep is a Ph.D. in Computer Science from University of Illinois at Urbana-Champaign.

Presentations

How we reduced Time-to-Reliable-Insights for Data Pipelines Session

How efficient is your data platform? The single metric we use is Time-to-Reliable-Insights — total of time spent to ingest, transform, catalog, analyze, and publish. There are three elephants-in-the-room when it comes to Time-to-Reliable-insights — time-to-discover, time-to-catalog, and time-to-debug for data quality. This talk covers three design patterns and/or frameworks we have implemented.

Vinod Vaikuntanathan is an associate professor of computer science at MIT and a co-founder of Duality Technologies. Computer Science and Artificial Intelligence Laboratory (CSAIL). His research focuses on lattice-based cryptography, and the theory and practice of computing on encrypted data. Vinod earned his PhD in computer science from MIT, receiving the George M. Sprowls Award for the best computer science thesis. His teaching and research in cybersecurity was recently recognized with MIT’s Harold E. Edgerton Faculty Achievement Award, a Sloan Faculty Fellowship, a Microsoft Faculty Fellowship and a DARPA Young Faculty award.

Presentations

Machine Learning on Encrypted Data: Challenges and Opportunities Session

In this talk, we will discuss the challenges and opportunities of machine learning on encrypted data and describe the state of the art in this space.

Vinithra Varadharajan is a senior engineering manager in the cloud organization at Cloudera, where she is responsible for the cloud portfolio products, including Altus Data Engineering, Altus Analytic Database, Altus SDX, and Cloudera Director. Previously, Vinithra was a software engineer at Cloudera, working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop lifecycle management.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Moving to the cloud poses challenges from re-architecting to be cloud-native, to data context consistency across workloads that span multiple clusters on-prem and in the cloud. First, we’ll cover in depth cloud architecture and challenges; second, you’ll use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

As a storage expert and senior director of product marketing and strategic alliances for Western Digital’s Data Center Systems business unit, Stefaan Vervaet is responsible for leading marketing and business development efforts to deliver advanced object storage-based systems and emerging storage solutions for today’s at-scale enterprise and cloud workloads, including big data analytics, virtualization, application acceleration and long-term backup/active archive. Vervaet brings 15 years of experience in the data storage and backup industry. As a business-focused technologist with an extensive start-up background, he brings a unique perspective to Western Digital. His background includes product management and go-to-market positions in the backup space (Veritas), as well as technical sales and support executive positions in the object storage world (Amplidata). As an innovator with a proven track record, Vervaet successfully helped build startup companies like DataCenter Technologies, a dedupe technology (acquired by Veritas in 2005) and Amplidata, a leading object storage vendor (acquired by HGST in 2015, a Western Digital Company). Immediately before joining HGST, he established and built the U.S. office running technical sales, support and operations worldwide for Amplidata. Vervaet holds a master’s degree in Applied Informatics from the University of Ghent, Belgium and is currently based out of Western Digital’s San Jose, CA headquarters.

Presentations

Immersive 3D VR to three-geo archive, EPFL captures the feel of the Montreux Jazz Festival Session

The École Polytechnique Fédérale de Lausanne (EPFL) spearheaded the official digital archival of 15,000+ hours of A/V content captured from the Montreux Jazz Festival since 1967, and most recently, created an immersive 3D VR experience. From capture, store, delivery and experience, this case study focuses on the evolution of M&E workflow– from camera to cloud – that made it all possible.

Lars is a software engineer at Cloudera. He has worked on various parts of Apache Impala including crash handling, its Parquet scanners, and scan range scheduling. Most recently he worked on integrating Kudu’s RPC framework into Impala. Before his time at Cloudera he worked on various databases at SAP.

Presentations

Accelerating Analytical Antelopes: Integrating Apache Kudu's RPC into Apache Impala Session

In recent years, Apache Impala has been deployed to clusters that are large enough to hit architectural limitations in the stack. Our talk will cover the efforts and results to address the scalability limitations in the now legacy Thrift RPC framework by using Apache Kudu's RPC which was built from the ground up to support asynchronous communication, multiplexed connections, TLS, and Kerberos.

Todd Walter is chief technologist and fellow at Teradata, where he helps business leaders, analysts, and technologists better understand all of the astonishing possibilities of big data and analytics in view of emerging and existing capabilities of information infrastructures. Todd has been with Teradata for more than 30 years. He’s a sought-after speaker and educator on analytics strategy, big data architecture, and exposing the virtually limitless business opportunities that can be realized by architecting with the most advanced analytic intelligence platforms and solutions. Todd holds more than a dozen patents.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Dean Wampler is the vice president of fast data engineering at Lightbend, where he leads the Lightbend Fast Data Platform project, a distribution of scalable, distributed stream processing tools including Spark, Flink, Kafka, and Akka, with machine learning and management tools. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He is a contributor to several open source projects. A frequent Strata speaker, he’s also the co-organizer of several conferences around the world and several user groups in Chicago.

Presentations

Executive Briefing: What it takes to use machine learning in fast data pipelines Session

Your team is building Machine Learning capabilities. I'll discuss how you can integrate these capabilities in streaming data pipelines so you can leverage the results quickly and update them as needed. There are big challenges. How do you build long-running services that are very reliable and scalable? How do you combine a spectrum of very different tools, from data science to operations?

Hands-on Machine Learning with Kafka-based Streaming Pipelines Tutorial

This hands-on tutorial examines production use of ML in streaming data pipelines; how to do periodic model retraining and low-latency scoring in live streams. We'll discuss Kafka as the data backplane, pros and cons of microservices vs. systems like Spark and Flink, tips for Tensorflow and SparkML, performance considerations, model metadata tracking, and other techniques.

Jason Wang is a software engineer at Cloudera focusing on the cloud.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Moving to the cloud poses challenges from re-architecting to be cloud-native, to data context consistency across workloads that span multiple clusters on-prem and in the cloud. First, we’ll cover in depth cloud architecture and challenges; second, you’ll use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Jiao (Jennie) Wang is a software engineer on the big data technology team at Intel, where she works in the area of big data analytics. She is engaged in developing and optimizing distributed deep learning framework on Apache Spark.

Presentations

Analytics Zoo: Distributed Tensorflow and Keras on Apache Spark Tutorial

In this tutorial, we will show how to build and productionize deep learning applications for Big Data using "Analytics Zoo":https://github.com/intel-analytics/analytics-zoo (a unified analytics + AI platform that seamlessly unites Spark, TensorFlow, Keras and BigDL programs into an integrated pipeline) using real-world use cases (such as JD.com, MLSListings, World Bank, Baosight, Midea/KUKA, etc.)

User-based real-time product recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL Session

User-based real-time recommendation system has become an important topic in e-commerce field nowadays. This talk demonstrates how to build deep learning algorithms using Analytics Zoo with BigDL on Apache Spark and create end to end system to serve real-time product recommendation.

Julie Wang is Sr. Manager of Business Analytics team at LinkedIn, with 10 years of experience in Predictive Analytics, Risk Management and Data product development across industries including Finance, Consulting, e-commerce, social network, and SaaS. Julie is passionate in translating business problems into qualitative questions, solving them by synthesizing and mining large scale data, and driving business decisions in a scalable way. Julie is also passionate in developing Data Science community via mentoring, volunteering, and being an evangelist of data-informed culture for all organizations.

Presentations

Full Spectrum of Data Science to Drive Business Decisions Tutorial

Thanks to the rapid growth in data resources, it is common for business leaders to appreciate the challenge and importance in mining the information from data. In this tutorial, a group of well respected data scientists would share with you their experiences and success on leveraging the emerging techniques in assisting intelligent decisions, that would lead to impactful outcomes at LinkedIn.

Luyang Wang is a data scientist / big data engineering with strong system architect and software development background in OfficeDepot.

Presentations

User-based real-time product recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL Session

User-based real-time recommendation system has become an important topic in e-commerce field nowadays. This talk demonstrates how to build deep learning algorithms using Analytics Zoo with BigDL on Apache Spark and create end to end system to serve real-time product recommendation.

Olivia leads the development of new products in fraud detection at Datavisor. Her team has built an industry-leading framework of unsupervised machine learning that automatically detect coordinated fraud attacks. Prior to joining Datavisor she’s worked in the financial industry. She received her PhD degree from Cornell University.

Presentations

Detecting Coordinated Fraud Attacks Using Deep Learning Session

Online fraud flourishes as online services become ubiquitous in our daily life. This talk will discuss how Datavisor leverages cutting-edge deep learning technologies to address the challenges in large-scale fraud detection.

Rachel Warren is a software engineer and data scientist for Salesforce Einstein, where she is working on scaling and productionizing auto ML on Spark. Previously, Rachel was a machine learning engineer for Alpine Data, where she helped build a Spark auto-tuner to automatically configure Spark applications in new environments. A Spark enthusiast, she is the coauthor of High Performance Spark. Rachel is a climber, frisbee player, cyclist, and adventurer. Last year, she and her partner completed a thousand-mile off-road unassisted bicycle tour of Patagonia.

Presentations

Understanding Spark Tuning with Auto Tuning (or how to stop your pager going off at 2am*) Session

Apache Spark is an amazing distributed system, but part of the bargain we've all made with the infrastructure demons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. This talk will look at auto-tuning jobs using historical & static job information using systems like Mahout, and internal Spark ML jobs as workloads including new settings in 2.4.

Cory Watson leads the Observability team at Stripe, helping engineers be more confident in their work so they can ship reliable, safe and performant features for Stripe’s products. Previously he managed the Observability team at Twitter and has spent over 20 years as a leader, software engineer, SRE, and OSS contributor.

Presentations

Veneur: Global observability data with style Session

How Stripe uses data sketching and off the shelf parts to build a novel observability pipeline that unifies measurements across our infrastructure to both improve reliability and keep vendor costs down.

Robin Way is a faculty member at the International Institute of Analytics, and the founder and president of management analytics consultancy Corios. He is author of “Skate Where the Puck’s Headed: A Playbook for Scoring Big with Predictive Analytics”, available from Amazon in hardcopy and on the Kindle.

Robin has over 25 years of experience in the design, development, execution, and improvement of applied analytics models for clients in the credit, payments, lending, brokerage, insurance, and energy industries. Previously, Robin was a managing analytics consultant in SAS Institute’s Financial Services Business Unit for 12 years and spent another 10+ years in analytic management roles for several client-side and consulting firms.

Robin’s professional passion is devoted to democratizing and demystifying the science of applied analytics. His contributions to the field correspondingly emphasize statistical visualization, analytical data preparation, predictive modeling, time series forecasting, mathematical optimization applied to marketing, and risk management strategies.

Robin holds an undergraduate degree from the University of California at Berkeley; his subsequent graduate-level coursework emphasized the analytical modeling of human and consumer behavior. He lives in Portland, Oregon, with his wife, Melissa, and two sons, Colin and Liam. In his spare time, Robin plays soccer and holds a black belt in taekwondo.

Presentations

Organic Intelligence: Telling a story about the Human Experience with Math Data Case Studies

Why do we call it "artificial" intelligence? Did AI write itself? No, of course it didn't. We invented the math, the computer technology, and harnessed the data sources. I propose we re-position what we do as "organic intelligence": we apply math and computers to data to tell a story about the human experience. And we've designed some fun exercises for the audience to decipher what it's all about.

Thomas Weise is a software engineer for the streaming platform at Lyft. Thomas is PMC member of Apache Apex and Apache Beam; he has also contributed to several other of the ecosystem projects. He has worked on distributed systems for over 20 years, including at a number of technology companies in the San Francisco Bay Area, such as DataTorrent, where he was a cofounder of the Apex project. Thomas is a frequent speaker at international big data conferences and author of the book Learning Apache Apex.

Presentations

The magic behind your Lyft ride prices - a case study of Machine Learning and Streaming Session

At the core of Lyft is how we dynamically price our rides - a combination of various data sources, ML models, and streaming infrastructure for low latency, reliability and scalability. This allows the pricing system to be more adaptable to real world changes. The streaming platform powers pricing by bringing together the best of both worlds; ML algorithm in Python and JVM based streaming engine.

Melinda Han Williams is the VP of Data Science and Analytics at Dstillery. Before joining the ad tech industry, Melinda worked as a physicist developing third generation photovoltaics and studying electronic transport in nanostructured graphene devices. Her peer-reviewed journal publications have been cited over 7,000 times. Melinda holds bachelor’s degrees in Applied Math and Engineering Physics from the University of California at Berkeley, and earned her Ph.D. in Applied Physics with distinction from Columbia University, where she held a National Science Foundation Graduate Research Fellowship.

Presentations

Artificial intelligence on human behavior: New insights into customer segmentation Session

Customer segmentation based on coarse survey data has long been a staple of traditional market research. We use deep learning to model the digital pathways of over a hundred million consumers and use this embedding to cluster customer populations into fine-grained behavioral segments and inform smarter consumer insights. Along the way, we create a map of the internet.

Tony Wu manages the Altus core engineering team at Cloudera. Previously, Tony was a team lead for the partner engineering team at Cloudera. He is responsible for Microsoft Azure integration for Cloudera Director.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Moving to the cloud poses challenges from re-architecting to be cloud-native, to data context consistency across workloads that span multiple clusters on-prem and in the cloud. First, we’ll cover in depth cloud architecture and challenges; second, you’ll use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Boris Yakubchik is a Data Scientist at Forbes. He creates user-facing products which use machine learning and builds systems from the servers that clean and process data, to the front-end that users interact with.

Presentations

Our New Publishing Platform Will Make You A Better Writer: Using AI To Assist The Newsroom Session

Introducing Bertie, our new publishing platform at Forbes. Bertie is an AI assistant that learns from writers at all times and suggests improvements along the way. We will discuss Bertie’s features, architecture, and ultimate goals. We will be giving special attention to how we implement an ensemble of machine learning models that, together, makeup a skill set and personality of the AI assistant.

Yuhao Yang is a senior software engineer in Intel Big Data team, focusing on deep learning algorithms and applications. His area of focus is distributed deep learning/machine learning and has accumulated rich solution experiences, including fraud detection, recommendation, speech recognition, visual perception etc. He’s also an active contributor of Apache Spark MLlib (GitHub: hhbyyh).

Presentations

Analytics Zoo: Distributed Tensorflow and Keras on Apache Spark Tutorial

In this tutorial, we will show how to build and productionize deep learning applications for Big Data using "Analytics Zoo":https://github.com/intel-analytics/analytics-zoo (a unified analytics + AI platform that seamlessly unites Spark, TensorFlow, Keras and BigDL programs into an integrated pipeline) using real-world use cases (such as JD.com, MLSListings, World Bank, Baosight, Midea/KUKA, etc.)

Analytics Zoo: Distributed TensorFlow in Production on Apache Spark Session

The talk introduces how to run distributed TensorFlow on Apache Spark with the open source software package Analytics Zoo. Compared to other solution, Analytics Zoo is built for production environment and encourages more industry users to run deep learning applications with the Big Data ecosystems.

Jeffrey is the Chief Data Scientist at AllianceBernstein, a global asset-management and research firm, where he leads all of the data science efforts. Prior to AllianceBernstein, He was the Vice President and Head of Data Science at Silicon Valley Data Science, where he led a team of Ph.D. data scientists helping companies transform their businesses using advanced data science techniques and emerging technology. He is active in the data science community and often speaks at data science conferences and local events. Jeffrey has many years of experience in applying a wide range of econometric and machine learning techniques to create analytic solutions for financial institutions, policy institutions, and businesses. He has expertise in combining high performance computing and big data technology to generate analytic insights for strategic decision making. Prior to SVDS, Jeffrey held various positions, including the Head of Risk Analytics at Charles Schwab Corporation, Director of Financial Risk Management Consulting at KPMG, Assistant Director at Moody’s Analytics, and Assistant Professor of Economics at Virginia Tech. Jeffrey holds a Ph.D. and an M.A. in Economics from the University of Pennsylvania and a B.S. in Mathematics and Economics from UCLA.

Presentations

Time Series Forecasting using Statistical and Machine Learning Models: When and How Session

Time series forecasting techniques are applied in a wide range of scientific disciplines, business scenarios, and policy settings. This presentation discuss the applications of statistical time series models, such as ARIMA, VAR, and Regime Switching Models, and machine learning models, such as random forest and neural network-based models, to forecasting problems.

Ting-Fang Yen is director of research at DataVisor, a leading fraud detection solution with a mission of building and restoring trust online. She has over 10 years of experience in applying big data analytics and machine learning to tackle problems in cybersecurity. Ting-Fang holds a PhD in electrical and computer engineering from Carnegie Mellon University.

Presentations

Talking to the Machines: Monitoring production machine learning systems Session

We describe a monitor for production machine learning systems that handle billions of requests daily. Our approach discovers detection anomalies, such as spurious false positives, as well as gradual concept drifts when the model no longer captures the target concept. This session presents new tools for detecting undesirable model behaviors early in large-scale online ML systems.

Ali Zaidi is a PhD student in Statistics at UC Berkeley. Previously, he was a data scientist in Microsoft’s AI and Research Group, where he worked to make distributed computing and machine learning in the cloud easier, more efficient, and more enjoyable for data scientists and developers alike. Before that, Ali was a research associate at NERA (National Economic Research Associates), providing statistical expertise on financial risk, securities valuation, and asset pricing. He studied statistics at the University of Toronto and computer science at Stanford University.

Presentations

Building High-Performance Text Classifiers on a Limited Labeling Budget Session

We show how three cutting-edge machine learning techniques can be used together to up your modeling game: 1. Transfer learning from pre-trained language models 2. Active learning to make more effective use of a limited labeling budget 3. Hyperparameter tuning to maximize model performance We will apply these techniques to a growing business challenge: moderating public discussions.

Tristan Zajonc is CTO for Machine Learning at Cloudera. Tristan previously led engineering for Cloudera Data Science Workbench and was the cofounder and CEO of Sense, an enterprise data science platform that was acquired by Cloudera in 2016. He has over 15 years experience in applied data science, machine learning, and machine learning systems development across academia and industry and holds a PhD from Harvard University.

Presentations

Cloud-Native Machine Learning: Emerging Trends and the Road Ahead Session

Data platforms are being asked to support an ever increasing range of workloads and compute environments, including machine learning and elastic cloud platforms. In this talk, we will discuss some emerging capabilities, including running machine learning and Spark workloads on autoscaling container platforms, and share our vision of the road ahead for ML and AI in the cloud.

Jian Zhang is an software engineer manager at Intel, he and his team primarily focused on Open Source Storage development and optimizations on Intel platforms, and build reference solutions for customers. He has 10 years of experiences on performance analysis and optimization for many open source projects like Xen, KVM, Swift and Ceph, HDFS and benchmarking workloads like SPEC-, TPC. Jian has a master’s degree in Computer Science and Engineering of Shanghai Jiaotong university.

Presentations

Spark-PMoF: Accelerating Bigdata Analytics with Persistent Memory over Fabric Session

We introduce Spark-PMOF and explain how it improves Spark analytics performance.

Yongzheng Zhang is a Sr. Manager, Data Mining at LinkedIn and an active researcher and practitioner of text mining and machine learning. He has developed many practical and scalable solutions for utilizing unstructured data for ecommerce and social-networking applications, including search, merchandising, social commerce, and customer-service excellence. Yongzheng is a highly regarded expert in text mining and has published and presented many papers in top journals and at conferences. He is also actively organizing tutorials and workshops on sentiment analysis at prestigious conferences. He holds a PhD in computer science from Dalhousie University in Canada.

Presentations

Full Spectrum of Data Science to Drive Business Decisions Tutorial

Thanks to the rapid growth in data resources, it is common for business leaders to appreciate the challenge and importance in mining the information from data. In this tutorial, a group of well respected data scientists would share with you their experiences and success on leveraging the emerging techniques in assisting intelligent decisions, that would lead to impactful outcomes at LinkedIn.

Yuan Zhou is a Senior Software Development Engineer in the Software and Service Group for Intel Corporation, working in the Opensource Technology Center team primarily focused on Bigdata Storage Software. He has been working in Databases, Virtualization and Cloud computing for most of his 7+ year career at Intel.

Presentations

Spark-PMoF: Accelerating Bigdata Analytics with Persistent Memory over Fabric Session

We introduce Spark-PMOF and explain how it improves Spark analytics performance.

Penghui is a backend engineer at zhaopin.com. He leads Apache Pulsar at zhaopin.com and contributes heavily to Pulsar open source. Penghui has 5+ years of experience in developing message queues and microservices.

Presentations

How zhaopin.com built its enterprise event bus using Apache Pulsar Session

Using a messaging system to build an event bus is very common. However, certain use cases demand messaging system with a certain set of features. This talk will focus on the event bus requirements for Zhaopin.com, one of the biggest Chinese online recruitment services provider, and why they chose Apache Pulsar.