Mar 15–18, 2020

Speakers

Hear from innovative programmers, talented managers, and senior executives who are doing amazing things with data and AI. More speakers will be announced; please check back for updates.

Grid viewList view

Arpit Agarwal is an engineer in the storage team at Cloudera and an active HDFS/Hadoop committer since 2013.

Presentations

It's 2020 now: Apache Hadoop 3.x state of the union & upgrade guidance Session

2020 Hadoop is still evolving fast. You'll learn the current status of Apache Hadoop community and the exciting present and future of Hadoop 3.x. Wangda Tan and Arpit Agarwal cover new features like Hadoop on Cloud, GPU support, NameNode federation, Docker, 10X scheduling improvements, OZone, etc. And they offer you upgrade guidance from 2.x to 3.x.

John-Mark Agosta is a principal data scientist in IMML at Microsoft. Previously, he worked with startups and labs in the Bay Area, including the original Knowledge Industries, and was a researcher at Intel Labs, where he was awarded a Santa Fe Institute Business Fellowship in 2007, and at SRI International after receiving his PhD from Stanford. He has participated in the annual Uncertainty in AI conference since its inception in 1985, proving his dedication to probability and its applications. When feeling low, he recharges his spirits by singing Russian music with Slavyanka, the Bay Area’s Slavic music chorus.

Presentations

Machine learning for managers Tutorial

Bob Horton, Mario Inchiosa, and John-Mark Agosta offer an overview of the fundamental concepts of machine learning (ML) for business and healthcare decision makers and software product managers so you'll be able to make a more effective use of ML results and be better able to evaluate opportunities to apply ML in your industries.

Mudasir Ahmad is a distinguished engineer and senior director at Cisco. He’s been involved with design and algorithms for 17 years. Mudasir leads the Center of Excellence for Numerical Analysis, developing new analytical and stochastic algorithms. He’s also involved with implementing IoT, artificial intelligence, and big data analytics to streamline supply chain operations. Mudasir has delivered several invited talks on leading technology solutions internationally. He has over 30 publications on microelectronic packaging, two book chapters, and 13 US patents. He received the internationally renowned Outstanding Young Engineer Award in 2012 from the IEEE. He earned an MS in management science and engineering at Stanford University, an MS in mechanical engineering from the Georgia Institute of Technology, and a bachelors from Ohio University.

Presentations

Real-life application of AI in supply chain operations Session

Artificial intelligence (AI) is a natural fit for supply chain operations, where decisions and actions need to be taken daily or even hourly about delivery, manufacturing, quality, logistics, and planning. Mudasir Ahmad explains how AI can be implemented in a scalable and cost-effective way in your business' supply chain operations, and he identifies benefits and potential challenges.

Subutai Ahmad is the vice president of research at Numenta, where he brings his experience across real-time systems, computer vision, and learning. Previously, he was vice president of engineering at YesVideo, where he helped grow the company from a three-person startup to a leader in automated digital media authoring; cofounded ePlanet Interactive, a spin-off of Interval Research, which developed the IntelPlay Me2Cam, the first computer vision product developed for consumers; and key researcher at Interval Research. Subutai earned his BS in computer science from Cornell University and a PhD in computer science from the University of Illinois at Urbana-Champaign. While pursuing his PhD, Subutai completed a thesis on computational neuroscience models of visual attention.

Presentations

How can we be so dense? The benefits of using sparse representations Session

Given that today's machine learning systems can't come close to the flexibility and generality of the brain, it's normal to ask how we can learn from the brain to improve them. Sparsity provides a great starting point. Subutai Ahmad explains how sparsity works in the brain and how applying sparsity to artificial neural networks provides significant advantages.

Presentations

Fighting Fire with Fire: Anatomy of an AI-Enhanced Fraud Detection Solution Intel® AI Builders Showcase

In this session, we will explore through a live demonstration how to move from credit card transaction data to a complete AI fraud detection solution using the Darwin data science automation platform powered by the 2nd generation Intel® Xeon® Scalable processors and Intel® MPI Library.

Brigitte Alexander is the managing director of artificial intelligence (AI) partner programs for Intel, where she’s responsible for creating a scalable and vibrant global AI partner ecosystem on Intel AI technology by attracting, recruiting, and maintaining relationships with best-of-breed enterprise independent software vendors, system integrators, and original equipment manufacturers. Previously, Brigitte led ecosystem and global marketing for Vuforia, an augmented reality platform owned by Qualcomm and then sold to PTC, and held a variety of positions, including director of partnerships, partner marketing, and product management, at such companies as Yahoo and Infospace. Brigitte holds an MBA from the Thunderbird School of Global Management and a BA from the University of California, Santa Barbara.

Presentations

Welcome and AIB overview and growth: Impact on AI ecosystem Intel® AI Builders Showcase

Welcome to AI in the Enterprise: The Intel® AI Builders Showcase Event

Alasdair Allan is a director at Babilim Light Industries and a scientist, author, hacker, maker, and journalist. An expert on the internet of things and sensor systems, he’s famous for hacking hotel radios, deploying mesh networked sensors through the Moscone Center during Google I/O, and for being behind one of the first big mobile privacy scandals when, back in 2011, he revealed that Apple’s iPhone was tracking user location constantly. He’s written eight books and writes regularly for Hackster.io, Hackaday, and other outlets. A former astronomer, he also built a peer-to-peer autonomous telescope network that detected what was, at the time, the most distant object ever discovered.

Presentations

Dealing with data on the edge Session

Much of the data we collect is thrown away, but that's about to change; the power envelope needed to run machine learning models on embedded hardware has fallen dramatically, enabling you to put the smarts on the device rather than in the cloud. Alasdair Allan explains how the data you throw away can be processed in real time at the edge, and this has huge implications for how you deal with data.

Shradha Ambekar is a staff software engineer in the Small Business Data Group at Intuit, where she’s the technical lead for lineage framework (SuperGLUE), real-time analytics, and has made several key contributions in building solutions around the data platform, and she contributed to spark-cassandra-connector. She has experience with Hadoop distributed file system (HDFS), Hive, MapReduce, Hadoop, Spark, Kafka, Cassandra, and Vertica. Previously, she was a software engineer at Rearden Commerce. Shradha spoke at the O’Reilly Open Source Conference in 2019. She holds a bachelor’s degree in electronics and communication engineering from NIT Raipur, India.

Presentations

Always accurate business metrics through lineage-based anomaly tracking Session

Debugging data pipelines is nontrivial, and finding the root cause can take hours to days. Shradha Ambekar and Sunil Goplani outline how Intuit built a self-serve tool that automatically discovers data pipeline lineage and applies anomaly detection to detect and help debug issues in minutes—establishing trust in metrics and improving developer productivity by 10x–100x.

David Anderson is a training coordinator at Ververica, the original creators of Apache Flink. He’s delivered training and consulting to many of the world’s leading banks, telecommunications providers, and retailers. Previously, he led the development of data-intensive applications for companies across Europe.

Presentations

Apache Flink developer training 2-Day Training

David Anderson and Seth Wiesman lead a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. You'll focus on the core concepts of distributed streaming data flows, event time, and key-partitioned state, while looking at runtime, ecosystem, and use cases with exercises to help you understand how the pieces fit together.

Apache Flink developer training (Day 2) Training Day 2

David Anderson and Seth Wiesman lead a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. You'll focus on the core concepts of distributed streaming data flows, event time, and key-partitioned state, while looking at runtime, ecosystem, and use cases with exercises to help you understand how the pieces fit together.

Event-driven applications made easy with Apache Flink Tutorial

David Anderson and Seth Wiesman demonstrate how building and managing scalable, stateful, event-driven applications can be easier and more straightforward than you might expect. You'll go hands-on to implement a ride-sharing application together.

Jesse Anderson is a big data engineering expert and trainer.

Presentations

Professional Kafka development 2-Day Training

Jesse Anderson leads a deep dive into Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it. You'll also discover how to create consumers and publishers in Kafka and how to use Kafka Streams, Kafka Connect, and KSQL as you explore the Kafka ecosystem.

Professional Kafka development (Day 2) Training Day 2

Jesse Anderson leads a deep dive into Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it. You'll also discover how to create consumers and publishers in Kafka and how to use Kafka Streams, Kafka Connect, and KSQL as you explore the Kafka ecosystem.

Apoorva Ankad graduated in Electronics and Communications Engineering from Karnatak University in 2001. He was associated with Toshiba Software India Private Limited for 15 years in various capacities.

At Toshiba he was instrumental in development and porting of various Audio and Video codecs including MP3, AAC, H264, etc. for various Mobile and DTV platforms of Toshiba. Later he was heading the ADAS (Advanced Driver Assistance Systems) division in Toshiba building some state-of-the-art solutions for Pedestrian Detection, Vehicle Detections, etc.

After his stint in Toshiba, as a part of the founding team of Deevia Software is instrumental in driving the vision of making Deevia Software pioneers in Vision based AI.

At Deevia Software, he is heading the Computer Vision and AI division targeting sectors including Automotive, Industrial, Sports, Medical working with partners like Hitachi, Toshiba, Intel, Tata Steel, Valeo, NTT Data etc.

His other interests include hiking, singing, cooking, training young minds on futuristic technologies and life skills.

Presentations

AI-powered People Activity Monitoring System Intel® AI Builders Showcase

The safety and productivity of the workers are critical in industrial assembly lines. Monitoring workers is an important use-case for improving safety. Recent technology advancements enable posture detection-based analytics. However, the camera-based posture detections are performance intensive (requiring high-end GPUs).

Eitan Anzenberg is the chief data scientist at Bill.com and has many years of experience as a scientist and researcher. His recent focus is in machine learning, deep learning, applied statistics, and engineering. Previously, Eitan was a postdoctoral scholar at Lawrence Berkeley National Lab, received his PhD in physics from Boston University, and his BS in astrophysics from University of California, Santa Cruz. Eitan has 2 patents and 11 publications to date and has spoken about data at various conferences around the world.

Presentations

Beyond OCR: Using deep learning to understand documents Session

Although the field of optical character recognition (OCR) has been around for half a century, document parsing and field extraction from images remains an open research topic. Eitan Anzenberg leads a deep dive into a learning architecture that leverages document understanding to extract fields of interest.

Shilpa Arora is a principal data scientist at Atlan, a data product company. She’s also a guest lecture for applied econometrics at IIM-Kashipur College in India. She loves designing and building scalable data products with features that look and feel customized to every user.

Presentations

Predicting malaria using ML and satellite data Health Data Day

Nearly half the world's population lives in areas at risk of malaria. Malaria is highly dependent on the environment, demographic, and infrastructure of the region. Shilpa Arora leads a deep dive into these indicators and a time series analysis of malaria cases, which can help you identify the problem factors and predict the expected cases if there's no external intervention.

Muhammad Asfar is a PhD candidate at Airlangga University and political consultant at Pusdeham. He’s written more than 200 academic papers, magazines, or books related to politics. He’s fully aware that technology has shaped a new spectrum at the direction of how the politics is now and the future. Therefore, he’s passionately learning about big data and new technology adoption in politics to fill up the gap that voting behavior theories cannot yet explain.

Presentations

Political mapping with big data: Indonesia's presidential election 2019 Session

With the disclosure of the Cambridge Analytica scandal, political practitioners have started to adopt big data technology to give them better understanding and management of data. Qorry Asfar and Muhammad Asfar provide a big data case study to develop political strategy and examine how technological adoption will shape a better political landscape.

Qorry Asfar is a data manager at Pusdeham Prodata Indonesia and has 2+ years’ experience in voting behavior, political campaigns, and political advisory. She’s passionate to learn new ways to creatively use political data for better management and strategy.

Presentations

Political mapping with big data: Indonesia's presidential election 2019 Session

With the disclosure of the Cambridge Analytica scandal, political practitioners have started to adopt big data technology to give them better understanding and management of data. Qorry Asfar and Muhammad Asfar provide a big data case study to develop political strategy and examine how technological adoption will shape a better political landscape.

Jatinder Assi is a data engineering manager at GumGum and is enthusiastic about building scalable distributed applications and business-driven data products.

Presentations

Real-time forecasting at scale using Delta Lake Session

GumGum receives 30 billion programmatic inventory impressions amounting to 25 TB of data per day. By generating near-real-time inventory forecast based on campaign-specific targeting rules, it enables users to set up successful future campaigns. Rashmina Menon and Jatinder Assi highlight the architecture that enables forecasting in less than 30 seconds with Delta Lake and Databricks Delta caching.

Utkarsh B. is the technology advisor to the CEO, a distinguished architect, and a senior principal architect at Flipkart. He’s been driving architectural blueprints and coherence across diverse platforms in Flipkart through multiple generations of their evolution and leveraging technology to solve for scale, resilience, business continuity, and disaster recovery. He has extensive experience (18+ years) in building platforms across a wide spectrum of technical and functional problem domains.

Presentations

Architecture patterns for BCP and DR at enterprise scale at Flipkart Session

Utkarsh B. and Giridhar Yasa lead a deep dive into architectural patterns and the solutions Flipkart developed to ensure business continuity to millions of online customers, and how it leveraged technology to avert or mitigate risks from catastrophic failures. Solving for business continuity requires investments application, data management, and infrastructure.

Giriraj Bagdi is a DevOps leader of cloud and data at Intuit, where he leads infrastructure engineering and SRE teams in delivering technology and functional capabilities for online platforms. He drove and managed large complex initiatives in cloud data-infrastructure, automation engineering, big data, and database transactional platform. Giriraj has extensive knowledge of building engineering solutions and platforms to improve the operational efficiency of cloud infrastructure in the areas of command and control and data reliability for big data, high-transaction, high-volume, and high-availability environments. He drives the initiative in transforming big data engineering and migration to AWS big data technologies, in other words, EMR, Athena QuickSight, etc. He’s an innovative, energetic, and goal-oriented technologist and a team player with strong problem solving skills.

Presentations

10 lead indicators before data becomes a mess Session

Data quality metrics focus on quantifying if data is a mess. But you need to identify lead indicators before data becomes a mess. Sandeep Uttamchandani, Giriraj Bagadi, and Sunil Goplani explore developing lead indicators for data quality for Intuit's production data pipelines. You'll learn about the details of lead indicators, optimization tools, and lessons that moved the needle on data quality.

Bahman Bahmani is the vice president of data science and engineering at Rakuten (the seventh-largest internet company in the world), managing an AI organization with engineering and data science managers, data scientists, machine learning engineers, and data engineers globally distributed across three continents, and he’s in charge of the end-to-end AI systems behind the Rakuten Intelligence suite of products. Previously, Bahman built and managed engineering and data science teams across industry, academia, and the public sector in areas including digital advertising, consumer web, cybersecurity, and nonprofit fundraising, where he consistently delivered substantial business value. He also designed and taught courses, led an interdisciplinary research lab, and advised theses in the Computer Science Department at Stanford University, where he also did his own PhD focused on large-scale algorithms and machine learning, topics on which he’s a published author.

Presentations

AI in the new era of personal data protection Session

With California Consumer Privacy Act (CCPA) looming near, Europe’s GDPR still sending shockwaves, and public awareness of privacy breaches heightening, we're in the early days of a new era of personal data protection. Bahman Bahmani explores the challenges and opportunities for AI in this new era and provides actionable insights for you to navigate your path to AI success.

Kamil Bajda-Pawlikowski is a cofounder and CTO of the enterprise Presto company Starburst. Previously, Kamil was the chief architect at the Teradata Center for Hadoop in Boston, focusing on the open source SQL engine Presto, and the cofounder and chief software architect of Hadapt, the first SQL-on-Hadoop company (acquired by Teradata). Kamil began his journey with Hadoop and modern MPP SQL architectures about 10 years ago during a doctoral program at Yale University, where he co-invented HadoopDB, the original foundation of Hadapt’s technology. He holds an MS in computer science from Wroclaw University of Technology and both an MS and an MPhil in computer science from Yale University.

Presentations

Presto on Kubernetes: Query anything, anywhere Session

Kamil Bajda-Pawlikowski explores Presto, an open source SQL engine, featuring low-latency queries, high concurrency, and the ability to query multiple data sources. With Kubernetes, you can easily deploy and manage Presto clusters across hybrid and multicloud environments with built-in high availability, autoscaling, and monitoring.

Lee is helping people predict and shape their future using machine intelligence. He is Commercial Director at Seldon where he’s responsible for global partnerships, client development and go-to-market strategy. Passionate about the confluence of business and technology, Lee is helping enterprise customers like Barclays realise success through the industrialization of their machine learning models. Seldon is an alumnus of the Barclays Techstars Accelerator 2016.

Presentations

Machine Learning Model Deployment and Inference Intel® AI Builders Showcase

Model Inferencing use cases are becoming a requirement for models moving into the next phase of production deployments. More and more users are now encountering use cases around canary deployments, scale-to-zero or serverless characteristics. We will demonstrate where KFServing currently is, and where it's heading.

Claudiu Barbura is a director of engineering at Blueprint, and he oversees product engineering, where he builds large-scale advanced analytics pipelines, IoT, and data science applications for customers in oil and gas, energy, and retail industries. Previously, he was the vice president of engineering at UBIX.AI, automating data science at scale, and senior director of engineering, xPatterns platform services at Atigeo, building several advanced analytics platforms and applications in healthcare and financial industries. Claudiu is a hands-on architect, dev manager, and executive with 20+ years of experience in open source, big data science and Microsoft technology stacks and a frequent speaker at data conferences.

Presentations

The power of GPUs for data virtualization in Tableau, PowerBI & beyond Session

Claudiu Barbura exposes a tech stack for BI tools and data science notebooks, using live demos to explain the lessons learned using Spark (CPU), BlazingSQL and Rapids.ai (GPU), and Apache Arrow in the quest to exponentially increase the performance of the data virtualizer, which enables real-time access to data sources across different cloud providers and on-premises databases and APIs.

Chris has been working on high performance pub-sub systems for over a dozen years. During that time he has tested, supported, and operated messaging systems that are deployed in banking, capital markets, and transportation industries. Recently he has founded Kafkaesque, which is a managed streaming and queuing service based on the Apache Pulsar open-source project.

Presentations

Getting started streaming and queuing in Apache Pulsar Interactive session

Chris Bartholomew walks you through the architecture and important concepts of Apache Pulsar. You'll set up a local Apache Pulsar environment and use the Python API to do publish/subscribe (pub/sub) message streaming, fanning out messages to multiple consumers.

Benjamin Batorsky is the Associate Director of Data Science at MIT Sloan, where he leads data science projects for the Food Supply Chain and Analytics group. Previously he worked on the data science team at ThriveHive, where he scoped and built data products by leveraging multi-modal datasets on small businesses and their customers. In his work, he is often posed difficult business questions and is able to develop and execute a strategy for answering them with either one-off analytic products or production-ready prototypes. He earned his PhD in Policy Analysis from the RAND Corporation, working on analytics projects in the areas of health, policy and infrastructure.

Presentations

Named-entity recognition from scratch with spaCy Session

Identifying and labeling named entities such as companies or people in text is a key part of text processing pipelines. Benjamin Batorsky outlines how to train, test, and implement a named entity recognition (NER) model with spaCy. You'll get a sneak peak on how to use these techniques with large, non-English corpora.

Steven Beales is a senior vice president of IT at WCG. He has 25 years of experience in IT and has spent over 16 years in the pharmaceutical industry. He led implementation of the clinical trial portal at Genentech across 100+ countries and of the clinical trial safety portal at a top-5 pharma organization, which included a data-driven rules engine configured with safety regulations from those countries, saving this organization hundreds of millions of dollars. Over 50 million safety alerts have been distributed by these two portals via the cloud. Previously, he was the chief software architect at mdlogix, where he led the implementation of the CTMS systems for Johns Hopkins University, Washington University at St. Louis, the University of Pittsburgh, and the Interactive Autism Network for Autism Speaks.

Presentations

Pragmatic artificial intelligence in biopharmaceutical industry Session

Steven Beales describes applications of NLP, machine learning, and the data-driven rules that generate significant productivity and quality improvements in the complex business workflows of drug safety and pharmacovigilance without large upfront investment. Pragmatic use of AI allows organizations to create immediate value and ROI before widening adoption as their capabilities with AI increase.

Ian Beaver is a chief scientist at Verint, a provider of conversational AI systems for enterprise businesses. Ian has been publishing discoveries in the field of AI since 2005 on topics surrounding human–computer interactions such as gesture recognition, user preference learning, and communication with multimodal automated assistants. Ian has presented his work at various academic and industry conferences and authored over 30 patents within the field of human language technology. His extensive experience and access to large volumes of real-world, human–machine conversation data for his research has made him a leading voice in conversational analysis of dialog systems. Ian currently leads a team in finding ways to optimize human productivity by way of automation and augmentation, using symbiotic relationships with machines.

Presentations

Chatbots and conversation analysis: Learn what customers want to know Data Case Studies

Chatbots are increasingly used in customer service as a first tier of support. Through deep analysis of conversation logs, you can learn real user motivations and where company improvements can be made. Ian Beaver and Aryn Sargent make a build or buy comparison for deploying self-service bots, cover motivations and techniques for deep conversational analysis, and discuss real-world discoveries.

Peyman Behbahani is a senior AI architect at Wipro, helping various industries on building real-world and large-scale AI applications in their businesses. He earned his PhD in electronic engineering at City, University of London in 2011. His main research and development interest is in AI, computer vision, mathematical modeling, and forecasting.

Presentations

An approach to automate time and motion analysis Session

Studying time and motion in manufacturing operations on a shop floor is traditionally carried out through manual observation, which is time consuming and involves human errors and limitations. Sundar Varadarajan and Peyman Behbahani detail a new approach of video analytics combined with time series analysis to automate activity identification and timing measurements.

Daniel Beltran-Villegas is a member of the Commercial Data Sciences team in the Technology organization of Janssen Pharmaceuticals. In his previous work, Daniel developed predictive models for the production of nanomaterials using clustering, Bayesian inference, and regression. At Janssen Daniel is involved in developing predictive models across the Pharma value chain.

Presentations

Industrializing machine learning: Use Case, Challenges & Learnings (Sponsored by Dataiku) Session

Establishing best practices to enable data science solutions at scale can be difficult in highly matrixed environments. Join us to learn about the evolution at Janssen US and the framework used for industrializing advanced analytics capabilities for applications such as predictive targeting.

Austin Bennett is a data engineer at Sling Media (a DISH company), where he develops systems and mentors aspiring data scientists. Austin is a cognitive linguist and researcher with an interest in multimodal communication, largely through Redhenlab.org. He’s enthusiastic about the promise of Apache Beam, is very active with the community, and has trained people around the world how to use and contribute to the open source project.

Presentations

Unifying batch and stream processing with Apache Beam Interactive session

Austin Bennett offers hands-on training with the Apache Beam programming model. Beam is an open-source unified model for Batch and Stream data processing that runs on execution engines like Google Cloud Dataflow, Apache Flink, and Apache Spark.

William Benton is an engineering manager and senior principal software engineer at Red Hat, where he leads a team of data scientists and engineers. He’s applied machine learning to problems ranging from forecasting cloud infrastructure costs to designing better cycling workouts. His focus is investigating the best ways to build and deploy intelligent applications in cloud native environments, but he’s also conducted research and development in the areas of static program analysis, managed language runtimes, logic databases, cluster configuration management, and music technology.

Presentations

What nobody told you about machine learning in the hybrid cloud Session

Cloud native infrastructure like Kubernetes has obvious benefits for machine learning systems, allowing you to scale out experiments, train on specialized hardware, and conduct A/B tests. What isn’t obvious are the challenges that come up on day two. Sophie Watson and William Benton share their experience helping end users navigate these challenges and make the most of new opportunities.

Lukas Biewald is the founder and CEO of Weights & Biases, his second major contribution to advances in the machine learning field. Previously, Lukas founded Figure Eight, formally CrowdFlower. Figure Eight was acquired by Appen in 2019. Lukas has dedicated his career to optimize ML workflows and teach ML practitioners, making machine learning more accessible to all.

Presentations

Preparing and Standardizing Data for Machine Learning Interactive session

Join expert Lukas Biewald to learn how to build and augment a convolutional neural network (CNN) using Keras. Along the way, you’ll explore common issues and bugs that are often glossed over in other courses, as well as some useful approaches to troubleshooting. You can’t become a deep learning expert in a day, but you'll leave able to build and deploy useful real-world CNNs

Using Keras to classify text with LSTMs and other ML techniques Tutorial

Join Lukas Biewald to build and deploy long short-term memories (LSTMs), grated recurrent units (GRUs), and other text classification techniques using Keras and scikit-learn.

Sarah Bird is a principle program manager at Microsoft, where she leads research and emerging technology strategy for Azure AI. Sarah works to accelerate the adoption and impact of AI by bringing together the latest innovations research with the best of open source and product expertise to create new tools and technologies. She leads the development of responsible AI tools in Azure Machine Learning. She’s also an active member of the Microsoft Aether committee, where she works to develop and drive company-wide adoption of responsible AI principles, best practices, and technologies. Previously, Sarah was one of the founding researchers in the Microsoft FATE research group and worked on AI fairness in Facebook. She’s an active contributor to the open source ecosystem; she cofounded ONNX, an open source standard for machine learning models and was a leader in the PyTorch 1.0 project. She was an early member of the machine learning systems research community and has been active in growing and forming the community. She cofounded the SysML research conference and the Learning Systems workshops. She holds a PhD in computer science from the University of California, Berkeley, advised by Dave Patterson, Krste Asanovic, and Burton Smith.

Presentations

An overview of responsible artificial intelligence Tutorial

Mehrnoosh Sameki and Sarah Bird examine six core principles of responsible AI with a focus on transparency, fairness, and privacy. You'll discover best practices and state-of-the-art open source toolkits that empower researchers, data scientists, and stakeholders to build trustworthy AI systems.

Levan Borchkhadze is a senior data scientist at TBC Bank, where his main responsibility is to supervise multiple data science projects. He earned BBA and MBA degrees from Georgian American University with a wide variety of working experience in different industries as financial analyst, business process analyst, and ERP systems implementation specialist. Levan earned his master’s degree in big data solutions from Barcelona Technology School.

Presentations

A novel approach of recommender systems in retail banking Session

TBC Bank is in transition from product-centric to a client-centric, and obvious applications of analytics are developing personalized next-best product recommendation for clients. George Chkadua and Levan Borchkhadze explain why the bank decided to implement the ALS user-item matrix factorization method and demographic model. As as result, the pilot increased sales conversion rates by 70%.

Dhruba Borthakur is cofounder and CTO at Rockset, a company building software to enable data-powered applications. Previously, Dhruba was the founding engineer of the open source RocksDB database at Facebook and one of the founding engineers of the Hadoop file system at Yahoo; an early contributor to the open source Apache HBase project; a senior engineer at Veritas, where he was responsible for the development of VxFS and Veritas SanPointDirect storage system; the cofounder of Oreceipt.com, an ecommerce startup based in Sunnyvale; and a senior engineer at IBM-Transarc Labs, where he contributed to the development of Andrew File System (AFS), a part of IBM’s ecommerce initiative, WebSphere. Dhruba holds an MS in computer science from the University of Wisconsin-Madison and a BS in computer science BITS, Pilani, India. He has 25 issued patents.

Presentations

Building live dashboards on Amazon DynamoDB using Rockset Session

Rockset is a serverless search and analytics engine that enables real-time search and analytics on raw data from Amazon DynamoDB—with full featured SQL. Kshitij Wadhwa and Dhruba Borthakur explore how Rockset takes an entirely new approach to loading, analyzing, and serving data so you can run powerful SQL analytics on data from DynamoDB without ETL.

Mario Bourgoin is a senior data scientist at Microsoft, where he helps the company’s efforts to democratize AI, and a mathematician, data scientist, and statistician with a broad and deep knowledge of machine learning, artificial intelligence, data mining, statistics, and computational mathematics. Previously, he taught at several institutions and joined a Boston-area startup, where he worked on medical and business applications. He earned his PhD in mathematics from Brandeis University in Waltham, Massachusetts.

Presentations

Using the cloud to scale up hyperparameter optimization for ML Session

Hyperparameter optimization for machine leaning is complex, requires advanced optimization techniques, and can be implemented as a generic framework decoupled from specific details of algorithms. Fidan Boylu Uz, Mario Bourgoin, and George Iordanescu apply such a framework to tasks like object detection and text matching in a transparent, scalable, and easy-to-manage way in a cloud service.

Fidan Boylu Uz is a senior data scientist at Microsoft, where she’s responsible for the successful delivery of end-to-end advanced analytic solutions. She’s also worked on a number of projects on predictive maintenance and fraud detection. Fidan has 10+ years of technical experience on data mining and business intelligence. Previously, she was a professor conducting research and teaching courses on data mining and business intelligence at the University of Connecticut. She has a number of academic publications on machine learning and optimization and their business applications and holds a PhD in decision sciences.

Presentations

Using the cloud to scale up hyperparameter optimization for ML Session

Hyperparameter optimization for machine leaning is complex, requires advanced optimization techniques, and can be implemented as a generic framework decoupled from specific details of algorithms. Fidan Boylu Uz, Mario Bourgoin, and George Iordanescu apply such a framework to tasks like object detection and text matching in a transparent, scalable, and easy-to-manage way in a cloud service.

Claudiu Branzan is an analytics senior manager in the Applied Intelligence Group at Accenture, based in Seattle, where he leverages his more than 10 years of expertise in data science, machine learning, and AI to promote the use and benefits of these technologies to build smarter solutions to complex problems. Previously, Claudiu held highly technical client-facing leadership roles in companies using big data and advanced analytics to offer solutions for clients in healthcare, high-tech, telecom, and payments verticals.

Presentations

Advanced natural language processing with Spark NLP Tutorial

David Talby, Alex Thomas, Claudiu Branzan, and Veysel Kocaman detail applying advances in deep learning for common natural language processing (NLP) tasks such as named entity recognition, document classification, sentiment analysis, spell checking, and OCR. You'll learn to build complete text analysis pipelines using the highly performant, scalable, open source Spark NLP library in Python.

Navinder Pal Singh Brar is a senior data engineer at Walmart Labs, where he’s been working with the Kafka ecosystem for the last couple of years, especially Kafka Streams, and created a Customer Data Platform on top of it to suit the company’s needs to process billions of customer events per day in real time and trigger certain machine learning models on each event. He’s been active in contributing back to Kafka Streams and filed three patents last year. Navinder is a regular speaker at local and international events on real-time stream processing, data platforms, and Kafka.

Presentations

Real-time fraud detection with Kafka Streams Session

One of the major use cases for stream processing is real-time fraud detection. Ecommerce has to deal with frauds on a wider scale as more and more companies are trying to provide customers with incentives such as free shipping by moving on to subscription-based models. Navinder Pal Singh Brar dives into the architecture, problems faced, and lessons from building such a pipeline.

Jay Budzik is the chief technology officer at ZestFinance, where he oversees Zest’s product and engineering teams. His passion for inventing new technologies—particularly in data mining and AI—has played a central role throughout his career. Previously, he held various positions, including founding an AI enterprise search company, helping major media organizations apply AI and machine learning to expand their audiences and revenue, and developed systems that process tens of trillions of data points. Jay has a PhD in computer science from Northwestern University.

Presentations

Introducing GIG: A new method for explaining any ensemble ML model Session

More companies are adopting machine learning (ML) to run key business functions. The best-performing models combine diverse model types into stacked ensembles, but explaining these hybrid models has been impossible—until now. Jay Budzik details a new technique, generalized integrated gradients (GIG), to explain complex ensembled ML models that are safe to use in high-stakes applications.

Patrick Buehler is a principal data scientist in the Cloud AI Group at Microsoft. He has over 15 years of working experience in academic settings and with various external customers spanning a wide range of computer vision problems. He earned his PhD in computer vision from Oxford with Andrew Zisserman.

Presentations

Solving real-world computer vision problems using open source Interactive session

In recent years, computer vision (CV) has seen quick growth in quality and usability, driving business adoption of AI solutions. Patrick Buehler offers a comprehensive introduction to deep learning models for computer vision (CV). You'll be able to get your hands dirty training and evaluating CV models with prepared examples and exercises.

Paris Buttfield-Addison is a cofounder of Secret Lab, a game development studio based in beautiful Hobart, Australia. Secret Lab builds games and game development tools, including the multi-award-winning ABC Play School iPad games, the BAFTA- and IGF-winning Night in the Woods, the Qantas airlines Joey Playbox games, and the Yarn Spinner narrative game framework. Previously, Paris was a mobile product manager for Meebo (acquired by Google). Paris particularly enjoys game design, statistics, blockchain, machine learning, and human-centered technology. He researches and writes technical books on mobile and game development (more than 20 so far) for O’Reilly and is writing Practical AI with Swift and Head First Swift. He holds a degree in medieval history and a PhD in computing. You can find him on Twitter as @parisba.

Presentations

Swift for TensorFlow in 3 hours Tutorial

Mars Geldard, Tim Nugent, and Paris Buttfield-Addison are here to prove Swift isn't just for app developers. Swift for TensorFlow provides the power of TensorFlow with all the advantages of Python (and complete access to Python libraries) and Swift—the safe, fast, incredibly capable open source programming language; Swift for TensorFlow is the perfect way to learn deep learning and Swift.

Vinoth Chandar is the Co-Creator of the Hudi project at Uber and also PMC/Lead of Apache Hudi (Incubating). Previously, he was a senior staff engineer at Uber, where he led projects across various technology areas like data infrastructure, data architecture & mobile/network performance. Vinoth has keen interest in unified architectures for data analytics and processing. Previously, he was the LinkedIn lead on Voldemort and worked on Oracle Server’s replication engine, HPC, and stream processing.

Presentations

Bring stream processing to batch data using Apache Hudi (incubating) Session

Batch processing can benefit immensely from adopting some techniques from the streaming processing world. Balaji Varadarajan shares how Apache Hudi (incubating), an open source project created at Uber and currently incubating with the ASF, can bridge this gap and enable more productive, efficient batch data engineering.

Praveen Chandra is the head of the Data and Analytics Practice at GSPANN where he spends his time wondering about the nuances and intricacies of the data around us. He’s extremely passionate about data in all its forms and has spent his time building analytical platforms and wrangling with data at GAP and Macy’s before plunging back into consulting. He and his team have been tied at the hip with Kohl’s team in shaping and engineering the data platform on Google Cloud Platform (GCP) with the objective of surfacing the right insights at the right time. When he isn’t wondering about what his customers want to do with the deluge of data, he likes to read and swim.

Presentations

Revitalizing Kohl’s marketing and experience personalized ecosystem on Google Cloud Platform (sponsored by GSPANN) Session

Praveen Chandra and Shailendra Maktedar describe the challenges Kohl's faced with its legacy marketing analytics platform and how it leveraged Google Cloud Platform (GCP) and BigQuery to provide better and more consistent customer insights to the marketing analytics business team.

Sathya Chandran is a security research scientist at DataVisor. He’s an expert in applying big data and unsupervised machine learning to fraud detection, specializing in the financial, ecommerce, social, and gaming industries. Previously, Sathya was at HP Labs and Honeywell Labs. Sathya holds a PhD in CS from the University of South Florida.

Presentations

Mobility fingerprinting: A novel approach to detect account takeovers Session

Sathya Chandran shares key insights into current trends of account takeover fraud by analyzing 52 billion events generated by 1.1 billion users and developing a set of user mobility features to capture suspicious device and IP-switching patterns. You'll learn to incorporate mobility features into an anomaly detection solution to detect suspicious account activity in real time.

Diane serves as Distinguished Data Scientist at Intuit, where she powers the prosperity of consumers and small businesses with machine learning, behavioral analysis, and risk prediction. Diane initially worked on TurboTax, looking at the effectiveness of our digital marketing campaigns, understanding user behavior in the product, and analyzing how customers get help when they need it. She also helped launch QuickBooks Capital, predicting outcomes for loan applicants. She is currently applying AI/ML techniques to security, risk and fraud. Diane has a PhD in Operations Research from Stanford. She previously worked for a small mathematical consulting firm, and a start-up in the online advertising space. Prior to joining Intuit, Diane was a stay-at-home mom for 6 years.

Presentations

Explainable AI: Your model is only as good as your explanation Session

Explainable AI (XAI) has gained industry traction, given the importance of explaining ML-assisted decisions in human terms and detecting undesirable ML defects before systems are deployed. Joy Rimchala and Diane Chang delve into XAI techniques, advantages and drawbacks of black box versus glass box models, concept-based diagnostics, and real-world examples using design thinking principles.

Jin Hyuk Chang is a software engineer on the data platform team at Lyft, working on various data products. Jin is a main contributor to Apache Gobblin and Azkaban. Previously, Jin worked at Linkedin and Amazon Web Services, focused on big data and service-oriented architecture.

Presentations

Amundsen: An open source data discovery and metadata platform Session

Jin Hyuk Chang and Tao Feng offer a glimpse of Amundsen, an open source data discovery and metadata platform from Lyft. Since it was open-sourced, Amundsen has been used and extended by many different companies within the community.

Michael Chang recently joined Supermicro Computer as the Head of AI Strategy and Solutions. He has more than 20 years of experience in the cloud computing, enterprise, storage and machine learning markets. His previous position was co-founder and Vice President of Products at Silicon Valley AI startup NVXL Technology. Prior to this, he was the Product Marketing Director at LSI. Mr. Chang holds a Master’s degree in Business Administration from the Haas School of Business at the University of California, Berkeley, a Bachelor’s degree in Electrical Engineering from the National Chiao Tung University, and completed an advanced leadership course at the Stanford University School of Business.

Presentations

Conversational AI platform for Fast Food Restaurants Intel® AI Builders Showcase

In this presentation, we will illustrate the innovation behind the Autonomous Voice Order Taking solution– which uses artificial intelligence-powered by Intel Xeon D CPU and Movidius VPUs to drive more intelligent responses, increase better order accuracy, reduce wait time and improve store revenue per hour.

Yue “Cathy” Chang is a partner at TutumGene, a technology company that aims to accelerate disease curing by providing solutions for gene therapy and regulation of gene expression. She’s a business executive recognized for sales, business development, and product marketing in high technology. Previously, she was with Silicon Valley Data Science, a startup (acquired by Apple) that provided business transformation consulting to enterprises and other organizations using data science- and engineering-based solutions; employee #1 hired by the CEO at venture-funded software startup Rocana (acquired by Splunk), where she served as senior director of business development focusing on building and growing long-term relationships, and notably increased sales leads 2x through building and managing indirect revenue channels; held multiple strategic roles at blue chip software enterprise companies as well as startups, including corporate and business development at Feedzai and Datameer; senior product management, product marketing and sales at Symantec and IBM; and strategic sourcing improvement consulting at Honeywell. Cathy holds MS and BS degrees in electrical and computer engineering from Carnegie Mellon University, MBA and MS degrees as a leaders for global operations (LGO) duel-degree fellow from MIT, and two patents for her early work in microprocessor logic design.

Presentations

AI meets genomics: Genetics and genome editing revolutionize medicine Health Data Day

Genome editing has been dubbed a top technology that could create trillion-dollar markets. Yue "Cathy" Chang explains how recent advances in applying AI to genomic editing accelerate the transformation of medicine. You'll learn how AI is applied to genome sequencing and editing, explore the potential to correct mutations, and explore questions on using genome editing to optimize human health.

Apply oversight and domain insight to AI and ML to increase success Tutorial

More than 85% of data science projects fail. This high failure rate is a main reason why data science is still a "science." As data science practitioners, reducing this failure rate is a priority. Jike Chong and Yue "Cathy" Chang explain the three key steps of applying data science technology to business problems and three concerns for applying domain insights in AI and ML initiatives.

Technology oversight to reduce data science project failure rate Session

More than 85% of data science projects fail. This high failure rate is a main reason why data science is still a science. Jike Chong and Yue "Cathy" Chang outline how you can reduce this failure rate and improve teams' confidence in executing successful data science projects by applying data science technology to business problems: scenario mapping, pattern discovery, and success evaluation.

Jeff Chao is a senior software engineer at Netflix, where he works on stream processing engines and observability platforms. Jeff builds and maintains Mantis, an open source platform that makes it easy for developers to build cost-effective, real-time, operations-focused applications. Previously, he was at Heroku, offering a fully managed Apache Kafka service.

Presentations

Cost-effective, real-time operational insights at Netflix Session

Netflix has experienced an unprecedented global increase in membership over the last several years. Production outages today have greater impact in less time than years before. Jeff Chao details the open-sourced Mantis, which allows Netflix to continue providing great experiences for its members, enabling it to get real-time, granular, cost-effective operational insights.

Chanchal Chatterjee is a cloud AI leader at Google Cloud Platform with a focus on financial services and energy market verticals. He’s held several leadership roles focusing on machine learning, deep learning, and real-time analytics. Previously, he was chief architect of EMC at the CTO office, where he led end-to-end deep learning and machine learning solutions for data centers, smart buildings, and smart manufacturing for leading customers; was instrumental in the Industrial Internet Consortium, where he published an AI framework for large enterprises. Chanchal received several awards, including the Outstanding Paper Award from IEEE Neural Network Council for adaptive learning algorithms recommended by MIT professor Marvin Minsky. Chanchal founded two tech startups between 2008 and 2013. He has 29 granted or pending patents and over 30 publications. Chanchal earned MS and PhD degrees in electrical and computer engineering from Purdue University.

Presentations

Solving financial services ML problems with explainable ML models Session

Financial services companies use machine learning models to solve critical business use cases; regulators demand model explainability. Chanchal Chatterjee shares how Google solved financial services's business-critical problems such as credit card fraud, anti-money laundering, lending risk, and insurance loss using complex machine learning models you can explain to regulators.

As VP of Products, Senscape Technologies, Kevin has been leading Senscape’s embedded edge AI products definition and roadmap in addition to partnership development. Prior to this, Kevin has held senior technical and management positions in leading companies in Silicon Valley including Xperi, Tessera, Intel and Dell. He has extensive experience in computer vision, image processing, chip design and system development. He was awarded 16 US patents.

Presentations

Multi-function Smart Street Post Management System to Enable Smart Spaces Intel® AI Builders Showcase

Senscape’s multi-function smart street post system, based on Intel® FPGAs and Intel® Movidius™ Myriad™ VPUs, integrates many sensors and processes the data from these sensors using its proprietary algorithm to monitor, alert and improve public security.

Michael Chertushkin is a senior data scientist at John Snow Labs. He graduated from Ural Federal University, RadioTechnical Faculty in 2012 and worked as a software developer. In 2014 he decided to shift to data science, recognizing the growing interest towards machine learning. He has successfully completed several projects in this field and decided to get more fundamental skills, which led him to graduate from the Yandex Data School—the best educational center in Russia for preparing highly skilled professionals in data science.

Presentations

A unified CV, OCR, and NLP model pipeline for scalable document understanding at DocuSign Session

Roshan Satish and Michael Chertushkin lead you through a real-world case study about applying state-of-the-art deep learning techniques to a pipeline that combines computer vision (CV), optical character recognition (OCR), and natural language processing (NLP) at DocuSign. You'll discover how the project delivered on its extreme interpretability, scalability, and compliance requirements.

Amanda “Mandy” Chessell is a master inventor, fellow of the Royal Academy of Engineering, and a distinguished engineer at IBM, where she’s driving IBM’s strategic move to open metadata and governance through the Apache Atlas open source project. Mandy is a trusted advisor to executives from large organizations and works with them to develop strategy and architecture relating to the governance, integration, and management of information. You can find out more information on her blog.

Presentations

Creating an ecosystem on data governance in the ODPi Egeria project Session

Building on its success at establishing standards in the Apache Hadoop data platform, the ODPi (Linux Foundation) turns its focus to the next big data challenge—enabling metadata management and governance at scale across the enterprise. Mandy Chessell and John Mertic discuss how the ODPi's guidance on governance (GoG) aims to create an open data governance ecosystem.

George Chkadua is a data scientist at TBC Bank. His main focus is machine learning and its applications in industries from a mathematics and business perspective. He earned a PhD in mathematics from King’s College London. George has published various articles in peer review journals and has been invited speak on many scientific conferences and seminars.

Presentations

A novel approach of recommender systems in retail banking Session

TBC Bank is in transition from product-centric to a client-centric, and obvious applications of analytics are developing personalized next-best product recommendation for clients. George Chkadua and Levan Borchkhadze explain why the bank decided to implement the ALS user-item matrix factorization method and demographic model. As as result, the pilot increased sales conversion rates by 70%.

Sowmiya Chocka Narayanan is the cofounder and CTO of Lily AI, an emotional intelligence-powered shopping experience that helps brands understand their consumers’ purchase behavior. She’s focused on decoding user behavior and building deep product understanding by applying deep learning techniques. Previously, she worked at different levels of the tech stack at Box, leading initiatives in building SDKs, applications for industry verticals, and MDM solutions, and was also an early engineer at Pocket Gems, where she worked on the core game engine and built acquisition and retention strategies for the number one and number four top-grossing gaming apps. Sowmiya earned her master’s degree in electrical and computer engineering from The University of Texas at Austin.

Presentations

Personalization powered by unlocking deep product & consumer features Session

Digital brands focus heavily on personalizing consumers' experience at every single touchpoint. In order to engage with consumers in the most relevant ways, Lily AI helps brands dissect and understand how their consumers interact with their products, more specifically with the product features. Sowmiya Chocka Narayanan explores the lessons learned building AI-powered personalization for fashion.

Jike Chong is the director of data science, hiring marketplace at LinkedIn. He’s an accomplished executive and professor with experience across industry and academia. Previously, he was the chief data scientist at Acorns, the leading microinvestment app in US with over four million verified investors, which uses behavioral economics to help the up-and-coming save and invest for a better financial future; was the chief data scientist at Yirendai, an online P2P lending platform with more than $7B loans originated and the first of its kind from China to go public on NYSE; established and headed the data science division at SimplyHired, a leading job search engine in Silicon Valley; advised the Obama administration on using AI to reduce unemployment; and led quantitative risk analytics at Silver Lake Kraftwerk, where he was responsible for applying big data techniques to risk analysis of venture investment. Jike is also an adjunct professor and PhD advisor in the Department of Electrical and Computer Engineering at Carnegie Mellon University, where he established the CUDA Research Center and CUDA Teaching Center, which focus on the application of GPUs for machine learning. Recently, he developed and taught a new graduate level course on machine learning for internet finance at Tsinghua University in Beijing, China, where he’s an adjunct professor. Jike holds MS and BS degrees in electrical and computer engineering from Carnegie Mellon University and a PhD from the University of California, Berkeley. He holds 11 patents (six granted, five pending).

Presentations

Apply oversight and domain insight to AI and ML to increase success Tutorial

More than 85% of data science projects fail. This high failure rate is a main reason why data science is still a "science." As data science practitioners, reducing this failure rate is a priority. Jike Chong and Yue "Cathy" Chang explain the three key steps of applying data science technology to business problems and three concerns for applying domain insights in AI and ML initiatives.

Technology oversight to reduce data science project failure rate Session

More than 85% of data science projects fail. This high failure rate is a main reason why data science is still a science. Jike Chong and Yue "Cathy" Chang outline how you can reduce this failure rate and improve teams' confidence in executing successful data science projects by applying data science technology to business problems: scenario mapping, pattern discovery, and success evaluation.

Nicholas Cifuentes-Goodbody is a data scientist in residence at the Data Incubator. He’s taught English in France, Spanish in Qatar, and now data science all over the world. Previously, he was at Williams College, Hamad bin Khalifa University (Qatar), and the University of Southern California. He earned his PhD at Yale University. He lives in Los Angeles with his amazing wife and their adorable pit bull.

Presentations

Deep learning with TensorFlow 2-Day Training

The TensorFlow library provides computational graphs with automatic parallelization across resources—ideal architecture for implementing neural networks. Nicholas Cifuentes-Goodbody walks you through TensorFlow's capabilities in Python, from building machine learning algorithms piece by piece to using the Keras API provided by TensorFlow, with several hands-on applications.

Ira Cohen is a cofounder and chief data scientist at Anodot, where he’s responsible for developing and inventing the company’s real-time multivariate anomaly detection algorithms that work with millions of time series signals. He holds a PhD in machine learning from the University of Illinois at Urbana-Champaign and has over 12 years of industry experience.

Presentations

Herding cats: Product management in the machine learning era Tutorial

While the role of the manager doesn't require deep knowledge of ML algorithms, it does require understanding how ML-based products should be developed. Ira Cohen explores the cycle of developing ML-based capabilities (or entire products) and the role of the (product) manager in each step of the cycle.

Nicola Corradi is a Research Scientist at DataVisor, where he uses his vast experience with neural networks to design and train deep learning models to recognize malicious patterns in user behaviour. He earned a PhD in cognitive science (University of Padua) and did a post-doc at Cornell in computational neuroscience and computer vision, focusing on the integration of computational model of the neurons with neural networks.

Presentations

A deep learning model to detect coordinated content abuse Session

Fraudulent attacks such as application fraud, fake reviews, and promotion abuse have to automate the generation of user content to scale; this creates latent patterns shared among the coordinated malicious accounts. Nicola Corradi digs into a deep learning model to detect such patterns for the identification of coordinated content abuse attacks on social, ecommerce, financial platforms, and more.

A DL model to detect coordinated frauds using patterns in user content Session

Fraudulent attacks like fake reviews, application fraud, and promotion abuse create a common pattern shared within coordinated malicious accounts. Nicola Corradi explains novel deep learning (DL) models that learned to detect suspicious patterns, leading to the individuation of coordinated fraud attacks on social, dating, ecommerce, financial, and news aggregator services.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Healthcare Data Day opening remarks Tutorial

Strata Data & AI program chair Alistair Croll welcomes you to the Healthcare Data Day day-long tutorial.

Remembering forever Keynote

Alistair Croll takes you on a fast-paced journey through cognition, ethics, and how we're remaking what it means to be human.

Tuesday keynotes Keynote

Strata Data & AI program chairs, Rachel Roumeliotis and Alistair Croll, welcome you to the first day of keynotes.

Wednesday keynotes Keynote

Strata program chairs Rachel Roumeliotis and Alistair Croll welcome you to the first day of keynotes.

Robert Crowe is a data scientist and TensorFlow Developer Advocate at Google with a passion for helping developers quickly learn what they need to be productive. He’s used TensorFlow since the very early days and is excited about how it’s evolving quickly to become even better than it already is. Previously, Robert deployed production ML applications and led software engineering teams for large and small companies, always focusing on clean, elegant solutions to well-defined needs. In his spare time, Robert sails, surfs occasionally, and raises a family.

Presentations

ML in production: Getting started with TensorFlow Extended (TFX) Tutorial

Putting together an ML production pipeline for training, deploying, and maintaining ML and deep learning applications is much more than just training a model. Robert Crowe outlines what's involved in creating a production ML pipeline and walks you through working code.

Michelangelo D’Agostino is the vice president of data science and engineering at ShopRunner, where he leads a team that develops statistical models and writes software that leverages their unique cross-retailer ecommerce dataset. Previously, Michelangelo led the data science R&D team at Civis Analytics, a Chicago-based data science software and consulting company that spun out of the 2012 Obama reelection campaign, and was a senior analyst in digital analytics with the 2012 Obama reelection campaign, where he helped to optimize the campaign’s email fundraising juggernaut and analyzed social media data. Michelangelo has been a mentor with the Data Science for Social Good Fellowship. He holds a PhD in particle astrophysics from UC Berkeley and got his start in analytics sifting through neutrino data from the IceCube experiment. Accordingly, he spent two glorious months at the South Pole, where he slept in a tent salvaged from the Korean War and enjoyed the twice-weekly shower rationing. He’s also written about science and technology for the Economist.

Presentations

The care and feeding of data scientists Session

Data science is relatively young, and the job of managing data scientists is younger still. Many people undertake this management position without the tools, mentorship, or role models they need to do it well. Katie Malone and Michelangelo D'Agostino review key themes from a recent Strata report that examines the steps necessary to build, manage, sustain, and retain a growing data science team.

Ahmed Datoo is the Co-Founder and COO of Mesmer. His experience in the technology industry spans software engineering, product management, marketing, and strategy. Prior to Mesmer, Ahmed was on the founding team of Zenprise and was the SVP of Product and Marketing. At Zenprise he grew revenues from $0 to $250 million, built an innovative product routinely praised by Gartner and Forrester, and coined the term and created the MDM category (mobile device management). Prior to Zenprise he held engineering management and product management positions at Loudcloud/EDS. He started his career as a strategy consultant at Accenture. Ahmed has appeared in articles in the WSJ, NYTimes, Economist, Network World, USA Today and has spoken at events at Stanford University, Interop, Gartner, and VentureBeat. He holds an MBA, MA, and BA from Stanford University.

Presentations

Accelerating Software Development via Deep Learning Intel® AI Builders Showcase

This is a practical session on how deep learning can solve engineering problems, starting with mobile app testing. You'll learn how convolutional neural networks, recurrent neural networks, and a variety of other ML models enabled a large insurance company to improve mobile developer productivity 4x and increase release velocity by 67%.

Sourav Dey is CTO at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. Previously, Sourav led teams building data products across the technology stack, from smart thermostats and security cams at Google Nest to power grid forecasting at AutoGrid to wireless communication chips at Qualcomm. He holds patents for his work, has been published in several IEEE journals, and has won numerous awards. He holds PhD, MS, and BS degrees in electrical engineering and computer science from MIT.

Presentations

Efficient ML engineering: Tools and best practices Tutorial

ML engineers work at the intersection of data science and software engineering—that is, MLOps. Sourav Dey and Alex Ng highlight the six steps of the Lean AI process and explain how it helps ML engineers work as an integrated part of development and production teams. You'll go hands-on with real-world data so you can get up and running seamlessly.

Gonzalo Diaz is a data scientist in residence at the Data Incubator, where he teaches the data science fellowship and online courses; he also develops the curriculum to include the latest data science tools and technologies. Previously, he was a web developer at an NGO and a researcher at IBM TJ Watson Research Center. He has a PhD in computer science from the University of Oxford.

Presentations

Big data for managers 2-Day Training

Gonzalo Diaz and Michael Li provide a nontechnical overview of AI and data science and teach common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and use their input and analysis for your business’s strategic priorities and decision making.

Big data for managers (Day 2) Training Day 2

Gonzalo Diaz and Michael Li provide a nontechnical overview of AI and data science and teach common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and use their input and analysis for your business’s strategic priorities and decision making.

Victor Dibia is a research engineer at Cloudera’s Fast Forward Labs, where his work focuses on prototyping state-of-the-art machine learning algorithms and advising clients. He’s passionate about community work and serves as a Google Developer Expert in machine learning. Previously, he was a research staff member at the IBM TJ Watson Research Center. His research interests are at the intersection of human-computer interaction, computational social science, and applied AI. He’s a senior member of IEEE and has published research papers at conferences such as AAAI Conference on Artificial Intelligence and ACM Conference on Human Factors in Computing Systems. His work has been featured in outlets such as the Wall Street Journal and VentureBeat. He holds an MS from Carnegie Mellon University and a PhD from City University of Hong Kong.

Presentations

Deep learning for anomaly detection Session

In many business use cases, it's frequently desirable to automatically identify and respond to abnormal data. This process can be challenging, especially when working with high-dimensional, multivariate data. Nisha Muktewar and Victor Dibia explore deep learning approaches (sequence models, VAEs, GANs) for anomaly detection, performance benchmarks, and product possibilities.

Dominic Divakaruni is a principal product leader at Dom Divakaruni is a principal group program manager at Microsoft working on the Azure Machine Learning platform. Current areas of focus include applying and managing data for machine learning including, data access, exploratory data analysis, data lineage, and data drift. Dom’s prior work includes building tools to help customers deploy models to production, deep learning frameworks, accelerated computing and GPUs.

Presentations

Data lineage enables reproducible and reliable ML at scale Session

Data scientists need a way to ensure result reproducibility. Sihui "May" Hu and Dominic Divakaruni unpack how to retrieve data-to-data, data-to-model, and model-to-deployment lineages in one graph to achieve reproducible and reliable machine learning at scale. You'll discover effective ways to track the full lineage from data preparation to model training to inference.

Mark Donsky is a director of product management at Okera, a software provider that provides discovery, access control, and governance at scale for today’s modern heterogenous data environments, where he leads product management. Previously, Mark led data management and governance solutions at Cloudera, and he’s held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the Western University, Ontario, Canada.

Presentations

CCPA, GDPR, and NYPA: Big data in the era of heavy privacy regulation Session

Privacy regulation is increasing worldwide with Europe's GDPR, the California Consumer Privacy Act (CCPA), and the New York Privacy Act (NYPA). Penalties for noncompliance are stiff, but many companies still aren't prepared. Mark Donsky shares how to establish best practices for holistic privacy readiness as part of your data strategy.

Jozo Dujmović is a professor of computer science at San Francisco State University, where he teaches and researches soft computing, decision engineering, software metrics, and computer performance evaluation. He’s the founder and principal of SEAS, a San Francisco company established in 1997 specializing in soft computing, decision engineering, and software support of the LSP decision method. Jozo wrote the LSP decision method and more than 170 refereed publications. His latest book is Soft Computing Evaluation Logic: The LSP Decision Method and Its Applications (John Wiley and IEEE Press, 2018). His first industrial experience was in Institute M. Pupin, Belgrade. Previously, he was a professor with the School of Electrical Engineering at the University of Belgrade and professor of computer science with the University of Florida, Gainesville, the University of Texas, Dallas, and Worcester Polytechnic Institute, Worcester. Jozo has received three best-paper awards, served as general chair of IEEE and ACM conferences, and has been an invited speaker at conferences in the US and Europe. Jozo has earned his BSEE, MS, and ScD degrees in computer engineering from the University of Belgrade.

Presentations

Monitoring patient disability and disease severity using AI Health Data Day

Jozo Dujmović details a soft computing method for quantifying disease severity and patient disability. Personalized healthcare requires these models and must be supported using AI software tools. Jozo introduces you to a case study of peripheral neuropathy to illustrate the methodology and outlines a related decision problem of the optimum timing of risky therapy.

Michael Dulin is the director of the Academy for Population Health Innovation at the University of North Carolina at Charlotte (UNC), a collaboration designed to advance community and population health, and he’s chief medical officer at Gray Matter Analytics. Previously, Michael was an electrical and biomedical engineer; a community-based private practice primary care physician in Harrisburg, North Carolina; research director and chair of the Carolinas Healthcare System Department of Family Medicine, where he directed a primary care practice–based research network (MAPPR); and chief clinical officer for outcomes research and analytics at Atrium Health. He’s a nationally recognized leader in the field of health information technology and the application of analytics and outcomes research to improve care delivery and advance population health. Michael has led projects funded by the Agency for Healthcare Research and Quality (AHRQ), the Robert Wood Johnson Foundation, the Duke Endowment, National Institutes of Health (NIH), and Patient-Centered Outcomes Research Institute (PCORI). His work has been recognized by the Charlotte Business Journal, North Carolina Healthcare Information & Communications Alliance (NCHICA), and Cerner. His work to build a healthcare data and analytics team was featured as a case study by the Harvard Business School and the Harvard T.H. Chan School of Public Health. He’s a member of the American Academy of Family Physicians, Society for Teachers of Family Medicine, North American Primary Care Research Group, and Alpha Omega Alpha. He’s a recipient of the North Carolina Medical Society’s Community Practitioner Program, a participant in the Center for International Understanding Latino Initiative, and recognized as one of Charlotte’s best doctors. Michael earned his PhD in neurophysiology and his medical degree from the University of Texas Medical School at Houston, and he completed his residency training at Carolinas Medical Center in Charlotte, North Carolina.

Presentations

Organizational culture’s role transforms healthcare with data and AI Health Data Day

Despite advances like cloud computing, healthcare providers struggle to apply data and analytics to essential functions. This delay is driven by organizational culture—particularly in large or complex organizations. Michael Dulin offers you an overview of common implementation barriers and approaches you need to succeed in the transformation process.

Ted Dunning is the chief technology officer at MapR, an HPE company. He’s also a board member for the Apache Software Foundation; a PMC member; and committer on a number of projects. Ted has years of experience with machine learning and other big data solutions across a range of sectors. He’s contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library and designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date plus a dozen pending. He holds a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.

Presentations

Kubernetes versus reality: What are the gaps? (sponsored by HPE) Session

Ted Dunning illustrates the situation of three real Kubernetes users, two in production and one in planning, who are building end-to-end containerized machine learning data pipelines. The users have different levels of experience with Kubernetes, very different requirements, and very different levels of development resources. You'll learn the strong commonalities in how to meet their needs.

In his role as a Partner for TNG Technology Consulting in Munich, Thomas Endres works as an IT consultant. Besides his normal work for the company and the customers he is creating various prototypes – like a telepresence robotics system with which you can see reality through the eyes of a robot, or an Augmented Reality AI that shows the world from the perspective of an artist. He is working on various applications in the fields of AR/VR, AI and gesture control, putting them to use e.g. in autonomous or gesture controlled drones. But he is also involved in other Open Source projects written in Java, C# and all kinds of JavaScript languages.

Thomas studied IT at the TU Munich and is passionate about software development and all the other aspects of technology. As an Intel Software Innovator and Black Belt, he is promoting new technologies like AI, AR/VR and robotics around the world. For this he received amongst others a JavaOne Rockstar award.

Presentations

Deepfakes 2.0: How neural networks are changing our world Session

Imagine looking into a mirror, but not seeing your own face. Instead, you're looking in the eyes of Barack Obama or Angela Merkel. Your facial expressions are seamlessly transferred to the other person's face in real time. Martin Förtsch and Thomas Endres dig into a prototype from TNG that transfers faces from one person to another in real time based on deepfakes.

Dillon Erb is CEO and co-founder of Paperspace (www.paperspace.com), a leading provider of cloud solutions for machine intelligence. Prior to Paperspace, Dillon worked on projects ranging from robotic fabrication toolstacks to HPC applications for topology optimization. Today, Dillon spends his time working to define cloud machine learning pipelines and accelerating the adoption of AI by developers.

Presentations

End-to-end Machine Learning Pipelines in Minutes Intel® AI Builders Showcase

In this presentation, we will demonstrate how an end-to-end machine learning pipeline can be constructed in minutes to support a state-of-the-art bacterial classifier. The technology demo features Intel OpenVINO and other AI accelerated instruction available on modern Intel CPUs

Dr. Stephan Erberich is the Chief Data Officer of the Children’s Hospital Los Angeles and Professor of Research Radiology at the University of Southern California. He is a computer scientist specialized in medical informatics and an AI practitioner in healthcare with a focus on image processing and computer vision ML.

Presentations

Semisupervised AI for automated categorization of medical images Session

Annotating radiological images by category at scale is a critical step for analytical ML. Supervised learning is challenging because image metadata doesn't reliably identify image content and manual labeling images for AI algorithms isn't feasible. Stephan Erberich, Kalvin Ogbuefi, and Long Ho share an approach for automated categorization of radiological images based on content category.

Moty Fania is a principal engineer and the CTO of the Advanced Analytics Group at Intel, which delivers AI and big data solutions across Intel. Moty has rich experience in ML engineering, analytics, data warehousing, and decision-support solutions. He led the architecture work and development of various AI and big data initiatives such as IoT systems, predictive engines, online inference systems, and more.

Presentations

Practical methods for continuous delivery & sustainability for AI Session

Moty Fania shares key insights from implementing and sustaining hundreds of ML models in production, including continuous delivery of ML models and systematic measures to minimize the cost and effort required to sustain them in production. You'll learn from examples from different business domains and deployment scenarios (on-premises, the cloud) covering the architecture and related AI platforms.

Tao Feng is a software engineer on the data platform team at Lyft. Tao is a committer and PMC member on Apache Airflow. Previously, Tao worked on data infrastructure, tooling, and performance at LinkedIn and Oracle.

Presentations

Amundsen: An open source data discovery and metadata platform Session

Jin Hyuk Chang and Tao Feng offer a glimpse of Amundsen, an open source data discovery and metadata platform from Lyft. Since it was open-sourced, Amundsen has been used and extended by many different companies within the community.

Rustem Feyzkhanov is a machine learning engineer at Instrumental, where he creates analytical models for the manufacturing industry. Rustem is passionate about serverless infrastructure (and AI deployments on it) and is the author of the course and book Serverless Deep Learning with TensorFlow and AWS Lambda.

Presentations

Serverless architecture for AI applications Session

Machine learning (ML) and deep learning (DL) are becoming more and more essential for businesses in internal and external use; one of the main issues with deployment is finding the right way to train and operationalize the model. Rustem Feyzkhanov digs into how use AWS infrastructure to use a serverless approach for deep learning, providing cheap, simple, scalable, and reliable architecture.

Lutz Finger is a data scientist and product manager at Google, focusing on the intersection of predictions and data to change healthcare. Using the power of AI, he and his team are committed to improving patient outcomes, predicting quality issues, and much more. He also teaches at Cornell University and writes for Forbes. Lutz is an authority in data. Previously, he built data-driven products for LinkedIn and Snap, and cofounded Fisheye Analytics, a data-mining company whose products supported governments and NGOs. Lutz wrote the book Ask Measure Learn (O’Reilly).

Presentations

Deep learning with electronic health records Health Data Day

Lutz Finger dives into how deep learning, specifically applied to electronic health records, produces actionable predictions that could greatly improve patient care and turn clinicians’ attention back on patients. You'll explore the potential benefits to such predictions and the potential challenges to adoption and use.

Martin Förtsch is an IT consultant at TNG, based in Unterföhring near Munich who studied computer sciences. His focus areas are Agile development (mainly) in Java, search engine technologies, information retrieval and databases. As an Intel Software Innovator and Intel Black Belt Software Developer, he’s strongly involved in the development of open source software for gesture control with 3D cameras like Intel RealSense and has built an augmented reality wearable prototype device with his team based on this technology. He gives many talks on national and international conferences about AI, the internet of things, 3D camera technologies, augmented reality, and test-driven development. He was awarded with the JavaOne Rockstar award.

Presentations

Deepfakes 2.0: How neural networks are changing our world Session

Imagine looking into a mirror, but not seeing your own face. Instead, you're looking in the eyes of Barack Obama or Angela Merkel. Your facial expressions are seamlessly transferred to the other person's face in real time. Martin Förtsch and Thomas Endres dig into a prototype from TNG that transfers faces from one person to another in real time based on deepfakes.

Ben Fowler is a machine learning technical leader at Southeast Toyota Finance, where he leads the end-to-end model development. He’s been in the field of data science for over five years. Ben has been a guest speaker at the Southern Methodist University program multiple times. Additionally, he’s spoken at the PyData Miami 2019 Conference and has spoken multiple times at the West Palm Beach Data Science Meetup. He earned a master of science in data science from Southern Methodist University.

Presentations

Evaluation of traditional and novel feature selection approaches Session

Selecting the optimal set of features is a key step in the machine learning modeling process. Ben Fowler shares research that tested five approaches for feature selection. The approaches included current widely used methods, along with novel approaches for feature selection using open source libraries, building a classification model using the Lending Club dataset.

Don Fox is a data scientist in residence in Boston for The Data Incubator. Previously, Don developed numerical models for a geothermal energy startup. Born and raised in South Texas, Don has a PhD in chemical engineering, where he researched renewable energy systems and developed computational tools to analyze the performance of these systems.

Presentations

Hands-on data science with Python 2-Day Training

Don Fox teaches you all the steps—from prototyping to production—of developing a machine learning pipeline. After looking at data cleaning, feature engineering, model building and evaluation, and deployment, you'll extend these models into two applications from real-world datasets. All your work will be done in Python.

Hands-on data science with Python (Day 2) Training Day 2

Don Fox teaches you all the steps—from prototyping to production—of developing a machine learning pipeline. After looking at data cleaning, feature engineering, model building and evaluation, and deployment, you'll extend these models into two applications from real-world datasets. All your work will be done in Python.

Michael J. Freedman is the cofounder and CTO of TimescaleDB and a full professor of computer science at Princeton University. His work broadly focuses on distributed and storage systems, networking, and security, and his publications have more than 12,000 citations. He developed CoralCDN (a decentralized content distribution network serving millions of daily users) and helped design Ethane (which formed the basis for OpenFlow and software-defined networking). Previously, he cofounded Illuminics Systems (acquired by Quova, now part of Neustar) and served as a technical advisor to Blockstack. Michael’s honors include a Presidential Early Career Award for Scientists and Engineers (given by President Obama), the SIGCOMM Test of Time Award, a Sloan Fellowship, an NSF CAREER award, the Office of Naval Research Young Investigator award, and support from the DARPA Computer Science Study Group. He earned his PhD at NYU and Stanford and his undergraduate and master’s degrees at MIT.

Presentations

Building a distributed time series database on PostgreSQL Session

Time series data tends to accumulate very quickly across DevOps, IoT, industrial and energy, finance, and other domains. Time series data is everywhere, with monitoring and IoT applications generating tens of millions of metrics per second and petabytes of data. Michael Freedman shows you how to build a distributed time series database that offers the power of full SQL at scale.

Chris Fregly is a senior developer advocate focused on AI and machine learning at Amazon Web Services (AWS). Chris shares knowledge with fellow developers and data scientists through his Advanced Kubeflow AI Meetup and regularly speaks at AI and ML conferences across the globe. Previously, Chris was a founder at PipelineAI, where he worked with many startups and enterprises to deploy machine learning pipelines using many open source and AWS products including Kubeflow, Amazon EKS, and Amazon SageMaker.

Presentations

Continuous ML and AI: Hands-on with Kubeflow and MLflow pipelines Interactive session

Join in to build real-world, distributed machine learning (ML) pipelines with Chris Fregly using Kubeflow, MLflow, TensorFlow, Keras, and Apache Spark in a Kubernetes environment.

Martin Frigaard is a cofounder of and data scientist at Aperture Marketing, which provides workshops, writes guides and tutorials, and builds web applications for individuals, businesses, and organizations. He has over nine years of experience with data analysis, statistics, and research, and he’s a fully certified RStudio tidyverse trainer.

Presentations

Stories and data: Use narrative and visualizations effectively Interactive session

Martin Frigaard not only outlines how to collect, manipulate, summarize, and visualize data, but also explores how to communicate your findings in a convincing way your audience will understand and appreciate.

Krishna is the cofounder and CEO of Fiddler Labs, an enterprise startup building an explainable AI engine to address problems regarding bias, fairness, and transparency in AI. Previously, he led the team that built Facebook’s explainability feature “Why am I seeing this?” He’s an entrepreneur with a technical background with experience creating scalable platforms and expertise in converting data into intelligence. Having held senior engineering leadership roles at Facebook, Pinterest, Twitter, and Microsoft, he’s seen the effects that bias has on AI and machine learning decision-making processes. With Fiddler, his goal is to enable enterprises across the globe solve this problem.

Presentations

The art of explainability: Removing the bias from AI Session

Krishna Gade outlines how "explainable AI" fills a critical gap in operationalizing AI and adopting an explainable approach into the end-to-end ML workflow from training to production. You'll discover the benefits of explainability such as the early identification of biased data and better confidence in model outputs.

Ben Galewsky is a research programmer at the National Center for Supercomputing Applications at the University of Illinois. He’s an experienced data engineering consultant whose career has spanned high-frequency trading systems to global investment bank enterprise architecture to big data analytics for large consumer goods manufacturers. He’s a member of the Institute for Research and Innovation in Software for High Energy Physics, which funds his development of scalable systems for the Large Hadron Collider.

Presentations

Data engineering at the Large Hadron Collider Session

Building a data engineering pipeline for serving segments of a 200 Pb dataset to particle physicists around the globe poses many challenges, some unique to high energy physics and some to big science projects across disciplines. Ben Galewsky, Gray Lindsey, and Andrew Melo highlight how much of it can inform industry data science at scale.

Triveni Gandhi is a data scientist at Dataiku, where she works with clients to deploy custom AI solutions and find meaning from complex data science pipelines. Her current work focuses on making responsible AI a part of the discourse and practices for enterprise data science. Triveni is also a cohost of the Banana Data Podcast, which highlights new developments and challenges in the world of AI. Previously, she served as a data analyst for a large education nonprofit in New York City, where she developed data pipelines and analyses to support the work of educators across the city. Triveni holds a PhD in political science from Cornell University.

Presentations

The buck stops here: Create accountability systems for responsible AI (sponsored by Dataiku) Keynote

What happens when AI goes awry? In the more sensational cases, like a driverless vehicle killing a pedestrian, there's a flurry of media reporting and backlash. Triveni Gandhi breaks down why, once the craze dies down, organizations need a way to learn from their mistakes and prevent harm from moving forward.

Eric Gardner is the director of sales enabling for the Artificial Intelligence Products Group (AIPG) within the Data Center Group (DCG) at Intel. Since joining Intel through the Accelerated Leadership Program (ALP), Eric has held product marketing, outbound marketing, and strategic planning roles spanning several business units. Previously, he worked for almost five years in silicon development and management at IBM. Eric holds an MBA from the University of Chicago’s Booth School of Business and a BSE from Duke University in electrical & computer engineering. He’s passionate about family, technology, sports, and the great outdoors.

Presentations

Welcome and AIB overview and growth: Impact on AI ecosystem Intel® AI Builders Showcase

Welcome to AI in the Enterprise: The Intel® AI Builders Showcase Event

Meg Garlinghouse is the head of social impact at LinkedIn. She’s passionate about connecting people with opportunities to use their skills and experience to transform the world. She has more than 20 years of experience working at the intersection of nonprofits and corporations, developing strategic and mutually beneficial partnerships. She has particular expertise in leveraging media and technology to meet the marketing, communications, and brand goals of respective clients. Meg has a passion for developing innovative social campaigns that have a business benefit.

Presentations

Fairness through experimentation at LinkedIn Session

Most companies want to ensure their products and algorithms are fair. Guillaume Saint-Jacques and Meg Garlinghouse share LinkedIn's A/B testing approach to fairness and describe new methods that detect whether an experiment introduces bias or inequality. You'll learn about a scalable implementation on Spark and discover examples of use cases and impact at LinkedIn.

William “Will” Gatehouse is the chief solutions architect for Accenture’s industry X.0 platforms, including solutions for oil and gas, chemicals, and smart grid. Will has over 20 years’ experience with implementing industrial platforms and has a reputation for applying emerging technology at enterprise scale as he’s proven for streaming analytics, semantic models, and edge analytics. When not at work, Will is an avid sailor.

Presentations

Building the digital twin IoT and unconventional data Session

The digital twin presents a problem of data and models at scale—how to mobilize IT and OT data, AI, and engineering models that work across lines of business and even across partners. Teresa Tung and William Gatehouse share their experience of implementing digital twins use cases that combine IoT, AI models, engineering models, and domain context.

Lior Gavish is a senior vice president of engineering at Barracuda, where he coleads the email security business. Lior developed AI solutions that were recognized by industry and academia, including a Distinguished Paper Award at USENIX Security 2019. Lior joined Barracuda through the acquisition of Sookasa, an Accel-backed startup where he was a cofounder and vice president of engineering. Previously, Lior led startup engineering teams building machine learning, web and mobile technologies. Lior holds a BSc and MSc in computer science from Tel-Aviv University and an MBA from Stanford University.

Presentations

High-precision detection of business email compromise Session

Lior Gavish breaks down a machine learning (ML)-based system that detects a highly evasive type of email-based fraud. The system combines innovative techniques for labeling and classifying highly unbalanced datasets with a distributed cloud application capable of processing high-volume communication in real time.

Marina (Mars) Rose Geldard is a researcher from Down Under in Tasmania. Entering the world of technology relatively late as a mature-age student, she’s found her place in the world: an industry where she can apply her lifelong love of mathematics and optimization. When she’s not busy being the most annoyingly eager researcher ever, she compulsively volunteers at industry events, dabbles in research, and serves on the executive committee for her state’s branch of the Australian Computer Society (ACS). She’s currently writing Practical Artificial Intelligence with Swift for O’Reilly.

Presentations

Swift for TensorFlow in 3 hours Tutorial

Mars Geldard, Tim Nugent, and Paris Buttfield-Addison are here to prove Swift isn't just for app developers. Swift for TensorFlow provides the power of TensorFlow with all the advantages of Python (and complete access to Python libraries) and Swift—the safe, fast, incredibly capable open source programming language; Swift for TensorFlow is the perfect way to learn deep learning and Swift.

Lars George is the principal solutions architect at Okera. Lars has been involved with Hadoop and HBase since 2007 and became a full HBase committer in 2009. Previously, Lars was the EMEA chief architect at Cloudera, acting as a liaison between the Cloudera professional services team and customers as well as partners in and around Europe, building the next data-driven solutions, and a cofounding partner of OpenCore, a Hadoop and emerging data technologies advisory firm. He’s spoken at many Hadoop User Group meetings as well as at conferences such as ApacheCon, FOSDEM, QCon, and Hadoop World and Hadoop Summit. He started the Munich OpenHUG meetings. He’s the author of HBase: The Definitive Guide (O’Reilly).

Presentations

Conquering the AWS IAM conundrum Session

With various levels of security layers and different departments responsible for data, there are a number of challenges with managing security and governance within AWS identity and access management (IAM). Lars George identifies the security layers, why there’s such a conundrum with IAM, if IAM actually slows down data projects, and the access control requirements needed in data lakes.

Dan Gifford is a senior data scientist responsible for creating data products at Getty Images in Seattle, Washington. Dan works at the intersection of science and creativity and builds products that improve the workflows of Getty Images photographers and customers. He’s the lead researcher on visual intelligence at Getty Images and is developing innovative new ways for customers to discover content. Previously, he was a data scientist on the ecommerce analytics team at Getty Images, where he modernized testing frameworks and analysis tools, in addition to modeling content relationships for the creative research team. Dan earned his PhD in astronomy and astrophysics from the University of Michigan in 2015, where he developed new algorithms for estimating the size of galaxy clusters. He also engineered a new image analysis pipeline for an instrument on a telescope used by the department at the Kitt Peak National Observatory.

Presentations

At a loss for words: How ML bridges the creative language gap Data Case Studies

Computer vision has made great strides toward human-level accuracy of describing and identifying images, but there often aren’t words to describe what we want algorithms to predict. Dan Gifford explores this paradox, the limitations of text-based image search, and how creative AI is challenging the way we view human creativity.

Navdeep Gill is a senior data scientist and software engineer at H2O.ai, where he focuses mainly on machine learning interpretability and had focused on GPU-accelerated machine learning, automated machine learning, and the core H2O-3 platform. Previously, Navdeep focused on data science and software development at Cisco and was a researcher and analyst in several neuroscience labs at California State University, East Bay, University of California, San Francisco, and Smith Kettlewell Eye Research Institute. Navdeep earned an MS in computational statistics, a BS in statistics, and a BA in psychology (with a minor in mathematics) from California State University, East Bay.

Presentations

Debugging machine learning models Session

Like all good software, machine learning models should be debugged to discover and remediate errors. Navdeep Gill explores several standard techniques in the context of model debugging—disparate impact, residual, and sensitivity analysis—and introduces novel applications such as global and local explanation of model residuals.

Ilana Golbin is a director in PwC’s emerging technologies practice and globally leads PwC’s research and development of responsible AI. Ilana has almost a decade of experience as a data scientist helping clients make strategic business decisions through data-informed decision making, simulation, and machine learning.

Presentations

A practical guide to responsible AI: Build robust, secure, and safe AI Session

Join in for a practitioner’s overview of the risks of AI and depiction of responsible AI deployment within an organization. You'll discover how to ensure the safety, security, standardized testing, and governance of systems and how models can be fooled or subverted. Ilana Golbin and Anand Rao illustrate how organizations safeguard AI applications and vendor solutions to mitigate AI risks.

Bruno Gonçalves is a chief data scientist at Data For Science, working at the intersection of data science and finance. Previously, he was a data science fellow at NYU’s Center for Data Science while on leave from a tenured faculty position at Aix-Marseille Université. Since completing his PhD in the physics of complex systems in 2008, he’s been pursuing the use of data science and machine learning to study human behavior. Using large datasets from Twitter, Wikipedia, web access logs, and Yahoo! Meme, he studied how we can observe both large scale and individual human behavior in an obtrusive and widespread manner. The main applications have been to the study of computational linguistics, information diffusion, behavioral change and epidemic spreading. In 2015, he was awarded the Complex Systems Society’s 2015 Junior Scientific Award for “outstanding contributions in complex systems science” and in 2018 was named a science fellow of the Institute for Scientific Interchange in Turin, Italy.

Presentations

Time series modeling: ML and deep learning approaches 2-Day Training

Time series are everywhere around us. Understanding them requires taking into account the sequence of values seen in previous steps and even long-term temporal correlations. Bruno Gonçalves explains a broad range of traditional machine learning (ML) and deep learning techniques to model and analyze time series datasets with an emphasis on practical applications.

Time series modeling: ML and deep learning approaches (Day 2) Training Day 2

Time series are everywhere around us. Understanding them requires taking into account the sequence of values seen in previous steps and even long-term temporal correlations. Bruno Gonçalves explains a broad range of traditional machine learning (ML) and deep learning techniques to model and analyze time series datasets with an emphasis on practical applications.

Abe Gong is CEO and cofounder at Superconductive Health. A seasoned entrepreneur, Abe has been leading teams using data and technology to solve problems in healthcare, consumer wellness, and public policy for over a decade. Previously, he was chief data officer at Aspire Health, the founding member of the Jawbone data science team, and lead data scientist at Massive Health. Abe holds a PhD in public policy, political science, and complex systems from the University of Michigan. He speaks and writes regularly on data science, healthcare, and the internet of things.

Presentations

Fighting pipeline debt with Great Expectations Session

Data organizations everywhere struggle with pipeline debt: untested, unverified assumptions that corrupt data quality, drain productivity, and erode trust in data. Abe Gong shares best practices gathered from across the data community in the course of developing a leading open source library for fighting pipeline debt and ensuring data quality: Great Expectations.

Srikanth Gopalakrishnan is a senior data scientist at Gramener, Bangalore office. He works on applying deep learning and machine learning approaches and probabilistic modeling in diverse fields. He comes from a solid mechanics background with a master’s degree in simulation sciences from RWTH Aachen University, Germany. After a short stint at the Aeronautics Department, Purdue University, he returned to India and transitioned to data science.

Presentations

Sizing biological cells and saving lives using AI Session

AI techniques are finding applications in a wide range of applications. Crowd-counting deep learning models have been used to count people, animals, and microscopic cells. Srikanth Gopalakrishnan introduces novel crowd-counting techniques and their applications, including a pharma case study to show how it was used for drug discovery to bring about 98% savings in drug characterization efforts.

Sunil Goplani is a group development manager at Intuit, leading the big data platform. Sunil has played key architecture and leadership roles in building solutions around data platforms, big data, BI, data warehousing, and MDM for startups and enterprises. Previously, Sunil served in key engineering positions at Netflix, Chegg, Brand.net, and few other startups. Sunil has a master’s degree in computer science.

Presentations

10 lead indicators before data becomes a mess Session

Data quality metrics focus on quantifying if data is a mess. But you need to identify lead indicators before data becomes a mess. Sandeep Uttamchandani, Giriraj Bagadi, and Sunil Goplani explore developing lead indicators for data quality for Intuit's production data pipelines. You'll learn about the details of lead indicators, optimization tools, and lessons that moved the needle on data quality.

Always accurate business metrics through lineage-based anomaly tracking Session

Debugging data pipelines is nontrivial, and finding the root cause can take hours to days. Shradha Ambekar and Sunil Goplani outline how Intuit built a self-serve tool that automatically discovers data pipeline lineage and applies anomaly detection to detect and help debug issues in minutes—establishing trust in metrics and improving developer productivity by 10x–100x.

As the Chief Data Officer of DataStax, Dr. Denise Koessler Gosnell applies her experiences as a machine learning and graph data practitioner to make more informed decisions with data. Prior to this role, Dr. Gosnell joined DataStax to create and lead the Global Graph Practice, a team that builds some of the largest distributed graph applications in the world. Dr. Gosnell earned her Ph.D. in Computer Science from the University of Tennessee as an NSF Fellow. Her research coined the concept “social fingerprinting” by applying graph algorithms to predict user identity from social media interactions.

Dr. Gosnell’s career centers on her passion for examining, applying, and advocating the applications of graph data. She has patented, built, published, and spoken on dozens of topics related to graph theory, graph algorithms, graph databases, and applications of graph data across all industry verticals. Prior to her roles with DataStax, Gosnell worked in the healthcare industry, where she contributed to software solutions for permissioned blockchains, machine learning applications of graph analytics, and data science.

Presentations

How does graph data help inform a self-organizing network? Session

Self-organizing networks rely on sensor communication and a centralized mechanism, like a cell tower, for transmitting the network's status. Denise Gosnell walks you through what happens if the tower goes down and how a graph data structure gets involved in the network's healing process. You'll see graphs in this dynamic network and how path information helps sensors come back online.

Gray Lindsey is a staff scientist at Fermi National Accelerator Laboratory studying Higgs and electroweak physics. He’s focused on developing software and detectors to address the challenge of the high-luminosity upgrade for the Large Hadron Collider and the corresponding upgrade of the Compact Muon Solenoid (CMS) experiment. He’s developed a variety of pattern recognition techniques to demonstrate and help realize new detector systems to efficiently assemble physics data from upgrades to the CMS detector. He also leads the development to make the analysis of those data more efficient and scalable using modern big data technologies.

Presentations

Data engineering at the Large Hadron Collider Session

Building a data engineering pipeline for serving segments of a 200 Pb dataset to particle physicists around the globe poses many challenges, some unique to high energy physics and some to big science projects across disciplines. Ben Galewsky, Gray Lindsey, and Andrew Melo highlight how much of it can inform industry data science at scale.

Mark Grover is a product manager at Lyft. Mark’s a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He’s also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He’s a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Presentations

Reducing data lag from 24+ hours to 5 mins at Lyft scale Session

Mark Grover and Dev Tagare offer you a glimpse into the end-to-end data architecture Lyft uses to reduce data lag in its analytical systems from 24+ hours to less than 5 minutes. You'll learn the what and why of tech choices, monitoring, and best practices. They outline Lyft's use cases, especially in ML model performance and evaluation.

Sarah is a Senior Data Scientist at InVision where she studies user collaboration through data. She is an accomplished conference speaker and O’Reilly Media author, and enjoys making data science as accessible as possible to a broad audience. Sarah attended graduate school at the University of Michigan’s School of Information.

Presentations

Preparing and standardizing data for machine learning Interactive session

Getting your data ready for modeling is the essential first step in the machine learning process. Sarah Guido outlines the basics of preparing and standardizing data for use in machine learning models.

Ananth Kalyan Chakravarthy Gundabattula is a senior application architect on the decisioning and advanced analytics engineering team for the Commonwealth Bank of Australia (CBA). Previously, he was an architect at ThreatMetrix, a member of the core team that scaled Threatmetrix architecture to 100 million transactions per day—which runs at very low latencies using Cassandra, Zookeeper, and Kafka—and migrated the ThreatMetrix data warehouse into the next generation architecture based on Hadoop and Impala; he was at IBM software labs and IBM CIO labs, enabling some of the first IBM CIO projects onboarding HBase, Hadoop, and Mahout stack. Ananth is a committer for Apache Apex and is working for the next-generation architectures for CBA fraud platform and Advanced Analytics Omnia platform at CBA. Ananth has presented at a number of conferences including YOW! Data and the Dataworks summit conference in Australia. Ananth holds a PhD in computer science security. He’s interested in all things data, including low-latency distributed processing systems, machine learning, and data engineering domains. He holds three patents and has one application pending.

Presentations

Automated feature engineering using dask and featuretools Session

Feature engineering can make or break a machine learning model. The featuretools package and associated algorithm accelerate the way features are built. Ananth Kalyan Chakravarthy Gundabattula explains a Dask- and Prefect-based framework that addresses challenges and opportunities using this approach in terms of lineage, risk, ethics, and automated data pipelines for the enterprise.

Sijie Guo is the founder and CEO of StreamNative, a data infrastructure startup offering a cloud native event streaming platform based on Apache Pulsar for enterprises. Previously, he was the tech lead for the Messaging Group at Twitter and worked on push notification infrastructure at Yahoo. He’s also the VP of Apache BookKeeper and PMC Member of Apache Pulsar.

Presentations

Transactional event streaming with Apache Pulsar Session

Sijie Guo and Yong Zhang lead a deep dive into the details of Pulsar transaction and how it can be used in Pulsar Functions and other processing engines to achieve transactional event streaming.

Dr. Alberto Gutierrez has over 25 years experience as an analytics and data scientist leader making contricutions to products in IoT, Telecommunications, Semiconductors, Mobile Applications, Call Centers, Customer Engagement, and Manufacturing. Specific application areas include predictive analytics, statistical process control, deep-learning (DL) NLP language models, and DL Computer Vision. He is actively engaged in the open-source community for deep learning and data science. Dr. Gutierrez has degrees from New Mexico State University, Ph.D. Communications and Information Theory, Purdue University, M.S.E.E. Semiconductors, and MBA Arizona State University, Carey School of Business.

Presentations

Industry 4.0 : Wind Turbine Defect Detection Intel® AI Builders Showcase

Reduction of the Levelized Cost of Energy (LCoE) remains the priority in the development of the wind energy sector. The nature of the structure and working mechanism of wind turbine poses a huge barrier to conduct periodical inspection and maintenance tasks. Our solution addresses this by combining drone and AI technology to augment site engineers with useful insights.

Patrick Hall is a principle scientist at bnh.ai, a D.C.-based boutique law firm; a senior direct of product at H2O.ai, a leading Silicon Valley machine learning software company; and a lecturer in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning. At both bnh.ai and H2O.ai, he works to mitigate AI risks and advance the responsible practice of machine learning. Previously, Patrick held global customer-facing and R&D research roles at SAS. He holds multiple patents in automated market segmentation using clustering and deep neural networks. Patrick is the 11th person worldwide to become a Cloudera Certified Data Scientist. He studied computational chemistry at the University of Illinois before graduating from the Institute for Advanced Analytics at North Carolina State University.

Presentations

Model debugging strategies Tutorial

Even if you've followed current best practices for model training and assessment, machine learning models can be hacked, socially discriminatory, or just plain wrong. Patrick Hall breaks down model debugging strategies to test and fix security vulnerabilities, unwanted social biases, and latent inaccuracies in models.

Luke (Qing) Han is a cofounder and CEO of Kyligence, cocreator and PMC chair of Apache Kylin, the leading open source OLAP for big data, and a Microsoft regional director and MVP. Luke has 10+ years’ experience in data warehouses, business intelligence, and big data. Previously, he was big data product lead at eBay and chief consultant of Actuate China.

Presentations

Apache Kylin: 5 years and still going strong (sponsored by Kyligence) Session

Luke Han and Debashis Saha delve into the technical history of Apache Kylin, share how it's being used inside some of the world's largest organizations, and provide a road map of what lies ahead for this popular open source project.

Hannes Hapke is a senior data scientist at SAP ConcurLabs. He’s been a machine learning enthusiast for many years and is a Google Developer Expert for machine learning. Hannes has applied deep learning to a variety of computer vision and natural language problems, but his main interest is in machine learning engineering and automating model workflows. Hannes is a coauthor of the deep learning publication Natural Language Processing in Action and he’s working on a book about Building Machine Learning Pipelines with TensorFlow Extended (O’Reilly). When he isn’t working on a deep learning project, you’ll find him outdoors running, hiking, or enjoying a good cup of coffee with a great book.

Presentations

Analyzing and deploying your machine learning model Tutorial

Most deep learning models don’t get analyzed, validated, and deployed. Catherine Nelson and Hannes Hapke explain the necessary steps to release machine learning models for real-world applications. You'll view an example project using the TensorFlow ecosystem, focusing on how to analyze models and deploy them efficiently.

Getting the most out of your AI projects with model feedback loops Session

Measuring your machine learning model’s performance is key for every successful data science project. Therefore, model feedback loops are essential to capture feedback from users and expand your model’s training dataset. Hannes Hapke and Catherine Nelson explore the concept of model feedback and guide you through a framework for increasing the ROI of your data science project.

Joe is a Partner leading PwC’s Artificial Intelligence Center of Enablement, with a primary focus on natural language processing and model development. In this role, he leads the encapsulation of models into PwC’s Data Sieve platform and has been serving as one of the chief architects of the platform since its development and inception. He also has an information security background which he uses to bring a security by design focus into the applications developed on the platform given the regulated nature of the institutions with whom he works. Joe joined PwC in February of 2014 in Assurance. Joe earned his degree from the Merrimack College where he obtained a degree in International Finance

Presentations

Cool AI for practical problems (sponsored by PwC) Session

While many of the solutions to which AI can be applied involve exciting technologies, AI will arguably have greater transformational impact on more mundane problems. Anand Rao offers an overview of how PwC has developed and applied innovative AI solutions to common, practical problems across several domains, such as tax, accounting, and management consulting.

Matt Harrison is a corporate trainer and consultant at MetaSnake. MetaSnake specializes in Python and Data Science. He teaches companies big and small how to leverage Python and Data Science for great good.

Presentations

Mastering pandas Tutorial

You can use pandas to load data, inspect it, tweak it, visualize it, and do analysis with only a few lines of code. Matt Harrison leads a deep dive in plotting and Matplotlib integration, data quality, and issues such as missing data. Matt uses the split-apply-combine paradigm with groupBy and Pivot and explains stacking and unstacking data.

Zak Hassan is a software engineer on the data analytics team working on data science and machine learning at Red Hat. Previously, Zak was a software consultant in the financial services and insurance industry, building end-to-end software solutions for clients.

Presentations

Log anomaly detector with NLP and unsupervised machine learning Session

The number of logs increases constantly and no human can monitor them all. Zak Hassan employs natural language processing (NLP) for text encoding and machine learning (ML) methods for automated anomaly detection to construct a tool to help developers perform root cause analysis more quickly. He provides a means to give feedback to the ML algorithm to learn from false positives.

Brittan Heller has structured her practice at Foley Hoag around the areas of law, technology and human rights. She specializes in advising companies on privacy, freedom of expression, content moderation, online harassment, disinformation, civic engagement, cyberhate and hate speech, and online extremism.

As the founding director of the Center on Technology and Society for the Anti-Defamation League, Brittan proposed new policies and implemented programs to prevent bias, racism, discrimination, and the spread of disinformation, with a focus on protecting minority populations. She also collaborated with major online platforms and gaming companies to combat cyberhate, and produced and launched new technology for good, in mediums like AI, VR/AR/XR and data visualization.

Presentations

How to Stop Online Harassment: Perspective of the First Jane Doe Keynote

How to Stop Online Harassment: Perspective of the First Jane Doe - Brittan Heller, Counsel, Business and Human Rights, Foley Hoag LLP

Thomas Henson is a data engineering advocate and senior systems engineer for the unstructured data solutions team at Dell EMC. Thomas has been involved in many different big data, analytics, and artificial intelligence projects throughout his career, with a focus on distributed systems. He’s a proud alumnus of the University of North Alabama, where he earned his undergraduate and graduate degree in computer information systems. Thomas is an accomplished speaker in the artificial intelligence and big data ecosystem at various conferences.

Presentations

Make impossible real: AI architecture from sandbox to production Session

Thomas Henson details key business and architectural requirements for storage, compute, and GPU acceleration to explore how you can achieve maximum benefit from AI platforms aligning with these requirements. You'll discover the Dell Technologies solution portfolio, including Isilon and ECS storage arrays, Ready Solutions, and PowerEdge Servers.

Long Van Ho is a data scientist at Children’s Hospital Los Angeles with over five years of experience in applying advanced machine learning techniques in healthcare and defense applications. His work includes developing the machine learning framework to enable data science at the Virtual Pediatric Intensive Care Unit and researching applications of artificial intelligence to improve care in the ICUs. His background includes research in particle beam physics at UCLA, and Stanford University has provided a strong research background in his career as a data scientist. His interests and goal is to bridge the potential of machine learning with practical applications to health.

Presentations

Semisupervised AI for automated categorization of medical images Session

Annotating radiological images by category at scale is a critical step for analytical ML. Supervised learning is challenging because image metadata doesn't reliably identify image content and manual labeling images for AI algorithms isn't feasible. Stephan Erberich, Kalvin Ogbuefi, and Long Ho share an approach for automated categorization of radiological images based on content category.

Ana Hocevar is a data scientist in residence at the Data Incubator, where she combines her love for coding and teaching. Ana has more than a decade of experience in physics and neuroscience research and over five years of teaching experience. Previously, she was a postdoctoral fellow at the Rockefeller University, where she worked on developing and implementing an underwater touch screen for dolphins. She holds a PhD in physics.

Presentations

Deep learning with PyTorch 2-Day Training

PyTorch is a machine learning library for Python that allows you to build deep neural networks with great flexibility. Its easy-to-use API and seamless use of GPUs make it a sought-after tool for deep learning. Join in to get the knowledge you need to build deep learning models using real-world datasets and PyTorch with Ana Hocevar.

Donagh Horgan is a principal engineer at Extreme Networks, where he designs data-driven solutions for smarter and more secure networks as part of the Cloud Technology Adoption Group. Previously, he led and contributed to applied machine learning research at a number of Fortune 500 companies with applications in the areas of converged physical security, asset microlocation and infrastructure performance monitoring. Donagh holds a BEng in microelectronic engineering and a PhD in electrical and electronic engineering from University College Cork, Ireland.

Presentations

What machines say when nobody's looking: Track IoT security with NLP Session

Machines talk among themselves, but it's possible to understand their behavior by analyzing their language. Donagh Horgan outlines a lightweight approach for securing large internet of things (IoT) deployments by leveraging modern natural language processing (NLP) techniques. Rather than attempting cumbersome firewall rules, IoT deployments can be efficiently secured by online behavioral modeling.

Bob Horton is a senior data scientist on the user understanding team at Bing. Bob holds an adjunct faculty appointment in health informatics at the University of San Francisco, where he gives occasional lectures and advises students on data analysis and simulation projects. Previously, he was on the professional services team at Revolution Analytics. Long before becoming a data scientist, he was a regular scientist (with a PhD in biomedical science and molecular biology from the Mayo Clinic). Some time after that, he got an MS in computer science from California State University, Sacramento.

Presentations

Machine learning for managers Tutorial

Bob Horton, Mario Inchiosa, and John-Mark Agosta offer an overview of the fundamental concepts of machine learning (ML) for business and healthcare decision makers and software product managers so you'll be able to make a more effective use of ML results and be better able to evaluate opportunities to apply ML in your industries.

Jeremy Howard is an entrepreneur, business strategist, developer, and educator. Jeremy is a founding researcher at fast.ai, a research institute dedicated to making deep learning more accessible. He is also a Distinguished Research Scientist at the University of San Francisco, a faculty member at Singularity University, and a Young Global Leader with the World Economic Forum. Jeremy’s most recent startup, Enlitic, was the first company to apply deep learning to medicine and was selected one of the world’s top 50 smartest companies by MIT Tech Review two years running. Previously, he was the president and chief scientist at the data science platform Kaggle, where he was the top ranked participant in international machine learning competitions two years running; was the founding CEO of successful Australian startups FastMail and Optimal Decisions Group (acquired by Lexis-Nexis); and spent eight years in management consulting at McKinsey & Co. and AT Kearney. Jeremy has invested in, mentored, and advised many startups and contributed to many open source projects. He has made a number of television and video appearances, including as a regular guest on Australia’s highest-rated breakfast news program, a popular talk on TED.com, and data science and web development tutorials and discussions.

Presentations

Practical Deep Learning Keynote

In this talk, Jeremy will argue that deep learning is actually a practical tool for practical people that is well within reach of anyone with basic coding skills and the passion and tenacity to solve real problems.

Sihui “May” Hu (she/her) is a program manager at Microsoft, focused on creating data management and data lineage solutions for the Azure Machine Learning service. Previously, she had two years of working experience in the ecommerce industry and several internships in product management. She graduated from Carnegie Mellon University, studying information systems management.

Presentations

Data lineage enables reproducible and reliable ML at scale Session

Data scientists need a way to ensure result reproducibility. Sihui "May" Hu and Dominic Divakaruni unpack how to retrieve data-to-data, data-to-model, and model-to-deployment lineages in one graph to achieve reproducible and reliable machine learning at scale. You'll discover effective ways to track the full lineage from data preparation to model training to inference.

Erich S. Huang is a codirector of Duke Forge and assistant dean for biomedical informatics. He was the first faculty recruit to a new initiative and new division at Duke in the Department of Biostatistics and Bioinformatics, as well as a principal investigator on a NIH-funded project under the Big Data to Knowledge (BD2K) RFAs. Erich is a faculty lead for informatics on the Google Life Sciences-funded Baseline Study. He’s leading a Duke University School of Medicine-wide initiative for a data service for biomedical researchers and leading projects on applied machine learning, user interfaces, and visualization of surgical outcomes (Clinical & Analytic Learning Platform for Surgical Outcomes, CALYPSO) and a chronic kidney disease “early warning” system. His overarching aim is to create a data science culture and infrastructure for biomedical and healthcare research.

Presentations

Deep care management: ML value-based care to Medicare patients Health Data Day

A patient with a history of heart attack, cardiac bypass, congestive heart failure, diabetes, and substance abuse is admitted to the emergency department. What next? Duke Connected Care is an accountable care organization, a model incentivizing value-based healthcare, targeting resources to those who need it most. Erich Huang explores deep care management, an ML workflow helping do just that.

Mario Inchiosa’s passion for data science and high-performance computing drives his work as principal software engineer in Microsoft Cloud + AI, where he focuses on delivering advances in scalable advanced analytics, machine learning, and AI. Previously, Mario served as chief scientist of Revolution Analytics; analytics architect in the big data organization at IBM, where he worked on advanced analytics in Hadoop, Teradata, and R; US chief scientist at Netezza, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances; US chief science officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining; and senior scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds bachelor’s, master’s, and PhD degrees in physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning Publication of the Year and Open Literature Publication Excellence awards.

Presentations

Machine learning for managers Tutorial

Bob Horton, Mario Inchiosa, and John-Mark Agosta offer an overview of the fundamental concepts of machine learning (ML) for business and healthcare decision makers and software product managers so you'll be able to make a more effective use of ML results and be better able to evaluate opportunities to apply ML in your industries.

George Iordanescu is a data scientist on the algorithms and data science team for Microsoft’s Cortana Intelligence Suite. Previously, he was a research scientist in academia, a consultant in the healthcare and insurance industry, and a postdoctoral visiting fellow in computer-assisted detection at the National Institutes of Health (NIH). His research interests include semisupervised learning and anomaly detection. George holds a PhD in EE from Politehnica University in Bucharest, Romania.

Presentations

Using the cloud to scale up hyperparameter optimization for ML Session

Hyperparameter optimization for machine leaning is complex, requires advanced optimization techniques, and can be implemented as a generic framework decoupled from specific details of algorithms. Fidan Boylu Uz, Mario Bourgoin, and George Iordanescu apply such a framework to tasks like object detection and text matching in a transparent, scalable, and easy-to-manage way in a cloud service.

Ankit Jain is a senior research scientist at Uber AI Labs, the machine learning research arm of Uber. His primary research areas include graph neural networks, meta-learning, and forecasting. Previously, he worked in variety of data science roles at Bank of America, Facebook, and other startups. He coauthored a book on machine learning, TensorFlow Machine Learning Projects. He’s been a featured speaker in many of the top AI conferences and universities across US, including UC Berkeley and the O’Reilly AI Conference, among others. He earned his MS from UC Berkeley and BS from IIT Bombay (India).

Presentations

Enhance recommendations in Uber Eats with graph convolutional networks Session

Ankit Jain and Piero Molino detail how to generate better restaurant and dish recommendations in Uber Eats by learning entity embeddings using graph convolutional networks implemented in TensorFlow.

Shubhankar Jain (he/him) is a machine learning engineer at SurveyMonkey, where he develops and implements machine learning systems for its products and teams. He’s really excited to bring his expertise and passion of data and AI systems to rest of the industry. In his free time, he likes hiking with his dog and accelerating his hearing loss at live music shows.

Presentations

Accelerate your organization: Make data optimal for ML Session

Every organization leverages ML to increase value to customers and understand its business. You may have created models, but now you need to scale. Shubhankar Jain, Aliaksandr Padvitselski, and Jin Yang use a case study to teach you how to pinpoint inefficiencies in your ML data flow, how SurveyMonkey tackled this, and how to make your data more usable to accelerate ML model development.

Ram Janakiraman is a distinguished engineer at the Aruba CTO Office working on machine intelligence for enterprise security. His recent focus has been on simplifying the building of behavior models by leveraging approaches in NLP and representation learning. He hopes to improve end user product engagement through a visual representation of entity interactions without compromising the privacy of the network entities. Ram has numerous patents from a variety of areas during the course of his career. Previously, he’s been in various startups and was a cofounding member of Niara, Inc., working on security analytics with a focus on threat detection and investigation before it was acquired by Aruba, a HPE Company. He’s also an avid scuba diver, always eager to explore the next reef or kelp. He’s an FAA Certified Drone Pilot, capturing the beauty of dive destinations on his trips.

Presentations

Preserving privacy in behavioral models with semantic learning in NLP Session

Devices discover their way around the network and proxy the intent of the users behind them; leveraging this information for behavior analytics can raise privacy concerns. A selective use of embedding models on a crafted corpus from anonymized data can address these concerns. Ramsundar Janakiraman details a way to build representations with behavioral insights that also preserves user identity.

Dan Jeffries is the chief technology evangelist at Pachyderm. He’s also an author, engineer, futurist, pro blogger, and he’s given talks all over the world on AI and cryptographic platforms. He’s spent more than two decades in IT as a consultant and at open source pioneer Red Hat. His articles have held the number one writer’s spot on Medium for artificial intelligence, bitcoin, cryptocurrency and economics more than 25 times. His breakout AI tutorial series, “Learning AI if You Suck at Math” along with his explosive pieces on cryptocurrency, "Why Everyone Missed the Most Important Invention of the Last 500 Years” and "Why Everyone Missed the Most Mind-Blowing Feature of Cryptocurrency,” are shared hundreds of times daily all over social media and have been read by more than 5 million people worldwide.

Presentations

When AI goes wrong and how to fix it with real-world AI auditing Session

With algorithms making more and more decisions in our lives, from who gets a job to who gets hired and fired, and even who goes to jail, it’s more critical than ever that we make AI auditable and explainable in the real world. Daniel Jeffries breaks down how you can make your AI and ML systems auditable and transparent right now with a few classic IT techniques your team already knows well.

Amit Kapoor is a data storyteller at narrativeViz, where he uses storytelling and data visualization as tools for improving communication, persuasion, and leadership through workshops and trainings conducted for corporations, nonprofits, colleges, and individuals. Interested in learning and teaching the craft of telling visual stories with data, Amit also teaches storytelling with data for executive courses as a guest faculty member at IIM Bangalore and IIM Ahmedabad. Amit’s background is in strategy consulting, using data-driven stories to drive change across organizations and businesses. Previously, he gained more than 12 years of management consulting experience with A.T. Kearney in India, Booz & Company in Europe, and startups in Bangalore. Amit holds a BTech in mechanical engineering from IIT, Delhi, and a PGDM (MBA) from IIM, Ahmedabad.

Presentations

Deep learning for recommendation systems 2-Day Training

Bargava Subramanian and Amit Kapoor provide you with a thorough introduction to the art and science of building recommendation systems and paradigms across domains. You'll get an end-to-end overview of deep learning-based recommendation and learning-to-rank systems to understand practical considerations and guidelines for building and deploying RecSys.

Deep learning for recommendation systems (Day 2) Training Day 2

Bargava Subramanian and Amit Kapoor provide you with a thorough introduction to the art and science of building recommendation systems and paradigms across domains. You'll get an end-to-end overview of deep learning-based recommendation and learning-to-rank systems to understand practical considerations and guidelines for building and deploying RecSys.

Holden Karau is a transgender Canadian software engineer working in the bay area. Previously, she worked at IBM, Alpine, Databricks, Google (twice), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She’s a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

Presentations

Ship it: A guide to model management and deployment with Kubeflow Session

Holden Karau is here to make sure you can get and keep your models in production with Kubeflow.

Arun Kejariwal is an independent lead engineer. Previously, he was he was a statistical learning principal at Machine Zone (MZ), where he led a team of top-tier researchers and worked on research and development of novel techniques for install-and-click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns, and his team built novel methods for bot detection, intrusion detection, and real-time anomaly detection; and he developed and open-sourced techniques for anomaly detection and breakout detection at Twitter. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Serverless streaming architectures and algorithms for the enterprise Tutorial

Arun Kejariwal, Karthik Ramasamy, and Anurag Khandelwal walk you through the landscape of streaming systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage. You'll get an overview of the inception and growth of the serverless paradigm and explore Apache Pulsar, which provides native serverless support in the form of Pulsar functions.

Anurag Khandelwal is an assistant professor in the De­part­ment of Com­puter Sci­ence at Yale Uni­versity. Previously, Anurag did a short post-doc at Cor­nell Tech where he worked with Tom Risten­part and Rachit Agar­wal. He re­ceived his PhD from the Uni­versity of Cali­for­nia, Berke­ley, at the RI­SELab, where he was ad­vised by Ion Stoica. Anurag earned his BTech in com­puter sci­ence and en­gin­eer­ing from the In­dian In­sti­tute of Tech­no­logy, Khar­ag­pur. His research interests span distributed systems, networking, and algorithms. In particular, his research focuses on addressing core challenges in distributed systems through novel algorithm and data structure design. During his PhD, Anurag built large-scale data-intensive systems such as Succinct and Confluo, that led to deployments in several production clusters.

Presentations

Serverless streaming architectures and algorithms for the enterprise Tutorial

Arun Kejariwal, Karthik Ramasamy, and Anurag Khandelwal walk you through the landscape of streaming systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage. You'll get an overview of the inception and growth of the serverless paradigm and explore Apache Pulsar, which provides native serverless support in the form of Pulsar functions.

Veysel Kocaman is a senior data scientist and ML engineer at John Snow Labs. He has a decade’s experience in the industry and provides hands-on consulting services in ML and AI, statistics, data science, and operations research to several startups and companies around the globe. Previously, Veysel has been a CTO, head of AI, and principal data scientist, among other titles. He earned his PhD in computer science at Leiden University (the Netherlands) and an MS degree in operations research from Penn State University.

Presentations

Advanced natural language processing with Spark NLP Tutorial

David Talby, Alex Thomas, Claudiu Branzan, and Veysel Kocaman detail applying advances in deep learning for common natural language processing (NLP) tasks such as named entity recognition, document classification, sentiment analysis, spell checking, and OCR. You'll learn to build complete text analysis pipelines using the highly performant, scalable, open source Spark NLP library in Python.

Madhu Kochar is the vice president of product development in the data and AI business unit at IBM. She leads the large portfolio of Watson AI applications and the data platform. Over the last few years, Madhu has distinguished herself by successfully establishing a world-class service delivery and engineering team that transformed IBM analytics business on the IBM public and private cloud. She has extensive experience in software development, DevOps, data ops, hybrid cloud, machine learning, and AI. In 2012, Madhu was named one of the outstanding executive women from Silicon Valley and a recipient of the prestigious Tribute to Women (TWIN) award, a recognition of her role as a woman executive and an inspiration for other women in technology fields. Madhu has also represented IBM in the science, technology, engineering and mathematics (STEM) summit panels and is on the IBM corporate Asian Council. She also leads the local IBM Women Charter, whose goal is to grow the next generation of leaders. Madhu is based in San Jose, California.

Presentations

Beyond the Model: Real-Time, Real-World Hardened AI (Sponsored by IBM) Session

You’ve deployed your model - now reality hits. The market changes, customers react, the relationship between input and output data is not what it was. Join Data and AI experts as they discuss how to help create hardened AI solutions that can dynamically adapt and evolve.

Turning AI aspirations into real-world business outcomes (sponsored by IBM) Keynote

Operationalized AI is no longer a far-off dream but an absolute necessity for enterprise success. Madhu Kochar explains why, to put AI to work, businesses need an integrated set of data management, analytics, and development tools that provide an enterprise-class platform for building systems of insight.

David Kohn is an R&D engineer at TimescaleDB, which he joined after a BS in environmental engineering at Yale, founding an electrochemistry startup, joining a battery startup, and doing crazy things with PostgreSQL for Moat (an ad-analytics company). He also cooks, does pottery, and builds furniture.

Presentations

Simplify data analytics by creating continuously up-to-date aggregates Session

The sheer volume of time series data from servers, applications, or IoT devices introduces performance challenges, both to insert data at high rates and to process aggregates for subsequent understanding. David Kohn demonstrates how systems can properly continuously maintain up-to-date aggregates, even correctly handling late or out of order data, to simplify data analysis.

Ravi Krishnaswamy is the director of software architecture in the AutoCAD Group at Autodesk. He has a passion for technology and has implemented a wide range of solutions for products at Autodesk from analytics and database applications to mobile graphics. His current projects involve analytics solutions on product usage data that leverage graph databases and machine learning techniques on graphs.

Presentations

Collaboration insights through data access graphs Session

Today’s applications interact with data in a distributed and decentralized world. Using graphs at scale, you can infer communities and your interaction by tracking access to common data across users and applications. Ravi Krishnaswamy displays a real-world product example with millions of users that uses the combined powers of Spark and graph databases to gain insights into customer workflows.

Akshay Kulkarni is a senior data scientist on the core AI and data science team at Publicis Sapient, where he’s part of strategy and transformation interventions through AI, manages high-priority growth initiatives around data science, and works on various machine learning, deep learning, natural language processing, and artificial intelligence engagements by applying state-of-the-art techniques. He’s a renowned AI and machine learning evangelist, author, and speaker. Recently, he’s been recognized as one of “Top 40 under 40 Data Scientists” in India by Analytics India Magazine. He’s consulted with several Fortune 500 and global enterprises in driving AI and data science-led strategic transformation. Akshay has rich experience of building and scaling AI and machine learning businesses and creating significant client impact. He’s actively involved in next-gen AI research and is also a part of next-gen AI community. Previously, he was at Gartner and Accenture, where he scaled the AI and data science business. He’s a regular speaker at major data science conferences and recently gave a talk on “Sequence Embeddings for Prediction Using Deep Learning” at GIDS. He’s the author of a book on NLP with Apress and authoring couple more books with Packt on deep learning and next-gen NLP. Akshay is a visiting faculty (industry expert) at few of the top universities in India. In his spare time, he enjoys reading, writing, coding, and helping aspiring data scientists.

Presentations

Attention networks all the way to production using Kubeflow Tutorial

Vijay Srinivas Agneeswaran, Pramod Singh, and Akshay Kulkarni demonstrate the in-depth process of building a text summarization model with an attention network using TensorFlow 2.0. You'll gain the practical hands-on knowledge to build and deploy a scalable text summarization model on top of Kubeflow.

Dinesh Kumar is a product engineer at Gojek. He has experience of building high-scale distributed systems and working with event-driven systems and components around Kafka.

Presentations

BEAST: Building an event processing library for millions of events Session

Maulik Soneji and Dinesh Kumar explore Gojek's event-processing library to consume events from Kafka and push it to BigQuery. All of its services are event sourced, and Gojek has a high load of 21K messages per second for few topics, and it has hundreds of topics.

Saewon Kye is a lead researcher of the medical division at JLK Inspection, a South Korean company developing AI-based diagnosis support systems for medical professionals. He manages the team responsible for developing AI platform and solutions for deploying various AI models and data annotation.

Saewon holds a bachelor’s degree in business and computer science from the University of British Columbia and a master’s degree from Yonsei University, from which he specialized in physiological data acquisition and AI-based data processing.

Presentations

AIHuB with Jviewer Intel® AI Builders Showcase

Saewon Kye, lead engineer at JLK Inspection Inc., will demonstrate how using Intel’s cutting-edge technology for deployment in developing countries, mobile environments, and places with a limited internet connection. The talk will also include how JLK Inspection has centralized the data acquired from edge devices to JLK’s universal viewer called JViewer.

Tianhui Michael Li is the founder and president of the Data Incubator, a data science training and placement firm. Michael bootstrapped the company and navigated it to a successful sale to the Pragmatic Institute. Previously, he headed monetization data science at Foursquare and has worked at Google, Andreessen Horowitz, J.P. Morgan, and D.E. Shaw. He’s a regular contributor to the Wall Street JournalTech CrunchWiredFast CompanyHarvard Business ReviewMIT Sloan Management ReviewEntrepreneurVenture Beat, Tech Target, and O’Reilly. Michael was a postdoc at Cornell Tech, a PhD at Princeton, and a Marshall Scholar in Cambridge.

Presentations

Big data for managers 2-Day Training

Gonzalo Diaz and Michael Li provide a nontechnical overview of AI and data science and teach common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and use their input and analysis for your business’s strategic priorities and decision making.

Big data for managers (Day 2) Training Day 2

Gonzalo Diaz and Michael Li provide a nontechnical overview of AI and data science and teach common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and use their input and analysis for your business’s strategic priorities and decision making.

Nong Li cofounded Okera in 2016 with Amandeep Khurana and serves as the company’s CEO. Previously, he was on the engineering team at Databricks, where he led performance engineering for Spark core and SparkSQL, and was tech lead for the Impala project at Cloudera and the author of the Record Service project. Nong is also one of the original authors of the Apache Parquet project and mentors several Apache projects, including Apache Arrow. Nong has a degree in computer science from Brown University.

Presentations

Data versus metadata: Overcoming challenges to secure the modern data lake Session

The evolution of storing data in a warehouse to hybrid infrastructure of on-premises and cloud data lakes enabled agility and scale. Nong Li looks at the problems between data and metadata, the privacy and security risks associated with them, how to avoid the pitfalls of this challenges, and why companies need to get it right by enforcing security and privacy consistently across all applications.

Penghui is a software engineer at Zhaopin and an Apache Pulsar Committer.

Presentations

Zhaopin simplifies stream processing using Pulsar Functions and SQL Session

Penghui Li and Neng Lu walk you through building an event streaming platform based on Apache Pulsar and simplifying a stream processing pipeline by Pulsar Functions, Pulsar Schema, and Pulsar SQL.

Julie Lockner, Director, Portfolio Optimization, DataOps and Customer Experience IBM Data and AI.  Julie is responsible for executing on the Data and AI offering strategy, leading DataOps internally to support data driven decisions, ensuring the best customer experience and leads the Data and AI division’s entire portfolio offering management operations. Prior to IBM, Julie led global data platform product and partner programs for InterSystems, a data management healthcare company.  Before that, she was vice president of Global Market Development and Product Marketing at Informatica as well as led global go-to-market for its data security and archive business unit. Previously, she was an industry analyst and vice president at Enterprise Strategy Group (ESG) and Founder,CEO of CentricInfo, a data management and governance consulting firm specializing in implementing data governance programs for fortune 1,000 firms. She has held sales, management and technology roles at EMC, Oracle and several startups.  She is the chairman of 17Minds, a startup focusing on wearable devices and therapeutic plans for children with special needs and is an advisor for several startups in data management and wearable device markets. Julie holds an MBA from MIT Sloan and a BSEE from Worcester Polytechnic Institute.

Presentations

Introduction to DataOps In Action - Accelerating Business Value through Automated and Governed Data Pipelines (Sponosred by IBM) Session

DataOps is an emerging, and highly successful data management practice that focuses on ensuring high quality data is available to consumers quickly and consistently while ensuring data privacy and governance policies are met.

Shondria Lopez-Merlos is a data specialist for the Florida Conference of The United Methodist Church. After making a suggestion in a meeting, Shondria was challenged to learn more about coding and automation. She subsequently taught herself Python and has begun learning HTML/CSS, SQL, and VBA. Shondria is a former O’Reilly Scholarship recipient. Additionally, she’s a member of Women Who Code and Women in STEAM.

Presentations

Start simple: Coding and automation at nonprofits and small businesses Data Case Studies

Small data teams that work for small businesses or nonprofits often want to use programming and automation but don't know where to start. Shondria Lopez-Merlos explores how to learn simple Python programs and incorporate them to help streamline workflow and, hopefully, lead to additional, increasingly complex projects.

Grace Lu is a software engineer on the data platform team at Robinhood.

Presentations

Usability first: The evolution of Robinhood’s data platform Data Case Studies

The data platform at Robinhood has evolved considerably as the scale of its data and its needs have evolved. Shuo Xiang and Grace Lu share the stories behind the evolution of its platform, how it aligns with the company's business use cases, and the challenges encountered and lessons learned.

Neng Lu is a staff software engineer at StreamNative where he drives the development of Apache Pulsar and the integrations with big data ecosystem. Before that, he was a senior software engineer at Twitter. He was the core committer to the Heron project and the leading engineer for Heron development at Twitter. He also worked on Twitter’s monitoring and key-value storage systems. Before joining Twitter, he got his master’s degree from UCLA and a bachelor degree from Zhejiang University.

Presentations

Zhaopin simplifies stream processing using Pulsar Functions and SQL Session

Penghui Li and Neng Lu walk you through building an event streaming platform based on Apache Pulsar and simplifying a stream processing pipeline by Pulsar Functions, Pulsar Schema, and Pulsar SQL.

Boris Lublinsky is a principal architect at Lightbend, where he specializes in big data, stream processing, and services. Boris has over 30 years’ experience in enterprise architecture. Previously, he was responsible for setting architectural direction, conducting architecture assessments, and creating and executing architectural road maps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, and Serving Machine Learning Models. He’s also cofounder of and frequent speaker at several Chicago user groups.

Presentations

Model governance Tutorial

Machine learning (ML) models are data, which means they require the same data governance considerations as the rest of your data. Boris Lublinsky and Dean Wampler outline metadata management for model serving and explain what information about running systems you need and why it's important. You'll also learn how Apache Atlas can be used for storing and managing this information.

Understanding data governance for machine learning models Session

Production deployment of machine learning (ML) models requires data governance because models are data. Dean Wampler and Boris Lublinsky justify that claim and explore its implications and techniques for satisfying the requirements. Using motivating examples, you'll explore reproducibility, security, traceability, and auditing, plus some unique characteristics of models in production settings.

Willy Lulciuc is a data engineer on the project Marquez team at WeWork in San Francisco, where he and his team make datasets discoverable and meaningful. Previously, he worked on the real-time streaming data platform powering BounceX, and before that, designed and scaled sensor data streams at Canary.

Presentations

Data lineage with Apache Airflow using Marquez Interactive session

Willy Lulciuc explains how lineage metadata in conjunction with a data catalog helps improve the overall quality of data. You'll dive into complex inter-DAGs dependencies in Airflow and get a hands-on introduction to data lineage using Marquez. You'll also develop strong debugging techniques and learn how to effectively apply them.

Anand Madhavan is the vice president of engineering at Narvar. Previously, he was head of engineering for the Discover product at Snapchat and director of engineering at Twitter, where we worked on building out the ad serving system for Twitter Ads. He earned an MS in computer science from Stanford University.

Presentations

Using Apache Pulsar functions for data workflows at Narvar Session

Narvar originally used a large collection of point technologies such as AWS Kinesis, Lambda, and Apache Kafka to satisfy its requirements for pub/sub messaging, message queuing, logging, and processing. Karthik Ramasamy and Anand Madhavan walk you through how Narvar moved away from using a slew of technologies and consolidating their use cases using Apache Pulsar.

Shailendra Maktedar is a senior engineering manager on the data analytics team at Kohl’s, and he’s leading the transformation of the legacy marketing analytics and personalization platform to the cloud, specifically focusing on the Google Cloud Platform (GCP). Shailendra’s passion is applying big data analytics to drive engineering innovation to solve challenging business problems. He’s a seasoned leader with 15 years of industry experience in building and leading teams of talented data architects, engineers, and product managers. He has a proven track record in building scalable data platforms, building large-scale data products, improving operational processes efficiencies, and enhancing user experience. Shailendra is hands-on practitioner with excellent communication skills. He loves to play volleyball and spend time with his kids when not thinking about GCP.

Presentations

Revitalizing Kohl’s marketing and experience personalized ecosystem on Google Cloud Platform (sponsored by GSPANN) Session

Praveen Chandra and Shailendra Maktedar describe the challenges Kohl's faced with its legacy marketing analytics platform and how it leveraged Google Cloud Platform (GCP) and BigQuery to provide better and more consistent customer insights to the marketing analytics business team.

Suneeta Mall is a senior data scientist at Nearmap, where she leads the engineering efforts of the Artificial Intelligence Division. She led the efforts of migrating Nearmap’s backend services to Kubernetes. In her 12 years of software industry experience, she’s worked on a solving variety of challenging technical and business problems in the field of big data, machine learning, GIS, travel, DevOps, and telecommunication. She earned her PhD from University of Sydney and bachelor’s of computer science and engineering.

Presentations

Deep learning meets Kubernetes: Massively parallel inference pipelines Session

Using Kubernetes as the backbone of AI infrastructure, Nearmap built a fully automated deep learning inference pipeline that's highly resilient, scalable, and massively parallel. Using this system, Nearmap ran semantic segmentation over tens of quadrillions of pixels. Suneeta Mall demonstrates the solution by using Kubernetes in big data crunching and machine learning at scale.

Katie Malone is director of data science at data science software and services company Civis Analytics, where she leads a team of diverse data scientists who serve as technical and methodological advisors to the Civis consulting team and write the core machine learning and data science software that underpins the Civis Data Science Platform. Previously, she worked at CERN on Higgs boson searches and was the instructor of Udacity’s Introduction to Machine Learning course. Katie hosts Linear Digressions, a weekly podcast on data science and machine learning. She holds a PhD in physics from Stanford.

Presentations

The care and feeding of data scientists Session

Data science is relatively young, and the job of managing data scientists is younger still. Many people undertake this management position without the tools, mentorship, or role models they need to do it well. Katie Malone and Michelangelo D'Agostino review key themes from a recent Strata report that examines the steps necessary to build, manage, sustain, and retain a growing data science team.

Sukanya Mandal is a data scientist at Capgemini. She has extensive experience building various solutions with IoT. She enjoys most working at the intersection of IoT and data science. She also leads the PyData Mumbai and Pyladies Mumbai chapter. Besides work and community efforts, she loves to explore new tech and pursue research. She’s published a couple of white papers with IEEE and a couple more are in the pipeline.

Presentations

Machine learning on resource-constrained IoT edge devices Session

Heavy ML computation on resource-constrained IoT devices is a challenge. IoT demands near-zero latency, high bandwidth availability, continuous and seamless availability, and privacy. The right infrastructure derives the right ROI. This is where edge and cloud come in. Sukanya Mandal explains how training ML models at the cloud and inferencing at the edge makes many IoT use cases plausible.

Nayad Manukian is a member of the Commercial Data Sciences team in the Technology organization of Janssen Pharmaceuticals. Nayad solves problems across the Pharma value chain using disruptive methodologies and technologies, including machine learning, natural language processing, and graph analytics.

Presentations

Industrializing machine learning: Use Case, Challenges & Learnings (Sponsored by Dataiku) Session

Establishing best practices to enable data science solutions at scale can be difficult in highly matrixed environments. Join us to learn about the evolution at Janssen US and the framework used for industrializing advanced analytics capabilities for applications such as predictive targeting.

Jaya Mathew is a senior data scientist on the artificial intelligence and research team at Microsoft, where she focuses on the deployment of AI and ML solutions to solve real business problems for customers in multiple domains. Previously, she worked on analytics and machine learning at Nokia and Hewlett Packard Enterprise. Jaya holds an undergraduate degree in mathematics and a graduate degree in statistics from the University of Texas at Austin.

Presentations

Machine translation helps break our language barrier Session

With the need to cater to a global audience, there's a growing demand for applications to support speech identification, translation, and transliteration from one language to another. Jaya Mathew explores this topic and how to quickly use some of the readily available APIs to identify, translate, or even transliterate speech or text within your application.

Kevin McFaul, Global Product Management, IBM Analytics and AI Applications, has been a key driver in the evolution of IBM’s analytics portfolio, applying AI to transform core business domains. His commitment to improving the business user experience is evident as he championed the focus on empowering users to communicate their data and findings in a more effective and engaging manner through self-learning, intelligently automated AI workflows. He currently manages the product team responsible for the Planning  Analytics and Cognos Analytics with the goal of combining analytics with budget and forecasting capabilities to drive better business outcomes by infusing AI into an organization. 

Presentations

AI-Infused Planning, Forecasting and Business Intelligence (Sponsored by IBM) Session

Learn how Planning Analytics and Cognos Analytics leverage AI and machine learning to build more accurate plans and forecasts, and discover hidden insights that drive better business decisions.

Andrew Melo is a research professor of physics and a big data application developer at Vanderbilt University. He’s spent the last decade developing and implementing large-scale data workflows for the Large Hadron Collider. Recently his focus has been reimplementing these physics workflows with Apache Spark.

Presentations

Data engineering at the Large Hadron Collider Session

Building a data engineering pipeline for serving segments of a 200 Pb dataset to particle physicists around the globe poses many challenges, some unique to high energy physics and some to big science projects across disciplines. Ben Galewsky, Gray Lindsey, and Andrew Melo highlight how much of it can inform industry data science at scale.

Rashmina Menon is a senior data engineer at GumGum, a computer vision company. She’s passionate about building distributed and scalable systems and end-to-end data pipelines that provide visibility to meaningful data through machine learning and reporting applications.

Presentations

Real-time forecasting at scale using Delta Lake Session

GumGum receives 30 billion programmatic inventory impressions amounting to 25 TB of data per day. By generating near-real-time inventory forecast based on campaign-specific targeting rules, it enables users to set up successful future campaigns. Rashmina Menon and Jatinder Assi highlight the architecture that enables forecasting in less than 30 seconds with Delta Lake and Databricks Delta caching.

John Mertic is the director of program management for the Linux Foundation. Under his leadership, he’s helped ASWF, ODPi, Open Mainframe Project, and R Consortium accelerate open source innovation and transform industries. John has an open source career spanning two decades, both as a contributor to projects such as SugarCRM and PHP and in open source leadership roles at SugarCRM, OW2, and OpenSocial. With an extensive open source background, he’s a regular speaker at various Linux Foundation and other industry trade shows each year. John’s also an avid writer and has authored two books The Definitive Guide to SugarCRM: Better Business Applications and Building on SugarCRM, as well as published articles on IBM developerWorks, Apple Developer Connection, and PHP Architect.

Presentations

Creating an ecosystem on data governance in the ODPi Egeria project Session

Building on its success at establishing standards in the Apache Hadoop data platform, the ODPi (Linux Foundation) turns its focus to the next big data challenge—enabling metadata management and governance at scale across the enterprise. Mandy Chessell and John Mertic discuss how the ODPi's guidance on governance (GoG) aims to create an open data governance ecosystem.

Minal Mishra is an engineering manager at Netflix, where he’s part of an effort to improve the software delivery of Netflix’s streaming player. Previously, he was with Xbox Live ecommerce and music and video services teams at Microsoft. Outside work, he enjoys playing tennis.

Presentations

Data powering frequent updates of Netflix's video player Session

Minal Mishra walks you through Netflix's video player release process, the challenges with deriving time series metrics from a firehose of events, and some of the oddities in running analysis on real-time metrics.

Sanjeev Mohan leads big data research for technical professionals at Gartner, where he researches trends and technologies for relational and NoSQL databases, object stores, and cloud databases. His areas of expertise span the end-to-end data pipeline, including ingestion, persistence, integration, transformation, and advanced analytics. Sanjeev is a well-respected speaker on big data and data governance. His research includes machine learning and the IoT. He also serves on a panel of judges for many Hadoop distribution organizations, such as Cloudera and Hortonworks.

Presentations

Modern data management essentials for hybrid multicloud journey Session

The acceleration of the migration of workloads to the cloud isn't a binary journey. Some workloads will still be on-premises and some will be on multiple cloud providers. Sanjeev Mohan identifies key data and analytics considerations in modern data architectures, including strategies to handle data latency, gravity, ingress transformation, compliance, and governance needs and data orchestration.

Piero Molino is a cofounder and senior research scientist at Uber AI Labs, where he works on natural language understanding and conversational AI. He’s the author of the open source platform Ludwig, a code-free deep learning toolbox.

Presentations

Enhance recommendations in Uber Eats with graph convolutional networks Session

Ankit Jain and Piero Molino detail how to generate better restaurant and dish recommendations in Uber Eats by learning entity embeddings using graph convolutional networks implemented in TensorFlow.

Keith Moore is the director of product management at SparkCognition and is responsible for the development of the Darwin automated model building project. He specializes in applying advanced data science and natural language processing algorithms to complex datasets. Previously, he was with test and measurement giant National Instruments as an analog to digital converter and vibration software product manager and developed client software solutions for major oil and gas, aerospace, and semiconductor organizations. He served as a board member of Pi Kappa Phi fraternity and is still a volunteer on the alumni engagement committee. He volunteers as the president of the Austin Volunteers Alumni Club President. Keith earned a degree in mechanical engineering from the University of Tennessee summa cum laude.

Presentations

Neuroevolution-based automated model building: Create better models Session

AutoML brings acceleration and democratization of data science, but in the game of accuracy and flexibility, using predefined blueprints to find adequate algorithms falls short. Carlos Pazos and Keith Moore shine a spotlight on a neuroevolutionary approach to AutoML to custom build novel, sophisticated neural networks that perfectly represent the relationships in your dataset.

Philipp Moritz is a PhD candidate in the Electrical Engineering and Computer Sciences (EECS) Department at the University of California, Berkeley, with broad interests in artificial intelligence, machine learning, and distributed systems. He’s a member of the Statistical AI Lab and the RISELab.

Presentations

Using Ray to scale Python and machine learning Tutorial

There's no easy way to scale up Python applications to the cloud. Ray is an open source framework for parallel and distributed computing, making it easy to program and analyze data at any scale by providing general-purpose high-performance primitives. Robert Nishihara, Ion Stoica, and Philipp Moritz demonstrate how to use Ray to scale up Python applications, data processing, and machine learning.

Barr Moses is cofounder and CEO at Monte Carlo. Previously, she was VP at Gainsight (an enterprise customer data platform) where she helped scale the company 10x in revenue and worked with hundreds of clients on delivering reliable data, a management consultant at Bain & Company, and a research assistant at the Statistics Department at Stanford University. She also served in the Israeli Air Force as a commander of an intelligence data analyst unit. Barr graduated from Stanford with a BSc in mathematical and computational science.

Presentations

Introducing data downtime: From firefighting to winning Session

Ever had your CEO or customer look at your report and tell you the numbers look way off? Barr Moses defines data downtime—periods of time when your data is partial, erroneous, missing, or otherwise inaccurate. Data downtime is highly costly for organizations, yet is often addressed ad hoc. You'll explore why data downtime matters to the data industry and how best-in-class teams address it.

Nisha Muktewar is a research engineer at Cloudera Fast Forward Labs, which is an applied machine intelligence research and advising group part of Cloudera. She works with organizations to help build data science solutions and spends time researching new tools, techniques, and libraries in this space. Previously, she was a manager on Deloitte’s actuarial, advanced analytics, and modeling practice leading teams in designing, building, and implementing predictive modeling solutions for pricing, consumer behavior, marketing mix, and customer segmentation use cases for insurance and retail and consumer businesses.

Presentations

Deep learning for anomaly detection Session

In many business use cases, it's frequently desirable to automatically identify and respond to abnormal data. This process can be challenging, especially when working with high-dimensional, multivariate data. Nisha Muktewar and Victor Dibia explore deep learning approaches (sequence models, VAEs, GANs) for anomaly detection, performance benchmarks, and product possibilities.

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.

Presentations

Building a cloud data lake: Ingest, process & analyze big data on AWS Session

Data lakes are hot again; with S3 from AWS as the data lake storage, the modern data lake architecture separates compute from storage. You can choose from a variety of elastic, scalable, and cost-efficient technologies when designing a cloud data lake. Tomer Shiran and Jacques Nadeau share best practices for building a data lake on AWS, as well as various services and open source building blocks.

Fastest data lake: Comparing Azure, Google, and AWS storage Session

Jacques Nadeau leads a deep dive into important considerations when choosing between data lake storage options—speed, cost, and consistency. You'll learn about these differences and how caching and ephemeral storage can affect these trade-offs. Jacques demonstrates technologies that improve analytical experience by compensating for slow reads.

Moin Nadeem is an undergraduate at MIT, where he studies computer science with a minor in negotiations. His research broadly studies applications of natural language. Most recently, he performed an extensive study on bias in language models, culminating with the release of the largest dataset on bias in NLP in the world. Previously, he cofounded the Machine Intelligence Community at MIT, which aims to democratize machine learning across undergraduates on campus, and received the Best Undergraduate Paper award at MIT.

Presentations

How biased is your natural language model? Assessing fairness in NLP Session

The real world is highly biased, but we still train AI models on that data. This leads to models that are highly offensive and discriminatory. For instance, models have learned that male engineers are preferable, and therefore discriminate when used in hiring. Moin Nadeem explores how to assess the social biases that popular models exhibit and how to leverage this to create a more fair model.

Catherine Nelson is a senior data scientist for Concur Labs at SAP Concur, where she explores innovative ways to use machine learning to improve the experience of a business traveller. She’s particularly interested in privacy-preserving ML and applying deep learning to enterprise data. Previously, she was a geophysicist and studied ancient volcanoes and explored for oil in Greenland. Catherine has a PhD in geophysics from Durham University and a master’s degree in earth sciences from Oxford University.

Presentations

Analyzing and deploying your machine learning model Tutorial

Most deep learning models don’t get analyzed, validated, and deployed. Catherine Nelson and Hannes Hapke explain the necessary steps to release machine learning models for real-world applications. You'll view an example project using the TensorFlow ecosystem, focusing on how to analyze models and deploy them efficiently.

Getting the most out of your AI projects with model feedback loops Session

Measuring your machine learning model’s performance is key for every successful data science project. Therefore, model feedback loops are essential to capture feedback from users and expand your model’s training dataset. Hannes Hapke and Catherine Nelson explore the concept of model feedback and guide you through a framework for increasing the ROI of your data science project.

Joseph Nelson is a cofounder and machine learning engineer at Roboflow, a tool for accelerating computer vision model development. Previously, he cofounded Represently, “the Zendesk for Congress,” reducing the time the US House of Representative takes to respond to constituent messages with natural language processing (NLP). He’s taught over 2,000 hours of data science instruction in Python with General Assembly and the George Washington University. Joseph is dedicated to making machine learning more accessible. He’s from Iowa.

Presentations

Using TensorFlow Lite for computer vision Interactive session

Computer vision gives you the ability to make anything in the real world into read/write on your phone. Joseph Nelson walks you through the end-to-end flow required to train a model for mobile deployment, including image collection, preprocessing and augmenting considerations, model training, and saving the TensorFlow Lite (TFLite) model in an appropriate format for deployment.

Alexander Ng is a director of infrastructure and DevOps at Manifold. Previously, he had a stint as an engineer and technical lead doing DevOps at Kyruus and engineering work for the Navy. He holds a BS in electrical engineering from Boston University.

Presentations

Efficient ML engineering: Tools and best practices Tutorial

ML engineers work at the intersection of data science and software engineering—that is, MLOps. Sourav Dey and Alex Ng highlight the six steps of the Lean AI process and explain how it helps ML engineers work as an integrated part of development and production teams. You'll go hands-on with real-world data so you can get up and running seamlessly.

Dave Nielsen is the head of community and ecosystem programs at Redis Labs and the cofounder of CloudCamp, a series of unconferences about cloud computing. Over his 19-year career, he’s been a web developer, systems architect, technical trainer, developer evangelist, and startup entrepreneur. Dave resides in Mountain View with his wife, Erika, to whom he proposed in his coauthored book PayPal Hacks.

Presentations

Redis plus Spark Structured Streaming: The perfect way to scale out your continuous app Session

Redis Streams enables you to collect data in a time series format while matching the data processing rate of your continuous application. Apache Spark’s Structured Streaming API enables real-time decision making for your continuous data. Dave Nielsen demonstrates how to integrate open source Redis with Apache Spark’s Structured Streaming API using the Spark-Redis library.

Robert Nishihara is a fourth-year PhD student working in the University of California, Berkeley, RISELab with Michael Jordan. He works on machine learning, optimization, and artificial intelligence.

Presentations

Using Ray to scale Python and machine learning Tutorial

There's no easy way to scale up Python applications to the cloud. Ray is an open source framework for parallel and distributed computing, making it easy to program and analyze data at any scale by providing general-purpose high-performance primitives. Robert Nishihara, Ion Stoica, and Philipp Moritz demonstrate how to use Ray to scale up Python applications, data processing, and machine learning.

Tim Nugent pretends to be a mobile app developer, game designer, tools builder, researcher, and tech author. When he isn’t busy avoiding being found out as a fraud, Tim spends most of his time designing and creating little apps and games he won’t let anyone see. He also spent a disproportionately long time writing his tiny little bio, most of which was taken up trying to stick a witty sci-fi reference in…before he simply gave up. He’s writing Practical Artificial Intelligence with Swift for O’Reilly and building a game for a power transmission company about a naughty quoll. (A quoll is an Australian animal.)

Presentations

Swift for TensorFlow in 3 hours Tutorial

Mars Geldard, Tim Nugent, and Paris Buttfield-Addison are here to prove Swift isn't just for app developers. Swift for TensorFlow provides the power of TensorFlow with all the advantages of Python (and complete access to Python libraries) and Swift—the safe, fast, incredibly capable open source programming language; Swift for TensorFlow is the perfect way to learn deep learning and Swift.

Kalvin Ogbuefi is a data scientist at the Children’s Hospital Los Angeles (CHLA). Previously, he was a project assistant in the USC Stevens Neuroimaging and Informatics Institute, Marina del Rey, on radiology image analysis. His extensive research experience comprises projects in deep learning, statistical modeling, and computer simulations at Lawrence Livermore National Laboratories and other major research institutions. He earned an MS in applied statistics from California State University, Long Beach and a BS in applied mathematics from University of California, Merced.

Presentations

Semisupervised AI for automated categorization of medical images Session

Annotating radiological images by category at scale is a critical step for analytical ML. Supervised learning is challenging because image metadata doesn't reliably identify image content and manual labeling images for AI algorithms isn't feasible. Stephan Erberich, Kalvin Ogbuefi, and Long Ho share an approach for automated categorization of radiological images based on content category.

Patryk Oleniuk is a data engineer at Virgin Hyperloop One, a company building the fifth mode of transportation. Previously, he was at CERN, where he wrote test software for the world’s biggest particle accelerator, National Instruments, and Samsung R&D. He graduated from EPFL (Swiss Federal Polytechnique in Lausanne) with an information technologies major. When he isn’t glued to a computer screen, he spends time road-tripping California with his friends.

Presentations

Flexible and fast simulation analytics in Hyperloop, a growing company Data Case Studies

To substantiate the key business and safety propositions necessary to establish a new mode of transportation, Virgin Hyperloop One (VHO) implemented a complex, large-scale, and highly configurable simulation. Each simulation run needs to be analyzed and assessed on several KPIs. Sandhya Raghavan and Patryk Oleniuk highlight how VHO successfully reduced the time to insight from days to hours.

Modern ML architecture: Predicting hourly rider demand for Hyperloop Session

Patryk Oleniuk and Sandhya Raghava investigate how to use demand data to improve on the design of the fifth mode of transport—Hyperloop. They discuss the passenger demand prediction methods and the tech stack (Spark, koalas, Keras, MLflow) used to build a deep neural network (DNN)-based near-future demand prediction for simulation purposes.

Aliaksandr Padvitselski (he/him) is a machine learning engineer at SurveyMonkey, where he works on building the machine learning platform and helping to integrate machine learning systems to SurveyMonkey’s products. He worked on a variety of projects related to data business and personalization at SurveyMonkey. Previously, he mostly worked in the finance industry contributing to backend services and building a data warehouse for BI systems.

Presentations

Accelerate your organization: Make data optimal for ML Session

Every organization leverages ML to increase value to customers and understand its business. You may have created models, but now you need to scale. Shubhankar Jain, Aliaksandr Padvitselski, and Jin Yang use a case study to teach you how to pinpoint inefficiencies in your ML data flow, how SurveyMonkey tackled this, and how to make your data more usable to accelerate ML model development.

Deepak Pai is a manager of AI machine learning core services at Adobe, where he manages a team of data scientists and engineers developing core ML services. The services are used by various Adobe Sensei Services that are part of the experience cloud. He holds a master’s and bachelor’s degree in computer science from a leading university in India. He’s published papers in top peer-reviewed conferences and have been granted patents.

Presentations

A graph neural network approach for time evolving fraud networks Data Case Studies

Sriram Ravindran, Deepak Pai, and Shubranshu Shekhar discuss developing a fraud detection model using state-of-the-art graph neural networks. This model can be used to detect card testing, trial abuse, seat addition, etc.

CrEIO: Critical Events Identification for Online purchase funnel Session

Identifying customer stages in a buying cycle enables you to perform personalized targeting based on the stage. Shankar Venkitachalam, Megahanath Macha Yadagiri, and Deepak Pai identify ML techniques to analyze a customer's clickstream behavior to find the different stages of the buying cycle and quantify the critical click events that help transition a user from one stage to another.

Subhankar Pal is Assistant Vice President (AVP) Technology, in Altran and has close to 20 years of professional experience in IT & Telecommunication industry . Altran is a global leader in engineering and R&D services, with presence in more than 30 countries. In Altran he is part of the Research & Innovation organization. His responsibilities include technology product incubation, consulting, solution creation & service offering definition in the area of Machine Learning & Artificial Intelligence for IoT and telecommunication markets. . The products he has incubated, have won top global awards, the most recent one is Layyer123 “Network Transformation Awards” as “Best AI/ML Application” in the “Innovation & Technology” category. Network Transformation Awards are the industry’s premier awards for global leadership & achievement in advancing the industry to the next generation of networks. Prior to joining Altran, he worked at Nokia Networks and C-DOT. Subhankar has extensive experience in speaking at international conferences and presenting technical papers in various forums. He has whitepapers published in IEEE forum & other international events.

Presentations

Realizing Intelligent Mobile Network With Intel Optimized Advanced Machine Learning Intel® AI Builders Showcase

In his talk, Subhankar will present a predictive 4G/LTE radio network health analytics solution that applies advanced deep learning models for channel quality prediction.

Ramesh Panuganty is the founder and CEO of MachEye. He’s a creative technology pioneer (12 patents and several publications) and entrepreneur (launched and exited three startups). His projects include SelectQ, an ed-tech platform that generates SAT questions on the fly using AI and NLG with ratcheting complexity until full preparation; Drastin (acquired by Splunk in 2017), where he created “conversational analysis” as a new BI market category and was recognized in the top five AI platforms by Gartner; and Cloud360 Hyperplatform (acquired by Cognizant in 2012), where he created “cloud management platforms” as a new market category and built a multimillion dollar ARR business.

Presentations

3 steps to implement AI architecture for autonomous intelligence (sponsored by MachEye) Session

Users don't speak SQL and data doesn't speak English. It's time to bridge the gap. Ramesh Panuganty details teaching machines how to tell data stories, humanizing UX through interactive audiovisuals, and leveraging ML to automatically surface and deliver insights. You'll learn from industry experts how the largest energy drink manufacturer and student loan company solve these challenges.

Carlos Pazos is a Senior Product Marketing Manager at SparkCognition responsible for automated model building and natural language processing solutions. He specializes in the real-world implementation of AI-based technologies for the Oil & Gas, Utilities, Aerospace, Finance, and Defense sectors.

Pazos previously worked for National Instruments as an IIoT embedded software and distributed systems product marketing manager. He specialized in real-time systems, heterogeneous computing architectures, industrial communication protocols, and analytics at the edge.

Presentations

Neuroevolution-based automated model building: Create better models Session

AutoML brings acceleration and democratization of data science, but in the game of accuracy and flexibility, using predefined blueprints to find adequate algorithms falls short. Carlos Pazos and Keith Moore shine a spotlight on a neuroevolutionary approach to AutoML to custom build novel, sophisticated neural networks that perfectly represent the relationships in your dataset.

Nick Pentreath is a principal engineer at the Center for Open Source Data & AI Technologies (CODAIT) at IBM, where he works on machine learning. Previously, he cofounded Graphflow, a machine learning startup focused on recommendations, and was at Goldman Sachs, Cognitive Match, and Mxit. He’s a committer and PMC member of the Apache Spark project and author of Machine Learning with Spark. Nick is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.

Presentations

Deploying end-to-end deep learning pipelines with ONNX Session

The common perception of deep learning is that it results in a fully self-contained model. However, in most cases, these models have similar requirements for data preprocessing as the more "traditional" machine learning. Despite this, there are few standard solutions for deploying end-to-end deep learning. Nick Pentreath explores how the ONNX format and ecosystem addresses this challenge.

Alexander Pierce is a field engineer at Pepperdata.

Presentations

Autoscaling big data operations in the cloud Session

Alex Pierce evaluates Amazon Elastic MapReduce (EMR), Azure HDInsight, and Google Cloud DataProc, three leading cloud service providers, with respect to Hadoop and big data autoscaling capabilities and offers guidance to help you determine the flavor of autoscaling to best fit your business needs.

Nick Pinckernell is a senior research engineer for the applied AI research team at Comcast, where he works on ML platforms for model serving and feature pipelining. He’s focused on software development, big data, distributed computing, and research in telecommunications for many years. He’s pursuing his MS in computer science at the University of Illinois at Urbana-Champaign, and when free, he enjoys IoT.

Presentations

Feature engineering pipelines five ways with Kafka, Redis, Spark, Dask, AirFlow, and more Session

With model serving becoming easier thanks to tools like Kubeflow, the focus is shifting to feature engineering. Nick Pinckernell reviews five ways to get your raw data into engineered features (and eventually to your model) with open source tools, flexible components, and various architectures.

Arvind Prabhakar is cofounder and CTO of StreamSets, provider of the industry’s first DataOps platform for modern data integration. He’s an Apache Software Foundation member and a PMC member on Flume, Sqoop, Storm, and MetaModel projects. Previously, Arvind held many roles at Cloudera, ranging from software engineer to director of engineering.

Presentations

Deploying DataOps for analytics agility Session

DataOps is the best approach for enterprises to improve business and drives future revenue streams and competitive differentiation, which is why so many businesses are rethinking their data strategy. Arvind Prabhakar explains how DataOps solves all the problems that come along with managing data movement at scale.

Sandhya Raghavan is a senior data engineer at Virgin Hyperloop One, where she helps build the data analytics platform for the organization. She has 13 years of experience working with leading organizations to build scalable data architectures, integrating relational, and big data technologies. She also has experience implementing large-scale, distributed machine learning algorithms. Sandhya holds a bachelor’s degree in computer science from Anna University, India. When Sandhya isn’t building data pipelines, you can find her traveling the world with her family or pedaling a bike.

Presentations

Flexible and fast simulation analytics in Hyperloop, a growing company Data Case Studies

To substantiate the key business and safety propositions necessary to establish a new mode of transportation, Virgin Hyperloop One (VHO) implemented a complex, large-scale, and highly configurable simulation. Each simulation run needs to be analyzed and assessed on several KPIs. Sandhya Raghavan and Patryk Oleniuk highlight how VHO successfully reduced the time to insight from days to hours.

Modern ML architecture: Predicting hourly rider demand for Hyperloop Session

Patryk Oleniuk and Sandhya Raghava investigate how to use demand data to improve on the design of the fifth mode of transport—Hyperloop. They discuss the passenger demand prediction methods and the tech stack (Spark, koalas, Keras, MLflow) used to build a deep neural network (DNN)-based near-future demand prediction for simulation purposes.

Anand Raman is the chief of staff for the AI CTO office at Microsoft. Previously, he was the chief of staff for the Microsoft Azure Data Group, covering data platforms and machine learning, and ran the company’s product management and the development teams for Azure Data Services and the Visual Studio and Windows Server user experience teams; he also worked several years as researcher before joining Microsoft. Anand holds a PhD in computational fluid mechanics.

Presentations

Anomaly detection algorithm inspired by computer vision and RL Session

Anomaly detection may sound old-fashioned, but it's super important in many industrial applications. Tony Xing and Anand Raman outline a novel anomaly detection algorithm based on spectral residual (SR) and convolutional neural networks (CNNs) and how this novel method was applied in the monitoring system supporting Microsoft AIOps and business incident prevention.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); worked briefly on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper. He’s the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin–Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Serverless streaming architectures and algorithms for the enterprise Tutorial

Arun Kejariwal, Karthik Ramasamy, and Anurag Khandelwal walk you through the landscape of streaming systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage. You'll get an overview of the inception and growth of the serverless paradigm and explore Apache Pulsar, which provides native serverless support in the form of Pulsar functions.

Using Apache Pulsar functions for data workflows at Narvar Session

Narvar originally used a large collection of point technologies such as AWS Kinesis, Lambda, and Apache Kafka to satisfy its requirements for pub/sub messaging, message queuing, logging, and processing. Karthik Ramasamy and Anand Madhavan walk you through how Narvar moved away from using a slew of technologies and consolidating their use cases using Apache Pulsar.

Anand Rao is a partner in PwC’s Advisory Practice and the innovation lead for the Data and Analytics Group, where he leads the design and deployment of artificial intelligence and other advanced analytical techniques and decision support systems for clients, including natural language processing, text mining, social listening, speech and video analytics, machine learning, deep learning, intelligent agents, and simulation. Anand is also responsible for open source software tools related to Apache Hadoop and packages built on top of Python and R for advanced analytics as well as research and commercial relationships with academic institutions and startups, research, development, and commercialization of innovative AI, big data, and analytic techniques. Previously, Anand was the chief research scientist at the Australian Artificial Intelligence Institute; program director for the Center of Intelligent Decision Systems at the University of Melbourne, Australia; and a student fellow at IBM’s T.J. Watson Research Center. He has held a number of board positions at startups and currently serves as a board member for a not-for-profit industry association. Anand has coedited four books and published over 50 papers in refereed journals and conferences. He was awarded the most influential paper award for the decade in 2007 from Autonomous Agents and Multi-Agent Systems (AAMAS) for his work on intelligent agents. He’s a frequent speaker on AI, behavioral economics, autonomous cars and their impact, analytics, and technology topics in academic and trade forums. Anand holds an MSc in computer science from Birla Institute of Technology and Science in India, a PhD in artificial intelligence from the University of Sydney, where he was awarded the university postgraduate research award, and an MBA with distinction from Melbourne Business School.

Presentations

A practical guide to responsible AI: Build robust, secure, and safe AI Session

Join in for a practitioner’s overview of the risks of AI and depiction of responsible AI deployment within an organization. You'll discover how to ensure the safety, security, standardized testing, and governance of systems and how models can be fooled or subverted. Ilana Golbin and Anand Rao illustrate how organizations safeguard AI applications and vendor solutions to mitigate AI risks.

Cool AI for practical problems (sponsored by PwC) Session

While many of the solutions to which AI can be applied involve exciting technologies, AI will arguably have greater transformational impact on more mundane problems. Anand Rao offers an overview of how PwC has developed and applied innovative AI solutions to common, practical problems across several domains, such as tax, accounting, and management consulting.

ML models are not software: Why organizations need dedicated operations to address the b Session

Anand Rao and Joseph Voyles introduce you to the core differences between software and machine learning model life cycles. They demonstrate how AI’s success also limits its scale and detail leading practices for establishing AIOps to overcome limitations by automating CI/CD, supporting continuous learning, and enabling model safety.

Delip Rao is the vice president of research at the AI Foundation, where he leads speech, language, and vision research efforts for generating and detecting artificial content. Previously, he founded the AI research consulting company Joostware and the Fake News Challenge, an initiative to bring AI researchers across the world together to work on fact checking-related problems, and he was at Google and Twitter. Delip is the author of a recent book on deep learning and natural language processing. His attitude toward production NLP research is shaped by the time he spent at Joostware working for enterprise clients, as the first machine learning researcher on the Twitter antispam team, and as an early researcher at Amazon Alexa.

Presentations

Natural language processing with deep learning 2-Day Training

Delip Rao explores natural language processing (NLP) using a set of machine learning techniques known as deep learning. He walks you through neural network architectures and NLP tasks and teaches you how to apply these architectures for those tasks.

Gayathri Rau is a senior manager of analytics product management at Dell Technologies, where she manages a team of product owners, data scientists, and BI developers to build AI and machine learning products to help solve customer and business problems. She has 19+ years in the data and technology field, with experience collaborating with business and technology architecture teams and enabling platform capabilities and innovation on enterprise data platforms. She is a co-inventor of a patent pending for innovative use of machine learning models at Dell Technologies.

Presentations

Data science + domain experts = exponentially better products Data Case Studies

To deliver best-in-class data science products, solutions must evolve through partnerships between data scientists and domain experts. Jeffrey Vah and Gayathri Rau detail the product lifecycle journey while integrating business expertise with data scientists and technologists. You'll discover best practices and pitfalls when digitally transforming your business through AI and machine learning.

Nancy Rausch is a senior manager at SAS. Nancy has been involved for many years in the design and development of SAS’s data warehouse and data management products, working closely with customers and authoring a number of papers on SAS data management products and best practice design principals for data management solutions. She holds an MS in computer engineering from Duke University, where she specialized in statistical signal processing, and a BS in electrical engineering from Michigan Technological University. She has recently returned to college and is pursuing an MS in analytics from Capella University.

Presentations

A study of bees: Using AI and art to tell a data story Session

For data to be meaningful, it needs to be presented in a way people can relate to. Nancy Rausch explains how SAS combined AI and art to tell a compelling data story and how it combined streaming data from local beehives to forecast hive health. It visualized this data in a live-action art sculpture, which helped to bring the data to life in a fun and compelling way.

Meghana Ravikumar is a machine learning engineer at SigOpt with a particular focus on novel applications of deep learning across academia and industry. In particular, Meghana explores the impact of hyperparameter optimization and other techniques on model performance and evangelizes these practical lessons for the broader machine learning community. Previously, she was in biotech, employing natural language processing to mine and classify biomedical literature. She holds a BS degree in bioengineering from UC Berkeley. When she’s not reading papers, developing models and tools, or trying to explain complicated topics, she enjoys doing yoga, traveling, and hunting for the perfect chai latte.

Presentations

Optimized image classification on the cheap Session

Meghana Ravikumar anchors on building an image classifier trained on the Stanford Cars dataset to evaluate fine tuning, feature extraction, and the impact of hyperparameter optimization, then tune image transformation parameters to augment the model. The goal is to answer: how can resource-constrained teams make trade-offs between efficiency and effectiveness using pretrained models?

Sriram Ravindran is a data scientist at Adobe, where he’s building a platform called Fraud AI. Fraud AI is a solution designed to meet Adobe’s fraud detection needs. Previously, he was a graduate research student at University of California, San Diego, where he worked on applying deep learning to EEG (brain activity) data.

Presentations

A graph neural network approach for time evolving fraud networks Data Case Studies

Sriram Ravindran, Deepak Pai, and Shubranshu Shekhar discuss developing a fraud detection model using state-of-the-art graph neural networks. This model can be used to detect card testing, trial abuse, seat addition, etc.

Kasie Richards is the national director for situational awareness and decision support for the American Red Cross. She provides holistic data support for national emergency operations, including embedding in field operations to support real-time data support from data collection through visualization. She also provides national coordination and support for data analytics associated with emergency preparedness and response. Kasie holds a doctorate in public health and an award for her data visualization work in hurricane planning.

Presentations

Supporting disasters with data Health Data Day

With the current frequency of disasters and other emergencies, how we prepare for and optimize data to support affected communities and individuals is a priority for the preparedness sector. Kasie Richards offers a holistic approach to data during emergencies and shares strategies for implementing a data-inclusive emergency preparedness strategy.

Joy Rimchala is a data scientist in Intuit’s Machine Learning Futures Group working on ML problems in limited-label data settings. Joy holds a PhD from MIT, where she spent five years doing biological object tracking experiments and modeling them using Markov decision processes.

Presentations

Explainable AI: Your model is only as good as your explanation Session

Explainable AI (XAI) has gained industry traction, given the importance of explaining ML-assisted decisions in human terms and detecting undesirable ML defects before systems are deployed. Joy Rimchala and Diane Chang delve into XAI techniques, advantages and drawbacks of black box versus glass box models, concept-based diagnostics, and real-world examples using design thinking principles.

Kelley Rivoire is the head of data infrastructure at Stripe, where she leads the Data Infrastructure Group. As an engineer, she built Stripe’s first real-time machine learning evaluation of user risk. Previously, she worked on nanophotonics and 3D imaging as a researcher at HP Labs. She holds a PhD from Stanford.

Presentations

Production ML outside black box: Repeatable input, explainable output Session

Tools for training and optimizing models have become more prevalent and easier to use; however, these are insufficient for deploying ML in critical production applications. Kelley Rivoire dissects how Stripe approached challenges in developing reliable, accurate, and performant ML applications that affect hundreds of thousands of businesses.

Paige Roberts is an open source relations manager at Vertica, where she promotes understanding of Vertica, MPP data processing, open source, and how the analytics revolution is changing the world. In two decades in the data management industry, she’s worked as an engineer, a trainer, a marketer, a product manager, and a consultant.

Presentations

Architecting production IoT analytics Session

What works in production is the only technology criterion that matters. Companies with successful high-scale production IoT analytics programs like Philips, Anritsu, and OptimalPlus show remarkable similarities. IoT at production scale requires certain technology choices. Paige Roberts drills into the architectures of successful production implementations to identify what works and what doesn’t.

Klaus Roder is the lead Program Director for Cloud Pak for Data System, a Data & AI Platform that helps companies to manage their Data & AI needs end to end. From collecting, organizing, analyzing and infusing Data in a governed way to help customers climb the AI letter. His passion is to work with customers and bring new innovations into products.

Prior to his current role, Klaus lead the IBM Entity Analytics business, providing deep Entity Insides to customers. He was also the lead Product Manager for IBM’s Big Data Business which he lead from its inception to a multi-million $ business. Before that, Klaus was a key member of the IBM Information Management CTO office.

Klaus holds a Business Degree and a Master in Computer Science from the University of Applied Science in Würzburg, Germany and multiple Stanford Continues Study Certifications.

Presentations

Machine Learning and Deep Learning with IBM Cloud Pak for Data System Intel® AI Builders Showcase

IBM Cloud Pak for Data System is an Intel-based hybrid cloud Data & AI platform delivering an information architecture for AI. With IBM Cloud Pak for Data System, you can unlock the value of all your data on a unified, cloud-native platform to automate how your organization turns data into insights.

Lisa Joy Rosner is the CMO at Otonomo, an automotive data services platform, where she drives global development of the company’s marketplace. She’s an an award-winning and patented executive with over 20 years of experience marketing big data and analytics solutions at both public and startup technology companies. Previously, Rosner was CMO at Neustar, leading a major brand transformation as the company entered into the security and marketing data services markets; launched social intelligence company, NetBase, where she worked with 5 of the top 10 CPGs as they adopted a new approach to real-time marketing; was vice president of marketing at MyBuys (sold to Magnetic); was vice president of worldwide marketing at BroadVision; and held positions at data warehousing companies Brio (sold to Hyperion), DecisionPoint (sold to Teradata), and Oracle. Lisa Joy was named a 2013 Silicon Valley Woman of Influence, 2014 B2B Marketer of the Year by the Sage Group and Wall Street Journal, and was a Top 100 Women in Marketing honoree by Brand Innovators in 2015. She’s been a guest lecturer at the Hass School of Business, the Tuck School of Business, and Stanford University. She earned a bachelor’s degree (sum cum laude) in English literature from the University of Maryland. She sits on the marketing advisory board of Mintigo, The Big Flip, Fyber, and PLAE Shoes, and the board of trustees for UC Merced. She’s the mother of four young children.

Presentations

Data privacy compliance and the future of mobility Session

As cars gain more advanced features, the role of customer privacy and responsible data stewardship becomes an important focus for auto manufacturers and drivers. Lisa Joy Rosner discusses the future of connected vehicles, data compliance measures, and the impact of related policies like GDPR and the California Consumer Privacy Act (CCPA).

Nikki Rouda is a principal product marketing manager at Amazon Web Services (AWS). Nikki has decades of experience leading enterprise big data, analytics, and data center infrastructure initiatives. Previously, he held senior positions at Cloudera, Enterprise Strategy Group (ESG), Riverbed, NetApp, Veritas, and UK-based Alertme.com (an early consumer IoT startup). Nikki holds an MBA from Cambridge’s Judge Business School and an ScB in geophysics from Brown University.

Presentations

Building a secure, scalable, and transactional data lake on AWS 2-Day Training

Nikki Rouda walks you through building a data lake on Amazon S3 using different ingestion mechanisms, performing incremental data processing on the data lake to support transactions on S3, and securing the data lake with fine-grained access control policies.

Building a secure, scalable, and transactional data lake on AWS (Day 2) Training Day 2

Nikki Rouda walks you through building a data lake on Amazon S3 using different ingestion mechanisms, performing incremental data processing on the data lake to support transactions on S3, and securing the data lake with fine-grained access control policies.

Rachel Roumeliotis is a strategic content director at O’Reilly, where she leads an editorial team that covers a wide variety of programming topics ranging from full stack to open source in the enterprise to emerging programming languages. Rachel is a programming chair of OSCON and O’Reilly’s Software Architecture Conference. She has been working in technical publishing for 10 years, acquiring content in many areas including mobile programming, UX, computer security, and AI.

Presentations

Tuesday keynotes Keynote

Strata Data & AI program chairs, Rachel Roumeliotis and Alistair Croll, welcome you to the first day of keynotes.

Wednesday keynotes Keynote

Strata program chairs Rachel Roumeliotis and Alistair Croll welcome you to the first day of keynotes.

Armand Ruiz, Director of IBM Data Science & AI Elite, leads a team of 100 data scientists that help customers get started with their digital transformation and AI adoption. Previously, Armand led the Product Management team in charge of the Data Science and Machine Learning offerings at IBM. He holds a Master’s in Electrical Engineering in Université Catholicism de Louvain and bachelor in Telecommunications Engineering in Universitat Politécnica de Catalunya.

Presentations

Beyond the Model: Real-Time, Real-World Hardened AI (Sponsored by IBM) Session

You’ve deployed your model - now reality hits. The market changes, customers react, the relationship between input and output data is not what it was. Join Data and AI experts as they discuss how to help create hardened AI solutions that can dynamically adapt and evolve.

Ebrahim Safavi is a senior data scientist at Mist, focusing on knowledge discovery from big data using machine learning and large-scale data mining where he developed, and implemented several key production components including the company’s chatbot inference engine and anomaly detections. He has won a Microsoft research award for his work on information retrieval and recommendation systems in graph-structured networks. Ebrahim earned a PhD degree in cognitive learning networks from Stevens Institute of Technology.

Presentations

Automated pipeline for large-scale neural network training, inference Session

Anomaly detection models are essential to run data-driven businesses intelligently. At Mist Systems, the need for accuracy and the scale of the data impose challenges to build and automate ML pipelines. Ebrahim Safavi and Jisheng Wang explain how recurrent neural networks and novel statistical models allow Mist Systems to build a cloud native solution and automate the anomaly detection workflow.

Debashis Saha is the vice president of engineering at AppZen. Previously, he was vice president of platforms for Intuit, where he led engineering that enabled developers to be productive, data driven, and innovative; executive vice president and chief technology officer for Jiff, a healthcare startup focused on increasing health and happiness; and vice president of commerce platform infrastructure at eBay, where he led the engineering teams that built, managed, and operated the platforms and infrastructure services and his portfolio included eBay’s entire infrastructure, ranging from data centers to cloud, platforms, frameworks, data and analytics, engineering services, and operations. Debashis earned an MS in electrical engineering and computer science from MIT and a bachelor’s of technology in computer science and engineering from the Indian Institute of Technology, Kharagpur.

Presentations

Apache Kylin: 5 years and still going strong (sponsored by Kyligence) Session

Luke Han and Debashis Saha delve into the technical history of Apache Kylin, share how it's being used inside some of the world's largest organizations, and provide a road map of what lies ahead for this popular open source project.

Guillaume Saint-Jacques is the tech lead of computational social science at LinkedIn. Previously, he was the technical lead of the LinkedIn experimentation science team. He holds a PhD in management research from the MIT Sloan School of Management, a master’s degree in economics from the Paris École Normale Supérieure and the Paris School of Economics, and a master’s degree in entrepreneurship from HEC Paris.

Presentations

Fairness through experimentation at LinkedIn Session

Most companies want to ensure their products and algorithms are fair. Guillaume Saint-Jacques and Meg Garlinghouse share LinkedIn's A/B testing approach to fairness and describe new methods that detect whether an experiment introduces bias or inequality. You'll learn about a scalable implementation on Spark and discover examples of use cases and impact at LinkedIn.

Mehrnoosh Sameki is a technical program manager at Microsoft, responsible for leading the product efforts on machine learning interpretability within the Azure Machine Learning platform. Previously, she was a data scientist at Rue Gilt Groupe, incorporating data science and machine learning in the retail space to drive revenue and enhance customers’ personalized shopping experiences. She earned her PhD degree in computer science at Boston University.

Presentations

An overview of responsible artificial intelligence Tutorial

Mehrnoosh Sameki and Sarah Bird examine six core principles of responsible AI with a focus on transparency, fairness, and privacy. You'll discover best practices and state-of-the-art open source toolkits that empower researchers, data scientists, and stakeholders to build trustworthy AI systems.

Aryn Sargent is a data analyst at Verint, where she leads strategic accounts and clients in identifying and defining intelligent virtual assistant (IVA) understanding and knowledge areas through the use of proprietary AI-powered tools to analyze unstructured conversational data. She’s responsible for client automation strategies and evaluating, measuring and growing their success; defining tactical knowledge areas to achieve long-term vision. Previously, Aryn has held numerious roles within Verint, including key positions within product management, product strategy, and data analysis. She has over six years of experience leading the identification and acceleration of successful solutions for enterprise conversational AI and IVAs. When she’s not working with datasets, she’s well known for her green thumb in the garden and her love of dogs, fostering dogs in need until they find a loving forever home.

Presentations

Chatbots and conversation analysis: Learn what customers want to know Data Case Studies

Chatbots are increasingly used in customer service as a first tier of support. Through deep analysis of conversation logs, you can learn real user motivations and where company improvements can be made. Ian Beaver and Aryn Sargent make a build or buy comparison for deploying self-service bots, cover motivations and techniques for deep conversational analysis, and discuss real-world discoveries.

Roshan Satish is a product manager who has been involved with artificial intelligence initiatives at DocuSign since their inception. He came to the company through an acquisition of a CLM startup, SpringCM, and worked with product leadership across the organization to formalize an AI vision before beginning to scale out the team. His job has been to create a robust, enterprise-grade deep learning platform that enables intelligence and insights across the DocuSign Agreement Cloud. Understandably, many of the use cases center around document understanding and natural language processing (NLP) and natural language understanding (NLU)—but they’ve also explored features leveraging CNNs, as well as classical machine learning models. One of the major challenges has been working with a bare metal tech stack while emphasizing scalability and modularity of DocuSign’s AI services.

Presentations

A unified CV, OCR, and NLP model pipeline for scalable document understanding at DocuSign Session

Roshan Satish and Michael Chertushkin lead you through a real-world case study about applying state-of-the-art deep learning techniques to a pipeline that combines computer vision (CV), optical character recognition (OCR), and natural language processing (NLP) at DocuSign. You'll discover how the project delivered on its extreme interpretability, scalability, and compliance requirements.

Danilo Sato is a principal consultant at ThoughtWorks with more than 17 years of experience in many areas of architecture and engineering: software, data, infrastructure, and machine learning. Balancing strategy with execution, Danilo helps clients refine their technology strategy while adopting practices to reduce the time between having an idea, implementing it, and running it in production using the cloud, DevOps, and continuous delivery. He is the author of DevOps in Practice: Reliable and Automated Software Delivery, is a member of ThoughtWorks’ Technology Advisory Board and Office of the CTO, and is an experienced international conference speaker.

Presentations

CD for ML: Automating the end-to-end lifecycle Tutorial

Danilo Sato leads you through applying continuous delivery (CD) to data science and machine learning (ML). Join in to learn how to make changes to your models while safely integrating and deploying them into production using testing and automation techniques to release reliably at any time and with a high frequency.

Liqun Shao is a data scientist in the AI Development Acceleration Program at Microsoft. She finished her first rotational project on “Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-Based Platforms” with the paper publication in SoCC 2019 and her second one on “Azure Machine Learning Text Analytics Best Practices” with the contribution of the public NLP repo, which the talk got accepted by Strata Data 2020. Now she is working on her third rotational project with MSAI on SmartCompose. She earned her bachelor’s of computer science in China and her doctorate in computer science at the University of Massachusetts. Her research areas focus on natural language processing, data mining, and machine learning, especially on title generation, summarization and classification.

Presentations

Distributed training in the cloud for production-level NLP models Session

Liqun Shao leads you through a new GitHub repository to show you how data scientists without NLP knowledge can quickly train, evaluate, and deploy state-of-the-art NLP models. She focuses on two use cases with distributed training on Azure Machine Learning with Horovod: GenSen for sentence similarity and BERT for question-answering using Jupyter notebooks for Python.

Shubranshu Shekhar is a PhD student at Carnegie Mellon University.

Presentations

A graph neural network approach for time evolving fraud networks Data Case Studies

Sriram Ravindran, Deepak Pai, and Shubranshu Shekhar discuss developing a fraud detection model using state-of-the-art graph neural networks. This model can be used to detect card testing, trial abuse, seat addition, etc.

Mehul Sheth is a senior performance engineer in the Performance Labs at Druva, where he is responsible for the performance of the CloudApps product of Druva InSync. He has more than 13 years of experience in development and performance engineering, where he’s ensured production performance of thousands of applications. Mehul loves to tackle unsolved problems and strives to bring a simple solution to the table, rather than trying complex things.

Presentations

Realistic synthetic data at scale: Influenced by production data Session

Any software product needs to be tested against data, and it's difficult to have a random but realistic dataset representing production data. Mehul Sheth highlights using production data to generate models. Production data is accessed without exposing it or violating any customer agreements on privacy, and the models then generate test data at scale in lower environments.

Tomer Shiran is cofounder and CEO of Dremio, the data lake engine company. Previously, Tomer was the vice president of product at MapR, where he was responsible for product strategy, road map, and new feature development and helped grow the company from 5 employees to over 300 employees and 700 enterprise customers; and he held numerous product management and engineering positions at Microsoft and IBM Research. He’s the author of eight US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from the Technion, the Israel Institute of Technology.

Presentations

Building a cloud data lake: Ingest, process & analyze big data on AWS Session

Data lakes are hot again; with S3 from AWS as the data lake storage, the modern data lake architecture separates compute from storage. You can choose from a variety of elastic, scalable, and cost-efficient technologies when designing a cloud data lake. Tomer Shiran and Jacques Nadeau share best practices for building a data lake on AWS, as well as various services and open source building blocks.

Laura Schornack is an expert engineer and lead design architect for shared services at JPMorgan Chase. Previously, she worked for world-renowned organizations such as IBM and Nokia. She holds a degree in computer science from University of Illinois at Urbana-Champaign.

Presentations

Architecting and deploying an ML model to the private cloud Interactive session

Many pieces go into integrating machine learning models into an application. Laura Schornack details how to create the architecture for each piece so it can be delivered in an agile manner. Along the way, you'll learn how to integrate these pieces into an existing application.

Pramod Singh is a senior machine learning engineer at Walmart Labs. He has extensive hands-on experience in machine learning, deep learning, AI, data engineering, designing algorithms, and application development. He has spent more than 10 years working on multiple data projects at different organizations. He’s the author of three books Machine Learning with PySpark, Learn PySpark, and Learn TensorFlow 2.0. He’s also a regular speaker at major conferences such as the O’Reilly Strata Data and AI Conferences. Pramod holds a BTech in electrical engineering from BATU, and an MBA from Symbiosis University. He’s also done data science certification from IIM–Calcutta. He lives in Bangalore with his wife and three-year-old son. In his spare time, he enjoys playing guitar, coding, reading, and watching football.

Presentations

Attention networks all the way to production using Kubeflow Tutorial

Vijay Srinivas Agneeswaran, Pramod Singh, and Akshay Kulkarni demonstrate the in-depth process of building a text summarization model with an attention network using TensorFlow 2.0. You'll gain the practical hands-on knowledge to build and deploy a scalable text summarization model on top of Kubeflow.

Joseph Sirosh is the chief technology officer at Compass. Previously, he was the corporate vice president of the Cloud AI Platform at Microsoft, where he lead the company’s enterprise AI strategy and products such as Azure Machine Learning, Azure Cognitive Services, Azure Search, and Bot Framework; the corporate vice president for Microsoft’s Data Platform; the vice president for Amazon’s Global Inventory Platform, responsible for the science and software behind Amazon’s supply chain and order fulfillment systems, as well as the central Machine Learning Group, which he built and led; and the vice president of research and development at Fair Isaac Corp., where he led R&D projects for DARPA, Homeland Security, and several other government organizations. He’s passionate about machine learning and its applications and has been active in the field since 1990. Joseph holds a PhD in computer science from the University of Texas at Austin and a BTech in computer science and engineering from the Indian Institute of Technology Chennai.

Presentations

Compass uses Amazon to simplify and modernize home search Session

Compass is changing real estate by leveraging its industry-leading software to build search and analytical tools that help real estate professionals find, market, and sell homes. Joseph Sirosh details how Compass leverages AWS services, including Amazon Elasticsearch Service, to deliver a complete, scalable home-search solution.

Divya Sivasankaran is a machine learning scientist at integrate.ai where she focuses on building out FairML capabilities within its products. Previously, she worked for a startup that partnered with government organizations (police force and healthcare) to build AI capabilities to bring about positive change (and good intentions). But these experiences also shaped her thinking around the larger ethical implications of AI in the wild and the need for ethical considerations to be brought forward at the design thinking stages (proactive versus reactive).

Presentations

FairML from theory to practice: Our journey to build a fair product Session

In recent years, there's been a lot of attention on the need for ethical considerations in ML, as well as different ways to address bias in different stages of the ML pipeline. However, there hasn't been a lot of focus on how to bring fairness to ML products. Divya Sivasankaran explores the key challenges (and how to overcome them) in operationalizing fairness and bias in ML products.

Dr. Sam Small serves as the Chief Security Officer of ZeroFOX, helping its customers implement world class social-media protection programs and supporting ZeroFOX to continuously advance its role as the innovation leader of social-media and collaborative-technology security solutions. After earning his doctorate in computer science from Johns Hopkins University, Sam was a lecturer, led an academic security research lab, and launched two security-industry startups, including Fast Orientation where he most recently served as CEO and continues to maintain a non-operational role as Chairman. In addition to his technical and entrepreneurial pursuits, Dr. Small has provided expert technology and security assessments of dozens of organizations and vendor products and conducted due diligence assessments for more than a dozen software-industry investments, mergers, and acquisitions. He has also served as an expert witness in several high-profile security, software, and network-related lawsuits. His work and research have been covered in publications including Wired, The New York Times, The Washington Post, New Scientist, CNET, ZDNet, and slashdot.

Presentations

Accelerate Threat & Object Detection on Digital Platforms with AI Intel® AI Builders Showcase

Many of the biggest security risks for businesses today come through the expanding attack surface—social and digital platforms. Running on Intel technologies helps analyze more images and alert on more threats for more customers.

Jason “Jay” Smith is a Cloud customer engineer at Google. He spends his day helping enterprises find ways to expand their workload capabilities on Google Cloud. He’s on the Kubeflow go-to-market team and provides code contributions to help people build an ecosystem for their machine learning operations. His passions include big data, ML, and helping organizations find a way to collect, store, and analyze information.

Presentations

Using serverless Spark on Kubernetes for data streaming and analytics Session

Data is a valuable resource, but collecting and analyzing the data can be challenging. And the cost of resource allocation often prohibits the speed at which you can analyze the data. Jay Smith and Remy Welch break down how serverless architecture can improve the portability and scalability of streaming event-driven Apache Spark jobs and perform ETL tasks using serverless frameworks.

Maulik Soneji is a product engineer at Gojek, where he works with different parts of data pipelines for a hypergrowth startup. Outside of learning about mature data systems, he’s interested in Elasticsearch, Go, and Kubernetes.

Presentations

BEAST: Building an event processing library for millions of events Session

Maulik Soneji and Dinesh Kumar explore Gojek's event-processing library to consume events from Kafka and push it to BigQuery. All of its services are event sourced, and Gojek has a high load of 21K messages per second for few topics, and it has hundreds of topics.

Colin Spikes is a senior manager of solution engineering at Algorithmia and an experienced solution consultant with an extensive background in all things data. Previously, Colin managed a team of data solution architects at Socrata, assisting cities, states, and federal agencies worldwide to unlock the power of data to better understand and communicate conditions in their communities.

Presentations

The OS for AI: How serverless computing enables the next generation of ML Session

ML is advancing rapidly, but only a few contributors focus on the infrastructure and scaling challenges that come with it. Colin Spikes explores why ML is a natural fit for serverless computing, a general architecture for scalable ML, and common issues when implementing on-demand scaling over GPU clusters. He provides general solutions and describes a vision for the future of cloud-based ML.

Vijay Srinivas Agneeswaran is a director of data sciences at Walmart Labs in India, where he heads the machine learning platform development and data science foundation teams, which provide platform and intelligent services for Walmart businesses around the world. He’s spent the last 18 years creating intellectual property and building data-based products in industry and academia. Previously, he led the team that delivered real-time hyperpersonalization for a global automaker, as well as other work for various clients across domains such as retail, banking and finance, telecom, and automotive; he built PMML support into Spark and Storm and realized several machine learning algorithms such as LDA and random forests over Spark; he led a team that designed and implemented a big data governance product for a role-based fine-grained access control inside of Hadoop YARN; and he and his team also built the first distributed deep learning framework on Spark. He’s been a professional member of the ACM and the IEEE (senior) for the last 10+ years. He has five full US patents and has published in leading journals and conferences, including IEEE Transactions. His research interests include distributed systems, artificial intelligence, and big data and other emerging technologies. Vijay has a bachelor’s degree in computer science and engineering from SVCE, Madras University, an MS (by research) from IIT Madras, and a PhD from IIT Madras and held a postdoctoral research fellowship in the LSIR Labs, Swiss Federal Institute of Technology, Lausanne (EPFL).

Presentations

Attention networks all the way to production using Kubeflow Tutorial

Vijay Srinivas Agneeswaran, Pramod Singh, and Akshay Kulkarni demonstrate the in-depth process of building a text summarization model with an attention network using TensorFlow 2.0. You'll gain the practical hands-on knowledge to build and deploy a scalable text summarization model on top of Kubeflow.

Ion Stoica is a professor in the Electrical Engineering and Computer Sciences (EECS) Department at the University of California, Berkeley, where he researches cloud computing and networked computer systems. Previously, he worked on dynamic packet state, chord DHT, internet indirection infrastructure (i3), declarative networks, and large-scale systems, including Apache Spark, Apache Mesos, and Alluxio. He’s the cofounder of Databricks—a startup to commercialize Apache Spark—and Conviva—a startup to commercialize technologies for large-scale video distribution. Ion is an ACM fellow and has received numerous awards, including inclusion in the SIGOPS Hall of Fame (2015), the SIGCOMM Test of Time Award (2011), and the ACM doctoral dissertation award (2001).

Presentations

Using Ray to scale Python and machine learning Tutorial

There's no easy way to scale up Python applications to the cloud. Ray is an open source framework for parallel and distributed computing, making it easy to program and analyze data at any scale by providing general-purpose high-performance primitives. Robert Nishihara, Ion Stoica, and Philipp Moritz demonstrate how to use Ray to scale up Python applications, data processing, and machine learning.

Dave Stuart is a senior technical executive within the US Department of Defense, where he’s leading a large-scale effort to transform the workflows of thousands of enterprise business analysts through Jupyter and Python adoption, making tradecraft more efficient, sharable, and repeatable. Previously, Dave led multiple grassroots technology adoption efforts, developing innovative training methods that tangibly increased the technical proficiency of a large noncoding enterprise workforce.

Presentations

Jupyter as an enterprise DIY analytic platform Session

Dave Stuart takes a look into how the US Intelligence Community (IC) uses Jupyter and Python to harness subject matter expertise of analysts in a DIY analytic movement. You'll cover the technical and cultural challenges the community encountered in its quest to find success at a large scale and address the strategies used to mitigate the challenges.

Bargava Subramanian is a cofounder and deep learning engineer at Binaize in Bangalore, India. He has 15 years’ experience delivering business analytics and machine learning solutions to B2B companies. He mentors organizations in their data science journey. He holds a master’s degree from the University of Maryland, College Park. He’s an ardent NBA fan.

Presentations

Deep learning for recommendation systems 2-Day Training

Bargava Subramanian and Amit Kapoor provide you with a thorough introduction to the art and science of building recommendation systems and paradigms across domains. You'll get an end-to-end overview of deep learning-based recommendation and learning-to-rank systems to understand practical considerations and guidelines for building and deploying RecSys.

Deep learning for recommendation systems (Day 2) Training Day 2

Bargava Subramanian and Amit Kapoor provide you with a thorough introduction to the art and science of building recommendation systems and paradigms across domains. You'll get an end-to-end overview of deep learning-based recommendation and learning-to-rank systems to understand practical considerations and guidelines for building and deploying RecSys.

Dev Tagare is an engineering manager at Lyft. He has hands-on experience in building end-to-end data platforms for high-velocity and large data volume use cases. Previously, Dev spent 10 years leading engineering functions for companies including Oracle and Twitter with a focus on areas including open source; big data; low-latency, high-scalability design; data structures; design patterns; and real-time analytics.

Presentations

Reducing data lag from 24+ hours to 5 mins at Lyft scale Session

Mark Grover and Dev Tagare offer you a glimpse into the end-to-end data architecture Lyft uses to reduce data lag in its analytical systems from 24+ hours to less than 5 minutes. You'll learn the what and why of tech choices, monitoring, and best practices. They outline Lyft's use cases, especially in ML model performance and evaluation.

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, agile, distributed teams. Previously, he led business operations for Bing Shopping in the US and Europe with Microsoft’s Bing Group and built and ran distributed teams that helped scale Amazon’s financial systems with Amazon in both Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Advanced natural language processing with Spark NLP Tutorial

David Talby, Alex Thomas, Claudiu Branzan, and Veysel Kocaman detail applying advances in deep learning for common natural language processing (NLP) tasks such as named entity recognition, document classification, sentiment analysis, spell checking, and OCR. You'll learn to build complete text analysis pipelines using the highly performant, scalable, open source Spark NLP library in Python.

Model governance: A checklist for getting AI safely to production Session

The industry has about 40 years of experience forming best practices and tools for storing, versioning, collaborating, securing, testing, and building software source code—but only about 4 years doing so for AI models. David Talby catches you up on current best practices and freely available tools so your team can go beyond experimentation to successfully deploy models.

Ankur Taly is the head of data science at Fiddler, where he’s responsible for developing, productionizing, and evangelizing core explainable AI technology. Previously, he was a staff research scientist at Google Brain, where he carried out research in explainable AI and was most well-known for his contribution to developing and applying integrated gradients— a new interpretability algorithm for deep networks. His research in this area has resulted in publications at top-tier machine learning conferences and prestigious journals like the American Academy of Ophthalmology (AAO) and Proceedings of the National Academy of Sciences (PNAS). Besides explainable AI, Ankur has a broad research background and has published 25+ papers in areas including computer security, programming languages, formal verification, and machine learning. He’s served on several academic conference program committees (PLDI, POST, and PLAS), delivered several invited lectures at universities and various industry venues, and instructed short courses at summer schools and conferences. Ankur earned his PhD in computer science from Stanford University and a BTech in CS from IIT Bombay.

Presentations

Slice and explain: A unified paradigm for explaining ML models Session

Ankur Taly showcases a new paradigm for model explanations called "slice and explain" that unifies several existing explanation tools into a single framework. You'll learn how to leverage the framework as a data scientist, business user, and regulator to successfully analyze models.

Wangda Tan is a product management committee (PMC) member of Apache Hadoop and engineering manager of the computation platform team at Cloudera. He manages all efforts related to Kubernetes and YARN for both on-cloud and on-premises use cases of Cloudera. His primary areas of interest are the YuniKorn scheduler (scheduling containers across YARN and Kubernetes) and the Hadoop submarine project (running a deep learning workload across YARN and Kubernetes). He’s also led features like resource scheduling, GPU isolation, node labeling, resource preemption, etc., efforts in the Hadoop YARN community. Previously, he worked on integration of OpenMPI and GraphLab with Hadoop YARN at Pivotal and participated in creating a large-scale machine learning, matrix, and statistics computation program using MapReduce and MPI and Alibaba.

Presentations

It's 2020 now: Apache Hadoop 3.x state of the union & upgrade guidance Session

2020 Hadoop is still evolving fast. You'll learn the current status of Apache Hadoop community and the exciting present and future of Hadoop 3.x. Wangda Tan and Arpit Agarwal cover new features like Hadoop on Cloud, GPU support, NameNode federation, Docker, 10X scheduling improvements, OZone, etc. And they offer you upgrade guidance from 2.x to 3.x.

Cathy Tanimura is the senior director of analytics and data science at Strava. She has a passion for leveraging data in multiple ways: to help people make better decisions, to tell stories about companies and industries, and to develop great product experiences. Previously, she built and led data teams at several high-growth technology companies, including Okta, Zynga, and StubHub.

Presentations

The power of visualizing health, fitness, and community impact Health Data Day

Pulling from specific product innovations and applications like relative effort, cumulative stats, Strava's Year in Sport, heat maps, and Metro, Cathy Tanimura shares best practices for creating effective data visualizations that help improve the health and fitness of individuals and the well-being of communities.

Fatma Tarlaci is a data science fellow at Quansight, where she focuses on creating training materials in AI and contributes to data science and machine learning projects. She received her PhD in humanities from the University of Texas at Austin followed by a master’s degree in computer science from Stanford University. Her work and research specialize in deep learning and data science.

Presentations

Natural language processing with open source Tutorial

Language is at the heart of everything we do. Natural language processing (NLP) is one of the most challenging tasks of artificial intelligence, mainly due to the difficulty of detecting nuances and common sense reasoning in natural language. Fatma Tarlaci invites you to learn more about NLP and explore a complete hands-on implementation of an NLP deep learning model.

Ben Taylor is a cofounder—along with David Gonzalez—at Zeff to pursue deep learning for image, audio, video, and text for the enterprise. He’s a thought leader around AI with over 16 years of machine learning experience. Previously, Ben was at Intel and Micron, working in photolithography, process control, and yield prediction; an AI hedge fund (AIQ) as its high-performance computing (HPC)/AI expert and built models using a 600 GPU cluster to predict stock movements based on the news to pursue his love for HPC and predictive modeling; a young HR startup called HireVue, where he built out its data science group, filed seven patents, and help launch its AI insights product using video and audio from candidate interviews, allowing his team of PhD scientists to help pioneer antibias mitigation strategies for AI. He studied chemical engineering.

Presentations

Deep Xbox: Ramifications for AI and rapid acceleration into the future (sponsored by Dell Technologies) Keynote

AI does incredible things; amplifies human expertise. Ben Taylor demonstrates this with an AI that learns to play a stock Xbox using pixels and highlights the rapid increase in data complexity and scale required. You'll identify the implications of this complexity trend across multiple industries.

Maureen Teyssier is a chief data scientist at Reonomy, a property intelligence company that is transforming the world’s largest asset class: commercial real estate. Maureen has run simulations and transformed data for almost 20 years. She has a breadth of knowledge on a variety of data, including location data, click data, image data, streaming data, public and simulated data, and experience working with data at scale, managing datasets ranging from kilobytes to terabytes. Previously, she drove technological and process advancements that resulted in 500% year over year BtoB contract growth at Enigma, a data-as-a-service company headquartered in New York City; delivered smart technology that anticipates human behavior and needs at Axon Vibe; created a smartwatch app recommender in the Insight Data Science Fellows Program; and researched galactic shapes due to the interplay between dark matter and stellar evolution as a postdoctoral associate at Rutgers University. Maureen earned her PhD in computational astrophysics from Columbia University, where she studied the evolution of galaxies by running cosmological simulations on supercomputers.

Presentations

How ML is creating a new category in commercial real estate Data Case Studies

Despite being one of America’s largest industries, commercial real estate professionals lack insights and opportunities due to the fragmented, disparate nature of real estate information. The market is still predominantly facilitated by paper agreements, phone calls, and in-person transactions. Maureen Teyssier showcases how to use knowledge graphs to empower informed, strategic decisions.

Alex Thomas is a data scientist at John Snow Labs. He’s used natural language processing (NLP) and machine learning with clinical data, identity data, and job data. He’s worked with Apache Spark since version 0.9 as well as with NLP libraries and frameworks including UIMA and OpenNLP.

Presentations

Advanced natural language processing with Spark NLP Tutorial

David Talby, Alex Thomas, Claudiu Branzan, and Veysel Kocaman detail applying advances in deep learning for common natural language processing (NLP) tasks such as named entity recognition, document classification, sentiment analysis, spell checking, and OCR. You'll learn to build complete text analysis pipelines using the highly performant, scalable, open source Spark NLP library in Python.

Sherin Thomas is a software engineer at Lyft. In her career, spanning eight years, she’s worked on most parts of the tech stack, but enjoys the challenges in data science and machine learning the most. Most recently she’s been focused on building products that facilitate advances in artificial intelligence and machine learning through streaming. She’s passionate about getting more people, especially women, interested in this field and has been trying her best to share her work with the community through tech talks and panel discussions. Most recently she gave a talk about machine learning infra and streaming at Beam Summit and Flink Forward in Berlin. In her free time she loves to read and paint. She’s also the president of the Russian Hill book club based in San Francisco and loves to organize events for her local library.

Presentations

A self-service platform for continuous, real-time feature generation Data Case Studies

In the world of ride-sharing, decisions such as matching a passenger to the nearest driver, pricing, ETA, etc. need to be made in real time, making it imperative to build the most up-to-date view of the world. However, gleaning information from high-volume streaming data is tricky, and often solutions are hard to use. Sherin Thomas explains how Lyft attempted to solve this problem with Flink.

Jameson Toole is the cofounder and CEO of Fritz AI, a company building tools to help developers optimize, deploy, and manage machine learning models on mobile devices. Previously, he built analytics pipelines for Google X’s Project Wing and ran the data science team at Boston technology startup Jana Mobile. He holds undergraduate degrees in physics, economics, and applied mathematics from the University of Michigan and both an MS and PhD in engineering systems from MIT, where he worked on applications of big data and machine learning to urban and transportation planning at the Human Mobility and Networks Lab.

Presentations

Create smaller, faster, production-worthy mobile ML models Session

Getting machine learning (ML) models ready for use on device is a major challenge. Jameson Toole explains optimization, pruning, and compression techniques that keep app sizes small and inference speeds high. You'll learn to apply these techniques using mobile ML frameworks such as Core ML and TensorFlow Lite.

Teresa Tung is a managing director at Accenture, where she’s responsible for taking the best-of-breed next-generation software architecture solutions from industry, startups, and academia and evaluating their impact on Accenture’s clients through building experimental prototypes and delivering pioneering pilot engagements. Teresa leads R&D on platform architecture for the internet of things and works on real-time streaming analytics, semantic modeling, data virtualization, and infrastructure automation for Accenture’s Applied Intelligence Platform. Teresa is Accenture’s most prolific inventor with 170+ patent and applications. She holds a PhD in electrical engineering and computer science from the University of California, Berkeley.

Presentations

Building the digital twin IoT and unconventional data Session

The digital twin presents a problem of data and models at scale—how to mobilize IT and OT data, AI, and engineering models that work across lines of business and even across partners. Teresa Tung and William Gatehouse share their experience of implementing digital twins use cases that combine IoT, AI models, engineering models, and domain context.

Sandeep Uttamchandani is the hands-on chief data architect and head of data platform engineering at Intuit, where he’s leading the cloud transformation of the big data analytics, ML, and transactional platform used by 3M+ small business users for financial accounting, payroll, and billions of dollars in daily payments. Previously, Sandeep held engineering roles at VMware and IBM and founded a startup focused on ML for managing enterprise systems. Sandeep’s experience uniquely combines building enterprise data products and operational expertise in managing petabyte-scale data and analytics platforms in production for IBM’s federal and Fortune 100 customers. Sandeep has received several excellence awards. He has over 40 issued patents and 25 publications in key systems conference such as VLDB, SIGMOD, CIDR, and USENIX. Sandeep is a regular speaker at academic institutions and conducts conference tutorials for data engineers and scientists. He advises PhD students and startups, serves as program committee member for systems and data conferences, and was an associate editor for ACM Transactions on Storage. He blogs on LinkedIn and his personal blog, Wrong Data Fabric. Sandeep holds a PhD in computer science from the University of Illinois at Urbana-Champaign.

Presentations

10 lead indicators before data becomes a mess Session

Data quality metrics focus on quantifying if data is a mess. But you need to identify lead indicators before data becomes a mess. Sandeep Uttamchandani, Giriraj Bagadi, and Sunil Goplani explore developing lead indicators for data quality for Intuit's production data pipelines. You'll learn about the details of lead indicators, optimization tools, and lessons that moved the needle on data quality.

Jeffrey Vah is a senior principal test engineer at Dell Technologies. He has 26+ years of experience in test engineering and 7 years in engineering management, test design, tools, and automation. He’s the coauthor of the article, “The Intersection of Predictive Analytics, Predictive Repair, and Reverse Supply Chain.” He’s a co-inventor of five pending patents for innovative use of machine learning models at Dell Technologies, and a co-inventor of a patent for innovative hardware design at Microsoft.

Presentations

Data science + domain experts = exponentially better products Data Case Studies

To deliver best-in-class data science products, solutions must evolve through partnerships between data scientists and domain experts. Jeffrey Vah and Gayathri Rau detail the product lifecycle journey while integrating business expertise with data scientists and technologists. You'll discover best practices and pitfalls when digitally transforming your business through AI and machine learning.

Balaji Varadarajan is a senior software engineer at Uber, where he works on the Hudi project and oversees data engineering broadly across the network performance monitoring domain. Previously, he was one of the lead engineers on LinkedIn’s databus change capture system as well as the Espresso NoSQL store. Balaji’s interests lie in distributed data systems.

Presentations

Bring stream processing to batch data using Apache Hudi (incubating) Session

Batch processing can benefit immensely from adopting some techniques from the streaming processing world. Balaji Varadarajan shares how Apache Hudi (incubating), an open source project created at Uber and currently incubating with the ASF, can bridge this gap and enable more productive, efficient batch data engineering.

Sundar Varadarajan is a consulting partner on AI and ML at Wipro and plays an advisory role on edge AI and ML solutions. He’s an industry expert in the field of analytics, machine learning and AI, having ideated, architected and implemented innovative AI solutions across multiple industry verticals. Sundar can be reached at sundar.varadarajan@wipro.com.

Presentations

An approach to automate time and motion analysis Session

Studying time and motion in manufacturing operations on a shop floor is traditionally carried out through manual observation, which is time consuming and involves human errors and limitations. Sundar Varadarajan and Peyman Behbahani detail a new approach of video analytics combined with time series analysis to automate activity identification and timing measurements.

Paroma Varma is a cofounder at Snorkel and completed a PhD at Stanford, advised by Professor Christopher Ré and affiliated with the DAWN, SAIL, and StatML groups, where she was supported by the Stanford Graduate Fellowship and the National Science Foundation Graduate Research Fellowship. Her research interests revolve around weak supervision or using high-level knowledge in the form of noisy labeling sources to efficiently label massive datasets required to train machine learning models.

Presentations

Programmatically building and managing training datasets with Snorkel Tutorial

Paroma Varma teaches you how to build and manage training datasets programmatically with Snorkel, an open source framework developed at the Stanford AI Lab, and demonstrates how this can lead to more efficiently building and managing machine learning (ML) models in a range of practical settings.

Manasi Vartak is the founder and CEO of Verta.ai (www.verta.ai), an MIT-spinoff building software to enable high-velocity machine learning. The Verta platform enables data scientists and ML engineers to robustly version ML models, collaborate and share ML knowledge, and when models are ready for graduation, to deploy and monitor models in production environments. Verta grew out of Manasi’s Ph.D. work at MIT on ModelDB, the first open-source model management system deployed at Fortune-500 companies. Manasi previously worked on deep learning for content recommendation as part of the feed-ranking team at Twitter and dynamic ad-targeting at Google. Manasi is passionate about building intuitive data tools, helping companies become AI-first, and figuring out how data scientists and the organizations they support can be more effective. Manasi has spoken at several top research as well as industrial conferences such as OReilly AIConf, SIGMOD, VLDB, SparkSummit, and AnacondaCon, and has authored a course on model management.

Presentations

Robust MLOps with open-source: ModelDB, Jenkins, and Prometheus Session

A key part of any Ops toolchain is the versioning system. While code versioning systems are ubiquitous, model versioning systems don’t exist, making current MLOps unreliable. At MIT, we built ModelDB, the first open-source system for model management and now a complete model versioning system.

Shankar Venkitachalam is a data scientist on the experience cloud research and sensei team at Adobe. He holds a master’s degree in computer science from the University of Massachusetts Amherst. He’s passionate about machine learning, probabilistic graphical models, and natural language processing.

Presentations

CrEIO: Critical Events Identification for Online purchase funnel Session

Identifying customer stages in a buying cycle enables you to perform personalized targeting based on the stage. Shankar Venkitachalam, Megahanath Macha Yadagiri, and Deepak Pai identify ML techniques to analyze a customer's clickstream behavior to find the different stages of the buying cycle and quantify the critical click events that help transition a user from one stage to another.

Sumeet Vij is a director in the Strategic Innovation Group (SIG) at Booz Allen Hamilton, where he leads multiple client engagements, research, and strategic partnerships in the field of AI, digital personalization, recommendation systems, chatbots, digital assistants, and conversational commerce. Sumeet is also the practice lead for next-generation digital experiences powered by AI and data science, helping with the large-scale analysis of data and its use to quickly provide deeper insights, create new capabilities, and drive down costs.

Presentations

Weak supervision, strong models: Increase strength with noisy data Session

Weak supervision allows you to use noisy sources to provide supervision signals for labeling large amounts of training data. Sumeet Vij showcases an approach combining a Snorkel weak supervision framework with denoising labeling functions, a generative model, and AI-powered search to train classifiers leveraging enterprise knowledge, without the need for tens of thousands of hand-labeled examples.

Jorge Villamariona is a senior technical marketing engineer on the product marketing team at Qubole. Over the years, Jorge has acquired extensive experience in relational databases, business intelligence, big data engines, ETL, and CRM systems. He enjoys complex data challenges and helping customers gain greater insight and value from their existing data.

Presentations

Data engineering workshop 2-Day Training

Jorge Villamariona outlines how organizations that use a single platform for processing all types of big data workloads are able to manage growth and complexity, react faster to customer needs, and improve collaboration—all at the same time. You'll leverage Apache Spark and Hive to build an end-to-end solution to address business challenges common in retail and ecommerce.

Data engineering workshop (Day 2) Training Day 2

Jorge Villamariona outlines how organizations that use a single platform for processing all types of big data workloads are able to manage growth and complexity, react faster to customer needs, and improve collaboration—all at the same time. You'll leverage Apache Spark and Hive to build an end-to-end solution to address business challenges common in retail and ecommerce.

Mario Vinasco has over 15 years of progressive experience in data driven analytics with emphasis in machine learning and data scence programming creatively applied to eCommerce, advertising, customer acquisition/retention and marketing investment. Mario specializes in developing and applying leading edge business analytics to complex business problems using big data and predictive modeling platforms.

Mario holds a Masters in engineering economics from Stanford University and currently works as Director of Analytics and Data Science at Credit Sesame a disruptive FinTech company in the San Francisco Bay Area, responsible for customer management, retention and prediction.

Until recently, Mario worked for Uber Technologies applying data science to marketing investment optimization, advanced segmentation of customers by propensity to act, churn, open email and the set up sophisticated experiments to test and validate hypothesis.

At Facebook in the marketing analytics group he was responsible for improving the effectiveness of Facebook’s own consumer-facing campaigns. Key projects included ad-effectiveness measurement of Facebook’s brand marketing activities, and product campaigns for key product priorities using advanced experimentation techniques.

Prior roles included VP of business intelligence in digital textbook startup, people analytics manager at Google and eCommerce Sr manager at Symantec.

Presentations

Optimization of digital spend using machine learning in PyTorch Session

Uber spends hundreds of millions of dollars in marketing and constantly optimizes the allocation of these budgets. It deploys complex models, using Python and PyTorch, and borrowing from machine learning (ML) to speed up solvers to optimize marketing investment. Mario Vinasco explains the framework of the marketing spend problem and how it was implemented.

Joseph is a Director in PwC’s Emerging Tech practice and leads the U.S. A.I. Innovation Lab. He has experience applying simulation modeling, optimization, natural language processing, and machine learning to help clients explore disruptive technologies, develop strategies for new business models, and analyze the implications to policy change. In addition, he has worked with clients develop intelligent systems and develop platforms that integrate AI to support task automation and decision support. Beyond supporting model development, he has helped the firm develop its operational practices, infrastructure and tooling to support the development, deployment and maintenance of models at scale.

Presentations

ML models are not software: Why organizations need dedicated operations to address the b Session

Anand Rao and Joseph Voyles introduce you to the core differences between software and machine learning model life cycles. They demonstrate how AI’s success also limits its scale and detail leading practices for establishing AIOps to overcome limitations by automating CI/CD, supporting continuous learning, and enabling model safety.

Kshitij Wadhwa is a software engineer at Rockset, where he works on the platform engineering team. Previously, Kshitij was an engineer at NetApp on the filesystem and protocols team in Cloud Backup Service. Kshitij holds a master degree in computer science from North Carolina State University.

Presentations

Building live dashboards on Amazon DynamoDB using Rockset Session

Rockset is a serverless search and analytics engine that enables real-time search and analytics on raw data from Amazon DynamoDB—with full featured SQL. Kshitij Wadhwa and Dhruba Borthakur explore how Rockset takes an entirely new approach to loading, analyzing, and serving data so you can run powerful SQL analytics on data from DynamoDB without ETL.

Kai Waehner is a technology evangelist at Confluent. Kai’s areas of expertise include big data analytics, machine learning, deep learning, messaging, integration, microservices, the internet of things, stream processing, and blockchain. He’s regular speaker at international conferences such as JavaOne, O’Reilly Software Architecture, and ApacheCon and has written a number of articles for professional journals. Kai also shares his experiences with new technologies on his blog.

Presentations

Stream microservice architectures w/ Apache Kafka & Istio service mesh Session

Apache Kafka became the de facto standard for microservice architectures, which also introduced new challenges. Kai Wähner explores the problems of distributed microservices communication and how Kafka and a service mesh like Istio address them. You'll learn approaches for combining them to build a reliable and scalable microservice architecture with decoupled and secure microservices.

Dean Wampler is an expert in streaming data systems, focusing on applications of machine learning and artificial intelligence (ML/AI). He’s head of developer relations at Anyscale, which is developing Ray for distributed Python, primarily for ML/AI. Previously, he was an engineering VP at Lightbend, where he led the development of Lightbend CloudFlow, an integrated system for building and running streaming data applications with Akka Streams, Apache Spark, Apache Flink, and Apache Kafka. Dean is the author of Fast Data Architectures for Streaming Applications, Programming Scala, and Functional Programming for Java Developers, and he’s the coauthor of Programming Hive, all from O’Reilly. He’s a contributor to several open source projects. A frequent conference speaker and tutorial teacher, he’s also the co-organizer of several conferences around the world and several user groups in Chicago. He earned his PhD in physics from the University of Washington.

Presentations

Model governance Tutorial

Machine learning (ML) models are data, which means they require the same data governance considerations as the rest of your data. Boris Lublinsky and Dean Wampler outline metadata management for model serving and explain what information about running systems you need and why it's important. You'll also learn how Apache Atlas can be used for storing and managing this information.

Understanding data governance for machine learning models Session

Production deployment of machine learning (ML) models requires data governance because models are data. Dean Wampler and Boris Lublinsky justify that claim and explore its implications and techniques for satisfying the requirements. Using motivating examples, you'll explore reproducibility, security, traceability, and auditing, plus some unique characteristics of models in production settings.

Kelly Wan is a senior data scientist at LinkedIn, Sunnyvale. She’s a technology and data science evangelist. Previously, Kelly worked in investment banking for five years in New York City and has undergone a career transformation into the data science area in Silicon Valley. Kelly obtained her master’s of computer science degree from Columbia University and her bachelor’s degree from Southeast University in China.

Presentations

LinkedIn end-to-end data product to measure customer happiness Session

Studies show that good customer services accelerates customers' cohesion toward a product, which increases product engagement and revenue spending. It's traditional to use customer surveys to measure how customers feel about services and products. Kelly Wan, Jason Wang, and Lili Zhou examine the innovative data product to measure customer happiness from LinkedIn.

Haopei Wang is a research scientist at DataVisor. Previously, he earned his PhD from the Department of Computer Science and Engineering at Texas A&M University. His research includes big data security and system security.

Presentations

Feature engineering from digital identifiers for fraud detection Session

Haopei Wang details the design and implementation of a system that automatically extracts fraud-related features for digital identifiers commonly collected by online services. You'll be able to address real-time feature computation and create templates for feature generations. The system has been applied successfully to fraud detection and good user analysis.

Harrison Wang is a backend software engineer for LiveRamp and was responsible for coordinating the cloud migration for the activations team.

Presentations

Truth and reality of cloud migration for big data processing workflows Session

A migration to a new environment is never easy. You'll learn how LiveRamp tackled migrating its large-scale production workflows from its private data center to the cloud while maintaining high uptime. Harrison Wang examines the high-level steps and decisions involved, lessons learned, and what to realistically expect out of a migration.

Chih-Hui “Jason” Wang is a data scientist on the global customer operations (GCO) data science team at LinkedIn. At LinkedIn, he uses data to advocate the voices of customers and members. Previously, he was a data scientist at LeanTaaS where he helped transform healthcare operations through data science. He holds a master’s degree in statistics from the University of California, Berkeley.

Presentations

LinkedIn end-to-end data product to measure customer happiness Session

Studies show that good customer services accelerates customers' cohesion toward a product, which increases product engagement and revenue spending. It's traditional to use customer surveys to measure how customers feel about services and products. Kelly Wan, Jason Wang, and Lili Zhou examine the innovative data product to measure customer happiness from LinkedIn.

Jiao (Jennie) Wang is a software engineer on the big data technology team at Intel, where she works in the area of big data analytics. She’s engaged in developing and optimizing distributed deep learning framework on Apache Spark.

Jiao(Jennie)Wang是英特尔大数据技术团队的软件工程师,主要工作在大数据分析领域。她致力于基于Apache Spark开发和优化分布式深度学习框架。

Presentations

Real-time recommendation using attention networks with Analytics Zoo on Apache Spark Session

Lu Wang and Jennie Wang explain how to build a real-time menu recommendation system to leverage attention networks using Spark, Analytics Zoo, and MXNet in the cloud. You'll learn how to deploy the model and serve the real-time recommendation using both cloud and on-device infrastructure on Burger King’s production environment.

Jisheng Wang is the head of data science at Mist Systems, where he leads the development of Marvis—the first AI-driven virtual network assistant that automates the visibility, troubleshooting, reporting, and maintenance of enterprise networking. He has 10+ years of experience applying state-of-the-art big data and data science technologies to solve challenging enterprise problems including security, networking, and IoT. Previously, Jisheng was the senior director of data science in the CTO office of Aruba, a Hewlett-Packard Enterprise company since its acquisition of Niara in February 2017, where he led the overall innovation and development effort in big data infrastructure and data science and invented the industry’s first modular and data-agonistic User and Entity Behavior Analytics (UEBA) solution, which is widely deployed today among global enterprises; and he was a technical lead in Cisco responsible for various security products. Jisheng earned his PhD in electric engineering from Penn State University. He’s a frequent speaker at AI and ML conferences, including O’Reilly Strata AI, Frontier AI, Spark Summit, Hadoop Summit, and BlackHat.

Presentations

Automated pipeline for large-scale neural network training, inference Session

Anomaly detection models are essential to run data-driven businesses intelligently. At Mist Systems, the need for accuracy and the scale of the data impose challenges to build and automate ML pipelines. Ebrahim Safavi and Jisheng Wang explain how recurrent neural networks and novel statistical models allow Mist Systems to build a cloud native solution and automate the anomaly detection workflow.

Luyang Wang is a senior manager on the Burger King guest intelligence team at Restaurant Brands International, where he works on machine learning and big data analytics. He’s engaged in developing distributed machine learning applications and real-time web services for the Burger King brand. Previously, Luyang Wang was at Philips Big Data and AI Lab and Office Depot.

Presentations

Real-time recommendation using attention networks with Analytics Zoo on Apache Spark Session

Lu Wang and Jennie Wang explain how to build a real-time menu recommendation system to leverage attention networks using Spark, Analytics Zoo, and MXNet in the cloud. You'll learn how to deploy the model and serve the real-time recommendation using both cloud and on-device infrastructure on Burger King’s production environment.

Pete Warden is the technical lead of the mobile and embedded TensorFlow Group on Google’s Brain team.

Presentations

Machine Learning Magic with Microcontrollers Keynote

Pete Warden will show how machine learning on embedded chips opens up whole new kinds of applications. By fitting within a few tens of kilobytes of memory, deep learning algorithms running on battery or energy-harvesting powered devices can make sense of microphone, accelerometer, and even visual data, turning raw streams into actionable information.

Prashant Warier is the CEO of Qure.ai, a healthcare business that uses deep learning to automatically interpret X-rays, CT scans, and MRIs, and a chief data scientist at Fractal Analytics. Prashant has 16 years of experience in architecting and developing data science solutions. Previously, he founded AI-powered personalized digital marketing firm Imagna Analytics, which was acquired by Fractal, and worked with SAP and was instrumental in building its data science practice. He earned his PhD and MS in operations research from the Georgia Institute of Technology and his BTech from IIT Delhi. He’s passionate about using artificial intelligence for global good, and through Qure.ai is working toward making healthcare accessible and affordable using the power of machine learning and artificial intelligence.

Presentations

AI at the point of care revolutionizes diagnostics Health Data Day

If AI can automate the interpretation of abnormalities at the point of care (POC) for at-risk populations, it can eliminate diagnosis delays, speeding up time to treatment, and saving lives. Prashant Warier illustrates the technology required for this healthcare revolution to become reality, sharing case studies of machine learning deployed in poverty-stricken areas.

Sophie Watson is a senior data scientist at Red Hat, where she helps customers use machine learning to solve business problems in the hybrid cloud. She’s a frequent public speaker on topics including machine learning workflows on Kubernetes, recommendation engines, and machine learning for search. Sophie earned her PhD in Bayesian statistics.

Presentations

What nobody told you about machine learning in the hybrid cloud Session

Cloud native infrastructure like Kubernetes has obvious benefits for machine learning systems, allowing you to scale out experiments, train on specialized hardware, and conduct A/B tests. What isn’t obvious are the challenges that come up on day two. Sophie Watson and William Benton share their experience helping end users navigate these challenges and make the most of new opportunities.

Dennis Wei is a research staff member with IBM Research AI. He holds a PhD degree in electrical engineering from the Massachusetts Institute of Technology (MIT). His recent research interests center around trustworthy machine learning, including explainability and interpretability, fairness, and causality.

Presentations

Introducing the AI Explainability 360 open source toolkit Tutorial

As AI and ML make inroads into society, calls increase to explain their outputs, but stakeholders have different requirements for explanations. Dennis Wei teaches you to use and contribute to AI Explainability 360 so you can address these needs. You'll get a look at the first comprehensive toolkit for explainable AI and learn the new developments from research labs.

Seth is a data scientist at Sentilink, where he works on their core models to detect various forms of synthetic identity fraud.

He is also the author of Deep Learning from Scratch: Building From First Principles in Python, published by O’Reilly in 2019.

Presentations

Machine Learning for Fraud Detection with Partial Labels Session

Many data science problems don't begin with a large, labeled dataset; yet there is much less focus on such problems than on strictly supervised ones. In this talk, we'll cover how Sentilink builds machine learning models to detect synthetic identity fraud via data scientists partnering with a team of fraud analysts who manually label cases, creating an active learning-style feedback loop.

Josh Weisberg is a senior director on the 3D and computer vision team for Zillow Group. Previously, he led the AI camera and computational photography team at Microsoft Research, spent several years at Apple, and was at four early-stage startups. He’s written four books on imaging and color. Josh studied digital imaging at the Rochester Institute of Technology and holds a bachelor’s of science degree from the University of San Francisco.

Presentations

Designing a virtual tour application with computer vision and edge computing Session

Computer vision and deep learning enable new technologies to mimic how the human brain interprets images and create interactive shopping experiences. This progress has major implications for businesses providing customers with the information they need to make a purchase decision. Josh Weisberg offers an overview of implementing computer vision to create rich media experiences.

Remy Welch is a data analytics specialist at Google Cloud. She works with enterprises in San Francisco to understand best practices on collecting and analyzing data. Remy has expertise working within the gaming industry and helping them better handle data ingestion, storage, and analytics.

Presentations

Using serverless Spark on Kubernetes for data streaming and analytics Session

Data is a valuable resource, but collecting and analyzing the data can be challenging. And the cost of resource allocation often prohibits the speed at which you can analyze the data. Jay Smith and Remy Welch break down how serverless architecture can improve the portability and scalability of streaming event-driven Apache Spark jobs and perform ETL tasks using serverless frameworks.

Seth Wiesman is a senior solutions architect at Ververica, consulting with clients to maximize the benefits of real-time data processing for their business. He supports customers in the areas of application design, system integration, and performance tuning.

Presentations

Apache Flink developer training 2-Day Training

David Anderson and Seth Wiesman lead a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. You'll focus on the core concepts of distributed streaming data flows, event time, and key-partitioned state, while looking at runtime, ecosystem, and use cases with exercises to help you understand how the pieces fit together.

Apache Flink developer training (Day 2) Training Day 2

David Anderson and Seth Wiesman lead a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. You'll focus on the core concepts of distributed streaming data flows, event time, and key-partitioned state, while looking at runtime, ecosystem, and use cases with exercises to help you understand how the pieces fit together.

Event-driven applications made easy with Apache Flink Tutorial

David Anderson and Seth Wiesman demonstrate how building and managing scalable, stateful, event-driven applications can be easier and more straightforward than you might expect. You'll go hands-on to implement a ride-sharing application together.

Aaron is the VP of Community at OmniSci, responsible for OmniSci’s developer, user and open source communities. He comes to OmniSci with more than two decades of previous success building ecosystems around some of software’s most familiar platforms. Most recently he ran the global community for Mesosphere, including leading the launch and growth of DC/OS as an open source project. Prior to that he led the Java Community Process at Sun Microsystems, and ecosystem programs at SAP. Aaron has also served as the founding CEO of two startups in the entertainment space. Aaron has an MS in Computer Science and BS in Computer Engineering from Case Western Reserve University.

Presentations

GPU acceleration to interact with Open Street Map at planet scale Data Case Studies

Aaron Williams explores the explosive growth in the quantity of geospatial data and how this fuels the need to more frequently join geospatial data with traditional data.

Jacob Wilson is a Technology Principal within PwC Labs focused on delivering business-led and citizen-led AI capabilities to the U.S. firm and clients. Jacob is also the lead Solution Architect for PwC Labs across AI, Data, and Automation capabilities. Jacob is a business results driven leader with 13+ years of experience in enterprise development, business intelligence, and emerging technologies at PwC as well as other prior consulting firms. The PwC Labs leadership team is reimagining the possible using a digital mindset by leveraging innovative technology to automate processes and unlock insights from data through cutting edge analytics capabilities. As part of the PwC Labs team, Jacob works with both client and internal teams at PwC to digitize and automate processes, resulting in significant time savings and increased efficiency. In this role, Jacob shares this passion for innovation with clients and colleagues within a team comprised of the firm’s leading technologists, data scientists, data, automation and AI experts from all of PwC’s lines of service (Tax, Assurance, Advisory and Internal Services).

Presentations

Cool AI for practical problems (sponsored by PwC) Session

While many of the solutions to which AI can be applied involve exciting technologies, AI will arguably have greater transformational impact on more mundane problems. Anand Rao offers an overview of how PwC has developed and applied innovative AI solutions to common, practical problems across several domains, such as tax, accounting, and management consulting.

Sean Wiltshire is a senior vice president of solutions at Data Sciences. First and foremost, he’s a scientist. He’s also the director of data and analytics for the Liberal Party of Canada (LPC), which recently went from holding 34 seats (11%) in Canada’s parliament to 184 seats (54%). As noted by the Atlantic, this historic win was “…one of the most surprising upsets in Canadian electoral history. Never before has the third-ranked party in one Canadian parliament won a majority government in the next one.” Sean is largely credited with transforming the LPC into a data-driven organization. Sean came to the LPC after receiving his PhD in the department of human genetics at McGill University in Montreal. As a published author of several scientific manuscripts and one book chapter, Sean came to believe that science needed more friends in government. He decided to get involved with the Liberal Party because his local LPC representative was an astronaut and therefore he had a reasonable indication that the LPC was proscience. In the first five months, the new liberal government has invested more new money into basic research than any Canadian government in the last decade. So Sean considers this project to have been successful. When he’s not science-ing, Sean and his wife, Amelia, spend their time looking after their 3-year-old son Felix.

Presentations

Keynote with Sean Wiltshire Keynote

Sean Wiltshire, Senior Vice President of Solutions at Data Sciences

Kathy Winger is a business, corporate, real estate, banking, and data security attorney representing companies and individuals in commercial and corporate transactions, where she’s a solo practitioner in Tucson. She has more than 20 years of experience as an attorney in the private sector, practicing corporate, business, banking, regulatory, compliance, real estate, and consumer and commercial lending law. Previously, she served as in-house counsel to a national bank and financial services company.

Kathy frequently gives presentations addressing cybersecurity issues for businesses and has spoken to CFOs, financial executives, lawyers, insurance brokers, business owners, and technology professionals and groups such as Financial Executives and Affiliates of Tucson, National Bank of Arizona Women’s Financial Group, and the Automotive Service Association among others. Nationally, Kathy has spoken about cybersecurity and data breaches, most recently, at the Wall Street Journal Pro Cyber Security Symposium in San Diego and in Cybersecurity Atlanta (2018), Data Center World (2019) and the Channel Partners Conference & Expo (2019), among many others. Kathy has written articles on cybersecurity and banking topics that have appeared in national publications, has been interviewed for articles and radio shows in which she discussed cybersecurity, banking and business topics.

Kathy is the President of the Board of directors for the BSA Catalina Council and serves on the advisory board for the National Bank of Arizona Women’s Financial Group. She also serves on the board of directors of the Southern Arizona Children’s Advocacy Center and is a member of the Better Business Bureau of Southern Arizona.

Presentations

Cybersecurity & data breaches from a business lawyer's perspective Session

Kathy Winger walks you through what business owners and technology professionals need to know about potential risks in the cybersecurity arena. You'll learn the current legal and data security issues and practices along with what’s happening on the regulatory front. Along the way, you'll learn to mitigate the risks you face.

Paul Wolmering is the vice president of worldwide sales engineering for Actian’s hybrid data solution. Paul has over 30 years of experience in the enterprise data ecosystem, including parallel databases, distributed computing, big data, cloud computing, and supporting production systems. Previously, he led field engineering teams for Informix, Netezza, ParAccel, Pivotal, Cazena, and others.

Presentations

Next-gen hybrid data architecture strategy for data center and cloud (sponsored by Actian) Session

IT use cases have blurred the boundary between operational and analytics, and it's challenging for legacy databases and data warehouses to keep up with the demand for insights. Paul Wolmering cuts through the hype about the next "new thing" into how real-world, data-driven companies bridge the gap to make the next-gen cloud data warehouse journey between existing IT investments and across clouds.

Micah Wylde is a software engineer on the streaming compute team at Lyft, focused on the development of Apache Flink and Apache Beam. Previously, he built data infrastructure for fighting internet fraud at SIFT and real-time bidding infrastructure for ads at Quantcast.

Presentations

How Lyft built a streaming data platform on Kubernetes Session

Lyft processes millions of events per second in real time to compute prices, balance marketplace dynamics, and detect fraud, among many other use cases. Micah Wylde showcases how Lyft uses Kubernetes along with Flink, Beam, and Kafka to enable service engineers and data scientists to easily build real-time data applications.

Shuo Xiang is a software engineer on the data platform team at Robinhood.

Presentations

Usability first: The evolution of Robinhood’s data platform Data Case Studies

The data platform at Robinhood has evolved considerably as the scale of its data and its needs have evolved. Shuo Xiang and Grace Lu share the stories behind the evolution of its platform, how it aligns with the company's business use cases, and the challenges encountered and lessons learned.

Huangming Xie is a senior manager of data science at LinkedIn, where he leads the infrastructure data science team to drive resource intelligence, optimize compute and storage efficiency, and automate capacity forecasting for better scalability, as well as improve site availability for a pleasant member and customer experience. Huangming is an expert at converting data into actionable recommendations that impact strategy and generate direct business impact. Previously, he lead initiatives to enable data-driven product decisions at scale and build a great product for more than 600 million LinkedIn members worldwide.

Presentations

Get a CLUE: Optimizing big data compute efficiency Session

Compute efficiency optimization is of critical importance in the big data era, as data science and ML algorithms become increasingly complex and data size increases exponentially over time. Opportunities exist throughout the resource use funnel, which Zhe Zhang and Huangming Xie characterize using a CLUE framework.

Tony Xing is a senior product manager on the AI, data, and infrastructure (AIDI) team within Microsoft’s AI and Research Organization. Previously, he was a senior product manager on the Skype data team within Microsoft’s Application and Service Group, where he worked on products for data ingestion, real-time data analytics, and the data quality platform.

Presentations

Anomaly detection algorithm inspired by computer vision and RL Session

Anomaly detection may sound old-fashioned, but it's super important in many industrial applications. Tony Xing and Anand Raman outline a novel anomaly detection algorithm based on spectral residual (SR) and convolutional neural networks (CNNs) and how this novel method was applied in the monitoring system supporting Microsoft AIOps and business incident prevention.

Chendi Xue is a software engineer on the data analytics team at Intel. She has more than five years’ experience in big data and cloud system optimization, focusing on storage, network software stack performance analysis, and optimization. She participated in the development works including Spark-Shuffle optimization, Spark-SQL ColumnarBased execution, compute side cache implementation, storage benchmark tool implementation, etc. Previously, she worked on Linux device mapper optimization and iSCSI optimization during her master degree study.

Presentations

Accelerating Spark-SQL with AVX-supported vectorization Session

Chendi Xue and Jian Zhang explore how Intel accelerated Spark SQL with AVX-supported vectorization technology. They outline the design and evaluation, including how to enable columnar process in Spark SQL, how to use Arrow as intermediate data, how to leverage AVX-enabled Gandiva for data processing, and performance analysis with system metrics and breakdown.

Megahanath Macha Yadagiri is a graduate research assistant at Carnegie Mellon University.

Presentations

CrEIO: Critical Events Identification for Online purchase funnel Session

Identifying customer stages in a buying cycle enables you to perform personalized targeting based on the stage. Shankar Venkitachalam, Megahanath Macha Yadagiri, and Deepak Pai identify ML techniques to analyze a customer's clickstream behavior to find the different stages of the buying cycle and quantify the critical click events that help transition a user from one stage to another.

Binwei Yang is a software architect at Intel, focusing on performance optimization of big data software, accelerator design and utilization in big data framework, as well as the big data and HPC framework integration. Prior to the big data role, Binwei worked in intel micro architecture team and focusing on performance simulation and analysis.

Presentations

Accelerating Spark-SQL with AVX-supported vectorization Session

Chendi Xue and Jian Zhang explore how Intel accelerated Spark SQL with AVX-supported vectorization technology. They outline the design and evaluation, including how to enable columnar process in Spark SQL, how to use Arrow as intermediate data, how to leverage AVX-enabled Gandiva for data processing, and performance analysis with system metrics and breakdown.

Jin Yang (she/her) is a data scientist at SurveyMonkey, where she’s working on a variety of projects that leverage the power of machine learning models to improve customer experience. She loves data and considers an efficient and accurate data pipeline to be critical for successful model development and deployment. She’s very excited to share her experience on the journey of building a data interface that’s fully oriented toward supporting an effective data science workflow. In her free time, she loves running and hiking and will try to finish her first marathon this year.

Presentations

Accelerate your organization: Make data optimal for ML Session

Every organization leverages ML to increase value to customers and understand its business. You may have created models, but now you need to scale. Shubhankar Jain, Aliaksandr Padvitselski, and Jin Yang use a case study to teach you how to pinpoint inefficiencies in your ML data flow, how SurveyMonkey tackled this, and how to make your data more usable to accelerate ML model development.

Giridhar Yasa is a principal architect at Flipkart. He’s a technology leader with a consistent track record of leading teams from concept to successful delivery of complex software products. He has solid team creation and mentoring skills and multiple peer-reviewed journal and conference papers with citations and patents. His specialties are distributed systems, scalable software system architecture, storage software, networking and internet protocols, mobile communication protocols, system performance, free software, open source software, languages and tools, C, C++, Python, Unix-like operating systems, and Debian.

Presentations

Architecture patterns for BCP and DR at enterprise scale at Flipkart Session

Utkarsh B. and Giridhar Yasa lead a deep dive into architectural patterns and the solutions Flipkart developed to ensure business continuity to millions of online customers, and how it leveraged technology to avert or mitigate risks from catastrophic failures. Solving for business continuity requires investments application, data management, and infrastructure.

Wenming Ye is an AI and ML solutions architect at Amazon Web Services, helping researchers and enterprise customers use cloud-based machine learning services to rapidly scale their innovations. Previously, Wenming had diverse R&D experience at Microsoft Research, an SQL engineering team, and successful startups.

Presentations

Put DL to work: A practical introduction using Amazon Web Services 2-Day Training

Machine learning (ML) and deep learning (DL) projects are becoming increasingly common at enterprises and startups alike and have been a key innovation engine for Amazon businesses such as Go, Alexa, and Robotics. Wenming Ye demonstrates a practical next step in DL learning with instructions, demos, and hands-on labs.

Put DL to work: A practical introduction using Amazon Web Services Training Day 2

Machine learning (ML) and deep learning (DL) projects are becoming increasingly common at enterprises and startups alike and have been a key innovation engine for Amazon businesses such as Go, Alexa, and Robotics. Wenming Ye demonstrates a practical next step in DL learning with instructions, demos, and hands-on labs.

Jian Zhang is a senior software engineer manager at Intel, where he and his team primarily focus on open source storage development and optimizations on Intel platforms and build reference solutions for customers. He has 10 years of experience doing performance analysis and optimization for open source projects like Xen, KVM, Swift, and Ceph and working with Hadoop distributed file system (HDFS) and benchmarking workloads like SPEC and TPC. Jian holds a master’s degree in computer science and engineering from Shanghai Jiao Tong University.

Presentations

Accelerating Spark-SQL with AVX-supported vectorization Session

Chendi Xue and Jian Zhang explore how Intel accelerated Spark SQL with AVX-supported vectorization technology. They outline the design and evaluation, including how to enable columnar process in Spark SQL, how to use Arrow as intermediate data, how to leverage AVX-enabled Gandiva for data processing, and performance analysis with system metrics and breakdown.

Yong Zhang is a software engineer at StreamNative. He’s also a Pulsar and BookKeeper contributor, where he contributes a lot at Pulsar transaction, storage, and tools.

Presentations

Transactional event streaming with Apache Pulsar Session

Sijie Guo and Yong Zhang lead a deep dive into the details of Pulsar transaction and how it can be used in Pulsar Functions and other processing engines to achieve transactional event streaming.

Zhe Zhang is a senior manager of core big data infrastructure at LinkedIn, where he leads an excellent engineering team to provide big data services (Hadoop distributed file system (HDFS), YARN, Spark, TensorFlow, and beyond) to power LinkedIn’s business intelligence and relevance applications. Zhe’s an Apache Hadoop PMC member; he led the design and development of HDFS Erasure Coding (HDFS-EC).

Presentations

Get a CLUE: Optimizing big data compute efficiency Session

Compute efficiency optimization is of critical importance in the big data era, as data science and ML algorithms become increasingly complex and data size increases exponentially over time. Opportunities exist throughout the resource use funnel, which Zhe Zhang and Huangming Xie characterize using a CLUE framework.

Alice Zhao is a senior data scientist at Metis, where she teaches 12-week data science bootcamps. Previously, she was the first data scientist and supported multiple functions from marketing to technology at Cars.com; cofounded a data science education startup where she taught weekend courses to professionals at 1871 in Chicago at Best Fit Analytics Workshop; was an analyst at Redfin; and was a consultant at Accenture. She blogs about analytics and pop culture on A Dash of Data. Her blog post, “How Text Messages Change From Dating to Marriage” made it onto the front page of Reddit, gaining over half a million views in the first week. She’s passionate about teaching and mentoring and loves using data to tell fun and compelling stories. She has her MS in analytics and BS in electrical engineering, both from Northwestern University.

Presentations

Introduction to natural language processing in Python Tutorial

Data scientists crunch numbers. But you may run into text data. Alice Zhao teaches you how to turn text data into a format a machine can understand, identifies the most popular text analytics techniques, and showcases natural language processing (NLP) libraries in Python, including the natural language toolkit (NLTK), TextBlob, spaCy, and gensim.

Alice Zheng is a senior manager of applied science on the machine learning optimization team on Amazon’s advertising platform. She specializes in research and development of machine learning methods, tools, and applications. She’s the author of Feature Engineering for Machine Learning. Previously, Alice has worked at GraphLab, Dato, and Turi, where she led the machine learning toolkits team and spearheaded user outreach; was a researcher in the Machine Learning Group at Microsoft Research – Redmond. Alice holds PhD and BA degrees in computer science and a BA in mathematics, all from UC Berkeley.

Presentations

Lessons learned from building large ML systems Session

You'll learn four lessons in building and operating large-scale, production-grade machine learning systems at Amazon with Alice Zheng, useful for practitioners and would-be practitioners in the field.

Lili Zhou is a manager of the data science team at LinkedIn. Lili has intensive experience in customer operations, billing and collection, risk management, fraud detection, revenue forecasting, and online gaming. She’s passionate about leveraging large-scale data analytics and modeling to drive insights and business value.

Presentations

LinkedIn end-to-end data product to measure customer happiness Session

Studies show that good customer services accelerates customers' cohesion toward a product, which increases product engagement and revenue spending. It's traditional to use customer surveys to measure how customers feel about services and products. Kelly Wan, Jason Wang, and Lili Zhou examine the innovative data product to measure customer happiness from LinkedIn.

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

pr@oreilly.com

For media/analyst press inquires