Mar 15–18, 2020

Speakers

Hear from innovative programmers, talented managers, and senior executives who are doing amazing things with data and AI. More speakers will be announced; please check back for updates.

Grid viewList view

Arpit Agarwal is an engineer in the storage team at Cloudera and an active HDFS/Hadoop committer since 2013.

Presentations

It’s 2020 now: Apache Hadoop 3.x state of the union and upgrade guidance Session

2020 Hadoop is still evolving fast. You'll learn the current status of Apache Hadoop community and the exciting present and future of Hadoop 3.x. Wangda Tan and Arpit Agarwal cover new features like Hadoop on Cloud, GPU support, NameNode federation, Docker, 10X scheduling improvements, OZone, etc. And they offer you upgrade guidance from 2.x to 3.x.

John-Mark Agosta is a principal data scientist in IMML at Microsoft. Previously, he worked with startups and labs in the Bay Area, including the original Knowledge Industries, and was a researcher at Intel Labs, where he was awarded a Santa Fe Institute Business Fellowship in 2007, and at SRI International after receiving his PhD from Stanford. He has participated in the annual Uncertainty in AI conference since its inception in 1985, proving his dedication to probability and its applications. When feeling low, he recharges his spirits by singing Russian music with Slavyanka, the Bay Area’s Slavic music chorus.

Presentations

Machine learning for managers Tutorial

Bob Horton, Mario Inchiosa, and John-Mark Agosta offer an overview of the fundamental concepts of machine learning (ML) for business and healthcare decision makers and software product managers so you'll be able to make a more effective use of ML results and be better able to evaluate opportunities to apply ML in your industries.

Mudasir Ahmad is a distinguished engineer and senior director at Cisco. He’s been involved with design and algorithms for 17 years. Mudasir leads the Center of Excellence for Numerical Analysis, developing new analytical and stochastic algorithms. He’s also involved with implementing IoT, artificial intelligence, and big data analytics to streamline supply chain operations. Mudasir has delivered several invited talks on leading technology solutions internationally. He has over 30 publications on microelectronic packaging, two book chapters, and 13 US patents. He received the internationally renowned Outstanding Young Engineer Award in 2012 from the IEEE. He earned an MS in management science and engineering at Stanford University, an MS in mechanical engineering from the Georgia Institute of Technology, and a bachelors from Ohio University.

Presentations

Executive Briefing: Real-life application of artificial intelligence in supply chain operations Session

Artificial intelligence (AI) is a natural fit for supply chain operations, where decisions and actions need to be taken daily or even hourly about delivery, manufacturing, quality, logistics, and planning. Mudasir Ahmad explains how AI can be implemented in a scalable and cost-effective way in your business' supply chain operations, and he identifies benefits and potential challenges.

Subutai brings experience across real time systems, computer vision and learning as VP of Research at Numenta. He has previously served as VP Engineering at YesVideo, Inc. where he helped grow the company from a three-person start-up to a leader in automated digital media authoring. In 1997, Subutai co- founded ePlanet Interactive, a spin-off from Interval Research. ePlanet developed the IntelPlay Me2Cam, the first computer vision product developed for consumers. He has served as a key researcher at Interval Research.

Subutai holds a B.S. in Computer Science from Cornell University, and a Ph.D in Computer Science from the University of Illinois at Urbana-Champaign. While pursuing his Ph.D, Subutai completed a thesis on computational neuroscience models of visual attention.

Presentations

How Can We Be So Dense? The Benefits of Using Highly Sparse Representations Session

Given that today's machine learning systems cannot come close to the level of flexibility and generality of the brain, it's normal to ask how we can learn from the brain to improve them. Sparsity provides a great starting point. This talk will cover both how sparsity works in the brain and how applying sparsity to artificial neural networks provides significant advantages.

Alasdair Allan is a director at Babilim Light Industries and a scientist, author, hacker, maker, and journalist. An expert on the internet of things and sensor systems, he’s famous for hacking hotel radios, deploying mesh networked sensors through the Moscone Center during Google I/O, and for being behind one of the first big mobile privacy scandals when, back in 2011, he revealed that Apple’s iPhone was tracking user location constantly. He’s written eight books and writes regularly for Hackster.io, Hackaday, and other outlets. A former astronomer, he also built a peer-to-peer autonomous telescope network that detected what was, at the time, the most distant object ever discovered.

Presentations

Dealing with data on the edge Session

Much of the data we collect is thrown away, but that's about to change; the power envelope needed to run machine learning models on embedded hardware has fallen dramatically, enabling you to put the smarts on the device rather than in the cloud. Alasdair Allan explains how the data you throw away can be processed in real time at the edge, and this has huge implications for how you deal with data.

Shradha Ambekar is a staff software engineer in the Small Business Data Group at Intuit, where she’s the technical lead for lineage framework (SuperGLUE), real-time analytics, and has made several key contributions in building solutions around the data platform, and she contributed to spark-cassandra-connector. She has experience with Hadoop distributed file system (HDFS), Hive, MapReduce, Hadoop, Spark, Kafka, Cassandra, and Vertica. Previously, she was a software engineer at Rearden Commerce. Shradha spoke at the O’Reilly Open Source Conference in 2019. She holds a bachelor’s degree in electronics and communication engineering from NIT Raipur, India.

Presentations

Always accurate business metrics through lineage-based anomaly tracking Session

Debugging data pipelines is nontrivial and finding the root cause can take hours to days. Shradha Ambekar and Sunil Goplani outline how Intuit built a self-serve tool that automatically discovers data pipeline lineage and applies anomaly detection to detect and help debug issues in minutes—establishing trust in metrics and improving developer productivity by 10x–100x.

David Anderson is a training coordinator at Ververica, the original creators of Apache Flink. He’s delivered training and consulting to many of the world’s leading banks, telecommunications providers, and retailers. Previously, he led the development of data-intensive applications for companies across Europe.

Presentations

Apache Flink developer training 2-Day Training

David Anderson and Seth Wiesman lead a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. You'll focus on the core concepts of distributed streaming data flows, event time, and key-partitioned state, while looking at runtime, ecosystem, and use cases with exercises to help you understand how the pieces fit together.

Apache Flink developer training (Day 2) Training Day 2

David Anderson and Seth Wiesman lead a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. You'll focus on the core concepts of distributed streaming data flows, event time, and key-partitioned state, while looking at runtime, ecosystem, and use cases with exercises to help you understand how the pieces fit together.

Event-driven applications made easy with Apache Flink Tutorial

David Anderson and Seth Wiesman demonstrate how building and managing scalable, stateful, event-driven applications can be easier and more straightforward than you might expect. You'll go hands-on to implement a ride-sharing application together.

Jesse Anderson is a big data engineering expert and trainer.

Presentations

Professional Kafka development 2-Day Training

Jesse Anderson leads a deep dive into Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it. You'll also discover how to create consumers and publishers in Kafka and how to use Kafka Streams, Kafka Connect, and KSQL as you explore the Kafka ecosystem.

Professional Kafka development (Day 2) Training Day 2

Jesse Anderson leads a deep dive into Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it. You'll also discover how to create consumers and publishers in Kafka and how to use Kafka Streams, Kafka Connect, and KSQL as you explore the Kafka ecosystem.

Manohar Angani is a machine learning engineer at SurveyMonkey where he works on productionalizing models and integrating them with SurveyMonkey products. Previously, he worked in different groups within the company, like growth. In his free time, he likes to hang out with his family and explore the bay area.

Presentations

Accelerating your organization: Making data optimal for machine learning Session

Every organization leverages ML to increase value to customers and understand their business. You may have created models, but now you need to scale. Shubhankar Jain, Aliaksandr Padvitselski, and Manohar Angani use a case study to teach you how to pinpoint inefficiencies in your ML data flow, how SurveyMonkey tackled this, and how to make your data more usable to accelerate ML model development.

Eitan Anzenberg is the director of data science at Bill.com and has many years of experience as a scientist and researcher. His recent focus is in machine learning, deep learning, applied statistics, and engineering. Previously, Eitan was a postdoctoral scholar at Lawrence Berkeley National Lab, received his PhD in physics from Boston University, and his BS in astrophysics from University of California, Santa Cruz. Eitan has 2 patents and 11 publications to date and has spoken about data at various conferences around the world.

Presentations

Beyond OCR: Using deep learning to understand documents Session

Although the field of optical character recognition (OCR) has been around for half a century, document parsing and field extraction from images remains an open research topic. Eitan Anzenberg leads a deep dive into a learning architecture that leverages document understanding to extract fields of interest.

Using deep learning to understand documents Session

Although the field of optical character recognition (OCR) has been around for almost half a century, document parsing and field extraction from images remain an open research topic. Eitan Anzenberg digs into using an end-to-end deep learning and OCR architecture to predict regions of interest within documents and automatically extract their text.

She is the principal data scientist at Atlan, a data product company. She is also a guest lecture for applied econometrics at IIM-Kashipur college in India.
She loves designing and building scalable data products with features that look and feel customised to every user.

Presentations

Predicting Malaria using ML & satellite data Session

Malaria incidence is highly dependent on the environmental, demographic and infrastructural conditions of the affected region. Time series analysis of malaria cases against these variables can help in identifying the problematic factors as well as predict the expected cases in case of no external intervention. This session will deep dive into these indicators and the prediction model.

Muhammad Asfar is a PhD candidate at Airlangga University and political consultant at Pusdeham. He’s written more than 200 academic papers, magazines, or books related to politics. He’s fully aware that technology has shaped a new spectrum at the direction of how the politics is now and the future. Therefore, he’s passionately learning about big data and new technology adoption in politics to fill up the gap that voting behavior theories cannot yet explain.

Presentations

Political mapping with big data: Indonesia’s presidential election 2019 case Session

With the disclosure of the Cambridge Analytica scandal, political practitioners have started to adopt big data technology to give them better understanding and management of data. Qorry Asfar and Muhammad Asfar provide a big data case study to develop political strategy and examine how technological adoption will shape a better political landscape.

Qorry Asfar is a data analyst at Pusdeham Prodata Indonesia and has 2 years’ experience in voting behavior, political campaigns, and political advisory. She’s passionate to learn new ways to creatively use political data for better management and strategy. She occasionally attends big data conferences to gain insight of technological adoption that can be used in political data.

Presentations

Political mapping with big data: Indonesia’s presidential election 2019 case Session

With the disclosure of the Cambridge Analytica scandal, political practitioners have started to adopt big data technology to give them better understanding and management of data. Qorry Asfar and Muhammad Asfar provide a big data case study to develop political strategy and examine how technological adoption will shape a better political landscape.

Jatinder Assi is a data engineering manager at GumGum and is enthusiastic about building scalable distributed applications and business-driven data products.

Presentations

Real-time forecasting at scale using Delta Lake Session

GumGum receives 30 billion programmatic inventory impressions amounting to 25 TB of data per day. By generating near-real-time inventory forecast based on campaign-specific targeting rules, it enables users to set up successful future campaigns. Rashmina Menon and Jatinder Assi highlight the architecture that enables forecasting in less than 30 seconds with Delta Lake and Databricks Delta caching.

Utkarsh B. is the technology advisor to the CEO, a distinguished architect, and a senior principal architect at Flipkart. He’s been driving architectural blueprints and coherence across diverse platforms in Flipkart through multiple generations of their evolution and leveraging technology to solve for scale, resilience, business continuity, and disaster recovery. He has extensive experience (18+ years) in building platforms across a wide spectrum of technical and functional problem domains.

Presentations

Architectural patterns for business continuity and disaster recovery: Applied to Flipkart Session

Utkarsh B. and Giridhar Yasa lead a deep dive into architectural patterns and the solutions Flipkart developed to ensure business continuity to millions of online customers, and how it leveraged technology to avert or mitigate risks from catastrophic failures. Solving for business continuity requires investments application, data management, and infrastructure.

Giriraj Bagdi is a DevOps leader of cloud and data at Intuit, where he leads infrastructure engineering and SRE teams in delivering technology and functional capabilities for online platforms. He drove and managed large complex initiatives in cloud data-infrastructure, automation engineering, big data, and database transactional platform. Giriraj has extensive knowledge of building engineering solutions and platforms to improve the operational efficiency of cloud infrastructure in the areas of command and control and data reliability for big data, high-transaction, high-volume, and high-availability environments. He drives the initiative in transforming big data engineering and migration to AWS big data technologies, in other words, EMR, Athena QuickSight, etc. He’s an innovative, energetic, and goal-oriented technologist and a team player with strong problem solving skills.

Presentations

10 lead indicators before data becomes a mess Session

Data quality metrics focus on quantifying if data is a mess. But you need to identify lead indicators before data becomes a mess. Sandeep Uttamchandani, Giriraj Bagadi, and Sunil Goplani explore developing lead indicators for data quality for Intuit's production data pipelines. You'll learn about the details of lead indicators, optimization tools, and lessons that moved the needle on data quality.

Bahman Bahmani is the vice president of data science and engineering at Rakuten (the seventh-largest internet company in the world), managing an AI organization with engineering and data science managers, data scientists, machine learning engineers, and data engineers globally distributed across three continents, and he’s in charge of the end-to-end AI systems behind the Rakuten Intelligence suite of products. Previously, Bahman built and managed engineering and data science teams across industry, academia, and the public sector in areas including digital advertising, consumer web, cybersecurity, and nonprofit fundraising, where he consistently delivered substantial business value. He also designed and taught courses, led an interdisciplinary research lab, and advised theses in the Computer Science Department at Stanford University, where he also did his own PhD focused on large-scale algorithms and machine learning, topics on which he’s a published author.

Presentations

AI in the new era of personal data protection Session

With California Consumer Privacy Act (CCPA) looming near, Europe’s GDPR still sending shockwaves, and public awareness of privacy breaches heightening, we're in the early days of a new era of personal data protection. Bahman Bahmani explores the challenges and opportunities for AI in this new era and provides actionable insights for you to navigate your path to AI success.

Kamil Bajda-Pawlikowski is a cofounder and CTO of the enterprise Presto company Starburst. Previously, Kamil was the chief architect at the Teradata Center for Hadoop in Boston, focusing on the open source SQL engine Presto, and the cofounder and chief software architect of Hadapt, the first SQL-on-Hadoop company (acquired by Teradata). Kamil began his journey with Hadoop and modern MPP SQL architectures about 10 years ago during a doctoral program at Yale University, where he co-invented HadoopDB, the original foundation of Hadapt’s technology. He holds an MS in computer science from Wroclaw University of Technology and both an MS and an MPhil in computer science from Yale University.

Presentations

Presto on Kubernetes: Query anything, anywhere Session

Kamil Bajda-Pawlikowski explores Presto, an open source SQL engine, featuring low-latency queries, high concurrency, and the ability to query multiple data sources. With Kubernetes, you can easily deploy and manage Presto clusters across hybrid and multicloud environments with built-in high availability, autoscaling, and monitoring.

Claudiu Barbura is a director of engineering at Blueprint, and he oversees product engineering, where he builds large-scale advanced analytics pipelines, IoT, and data science applications for customers in oil and gas, energy, and retail industries. Previously, he was the vice president of engineering at UBIX.AI, automating data science at scale, and senior director of engineering, xPatterns platform services at Atigeo, building several advanced analytics platforms and applications in healthcare and financial industries. Claudiu is a hands-on architect, dev manager, and executive with 20+ years of experience in open source, big data science and Microsoft technology stacks and a frequent speaker at data conferences.

Presentations

The power of GPUs for data virtualization in Tableau and PowerBI and beyond Session

Claudiu Barbura exposes a tech stack for BI tools and data science notebooks, using live demos to explain the lessons learned using Spark (CPU), BlazingSQL and Rapids.ai (GPU), and Apache Arrow in the quest to exponentially increase the performance of the data virtualizer, which enables real-time access to data sources across different cloud providers and on-premises databases and APIs.

Chris has been working on high performance pub-sub systems for over a dozen years. During that time he has tested, supported, and operated messaging systems that are deployed in banking, capital markets, and transportation industries. Recently he has founded Kafkaesque, which is a managed streaming and queuing service based on the Apache Pulsar open-source project.

Presentations

Getting started streaming and queuing in Apache Pulsar Interactive session

Chris Bartholomew walks you through the architecture and important concepts of Apache Pulsar. You'll set up a local Apache Pulsar environment and use the Python API to do publish/subscribe (pub/sub) message streaming, fanning out messages to multiple consumers.

Benjamin Batorsky is the Associate Director of Data Science at MIT Sloan, where he leads data science projects for the Food Supply Chain and Analytics group. Previously he worked on the data science team at ThriveHive, where he scoped and built data products by leveraging multi-modal datasets on small businesses and their customers. In his work, he is often posed difficult business questions and is able to develop and execute a strategy for answering them with either one-off analytic products or production-ready prototypes. He earned his PhD in Policy Analysis from the RAND Corporation, working on analytics projects in the areas of health, policy and infrastructure.

Presentations

Named-entity recognition from scratch with spaCy Session

Identifying and labeling named entities such as companies or people in text is a key part of text processing pipelines. Benjamin Batorsky outlines how to train, test, and implement a named entity recognition (NER) model with spaCy. You'll get a sneak peak on how to use these techniques with large, non-English corpora.

Steven Beales is a senior vice president of IT at WCG. He has 25 years of experience in IT and has spent over 16 years in the pharmaceutical industry. He led implementation of the clinical trial portal at Genentech across 100+ countries and of the clinical trial safety portal at a top-5 pharma organization, which included a data-driven rules engine configured with safety regulations from those countries, saving this organization hundreds of millions of dollars. Over 50 million safety alerts have been distributed by these two portals via the cloud. Previously, he was the chief software architect at mdlogix, where he led the implementation of the CTMS systems for Johns Hopkins University, Washington University at St. Louis, the University of Pittsburgh, and the Interactive Autism Network for Autism Speaks.

Presentations

Pragmatic artificial intelligence in biopharmaceutical industry Session

Steven Beales describes applications of NLP, machine learning, and the data-driven rules that generate significant productivity and quality improvements in the complex business workflows of drug safety and pharmacovigilance without large upfront investment. Pragmatic use of AI allows organizations to create immediate value and ROI before widening adoption as their capabilities with AI increase.

Ian Beaver, PhD is the Chief Scientist at Verint Intelligent Self Service, a provider of conversational AI systems for enterprise businesses. Ian has been publishing discoveries in the field of AI since 2005 on topics surrounding human-computer interactions such as gesture recognition, user preference learning, and communication with multi-modal automated assistants. Ian has presented his work at various academic and industry conferences and authored over 30 patents within the field of human language technology. His extensive experience and access to large volumes of real-world, human-machine conversation data for his research has made him a leading voice in conversational analysis of dialog systems. Ian currently leads a team in finding ways to optimize human productivity by way of automation and augmentation, using symbiotic relationships with machines.

Presentations

Deploying chatbots and conversational analysis: learn what customers really want to know Data Case Studies

Chatbots are increasingly used in customer service as a first tier of support. Through deep analysis of conversation logs you can learn real user motivations and where company improvements can be made. In this talk, a build or buy comparison is made for deploying self-service bots, motivations and techniques for deep conversational analysis are covered, and real world discoveries are discussed.

Peyman Behbahani is a senior AI architect at Wipro, helping various industries on building real-world and large-scale AI applications in their businesses. He earned his PhD in electronic engineering at City, University of London in 2011. His main research and development interest is in AI, computer vision, mathematical modeling, and forecasting.

Presentations

An approach to automate time and motion analysis Session

Time and motion study of manufacturing operations in a shop floor is traditionally carried out through manual observation, which is time consuming and involves human errors and limitations. Sundar Varadarajan and Peyman Behbahani detail a new approach of video analytics combined with time series analysis to automate the process of activity identification and timing measurements.

Austin Bennett is a data engineer at Sling Media (a DISH company), where he develops systems, and mentors aspiring data scientists. Austin is a cognitive linguist and researcher with an interest in multimodal communication, largely through Redhenlab.org. He’s enthusiastic about the promise of Apache Beam, is very active with the community, and has trained people around the world how to use and contribute to the open source project.

Presentations

Unifying batch and stream processing with Apache Beam Interactive session

Austin Bennett offers hands-on training with the Apache Beam programming model. Beam is an open-source unified model for Batch and Stream data processing that runs on execution engines like Google Cloud Dataflow, Apache Flink, and Apache Spark.

William Benton is an engineering manager and senior principal software engineer at Red Hat, where he leads a team of data scientists and engineers. He’s applied machine learning to problems ranging from forecasting cloud infrastructure costs to designing better cycling workouts. His focus is investigating the best ways to build and deploy intelligent applications in cloud native environments, but he’s also conducted research and development in the areas of static program analysis, managed language runtimes, logic databases, cluster configuration management, and music technology.

Presentations

What nobody told you about machine learning in the hybrid cloud Session

Cloud native infrastructure like Kubernetes has obvious benefits for machine learning systems, allowing you to scale out experiments, train on specialized hardware, and conduct A/B tests. What isn’t obvious are the challenges that come up on day two. Sophie Watson and William Benton share their experience helping end users navigate these challenges and make the most of new opportunities.

Lukas Biewald is the founder and CEO of Weights & Biases, his second major contribution to advances in the machine learning field. In 2009, Lukas founded Figure Eight, formally CrowdFlower. Figure Eight was acquired by Appen in 2019. Lukas has dedicated his career to optimize ML workflows and teach ML practitioners, making machine learning more accessible to all.

Presentations

Using Keras to classify text with LSTMs and other ML techniques Tutorial

Join Lukas Biewald to build and deploy long short-term memories (LSTMs), grated recurrent units (GRUs), and other text classification techniques using Keras and scikit-learn.

Sarah Bird is a principle program manager at Microsoft, where she leads research and emerging technology strategy for Azure AI. Sarah works to accelerate the adoption and impact of AI by bringing together the latest innovations research with the best of open source and product expertise to create new tools and technologies. She leads the development of responsible AI tools in Azure Machine Learning. She’s also an active member of the Microsoft Aether committee, where she works to develop and drive company-wide adoption of responsible AI principles, best practices, and technologies. Previously, Sarah was one of the founding researchers in the Microsoft FATE research group and worked on AI fairness in Facebook. She’s an active contributor to the open source ecosystem; she cofounded ONNX, an open source standard for machine learning models and was a leader in the PyTorch 1.0 project. She was an early member of the machine learning systems research community and has been active in growing and forming the community. She cofounded the SysML research conference and the Learning Systems workshops. She holds a PhD in computer science from the University of California, Berkeley, advised by Dave Patterson, Krste Asanovic, and Burton Smith.

Presentations

An overview of responsible artificial intelligence Tutorial

Mehrnoosh Sameki and Sarah Bird examine six core principles of responsible AI: fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability, focusing on transparency, fairness, and privacy. You'll discover best practices and state-of-the-art open source toolkits that empower researchers, data scientists, and stakeholders to build trustworthy AI systems.

Levan Borchkhadze is a senior data scientist at TBC Bank, where his main responsibility is to supervise multiple data science projects. He earned BBA and MBA degrees from Georgian American University with a wide variety of working experience in different industries as financial analyst, business process analyst, and ERP systems implementation specialist. Levan earned his master’s degree in big data solutions from Barcelona Technology School.

Presentations

A novel approach of recommender systems in retail banking Session

TBC Bank is in transition from product-centric to a client-centri. Obvious applications of analytics are developing personalized next-best product recommendation for clients. George Chkadua and Levan Borchkhadze explain why the bank decided to implement the ALS user-item matrix factorization method and demographic model. As as result, the pilot increased sales conversion rates by 70%.

Dhruba Borthakur is cofounder and CTO at Rockset, a company building software to enable data-powered applications. Previously, Dhruba was the founding engineer of the open source RocksDB database at Facebook and one of the founding engineers of the Hadoop file system at Yahoo; an early contributor to the open source Apache HBase project; a senior engineer at Veritas, where he was responsible for the development of VxFS and Veritas SanPointDirect storage system; the cofounder of Oreceipt.com, an ecommerce startup based in Sunnyvale; and a senior engineer at IBM-Transarc Labs, where he contributed to the development of Andrew File System (AFS), a part of IBM’s ecommerce initiative, WebSphere. Dhruba holds an MS in computer science from the University of Wisconsin-Madison and a BS in computer science BITS, Pilani, India. He has 25 issued patents.

Presentations

Building live dashboards on Amazon DynamoDB using Rockset Session

Rockset is a serverless search and analytics engine that enables real-time search and analytics on raw data from Amazon DynamoDB—with full featured SQL. Kshitij Wadhwa and Dhruba Borthakur explore how Rockset takes an entirely new approach to loading, analyzing, and serving data so you can run powerful SQL analytics on data from DynamoDB without ETL.

Mario Bourgoin is a senior data scientist at Microsoft, where he helps the company’s efforts to democratize AI, and a mathematician, data scientist, and statistician with a broad and deep knowledge of machine learning, artificial intelligence, data mining, statistics, and computational mathematics. Previously, he taught at several institutions and joined a Boston-area startup, where he worked on medical and business applications. He earned his PhD in mathematics from Brandeis University in Waltham, Massachusetts.

Presentations

Using the cloud to scale up hyperparameter optimization for machine learning Session

Hyperparameter optimization for machine leaning is complex, requires advanced optimization techniques, and can be implemented as a generic framework decoupled from specific details of algorithms. Fidan Boylu Uz, Mario Bourgoin, and George Iordanescu apply such a framework to tasks like object detection and text matching in a transparent, scalable, and easy-to-manage way in a cloud service.

Fidan Boylu Uz is a senior data scientist at Microsoft, where she’s responsible for the successful delivery of end-to-end advanced analytic solutions. She’s also worked on a number of projects on predictive maintenance and fraud detection. Fidan has 10+ years of technical experience on data mining and business intelligence. Previously, she was a professor conducting research and teaching courses on data mining and business intelligence at the University of Connecticut. She has a number of academic publications on machine learning and optimization and their business applications and holds a PhD in decision sciences.

Presentations

Using the cloud to scale up hyperparameter optimization for machine learning Session

Hyperparameter optimization for machine leaning is complex, requires advanced optimization techniques, and can be implemented as a generic framework decoupled from specific details of algorithms. Fidan Boylu Uz, Mario Bourgoin, and George Iordanescu apply such a framework to tasks like object detection and text matching in a transparent, scalable, and easy-to-manage way in a cloud service.

Claudiu Branzan is an analytics senior manager in the Applied Intelligence Group at Accenture, based in Seattle, where he leverages his more than 10 years of expertise in data science, machine learning, and AI to promote the use and benefits of these technologies to build smarter solutions to complex problems. Previously, Claudiu held highly technical client-facing leadership roles in companies using big data and advanced analytics to offer solutions for clients in healthcare, high-tech, telecom, and payments verticals.

Presentations

Advanced natural language processing with Spark NLP Tutorial

David Talby, Alex Thomas, and Claudiu Branzan detail the application of the latest advances in deep learning for common natural language processing (NLP) tasks such as named entity recognition, document classification, sentiment analysis, spell checking, and OCR. You'll learn to build complete text analysis pipelines using the highly performant, scalable, open source Spark NLP library in Python.

Navinder Pal Singh Brar is a Senior Data Engineer at Walmart Labs, where he’s been working with the Kafka ecosystem for the last couple of years, especially Kafka Streams, and created a Customer Data Platform on top of it to suit the company’s needs to process billions of customer events per day in real-time and trigger certain machine learning models on each event. He’s been active in contributing back to Kafka Streams and has filed 3 patents last year itself. Navinder is a regular speaker at local and international events on real-time stream processing, data platforms, and Kafka.

Presentations

Real-time fraud detection with Kafka Streams Session

One of the major use cases for stream processing is real-time fraud detection. Ecommerce has to deal with frauds on a wider scale as more and more companies are trying to provide customers with incentives such as free shipping by moving on to subscription-based models. Navinder Pal Singh Brar dives into the architecture, problems faced, and lessons from building such a pipeline.

Jay Budzik is the chief technology officer at ZestFinance, where he oversees Zest’s product and engineering teams. His passion for inventing new technologies—particularly in data mining and AI—has played a central role throughout his career. Previously, he held various positions, including founding an AI enterprise search company, helping major media organizations apply AI and machine learning to expand their audiences and revenue, and developed systems that process tens of trillions of data points. Jay has a PhD in computer science from Northwestern University.

Presentations

Introducing GIG: A new method for explaining any ensemble ML model Session

More companies are adopting machine learning (ML) to run key business functions. The best-performing models combine diverse model types into stacked ensembles, but explaining these hybrid models has been impossible—until now. Jay Budzik details a new technique, generalized integrated gradients (GIG), to explain complex ensembled ML models that are safe to use in high-stakes applications.

Paris Buttfield-Addison is a cofounder of Secret Lab, a game development studio based in beautiful Hobart, Australia. Secret Lab builds games and game development tools, including the multi-award-winning ABC Play School iPad games, the BAFTA- and IGF-winning Night in the Woods, the Qantas airlines Joey Playbox games, and the Yarn Spinner narrative game framework. Previously, Paris was a mobile product manager for Meebo (acquired by Google). Paris particularly enjoys game design, statistics, blockchain, machine learning, and human-centered technology. He researches and writes technical books on mobile and game development (more than 20 so far) for O’Reilly and is writing Practical AI with Swift and Head First Swift. He holds a degree in medieval history and a PhD in computing. You can find him on Twitter as @parisba.

Presentations

Swift for TensorFlow in 3 hours Tutorial

Mars Geldard, Tim Nugent, and Paris Buttfield-Addison are here to prove Swift isn't just for app developers. Swift for TensorFlow provides the power of TensorFlow with all the advantages of Python (and complete access to Python libraries) and Swift—the safe, fast, incredibly capable open source programming language; Swift for TensorFlow is the perfect way to learn deep learning and Swift.

Vinoth Chandar is the Co-Creator of the Hudi project at Uber and also PMC/Lead of Apache Hudi (Incubating). Previously, he was a senior staff engineer at Uber, where he led projects across various technology areas like data infrastructure, data architecture & mobile/network performance. Vinoth has keen interest in unified architectures for data analytics and processing. Previously, he was the LinkedIn lead on Voldemort and worked on Oracle Server’s replication engine, HPC, and stream processing.

Presentations

Bringing stream processing to batch data using Apache Hudi (incubating) Session

Batch processing can benefit immensely from adopting some techniques from the streaming processing world. Balaji Varadarajan shares how Apache Hudi (incubating), an open source project created at Uber and currently incubating with the ASF, can bridge this gap and enable more productive, efficient batch data engineering.

Praveen heads the Data and Analytics Practice at GSPANN where he spends his time wondering the nuances and intricacies of data around us. He is extremely passionate about data in all its forms and has spent his time building analytical platforms and wrangling with data at GAP and Macy’s before plunging back into consulting. He and his team have been tied by the hip with Kohl’s team in shaping and engineering the Data Platform on GCP with the objective of surfacing the right insights at the right time. When he is not wondering about what his customers would want to do with the deluge of data, he likes to read and swim.

Presentations

Revitalizing Kohl’s Marketing and Experience Personalized Ecosystem on Google Cloud Platform (Sponsored by GSPANN) Session

Present the challenges faced with Kohl’s legacy Marketing analytics platform and how we leveraged Google Cloud Platform and Big Query to overcome the challenges to provide better and consistent customer insights to Marketing Analytics business team.

Sathya Chandran is a security research scientist at DataVisor. He’s an expert in applying big data and unsupervised machine learning to fraud detection, specializing in the financial, ecommerce, social, and gaming industries. Previously, Sathya was at HP Labs and Honeywell Labs. Sathya holds a PhD in CS from the University of South Florida.

Presentations

Mobility behavior fingerprinting: A new tool for detecting account takeover attacks Session

Sathya Chandran shares key insights into current trends of account takeover fraud by analyzing 52 billion events generated by 1.1 billion users and developing a set of user mobility features to capture suspicious device and IP-switching patterns. You'll learn to incorporate mobility features into an anomaly detection solution to detect suspicious account activity in real time.

Jin Hyuk Chang is a software engineer on the data platform team at Lyft, working on various data products. Jin is a main contributor to Apache Gobblin and Azkaban. Previously, Jin worked at Linkedin and Amazon Web Services, focused on big data and service-oriented architecture.

Presentations

Amundsen: An open source data discovery and metadata platform Session

Jin Hyuk Chang and Tao Feng offer a glimpse of Amundsen, an open source data discovery and metadata platform from Lyft. Since it was open-sourced, Amundsen has been used and extended by many different companies within the community.

Yue Cathy Chang is a business executive recognized for sales, business development, and product marketing in high technology.

Cathy co-founded and is currently the CEO of TutumGene, a technology company that aims to accelerate disease curing by providing solutions for gene therapy and regulation of gene expression. She was most recently with Silicon Valley Data Science, a startup (acquired by Apple) that provided business transformation consulting to enterprises and other organizations using data science- and engineering-based solutions. Prior to that, Cathy was employee #1 hired by the CEO at venture-funded software startup Rocana (acquired by Splunk), where she served as Senior Director of Business Development focusing on building and growing long-term relationships, and notably increased sales leads 2x through building and managing indirect revenue channels.

Prior to Rocana, Cathy held multiple strategic roles at blue chip software enterprise companies as well as startups, including Corporate and Business Development at FeedZai and Datameer; Senior product management, product marketing and sales at Symantec and IBM; and Strategic Sourcing Improvement Consulting at Honeywell.

Cathy holds MS and BS degrees in Electrical and Computer Engineering from Carnegie Mellon University, MBA and MS degrees as a Leaders for Global Operations (LGO) duel-degree fellow from MIT, and two patents for her early work in microprocessor logic design.

Presentations

AI meets genomics: understand and use genetics & genome editing to revolutionize medicine Session

Genome editing has been dubbed a top technology that could create trillion-dollar markets. Learn how recent advancements in the application of AI to genomic editing are accelerating transformation of medicine with Yue Cathy Chang as she explores how AI is applied to genome sequencing and editing, the potential to correct mutations, and questions on using genome editing to optimize human health.

Applying technology oversight and domain insights in AI and ML initiatives to increase success Tutorial

More than 85% of data science projects fail. This high failure rate is a main reason why data science is still a "science." As data science practitioners, reducing this failure rate is a priority. Jike Chong and Yue Cathy Chang explain the three key steps of applying data science technology to business problems and three concerns for applying domain insights in AI and ML initiatives.

Executive Briefing: Technology oversight to reduce data science project failure rate Session

More than 85% of data science projects fail. This high failure rate is a main reason why data science is still a science. Jike Chong and Yue "Cathy" Chang outline how you can reduce this failure rate and improve teams' confidence in executing successful data science projects by applying data science technology to business problems: scenario mapping, pattern discovery, and success evaluation.

Jeff Chao is a senior software engineer at Netflix, where he works on stream processing engines and observability platforms. Jeff builds and maintains Mantis, an open source platform that makes it easy for developers to build cost-effective, real-time, operations-focused applications. Previously, he was at Heroku, offering a fully managed Apache Kafka service.

Presentations

Cost-effective, real-time operational insights into production systems at Netflix Session

Netflix has experienced an unprecedented global increase in membership over the last several years. Production outages today have greater impact in less time than years before. Jeff Chao details the open-sourced Mantis, which allows Netflix to continue providing great experiences for its members, enabling it to get real-time, granular, cost-effective operational insights.

Chanchal Chatterjee is a cloud AI leader at Google Cloud Platform with a focus on financial services and energy market verticals. He’s held several leadership roles focusing on machine learning, deep learning, and real-time analytics. Previously, he was chief architect of EMC at the CTO office, where he led end-to-end deep learning and machine learning solutions for data centers, smart buildings, and smart manufacturing for leading customers; was instrumental in the Industrial Internet Consortium, where he published an AI framework for large enterprises. Chanchal received several awards, including the Outstanding Paper Award from IEEE Neural Network Council for adaptive learning algorithms recommended by MIT professor Marvin Minsky. Chanchal founded two tech startups between 2008 and 2013. He has 29 granted or pending patents and over 30 publications. Chanchal earned MS and PhD degrees in electrical and computer engineering from Purdue University.

Presentations

Solving financial services machine learning problems with explainable ML models Session

Financial services companies use machine learning models to solve critical business use cases. Regulators demand model explainability. Chanchal Chatterjee shares how Google solved financial services business critical problems such as credit card fraud, anti-money laundering, lending risk, and insurance loss using complex machine learning models you can explain to regulators.

Michael Chertushkin is a senior data scientist at John Snow Labs. He graduated from Ural Federal University, RadioTechnical Faculty in 2012 and worked as a software developer. In 2014 he decided to shift to data science, recognizing the growing interest towards machine learning. He has successfully completed several projects in this field and decided to get more fundamental skills, which led him to graduate from the Yandex Data School—the best educational center in Russia for preparing highly skilled professionals in data science.

Presentations

A unified CV, OCR, and NLP model pipeline for scalable document understanding at DocuSign Session

Roshan Satish and Michael Chertushkin lead you through a real-world case study about applying state-of-the-art deep learning techniques to a pipeline that combines computer vision (CV), optical character recognition (OCR), and natural language processing (NLP) at DocuSign. You'll discover how the project delivered on its extreme interpretability, scalability, and compliance requirements.

Amanda “Mandy” Chessell is a master inventor, fellow of the Royal Academy of Engineering, and a distinguished engineer at IBM, where she’s driving IBM’s strategic move to open metadata and governance through the Apache Atlas open source project. Mandy is a trusted advisor to executives from large organizations and works with them to develop strategy and architecture relating to the governance, integration, and management of information. You can find out more information on her blog.

Presentations

Creating an ecosystem on data governance in the ODPi Egeria project Session

Building on its success at establishing standards in the Apache Hadoop data platform, the ODPi (Linux Foundation) turns its focus to the next big data challenge—enabling metadata management and governance at scale across the enterprise. Mandy Chessell and John Mertic discuss how the ODPi's guidance on governance (GoG) aims to create an open data governance ecosystem.

George Chkadua is a data scientist at TBC Bank. His main focus is machine learning and its applications in industries from a mathematics and business perspective. He earned a PhD in mathematics from King’s College London. George has published various articles in peer review journals and has been invited speak on many scientific conferences and seminars.

Presentations

A novel approach of recommender systems in retail banking Session

TBC Bank is in transition from product-centric to a client-centri. Obvious applications of analytics are developing personalized next-best product recommendation for clients. George Chkadua and Levan Borchkhadze explain why the bank decided to implement the ALS user-item matrix factorization method and demographic model. As as result, the pilot increased sales conversion rates by 70%.

Sowmiya Chocka Narayanan is the cofounder and CTO of Lily AI, an emotional intelligence-powered shopping experience that helps brands understand their consumers’ purchase behavior. She’s focused on decoding user behavior and building deep product understanding by applying deep learning techniques. Previously, she worked at different levels of the tech stack at Box, leading initiatives in building SDKs, applications for industry verticals, and MDM solutions, and was also an early engineer at Pocket Gems, where she worked on the core game engine and built acquisition and retention strategies for the number one and number four top-grossing gaming apps. Sowmiya earned her master’s degree in electrical and computer engineering from The University of Texas at Austin.

Presentations

Personalization powered by unlocking deep product and consumer features Session

Digital brands focus heavily on personalizing consumers' experience at every single touchpoint. In order to engage with consumers in the most relevant ways, Lily AI helps brands dissect and understand how their consumers interact with their products, more specifically with the product features. Sowmiya Chocka Narayanan explores the lessons learned building AI-powered personalization for fashion.

Jike Chong is the director of data science, hiring marketplace at LinkedIn. He’s an accomplished executive and professor with experience across industry and academia. Previously, he was the chief data scientist at Acorns, the leading microinvestment app in US with over four million verified investors, which uses behavioral economics to help the up-and-coming save and invest for a better financial future; was the chief data scientist at Yirendai, an online P2P lending platform with more than $7B loans originated and the first of its kind from China to go public on NYSE; established and headed the data science division at SimplyHired, a leading job search engine in Silicon Valley; advised the Obama administration on using AI to reduce unemployment; and led quantitative risk analytics at Silver Lake Kraftwerk, where he was responsible for applying big data techniques to risk analysis of venture investment. Jike is also an adjunct professor and PhD advisor in the Department of Electrical and Computer Engineering at Carnegie Mellon University, where he established the CUDA Research Center and CUDATeaching Center, which focus on the application of GPUs for machine learning. Recently, he developed and taught a new graduate level course on machine learning for Internet finance at Tsinghua University in Beijing, China, where he is serving as an adjunct professor. Jike holds MS and BS degrees in electrical and computer engineering from Carnegie Mellon University and a PhD from the University of California, Berkeley. He holds 11 patents (six granted, five pending).

Presentations

Applying technology oversight and domain insights in AI and ML initiatives to increase success Tutorial

More than 85% of data science projects fail. This high failure rate is a main reason why data science is still a "science." As data science practitioners, reducing this failure rate is a priority. Jike Chong and Yue Cathy Chang explain the three key steps of applying data science technology to business problems and three concerns for applying domain insights in AI and ML initiatives.

Executive Briefing: Technology oversight to reduce data science project failure rate Session

More than 85% of data science projects fail. This high failure rate is a main reason why data science is still a science. Jike Chong and Yue "Cathy" Chang outline how you can reduce this failure rate and improve teams' confidence in executing successful data science projects by applying data science technology to business problems: scenario mapping, pattern discovery, and success evaluation.

Nicholas Cifuentes-Goodbody is a data scientist in residence at the Data Incubator. He’s taught English in France, Spanish in Qatar, and now data science all over the world. Previously, he was at Williams College, Hamad bin Khalifa University (Qatar), and the University of Southern California. He earned his PhD at Yale University. He lives in Los Angeles with his amazing wife and their adorable pit bull.

Presentations

Deep learning with TensorFlow 2-Day Training

The TensorFlow library provides computational graphs with automatic parallelization across resources—ideal architecture for implementing neural networks. You'll walk through TensorFlow's capabilities in Python, from building machine learning algorithms piece by piece to using the Keras API provided by TensorFlow, with several hands-on applications.

Ira Cohen is a cofounder and chief data scientist at Anodot, where he’s responsible for developing and inventing the company’s real-time multivariate anomaly detection algorithms that work with millions of time series signals. He holds a PhD in machine learning from the University of Illinois at Urbana-Champaign and has over 12 years of industry experience.

Presentations

Herding cats: Product management in the machine learning era Tutorial

While the role of the manager doesn't require deep knowledge of ML algorithms, it does require understanding how ML-based products should be developed. Ira Cohen explores the cycle of developing ML-based capabilities (or entire products) and the role of the (product) manager in each step of the cycle.

Nicola Corradi is a Research Scientist at DataVisor, where he uses his vast experience with neural networks to design and train deep learning models to recognize malicious patterns in user behaviour. He earned a PhD in cognitive science (University of Padua) and did a post-doc at Cornell in computational neuroscience and computer vision, focusing on the integration of computational model of the neurons with neural networks.

Presentations

A deep learning model to detect coordinated content abuse Session

Fraudulent attacks such as application fraud, fake reviews, and promotion abuse have to automate the generation of user content to scale; this creates latent patterns shared among the coordinated malicious accounts. Nicola Corradi digs into a deep learning model to detect such patterns for the identification of coordinated content abuse attacks on social, ecommerce, financial platforms, and more.

A deep learning model to detect coordinated frauds using patterns in user content Session

Fraudulent attacks like fake reviews, application fraud, and promotion abuse create a common pattern shared within coordinated malicious accounts. Nicola Corradi explains novel deep learning models that learned to detect suspicious patterns, leading to the individuation of coordinated fraud attacks on social, dating, ecommerce, financial, and news aggregator services.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Tuesday keynotes Keynote

Strata program chairs Rachel Roumeliotis and Alistair Croll welcome you to the first day of keynotes.

Wednesday keynotes Keynote

Strata program chairs Rachel Roumeliotis and Alistair Croll welcome you to the first day of keynotes.

Robert Crowe is a data scientist at Google with a passion for helping developers quickly learn what they need to be productive. A TensorFlow addict, he’s used TensorFlow since the very early days and is excited about how it’s evolving quickly to become even better than it already is. Previously, Robert led software engineering teams for large and small companies, always focusing on clean, elegant solutions to well-defined needs. In his spare time, Robert sails, surfs occasionally, and raises a family.

Presentations

ML in production: Getting started with TensorFlow Extended (TFX) Tutorial

Putting together an ML production pipeline for training, deploying, and maintaining ML and deep learning applications is much more than just training a model. Robert Crowe outlines what's involved in creating a production ML pipeline and walks you through working code.

Michelangelo D’Agostino is the senior director of data science at ShopRunner, where he leads a team that develops statistical models and writes software that leverages their unique cross-retailer ecommerce dataset. Previously, Michelangelo led the data science R&D team at Civis Analytics, a Chicago-based data science software and consulting company that spun out of the 2012 Obama reelection campaign, and was a senior analyst in digital analytics with the 2012 Obama reelection campaign, where he helped to optimize the campaign’s email fundraising juggernaut and analyzed social media data. Michelangelo has been a mentor with the Data Science for Social Good Fellowship. He holds a PhD in particle astrophysics from UC Berkeley and got his start in analytics sifting through neutrino data from the IceCube experiment. Accordingly, he spent two glorious months at the South Pole, where he slept in a tent salvaged from the Korean War and enjoyed the twice-weekly shower rationing. He’s also written about science and technology for the Economist.

Presentations

Executive Briefing: The care and feeding of data scientists Session

Data science is relatively young, and the job of managing data scientists is younger still. Many people undertake this management position without the tools, mentorship, or role models they need to do it well. Katie Malone and Michelangelo D'Agostino review key themes from a recent Strata report that examines the steps necessary to build, manage, sustain, and retain a growing data science team.

Leslie De Jesus is the chief innovation officer at Wovenware. With more than 20 years of expertise in software, product development, and data science, Leslie drives disruptive strategies and solutions, including AI and enterprise cloud solutions, to clients in a variety of markets from healthcare and telco to insurance, education, and defense industries. Leslie is responsible for designing advanced deep learning, machine learning and chatbot solutions, including patented groundbreaking products. One of her biggest strengths is team building, which is the foundation of repetition in the product creation process. Previously, Leslie has held positions such as senior software product architect, CTO, and vice president, product development for key firms.

Presentations

Going beyond the textbook: Best practices for creating a DL churn model in healthcare Session

Considering the cost of customer acquisition and the importance of making decisions based on customer data, churn prediction is key for retaining customers and anticipating future trends. Leslie De Jesus describes a case study of how a healthcare insurance provider reduced customer churn and examines three key considerations when creating the DL model to be a tool for preemptive decision making.

Sourav Dey is CTO at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. Previously, Sourav led teams building data products across the technology stack, from smart thermostats and security cams at Google Nest to power grid forecasting at AutoGrid to wireless communication chips at Qualcomm. He holds patents for his work, has been published in several IEEE journals, and has won numerous awards. He holds PhD, MS, and BS degrees in electrical engineering and computer science from MIT.

Presentations

Efficient ML engineering: Tools and best practices Tutorial

ML engineers work at the intersection of data science and software engineering—that is, MLOps. Sourav Dey and Alex Ng highlight the six steps of the Lean AI process and explain how it helps ML engineers work as an integrated part of development and production teams. You'll go hands-on with real-world data so you can get up and running seamlessly.

Gonzalo Diaz is a data scientist in residence at the Data Incubator, where he teaches the data science fellowship and online courses; he also develops the curriculum to include the latest data science tools and technologies. Previously, he was a web developer at an NGO and a researcher at IBM TJ Watson Research Center. He has a PhD in computer science from the University of Oxford.

Presentations

Big data for managers 2-Day Training

The instructors provide a nontechnical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and use their input and analysis for your business’s strategic priorities and decision making.

Big data for managers (Day 2) Training Day 2

Rich Ott and Michael Li provide a nontechnical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and use their input and analysis for your business’s strategic priorities and decision making.

Victor Dibia is a research engineer at Cloudera’s Fast Forward Labs, where his work focuses on prototyping state-of-the-art machine learning algorithms and advising clients. He’s passionate about community work and serves as a Google Developer Expert in machine learning. Previously, he was a research staff member at the IBM TJ Watson Research Center. His research interests are at the intersection of human-computer interaction, computational social science, and applied AI. He’s a senior member of IEEE and has published research papers at conferences such as AAAI Conference on Artificial Intelligence and ACM Conference on Human Factors in Computing Systems. His work has been featured in outlets such as the Wall Street Journal and VentureBeat. He holds an MS from Carnegie Mellon University and a PhD from City University of Hong Kong.

Presentations

Deep learning for anomaly detection Session

In many business use cases, it's frequently desirable to automatically identify and respond to abnormal data. This process can be challenging, especially when working with high-dimensional, multivariate data. Nisha Muktewar and Victor Dibia explore deep learning approaches (sequence models, VAEs, GANs) for anomaly detection, performance benchmarks, and product possibilities.

Dominic Divakaruni is a principal product leader at Dom Divakaruni is a principal group program manager at Microsoft working on the Azure Machine Learning platform. Current areas of focus include applying and managing data for machine learning including, data access, exploratory data analysis, data lineage, and data drift. Dom’s prior work includes building tools to help customers deploy models to production, deep learning frameworks, accelerated computing and GPUs.

Presentations

Data lineage enables reproducible and reliable machine learning at scale Session

Data scientists need a way to ensure result reproducibility. Sihui "May" Hu and Dominic Divakaruni unpack how to retrieve data-to-data, data-to-model, and model-to-deployment lineages in one graph to achieve reproducible and reliable machine learning at scale. You'll discover effective ways to track the full lineage from data preparation to model training to inference.

Mark Donsky is a director of product management at Okera, a software provider that provides discovery, access control, and governance at scale for today’s modern heterogenous data environments, where he leads product management. Previously, Mark led data management and governance solutions at Cloudera, and he’s held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the Western University, Ontario, Canada.

Presentations

Executive briefing on CCPA, GDPR, and NYPA: Big data in the era of heavy privacy regulation Session

Privacy regulation is increasing worldwide with Europe's GDPR, the California Consumer Privacy Act (CCPA), and the New York Privacy Act (NYPA). Penalties for noncompliance are stiff, but many companies still aren't prepared. Mark Donsky shares how to establish best practices for holistic privacy readiness as part of your data strategy.

JOZO DUJMOVIĆ received BSEE, MS, and ScD degrees in computer engineering from the University of Belgrade. He is a Professor of computer science and former Chair of the Computer Science Department at San Francisco State University, where he teaches and researches soft computing, decision engineering, software metrics, and computer performance evaluation. His first industrial experience was in Institute “M. Pupin,” Belgrade, followed by professorship with the School of Electrical Engineering at the University of Belgrade. Before his current position with San Francisco State University, he was the Professor of computer science with the University of Florida, Gainesville, FL, USA; the University of Texas, Dallas, TX, USA; and Worcester Polytechnic Institute, Worcester, MA, USA. He is the author of the LSP decision method, and more than 170 refereed publications. Jozo received three best paper awards, served as General Chair of IEEE and ACM conferences, and invited keynote speaker at conferences in USA and Europe. He is the founder and principal of SEAS, a San Francisco company established in 1997, specializing in soft computing, decision engineering, and software support for the LSP decision method. His latest book entitled Soft Computing Evaluation Logic was published by John Wiley and IEEE Press in 2018.

Presentations

Monitoring patient disability and disease severity using AI Session

We present soft computing models for evaluation of patient disability and disease severity. Such models are necessary in personalized healthcare and must be supported using AI software tools. Our methodology is illustrated using a case study of peripheral neuropathy. We also present the related decision problem of optimum timing of risky therapy.

Michael Dulin, MD, PhD is the Director of the Academy for Population Health Innovation at UNC Charlotte – a collaboration designed to advance community and population health and serves as the Chief Medical Officer for Gray Matter Analytics. Dulin started his career as an Electrical and Biomedical Engineer and then received his PhD studying Neurophysiology. His medical degree is from the University of Texas Medical School at Houston. He completed his residency training at Carolinas Medical Center in Charlotte, and he then entered private practice in Harrisburg, North Carolina where he worked as a community-based primary care physician prior to returning to academics. He then became the Research Director and served as the Chair of the Carolinas Healthcare System’s Department of Family Medicine where he directed a primary care practice-based research network (MAPPR). Immediately prior to joining UNC Charlotte and Gray Matter Analytics, he served as the Chief Clinical Officer for Outcomes Research and Analytics at Atrium Health.

Dulin is a nationally recognized leader in the field of health information technology and application of analytics and outcomes research to improve care delivery and advance population health. He has led projects in this domain funded by AHRQ, The Robert Wood Johnson Foundation, The Duke Endowment, NIH, and PCORI. His work has been recognized by the Charlotte Business Journal, NCHICA and Cerner. His work to build a healthcare data and analytics team was featured by the Harvard Business School / Harvard School T.H. Chan School of Public Health as a published case study.

Dr. Dulin is a member of the American Academy of Family Physicians, Society for Teachers of Family Medicine, North American Primary Care Research Group, and Alpha Omega Alpha. He is a recipient of the North Carolina Medical Society’s Community Practitioner Program; a participant in the Center for International Understanding Latino Initiative; and he was recognized as one of Charlotte’s Best Doctors.

Presentations

Organizational Culture’s Key Role in Transforming Healthcare Using Data and A.I. Session

Despite advances in technology like cloud computing, healthcare providers struggle with basics around applying data/analytics to essential functions. This delay is driven by organizational culture – particularly in large/complex organizations. This presentation will review common implementation barriers and approaches needed to succeed in the transformation process.

In his role as a Partner for TNG Technology Consulting in Munich, Thomas Endres works as an IT consultant. Besides his normal work for the company and the customers he is creating various prototypes – like a telepresence robotics system with which you can see reality through the eyes of a robot, or an Augmented Reality AI that shows the world from the perspective of an artist. He is working on various applications in the fields of AR/VR, AI and gesture control, putting them to use e.g. in autonomous or gesture controlled drones. But he is also involved in other Open Source projects written in Java, C# and all kinds of JavaScript languages.

Thomas studied IT at the TU Munich and is passionate about software development and all the other aspects of technology. As an Intel Software Innovator and Black Belt, he is promoting new technologies like AI, AR/VR and robotics around the world. For this he received amongst others a JavaOne Rockstar award.

Presentations

Deepfakes 2.0: How neural networks are changing our world Session

Imagine looking into a mirror, but not seeing your own face. Instead, you're looking in the eyes of Barack Obama or Angela Merkel. Your facial expressions are seamlessly transferred to the other person's face in real time. Martin Förtsch and Thomas Endres dig into a prototype from TNG that transfers faces from one person to another in real time based on deepfakes.

Dr. Stephan Erberich is the Chief Data Officer of the Children’s Hospital Los Angeles and Professor of Research Radiology at the University of Southern California. He is a computer scientist specialized in medical informatics and an AI practitioner in healthcare with a focus on image processing and computer vision ML.

Presentations

Semisupervised AI approach for automated categorization of medical images Session

Annotating radiological images by category at scale is a critical step for analytical ML. Supervised learning is challenging because image metadata doesn't reliably identify image content and manual labeling images for AI algorithms isn't feasible. Stephan Erberich, Kalvin Ogbuefi, and Long Ho share an approach for automated categorization of radiological images based on content category.

Moty Fania is a principal engineer and the CTO of the Advanced Analytics Group at Intel, which delivers AI and big data solutions across Intel. Moty has rich experience in ML engineering, analytics, data warehousing, and decision-support solutions. He led the architecture work and development of various AI and big data initiatives such as IoT systems, predictive engines, online inference systems, and more.

Presentations

Practical methods to enable continuous delivery and sustainability for AI Session

Moty Fania shares key insights from implementing and sustaining hundreds of ML models in production, including continuous delivery of ML models and systematic measures to minimize the cost and effort required to sustain them in production. You'll learn from examples from different business domains and deployment scenarios (on-premises, the cloud) covering the architecture and related AI platforms.

Tao Feng is a software engineer on the data platform team at Lyft. Tao is a committer and PMC member on Apache Airflow. Previously, Tao worked on data infrastructure, tooling, and performance at LinkedIn and Oracle.

Presentations

Amundsen: An open source data discovery and metadata platform Session

Jin Hyuk Chang and Tao Feng offer a glimpse of Amundsen, an open source data discovery and metadata platform from Lyft. Since it was open-sourced, Amundsen has been used and extended by many different companies within the community.

Rustem Feyzkhanov is a machine learning engineer at Instrumental, where he creates analytical models for the manufacturing industry. Rustem is passionate about serverless infrastructure (and AI deployments on it) and is the author of the course and book Serverless Deep Learning with TensorFlow and AWS Lambda.

Presentations

Serverless architecture for AI applications Session

Machine learning (ML) and deep learning (DL) are becoming more and more essential for businesses in internal and external use; one of the main issues with deployment is finding the right way to train and operationalize the model. Rustem Feyzkhanov digs into how use AWS infrastructure to use a serverless approach for deep learning, providing cheap, simple, scalable, and reliable architecture.

Martin Förtsch is an IT consultant at TNG, based in Unterföhring near Munich who studied computer sciences. His focus areas are Agile development (mainly) in Java, search engine technologies, information retrieval and databases. As an Intel Software Innovator and Intel Black Belt Software Developer, he’s strongly involved in the development of open source software for gesture control with 3D cameras like Intel RealSense and has built an augmented reality wearable prototype device with his team based on this technology. He gives many talks on national and international conferences about AI, the internet of things, 3D camera technologies, augmented reality, and test-driven development. He was awarded with the JavaOne Rockstar award.

Presentations

Deepfakes 2.0: How neural networks are changing our world Session

Imagine looking into a mirror, but not seeing your own face. Instead, you're looking in the eyes of Barack Obama or Angela Merkel. Your facial expressions are seamlessly transferred to the other person's face in real time. Martin Förtsch and Thomas Endres dig into a prototype from TNG that transfers faces from one person to another in real time based on deepfakes.

Ben Fowler is a machine learning technical leader at Southeast Toyota Finance, where he leads the end-to-end model development. He’s been in the field of data science for over five years. Ben has been a guest speaker at the Southern Methodist University program multiple times. Additionally, he’s spoken at the PyData Miami 2019 Conference and has spoken multiple times at the West Palm Beach Data Science Meetup. He earned a master of science in data science from Southern Methodist University.

Presentations

Evaluation of traditional and novel feature selection approaches Session

Selecting the optimal set of features is a key step in the machine learning modeling process. Ben Fowler shares research that tested five approaches for feature selection. The approaches included current widely used methods, along with novel approaches for feature selection using open source libraries, building a classification model using the Lending Club dataset.

Don Fox data scientist in residence in Boston for The Data Incubator. Previously, Don developed numerical models for a geothermal energy startup. Born and raised in South Texas, Don has a PhD in chemical engineering, where he researched renewable energy systems and developed computational tools to analyze the performance of these systems.

Presentations

Hands-on data science with Python 2-Day Training

You'll walk through all the steps—from prototyping to production—of developing a machine learning pipeline. After looking at data cleaning, feature engineering, model building and evaluation, and deployment, you'll extend these models into two applications from real-world datasets. All your work will be done in Python.

Hands-on data science with Python (Day 2) Training Day 2

You'll walk through all the steps—from prototyping to production—of developing a machine learning pipeline. After looking at data cleaning, feature engineering, model building and evaluation, and deployment, you'll extend these models into two applications from real-world datasets. All your work will be done in Python.

Michael J. Freedman is the cofounder and CTO of TimescaleDB and a full professor of computer science at Princeton University. His work broadly focuses on distributed and storage systems, networking, and security, and his publications have more than 12,000 citations. He developed CoralCDN (a decentralized content distribution network serving millions of daily users) and helped design Ethane (which formed the basis for OpenFlow and software-defined networking). Previously, he cofounded Illuminics Systems (acquired by Quova, now part of Neustar) and served as a technical advisor to Blockstack. Michael’s honors include a Presidential Early Career Award for Scientists and Engineers (given by President Obama), the SIGCOMM Test of Time Award, a Sloan Fellowship, an NSF CAREER award, the Office of Naval Research Young Investigator award, and support from the DARPA Computer Science Study Group. He earned his PhD at NYU and Stanford and his undergraduate and master’s degrees at MIT.

Presentations

Building a distributed time series database on PostgreSQL Session

Time series data tends to accumulate very quickly, across DevOps, IoT, industrial and energy, finance, and other domains. Time series data is everywhere, with monitoring and IoT applications generating tens of millions of metrics per second and petabytes of data. Michael Freedman shows you how to build a distributed time series database that offers the power of full SQL at scale.

Chris Fregly is an AWS technical evangelist for machine learning, founder of the Advanced KubeFlow Meetup, and author of the O’Reilly video series High Performance TensorFlow in Production. Previously, Chris was founder and product manager at PipelineAI, where he worked with many small startups and large enterprises to optimize and tune their ML and AI pipelines.

Presentations

Continuous machine learning and AI: Hands-on learning with Kubeflow and MLflow pipelines Interactive session

Join in to build real-world, distributed machine learning (ML) pipelines with Chris Fregly using Kubeflow, MLflow, TensorFlow, Keras, and Apache Spark in a Kubernetes environment.

Martin Frigaard is a cofounder of and data scientist at Aperture Marketing, which provides workshops, writes guides and tutorials, and builds web applications for individuals, businesses, and organizations. He has over nine years of experience with data analysis, statistics, and research, and he’s a fully certified RStudio tidyverse trainer.

Presentations

Data wrangling, visualizations, storytelling Interactive session

Martin Frigaard not only outlines how to collect, manipulate, summarize, and visualize data, but also explores how to communicate your findings in a convincing way your audience will understand and appreciate.

Shannon Fuller is an expert in developing and implementing data management and governance processes. Mr. Fuller was instrumental in building a data and analytics center of excellence within a large integrated healthcare system where he led the development of a data governance strategy. He is a creative problem solver and good strategic thinker with the ability to create a vision and share it with the business in order to gain support and functional commitment on projects. Shannon is nationally recognized for his expertise in the application of data governance with a focus on development of internal policies and change management designed to recognize data as an enterprise asset. He speaks nationally on this topic, leads an academic advisory board on big data applications in industry, and frequently guest lectures for UNC Charlotte.

Presentations

Organizational Culture’s Key Role in Transforming Healthcare Using Data and A.I. Session

Despite advances in technology like cloud computing, healthcare providers struggle with basics around applying data/analytics to essential functions. This delay is driven by organizational culture – particularly in large/complex organizations. This presentation will review common implementation barriers and approaches needed to succeed in the transformation process.

Krishna is the cofounder and CEO of Fiddler Labs, an enterprise startup building an explainable AI engine to address problems regarding bias, fairness, and transparency in AI. Previously, he led the team that built Facebook’s explainability feature “Why am I seeing this?” He’s an entrepreneur with a technical background with experience creating scalable platforms and expertise in converting data into intelligence. Having held senior engineering leadership roles at Facebook, Pinterest, Twitter, and Microsoft, he’s seen the effects that bias has on AI and machine learning decision-making processes. With Fiddler, his goal is to enable enterprises across the globe solve this problem.

Presentations

The art of explainability: Removing the bias from AI Session

Krishna Gade outlines how "explainable AI" fills a critical gap in operationalizing AI and adopting an explainable approach into the end-to-end ML workflow from training to production. You'll discover the benefits of explainability such as the early identification of biased data and better confidence in model outputs.

Ben Galewsky is a research programmer at the National Center for Supercomputing Applications at the University of Illinois. He’s an experienced data engineering consultant whose career has spanned high-frequency trading systems to global investment bank enterprise architecture to big data analytics for large consumer goods manufacturers. He’s a member of the Institute for Research and Innovation in Software for High Energy Physics, which funds his development of scalable systems for the Large Hadron Collider.

Presentations

Data engineering at the Large Hadron Collider Session

Building a data engineering pipeline for serving segments of a 200 Pb dataset to particle physicists around the globe poses many challenges, some unique to high energy physics and some to big science projects across disciplines. Ben Galewsky, Gray Lindsey, and Andrew Melo highlight how much of it can inform industry data science at scale.

Meg Garlinghouse is the head of social impact at LinkedIn. She’s passionate about connecting people with opportunities to use their skills and experience to transform the world. She has more than 20 years of experience working at the intersection of nonprofits and corporations, developing strategic and mutually beneficial partnerships. She has particular expertise in leveraging media and technology to meet the marketing, communications, and brand goals of respective clients. Meg has a passion for developing innovative social campaigns that have a business benefit.

Presentations

Fairness through experimentation at LinkedIn Session

Most companies want to ensure their products and algorithms are fair. Guillaume Saint-Jacques and Meg Garlinghouse share LinkedIn's A/B testing approach to fairness and describe new methods that detect whether an experiment introduces bias or inequality. You'll learn about a scalable implementation on Spark and discover examples of use cases and impact at LinkedIn.

William “Will” Gatehouse is the chief solutions architect for Accenture’s industry X.0 platforms, including solutions for oil and gas, chemicals, and smart grid. Will has over 20 years’ experience with implementing industrial platforms and has a reputation for applying emerging technology at enterprise scale as he’s proven for streaming analytics, semantic models, and edge analytics. When not at work, Will is an avid sailor.

Presentations

Building the digital twin IoT and unconventional data Session

The digital twin presents a problem of data and models at scale—how to mobilize IT and OT data, AI, and engineering models that work across lines of business and even across partners. Teresa Tung and William Gatehouse share their experience of implementing digital twins use cases that combine IoT, AI models, engineering models, and domain context.

Lior Gavish is a senior vice president of engineering at Barracuda, where he coleads the email security business. Lior developed AI solutions that were recognized by industry and academia, including a Distinguished Paper Award at USENIX Security 2019. Lior joined Barracuda through the acquisition of Sookasa, an Accel-backed startup where he was a cofounder and vice president of engineering. Previously, Lior led startup engineering teams building machine learning, web and mobile technologies. Lior holds a BSc and MSc in computer science from Tel-Aviv University and an MBA from Stanford University.

Presentations

High-precision detection of business email compromise Session

Lior Gavish breaks down a machine learning (ML)-based system that detects a highly evasive type of email-based fraud. The system combines innovative techniques for labeling and classifying highly unbalanced datasets with a distributed cloud application capable of processing high-volume communication in real time.

Marina (Mars) Rose Geldard is a researcher from Down Under in Tasmania. Entering the world of technology relatively late as a mature-age student, she’s found her place in the world: an industry where she can apply her lifelong love of mathematics and optimization. When she’s not busy being the most annoyingly eager researcher ever, she compulsively volunteers at industry events, dabbles in research, and serves on the executive committee for her state’s branch of the Australian Computer Society (ACS). She’s currently writing Practical Artificial Intelligence with Swift for O’Reilly Media.

Presentations

Swift for TensorFlow in 3 hours Tutorial

Mars Geldard, Tim Nugent, and Paris Buttfield-Addison are here to prove Swift isn't just for app developers. Swift for TensorFlow provides the power of TensorFlow with all the advantages of Python (and complete access to Python libraries) and Swift—the safe, fast, incredibly capable open source programming language; Swift for TensorFlow is the perfect way to learn deep learning and Swift.

Lars George is the principal solutions architect at Okera. Lars has been involved with Hadoop and HBase since 2007 and became a full HBase committer in 2009. Previously, Lars was the EMEA chief architect at Cloudera, acting as a liaison between the Cloudera professional services team and customers as well as partners in and around Europe, building the next data-driven solutions, and a cofounding partner of OpenCore, a Hadoop and emerging data technologies advisory firm. He’s spoken at many Hadoop User Group meetings as well as at conferences such as ApacheCon, FOSDEM, QCon, and Hadoop World and Hadoop Summit. He started the Munich OpenHUG meetings. He’s the author of HBase: The Definitive Guide (O’Reilly).

Presentations

Conquering the AWS IAM conundrum Session

With various levels of security layers and different departments responsible for data, there are a number of challenges with managing security and governance within AWS identity and access management (IAM). Lars George identifies the security layers, why there’s such a conundrum with IAM, if IAM actually slows down data projects, and the access control requirements needed in data lakes.

Dan Gifford is a Senior Data Scientist responsible for creating data products at Getty Images in Seattle, Washington. Dan works at the intersection between science and creativity and builds products that improve the workflows of both Getty Images photographers and customers. Currently, he is the lead researcher on visual intelligence at Getty Images and is developing innovative new ways for customers to discover content. Prior to this, he worked as a Data Scientist on the Ecommerce Analytics team at Getty Images where he modernized testing frameworks and analysis tools used by Getty Images Analysts in addition to modeling content relationships for the Creative Research team. Dan earned a Ph.D. in Astronomy and Astrophysics from the University of Michigan in 2015 where he developed new algorithms for estimating the size of galaxy clusters. He also engineered a new image analysis pipeline for an instrument on a telescope used by the department at the Kitt Peak National Observatory.

Presentations

At a Loss for Words: How ML Bridges the Creative Language Gap Data Case Studies

Computer vision has made great strides towards human-level accuracy of describing and identifying images, but there often aren’t words to describe what we want algorithms to predict. In this session, Getty Images’ Senior Data Scientist Dan Gifford will explore this paradox, limitations of text-based image search, and how creative AI is challenging the way we view human creativity.

Navdeep Gill is a senior data scientist and software engineer at H2O.ai, where he focuses mainly on machine learning interpretability and had focused on GPU-accelerated machine learning, automated machine learning, and the core H2O-3 platform. Previously, Navdeep focused on data science and software development at Cisco and was a researcher and analyst in several neuroscience labs at California State University, East Bay, University of California, San Francisco, and Smith Kettlewell Eye Research Institute. Navdeep earned an MS in computational statistics, a BS in statistics, and a BA in psychology (with a minor in mathematics) from California State University, East Bay.

Presentations

Debugging machine learning models Session

Like all good software, machine learning models should be debugged to discover and remediate errors. Navdeep Gill explores several standard techniques in the context of model debugging—disparate impact, residual, and sensitivity analysis—and introduces novel applications such as global and local explanation of model residuals.

Ilana Golbin is a director in PwC’s emerging technologies practice and globally leads PwC’s research and development of responsible AI. Ilana has almost a decade of experience as a data scientist helping clients make strategic business decisions through data-informed decision making, simulation, and machine learning.

Presentations

A practical guide to responsible AI: Building robust, secure, and safe AI Session

Join in for a practitioner’s overview of the risks of AI and depiction of responsible AI deployment within an organization. You'll discover how to ensure the safety, security, standardized testing, and governance of systems and how models can be fooled or subverted. Ilana Golbin and Anand Rao illustrate how organizations safeguard AI applications and vendor solutions to mitigate AI risks.

Bruno Gonçalves is a chief data scientist at Data For Science, working at the intersection of data science and finance. Previously, he was a data science fellow at NYU’s Center for Data Science while on leave from a tenured faculty position at Aix-Marseille Université. Since completing his PhD in the physics of complex systems in 2008, he’s been pursuing the use of data science and machine learning to study human behavior. Using large datasets from Twitter, Wikipedia, web access logs, and Yahoo! Meme, he studied how we can observe both large scale and individual human behavior in an obtrusive and widespread manner. The main applications have been to the study of computational linguistics, information diffusion, behavioral change and epidemic spreading. In 2015, he was awarded the Complex Systems Society’s 2015 Junior Scientific Award for “outstanding contributions in complex systems science” and in 2018 was named a science fellow of the Institute for Scientific Interchange in Turin, Italy.

Presentations

Time series modeling: ML and deep learning approaches 2-Day Training

Time series are everywhere around us. Understanding them requires taking into account the sequence of values seen in previous steps and even long-term temporal correlations. Bruno Goncalves explains a broad range of traditional machine learning (ML) and deep learning techniques to model and analyze time series datasets with an emphasis on practical applications.

Time series modeling: ML and deep learning approaches (Day 2) Training Day 2

Time series are everywhere around us. Understanding them requires taking into account the sequence of values seen in previous steps and even long-term temporal correlations. Bruno Goncalves explains a broad range of traditional machine learning (ML) and deep learning techniques to model and analyze time series datasets with an emphasis on practical applications.

Abe Gong is CEO and cofounder at Superconductive Health. A seasoned entrepreneur, Abe has been leading teams using data and technology to solve problems in healthcare, consumer wellness, and public policy for over a decade. Previously, he was chief data officer at Aspire Health, the founding member of the Jawbone data science team, and lead data scientist at Massive Health. Abe holds a PhD in public policy, political science, and complex systems from the University of Michigan. He speaks and writes regularly on data science, healthcare, and the internet of things.

Presentations

Fighting pipeline debt with Great Expectations Session

Data organizations everywhere struggle with pipeline debt: untested, unverified assumptions that corrupt data quality, drain productivity, and erode trust in data. Abe Gong shares best practices gathered from across the data community in the course of developing a leading open source library for fighting pipeline debt and ensuring data quality: Great Expectations.

Srikanth Gopalakrishnan is a senior data scientist at Gramener, Bangalore office. He works on applying deep learning and machine learning approaches and probabilistic modeling in diverse fields. He comes from a solid mechanics background with a master’s degree in simulation sciences from RWTH Aachen University, Germany. After a short stint at the Aeronautics Department, Purdue University, he returned to India and transitioned to data science.

Presentations

Sizing biological cells and saving lives using AI Session

AI techniques are finding applications in a wide range of applications. Crowd-counting deep learning models have been used to count people, animals, and microscopic cells. Srikanth Gopalakrishnan introduces novel crowd-counting techniques and their applications, including a pharma case study to show how it was used for drug discovery to bring about 98% savings in drug characterization efforts.

Sunil Goplani is a group development manager at Intuit, leading the big data platform. Sunil has played key architecture and leadership roles in building solutions around data platforms, big data, BI, data warehousing, and MDM for startups and enterprises. Previously, Sunil served in key engineering positions at Netflix, Chegg, Brand.net, and few other startups. Sunil has a master’s degree in computer science.

Presentations

10 lead indicators before data becomes a mess Session

Data quality metrics focus on quantifying if data is a mess. But you need to identify lead indicators before data becomes a mess. Sandeep Uttamchandani, Giriraj Bagadi, and Sunil Goplani explore developing lead indicators for data quality for Intuit's production data pipelines. You'll learn about the details of lead indicators, optimization tools, and lessons that moved the needle on data quality.

Always accurate business metrics through lineage-based anomaly tracking Session

Debugging data pipelines is nontrivial and finding the root cause can take hours to days. Shradha Ambekar and Sunil Goplani outline how Intuit built a self-serve tool that automatically discovers data pipeline lineage and applies anomaly detection to detect and help debug issues in minutes—establishing trust in metrics and improving developer productivity by 10x–100x.

As the Chief Data Officer of DataStax, Dr. Denise Koessler Gosnell applies her experiences as a machine learning and graph data practitioner to make more informed decisions with data. Prior to this role, Dr. Gosnell joined DataStax to create and lead the Global Graph Practice, a team that builds some of the largest distributed graph applications in the world. Dr. Gosnell earned her Ph.D. in Computer Science from the University of Tennessee as an NSF Fellow. Her research coined the concept “social fingerprinting” by applying graph algorithms to predict user identity from social media interactions.

Dr. Gosnell’s career centers on her passion for examining, applying, and advocating the applications of graph data. She has patented, built, published, and spoken on dozens of topics related to graph theory, graph algorithms, graph databases, and applications of graph data across all industry verticals. Prior to her roles with DataStax, Gosnell worked in the healthcare industry, where she contributed to software solutions for permissioned blockchains, machine learning applications of graph analytics, and data science.

Presentations

How does graph data help inform a self-organizing network? Session

Self-organizing networks rely on sensor communication and a centralized mechanism, like a cell tower, for transmitting the network's status. Denise Gosnell walks you through what happens if the tower goes down and how a graph data structure gets involved in the network's healing process. You'll see graphs in this dynamic network and how path information helps sensors come back online.

Mark Grover is a product manager at Lyft. Mark’s a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He’s also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He’s a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Presentations

Reducing data lag from 24+ hours to 5 mins at Lyft scale Session

Mark Grover and Dev Tagare offer you a glimpse at the end-to-end data architecture Lyft uses to reduce data lag appearing in its analytical systems from 24+ hours to under 5 minutes. You'll learn the what and why of tech choices, monitoring, and best practices. They outline the use cases Lyft has enabled, especially in ML model performance and evaluation.

Sarah Guido is a data scientist at Reonomy, where she’s helping to build disruptive tech in the commercial real estate industry in New York City. Three of her favorite things are Python, data, and machine learning.

Presentations

Preparing and standardizing data for machine learning Interactive session

Getting your data ready for modeling is the essential first step in the machine learning process. Sarah Guido outlines the basics of preparing and standardizing data for use in machine learning models.

Ananth Kalyan Chakravarthy Gundabattula is a senior application architect on the decisioning and advanced analytics engineering team for the Commonwealth Bank of Australia (CBA). Previously, he was an architect at ThreatMetrix, a member of the core team that scaled Threatmetrix architecture to 100 million transactions per day—which runs at very low latencies using Cassandra, Zookeeper, and Kafka—and migrated the ThreatMetrix data warehouse into the next generation architecture based on Hadoop and Impala; he was at IBM software labs and IBM CIO labs, enabling some of the first IBM CIO projects onboarding HBase, Hadoop, and Mahout stack. Ananth is a committer for Apache Apex and is working for the next-generation architectures for CBA fraud platform and Advanced Analytics Omnia platform at CBA. Ananth has presented at a number of conferences including YOW! Data and the Dataworks summit conference in Australia. Ananth holds a PhD in computer science security. He’s interested in all things data, including low-latency distributed processing systems, machine learning, and data engineering domains. He holds three patents and has one application pending.

Presentations

Automated feature engineering for the modern enterprise using Dask and featuretools Session

Feature engineering can make or break a machine learning model. The featuretools package and associated algorithm accelerate the way features are built. Ananth Kalyan Chakravarthy Gundabattula explains a Dask- and Prefect-based framework that addresses challenges and opportunities using this approach in terms of lineage, risk, ethics, and automated data pipelines for the enterprise.

Sijie Guo is the founder and CEO of StreamNative, a data infrastructure startup offering a cloud native event streaming platform based on Apache Pulsar for enterprises. Previously, he was the tech lead for the Messaging Group at Twitter and worked on push notification infrastructure at Yahoo. He’s also the VP of Apache BookKeeper and PMC Member of Apache Pulsar.

Presentations

Transactional event streaming with Apache Pulsar Session

Sijie Guo and Yong Zhang lead a deep dive into the details of Pulsar transaction and how it can be used in Pulsar Functions and other processing engines to achieve transactional event streaming.

Patrick Hall is a senior director for data science products at H2O.ai, where he focuses mainly on model interpretability and model management. Patrick is also an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning. Previously, Patrick held global customer-facing and R&D research roles at SAS Institute. He holds multiple patents in automated market segmentation using clustering and deep neural networks. Patrick is the eleventh person worldwide to become a Cloudera Certified Data Scientist. He studied computational chemistry at the University of Illinois before graduating from the Institute for Advanced Analytics at North Carolina State University.

Presentations

Model debugging strategies Tutorial

Even if you've followed current best practices for model training and assessment, machine learning models can be hacked, socially discriminatory, or just plain wrong. Patrick Hall breaks down model debugging strategies to test and fix security vulnerabilities, unwanted social biases, and latent inaccuracies in models.

Hannes Hapke is a senior data scientist at SAP ConcurLabs. He’s been a machine learning enthusiast for many years and is a Google Developer Expert for machine learning. Hannes has applied deep learning to a variety of computer vision and natural language problems, but his main interest is in machine learning infrastructure and automating model workflows. Hannes is a coauthor of the deep learning publication Natural Language Processing in Action and he’s currently working on a book about TensorFlow Extended Building Machine Learning Pipelines (O’Reilly). When he isn’t working on a deep learning project, you’ll find him outdoors running, hiking, or enjoying a good cup of coffee with a great book.

Presentations

Analyzing and deploying your machine learning model Tutorial

Most deep learning models don’t get analyzed, validated, and deployed. Catherine Nelson and Hannes Hapke explain the necessary steps to release machine learning models for real-world applications. You'll view an example project using the TensorFlow ecosystem, focusing on how to analyze models and deploy them efficiently.

Getting the most out of your AI projects with model feedback loops Session

Measuring your machine learning model’s performance is key for every successful data science project. Therefore, model feedback loops are essential to capture feedback from users and expand your model’s training dataset. Hannes Hapke and Catherine Nelson explore the concept of model feedback and guide you through a framework for increasing the ROI of your data science project.

Matt Harrison is a corporate trainer and consultant at MetaSnake. MetaSnake specializes in Python and Data Science. He teaches companies big and small how to leverage Python and Data Science for great good.

Presentations

Mastering pandas Tutorial

You can use pandas to load data, inspect it, tweak it, visualize it, and do analysis with only a few lines of code. Matt Harrison leads a deep dive in plotting and Matplotlib integration, data quality, and issues such as missing data. Matt uses the split-apply-combine paradigm with groupBy and Pivot and explains stacking and unstacking data.

Zak Hassan is a software engineer on the data analytics team working on data science and machine learning at Red Hat. Previously, Zak was a software consultant in the financial services and insurance industry, building end-to-end software solutions for clients.

Presentations

Log anomaly detector with NLP and unsupervised machine learning Session

The number of logs increases constantly and no human can monitor them all. Zak Hassan employs natural language processing (NLP) for text encoding and machine learning (ML) methods for automated anomaly detection to construct a tool to help developers perform root cause analysis more quickly. He provides a means to give feedback to the ML algorithm to learn from false positives.

Long Van Ho is a data scientist at Children’s Hospital Los Angeles with over five years of experience in applying advanced machine learning techniques in healthcare and defense applications. His work includes developing the machine learning framework to enable data science at the Virtual Pediatric Intensive Care Unit and researching applications of artificial intelligence to improve care in the ICUs. His background includes research in particle beam physics at UCLA, and Stanford University has provided a strong research background in his career as a data scientist. His interests and goal is to bridge the potential of machine learning with practical applications to health.

Presentations

Semisupervised AI approach for automated categorization of medical images Session

Annotating radiological images by category at scale is a critical step for analytical ML. Supervised learning is challenging because image metadata doesn't reliably identify image content and manual labeling images for AI algorithms isn't feasible. Stephan Erberich, Kalvin Ogbuefi, and Long Ho share an approach for automated categorization of radiological images based on content category.

Ana Hocevar is a data scientist in residence at the Data Incubator, where she combines her love for coding and teaching. Ana has more than a decade of experience in physics and neuroscience research and over five years of teaching experience. Previously, she was a postdoctoral fellow at the Rockefeller University, where she worked on developing and implementing an underwater touch screen for dolphins. She holds a PhD in physics.

Presentations

Deep learning with PyTorch 2-Day Training

PyTorch is a machine learning library for Python that allows you to build deep neural networks with great flexibility. Its easy-to-use API and seamless use of GPUs make it a sought-after tool for deep learning. Join in to get the knowledge you need to build deep learning models using real-world datasets and PyTorch with Ana Hocevar.

Donagh Horgan is a principal engineer at Extreme Networks where he designs data-driven solutions for smarter and more secure networks as part of the Cloud Technology Adoption Group. Previously, he has led and contributed to applied machine learning research at a number of Fortune 500 companies, with applications in the areas of converged physical security, asset microlocation and infrastructure performance monitoring. Donagh holds a BEng in Microelectronic Engineering and a PhD in Electrical and Electronic Engineering from University College Cork, Ireland.

Presentations

What do machines say when nobody’s looking? Tracking IoT security with NLP Session

Machines talk among themselves, but you may understand their behavior by analyzing their language. Giacomo Bernardi outlines a lightweight approach for securing large internet of things (IoT) deployments by leveraging modern natural language processing (NLP) techniques. Rather than attempting cumbersome firewall rules, IoT deployments can be efficiently secured by online behavioral modeling.

Bob Horton is a senior data scientist on the user understanding team at Bing. Bob holds an adjunct faculty appointment in health informatics at the University of San Francisco, where he gives occasional lectures and advises students on data analysis and simulation projects. Previously, he was on the professional services team at Revolution Analytics. Long before becoming a data scientist, he was a regular scientist (with a PhD in biomedical science and molecular biology from the Mayo Clinic). Some time after that, he got an MS in computer science from California State University, Sacramento.

Presentations

Machine learning for managers Tutorial

Bob Horton, Mario Inchiosa, and John-Mark Agosta offer an overview of the fundamental concepts of machine learning (ML) for business and healthcare decision makers and software product managers so you'll be able to make a more effective use of ML results and be better able to evaluate opportunities to apply ML in your industries.

Sihui “May” Hu (she/her) is a program manager at Microsoft, focused on creating data management and data lineage solutions for the Azure Machine Learning service. Previously, she had two years of working experience in the ecommerce industry and several internships in product management. She graduated from Carnegie Mellon University, studying information systems management.

Presentations

Data lineage enables reproducible and reliable machine learning at scale Session

Data scientists need a way to ensure result reproducibility. Sihui "May" Hu and Dominic Divakaruni unpack how to retrieve data-to-data, data-to-model, and model-to-deployment lineages in one graph to achieve reproducible and reliable machine learning at scale. You'll discover effective ways to track the full lineage from data preparation to model training to inference.

Co Director of Duke Forge, and Assistant Dean for Biomedical Informatics. First faculty recruit to a new initiative and new Division at Duke in the Department of Biostatistics & Bioinformatics. Principal Investigator on a NIH-funded project under the Big Data to Knowledge (BD2K) RFAs. Faculty lead for informatics on the Google Life Sciences-funded Baseline Study. Leading a Duke University School of Medicine-wide initiative for a Data Service for biomedical researchers. Leading projects on applied machine learning, user interfaces, and visualization of surgical outcomes (Clinical & Analytic Learning Platform for Surgical Outcomes, CALYPSO) and a chronic kidney disease “early warning” system.

Overarching aim is to create a Data Science culture and infrastructure for biomedical and healthcare research.

Presentations

Deep Care Management: Using ML to provide value-based care to Medicare patients Session

Duke Connected Care is an "accountable care organization" an innovation model under the Affordable Care Act that incentivizes value-based healthcare. In the end, this is about appropriately targeting health resources to those who need them the most. Deep Care Management is a ML workflow that helps Duke do just that.

Mario Inchiosa’s passion for data science and high-performance computing drives his work as principal software engineer in Microsoft Cloud + AI, where he focuses on delivering advances in scalable advanced analytics, machine learning, and AI. Previously, Mario served as chief scientist of Revolution Analytics; analytics architect in the big data organization at IBM, where he worked on advanced analytics in Hadoop, Teradata, and R; US chief scientist at Netezza, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances; US chief science officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining; and senior scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds bachelor’s, master’s, and PhD degrees in physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning Publication of the Year and Open Literature Publication Excellence awards.

Presentations

Machine learning for managers Tutorial

Bob Horton, Mario Inchiosa, and John-Mark Agosta offer an overview of the fundamental concepts of machine learning (ML) for business and healthcare decision makers and software product managers so you'll be able to make a more effective use of ML results and be better able to evaluate opportunities to apply ML in your industries.

George Iordanescu is a data scientist on the algorithms and data science team for Microsoft’s Cortana Intelligence Suite. Previously, he was a research scientist in academia, a consultant in the healthcare and insurance industry, and a postdoctoral visiting fellow in computer-assisted detection at the National Institutes of Health (NIH). His research interests include semisupervised learning and anomaly detection. George holds a PhD in EE from Politehnica University in Bucharest, Romania.

Presentations

Using the cloud to scale up hyperparameter optimization for machine learning Session

Hyperparameter optimization for machine leaning is complex, requires advanced optimization techniques, and can be implemented as a generic framework decoupled from specific details of algorithms. Fidan Boylu Uz, Mario Bourgoin, and George Iordanescu apply such a framework to tasks like object detection and text matching in a transparent, scalable, and easy-to-manage way in a cloud service.

Ankit Jain is a senior research scientist at Uber AI Labs, the machine learning research arm of Uber. His primary research areas include graph neural networks, meta-learning, and forecasting. Previously, he worked in variety of data science roles at Bank of America, Facebook, and other startups. He coauthored a book on machine learning, TensorFlow Machine Learning Projects. He’s been a featured speaker in many of the top AI conferences and universities across US, including UC Berkeley and the O’Reilly AI Conference, among others. He earned his MS from UC Berkeley and BS from IIT Bombay (India).

Presentations

Enhance recommendations in Uber Eats with graph convolutional networks Session

Ankit Jain and Piero Molino detail how to generate better restaurant and dish recommendations in Uber Eats by learning entity embeddings using graph convolutional networks implemented in TensorFlow.

Shubhankar Jain (he/him) is a machine learning engineer at SurveyMonkey, where he develops and implements machine learning systems for its products and teams. He’s really excited to bring his expertise and passion of data and AI systems to rest of the industry. In his free time, he likes hiking with his dog and accelerating his hearing loss at live music shows.

Presentations

Accelerating your organization: Making data optimal for machine learning Session

Every organization leverages ML to increase value to customers and understand their business. You may have created models, but now you need to scale. Shubhankar Jain, Aliaksandr Padvitselski, and Manohar Angani use a case study to teach you how to pinpoint inefficiencies in your ML data flow, how SurveyMonkey tackled this, and how to make your data more usable to accelerate ML model development.

Ram Janakiraman is a distinguished engineer at the Aruba CTO Office working on machine intelligence for enterprise security. His recent focus has been on simplifying the building of behavior models by leveraging approaches in NLP and representation learning. He hopes to improve end user product engagement through a visual representation of entity interactions without compromising the privacy of the network entities. Ram has numerous patents from a variety of areas during the course of his career. Previously, he’s been in various startups and was a cofounding member of Niara, Inc., working on security analytics with a focus on threat detection and investigation before it was acquired by Aruba, a HPE Company. He’s also an avid scuba diver, always eager to explore the next reef or kelp. He’s an FAA Certified Drone Pilot, capturing the beauty of dive destinations on his trips.

Presentations

Preserving privacy in behavioral analytics using semantic learning techniques in NLP Session

Devices discover their way around the network and proxy the intent of the users behind them; leveraging this information for behavior analytics can raise privacy concerns. A selective use of embedding models on a crafted corpus from anonymized data can address these concerns. Ramsundar Janakiraman details a way to build representations with behavioral insights that also preserves user identity.

Dan Jeffries is the chief technology evangelist at Pachyderm. He’s also an author, engineer, futurist, pro blogger, and he’s given talks all over the world on AI and cryptographic platforms. He’s spent more than two decades in IT as a consultant and at open source pioneer Red Hat. His articles have held the number one writer’s spot on Medium for artificial intelligence, bitcoin, cryptocurrency and economics more than 25 times. His breakout AI tutorial series, “Learning AI if You Suck at Math” along with his explosive pieces on cryptocurrency, "Why Everyone Missed the Most Important Invention of the Last 500 Years” and "Why Everyone Missed the Most Mind-Blowing Feature of Cryptocurrency,” are shared hundreds of times daily all over social media and have been read by more than 5 million people worldwide.

Presentations

When AI goes wrong and how to fix it with real-world AI auditing Session

With algorithms making more and more decisions in our lives, from who gets a job to who gets hired and fired, and even who goes to jail, it’s more critical than ever that we make AI auditable and explainable in the real world. Daniel Jeffries breaks down how you can make your AI and ML systems auditable and transparent right now with a few classic IT techniques your team already knows well.

Amit Kapoor is a data storyteller at narrativeViz, where he uses storytelling and data visualization as tools for improving communication, persuasion, and leadership through workshops and trainings conducted for corporations, nonprofits, colleges, and individuals. Interested in learning and teaching the craft of telling visual stories with data, Amit also teaches storytelling with data for executive courses as a guest faculty member at IIM Bangalore and IIM Ahmedabad. Amit’s background is in strategy consulting, using data-driven stories to drive change across organizations and businesses. Previously, he gained more than 12 years of management consulting experience with A.T. Kearney in India, Booz & Company in Europe, and startups in Bangalore. Amit holds a BTech in mechanical engineering from IIT, Delhi, and a PGDM (MBA) from IIM, Ahmedabad.

Presentations

Deep learning for recommendation systems 2-Day Training

Bargava Subramanian and Amit Kapoor provide you with a thorough introduction to the art and science of building recommendation systems and paradigms across domains. You'll get an end-to-end overview of deep learning-based recommendation and learning-to-rank systems to understand practical considerations and guidelines for building and deploying RecSys.

Deep learning for recommendation systems (Day 2) Training Day 2

Bargava Subramanian and Amit Kapoor provide you with a thorough introduction to the art and science of building recommendation systems and paradigms across domains. You'll get an end-to-end overview of deep learning-based recommendation and learning-to-rank systems to understand practical considerations and guidelines for building and deploying RecSys.

Holden Karau is a transgender Canadian software engineer working in the bay area. Previously, she worked at IBM, Alpine, Databricks, Google (twice), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She’s a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

Presentations

Ship it: A practitioner's guide to model management and deployment with Kubeflow Session

Trevor Grant and Holden Karau make sure you can get and keep your models in production with Kubeflow.

Arun Kejariwal is an independent lead engineer. Previously, he was he was a statistical learning principal at Machine Zone (MZ), where he led a team of top-tier researchers and worked on research and development of novel techniques for install-and-click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns, and his team built novel methods for bot detection, intrusion detection, and real-time anomaly detection; and he developed and open-sourced techniques for anomaly detection and breakout detection at Twitter. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Serverless streaming architectures and algorithms for the enterprise Tutorial

Arun Kejariwal, Karthik Ramasamy, and Anurag Khandelwal walk you through through the landscape of streaming systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage. You'll get an overview of the inception and growth of the serverless paradigm and explore Apache Pulsar, which provides native serverless support in the form of Pulsar functions.

Anurag Khandelwal is an Assistant Professor at the De­part­ment of Com­puter Sci­ence at Yale Uni­versity. His research interests span distributed systems, networking, and algorithms. In particular, his research focuses on addressing core challenges in distributed systems through novel algorithm and data structure design. During his PhD, Anurag built large-scale data-intensive systems such as Succinct and Confluo, that led to deployments in several production clusters.

Prior to start­ing at Yale, Anurag did a short post-doc at Cor­nell Tech where he worked with Tom Risten­part and Rachit Agar­wal. He re­ceived his PhD from the Uni­versity of Cali­for­nia, Berke­ley, at the RI­SELab, where he was ad­vised by Ion Stoica. Anurag com­pleted his bach­el­or’s de­gree (B. Tech. in Com­puter Sci­ence and En­gin­eer­ing) from the In­dian In­sti­tute of Tech­no­logy, Khar­ag­pur.

Presentations

Serverless streaming architectures and algorithms for the enterprise Tutorial

Arun Kejariwal, Karthik Ramasamy, and Anurag Khandelwal walk you through through the landscape of streaming systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage. You'll get an overview of the inception and growth of the serverless paradigm and explore Apache Pulsar, which provides native serverless support in the form of Pulsar functions.

Senior Data Scientist and ML Engineer having a decade long industry experience. PhD in CS at Leiden University (NL) and an MS degree in Operations Research from Penn State University (USA).

Veysel has worked as CTO, Head of AI, Principal Data Scientist and various other titles so far and I also provide hands-on consulting services in Machine Learning and AI, statistics, data science and operations research to the several start-ups and companies around the globe.

Presentations

Advanced natural language processing with Spark NLP Tutorial

David Talby, Alex Thomas, and Claudiu Branzan detail the application of the latest advances in deep learning for common natural language processing (NLP) tasks such as named entity recognition, document classification, sentiment analysis, spell checking, and OCR. You'll learn to build complete text analysis pipelines using the highly performant, scalable, open source Spark NLP library in Python.

After a BS in environmental engineering at Yale, founding an electrochemistry startup, joining a battery startup, and doing crazy things with PostgreSQL for Moat (an ad-analytics company), David joined Timescale to focus on research and development. He also cooks, does pottery and builds furniture.

Presentations

Simplifying data analytics by creating continuously up-to-date aggregates Session

The sheer volume of time series data from servers, applications, or IoT devices introduces performance challenges, both to insert data at high rates and to process aggregates for subsequent understanding. David Kohn demonstrates how systems can properly continuously maintain up-to-date aggregates, even correctly handling late or out-of-order data, to simplify data analysis.

Ravi Krishnaswamy is the director of software architecture in the AutoCAD Group at Autodesk. He has a passion for technology and has implemented a wide range of solutions for products at Autodesk from analytics and database applications to mobile graphics. His current projects involve analytics solutions on product usage data that leverage graph databases and machine learning techniques on graphs.

Presentations

Collaboration insights through data access graphs Session

Today’s applications interact with data in a distributed and decentralized world. Using graphs at scale, you can infer communities and your interaction by tracking access to common data across users and applications. Ravi Krishnaswamy displays a real-world product example with millions of users that uses the combined powers of Spark and graph databases to gain insights into customer workflows.

Akshay Kulkarni is a senior data scientist on the core AI and data science team at Publicis Sapient, where he’s part of strategy and transformation interventions through AI, manages high-priority growth initiatives around data science and works on various machine learning, deep learning, natural language processing, and artificial intelligence engagements by applying state-of-the-art techniques. He’s a renowned AI and machine learning evangelist, author, and speaker. Recently, he’s been recognized as one of “Top 40 under 40 Data Scientists” in India by Analytics India Magazine. He’s consulted with several Fortune 500 and global enterprises in driving AI and data science-led strategic transformation. Akshay has rich experience of building and scaling AI and machine learning businesses and creating significant client impact. He’s actively involved in next-gen AI research and is also a part of next-gen AI community. Previously, he was at Gartner and Accenture, where he scaled the AI and data science business. He’s a regular speaker at major data science conferences and recently gave a talk on “Sequence Embeddings for Prediction Using Deep Learning” at GIDS. He’s the author of a book on NLP with Apress and authoring couple more books with Packt on deep learning and next-gen NLP. Akshay is a visiting faculty (industry expert) at few of the top universities in India. In his spare time, he enjoys reading, writing, coding, and helping aspiring data scientists.

Presentations

Attention networks all the way to production using Kubeflow Tutorial

Vijay Srinivas Agneeswaran, Pramod Singh, and Akshay Kulkarni demonstrate the in-depth process of building a text summarization model with an attention network using TensorFlow (TF) 2.0. You'll gain the practical hands-on knowledge to build and deploy a scalable text summarization model on top of Kubeflow.

Dinesh Kumar is a product engineer at Gojek. He has experience of building high-scale distributed systems and working with event-driven systems and components around Kafka.

Presentations

BEAST: Building an event processing library to handle millions of events Session

Maulik Soneji and Dinesh Kumar explore Gojek's event-processing library to consume events from Kafka and push it to BigQuery. All of its services are event sourced, and Gojek has a high load of 21K messages per second for few topics, and it has hundreds of topics.

Tianhui Michael Li is the founder and president of the Data Incubator, a data science training and placement firm. Michael bootstrapped the company and navigated it to a successful sale to the Pragmatic Institute. Previously, he headed monetization data science at Foursquare and has worked at Google, Andreessen Horowitz, J.P. Morgan, and D.E. Shaw. He’s a regular contributor to the Wall Street JournalTech CrunchWiredFast CompanyHarvard Business ReviewMIT Sloan Management ReviewEntrepreneurVenture Beat, Tech Target, and O’Reilly. Michael was a postdoc at Cornell Tech, a PhD at Princeton, and a Marshall Scholar in Cambridge.

Presentations

Big data for managers 2-Day Training

The instructors provide a nontechnical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and use their input and analysis for your business’s strategic priorities and decision making.

Big data for managers (Day 2) Training Day 2

Rich Ott and Michael Li provide a nontechnical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and use their input and analysis for your business’s strategic priorities and decision making.

Nong Li cofounded Okera in 2016 with Amandeep Khurana and serves as the company’s CEO. Previously, he was on the engineering team at Databricks, where he led performance engineering for Spark core and SparkSQL, and was tech lead for the Impala project at Cloudera and the author of the Record Service project. Nong is also one of the original authors of the Apache Parquet project and mentors several Apache projects, including Apache Arrow. Nong has a degree in computer science from Brown University.

Presentations

Data versus metadata: Overcoming the challenges to securing the modern data lake Session

The evolution of storing data in a warehouse to hybrid infrastructure of on-premises and cloud data lakes enabled agility and scale. Nong Li looks at the problems between data and metadata, the privacy and security risks associated with them, how to avoid the pitfalls of this challenges, and why companies need to get it right by enforcing security and privacy consistently across all applications.

Penghui is a software engineer at Zhaopin and an Apache Pulsar Committer.

Presentations

Life beyond pub/sub: How Zhaopin simplifies stream processing using Pulsar Functions and SQL Session

Penghui Li and Jia Zhai walk you through building an event streaming platform based on Apache Pulsar and simplifying a stream processing pipeline by Pulsar Functions, Pulsar Schema, and Pulsar SQL.

Gray Lindsey is a staff scientist at Fermi National Accelerator Laboratory studying Higgs and electroweak physics. He’s focused on developing software and detectors to address the challenge of the high-luminosity upgrade for the Large Hadron Collider and the corresponding upgrade of the Compact Muon Solenoid (CMS) experiment. He’s developed a variety of pattern recognition techniques to demonstrate and help realize new detector systems to efficiently assemble physics data from upgrades to the CMS detector. He also leads the development to make the analysis of those data more efficient and scalable using modern big data technologies.

Presentations

Data engineering at the Large Hadron Collider Session

Building a data engineering pipeline for serving segments of a 200 Pb dataset to particle physicists around the globe poses many challenges, some unique to high energy physics and some to big science projects across disciplines. Ben Galewsky, Gray Lindsey, and Andrew Melo highlight how much of it can inform industry data science at scale.

Shondria Lopez-Merlos is a Data Specialist for the Florida Conference of The United Methodist Church. After making a suggestion in a meeting, Shondria was challenged to learn more about coding and automation. She subsequently taught herself Python and has begun learning HTML/CSS, SQL and VBA. Shondria is a former O’Reilly Scholarship recipient. Additionally, she is a member of Women Who Code and Women in STEAM.

Presentations

Starting Simple: How to Use Coding and Automation at Non-Profits and Small Businesses Data Case Studies

Small data teams that work for small businesses or non-profits often want to use programming and automation, but do not know “where to start.” This presentation discusses how to learn simple Python programs and incorporate them to help streamline workflow, and, hopefully, lead to additional, increasingly complex projects.

Boris Lublinsky is a principal architect at Lightbend, where he specializes in big data, stream processing, and services. Boris has over 30 years’ experience in enterprise architecture. Previously, he was responsible for setting architectural direction, conducting architecture assessments, and creating and executing architectural road maps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, and Serving Machine Learning Models. He’s also cofounder of and frequent speaker at several Chicago user groups.

Presentations

Model governance Tutorial

Machine learning (ML) models are data, which means they require the same data governance considerations as the rest of your data. Boris Lublinsky and Dean Wampler outline metadata management for model serving and explore what information about running systems you need and why it's important. You'll also learn how Apache Atlas can be used for storing and managing this information.

Understanding data governance for machine learning models Session

Production deployment of machine learning (ML) models requires data governance, because models are data. Dean Wampler and Boris Lublinsky justify that claim and explore its implications and techniques for satisfying the requirements. Using motivating examples, you'll explore reproducibility, security, traceability, and auditing, plus some unique characteristics of models in production settings.

Willy Lulciuc is a data engineer on the project Marquez team at WeWork in San Francisco, where he and his team make datasets discoverable and meaningful. Previously, he worked on the real-time streaming data platform powering BounceX, and before that, designed and scaled sensor data streams at Canary.

Presentations

Data lineage with Apache Airflow using Marquez Interactive session

Willy Lulciuc explains how lineage metadata in conjunction with a data catalog helps improve the overall quality of data. You'll dive into complex inter-DAGs dependencies in Airflow and get a hands-on introduction to data lineage using Marquez. You'll also develop strong debugging techniques and learn how to effectively apply them.

Anand Madhavan is the vice president of engineering at Narvar. Previously, he was head of engineering for the Discover product at Snapchat and director of engineering at Twitter, where we worked on building out the ad serving system for Twitter Ads. He earned an MS in computer science from Stanford University.

Presentations

Using Apache Pulsar functions for data workflows at Narvar Session

Narvar originally used a large collection of point technologies such as AWS Kinesis, Lambda, and Apache Kafka to satisfy its requirements for pub/sub messaging, message queuing, logging, and processing. Karthik Ramasamy and Anand Madhavan walk you through how Narvar moved away from using a slew of technologies and consolidating their use cases using Apache Pulsar.

Shailendra joined the Data analytics team at Kohl’s 3 years back and he is currently leading the transformation of Legacy Marketing analytics and personalization platform to cloud specifically focusing on Google Cloud Platform. Shailendra’s passion is applying big data analytics, drive engineering innovation to solve challenging business problems. He is a seasoned leader with 15 years of industry experience building and leading teams of talented data architects, engineers, and product managers. He has a proven track record in building scalable data platform, building large-scale data products, improved operational processes efficiencies, and enhanced user experience. Shailendra is hands-on practitioner and have excellent communication skills. He loves to play volleyball and spend time with his kids when not thinking about GCP.

Presentations

Revitalizing Kohl’s Marketing and Experience Personalized Ecosystem on Google Cloud Platform (Sponsored by GSPANN) Session

Present the challenges faced with Kohl’s legacy Marketing analytics platform and how we leveraged Google Cloud Platform and Big Query to overcome the challenges to provide better and consistent customer insights to Marketing Analytics business team.

Suneeta Mall is a senior data scientist at Nearmap, where she leads the engineering efforts of the Artificial Intelligence Division. She led the efforts of migrating Nearmap’s backend services to Kubernetes. In her 12 years of software industry experience, she’s worked on a solving variety of challenging technical and business problems in the field of big data, machine learning, GIS, travel, DevOps, and telecommunication. She earned her PhD from University of Sydney and bachelor’s of computer science and engineering.

Presentations

Deep learning meets Kubernetes: Running massively parallel inference pipelines efficiently Session

Using Kubernetes as the backbone of AI infrastructure, Nearmap built a fully automated deep learning inference pipeline that's highly resilient, scalable, and massively parallel. Using this system, Nearmap ran semantic segmentation over tens of quadrillions of pixels. Suneeta Mall demonstrates the solution using Kubernetes in big data crunching and machine learning at scale.

Katie Malone is director of data science at data science software and services company Civis Analytics, where she leads a team of diverse data scientists who serve as technical and methodological advisors to the Civis consulting team and write the core machine learning and data science software that underpins the Civis Data Science Platform. Previously, she worked at CERN on Higgs boson searches and was the instructor of Udacity’s Introduction to Machine Learning course. Katie hosts Linear Digressions, a weekly podcast on data science and machine learning. She holds a PhD in physics from Stanford.

Presentations

Executive Briefing: The care and feeding of data scientists Session

Data science is relatively young, and the job of managing data scientists is younger still. Many people undertake this management position without the tools, mentorship, or role models they need to do it well. Katie Malone and Michelangelo D'Agostino review key themes from a recent Strata report that examines the steps necessary to build, manage, sustain, and retain a growing data science team.

Sukanya Mandal is a data scientist at Capgemini. She has extensive experience building various solutions with IoT. She enjoys most working at the intersection of IoT and data science. She also leads the PyData Mumbai and Pyladies Mumbai chapter. Besides work and community efforts, she also loves to explore new tech and pursue research. She’s published a couple of white papers with IEEE and a couple more are in the pipeline.

Presentations

Machine learning on resource-constrained IoT edge devices Session

Heavy ML computation on resource-constrained IoT devices is a challenge. IoT demands near-zero latency, high bandwidth availability, continuous and seamless availability, and privacy. The right infrastructure derives the right ROI. This is where edge and cloud comes in. Sukanya Mandal explains how training ML models at the cloud and inferencing at the edge has made many IoT use cases plausible.

Jaya Mathew is a senior data scientist on the artificial intelligence and research team at Microsoft, where she focuses on the deployment of AI and ML solutions to solve real business problems for customers in multiple domains. Previously, she worked on analytics and machine learning at Nokia and Hewlett Packard Enterprise. Jaya holds an undergraduate degree in mathematics and a graduate degree in statistics from the University of Texas at Austin.

Presentations

Machine translation helps break our language barrier Session

With the need to cater to a global audience, there's a growing demand for applications to support speech identification, translation, and transliteration from one language to another. Jaya Mathew explores this topic and how to quickly use some of the readily available APIs to identify, translate, or even transliterate speech or text within your application.

Andrew Melo is a research professor of physics and a big data application developer at Vanderbilt University. He’s spent the last decade developing and implementing large-scale data workflows for the Large Hadron Collider. Recently his focus has been reimplementing these physics workflows with Apache Spark.

Presentations

Data engineering at the Large Hadron Collider Session

Building a data engineering pipeline for serving segments of a 200 Pb dataset to particle physicists around the globe poses many challenges, some unique to high energy physics and some to big science projects across disciplines. Ben Galewsky, Gray Lindsey, and Andrew Melo highlight how much of it can inform industry data science at scale.

Rashmina Menon is a Senior Data Engineer with GumGum, which is a Computer Vision company. She’s passionate about building distributed and scalable systems and end-to-end data pipelines that provide visibility to meaningful data through machine learning and reporting applications.

Presentations

Real-time forecasting at scale using Delta Lake Session

GumGum receives 30 billion programmatic inventory impressions amounting to 25 TB of data per day. By generating near-real-time inventory forecast based on campaign-specific targeting rules, it enables users to set up successful future campaigns. Rashmina Menon and Jatinder Assi highlight the architecture that enables forecasting in less than 30 seconds with Delta Lake and Databricks Delta caching.

John Mertic is the director of program management for the Linux Foundation. Under his leadership, he’s helped ASWF, ODPi, Open Mainframe Project, and R Consortium accelerate open source innovation and transform industries. John has an open source career spanning two decades, both as a contributor to projects such as SugarCRM and PHP and in open source leadership roles at SugarCRM, OW2, and OpenSocial. With an extensive open source background, he’s a regular speaker at various Linux Foundation and other industry trade shows each year. John’s also an avid writer and has authored two books The Definitive Guide to SugarCRM: Better Business Applications and Building on SugarCRM, as well as published articles on IBM developerWorks, Apple Developer Connection, and PHP Architect.

Presentations

Creating an ecosystem on data governance in the ODPi Egeria project Session

Building on its success at establishing standards in the Apache Hadoop data platform, the ODPi (Linux Foundation) turns its focus to the next big data challenge—enabling metadata management and governance at scale across the enterprise. Mandy Chessell and John Mertic discuss how the ODPi's guidance on governance (GoG) aims to create an open data governance ecosystem.

Minal Mishra is an engineering manager at Netflix, where he’s part of an effort to improve the software delivery of Netflix’s streaming player. Previously, he was with Xbox Live ecommerce and music and video services teams at Microsoft. Outside work, he enjoys playing tennis.

Presentations

Data powering frequent updates of Netflix's video player Session

Minal Mishra walks you through Netflix's video player release process, the challenges with deriving time series metrics from a firehose of events, and some of the oddities in running analysis on real-time metrics.

Sanjeev Mohan leads big data research for technical professionals at Gartner, where he researches trends and technologies for relational and NoSQL databases, object stores, and cloud databases. His areas of expertise span the end-to-end data pipeline, including ingestion, persistence, integration, transformation, and advanced analytics. Sanjeev is a well-respected speaker on big data and data governance. His research includes machine learning and the IoT. He also serves on a panel of judges for many Hadoop distribution organizations, such as Cloudera and Hortonworks.

Presentations

Executive Briefing: Data is leaving your door—Essentials for embarking on hybrid multicloud journey Session

The acceleration of the migration of workloads to the cloud isn't a binary journey. Some workloads will still be on-premises and some will be on multiple cloud providers. Sanjeev Mohan identifies key data and analytics considerations in modern data architectures, including strategies to handle data latency, gravity, ingress transformation, compliance, and governance needs and data orchestration.

Piero Molino is a cofounder and senior research scientist at Uber AI Labs, where he works on natural language understanding and conversational AI. He’s the author of the open source platform Ludwig, a code-free deep learning toolbox.

Presentations

Enhance recommendations in Uber Eats with graph convolutional networks Session

Ankit Jain and Piero Molino detail how to generate better restaurant and dish recommendations in Uber Eats by learning entity embeddings using graph convolutional networks implemented in TensorFlow.

Keith Moore is the Director of Product Management at SparkCognition and is responsible for the development of the Darwin automated model building product. He specializes in applying advanced data science and natural language processing algorithms to complex data sets.

Moore has multiple patents in the space of data science automation software, and has been recognized by Hart Energy as an influencer in the energy innovation space. Moore previously worked for National Instruments as a data acquisition and vibration software product manager. Prior to that, he developed client software solutions for major oil and gas, aerospace, and semiconductor organizations.

Moore has served as a board member of Pi Kappa Phi fraternity, and still serves volunteers on the alumni engagement committee. He graduated from the University of Tennessee with a with a B.S. in mechanical engineering, and serves on the alumni board of advisors for the Austin, Texas area.

Presentations

Neuroevolution-based automated model building: How to create better models Session

AutoML brings acceleration and democratization of data science, but in the game of accuracy and flexibility, using predefined blueprints to find adequate algorithms falls short. Carlos Pazos and Keith Moore shine a spotlight on a neuroevolutionary approach to AutoML to custom build novel, sophisticated neural networks that perfectly represent the relationships in your dataset.

Philipp Moritz is a PhD candidate in the Electrical Engineering and Computer Sciences (EECS) Department at the University of California, Berkeley, with broad interests in artificial intelligence, machine learning, and distributed systems. He’s a member of the Statistical AI Lab and the RISELab.

Presentations

Using Ray to scale Python, data processing, and machine learning Tutorial

There's no easy way to scale up Python applications to the cloud. Ray is an open source framework for parallel and distributed computing, making it easy to program and analyze data at any scale by providing general-purpose high-performance primitives. Robert Nishihara, Ion Stoica, and Philipp Moritz demonstrate how to use Ray to scale up Python applications, data processing, and machine learning.

Barr Moses is CEO and Co-Founder of Monte Carlo. Previously, she was VP at Gainsight (an enterprise customer data platform) where she helped scale the company 10x in revenue and worked with hundreds of clients on delivering reliable data, a management consultant at Bain & Company, and a research assistant at the Statistics Department at Stanford University. She also served in the Israeli Air Force as a commander of an intelligence data analyst unit. Barr graduated from Stanford with a BSc in mathematical and computational science.

Presentations

Executive Briefing: Introducing data downtime—From firefighting to winning Session

Ever had your CEO or customer look at your report and tell you the numbers look way off? Barr Moses defines data downtime—periods of time when your data is partial, erroneous, missing, or otherwise inaccurate. Data downtime is highly costly for organizations, yet is often addressed ad hoc. You'll explore why data downtime matters to the data industry and how best-in-class teams address it.

Nisha Muktewar is a research engineer at Cloudera Fast Forward Labs, which is an applied machine intelligence research and advising group part of Cloudera. She works with organizations to help build data science solutions and spends time researching new tools, techniques, and libraries in this space. Previously, she was a manager on Deloitte’s actuarial, advanced analytics, and modeling practice leading teams in designing, building, and implementing predictive modeling solutions for pricing, consumer behavior, marketing mix, and customer segmentation use cases for insurance and retail and consumer businesses.

Presentations

Deep learning for anomaly detection Session

In many business use cases, it's frequently desirable to automatically identify and respond to abnormal data. This process can be challenging, especially when working with high-dimensional, multivariate data. Nisha Muktewar and Victor Dibia explore deep learning approaches (sequence models, VAEs, GANs) for anomaly detection, performance benchmarks, and product possibilities.

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.

Presentations

Building a cloud data lake: Ingesting, processing, and analyzing big data on AWS Session

Data lakes are hot again; with S3 from AWS as the data lake storage, the modern data lake architecture separates compute from storage. You can choose from a variety of elastic, scalable, and cost-efficient technologies when designing a cloud data lake. Tomer Shiran and Jacques Nadeau share best practices for building a data lake on AWS, as well as various services and open source building blocks.

The fastest data lake: Comparing data lake storage on Azure, Google, and AWS Session

Jacques Nadeau leads a deep dive into important considerations when choosing between data lake storage options—speed, cost, and consistency. You'll learn about these differences and on how caching and ephemeral storage can affect these trade-offs. Jacques demonstrates technologies that improve analytical experience by compensating for slow reads.

Moin Nadeem is an undergraduate at MIT, where he studies computer science with a minor in negotiations. His research broadly studies applications of natural language. Most recently, he performed an extensive study on bias in language models, culminating with the release of the largest dataset on bias in NLP in the world. Previously, he cofounded the Machine Intelligence Community at MIT, which aims to democratize machine learning across undergraduates on campus, and received the Best Undergraduate Paper award at MIT.

Presentations

How biased is your natural language model? Assessing fairness in NLP Session

The real world is highly biased, but we still train AI models on that data. This leads to models that are highly offensive and discriminatory. For instance, models have learned that male engineers are preferable, and therefore discriminate when used in hiring. Moin Nadeem explores how to assess the social biases that popular models exhibit and how to leverage this to create a more fair model.

Catherine Nelson is a senior data scientist for Concur Labs at SAP Concur, where she explores innovative ways to use machine learning to improve the experience of a business traveller. She’s particularly interested in privacy-preserving ML and applying deep learning to enterprise data. Previously, she was a geophysicist and studied ancient volcanoes and explored for oil in Greenland. Catherine has a PhD in geophysics from Durham University and a master’s of earth sciences from Oxford University.

Presentations

Analyzing and deploying your machine learning model Tutorial

Most deep learning models don’t get analyzed, validated, and deployed. Catherine Nelson and Hannes Hapke explain the necessary steps to release machine learning models for real-world applications. You'll view an example project using the TensorFlow ecosystem, focusing on how to analyze models and deploy them efficiently.

Getting the most out of your AI projects with model feedback loops Session

Measuring your machine learning model’s performance is key for every successful data science project. Therefore, model feedback loops are essential to capture feedback from users and expand your model’s training dataset. Hannes Hapke and Catherine Nelson explore the concept of model feedback and guide you through a framework for increasing the ROI of your data science project.

Alexander Ng is a Director, Infrastructure and DevOps at Manifold. His previous work includes a stint as engineer and technical lead doing DevOps at Kryuus as well as engineering work for the Navy. He holds a BS in electrical engineering from Boston University.

Presentations

Efficient ML engineering: Tools and best practices Tutorial

ML engineers work at the intersection of data science and software engineering—that is, MLOps. Sourav Dey and Alex Ng highlight the six steps of the Lean AI process and explain how it helps ML engineers work as an integrated part of development and production teams. You'll go hands-on with real-world data so you can get up and running seamlessly.

Dave Nielsen is the head of community and ecosystem programs at Redis Labs and the cofounder of CloudCamp, a series of unconferences about cloud computing. Over his 19-year career, he’s been a web developer, systems architect, technical trainer, developer evangelist, and startup entrepreneur. Dave resides in Mountain View with his wife, Erika, to whom he proposed in his coauthored book PayPal Hacks.

Presentations

Redis plus Spark Structured Streaming: The perfect way to scale out your continuous app Session

Redis Streams enables you to collect data in a time series format while matching the data processing rate of your continuous application. Apache Spark’s Structured Streaming API enables real-time decision making for your continuous data. Dave Nielsen demonstrates how to integrate open source Redis with Apache Spark’s Structured Streaming API using the Spark-Redis library.

Robert Nishihara is a fourth-year PhD student working in the University of California, Berkeley, RISELab with Michael Jordan. He works on machine learning, optimization, and artificial intelligence.

Presentations

Using Ray to scale Python, data processing, and machine learning Tutorial

There's no easy way to scale up Python applications to the cloud. Ray is an open source framework for parallel and distributed computing, making it easy to program and analyze data at any scale by providing general-purpose high-performance primitives. Robert Nishihara, Ion Stoica, and Philipp Moritz demonstrate how to use Ray to scale up Python applications, data processing, and machine learning.

Tim Nugent pretends to be a mobile app developer, game designer, tools builder, researcher, and tech author. When he isn’t busy avoiding being found out as a fraud, Tim spends most of his time designing and creating little apps and games he won’t let anyone see. He also spent a disproportionately long time writing his tiny little bio, most of which was taken up trying to stick a witty sci-fi reference in…before he simply gave up. He’s writing Practical Artificial Intelligence with Swift for O’Reilly and building a game for a power transmission company about a naughty quoll. (A quoll is an Australian animal.)

Presentations

Swift for TensorFlow in 3 hours Tutorial

Mars Geldard, Tim Nugent, and Paris Buttfield-Addison are here to prove Swift isn't just for app developers. Swift for TensorFlow provides the power of TensorFlow with all the advantages of Python (and complete access to Python libraries) and Swift—the safe, fast, incredibly capable open source programming language; Swift for TensorFlow is the perfect way to learn deep learning and Swift.

Kalvin Ogbuefi is a data scientist at the Children’s Hospital Los Angeles (CHLA). Previously, he was a project assistant in the USC Stevens Neuroimaging and Informatics Institute, Marina del Rey, on radiology image analysis. His extensive research experience comprises projects in deep learning, statistical modeling, and computer simulations at Lawrence Livermore National Laboratories and other major research institutions. He earned an MS in applied statistics from California State University, Long Beach and a BS in applied mathematics from University of California, Merced.

Presentations

Semisupervised AI approach for automated categorization of medical images Session

Annotating radiological images by category at scale is a critical step for analytical ML. Supervised learning is challenging because image metadata doesn't reliably identify image content and manual labeling images for AI algorithms isn't feasible. Stephan Erberich, Kalvin Ogbuefi, and Long Ho share an approach for automated categorization of radiological images based on content category.

Patryk grew up in Poland where he completed his undergrad education in Communication and Computer Engineering at Warsaw University of Technology. Straight out of college, he was picked up by CERN in Geneva, Switzerland, where he wrote test software for the world’s biggest particle accelerator.

After his work at CERN, he started graduate school at EPFL, Switzerland where he obtained his MSc degree in Information Technologies. He wrote his thesis in collaboration with Virgin Hyperloop One, an American company building the new mode of transportation.

Currently, he continues to build software for Hyperloop as a Data Engineer.

Presentations

Flexible and fast Simulation Analytics in a growing Company: a Hyperloop Case Study Data Case Studies

To substantiate the key business and safety propositions necessary to establish a new mode of transportation, Virgin Hyperloop One (VHO) has implemented a complex, large-scale, and highly configurable simulation. Each simulation run needs to be analyzed and assessed on several KPIs. This session highlights how we successfully reduced the time to insight of our analyses from days to hours.

ML architecture using newest tools: Predicting near-future passenger demand for Hyperloop Session

Patryk Oleniuk and Sandhya Raghava investigate how to use demand data to improve on the design of the fifth mode of transport—Hyperloop. They discuss the passenger demand prediction methods and the tech stack (Spark, koalas, Keras, MLflow) used to build a deep neural network (DNN)-based near-future demand prediction for simulation purposes.

Richard Ott obtained his PhD in particle physics from the Massachusetts Institute of Technology, followed by postdoctoral research at the University of California, Davis. He then decided to work in industry, taking a role as a data scientist and software engineer at Verizon for two years. When the opportunity to combine his interest in data with his love of teaching arose at The Data Incubator, he joined and has been teaching there ever since.

Presentations

Deep learning with PyTorch (Day 2) Training Day 2

PyTorch is a machine learning library for Python that allows you to build deep neural networks with great flexibility. Its easy-to-use API and seamless use of GPUs make it a sought-after tool for deep learning. Join in to get the knowledge you need to build deep learning models using real-world datasets and PyTorch with Rich Ott.

Aliaksandr Padvitselski (he/him) is a machine learning engineer at SurveyMonkey, where he works on building the machine learning platform and helping to integrate machine learning systems to SurveyMonkey’s products. He worked on a variety of projects related to data business and personalization at SurveyMonkey. Previously, he mostly worked in the finance industry contributing to backend services and building a data warehouse for BI systems.

Presentations

Accelerating your organization: Making data optimal for machine learning Session

Every organization leverages ML to increase value to customers and understand their business. You may have created models, but now you need to scale. Shubhankar Jain, Aliaksandr Padvitselski, and Manohar Angani use a case study to teach you how to pinpoint inefficiencies in your ML data flow, how SurveyMonkey tackled this, and how to make your data more usable to accelerate ML model development.

Deepak Pai is a manager of AI machine learning core services at Adobe, where he manages a team of data scientists and engineers developing core ML services. The services are used by various Adobe Sensei Services that are part of experience cloud. He holds a master’s and bachelor’s degree in computer science from a leading university in India. He’s published papers in top peer-reviewed conferences and have been granted patents.

Presentations

A graph neural network approach for time evolving fraud networks Data Case Studies

Developing a fraud detection model using state-of-the-art graph neural networks. This model can be used to detect card testing, trial abuse, seat addition etc.

A machine learning approach to customer profiling by identifying purchase lifecycle stages Session

Identifying customer stages in a buying cycle enables you to perform personalized targeting depending on the stage. Shankar Venkitachalam, Megahanath Macha Yadagiri, and Deepak Pai identify ML techniques to analyze a customer's clickstream behavior to find the different stages of the buying cycle and quantify the critical click events that help transition a user from one stage to another.

Ramesh Panuganty is the Founder and CEO of MachEye. He is a creative technology pioneer (with 12 patents and several publications) and an entrepreneur (launched and exited 3 start-ups). His projects include:

SelectQ: An ed-tech platform that generates SAT questions on the fly using AI and NLG, with ratcheting complexity until full preparation.

Drastin (acquired by Splunk in 2017): With this, Ramesh created “Conversational Analytics” as a new BI market category. Drastin was recognized in the top five AI platforms by Gartner.

Cloud360 Hyperplatform (acquired by Cognizant in 2012): With this, Ramesh created “Cloud Management Platforms” as a new market category and built a multi-million dollar ARR business.

Presentations

3 steps to implement AI architecture for autonomous analytics (sponsored by MachEye) Session

Users don't speak SQL and data doesn't speak English. It's time to bridge the gap. Ramesh Panuganty details teaching machines how to tell data stories, humanizing UX through interactive audiovisuals, and leveraging ML to automatically surface and deliver insights. You'll learn from industry experts how the largest energy drink manufacturer and student loan company solve these challenges.

Carlos Pazos is a senior product marketing manager at SparkCognition, responsible for automated model building and natural language processing solutions. He specializes in the real-world implementation of AI-based technologies for the oil and gas, utilities, aerospace, finance, and defense sectors. Previously, he was with National Instruments as an IIoT embedded software and distributed systems product marketing manager. He specialized in real-time systems, heterogeneous computing architectures, industrial communication protocols, and analytics at the edge.

Presentations

Neuroevolution-based automated model building: How to create better models Session

AutoML brings acceleration and democratization of data science, but in the game of accuracy and flexibility, using predefined blueprints to find adequate algorithms falls short. Carlos Pazos and Keith Moore shine a spotlight on a neuroevolutionary approach to AutoML to custom build novel, sophisticated neural networks that perfectly represent the relationships in your dataset.

Alexander Pierce is a field engineer at Pepperdata.

Presentations

Autoscaling big data operations in the cloud Session

Alex Pierce evaluates Amazon Elastic MapReduce (EMR), Azure HDInsight, and Google Cloud DataProc, three leading cloud service providers, with respect to Hadoop and big data autoscaling capabilities and offers guidance to help you determine the flavor of autoscaling to best fit your business needs.

Nick Pinckernell is a senior research engineer for the applied AI research team at Comcast, where he works on ML platforms for model serving and feature pipelining. He’s focused on software development, big data, distributed computing, and research in telecommunications for many years. He’s pursuing his MS in computer science at the University of Illinois at Urbana-Champaign, and when free, he enjoys IoT.

Presentations

Feature engineering pipelines five ways with Kafka, Redis, Spark, Dask, AirFlow, and more Session

With model serving becoming easier thanks to tools like Kubeflow, the focus is shifting to feature engineering. Nick Pinckernell reviews five ways to get your raw data into engineered features (and eventually to your model) with open source tools, flexible components, and various architectures.

Arvind Prabhakar is cofounder and CTO of StreamSets, provider of the industry’s first DataOps platform for modern data integration. He’s an Apache Software Foundation member and a PMC member on Flume, Sqoop, Storm, and MetaModel projects. Previously, Arvind held many roles at Cloudera, ranging from software engineer to director of engineering.

Presentations

Deploying DataOps for analytics agility Session

DataOps is the best approach for enterprises to improve business and drives future revenue streams and competitive differentiation, which is why so many businesses are rethinking their data strategy. Arvind Prabhakar explains how DataOps solves all the problems that come along with managing data movement at scale.

Sandhya Raghavan is a Senior Data Engineer at Virgin Hyperloop One, where she helps building the data analytics platform for the organization. She has 13 years of experience working with leading organizations to build scalable data architectures, integrating relational and big data technologies. She also has experience implementing large-scale, distributed machine learning algorithms. Sandhya holds a bachelor’s degree in Computer Science from Anna University, India. When Sandhya is not building data pipelines, you can see her travel the world with her family or pedaling a bike.

Presentations

Flexible and fast Simulation Analytics in a growing Company: a Hyperloop Case Study Data Case Studies

To substantiate the key business and safety propositions necessary to establish a new mode of transportation, Virgin Hyperloop One (VHO) has implemented a complex, large-scale, and highly configurable simulation. Each simulation run needs to be analyzed and assessed on several KPIs. This session highlights how we successfully reduced the time to insight of our analyses from days to hours.

ML architecture using newest tools: Predicting near-future passenger demand for Hyperloop Session

Patryk Oleniuk and Sandhya Raghava investigate how to use demand data to improve on the design of the fifth mode of transport—Hyperloop. They discuss the passenger demand prediction methods and the tech stack (Spark, koalas, Keras, MLflow) used to build a deep neural network (DNN)-based near-future demand prediction for simulation purposes.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); worked briefly on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper. He’s the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin–Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Serverless streaming architectures and algorithms for the enterprise Tutorial

Arun Kejariwal, Karthik Ramasamy, and Anurag Khandelwal walk you through through the landscape of streaming systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage. You'll get an overview of the inception and growth of the serverless paradigm and explore Apache Pulsar, which provides native serverless support in the form of Pulsar functions.

Using Apache Pulsar functions for data workflows at Narvar Session

Narvar originally used a large collection of point technologies such as AWS Kinesis, Lambda, and Apache Kafka to satisfy its requirements for pub/sub messaging, message queuing, logging, and processing. Karthik Ramasamy and Anand Madhavan walk you through how Narvar moved away from using a slew of technologies and consolidating their use cases using Apache Pulsar.

Anand Rao is a partner in PwC’s Advisory Practice and the innovation lead for the Data and Analytics Group, where he leads the design and deployment of artificial intelligence and other advanced analytical techniques and decision support systems for clients, including natural language processing, text mining, social listening, speech and video analytics, machine learning, deep learning, intelligent agents, and simulation. Anand is also responsible for open source software tools related to Apache Hadoop and packages built on top of Python and R for advanced analytics as well as research and commercial relationships with academic institutions and startups, research, development, and commercialization of innovative AI, big data, and analytic techniques. Previously, Anand was the chief research scientist at the Australian Artificial Intelligence Institute; program director for the Center of Intelligent Decision Systems at the University of Melbourne, Australia; and a student fellow at IBM’s T.J. Watson Research Center. He has held a number of board positions at startups and currently serves as a board member for a not-for-profit industry association. Anand has coedited four books and published over 50 papers in refereed journals and conferences. He was awarded the most influential paper award for the decade in 2007 from Autonomous Agents and Multi-Agent Systems (AAMAS) for his work on intelligent agents. He’s a frequent speaker on AI, behavioral economics, autonomous cars and their impact, analytics, and technology topics in academic and trade forums. Anand holds an MSc in computer science from Birla Institute of Technology and Science in India, a PhD in artificial intelligence from the University of Sydney, where he was awarded the university postgraduate research award, and an MBA with distinction from Melbourne Business School.

Presentations

A practical guide to responsible AI: Building robust, secure, and safe AI Session

Join in for a practitioner’s overview of the risks of AI and depiction of responsible AI deployment within an organization. You'll discover how to ensure the safety, security, standardized testing, and governance of systems and how models can be fooled or subverted. Ilana Golbin and Anand Rao illustrate how organizations safeguard AI applications and vendor solutions to mitigate AI risks.

Cool AI for Practical Problems (Sponsored by PwC) Session

While many of the solutions to which AI can be applied involve exciting technologies, AI will arguably have greater transformational impact on more mundane problems. This session provides an overview of how PwC has developed and applied innovative AI solutions to common, practical problems across several domains, such as tax, accounting, and management consulting.

ML models are not software: Why organizations need dedicated operations to address the b Session

Anand Rao and Joseph Voyles introduce you to the core differences between software and machine learning model life cycles. They demonstrate how AI’s success also limits its scale and detail leading practices for establishing AIOps to overcome limitations by automating CI/CD, supporting continuous learning, and enabling model safety.

Delip Rao is the vice president of research at the AI Foundation, where he leads speech, language, and vision research efforts for generating and detecting artificial content. Previously, he founded the AI research consulting company Joostware and the Fake News Challenge, an initiative to bring AI researchers across the world together to work on fact checking-related problems, and he was at Google and Twitter. Delip is the author of a recent book on deep learning and natural language processing. His attitude toward production NLP research is shaped by the time he spent at Joostware working for enterprise clients, as the first machine learning researcher on the Twitter antispam team, and as an early researcher at Amazon Alexa.

Presentations

Natural language processing with deep learning 2-Day Training

Delip Rao explores natural language processing (NLP) using a set of machine learning techniques known as deep learning. He walks you through neural network architectures and NLP tasks and teaches you how to apply these architectures for those tasks.

19+ Years in Data and technology field with experience in collaborating with business & technology architecture teams and enabling platform capabilities & innovation on enterprise data platform. Currently managing a team of product owners, data scientists and BI developers to build AI & Machine learning products to help solve customer & business problems.

Patents: Co-inventor of Patent pending for innovative use of Machine learning models at Dell technologies.

Presentations

Data science + Domain Experts = Exponentially better Products Data Case Studies

To deliver best-in class data science products, solutions must evolve through strong partnerships between data scientists and domain experts. We will describe the product lifecycle journey we took as we integrated business expertise with data scientists and technologists highlighting best practices and pitfalls to avoid when digitally transforming your business through AI and machine learning.

Nancy Rausch is a senior manager at SAS. Nancy has been involved for many years in the design and development of SAS’s data warehouse and data management products, working closely with customers and authoring a number of papers on SAS data management products and best practice design principals for data management solutions. She holds an MS in computer engineering from Duke University, where she specialized in statistical signal processing, and a BS in electrical engineering from Michigan Technological University. She has recently returned to college and is pursuing an MS in analytics from Capella University.

Presentations

A study of bees: Using AI and art to tell a data story Session

For data to be meaningful, it needs to be presented in a way people can relate to. Nancy Rausch explains how SAS combined AI and art to tell a compelling data story and how it combined streaming data from local beehives to forecast hive health. It visualized this data in a live-action art sculpture, which helped to bring the data to life in a fun and compelling way.

Meghana Ravikumar is a machine learning engineer at SigOpt with a particular focus on novel applications of deep learning across academia and industry. In particular, Meghana explores the impact of hyperparameter optimization and other techniques on model performance and evangelizes these practical lessons for the broader machine learning community. Previously, she was in biotech, employing natural language processing to mine and classify biomedical literature. She holds a BS degree in bioengineering from UC Berkeley. When she’s not reading papers, developing models and tools, or trying to explain complicated topics, she enjoys doing yoga, traveling, and hunting for the perfect chai latte.

Presentations

Optimized image classification on the cheap Session

Meghana Ravikumar anchors on building an image classifier trained on the Stanford Cars dataset to evaluate fine tuning, feature extraction, and the impact of hyperparameter optimization, then tune image transformation parameters to augment the model. The goal is to answer: how can resource-constrained teams make trade-offs between efficiency and effectiveness using pretrained models?

Sriram Ravindran is a data scientist at Adobe where he is building a platform called Fraud AI. Fraud AI is a solution being designed to meet Adobe’s fraud detection needs. Prior to this, he was a graduate research student at University of California, San Diego where he worked on applying deep learning to EEG (brain activity) data.

Presentations

A graph neural network approach for time evolving fraud networks Data Case Studies

Developing a fraud detection model using state-of-the-art graph neural networks. This model can be used to detect card testing, trial abuse, seat addition etc.

Joy Rimchala is a data scientist in Intuit’s Machine Learning Futures Group working on ML problems in limited-label data settings. Joy holds a PhD from MIT, where she spent five years doing biological object tracking experiments and modeling them using Markov decision processes.

Presentations

Explainable AI: Your model is only as good as your explanation Session

Explainable AI (XAI) has gained industry traction, given the importance of explaining ML-assisted decisions in human terms and detecting undesirable ML defects before systems are deployed. Talia Tron and Joy Rimchala delve into XAI techniques, advantages and drawbacks of black box versus glass box models, concept-based diagnostics, and real-world examples using design thinking principles.

Kelley Rivoire is the head of data infrastructure at Stripe, where she leads the Data Infrastructure Group. As an engineer, she built Stripe’s first real-time machine learning evaluation of user risk. Previously, she worked on nanophotonics and 3D imaging as a researcher at HP Labs. She holds a PhD from Stanford.

Presentations

Production ML outside the black box: From repeatable inputs to explainable outputs Session

Tools for training and optimizing models have become more prevalent and easier to use; however, these are insufficient for deploying ML in critical production applications. Kelley Rivoire dissects how Stripe approached challenges in developing reliable, accurate, and performant ML applications that affect hundreds of thousands of businesses.

Paige Roberts is an open source relations manager at Vertica, where she promotes understanding of Vertica, MPP data processing, open source, and how the analytics revolution is changing the world. In two decades in the data management industry, she’s worked as an engineer, a trainer, a marketer, a product manager, and a consultant.

Presentations

Architecting production IoT analytics Session

What works in production is the only technology criterion that matters. Companies with successful high-scale production IoT analytics programs like Philips, Anritsu, and OptimalPlus show remarkable similarities. IoT at production scale requires certain technology choices. Paige Roberts drills into the architectures of successful production implementations to identify what works and what doesn’t.

Lisa Joy Rosner is the CMO at Otonomo, an automotive data services platform, where she drives global development of the company’s marketplace. She’s an an award-winning and patented executive with over 20 years of experience marketing big data and analytics solutions at both public and startup technology companies. Previously, Rosner was CMO at Neustar, leading a major brand transformation as the company entered into the security and marketing data services markets; launched social intelligence company, NetBase, where she worked with 5 of the top 10 CPGs as they adopted a new approach to real-time marketing; was vice president of marketing at MyBuys (sold to Magnetic); was vice president of worldwide marketing at BroadVision; and held positions at data warehousing companies Brio (sold to Hyperion), DecisionPoint (sold to Teradata), and Oracle. Lisa Joy was named a 2013 Silicon Valley Woman of Influence, 2014 B2B Marketer of the Year by the Sage Group and Wall Street Journal, and was a Top 100 Women in Marketing honoree by Brand Innovators in 2015. She’s been a guest lecturer at the Hass School of Business, the Tuck School of Business, and Stanford University. She earned a bachelor’s degree (sum cum laude) in English literature from the University of Maryland. She sits on the marketing advisory board of Mintigo, The Big Flip, Fyber, and PLAE Shoes, and the board of trustees for UC Merced. She’s the mother of four young children.

Presentations

Navigating compliance in the future of the mobility while protecting driver privacy Session

As cars gain more advanced features, the role of customer privacy and responsible data stewardship becomes an important focus for auto manufacturers and drivers. Lisa Joy Rosner discusses the future of connected vehicles, data compliance measures, and the impact of related policies like GDPR and the California Consumer Privacy Act (CCPA).

Nikki Rouda is a principal product marketing manager at Amazon Web Services (AWS). Nikki has decades of experience leading enterprise big data, analytics, and data center infrastructure initiatives. Previously, he held senior positions at Cloudera, Enterprise Strategy Group (ESG), Riverbed, NetApp, Veritas, and UK-based Alertme.com (an early consumer IoT startup). Nikki holds an MBA from Cambridge’s Judge Business School and an ScB in geophysics from Brown University.

Presentations

Building a secure, scalable, and transactional data lake on AWS 2-Day Training

In this workshop, we will walk you through the steps of building a data lake on Amazon S3 using different ingestion mechanisms, performing incremental data processing on the data lake to support transactions on S3, and securing the data lake with fine grained access control policies.

Building a serverless big data application on AWS (Day 2) Training Day 2

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join Nikki Rouda to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Rachel Roumeliotis is a strategic content director at O’Reilly, where she leads an editorial team that covers a wide variety of programming topics ranging from full stack to open source in the enterprise to emerging programming languages. Rachel is a programming chair of OSCON and O’Reilly’s Software Architecture Conference. She has been working in technical publishing for 10 years, acquiring content in many areas including mobile programming, UX, computer security, and AI.

Presentations

Tuesday keynotes Keynote

Strata program chairs Rachel Roumeliotis and Alistair Croll welcome you to the first day of keynotes.

Wednesday keynotes Keynote

Strata program chairs Rachel Roumeliotis and Alistair Croll welcome you to the first day of keynotes.

Ebrahim Safavi is a senior data scientist at Mist, focusing on knowledge discovery from big data using machine learning and large-scale data mining where he developed, and implemented several key production components including the company’s chatbot inference engine and anomaly detections. He has won a Microsoft research award for his work on information retrieval and recommendation systems in graph-structured networks. Ebrahim earned a PhD degree in cognitive learning networks from Stevens Institute of Technology.

Presentations

Scalable and automated pipeline for large-scale neural network training and inference Session

Anomaly detection models are essential to run data-driven businesses intelligently. At Mist Systems, the need for accuracy and the scale of the data impose challenges to build and automate ML pipelines. Ebrahim Safavi and Jisheng Wang explain how recurrent neural networks and novel statistical models allow Mist Systems to build a cloud native solution and automate the anomaly detection workflow.

Guillaume Saint-Jacques is the tech lead of computational social science at LinkedIn. Previously, he was the technical lead of the LinkedIn experimentation science team. He holds a PhD in management research from the MIT Sloan School of Management, a master’s degree in economics from the Paris École Normale Supérieure and the Paris School of Economics, and a master’s degree in entrepreneurship from HEC Paris.

Presentations

Fairness through experimentation at LinkedIn Session

Most companies want to ensure their products and algorithms are fair. Guillaume Saint-Jacques and Meg Garlinghouse share LinkedIn's A/B testing approach to fairness and describe new methods that detect whether an experiment introduces bias or inequality. You'll learn about a scalable implementation on Spark and discover examples of use cases and impact at LinkedIn.

Mehrnoosh Sameki is a technical program manager at Microsoft, responsible for leading the product efforts on machine learning interpretability within the Azure Machine Learning platform. Previously, she was a data scientist at Rue Gilt Groupe, incorporating data science and machine learning in the retail space to drive revenue and enhance customers’ personalized shopping experiences. She earned her PhD degree in computer science at Boston University.

Presentations

An overview of responsible artificial intelligence Tutorial

Mehrnoosh Sameki and Sarah Bird examine six core principles of responsible AI: fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability, focusing on transparency, fairness, and privacy. You'll discover best practices and state-of-the-art open source toolkits that empower researchers, data scientists, and stakeholders to build trustworthy AI systems.

Aryn Sargent is a data analyst with over 6 years of experience leading the identification and acceleration of successful solutions for enterprise conversational AI and Intelligent Virtual Assistants. Over the past six years, Aryn has held numerous roles within Verint Next IT including key positions within product management, product strategy, and data analysis. Today, Aryn leads strategic accounts and clients in identifying and defining IVA understanding and knowledge areas through the use of proprietary AI-powered tools to analyze unstructured conversational data. She is responsible for client’s automation strategies and evaluating, measuring and growing their success; defining tactical knowledge areas to achieve long-term vision. When she’s not working with data sets, she is well known for her green thumb in the garden and her love of dogs, fostering dogs in need until they find a loving, forever home.

Presentations

Deploying chatbots and conversational analysis: learn what customers really want to know Data Case Studies

Chatbots are increasingly used in customer service as a first tier of support. Through deep analysis of conversation logs you can learn real user motivations and where company improvements can be made. In this talk, a build or buy comparison is made for deploying self-service bots, motivations and techniques for deep conversational analysis are covered, and real world discoveries are discussed.

Roshan Satish is a product manager who has been involved with artificial intelligence initiatives at DocuSign since their inception. He came to the company through an acquisition of a CLM startup, SpringCM, and worked with product leadership across the organization to formalize an AI vision before beginning to scale out the team. His job has been to create a robust, enterprise-grade deep learning platform that enables intelligence and insights across the DocuSign Agreement Cloud. Understandably, many of the use cases center around document understanding and natural language processing (NLP) and natural language understanding (NLU)—but they’ve also explored features leveraging CNNs, as well as classical machine learning models. One of the major challenges has been working with a bare metal tech stack while emphasizing scalability and modularity of DocuSign’s AI services.

Presentations

A unified CV, OCR, and NLP model pipeline for scalable document understanding at DocuSign Session

Roshan Satish and Michael Chertushkin lead you through a real-world case study about applying state-of-the-art deep learning techniques to a pipeline that combines computer vision (CV), optical character recognition (OCR), and natural language processing (NLP) at DocuSign. You'll discover how the project delivered on its extreme interpretability, scalability, and compliance requirements.

Danilo Sato is a principal consultant at ThoughtWorks with more than 17 years of experience in many areas of architecture and engineering: software, data, infrastructure, and machine learning. Balancing strategy with execution, Danilo helps clients refine their technology strategy while adopting practices to reduce the time between having an idea, implementing it, and running it in production using the cloud, DevOps, and continuous delivery. He is the author of DevOps in Practice: Reliable and Automated Software Delivery, is a member of ThoughtWorks’ Technology Advisory Board and Office of the CTO, and is an experienced international conference speaker.

Presentations

Continuous delivery for machine learning: Automating the end-to-end lifecycle Tutorial

Danilo Sato lead you through applying continuous delivery (CD) to data science and machine learning (ML). Join in to learn how to make changes to your models while safely integrating and deploying them into production using testing and automation techniques to release reliably at any time and with a high frequency.

Laura Schornack is an expert engineer and lead design architect for shared services at JPMorgan Chase. Previously, she worked for world-renowned organizations such as IBM and Nokia. She holds a degree in computer science from University of Illinois at Urbana-Champaign.

Presentations

Architecting and deploying an ML model to the private cloud Interactive session

Many pieces go into integrating machine learning models into an application. Laura Schornack details how to create the architecture for each piece so it can be delivered in an agile manner. Along the way, you'll learn how to integrate these pieces into an existing application.

Liqun Shao is a data scientist in the AI Development Acceleration Program at Microsoft. She finished her first rotational project on “Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-Based Platforms” with the paper publication in SoCC 2019 and her second one on “Azure Machine Learning Text Analytics Best Practices” with the contribution of the public NLP repo. She earned her bachelor’s of computer science in China and her doctorate in computer science at the University of Massachusetts. Her research areas focus on natural language processing, data mining, and machine learning, especially on title generation, summarization and classification.

Presentations

Distributed training in the cloud for production-level NLP models Session

Liqun Shao leads you through a new GitHub repository to show you how data scientists without NLP knowledge can quickly train, evaluate, and deploy state-of-the-art NLP models. She focuses on two use cases with distributed training on Azure Machine Learning with Horovod: GenSen for sentence similarity and BERT for question-answering using Jupyter notebooks for Python.

Presentations

A graph neural network approach for time evolving fraud networks Data Case Studies

Developing a fraud detection model using state-of-the-art graph neural networks. This model can be used to detect card testing, trial abuse, seat addition etc.

Mehul Sheth is a senior performance engineer in the Performance Labs at Druva, where he is responsible for the performance of the CloudApps product of Druva InSync. He has more than 13 years of experience in development and performance engineering, where he’s ensured production performance of thousands of applications. Mehul loves to tackle unsolved problems and strives to bring a simple solution to the table, rather than trying complex things.

Presentations

Realistic synthetic data at scale: Influenced by, but not, production data Session

Any software product needs to be tested against data, and it's difficult to have a random but realistic dataset representing production data. Mehul Sheth highlights using production data to generate models. Production data is accessed without exposing it or violating any customer agreements on privacy, and the models then generate test data at scale in lower environments.

Tomer Shiran is cofounder and CEO of Dremio, the data lake engine company. Previously, Tomer was the vice president of product at MapR, where he was responsible for product strategy, road map, and new feature development and helped grow the company from 5 employees to over 300 employees and 700 enterprise customers; and he held numerous product management and engineering positions at Microsoft and IBM Research. He’s the author of eight US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from the Technion, the Israel Institute of Technology.

Presentations

Building a cloud data lake: Ingesting, processing, and analyzing big data on AWS Session

Data lakes are hot again; with S3 from AWS as the data lake storage, the modern data lake architecture separates compute from storage. You can choose from a variety of elastic, scalable, and cost-efficient technologies when designing a cloud data lake. Tomer Shiran and Jacques Nadeau share best practices for building a data lake on AWS, as well as various services and open source building blocks.

Pramod Singh is a senior machine learning engineer at Walmart Labs. He has extensive hands-on experience in machine learning, deep learning, AI, data engineering, designing algorithms, and application development. He has spent more than 10 years working on multiple data projects at different organizations. He’s the author of three books Machine Learning with PySpark, Learn PySpark, and Learn TensorFlow 2.0. He’s also a regular speaker at major conferences such as the O’Reilly Strata Data and AI Conferences. Pramod holds a BTech in electrical engineering from BATU, and an MBA from Symbiosis University. He’s also done data science certification from IIM–Calcutta. He lives in Bangalore with his wife and three-year-old son. In his spare time, he enjoys playing guitar, coding, reading, and watching football.

Presentations

Attention networks all the way to production using Kubeflow Tutorial

Vijay Srinivas Agneeswaran, Pramod Singh, and Akshay Kulkarni demonstrate the in-depth process of building a text summarization model with an attention network using TensorFlow (TF) 2.0. You'll gain the practical hands-on knowledge to build and deploy a scalable text summarization model on top of Kubeflow.

Joseph Sirosh is the chief technology officer at Compass. Previously, he was the corporate vice president of the Cloud AI Platform at Microsoft, where he lead the company’s enterprise AI strategy and products such as Azure Machine Learning, Azure Cognitive Services, Azure Search, and Bot Framework; the corporate vice president for Microsoft’s Data Platform; the vice president for Amazon’s Global Inventory Platform, responsible for the science and software behind Amazon’s supply chain and order fulfillment systems, as well as the central Machine Learning Group, which he built and led; and the vice president of research and development at Fair Isaac Corp., where he led R&D projects for DARPA, Homeland Security, and several other government organizations. He’s passionate about machine learning and its applications and has been active in the field since 1990. Joseph holds a PhD in computer science from the University of Texas at Austin and a BTech in computer science and engineering from the Indian Institute of Technology Chennai.

Presentations

Compass uses Amazon to simplify and modernize home search Session

Compass is changing real estate by leveraging its industry-leading software to build search and analytical tools that help real estate professionals find, market, and sell homes. Joseph Sirosh details how Compass leverages AWS services, including Amazon Elasticsearch Service, to deliver a complete, scalable home-search solution.

Divya Sivasankaran is a machine learning scientist at integrate.ai where she focuses on building out FairML capabilities within its products. Previously, she worked for a startup that partnered with government organizations (police force and healthcare) to build AI capabilities to bring about positive change (and good intentions). But these experiences also shaped her thinking around the larger ethical implications of AI in the wild and the need for ethical considerations to be brought forward at the design thinking stages (proactive versus reactive).

Presentations

FairML from theory to practice: Lessons drawn from our journey to build a fair product Session

In recent years, there's been a lot of attention on the need for ethical considerations in ML, as well as different ways to address bias in different stages of the ML pipeline. However, there hasn't been a lot of focus on how to bring fairness to ML products. Divya Sivasankaran explores the key challenges (and how to overcome them) in operationalizing fairness and bias in ML products.

Jason “Jay” Smith is a Cloud customer engineer at Google. He spends his day helping enterprises find ways to expand their workload capabilities on Google Cloud. He’s on the Kubeflow go-to-market team and provides code contributions to help people build an ecosystem for their machine learning operations. His passions include big data, ML, and helping organizations find a way to collect, store, and analyze information.

Presentations

Using serverless Spark on Kubernetes for data streaming and analytics Session

Data is a valuable resource, but collecting and analyzing the data can be challenging. And the cost of resource allocation often prohibits the speed at which you can analyze the data. Jay Smith and Remy Welch break down how serverless architecture can improve the portability and scalability of streaming event-driven Apache Spark jobs and perform ETL tasks using serverless frameworks.

Maulik Soneji is a product engineer at Gojek, where he works with different parts of data pipelines for a hypergrowth startup. Outside of learning about mature data systems, he’s interested in Elasticsearch, Go, and Kubernetes.

Presentations

BEAST: Building an event processing library to handle millions of events Session

Maulik Soneji and Dinesh Kumar explore Gojek's event-processing library to consume events from Kafka and push it to BigQuery. All of its services are event sourced, and Gojek has a high load of 21K messages per second for few topics, and it has hundreds of topics.

Colin Spikes, Senior Manager of Solution Engineering at Algorithmia, is an experienced solution consultant with an extensive background in all things data. Prior to Algorithmia, Colin managed a team of Data Solution Architects at Socrata, assisting cities, states, and federal agencies worldwide to unlock the power of data to better understand and communicate conditions in their communities.

Presentations

The OS for AI: How serverless computing enables the next gen of machine learning Session

ML is advancing rapidly, but only a few contributors focus on the infrastructure and scaling challenges that come with it. Jonathan Peck explores why ML is a natural fit for serverless computing, a general architecture for scalable ML, and common issues when implementing on-demand scaling over GPU clusters. He provides general solutions and describes a vision for the future of cloud-based ML.

Vijay Srinivas Agneeswaran is a director of data sciences at Walmart Labs in India, where he heads the machine learning platform development and data science foundation teams, which provide platform and intelligent services for Walmart businesses around the world. He’s spent the last 18 years creating intellectual property and building data-based products in industry and academia. Previously, he led the team that delivered real-time hyperpersonalization for a global automaker, as well as other work for various clients across domains such as retail, banking and finance, telecom, and automotive; he built PMML support into Spark and Storm and realized several machine learning algorithms such as LDA and random forests over Spark; he led a team that designed and implemented a big data governance product for a role-based fine-grained access control inside of Hadoop YARN; and he and his team also built the first distributed deep learning framework on Spark. He’s been a professional member of the ACM and the IEEE (senior) for the last 10+ years. He has five full US patents and has published in leading journals and conferences, including IEEE Transactions. His research interests include distributed systems, artificial intelligence, and big data and other emerging technologies. Vijay has a bachelor’s degree in computer science and engineering from SVCE, Madras University, an MS (by research) from IIT Madras, and a PhD from IIT Madras and held a postdoctoral research fellowship in the LSIR Labs, Swiss Federal Institute of Technology, Lausanne (EPFL).

Presentations

Attention networks all the way to production using Kubeflow Tutorial

Vijay Srinivas Agneeswaran, Pramod Singh, and Akshay Kulkarni demonstrate the in-depth process of building a text summarization model with an attention network using TensorFlow (TF) 2.0. You'll gain the practical hands-on knowledge to build and deploy a scalable text summarization model on top of Kubeflow.

Ion Stoica is a professor in the Electrical Engineering and Computer Sciences (EECS) Department at the University of California, Berkeley, where he researches cloud computing and networked computer systems. Previously, he worked on dynamic packet state, chord DHT, internet indirection infrastructure (i3), declarative networks, and large-scale systems, including Apache Spark, Apache Mesos, and Alluxio. He’s the cofounder of Databricks—a startup to commercialize Apache Spark—and Conviva—a startup to commercialize technologies for large-scale video distribution. Ion is an ACM fellow and has received numerous awards, including inclusion in the SIGOPS Hall of Fame (2015), the SIGCOMM Test of Time Award (2011), and the ACM doctoral dissertation award (2001).

Presentations

Using Ray to scale Python, data processing, and machine learning Tutorial

There's no easy way to scale up Python applications to the cloud. Ray is an open source framework for parallel and distributed computing, making it easy to program and analyze data at any scale by providing general-purpose high-performance primitives. Robert Nishihara, Ion Stoica, and Philipp Moritz demonstrate how to use Ray to scale up Python applications, data processing, and machine learning.

Dave Stuart is a senior technical executive within the US Department of Defense, where he’s leading a large-scale effort to transform the workflows of thousands of enterprise business analysts through Jupyter and Python adoption, making tradecraft more efficient, sharable, and repeatable. Previously, Dave led multiple grassroots technology adoption efforts, developing innovative training methods that tangibly increased the technical proficiency of a large noncoding enterprise workforce.

Presentations

Jupyter as an enterprise DIY analytic platform Session

Dave Stuart takes a look into how the US Intelligence Community (IC) uses Jupyter and Python to harness subject matter expertise of analysts in a DIY analytic movement. You'll cover the technical and cultural challenges the community encountered in its quest to find success at a large scale and address the strategies used to mitigate the challenges.

Bargava Subramanian is a cofounder and deep learning engineer at Binaize in Bangalore, India. He has 15 years’ experience delivering business analytics and machine learning solutions to B2B companies, and he mentors organizations in their data science journey. He holds a master’s degree from the University of Maryland, College Park. He’s an ardent NBA fan.

Presentations

Deep learning for recommendation systems 2-Day Training

Bargava Subramanian and Amit Kapoor provide you with a thorough introduction to the art and science of building recommendation systems and paradigms across domains. You'll get an end-to-end overview of deep learning-based recommendation and learning-to-rank systems to understand practical considerations and guidelines for building and deploying RecSys.

Deep learning for recommendation systems (Day 2) Training Day 2

Bargava Subramanian and Amit Kapoor provide you with a thorough introduction to the art and science of building recommendation systems and paradigms across domains. You'll get an end-to-end overview of deep learning-based recommendation and learning-to-rank systems to understand practical considerations and guidelines for building and deploying RecSys.

Dev Tagare is an engineering manager at Lyft. He has hands-on experience in building end-to-end data platforms for high-velocity and large data volume use cases. Previously, Dev spent 10 years leading engineering functions for companies including Oracle and Twitter with a focus on areas including open source; big data; low-latency, high-scalability design; data structures; design patterns; and real-time analytics.

Presentations

Reducing data lag from 24+ hours to 5 mins at Lyft scale Session

Mark Grover and Dev Tagare offer you a glimpse at the end-to-end data architecture Lyft uses to reduce data lag appearing in its analytical systems from 24+ hours to under 5 minutes. You'll learn the what and why of tech choices, monitoring, and best practices. They outline the use cases Lyft has enabled, especially in ML model performance and evaluation.

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, agile, distributed teams. Previously, he led business operations for Bing Shopping in the US and Europe with Microsoft’s Bing Group and built and ran distributed teams that helped scale Amazon’s financial systems with Amazon in both Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Advanced natural language processing with Spark NLP Tutorial

David Talby, Alex Thomas, and Claudiu Branzan detail the application of the latest advances in deep learning for common natural language processing (NLP) tasks such as named entity recognition, document classification, sentiment analysis, spell checking, and OCR. You'll learn to build complete text analysis pipelines using the highly performant, scalable, open source Spark NLP library in Python.

Model governance: A checklist for getting AI safely to production Session

The industry has about 40 years of experience forming best practices and tools for storing, versioning, collaborating, securing, testing, and building software source code—but only about 4 years doing so for AI models. David Talby catches you up on current best practices and freely available tools so your team can go beyond experimentation to successfully deploy models.

Ankur Taly is the Head of Data Science at Fiddler labs, where he is responsible for developing, productizing, and evangelizing core explainable AI technology. Previously, he was a Staff Research Scientist at Google Brain where he carried out research in explainable AI, and was most well-known for his contribution to developing and applying Integrated Gradients  — a new interpretability algorithm for Deep Networks. His research in this area has resulted in publications at top-tier machine learning conferences, and prestigious journals like the American Academy of Ophthalmology (AAO) and Proceedings of the National Academy of Sciences (PNAS). Besides explainable AI, Ankur has a broad research background, and has published 25+ papers in several areas including Computer Security, Programming Languages, Formal Verification, and Machine Learning. He has served on several academic conference program committees (PLDI 2014 and 2019, POST 2014, PLAS 2013), delivered several invited lectures at universities and various industry venues, and instructed short-courses at summer schools and conferences. Ankur obtained his Ph.D. in Computer Science from Stanford University in 2012 and a B. Tech in Computer Science from IIT Bombay in 2007.

Presentations

Slice and Explain: A Unified Paradigm for Explaining Machine Learning Models Session

This talk will present a new paradigm for model explanations called “Slice and Explain” that unifies several existing explanation tools into a single framework. It will describe how the framework can be leveraged by data scientists, business users, and regulators, to successfully analyze models.

Wangda Tan is a product management committee (PMC) member of Apache Hadoop and engineering manager of the computation platform team at Cloudera. He manages all efforts related to Kubernetes and YARN for both on-cloud and on-premises use cases of Cloudera. His primary areas of interest are the YuniKorn scheduler (scheduling containers across YARN and Kubernetes) and the Hadoop submarine project (running a deep learning workload across YARN and Kubernetes). He’s also led features like resource scheduling, GPU isolation, node labeling, resource preemption, etc., efforts in the Hadoop YARN community. Previously, he worked on integration of OpenMPI and GraphLab with Hadoop YARN at Pivotal and participated in creating a large-scale machine learning, matrix, and statistics computation program using MapReduce and MPI and Alibaba.

Presentations

It’s 2020 now: Apache Hadoop 3.x state of the union and upgrade guidance Session

2020 Hadoop is still evolving fast. You'll learn the current status of Apache Hadoop community and the exciting present and future of Hadoop 3.x. Wangda Tan and Arpit Agarwal cover new features like Hadoop on Cloud, GPU support, NameNode federation, Docker, 10X scheduling improvements, OZone, etc. And they offer you upgrade guidance from 2.x to 3.x.

Cathy Tanimura is Sr. Director of Analytics & Data Science at Strava. She has a passion for leveraging data in multiple ways: to help people make better decisions; to tell stories about companies and industries; and to develop great product experiences. She previously built and led data teams at several high growth technology companies including Okta, Zynga, and StubHub.com.

Presentations

The Power of Visualizing Health, Fitness, and Community Impact Session

Pulling from specific product innovations and applications like relative effort, cumulative stats, Year in Sport, Heatmaps, and Metro, Cathy Tanimura, Sr. Director of Analytic at Strava will share best practices for creating effective data visualizations that help improve the health and fitness of individuals, as well as the well-being of communities.

Fatma Tarlaci is a data science fellow at Quansight, where she focuses on creating training materials in AI and contributes to data science and machine learning projects. She received her Ph.D. in humanities from the University of Texas at Austin followed by a Master’s degree in Computer Science from Stanford University. Her work and research specialize in deep learning and data science.

Presentations

Natural language processing with open source Tutorial

Language is at the heart of everything we—humans—do. Natural language processing (NLP) is one of the most challenging tasks of artificial intelligence, mainly due to the difficulty of detecting nuances and common sense reasoning in natural language. Fatma Tarlaci invites you to learn more about NLP and get a complete hands-on implementation of an NLP deep learning model.

Maureen is Chief Data Scientist at Reonomy, a property intelligence company which is transforming the world’s largest asset class: commercial real estate. Maureen has run simulations and transformed data for almost 20 years. She has a breadth of knowledge on a variety of data, including: location data, click data, image data, streaming data, public and simulated data, as well as experience working with data at scale, managing datasets ranging from kilobytes to terabytes.

In previous roles, Maureen drove technological and process advancements which resulted in 500% year over year BtoB contract growth at Enigma, a data-as-a-service company headquartered in New York City. She delivered smart technology which anticipates human behavior and needs at Axon Vibe, created a smartwatch app recommender in the Insight Data Science Fellows Program, and researched galactic shapes due to the interplay between Dark Matter and Stellar Evolution as a Postdoctoral Associate at Rutgers University. Maureen’s Ph.D. is in Computational Astrophysics from Columbia University: she studied the evolution of galaxies by running cosmological simulations on supercomputers.

Presentations

How Machine Learning is Creating a New Category in Commercial Real Estate Data Case Studies

Despite being one of America’s largest industries, Commercial Real Estate professionals have been deprived of insights and opportunities due to the fragmented, disparate nature of real estate information. The $15 trillion market is still predominantly facilitated by paper agreements, phone calls, and in-person transactions.

Alex Thomas is a data scientist at John Snow Labs. He’s used natural language processing (NLP) and machine learning with clinical data, identity data, and job data. He’s worked with Apache Spark since version 0.9 as well as with NLP libraries and frameworks including UIMA and OpenNLP.

Presentations

Advanced natural language processing with Spark NLP Tutorial

David Talby, Alex Thomas, and Claudiu Branzan detail the application of the latest advances in deep learning for common natural language processing (NLP) tasks such as named entity recognition, document classification, sentiment analysis, spell checking, and OCR. You'll learn to build complete text analysis pipelines using the highly performant, scalable, open source Spark NLP library in Python.

Sherin is a Software Engineer at Lyft. In her career spanning 8 years, she has worked on most parts of the tech stack, but enjoys the challenges in Data Science and Machine Learning the most. Most recently she has been focussed on building products that would facilitate advances in Artificial Intelligence and Machine Learning through Streaming.

She is passionate about getting more people, especially women, interested in this field and has been trying her best to share her work with the community through tech talks and panel discussions. Most recently she gave a talk about Machine Learning Infra and Streaming, at Beam Summit as well as Flink Forward in Berlin.

In her free time she loves to read and paint. She is also the president of the Russian Hill book club based in San Francisco and loves to organize events for her local library.

Presentations

Building a self-service platform for continuous, real-time feature generation. Data Case Studies

In the world of ride sharing, decisions such as matching a passenger to the nearest driver, pricing, ETA etc need to be made in realtime. For this it is imperative to build the most up to date view of the world using data. However, gleaning information from high volume streaming data is not just tricky but often solutions are hard to use. At Lyft we have attempted to solve this problem with Flink

Jameson Toole is the cofounder and CEO of Fritz AI, a company building tools to help developers optimize, deploy, and manage machine learning models on mobile devices. Previously, he built analytics pipelines for Google X’s Project Wing and ran the data science team at Boston technology startup Jana Mobile. He holds undergraduate degrees in physics, economics, and applied mathematics from the University of Michigan and both an MS and PhD in engineering systems from MIT, where he worked on applications of big data and machine learning to urban and transportation planning at the Human Mobility and Networks Lab.

Presentations

Creating smaller, faster, production-worthy mobile machine learning models Session

Getting machine learning (ML) models ready for use on device is a major challenge. Jameson Toole explains optimization, pruning, and compression techniques that keep app sizes small and inference speeds high. You'll learn to apply these techniques using mobile ML frameworks such as Core ML and TensorFlow Lite.

Talia Tron is a senior data scientist on the ML technologies futures team at Intuit, where she leads the effort on explainable AI. She worked on the security risk and fraud team, where she used ML and AI solutions to detect threats and frauds on Intuit’s products. She’s the leader of Intuit’s innovation catalyst local community, pushing toward customer obsession and design thinking across the Israeli site. Previously, she was data scientist in Microsoft’s advanced threat analytics groups (ATA R&D) and developed customized elearning tools in the Microsoft Education Group, and she cofounded the interdisciplinary psychiatry group, which brings together clinicians, neuroscientists, and data scientists to advance brain-related psychiatric evaluation and treatment. Talia holds a PhD in computational neuroscience from the Hebrew University, where she developed automatic tools for analyzing facial expressions and motor behavior in schizophrenia. She conducted research in collaboration with the Sheba Medical Center Innovation Center—using ML to explore and predict treatment outcomes and develop medical decision support systems.

Presentations

Explainable AI: Your model is only as good as your explanation Session

Explainable AI (XAI) has gained industry traction, given the importance of explaining ML-assisted decisions in human terms and detecting undesirable ML defects before systems are deployed. Talia Tron and Joy Rimchala delve into XAI techniques, advantages and drawbacks of black box versus glass box models, concept-based diagnostics, and real-world examples using design thinking principles.

Teresa Tung is a managing director at Accenture, where she’s responsible for taking the best-of-breed next-generation software architecture solutions from industry, startups, and academia and evaluating their impact on Accenture’s clients through building experimental prototypes and delivering pioneering pilot engagements. Teresa leads R&D on platform architecture for the internet of things and works on real-time streaming analytics, semantic modeling, data virtualization, and infrastructure automation for Accenture’s Applied Intelligence Platform. Teresa is Accenture’s most prolific inventor with 170+ patent and applications. She holds a PhD in electrical engineering and computer science from the University of California, Berkeley.

Presentations

Building the digital twin IoT and unconventional data Session

The digital twin presents a problem of data and models at scale—how to mobilize IT and OT data, AI, and engineering models that work across lines of business and even across partners. Teresa Tung and William Gatehouse share their experience of implementing digital twins use cases that combine IoT, AI models, engineering models, and domain context.

Sandeep Uttamchandani is the hands-on chief data architect and head of data platform engineering at Intuit, where he’s leading the cloud transformation of the big data analytics, ML, and transactional platform used by 3M+ small business users for financial accounting, payroll, and billions of dollars in daily payments. Previously, Sandeep held engineering roles at VMware and IBM and founded a startup focused on ML for managing enterprise systems. Sandeep’s experience uniquely combines building enterprise data products and operational expertise in managing petabyte-scale data and analytics platforms in production for IBM’s federal and Fortune 100 customers. Sandeep has received several excellence awards. He has over 40 issued patents and 25 publications in key systems conference such as VLDB, SIGMOD, CIDR, and USENIX. Sandeep is a regular speaker at academic institutions and conducts conference tutorials for data engineers and scientists. He advises PhD students and startups, serves as program committee member for systems and data conferences, and was an associate editor for ACM Transactions on Storage. He blogs on LinkedIn and his personal blog, Wrong Data Fabric. Sandeep holds a PhD in computer science from the University of Illinois at Urbana-Champaign.

Presentations

10 lead indicators before data becomes a mess Session

Data quality metrics focus on quantifying if data is a mess. But you need to identify lead indicators before data becomes a mess. Sandeep Uttamchandani, Giriraj Bagadi, and Sunil Goplani explore developing lead indicators for data quality for Intuit's production data pipelines. You'll learn about the details of lead indicators, optimization tools, and lessons that moved the needle on data quality.

Sr. Principal Test Engineer, DellEMC Inc, One Dell Way, Round Rock, TX
26+ years expertise in Test Engineering, 7 years in Engineering Management, Test Design, Tools & Automation.
Publications: Co-authored article, CIOReview, Nov, 2018, Patents: Co-Inventor of (4) pending patents for innovative use of machine learning models at Dell Technologies; Co-Inventor of US patent grant #9050529 for Innovative Hardware Design at Microsoft (2012).

Presentations

Data science + Domain Experts = Exponentially better Products Data Case Studies

To deliver best-in class data science products, solutions must evolve through strong partnerships between data scientists and domain experts. We will describe the product lifecycle journey we took as we integrated business expertise with data scientists and technologists highlighting best practices and pitfalls to avoid when digitally transforming your business through AI and machine learning.

Balaji Varadarajan is a senior software engineer at Uber, where he works on the Hudi project and oversees data engineering broadly across the network performance monitoring domain. Previously, he was one of the lead engineers on LinkedIn’s databus change capture system as well as the Espresso NoSQL store. Balaji’s interests lie in distributed data systems.

Presentations

Bringing stream processing to batch data using Apache Hudi (incubating) Session

Batch processing can benefit immensely from adopting some techniques from the streaming processing world. Balaji Varadarajan shares how Apache Hudi (incubating), an open source project created at Uber and currently incubating with the ASF, can bridge this gap and enable more productive, efficient batch data engineering.

Sundar Varadarajan is a consulting partner on AI and ML at Wipro and plays an advisory role on edge AI and ML solutions. He’s an industry expert in the field of analytics, machine learning and AI, having ideated, architected and implemented innovative AI solutions across multiple industry verticals. Sundar can be reached at sundar.varadarajan@wipro.com.

Presentations

An approach to automate time and motion analysis Session

Time and motion study of manufacturing operations in a shop floor is traditionally carried out through manual observation, which is time consuming and involves human errors and limitations. Sundar Varadarajan and Peyman Behbahani detail a new approach of video analytics combined with time series analysis to automate the process of activity identification and timing measurements.

Paroma Varma is a cofounder at Snorkel and completed a PhD at Stanford, advised by Professor Christopher Ré and affiliated with the DAWN, SAIL, and StatML groups, where she was supported by the Stanford Graduate Fellowship and the National Science Foundation Graduate Research Fellowship. Her research interests revolve around weak supervision or using high-level knowledge in the form of noisy labeling sources to efficiently label massive datasets required to train machine learning models.

Presentations

Programmatically building and managing training datasets with Snorkel Tutorial

Paroma Varma teaches you how to build and manage training datasets programmatically with Snorkel, an open source framework developed at the Stanford AI Lab, and demonstrates how this can lead to more efficiently building and managing machine learning (ML) models in a range of practical settings.

Shankar Venkitachalam is a data scientist on the experience cloud research and sensei team at Adobe. He holds a master’s degree in computer science from the University of Massachusetts Amherst. He’s passionate about machine learning, probabilistic graphical models, and natural language processing.

Presentations

A machine learning approach to customer profiling by identifying purchase lifecycle stages Session

Identifying customer stages in a buying cycle enables you to perform personalized targeting depending on the stage. Shankar Venkitachalam, Megahanath Macha Yadagiri, and Deepak Pai identify ML techniques to analyze a customer's clickstream behavior to find the different stages of the buying cycle and quantify the critical click events that help transition a user from one stage to another.

Sumeet Vij is a director in the Strategic Innovation Group (SIG) at Booz Allen Hamilton, where he leads multiple client engagements, research, and strategic partnerships in the field of AI, digital personalization, recommendation systems, chatbots, digital assistants, and conversational commerce. Sumeet is also the practice lead for next-generation digital experiences powered by AI and data science, helping with the large-scale analysis of data and its use to quickly provide deeper insights, create new capabilities, and drive down costs.

Presentations

Weak supervision for stronger models: Increasing classification strength using noisy data Session

Weak supervision allows the use of noisy sources to provide supervision signals for labeling large amounts of training data. Sumeet Vij showcases an approach combining a Snorkel weak supervision framework with denoising labeling functions, a generative model, and AI-powered search to train classifiers leveraging enterprise knowledge, without the need for tens of thousands of hand-labeled examples.

Jorge Villamariona is a senior technical marketing engineer on the product marketing team at Qubole. Over the years, Jorge has acquired extensive experience in relational databases, business intelligence, big data engines, ETL, and CRM systems. He enjoys complex data challenges and helping customers gain greater insight and value from their existing data.

Presentations

Data engineering workshop 2-Day Training

Jorge Villamariona outlines how organizations using a single platform for processing all types of big data workloads are able to manage growth and complexity, react faster to customer needs, and improve collaboration—all at the same time. You'll leverage Apache Spark and Hive to build an end-to-end solution to address business challenges common in retail and ecommerce.

Data engineering workshop (Day 2) Training Day 2

Jorge Villamariona outlines how organizations using a single platform for processing all types of big data workloads are able to manage growth and complexity, react faster to customer needs, and improve collaboration—all at the same time. You'll leverage Apache Spark and Hive to build an end-to-end solution to address business challenges common in retail and ecommerce.

Mario Vinasco has over 15 years of progressive experience in data driven analytics with emphasis in machine learning and data scence programming creatively applied to eCommerce, advertising, customer acquisition/retention and marketing investment. Mario specializes in developing and applying leading edge business analytics to complex business problems using big data and predictive modeling platforms.

Mario holds a Masters in engineering economics from Stanford University and currently works as Director of Analytics and Data Science at Credit Sesame a disruptive FinTech company in the San Francisco Bay Area, responsible for customer management, retention and prediction.

Until recently, Mario worked for Uber Technologies applying data science to marketing investment optimization, advanced segmentation of customers by propensity to act, churn, open email and the set up sophisticated experiments to test and validate hypothesis.

At Facebook in the marketing analytics group he was responsible for improving the effectiveness of Facebook’s own consumer-facing campaigns. Key projects included ad-effectiveness measurement of Facebook’s brand marketing activities, and product campaigns for key product priorities using advanced experimentation techniques.

Prior roles included VP of business intelligence in digital textbook startup, people analytics manager at Google and eCommerce Sr manager at Symantec.

Presentations

Optimization of digital spend using machine learning in PyTorch Session

Uber spends hundreds of millions of dollars in marketing and constantly optimizes the allocation of these budgets. It deploys complex models, using Python and PyTorch, and borrowing from machine learning (ML) to speed up solvers to optimize marketing investment. Mario Vinasco explains the framework of the marketing spend problem and how it was implemented.

Joseph Voyles is a director at PwC.

Presentations

ML models are not software: Why organizations need dedicated operations to address the b Session

Anand Rao and Joseph Voyles introduce you to the core differences between software and machine learning model life cycles. They demonstrate how AI’s success also limits its scale and detail leading practices for establishing AIOps to overcome limitations by automating CI/CD, supporting continuous learning, and enabling model safety.

Kshitij Wadhwa is a software engineer at Rockset, where he works on the platform engineering team. Previously, Kshitij was an engineer at NetApp on the filesystem and protocols team in Cloud Backup Service. Kshitij holds a master degree in computer science from North Carolina State University.

Presentations

Building live dashboards on Amazon DynamoDB using Rockset Session

Rockset is a serverless search and analytics engine that enables real-time search and analytics on raw data from Amazon DynamoDB—with full featured SQL. Kshitij Wadhwa and Dhruba Borthakur explore how Rockset takes an entirely new approach to loading, analyzing, and serving data so you can run powerful SQL analytics on data from DynamoDB without ETL.

Kai Waehner is a technology evangelist at Confluent. Kai’s areas of expertise include big data analytics, machine learning, deep learning, messaging, integration, microservices, the internet of things, stream processing, and blockchain. He’s regular speaker at international conferences such as JavaOne, O’Reilly Software Architecture, and ApacheCon and has written a number of articles for professional journals. Kai also shares his experiences with new technologies on his blog.

Presentations

Streaming microservice architectures with Apache Kafka and Istio service mesh Session

Apache Kafka became the de facto standard for microservice architectures, which also introduced new challenges. Kai Wähner explores the problems of distributed microservices communication and how Kafka and a service mesh like Istio address them. You'll learn approaches for combining them to build a reliable and scalable microservice architecture with decoupled and secure microservices.

Dean Wampler is an expert in streaming data systems, focusing on applications of machine learning and artificial intelligence (ML/AI). He is Head of Developer Relations at Anyscale, which is developing Ray for distributed Python, primarily for ML/AI. Previously, he was an engineering VP at Lightbend, where he led the development of Lightbend CloudFlow, an integrated system for building and running streaming data applications with Akka Streams, Apache Spark, Apache Flink, and Apache Kafka. Dean is the author of Fast Data Architectures for Streaming Applications, Programming Scala, and Functional Programming for Java Developers, and he is the coauthor of Programming Hive, all from O’Reilly. He’s a contributor to several open source projects. A frequent conference speaker and tutorial teacher, he’s also the co-organizer of several conferences around the world and several user groups in Chicago. He has a Ph.D. in Physics from the University of Washington.

Presentations

Model governance Tutorial

Machine learning (ML) models are data, which means they require the same data governance considerations as the rest of your data. Boris Lublinsky and Dean Wampler outline metadata management for model serving and explore what information about running systems you need and why it's important. You'll also learn how Apache Atlas can be used for storing and managing this information.

Understanding data governance for machine learning models Session

Production deployment of machine learning (ML) models requires data governance, because models are data. Dean Wampler and Boris Lublinsky justify that claim and explore its implications and techniques for satisfying the requirements. Using motivating examples, you'll explore reproducibility, security, traceability, and auditing, plus some unique characteristics of models in production settings.

Kelly Wan is a senior data scientist at LinkedIn, Sunnyvale. She’s a technology and data science evangelist. Previously, Kelly worked in investment banking for five years in New York City and has undergone a career transformation into the data science area in Silicon Valley. Kelly obtained her master’s of computer science degree from Columbia University and her bachelor’s degree from Southeast University in China.

Presentations

LinkedIn end-to-end data product to measure customer happiness Session

Studies show that good customer services accelerates customers' cohesion toward a product, which increases product engagement and revenue spending. It's traditional to use customer surveys to measure how customers feel about services and products. Kelly Wan, Jason Wang, and Lili Zhou examine the innovative data product to measure customer happiness from LinkedIn.

Haopei Wang is a research scientist at DataVisor. Previously, he earned his PhD from the Department of Computer Science and Engineering at Texas A&M University. His research includes big data security and system security.

Presentations

Efficient feature engineering from digital identifiers for online fraud detection Session

Haopei Wang details the design and implementation of a system that automatically extracts fraud-related features for digital identifiers commonly collected by online services. You'll be able to address real-time feature computation and create templates for feature generations. The system has been applied successfully to fraud detection and good user analysis.

Harrison Wang is a backend software engineer for LiveRamp and was responsible for coordinating the cloud migration for the activations team.

Presentations

Truth and reality of a cloud migration for large-scale data processing workflows Session

A migration to a new environment is never easy. You'll learn how LiveRamp tackled migrating its large-scale production workflows from its private data center to the cloud while maintaining high uptime. Harrison Wang examines the high-level steps and decisions involved, lessons learned, and what to realistically expect out of a migration.

Chih-Hui “Jason” Wang is a data scientist on the global customer operations (GCO) data science team at LinkedIn. At LinkedIn, he uses data to advocate the voices of customers and members. Previously, he was a data scientist at LeanTaaS where he helped transform healthcare operations through data science. He holds a master’s degree in statistics from the University of California, Berkeley.

Presentations

LinkedIn end-to-end data product to measure customer happiness Session

Studies show that good customer services accelerates customers' cohesion toward a product, which increases product engagement and revenue spending. It's traditional to use customer surveys to measure how customers feel about services and products. Kelly Wan, Jason Wang, and Lili Zhou examine the innovative data product to measure customer happiness from LinkedIn.

Jiao (Jennie) Wang is a software engineer on the big data technology team at Intel, where she works in the area of big data analytics. She’s engaged in developing and optimizing distributed deep learning framework on Apache Spark.

Jiao(Jennie)Wang是英特尔大数据技术团队的软件工程师,主要工作在大数据分析领域。她致力于基于Apache Spark开发和优化分布式深度学习框架。

Presentations

Real-time recommendation using attention networks with Analytics Zoo on Apache Spark Session

Lu Wang and Jennie Wang explain how to build a real-time menu recommendation system to leverage attention networks using Spark, Analytics Zoo, and MXNet in the cloud. You'll learn how to deploy the model and serve the real-time recommendation using both cloud and on-device infrastructure on Burger King’s production environment.

Jisheng Wang is the head of data science at Mist Systems, where he leads the development of Marvis—the first AI-driven virtual network assistant that automates the visibility, troubleshooting, reporting, and maintenance of enterprise networking. He has 10+ years of experience applying state-of-the-art big data and data science technologies to solve challenging enterprise problems including security, networking, and IoT. Previously, Jisheng was the senior director of data science in the CTO office of Aruba, a Hewlett-Packard Enterprise company since its acquisition of Niara in February 2017, where he led the overall innovation and development effort in big data infrastructure and data science and invented the industry’s first modular and data-agonistic User and Entity Behavior Analytics (UEBA) solution, which is widely deployed today among global enterprises; and he was a technical lead in Cisco responsible for various security products. Jisheng earned his PhD in electric engineering from Penn State University. He’s a frequent speaker at AI and ML conferences, including O’Reilly Strata AI, Frontier AI, Spark Summit, Hadoop Summit, and BlackHat.

Presentations

Scalable and automated pipeline for large-scale neural network training and inference Session

Anomaly detection models are essential to run data-driven businesses intelligently. At Mist Systems, the need for accuracy and the scale of the data impose challenges to build and automate ML pipelines. Ebrahim Safavi and Jisheng Wang explain how recurrent neural networks and novel statistical models allow Mist Systems to build a cloud native solution and automate the anomaly detection workflow.

Luyang Wang is a senior manager on the Burger King guest intelligence team at Restaurant Brands International, where he works on machine learning and big data analytics. He’s engaged in developing distributed machine learning applications and real-time web services for the Burger King brand. Previously, Luyang Wang was at Philips Big Data and AI Lab and Office Depot.

Presentations

Real-time recommendation using attention networks with Analytics Zoo on Apache Spark Session

Lu Wang and Jennie Wang explain how to build a real-time menu recommendation system to leverage attention networks using Spark, Analytics Zoo, and MXNet in the cloud. You'll learn how to deploy the model and serve the real-time recommendation using both cloud and on-device infrastructure on Burger King’s production environment.

Dr Prashant Warier is CEO, Qure.ai & Chief Data Scientist, Fractal Analytics with 16 years of experience in architecting and developing data science solutions. Prashant founded AI-powered personalized digital marketing firm Imagna Analytics which was acquired by Fractal in 2015. Earlier, he worked with SAP and was instrumental in building their Data Science practice. Currently, he heads Qure.ai – a healthcare business which uses deep learning to automatically interpret X-rays, CT Scans and MRIs. He has a PhD and MS in Operations Research from Georgia Institute of Technology and a BTech from IIT Delhi.
He is passionate about using artificial intelligence for global good, and through Qure.ai, is working towards making healthcare accessible and affordable using the power of machine learning and artificial intelligence.

Presentations

AI at The Point of Care Revolutionizes Diagnostics Session

If AI can automate the interpretation of abnormalities at the point of care for at-risk populations, it can eliminate the delays toward diagnosis, speeding the time to treatment, and saving lives. We will detail the technology required for this healthcare revolution to become reality, sharing case studies of machine learning deployed in poverty-stricken areas.

Sophie Watson is a senior data scientist at Red Hat, where she helps customers use machine learning to solve business problems in the hybrid cloud. She’s a frequent public speaker on topics including machine learning workflows on Kubernetes, recommendation engines, and machine learning for search. Sophie earned her PhD in Bayesian statistics.

Presentations

What nobody told you about machine learning in the hybrid cloud Session

Cloud native infrastructure like Kubernetes has obvious benefits for machine learning systems, allowing you to scale out experiments, train on specialized hardware, and conduct A/B tests. What isn’t obvious are the challenges that come up on day two. Sophie Watson and William Benton share their experience helping end users navigate these challenges and make the most of new opportunities.

Dennis Wei is a research staff member with IBM Research AI. He holds a PhD degree in electrical engineering from the Massachusetts Institute of Technology (MIT). His recent research interests center around trustworthy machine learning, including explainability and interpretability, fairness, and causality.

Presentations

Introducing the AI Explainability 360 open source toolkit Tutorial

As AI and ML make inroads into society, calls increase to explain their outputs. Dennis Wei teaches you to use and contribute to the new open source Python package AI Explainability 360. Dennis translates new developments from research labs. You'll get a look at the first comprehensive toolkit for explainable AI, including eight diverse and state-of-the-art methods from IBM Research.

Josh Weisberg is a senior director on the 3D and computer vision team for Zillow Group. Previously, he led the AI camera and computational photography team at Microsoft Research, spent several years at Apple, and was at four early-stage startups. He’s written four books on imaging and color. Josh studied digital imaging at the Rochester Institute of Technology and holds a bachelor’s of science degree from the University of San Francisco.

Presentations

Designing a virtual tour application with computer vision and edge computing Session

Computer vision and deep learning enable new technologies to mimic how the human brain interprets images and create interactive shopping experiences. This progress has major implications for businesses providing customers with the information they need to make a purchase decision. Josh Weisberg offers an overview of implementing computer vision to create rich media experiences.

Remy Welch is a data analytics specialist at Google Cloud. She works with enterprises in San Francisco to understand best practices on collecting and analyzing data. Remy has expertise working within the gaming industry and helping them better handle data ingestion, storage, and analytics.

Presentations

Using serverless Spark on Kubernetes for data streaming and analytics Session

Data is a valuable resource, but collecting and analyzing the data can be challenging. And the cost of resource allocation often prohibits the speed at which you can analyze the data. Jay Smith and Remy Welch break down how serverless architecture can improve the portability and scalability of streaming event-driven Apache Spark jobs and perform ETL tasks using serverless frameworks.

Seth Wiesman is a senior solutions architect at Ververica, consulting with clients to maximize the benefits of real-time data processing for their business. He supports customers in the areas of application design, system integration, and performance tuning.

Presentations

Apache Flink developer training 2-Day Training

David Anderson and Seth Wiesman lead a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. You'll focus on the core concepts of distributed streaming data flows, event time, and key-partitioned state, while looking at runtime, ecosystem, and use cases with exercises to help you understand how the pieces fit together.

Apache Flink developer training (Day 2) Training Day 2

David Anderson and Seth Wiesman lead a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. You'll focus on the core concepts of distributed streaming data flows, event time, and key-partitioned state, while looking at runtime, ecosystem, and use cases with exercises to help you understand how the pieces fit together.

Event-driven applications made easy with Apache Flink Tutorial

David Anderson and Seth Wiesman demonstrate how building and managing scalable, stateful, event-driven applications can be easier and more straightforward than you might expect. You'll go hands-on to implement a ride-sharing application together.

Aaron is the VP of Community at OmniSci, responsible for OmniSci’s developer, user and open source communities. He comes to OmniSci with more than two decades of previous success building ecosystems around some of software’s most familiar platforms. Most recently he ran the global community for Mesosphere, including leading the launch and growth of DC/OS as an open source project. Prior to that he led the Java Community Process at Sun Microsystems, and ecosystem programs at SAP. Aaron has also served as the founding CEO of two startups in the entertainment space. Aaron has an MS in Computer Science and BS in Computer Engineering from Case Western Reserve University.

Presentations

Using GPU-acceleration to Interact with Open Street Map at Planet-Scale Data Case Studies

In this talk, we’ll explore the explosive growth in quantity of geospatial data, and how this is fueling the need to more frequently join geospatial data with traditional data.

Kathy Winger is a business, corporate, real estate, banking, and data security attorney representing companies and individuals in commercial and corporate transactions, where she’s a solo practitioner in Tucson. She has more than 20 years of experience as an attorney in the private sector, practicing corporate, business, banking, regulatory, compliance, real estate, and consumer and commercial lending law. Previously, she served as in-house counsel to a national bank and financial services company.

Kathy frequently gives presentations addressing cybersecurity issues for businesses and has spoken to CFOs, financial executives, lawyers, insurance brokers, business owners, and technology professionals and groups such as Financial Executives and Affiliates of Tucson, National Bank of Arizona Women’s Financial Group, and the Automotive Service Association among others. Nationally, Kathy has spoken about cybersecurity and data breaches, most recently, at the Wall Street Journal Pro Cyber Security Symposium in San Diego and in Cybersecurity Atlanta (2018), Data Center World (2019) and the Channel Partners Conference & Expo (2019), among many others. Kathy has written articles on cybersecurity and banking topics that have appeared in national publications, has been interviewed for articles and radio shows in which she discussed cybersecurity, banking and business topics.

Kathy is the President of the Board of directors for the BSA Catalina Council and serves on the advisory board for the National Bank of Arizona Women’s Financial Group. She also serves on the board of directors of the Southern Arizona Children’s Advocacy Center and is a member of the Better Business Bureau of Southern Arizona.

Presentations

Executive Briefing: Cybersecurity and data breaches from a business lawyer's perspective Session

Kathy Winger walks you through what business owners and technology professionals need to know about potential risks in the cybersecurity arena. You'll learn the current legal and data security issues and practices along with what’s happening on the regulatory front. Along the way, you'll learn to mitigate the risks you face.

Micah Wylde is a software engineer on the streaming compute team at Lyft, focused on the development of Apache Flink and Apache Beam. Previously, he built data infrastructure for fighting internet fraud at SIFT and real-time bidding infrastructure for ads at Quantcast.

Presentations

How Lyft built a streaming data platform on Kubernetes Session

Lyft processes millions of events per second in real time to compute prices, balance marketplace dynamics, and detect fraud, among many other use cases. Micah Wylde showcases how Lyft uses Kubernetes along with Flink, Beam, and Kafka to enable service engineers and data scientists to easily build real-time data applications.

Shuo is a software engineer from Robinhood’s Data Platform team.

Presentations

Usability First: the Evolution of Robinhood’s Data Platform Data Case Studies

The Data Platform at Robinhood has evolved considerably as the scale of our data and needs of the company have evolved. In this talk, we are sharing the stories behind the evolution of our platform, explaining how does it align with our business use cases, and discussing in detail the challenges we encountered and lessons we learned.

Huangming Xie is a senior manager of data science at LinkedIn, where he leads the infrastructure data science team to drive resource intelligence, optimize compute and storage efficiency, and automate capacity forecasting for better scalability, as well as improve site availability for a pleasant member and customer experience. Huangming is an expert at converting data into actionable recommendations that impact strategy and generate direct business impact. Previously, he lead initiatives to enable data-driven product decisions at scale and build a great product for more than 600 million LinkedIn members worldwide.

Presentations

Get a CLUE: Optimizing big data compute efficiency Session

Compute efficiency optimization is of critical importance in the big data era, as data science and ML algorithms become increasingly complex and data size increases exponentially over time. Opportunities exist throughout the resource use funnel, which Zhe Zhang and Huangming Xie characterize using a CLUE framework.

Tony Xing is a senior product manager on the AI, data, and infrastructure (AIDI) team within Microsoft’s AI and Research Organization. Previously, he was a senior product manager on the Skype data team within Microsoft’s Application and Service Group, where he worked on products for data ingestion, real-time data analytics, and the data quality platform.

Presentations

Introducing a new anomaly detection algorithm inspired by computer vision and RL Session

Anomaly detection may sound old-fashioned, yet it's super important in many industrial applications. Tony Xing outlines a novel anomaly detection algorithm based on spectral residual (SR) and convolutional neural networks (CNNs) and how this novel method was applied in the monitoring system supporting Microsoft AIOps and business incident prevention.

Chendi Xue is a software engineer on the data analytics team at Intel. She has more than five years’ experience in big data and cloud system optimization, focusing on storage, network software stack performance analysis, and optimization. She participated in the development works including Spark-Shuffle optimization, Spark-SQL ColumnarBased execution, compute side cache implementation, storage benchmark tool implementation, etc. Previously, she worked on Linux device mapper optimization and iSCSI optimization during her master degree study.

Presentations

Accelerating Spark-SQL with AVX-supported vectorization Session

Chendi Xue and Jian Zhang explore how Intel accelerated Spark SQL with AVX-supported vectorization technology. They outline the design and evaluation, including how to enable columnar process in Spark SQL, how to use Arrow as intermediate data, how to leverage AVX-enabled Gandiva for data processing, and performance analysis with system metrics and breakdown.

Megahanath Macha Yadagiri is a graduate research assistant at Carnegie Mellon University.

Presentations

A machine learning approach to customer profiling by identifying purchase lifecycle stages Session

Identifying customer stages in a buying cycle enables you to perform personalized targeting depending on the stage. Shankar Venkitachalam, Megahanath Macha Yadagiri, and Deepak Pai identify ML techniques to analyze a customer's clickstream behavior to find the different stages of the buying cycle and quantify the critical click events that help transition a user from one stage to another.

Giridhar Yasa is a principal architect at Flipkart. He’s a technology leader with a consistent track record of leading teams from concept to successful delivery of complex software products. He has solid team creation and mentoring skills and multiple peer-reviewed journal and conference papers with citations and patents. His specialties are distributed systems, scalable software system architecture, storage software, networking and internet protocols, mobile communication protocols, system performance, free software, open source software, languages and tools, C, C++, Python, Unix-like operating systems, and Debian.

Presentations

Architectural patterns for business continuity and disaster recovery: Applied to Flipkart Session

Utkarsh B. and Giridhar Yasa lead a deep dive into architectural patterns and the solutions Flipkart developed to ensure business continuity to millions of online customers, and how it leveraged technology to avert or mitigate risks from catastrophic failures. Solving for business continuity requires investments application, data management, and infrastructure.

Wenming Ye is an AI and ML solutions architect at Amazon Web Services, helping researchers and enterprise customers use cloud-based machine learning services to rapidly scale their innovations. Previously, Wenming had diverse R&D experience at Microsoft Research, an SQL engineering team, and successful startups.

Presentations

Put deep learning to work: A practical introduction using Amazon Web Services 2-Day Training

Machine learning (ML) and deep learning (DL) projects are becoming increasingly common at enterprises and startups alike and have been a key innovation engine for Amazon businesses such as Go, Alexa, and Robotics. Wenming Ye demonstrates a practical next step in DL learning with instructions, demos, and hands-on labs.

Put deep learning to work: A practical introduction using Amazon Web Services (Day 2) Training Day 2

Machine learning (ML) and deep learning (DL) projects are becoming increasingly common at enterprises and startups alike and have been a key innovation engine for Amazon businesses such as Go, Alexa, and Robotics. Wenming Ye demonstrates a practical next step in DL learning with instructions, demos, and hands-on labs.

Jia Zhai is a founding engineer of StreamNative, as well as PMC member of both Apache Pulsar and Apache BookKeeper, and contributes to these two projects continually.

Presentations

Life beyond pub/sub: How Zhaopin simplifies stream processing using Pulsar Functions and SQL Session

Penghui Li and Jia Zhai walk you through building an event streaming platform based on Apache Pulsar and simplifying a stream processing pipeline by Pulsar Functions, Pulsar Schema, and Pulsar SQL.

Jian Zhang is a senior software engineer manager at Intel, where he and his team primarily focus on open source storage development and optimizations on Intel platforms and build reference solutions for customers. He has 10 years of experience doing performance analysis and optimization for open source projects like Xen, KVM, Swift, and Ceph and working with Hadoop distributed file system (HDFS) and benchmarking workloads like SPEC and TPC. Jian holds a master’s degree in computer science and engineering from Shanghai Jiao Tong University.

Presentations

Accelerating Spark-SQL with AVX-supported vectorization Session

Chendi Xue and Jian Zhang explore how Intel accelerated Spark SQL with AVX-supported vectorization technology. They outline the design and evaluation, including how to enable columnar process in Spark SQL, how to use Arrow as intermediate data, how to leverage AVX-enabled Gandiva for data processing, and performance analysis with system metrics and breakdown.

Yong Zhang is a software engineer at StreamNative. He’s also a Pulsar and BookKeeper contributor, where he contributes a lot at Pulsar transaction, storage, and tools.

Presentations

Transactional event streaming with Apache Pulsar Session

Sijie Guo and Yong Zhang lead a deep dive into the details of Pulsar transaction and how it can be used in Pulsar Functions and other processing engines to achieve transactional event streaming.

Zhe Zhang is a senior manager of core big data infrastructure at LinkedIn, where he leads an excellent engineering team to provide big data services (Hadoop distributed file system (HDFS), YARN, Spark, TensorFlow, and beyond) to power LinkedIn’s business intelligence and relevance applications. Zhe’s an Apache Hadoop PMC member; he led the design and development of HDFS Erasure Coding (HDFS-EC).

Presentations

Get a CLUE: Optimizing big data compute efficiency Session

Compute efficiency optimization is of critical importance in the big data era, as data science and ML algorithms become increasingly complex and data size increases exponentially over time. Opportunities exist throughout the resource use funnel, which Zhe Zhang and Huangming Xie characterize using a CLUE framework.

Alice Zhao is a senior data scientist at Metis, where she teaches 12-week data science bootcamps. Previously, she was the first data scientist and supported multiple functions from marketing to technology at Cars.com; cofounded a data science education startup where she taught weekend courses to professionals at 1871 in Chicago at Best Fit Analytics Workshop; was an analyst at Redfin; and was a consultant at Accenture. She blogs about analytics and pop culture on A Dash of Data. Her blog post, “How Text Messages Change From Dating to Marriage” made it onto the front page of Reddit, gaining over half a million views in the first week. She’s passionate about teaching and mentoring, and loves using data to tell fun and compelling stories. She has her MS in analytics and BS in electrical engineering, both from Northwestern University.

Presentations

Introduction to natural language processing in Python Tutorial

Data scientists are known to crunch numbers, but you may also run into text data. Alice Zhao teaches you to turn text data into a format that a machine can understand, identifies some of the most popular text analytics techniques, and showcases several natural language processing (NLP) libraries in Python including the natural language toolkit (NLTK), TextBlob, spaCy, and gensim.

Alice Zheng is a senior manager of applied science on the machine learning optimization team on Amazon’s advertising platform. She specializes in research and development of machine learning methods, tools, and applications. She’s the author of Feature Engineering for Machine Learning. Previously, Alice has worked at GraphLab, Dato, and Turi, where she led the machine learning toolkits team and spearheaded user outreach; was a researcher in the Machine Learning Group at Microsoft Research – Redmond. Alice holds PhD and BA degrees in computer science and a BA in mathematics, all from UC Berkeley.

Presentations

Lessons learned from building large ML systems Session

You'll learn four lessons in building and operating large-scale, production-grade machine learning systems at Amazon with Alice Zheng, useful for practitioners and would-be practitioners in the field.

Lili Zhou is a manager of the data science team at LinkedIn. Lili has intensive experience in customer operations, billing and collection, risk management, fraud detection, revenue forecasting, and online gaming. She’s passionate about leveraging large-scale data analytics and modeling to drive insights and business value.

Presentations

LinkedIn end-to-end data product to measure customer happiness Session

Studies show that good customer services accelerates customers' cohesion toward a product, which increases product engagement and revenue spending. It's traditional to use customer surveys to measure how customers feel about services and products. Kelly Wan, Jason Wang, and Lili Zhou examine the innovative data product to measure customer happiness from LinkedIn.

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

pr@oreilly.com

For media/analyst press inquires