Mar 15–18, 2020

Speakers

Hear from innovative programmers, talented managers, and senior executives who are doing amazing things with data and AI. More speakers will be announced; please check back for updates.

Grid viewList view

Arpit Agarwal is a senior engineering manager on the storage engineering team at Cloudera and an active HDFS/Hadoop committer since 2013.

Presentations

It’s 2020 now: Apache Hadoop 3.x state of the union and upgrade guidance Session

2020 Hadoop is still evolving fast. You'll learn the current status of Apache Hadoop community and the exciting present and future of Hadoop 3.x. Wangda Tan and Arpit Agarwal cover new features like Hadoop on Cloud, GPU support, NameNode federation, Docker, 10X scheduling improvements, OZone, etc. And they offer you upgrade guidance from 2.x to 3.x.

John-Mark Agosta is a principal data scientist in IMML at Microsoft. Previously, he worked with startups and labs in the Bay Area, including the original Knowledge Industries, and was a researcher at Intel Labs, where he was awarded a Santa Fe Institute Business Fellowship in 2007, and at SRI International after receiving his PhD from Stanford. He has participated in the annual Uncertainty in AI conference since its inception in 1985, proving his dedication to probability and its applications. When feeling low ,he recharges his spirits by singing Russian music with Slavyanka, the Bay Area’s Slavic music chorus.

Presentations

Machine learning for managers Tutorial

Robert Horton, Mario Inchiosa, and John-Mark Agosta offer an overview of the fundamental concepts of machine learning (ML) to business and healthcare decision makers and software product managers so you'll be able to make a more effective use of ML results and be better able to evaluate opportunities to apply ML in your industries.

Mudasir Ahmad is a distinguished engineer and senior director at Cisco. He’s been involved with design and algorithms for 17 years. Mudasir leads the Center of Excellence for Numerical Analysis, developing new analytical and stochastic algorithms. He’s also involved with implementing IoT, artificial intelligence, and big data analytics to streamline supply chain operations. Mudasir has delivered several invited talks on leading technology solutions internationally. He has over 30 publications on microelectronic packaging, two book chapters, and 13 US patents. He received the internationally renowned Outstanding Young Engineer Award in 2012 from the IEEE. He earned an MS in management science and engineering at Stanford University, an MS in mechanical engineering from the Georgia Institute of Technology, and a bachelors from Ohio University.

Presentations

Real-life application of artificial intelligence in supply chain operations Session

Artificial intelligence (AI) is a natural fit for supply chain operations, where decisions and actions need to be taken daily or even hourly, relating to delivery, manufacturing, quality, logistics, and planning. Mudasir Ahmad explains how AI can be implemented in a scalable and cost-effective way in your business' supply chain operations. You'll identify benefits and potential challenges.

Alasdair Allan is a director at Babilim Light Industries and a scientist, author, hacker, maker, and journalist. An expert on the internet of things and sensor systems, he’s famous for hacking hotel radios, deploying mesh networked sensors through the Moscone Center during Google I/O, and for being behind one of the first big mobile privacy scandals when, back in 2011, he revealed that Apple’s iPhone was tracking user location constantly. He’s written eight books and writes regularly for Hackster.io, Hackaday, and other outlets. A former astronomer, he also built a peer-to-peer autonomous telescope network that detected what was, at the time, the most distant object ever discovered.

Presentations

Dealing with data on the edge Session

Much of the data we collect is thrown away, but that's about to change; the power envelope needed to run machine learning models on embedded hardware has fallen dramatically, enabling you to put the smarts on the device rather than in the cloud. Alasdair Allan explains how the data you threw away can be processed in real time at the edge, and this has huge implications for how you deal with data.

Shradha Ambekar is a staff software engineer in the Small Business Data Group at Intuit, where she’s the technical lead for lineage framework (SuperGLUE), real-time analytics, and has made several key contributions in building solutions around the data platform, and she contributed to spark-cassandra-connector. She has experience with Hadoop distributed file system (HDFS), Hive, MapReduce, Hadoop, Spark, Kafka, Cassandra, and Vertica. Previously, she was a software engineer at Rearden Commerce. Shradha spoke at the O’Reilly Open Source Conference in 2019. She holds a bachelor’s degree in electronics and communication engineering from NIT Raipur, India.

Presentations

Always accurate business metrics through lineage-based anomaly tracking Session

Debugging data pipelines is nontrivial and finding the root cause can take hours to days. Shradha Ambekar and Sunil Goplani outline how Intuit built a self-serve tool that automatically discovers data pipeline lineage and applies anomaly detection to detect and help debug issues in minutes–establishing trust in metrics and improving developer productivity by 10-100X.

David Anderson is a training coordinator at Ververica, the original creators of Apache Flink. He’s delivered training and consulting to many of the world’s leading banks, telecommunications providers, and retailers. Previously, he led the development of data-intensive applications for companies across Europe.

Presentations

Apache Flink developer training 2-Day Training

David Anderson and Seth Wiesman lead a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. You'll focus on the core concepts of distributed streaming data flows, event time, and key-partitioned state, while looking at runtime, ecosystem, and use cases with exercises to help you understand how the pieces fit together.

Apache Flink developer training (Day 2) Training Day 2

David Anderson and Seth Wiesman lead a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. You'll focus on the core concepts of distributed streaming data flows, event time, and key-partitioned state, while looking at runtime, ecosystem, and use cases with exercises to help you understand how the pieces fit together.

Event-driven applications made easy with Apache Flink Tutorial

David Anderson and Seth Wiesman demonstrate how building and managing scalable, stateful, event-driven applications can be easier and more straightforward than you might expect. You'll go hands-on to implement a ride-sharing application together.

Jesse Anderson is a big data engineering expert and trainer.

Presentations

Professional Kafka development 2-Day Training

Jesse Anderson leads a deep dive into Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it. You'll also discover how to create consumers and publishers in Kafka and how to use Kafka Streams, Kafka Connect, and KSQL as you explore the Kafka ecosystem.

Professional Kafka development (Day 2) Training Day 2

Jesse Anderson leads a deep dive into Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it. You'll also discover how to create consumers and publishers in Kafka and how to use Kafka Streams, Kafka Connect, and KSQL as you explore the Kafka ecosystem.

Manohar Angani is a machine learning engineer at SurveyMonkey where he works on productionalizing models and integrating them with SurveyMonkey products. Previously, he worked in different groups within the company, like growth. In his free time, he likes to hang out with his family and explore the bay area.

Presentations

Accelerating your organization: Making data optimal for machine learning Session

Every organization leverages ML to increase value to customers and understand their business. You may have created models, but now you need to scale. Shubhankar Jain, Aliaksandr Padvitselski, and Manohar Angani use a case study to teach you how to pinpoint inefficiencies in your ML data flow, how SurveyMonkey tackled this, and how to make your data more usable to accelerate ML model development.

Eitan Anzenberg is the director of data science at Bill.com and has many years of experience as a scientist and researcher. His recent focus is in machine learning, deep learning, applied statistics, and engineering. Previously, Eitan was a postdoctoral scholar at Lawrence Berkeley National Lab, received his PhD in physics from Boston University, and his BS in astrophysics from University of California, Santa Cruz. Eitan has 2 patents and 11 publications to date and has spoken about data at various conferences around the world.

Presentations

Beyond OCR: Using deep learning to understand documents Session

Although the field of optical character recognition (OCR) has been around for half a century, document parsing and field extraction from images remains an open research topic. Eitan Anzenberg leads a deep dive into a learning architecture that leverages document understanding to extract fields of interest.

Using deep learning to understand documents Session

Although the field of optical character recognition (OCR) has been around for almost half a century, document parsing and field extraction from images remain an open research topic. Eitan Anzenberg digs into using an end-to-end deep learning and OCR architecture to predict regions of interest within documents and automatically extract their text.

She is the principal data scientist at Atlan, a data product company. She is also a guest lecture for applied econometrics at IIM-Kashipur college in India.
She loves designing and building scalable data products with features that look and feel customised to every user.

Presentations

Predicting Malaria using ML & satellite data Session

Malaria incidence is highly dependent on the environmental, demographic and infrastructural conditions of the affected region. Time series analysis of malaria cases against these variables can help in identifying the problematic factors as well as predict the expected cases in case of no external intervention. This session will deep dive into these indicators and the prediction model.

Muhammad Asfar is a PhD candidate at Airlangga University and political consultant at Pusdeham. He’s written more than 200 academic papers, magazines, or books related to politics. He’s fully aware that technology has shaped a new spectrum at the direction of how the politics is now and the future. Therefore, he’s passionately learning about big data and new technology adoption in politics to fill up the gap that voting behavior theories cannot yet explain.

Presentations

Political mapping with big data: Indonesia’s presidential election 2019 case Session

With the disclosure of the Cambridge Analytical scandal, political practitioners have started to adopt big data technology to give them better understanding and management of data. Qorry Asfar and Muhammad Asfar provide a big data case study to develop political strategy and examine how technological adoption will shape a better political landscape.

Qorry Asfar is a data analyst at Pusdeham Prodata Indonesia and has 2 years’ experience in voting behavior, political campaigns, and political advisory. She’s passionate to learn new ways to creatively use political data for better management and strategy. She occasionally attends big data conferences to gain insight of technological adoption that can be used in political data.

Presentations

Political mapping with big data: Indonesia’s presidential election 2019 case Session

With the disclosure of the Cambridge Analytical scandal, political practitioners have started to adopt big data technology to give them better understanding and management of data. Qorry Asfar and Muhammad Asfar provide a big data case study to develop political strategy and examine how technological adoption will shape a better political landscape.

Jatinder Assi is a data engineering manager at GumGum and is enthusiastic about building scalable distributed applications and business-driven data products.

Presentations

Real-time forecasting at scale using Delta Lake Session

GumGum receives 30 billion programmatic inventory impressions amounting to 25 TB of data per day. By generating near-real-time inventory forecast subject to campaign-specific targeting rules, it enables users to set up successful future campaigns. Rashmina Menon and Jatinder Assi highlight the architecture enabling forecasting in less than 30 seconds with Delta Lake and Databricks Delta caching.

Utkarsh B. is the technology advisor to the CEO, a distinguished architect, and a senior principal architect at Flipkart. He’s been driving architectural blueprints and coherence across diverse platforms in Flipkart through multiple generations of their evolution and leveraging technology to solve for scale, resilience, business continuity, and disaster recovery. He has extensive experience (18+ years) in building platforms across a wide spectrum of technical and functional problem domains.

Presentations

Architectural patterns for business continuity and disaster recovery: Applied to Flipkart Session

Utkarsh B and Giridhar Yasa lead a deep dive into architectural patterns and the solutions Flipkart developed to ensure business continuity to millions of online customers, and how it leveraged technology to avert or mitigate risks from catastrophic failures. Solving for business continuity requires investments application, data management, and infrastructure.

Giriraj Bagdi is a DevOps leader of cloud and data at Intuit, where he leads infrastructure engineering and SRE teams in delivering technology and functional capabilities for online platforms. He drove and managed large complex initiatives in cloud data-infrastructure, automation engineering, big data, and database transactional platform. Giriraj has extensive knowledge of building engineering solutions and platforms to improve the operational efficiency of cloud infrastructure in the areas of command and control and data reliability for big data, high-transaction, high-volume, and high-availability environments. He drives the initiative in transforming big data engineering and migration to AWS big data technologies, in other words, EMR, Athena QuickSight, etc. He’s an innovative, energetic, and goal-oriented technologist and a team player with strong problem solving skills.

Presentations

10 lead indicators before data becomes a mess Session

Data quality metrics focus on quantifying whether data is a mess. But you need to identify lead indicators before data becomes a mess. Sandeep U, Giriraj Bagadi, and Sunil Goplani explore developing lead indicators for data quality for Intuit's production data pipelines. You'll learn about the details of lead indicators, optimization tools, and lessons that moved the needle on data quality.

Bahman is a VP of Data Science and Engineering at Rakuten (the seventh largest internet company in the world), managing an AI organization with engineering and data science managers, data scientists, machine learning engineers, and data engineers, globally distributed across three continents, and in charge of the end-to-end AI systems behind Rakuten’s Americas businesses such as Rakuten Intelligence (B2B) and Rakuten Rewards (B2C). Previously, Bahman built and managed engineering and data science teams across industry, academia, and the public sector in areas including digital advertising, consumer web, cybersecurity, and nonprofit fundraising, where he consistently delivered substantial business value. He also designed and taught courses, led an interdisciplinary research lab, and advised theses in the computer science department at Stanford University, where he also did his own PhD advised by the legendary late Rajeev Motwani, Prabhakar Raghavan (currently SVP of Ads at Google), and Ashish Goel, and focused on large-scale algorithms and machine learning, topics on which he is a well-published author.

Presentations

AI In The New Era Of Personal Data Protection Session

With California’s CCPA looming near, Europe’s GDPR still sending shockwaves, and public awareness of privacy breaches heightening, we are in the early days of a new era of personal data protection. We will explore the challenges and opportunities for AI in this new era, and provide actionable insights for the audience to navigate their paths to AI success in this brave new world of data privacy.

Kamil Bajda-Pawlikowski is a cofounder and CTO of the enterprise Presto company Starburst. Previously, Kamil was the chief architect at the Teradata Center for Hadoop in Boston, focusing on the open source SQL engine Presto, and the cofounder and chief software architect of Hadapt, the first SQL-on-Hadoop company (acquired by Teradata). Kamil began his journey with Hadoop and modern MPP SQL architectures about 10 years ago during a doctoral program at Yale University, where he co-invented HadoopDB, the original foundation of Hadapt’s technology. He holds an MS in computer science from Wroclaw University of Technology and both an MS and an MPhil in computer science from Yale University.

Presentations

Presto on Kubernetes: Query anything, anywhere Session

Kamil Bajda-Pawlikowski explores Presto, an open source SQL engine, featuring low-latency queries, high concurrency, and the ability to query multiple data sources. With Kubernetes, you can easily deploy and manage Presto clusters across hybrid and multicloud environments with built-in high availability, autoscaling, and monitoring.

Claudiu Barbura is a director of engineering at Blueprint, and he oversees product engineering, where he builds large-scale advanced analytics pipelines, IoT, and data science applications for customers in oil and gas, energy, and retail industries. Previously, he was the vice president of engineering at UBIX.AI, automating data science at scale, and senior director of engineering, xPatterns platform services at Atigeo, building several advanced analytics platforms and applications in healthcare and financial industries. Claudiu is a hands-on architect, dev manager, and executive with 20+ years of experience in open source, big data science and Microsoft technology stacks and a frequent speaker at data conferences.

Presentations

The power of GPUs for data virtualization in Tableau and PowerBI and beyond Session

Claudiu Barbura exposes a tech stack to consumers in BI tools and data science notebooks using live demos to explain the lessons learned using Spark (CPU), BlazingSQL and Rapids.ai (GPU), and Apache Arrow in its quest to exponentially increase the performance of its data virtualizer, which enables real-time access to data sources across different cloud providers and on-premises databases and APIs.

Jimmy Bates is a field engineer at Pepperdata. He’s been in big data for nine years, and he’s has worked across all industries on Apache, Cloudera, Hortonworks, and MapR-based deployments. Previously, he was at MapR. He’s worked in the realms of developer, professional services, solutions architect, and product manager.

Presentations

Autoscaling big data operations in the cloud Session

Jimmy Bates offers an impartial evaluation of Amazon Elastic MapReduce (EMR), Azure HDInsight, and Google Cloud DataProc, three leading cloud service providers, with respect to Hadoop and big data autoscaling capabilities and provides guidance to help you determine the flavor of autoscaling to best fit your business needs.

Benjamin Batorsky is an associate director of data science at MIT Sloan, where he works to derive insight from a rich dataset on small businesses and their customers. Previously, he was lead data scientist at the small business marketing analytics company ThriveHive, and he worked in the areas of health, policy, and infrastructure. He earned a PhD from the RAND Corporation in policy analysis.

Presentations

Named-entity recognition from scratch with spaCy Session

Identifying and labeling named entities such as companies or people in text is a key part of text processing pipelines. Benjamin Batorsky outlines how to train, test, and implement a named entity recognition (NER) model with spaCy. You'll get a sneak peak on how to use these techniques with large, non-English corpora.

An expert in the field of safety reporting technology, Mr. Beales has 25 years of experience in IT, and has spent over 16 years in the pharmaceutical industry. He joined WCG ePharmaSolutions in 2009 and led implementation of the company’s Clinical Trial Portal at Genentech across 100+ countries. In 2015, he led implementation of the Clinical Trial Safety Portal at a top 5 pharma organization, which included a data-driven rules engine configured with safety regulations from those countries, which saved this organization hundreds of millions of dollars. Over 50 million safety alerts have been distributed by these two portals via the cloud. Prior to joining WCG ePharmaSolutions, Mr. Beales was the Chief Software Architect at mdlogix, where he led the implementation of the CTMS systems for Johns Hopkins University, Washington University at St. Louis, the University of Pittsburgh, and the Interactive Autism Network for Autism Speaks.

Presentations

Pragmatic Artificial Intelligence in Biopharmaceutical Industry Session

Explore applications of NLP, Machine Learning and data-driven rules that generate significant productivity and quality improvements in the complex business workflows of Drug Safety and Pharmacovigilance without large upfront investment. Pragmatic use of AI allows organizations to create immediate value and ROI before widening adoption as their capabilities with AI increase.

Ian Beaver, PhD is the Chief Scientist at Verint Intelligent Self Service, a provider of conversational AI systems for enterprise businesses. Ian has been publishing discoveries in the field of AI since 2005 on topics surrounding human-computer interactions such as gesture recognition, user preference learning, and communication with multi-modal automated assistants. Ian has presented his work at various academic and industry conferences and authored over 30 patents within the field of human language technology. His extensive experience and access to large volumes of real-world, human-machine conversation data for his research has made him a leading voice in conversational analysis of dialog systems. Ian currently leads a team in finding ways to optimize human productivity by way of automation and augmentation, using symbiotic relationships with machines.

Presentations

Deploying chatbots and conversational analysis: learn what customers really want to know Data Case Studies

Chatbots are increasingly used in customer service as a first tier of support. Through deep analysis of conversation logs you can learn real user motivations and where company improvements can be made. In this talk, a build or buy comparison is made for deploying self-service bots, motivations and techniques for deep conversational analysis are covered, and real world discoveries are discussed.

Dr Peyman Behbahani completed his PhD in Electronic Engineering at City, University of London in 2011. His main research and development interest is in AI, Computer Vision, mathematical modelling and forecasting.

Peyman currently is working as an AI Architect, helping various industries on building real-world and large-scale AI application in their businesses.

Presentations

An approach to automate Time and Motion Analysis Session

• Time and motion study of manufacturing operations in a shop floor is traditionally carried out through manual observation which is time consuming and involves human errors and limitations. In this study a new approach of video analytics combined with time series analysis is introduced to automate the process of activity identification and timing measurements.

William Benton is an engineering manager and senior principal software engineer at Red Hat, where he leads a team of data scientists and engineers and has applied machine learning to problems ranging from forecasting cloud infrastructure costs to designing better cycling workouts. His current focus is investigating the best ways to build and deploy intelligent applications in cloud native environments, but he’s also conducted research and development in the areas of static program analysis, managed language runtimes, logic databases, cluster configuration management, and music technology.

Presentations

What nobody told you about machine learning in the hybrid cloud Session

Cloud native infrastructure like Kubernetes has obvious benefits for machine learning systems, allowing you to scale out experiments, train on specialized hardware, and conduct A/B tests. What isn’t obvious are the challenges that come up on day two. Sophie Watson and William Benton share their experience helping end-users navigate these challenges and make the most of new opportunities.

Giacomo Bernardi is Distinguished Engineer at Extreme Networks, where he works on multiple science-heavy project for traffic engineering and network traffic visibility analytics. He leads a global team of data scientist and machine learning engineers. Giacomo is a self-proclaimed networking nerd and was CTO of a large Internet Service Provider where he built a custom software-defined platform. He holds a PhD in Wireless Networking from the University of Edinburgh (UK), a MSc from Trinity College Dublin (Ireland) and a BSc from the University of Milan (Italy).

Presentations

What do machines say when nobody’s looking? Tracking IoT security with NLP Session

Machines talk among them! What can we learn about their behaviour by analysing their "language"? In this talk we present a lightweight approach for securing large Internet of Things (IoT) deployments by leveraging modern Natural Language Processing techniques. Rather than attempting cumbersome firewall rules, we argue that IoT deployments can be efficiently secured by online behavioural modelling

Lukas Biewald is the founder and CEO of Weights & Biases, his second major contribution to advances in the machine learning field. In 2009, Lukas founded Figure Eight, formally CrowdFlower. Figure Eight was acquired by Appen in 2019. Lukas has dedicated his career to optimize ML workflows and teach ML practitioners, making machine learning more accessible to all.

Presentations

Using Keras to classify text with LSTMs and other ML techniques Tutorial

Join Lukas Biewald to build and deploy long short-term memories (LSTMs), grated recurrent units (GRUs), and other text classification techniques using Keras and scikit-learn.

Sarah Bird is a principle program manager at Microsoft, where she leads research and emerging technology strategy for Azure AI. Sarah works to accelerate the adoption and impact of AI by bringing together the latest innovations research with the best of open source and product expertise to create new tools and technologies. She leads the development of responsible AI tools in Azure Machine Learning. She’s also an active member of the Microsoft Aether committee, where she works to develop and drive company-wide adoption of responsible AI principles, best practices, and technologies. Previously, Sarah was one of the founding researchers in the Microsoft FATE research group and worked on AI fairness in Facebook. She’s an active contributor to the open source ecosystem; she cofounded ONNX, an open source standard for machine learning models and was a leader in the PyTorch 1.0 project. She was an early member of the machine learning systems research community and has been active in growing and forming the community. She cofounded the SysML research conference and the Learning Systems workshops. She holds a PhD in computer science from the University of California, Berkeley, advised by Dave Patterson, Krste Asanovic, and Burton Smith.

Presentations

An overview of responsible artificial intelligence Tutorial

Mehrnoosh Sameki and Sarah Bird examine six core principles of responsible AI: fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability. They focus on transparency, fairness, and privacy, and they cover best practices and state-of-the-art open source toolkits that empower researchers, data scientists, and stakeholders to build trustworthy AI systems.

Levan Borchkhadze is a senior data scientist at TBC bank, where his main responsibility is to supervise multiple data science projects. He earned BBA and MBA degrees from Georgian American University with wide variety of working experience in different industries as financial analyst, business process analyst, and ERP systems implementation specialist. After graduating with big data solutions master degree at Barcelona Technology School, Levan joined TBC Bank.

Presentations

A novel approach of recommender systems in retail banking Session

TBC Bank is in transition from a product-centric to a client-centric company. Obvious applications of analytics are developing personalized next-best product recommendation for clients. George Chkadua and Levan Borchkhadze explain why it decided to implement ALS user-item matrix factorization method and demographic model. As as result, the pilot increased sales conversion rates by 70%.

Dhruba Borthakur is cofounder and CTO at Rockset, a company building software to enable data-powered applications. Previously, Dhruba was the founding engineer of the open source RocksDB database at Facebook and one of the founding engineers of the Hadoop file system at Yahoo; an early contributor to the open source Apache HBase project; a senior engineer at Veritas, where he was responsible for the development of VxFS and Veritas SanPointDirect storage system; the cofounder of Oreceipt.com, an ecommerce startup based in Sunnyvale; and a senior engineer at IBM-Transarc Labs, where he contributed to the development of Andrew File System (AFS), a part of IBM’s ecommerce initiative, WebSphere. Dhruba holds an MS in computer science from the University of Wisconsin-Madison and a BS in computer science BITS, Pilani, India. He has 25 issued patents.

Presentations

Building live dashboards on Amazon DynamoDB using Rockset Session

Rockset is a serverless search and analytics engine that enables real-time search and analytics on raw data from Amazon DynamoDB—with full featured SQL. Kshitij Wadhwa and Dhruba Borthakur explore how Rockset takes an entirely new approach to loading, analyzing, and serving data so you can run powerful SQL analytics on data from DynamoDB without ETL.

Mario Bourgoin is a senior data scientist at Microsoft, where he helps the company’s efforts to democratize AI, and a mathematician, data scientist, and statistician with a broad and deep knowledge of machine learning, artificial intelligence, data mining, statistics, and computational mathematics. Previously, he taught at several institutions and joined a Boston-area startup, where he worked on medical and business applications. He earned his PhD in mathematics from Brandeis University in Waltham, Massachusetts.

Presentations

Using the cloud to scale up hyperparameter optimization for machine learning Session

Hyperparameter optimization for machine leaning is complex that requires advanced optimization techniques and can be implemented as a generic framework decoupled from specific details of algorithms. Fidan Boylu Uz, Mario Bourgoin, and Gheorghe Iordanescu apply such a framework to tasks like object detection and text matching in a transparent, scalable, and easy-to-manage way in a cloud service.

Fidan Boylu Uz is a senior data scientist at Microsoft, where she’s responsible for the successful delivery of end-to-end advanced analytic solutions. She’s also worked on a number of projects on predictive maintenance and fraud detection. Fidan has 10+ years of technical experience on data mining and business intelligence. Previously, she was a professor conducting research and teaching courses on data mining and business intelligence at the University of Connecticut. She has a number of academic publications on machine learning and optimization and their business applications and holds a PhD in decision sciences.

Presentations

Using the cloud to scale up hyperparameter optimization for machine learning Session

Hyperparameter optimization for machine leaning is complex that requires advanced optimization techniques and can be implemented as a generic framework decoupled from specific details of algorithms. Fidan Boylu Uz, Mario Bourgoin, and Gheorghe Iordanescu apply such a framework to tasks like object detection and text matching in a transparent, scalable, and easy-to-manage way in a cloud service.

Claudiu Branzan is an analytics senior manager in the Applied Intelligence Group at Accenture, based in Seattle, where he leverages his more than 10 years of expertise in data science, machine learning, and AI to promote the use and benefits of these technologies to build smarter solutions to complex problems. Previously, Claudiu held highly technical client-facing leadership roles in companies using big data and advanced analytics to offer solutions for clients in healthcare, high-tech, telecom, and payments verticals.

Presentations

Advanced natural language processing with Spark NLP Tutorial

David Talby, Alex Thomas, and Claudiu Branzan detail the application of the latest advances in deep learning for common natural language processing (NLP) tasks such as named entity recognition, document classification, sentiment analysis, spell checking, and OCR. You'll learn to build complete text analysis pipelines using the highly performant, scalable, open source Spark NLP library in Python.

Navinder Pal Singh Brar is a senior software engineer at Walmart Labs, where he’s been working with the Kafka ecosystem for the last couple of years, especially Kafka Streams, and created a new platform on top of it to suit the company’s needs to process billions of events per day in real time and trigger models on each event. He’s been active in contributing back to Kafka Streams and has patented few features. He’s interested in solving complex problems and distributed systems. Navinder likes to spend time in the gym and boxing ring in his spare time.

Presentations

Real-time fraud detection with Kafka Streams Session

One of the major use cases for stream processing is real-time fraud detection. Ecommerce has to deal with frauds on a wider scale as more and more companies are trying to provide customers with incentives such as free shipping by moving on to subscription-based models. Navinder Pal Singh Brar dives into the architecture, problems faced, and lessons from building such a pipeline.

Jay Budzik is the chief technology officer at ZestFinance, where he oversees Zest’s product and engineering teams. His passion for inventing new technologies—particularly in data mining and AI—has played a central role throughout his career. Previously, he held various positions, including founding an AI enterprise search company, helping major media organizations apply AI and machine learning to expand their audiences and revenue, and developed systems that process tens of trillions of data points. Jay has a PhD in computer science from Northwestern University.

Presentations

Introducing GIG: A new method for explaining any ensemble ML model Session

More companies are adopting machine learning (ML) to run key business functions. The best performing models combine diverse model types into stacked ensembles, but explaining these hybrid models has been impossible—until now. Jay Budzik details a new technique, generalized integrated gradients (GIG), to explain complex ensembled ML models that are safe to use in high-stakes applications.

Paris Buttfield-Addison is a cofounder of Secret Lab, a game development studio based in beautiful Hobart, Australia. Secret Lab builds games and game development tools, including the multi-award-winning ABC Play School iPad games, the BAFTA- and IGF-winning Night in the Woods, the Qantas airlines Joey Playbox games, and the Yarn Spinner narrative game framework. Previously, Paris was a mobile product manager for Meebo (acquired by Google). Paris particularly enjoys game design, statistics, blockchain, machine learning, and human-centered technology. He researches and writes technical books on mobile and game development (more than 20 so far) for O’Reilly and is writing Practical AI with Swift and Head First Swift. He holds a degree in medieval history and a PhD in computing. You can find him on Twitter as @parisba.

Presentations

Swift for TensorFlow in 3 hours Tutorial

Mars Geldard, Tim Nugent, and Paris Buttfield-Addison are here to prove Swift isn't just for app developers. Swift for TensorFlow provides the power of TensorFlow with all the advantages of Python (and complete access to Python libraries) and Swift—the safe, fast, incredibly capable open source programming language; Swift for TensorFlow is the perfect way to learn deep learning and Swift.

Sathya Chandran is a security research scientist at DataVisor. He’s an expert in applying big data and unsupervised machine learning to fraud detection, specializing in the financial, ecommerce, social, and gaming industries. Previously, Sathya was at HP Labs and Honeywell Labs. Sathya holds a PhD in CS from the University of South Florida.

Presentations

Mobility behavior fingerprinting: A new tool for detecting account takeover attacks Session

Sathya Chandran explains key insights into current trends of account takeover fraud by analyzing 52 billion events generated by 1.1 billion users and developing a set of features called user mobility features to capture suspicious device and IP-switching patterns. You'll learn to incorporate mobility features into an anomaly detection solution to detect suspicious account activity in real time.

Jin Hyuk Chang is a software engineer on the data platform team at Lyft, working on various data products. Jin is a main contributor to Apache Gobblin, and Azkaban. Previously, Jin worked at Linkedin and Amazon Web Services, focused on big data and service-oriented architecture.

Presentations

Amundsen: An open source data discovery and metadata platform Session

Jin Hyuk Chang and Tao Feng offer a glimpse of Amundsen, an open source data discovery and metadata platform from Lyft. Since it was open-sourced, Amundsen has been used and extended by many different companies within the community.

Yue Cathy Chang is a business executive recognized for sales, business development, and product marketing in high technology.

Cathy co-founded and is currently the CEO of TutumGene, a technology company that aims to accelerate disease curing by providing solutions for gene therapy and regulation of gene expression. She was most recently with Silicon Valley Data Science, a startup (acquired by Apple) that provided business transformation consulting to enterprises and other organizations using data science- and engineering-based solutions. Prior to that, Cathy was employee #1 hired by the CEO at venture-funded software startup Rocana (acquired by Splunk), where she served as Senior Director of Business Development focusing on building and growing long-term relationships, and notably increased sales leads 2x through building and managing indirect revenue channels.

Prior to Rocana, Cathy held multiple strategic roles at blue chip software enterprise companies as well as startups, including Corporate and Business Development at FeedZai and Datameer; Senior product management, product marketing and sales at Symantec and IBM; and Strategic Sourcing Improvement Consulting at Honeywell.

Cathy holds MS and BS degrees in Electrical and Computer Engineering from Carnegie Mellon University, MBA and MS degrees as a Leaders for Global Operations (LGO) duel-degree fellow from MIT, and two patents for her early work in microprocessor logic design.

Presentations

AI meets genomics: understand and use genetics & genome editing to revolutionize medicine Session

Genome editing has been dubbed a top technology that could create trillion-dollar markets. Learn how recent advancements in the application of AI to genomic editing are accelerating transformation of medicine with Yue Cathy Chang as she explores how AI is applied to genome sequencing and editing, the potential to correct mutations, and questions on using genome editing to optimize human health.

Applying technology oversight and domain insights in AI and ML initiatives to increase success Tutorial

More than 85% of data science projects fail. This high failure rate is a main reason why data science is still a "science." As data science practitioners, reducing this failure rate is a priority. Jike Chong and Yue Cathy Chang explain the the three key steps of applying data science technology to business problems and three concerns for applying domain insights in AI and ML initiatives.

Executive Briefing: Technology oversight to reduce data science project failure rate Session

More than 85% of data science projects fail. This high failure rate is a main reason why data science is still a science. Jike Chong and Yue "Cathy" Chang outline how you can reduce this failure rate and improve teams' confidence in executing successful data science projects by applying data science technology to business problems: scenario mapping, pattern discovery, and success evaluation.

Jeff Chao is a senior software engineer at Netflix, where he works on stream processing engines and observability platforms. Jeff builds and maintains Mantis, an open source platform that makes it easy for developers to build cost-effective, real-time, operations-focused applications. Previously, he was at Heroku offering a fully managed Apache Kafka service.

Presentations

Cost-effective, real-time operational insights into production systems at Netflix Session

Netflix has experienced an unprecedented global increase in membership over the last several years. Production outages today have greater impact in less time than years before. Jeff Chao details the open-sourced Mantis, which allows Netflix to continue providing great experiences for its members, enabling it to get real-time, granular, cost-effective operational insights.

Chanchal Chatterjee is a cloud AI leader at Google Cloud Platform with a focus on financial services and energy market verticals. He’s held several leadership roles focusing on machine learning, deep learning, and real-time analytics. Previously, he was chief architect of EMC at the CTO office, where he led end-to-end deep learning and machine learning solutions for data centers, smart buildings, and smart manufacturing for leading customers; was instrumental in the Industrial Internet Consortium, where he published an AI framework for large enterprises. Chanchal received several awards, including the Outstanding Paper Award from IEEE Neural Network Council for adaptive learning algorithms recommended by MIT professor Marvin Minsky. Chanchal founded two tech startups between 2008 and 2013. He has 29 granted or pending patents and over 30 publications. Chanchal earned MS and PhD degrees in electrical and computer engineering from Purdue University.

Presentations

Solving financial services machine learning problems with explainable ML models Session

Financial services companies use machine learning models to solve critical business use cases. Regulators demand model explainability. Chanchal Chatterjee shares how Google solved financial services business critical problems such as credit card fraud, anti-money laundering, lending risk, and insurance loss using complex machine learning models that can be explained to the regulators.

Michael graduated from Ural Federal University, RadioTechnical Faculty in 2012 and since then worked as a software developer. In 2014 he decided to shift to data science, recognizing the growing interest towards Machine Learning. He has successfully completed several projects in this field and decided to get more fundamental skills, which led him to graduate from the Yandex Data School – the best educational center in Russia for preparing highly-skilled professionals in Data Science.

Presentations

A Unified CV, OCR & NLP Model Pipeline for Scalable Document Understanding at DocuSign Session

This is a real-world case study about applying state-of-the-art deep learning techniques to a pipeline that combines computer vision, OCR, and natural language processing, at DocuSign - the world's largest eSignature provider. We'll also cover how the project delivered on its extreme interpretability, scalability, and compliance requirements.

Amanda “Mandy” Chessell is a master inventor, fellow of the Royal Academy of Engineering, and a distinguished engineer at IBM, where she’s driving IBM’s strategic move to open metadata and governance through the Apache Atlas open source project. Mandy is a trusted advisor to executives from large organizations and works with them to develop strategy and architecture relating to the governance, integration, and management of information. You can find out more information on her blog.

Presentations

Creating an ecosystem on data governance in the ODPi Egeria project Session

Building on its success at establishing standards in the Apache Hadoop data platform, the ODPi (Linux Foundation) now turns its focus to the next big data challenge—enabling metadata management and governance at scale across the enterprise. Amanda Chessell and John Mertic discuss how the ODPi's guidance on governance (GoG) aims to create an open data governance ecosystem.

George Chkadua is a data scientist at TBC Bank. His main focus is machine learning and its applications in industries from a mathematics and business perspective. He earned a PhD in mathematics from King’s College London. George has published various articles in peer review journals and has been invited speak on many scientific conferences and seminars.

Presentations

A novel approach of recommender systems in retail banking Session

TBC Bank is in transition from a product-centric to a client-centric company. Obvious applications of analytics are developing personalized next-best product recommendation for clients. George Chkadua and Levan Borchkhadze explain why it decided to implement ALS user-item matrix factorization method and demographic model. As as result, the pilot increased sales conversion rates by 70%.

Sowmiya is the co-founder/CTO of Lily AI, an emotional intelligence powered shopping experience that helps brands understand their consumers’ purchase behavior. At Lily, she is focussed on decoding user behavior and building deep product understanding by applying deep learning techniques. Prior to Lily, she worked at different levels of the tech stack at Box leading initiatives in – building SDKs, applications for industry verticals & MDM solutions. She was also an early engineer at Pocket Gems where she worked on the core game engine and built acquisition and retention strategies for #1 & #4 top grossing gaming apps. Sowmiya is a UT Austin grad with a Masters in Electrical and Computer Engineering.

Presentations

Personalization powered by unlocking deep product and consumer features Session

Digital brands are focussing heavily on personalizing the experience of the consumers at every single touchpoint. In order to engage with consumers in the most relevant ways, we help brands dissect and understand how their consumers are interacting with their products more specifically with the product features.

Jike Chong is the director of data science, hiring marketplace at LinkedIn. He’s an accomplished executive and professor with experience across industry and academia. Previously, he was the chief data scientist at Acorns, the leading microinvestment app in US with over four million verified investors, which uses behavioral economics to help the up-and-coming save and invest for a better financial future; was the chief data scientist at Yirendai, an online P2P lending platform with more than $7B loans originated and the first of its kind from China to go public on NYSE; established and headed the data science division at SimplyHired, a leading job search engine in Silicon Valley; advised the Obama administration on using AI to reduce unemployment; and led quantitative risk analytics at Silver Lake Kraftwerk, where he was responsible for applying big data techniques to risk analysis of venture investment. Jike is also an adjunct professor and PhD advisor in the Department of Electrical and Computer Engineering at Carnegie Mellon University, where he established the CUDA Research Center and CUDATeaching Center, which focus on the application of GPUs for machine learning. Recently, he developed and taught a new graduate level course on machine learning for Internet finance at Tsinghua University in Beijing, China, where he is serving as an adjunct professor. Jike holds MS and BS degrees in electrical and computer engineering from Carnegie Mellon University and a PhD from the University of California, Berkeley. He holds 11 patents (six granted, five pending).

Presentations

Applying technology oversight and domain insights in AI and ML initiatives to increase success Tutorial

More than 85% of data science projects fail. This high failure rate is a main reason why data science is still a "science." As data science practitioners, reducing this failure rate is a priority. Jike Chong and Yue Cathy Chang explain the the three key steps of applying data science technology to business problems and three concerns for applying domain insights in AI and ML initiatives.

Executive Briefing: Technology oversight to reduce data science project failure rate Session

More than 85% of data science projects fail. This high failure rate is a main reason why data science is still a science. Jike Chong and Yue "Cathy" Chang outline how you can reduce this failure rate and improve teams' confidence in executing successful data science projects by applying data science technology to business problems: scenario mapping, pattern discovery, and success evaluation.

Ira Cohen is a cofounder and chief data scientist at Anodot, where he’s responsible for developing and inventing the company’s real-time multivariate anomaly detection algorithms that work with millions of time series signals. He holds a PhD in machine learning from the University of Illinois at Urbana-Champaign and has over 12 years of industry experience.

Presentations

Herding cats: Product management in the machine learning era Tutorial

While the role of the manager doesn't require deep knowledge of ML algorithms, it does require understanding how ML-based products should be developed. Ira Cohen explores the cycle of developing ML-based capabilities (or entire products) and the role of the (product) manager in each step of the cycle.

After my PhD in Cognitive Science (University of Padua), I did my post-doc at Cornell in Computational Neuroscience/Computer Vision focusing on the integration of computational model of the neurons with neural networks (which we licensed to Ford to be used in their self-driving car project).

Now I’m using my vast experience with neural networks at Datavisor, designing and training deep learning model to recognize malicious pattern in the users’ behaviour.

Presentations

A Deep Learning model to detect coordinated content abuse. Session

Fraudulent attacks such as application fraud, fake reviews, and promotion abuse have to automate the generation of user content to scale; this creates latent patterns shared among the coordinated malicious accounts. We designed a deep learning model to detect such patterns, leading to the identification of coordinated content abuse attacks on social, e-commerce, financial platforms, and more.

A deep learning model to detect coordinated frauds using patterns in user content Session

Fraudulent attacks like fake reviews, application fraud, and promotion abuse create a common pattern shared within coordinated malicious accounts. Nicola Corradi explains novel deep learning models that learned to detect suspicious patterns, leading to the individuation of coordinated fraud attacks on social, dating, ecommerce, financial, and news aggregator services.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Tuesday keynotes Keynote

Strata program chairs Rachel Roumeliotis and Alistair Croll welcome you to the first day of keynotes.

Wednesday keynotes Keynote

Strata program chairs Rachel Roumeliotis and Alistair Croll welcome you to the first day of keynotes.

Robert Crowe is a data scientist at Google with a passion for helping developers quickly learn what they need to be productive. A TensorFlow addict, he’s used TensorFlow since the very early days and is excited about how it’s evolving quickly to become even better than it already is. Previously, Robert led software engineering teams for large and small companies, always focusing on clean, elegant solutions to well-defined needs. In his spare time, Robert sails, surfs occasionally, and raises a family.

Presentations

ML in production: Getting started with TensorFlow Extended (TFX) Tutorial

Putting together an ML production pipeline for training, deploying, and maintaining ML and deep learning applications is much more than just training a model. Robert Crowe outlines what's involved in creating a production ML pipeline and walks you through working code.

Michelangelo D’Agostino is the senior director of data science at ShopRunner, where he leads a team that develops statistical models and writes software that leverages their unique cross-retailer e-commerce dataset. Previously, Michelangelo led the data science R&D team at Civis Analytics, a Chicago-based data science software and consulting company that spun out of the 2012 Obama reelection campaign, and was a senior analyst in digital analytics with the 2012 Obama reelection campaign, where he helped to optimize the campaign’s email fundraising juggernaut and analyzed social media data. Michelangelo has been a mentor with the Data Science for Social Good Fellowship. He holds a PhD in particle astrophysics from UC Berkeley and got his start in analytics sifting through neutrino data from the IceCube experiment. Accordingly, he spent two glorious months at the South Pole, where he slept in a tent salvaged from the Korean War and enjoyed the twice-weekly shower rationing. He’s also written about science and technology for the Economist.

Presentations

The Care and Feeding of Data Scientists Session

As a discipline, data science is relatively young, but the job of managing data scientists is younger still. Many people undertake this management position without the tools, mentorship, or role models they need to do it well. This session will review key themes from a recent Strata report, which examines the steps necessary to build, manage, sustain, and retain a growing data science team.

Leslie De Jesus is the chief innovation officer at Wovenware. With more than 20 years of expertise in software, product development, and data science, Leslie drives disruptive strategies and solutions, including AI and enterprise cloud solutions, to clients in a variety of markets from healthcare and telco to insurance, education, and defense industries. Leslie is responsible for designing advanced deep learning, machine learning and chatbot solutions, including patented groundbreaking products. One of her biggest strengths is team building, which is the foundation of repetition in the product creation process. Previously, Leslie has held positions such as senior software product architect; CTO; and vice president, product development for key firms.

Presentations

Going beyond the textbook – Best practices for creating a DL churn model in healthcare Session

Considering the cost of customer acquisition and the importance of making decisions based on customer data, churn prediction is a key tool for retaining customers and anticipating future trends. In this case study of how a healthcare insurance provider reduced customer churn thanks to a DL model, we'll examine 3 key considerations when creating the model to be a tool for preemptive decision-making

Sourav Dey is CTO at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. Previously, Sourav led teams building data products across the technology stack, from smart thermostats and security cams at Google Nest to power grid forecasting at AutoGrid to wireless communication chips at Qualcomm. He holds patents for his work, has been published in several IEEE journals, and has won numerous awards. He holds PhD, MS, and BS degrees in electrical engineering and computer science from MIT.

Presentations

Efficient ML engineering: Tools and best practices Tutorial

Today, ML engineers are working at the intersection of data science and software engineering—that is, MLOps. Sourav Dey and Alex Ng highlight the six steps of the Lean AI process and explain how it helps ML engineers work as an integrated part of development and production teams. You'll go hands-on using real-world data so you can get up and running seamlessly.

Victor Dibia is a research engineer at Cloudera’s Fast Forward Labs, where his work focuses on prototyping state-of-the-art machine learning algorithms and advising clients. He’s passionate about community work and serves as a Google Developer Expert in machine learning. Previously, he was a research staff member at the IBM TJ Watson Research Center. His research interests are at the intersection of human-computer interaction, computational social science, and applied AI. He’s a senior member of IEEE and has published research papers at conferences such as AAAI Conference on Artificial Intelligence and ACM Conference on Human Factors in Computing Systems. His work has been featured in outlets such as the Wall Street Journal and VentureBeat. He holds an MS from Carnegie Mellon University and a PhD from City University of Hong Kong.

Presentations

Deep learning for anomaly detection Session

In many business use cases, it's frequently desirable to automatically identify and respond to abnormal data. This process can be challenging, especially when working with high-dimensional, multivariate data. Nisha Muktewar and Victor Dibia explore deep learning approaches (sequence models, VAEs, GANs) for anomaly detection, performance benchmarks, and product possibilities.

Dominic Divakaruni is a principal product leader at Microsoft with a track record of launching and driving noteworthy products across vastly varied technical domains. He’s devoted to a sense of ownership—delivering revenue-generating, high-growth products that meet the highest standards for customer satisfaction.

Presentations

Data lineage enables reproducible and reliable machine learning at scale Session

You'll discover effective ways to track the full lineage from data preparation to model training to inference. Sihui Hu and Dominic Divakaruni unpack how to retrieve data-to-data, data-to-model, and model-to-deployment lineages in one graph to achieve reproducible and reliable machine learning at scale.

Mark Donsky is a director of product management at Okera, a software provider that provides discovery, access control, and governance at scale for today’s modern heterogenous data environments, where he leads product management. Previously, Mark led data management and governance solutions at Cloudera, and he’s held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the Western University, Ontario, Canada.

Presentations

Executive briefing on CCPA, GDPR, and NYPA: Big data in the era of heavy privacy regulation Session

Privacy regulation is increasing worldwide with Europe's GDPR, the California Consumer Privacy Act (CCPA), and the New York Privacy Act (NYPA). Penalties for noncompliance are stiff, but many companies still aren't prepared. Mark Donsky shares how to establish best practices for holistic privacy readiness as part of your data strategy.

JOZO DUJMOVIĆ received BSEE, MS, and ScD degrees in computer engineering from the University of Belgrade. He is a Professor of computer science and former Chair of the Computer Science Department at San Francisco State University, where he teaches and researches soft computing, decision engineering, software metrics, and computer performance evaluation. His first industrial experience was in Institute “M. Pupin,” Belgrade, followed by professorship with the School of Electrical Engineering at the University of Belgrade. Before his current position with San Francisco State University, he was the Professor of computer science with the University of Florida, Gainesville, FL, USA; the University of Texas, Dallas, TX, USA; and Worcester Polytechnic Institute, Worcester, MA, USA. He is the author of the LSP decision method, and more than 170 refereed publications. Jozo received three best paper awards, served as General Chair of IEEE and ACM conferences, and invited keynote speaker at conferences in USA and Europe. He is the founder and principal of SEAS, a San Francisco company established in 1997, specializing in soft computing, decision engineering, and software support for the LSP decision method. His latest book entitled Soft Computing Evaluation Logic was published by John Wiley and IEEE Press in 2018.

Presentations

Monitoring patient disability and disease severity using AI Session

We present soft computing models for evaluation of patient disability and disease severity. Such models are necessary in personalized healthcare and must be supported using AI software tools. Our methodology is illustrated using a case study of peripheral neuropathy. We also present the related decision problem of optimum timing of risky therapy.

Michael Dulin, MD, PhD is the Director of the Academy for Population Health Innovation at UNC Charlotte – a collaboration designed to advance community and population health and serves as the Chief Medical Officer for Gray Matter Analytics. Dulin started his career as an Electrical and Biomedical Engineer and then received his PhD studying Neurophysiology. His medical degree is from the University of Texas Medical School at Houston. He completed his residency training at Carolinas Medical Center in Charlotte, and he then entered private practice in Harrisburg, North Carolina where he worked as a community-based primary care physician prior to returning to academics. He then became the Research Director and served as the Chair of the Carolinas Healthcare System’s Department of Family Medicine where he directed a primary care practice-based research network (MAPPR). Immediately prior to joining UNC Charlotte and Gray Matter Analytics, he served as the Chief Clinical Officer for Outcomes Research and Analytics at Atrium Health.

Dulin is a nationally recognized leader in the field of health information technology and application of analytics and outcomes research to improve care delivery and advance population health. He has led projects in this domain funded by AHRQ, The Robert Wood Johnson Foundation, The Duke Endowment, NIH, and PCORI. His work has been recognized by the Charlotte Business Journal, NCHICA and Cerner. His work to build a healthcare data and analytics team was featured by the Harvard Business School / Harvard School T.H. Chan School of Public Health as a published case study.

Dr. Dulin is a member of the American Academy of Family Physicians, Society for Teachers of Family Medicine, North American Primary Care Research Group, and Alpha Omega Alpha. He is a recipient of the North Carolina Medical Society’s Community Practitioner Program; a participant in the Center for International Understanding Latino Initiative; and he was recognized as one of Charlotte’s Best Doctors.

Presentations

Organizational Culture’s Key Role in Transforming Healthcare Using Data and A.I. Session

Despite advances in technology like cloud computing, healthcare providers struggle with basics around applying data/analytics to essential functions. This delay is driven by organizational culture – particularly in large/complex organizations. This presentation will review common implementation barriers and approaches needed to succeed in the transformation process.

In his role as an Associate Partner for TNG Technology Consulting in Munich, Thomas Endres works as an IT consultant. Besides his normal work for the company and the customers he is creating various prototypes – like a telepresence robotics system with which you can see reality through the eyes of a robot, or an Augmented Reality AI that shows the world from the perspective of an artist. He is working on various applications in the fields of AR/VR, AI and gesture control, putting them to use e.g. in autonomous or gesture controlled drones. But he is also involved in other open source projects written in Java, C# and all kinds of JavaScript languages.

Thomas studied IT at the TU Munich and is passionate about software development and all the other aspects of technology. As an Intel Software Innovator and Black Belt, he is promoting new technologies like AI, AR/VR and robotics around the world. For this he received amongst others a JavaOne Rockstar award.

Presentations

Deep Fakes 2.0 - How Neural Networks are Changing our World Session

Imagine looking into a mirror, but not to see your own face. Instead, you are looking in the eyes of Barack Obama or Angela Merkel. Your facial expressions are seamlessly transferred to that other person's face in real time. The TNG Hardware Hacking Team has managed to create a prototype that transfers faces from one person to another in real time based on Deep Fakes.

Dr. Erberich is a computer scientist, specialized in medical informatics. He is an AI practitioner in healthcare with a focus on radiological image processing and computer vision ML. Dr. Erberich is the Chief Data Officer of Children’s Hospital Los Angeles and a Professor of Research Radiology at the University of Southern California (USC).

Presentations

Semi-Supervised AI Approach for Automated Categorization of Medical Images Session

Annotating radiological images by category at scale is a critical step for further analytical ML. However, supervised learning is challenging because image metadata does not reliably identify image content and manual labeling of enough images for AI algorithms is not feasible. Here we present a semi-supervised approach for automated categorization of radiological images based on content category.

Moty Fania is a principal engineer and the CTO of the Advanced Analytics Group at Intel, which delivers AI and big data solutions across Intel. Moty has rich experience in ML engineering, analytics, data warehousing, and decision-support solutions. He led the architecture work and development of various AI and big data initiatives such as IoT systems, predictive engines, online inference systems, and more.

Presentations

Practical methods to enable continuous delivery and sustainability for AI Session

Moty Fania shares key insights from implementing and sustaining hundreds of ML models in production, including continuous delivery of ML models and systematic measures to minimize the cost and effort required to sustain them in production. You'll learn from examples from different business domains and deployment scenarios (on-premises, the cloud) covering the architecture and related AI platforms.

Tao Feng is a software engineer on the data platform team at Lyft. Tao is a committer and PMC member on Apache Airflow. Previously, Tao worked on data infrastructure, tooling, and performance at LinkedIn and Oracle.

Presentations

Amundsen: An open source data discovery and metadata platform Session

Jin Hyuk Chang and Tao Feng offer a glimpse of Amundsen, an open source data discovery and metadata platform from Lyft. Since it was open-sourced, Amundsen has been used and extended by many different companies within the community.

Rustem Feyzkhanov is a machine learning engineer at Instrumental, where he creates analytical models for the manufacturing industry. Rustem is passionate about serverless infrastructure (and AI deployments on it) and is the author of the course and book Serverless Deep Learning with TensorFlow and AWS Lambda.

Presentations

Serverless architecture for AI applications Session

Machine and deep learning are becoming more and more essential for businesses for internal and external use; one of the main issues with deployment is finding the right way to train and operationalize the model. A serverless approach for deep learning provides cheap, simple, scalable, and reliable architecture for it. Rustem Feyzkhanov digs into how to do so within AWS infrastructure.

Martin Förtsch is an IT-consultant of TNG Technology Consulting GmbH based in Unterföhring near Munich who studied computer sciences.

Workwise his focus areas are Agile Development (mainly) in Java, Search Engine Technologies, Information Retrieval and Databases. As an Intel Software Innovator and Intel Black Belt Software Developer he is strongly involved in the development of open-source software for gesture control with 3D-cameras like e.g. Intel RealSense and has built an Augmented Reality wearable prototype device with his team based on this technology.

Furthermore, he gives many talks on national and international conferences about AI, Internet of Things, 3D-camera technologies, Augmented Reality and Test Driven Development as well. He was awarded with the JavaOne Rockstar award.

Presentations

Deep Fakes 2.0 - How Neural Networks are Changing our World Session

Imagine looking into a mirror, but not to see your own face. Instead, you are looking in the eyes of Barack Obama or Angela Merkel. Your facial expressions are seamlessly transferred to that other person's face in real time. The TNG Hardware Hacking Team has managed to create a prototype that transfers faces from one person to another in real time based on Deep Fakes.

Ben Fowler has been in the field of data science for over five years. In his current role at Southeast Toyota Finance, Ben leads the end to end model development process to solve the problem of interest.

This process involves:
• Discussing and defining the problems of interest for the business
• Designing and implementing a plan for sampling of the data and conducting experiments
• Exploratory data analysis
• Data cleaning and preparation
• Feature engineering and feature selection
• Model evaluation and selection
• Model interpretability
• Presenting results
• Production code for reproducibility and serving of models in deployment on new data
• Model documentation
• Researching novel technologies and machine learning methodologies

Ben holds a Master of Science in Data Science from Southern Methodist University, graduating in 2017. Following graduation Ben has been a guest speaker to the SMU program multiple times. Additionally, Ben has spoken at the PyData Miami 2019 Conference and has spoken multiple times at the West Palm Beach Data Science Meetup.

Presentations

Evaluation of Traditional and Novel Feature Selection Approaches Session

Selecting the optimal set of features is a key step in the ML modeling process. This talk will present research conducted that tested five approaches for feature selection. The approaches included current widely used methods, along with novel approaches for feature selection using open-source libraries, building a classification model using the Lending Club dataset.

Don Fox data scientist in residence in Boston for The Data Incubator. Previously, Don developed numerical models for a geothermal energy startup. Born and raised in South Texas, Don has a PhD in chemical engineering, where he researched renewable energy systems and developed computational tools to analyze the performance of these systems.

Presentations

Hands-on data science with Python (Day 2) Training Day 2

Don Fox walks you through all the steps—from prototyping to production—of developing a machine learning pipeline. After looking at data cleaning, feature engineering, model building/evaluation, and deployment, you'll extend these models into two applications from real-world datasets. All your work will be done in Python.

Michael J. Freedman is the cofounder and CTO of TimescaleDB and a full professor of computer science at Princeton University. His work broadly focuses on distributed and storage systems, networking, and security, and his publications have more than 12,000 citations. He developed CoralCDN (a decentralized content distribution network serving millions of daily users) and helped design Ethane (which formed the basis for OpenFlow and software-defined networking). Previously, he cofounded Illuminics Systems (acquired by Quova, now part of Neustar) and served as a technical advisor to Blockstack. Michael’s honors include a Presidential Early Career Award for Scientists and Engineers (given by President Obama), the SIGCOMM Test of Time Award, a Sloan Fellowship, an NSF CAREER award, the Office of Naval Research Young Investigator award, and support from the DARPA Computer Science Study Group. He earned his PhD at NYU and Stanford and his undergraduate and master’s degrees at MIT.

Presentations

Building a distributed time series database on PostgreSQL Session

Time series data tends to accumulate very quickly, across DevOps, IoT, industrial and energy, finance, and other domains. Time series data is everywhere, with monitoring and IoT applications generating tens of millions of metrics per second and petabytes of data. Michael Freedman shows you how to build a distributed time series database that offers the power of full SQL at scale.

Shannon Fuller is an expert in developing and implementing data management and governance processes. Mr. Fuller was instrumental in building a data and analytics center of excellence within a large integrated healthcare system where he led the development of a data governance strategy. He is a creative problem solver and good strategic thinker with the ability to create a vision and share it with the business in order to gain support and functional commitment on projects. Shannon is nationally recognized for his expertise in the application of data governance with a focus on development of internal policies and change management designed to recognize data as an enterprise asset. He speaks nationally on this topic, leads an academic advisory board on big data applications in industry, and frequently guest lectures for UNC Charlotte.

Presentations

Organizational Culture’s Key Role in Transforming Healthcare Using Data and A.I. Session

Despite advances in technology like cloud computing, healthcare providers struggle with basics around applying data/analytics to essential functions. This delay is driven by organizational culture – particularly in large/complex organizations. This presentation will review common implementation barriers and approaches needed to succeed in the transformation process.

Srikanth is currently working as a senior data scientist at Gramener, Bangalore office. He comes from a Solid mechanics background with a Masters in Simulation Sciences from RWTH Aachen University, Germany. After a short stint at Aeronautics department, Purdue University, he returned to India
and transitioned to Data Science.

He is currently working on applying deep learning and machine learning approaches in diverse fields.

Presentations

Sizing biological cells and saving lives using AI Session

AI techniques are finding applications in a wide range of applications. Crowd counting deep learning models have been used to count people, animals, and microscopic cells. This talk will introduce some novel crowd counting techniques and their applications. A pharma case study will be presented to show how it was used for drug discovery to bring about 98% savings in drug characterization efforts.

Krishna is the co-founder and CEO of Fiddler Labs, an enterprise startup building an Explainable AI Engine to address problems regarding bias, fairness and transparency in AI. At Facebook, he led the team that built Facebook’s explainability feature ‘Why am I seeing this?’. He’s an entrepreneur with a technical background with experience creating scalable platforms and expertise in converting data into intelligence. Having held senior engineering leadership roles at Facebook, Pinterest, Twitter and Microsoft, he’s seen the effects that bias has on AI and machine learning decision making processes, and with Fiddler, his goal is to enable enterprises across the globe solve this problem.

Presentations

The Art of Explainability: Removing the bias from AI Session

This session will look at how ‘Explainable AI’ fills a critical gap in operationalizing AI and adopting an explainable approach into the end-to-end ML workflow, from training to production. We'll discuss the benefits of explainability such as the early identification of biased data and better confidence in model outputs.

Ben Galewsky is a research programmer at the National Center for Supercomputing Applications at the University of Illinois. He’s an experienced data engineering consultant whose career has spanned high-frequency trading systems to global investment bank enterprise architecture to big data analytics for large consumer goods manufacturers. He’s a member of the Institute for Research and Innovation in Software for High Energy Physics, which funds his development of scalable systems for the Large Hadron Collider.

Presentations

Data engineering at the Large Hadron Collider Session

Building a data engineering pipeline for serving segments of a 200 Pb dataset to particle physicists around the globe poses many challenges, some of which are unique to high energy physics and some apply to big science projects across disciplines. Ben Galewsky, Gray Lindsey, and Andrew Melo highlight how much of it can inform industry data science at scale.

Meg Garlinghouse is the head of social impact at LinkedIn. She’s passionate about connecting people with opportunities to use their skills and experience to transform the world. She has more than 20 years of experience working at the intersection of nonprofits and corporations, developing strategic and mutually beneficial partnerships. She has particular expertise in leveraging media and technology to meet the marketing, communications, and brand goals of respective clients. Meg has a passion for developing innovative social campaigns that have a business benefit.

Presentations

Fairness through experimentation at LinkedIn Session

Most companies want to ensure their products and algorithms are fair. Guillaume Saint-Jacques and Meg Garlinghouse we share LinkedIn's A/B testing approach to fairness, describe new methods that detect whether an experiment introduces bias or inequality. You'll learn about a scalable implementation on Spark and examples of use cases and impact at LinkedIn.

William “Will” Gatehouse is the chief solutions architect for Accenture’s industry X.0 platforms, including solutions for oil and gas, chemicals, and smart grid. Will has over 20 years’ experience with implementing industrial platforms and has a reputation for applying emerging technology at enterprise scale as he’s proven for streaming analytics, semantic models, and edge analytics. When not at work, Will is an avid sailor.

Presentations

Building the digital twin IoT and unconventional data Session

The digital twin presents a problem of data and models at scale—how to mobilize IT and OT data, AI, and engineering models that work across lines of business and even across partners. Teresa Tung and William Gatehouse share their experience of implementing digital twins use cases that combine IoT, AI models, engineering models, and domain context.

Lior Gavish is a senior vice president of engineering at Barracuda, where he coleads the email security business. Lior developed AI solutions that were recognized by industry and academia, including a Distinguished Paper Award at USENIX Security 2019. Lior joined Barracuda through the acquisition of Sookasa, an Accel-backed startup where he was a cofounder and vice president of engineering. Previously, Lior led startup engineering teams building machine learning, web and mobile technologies. Lior holds a BSc and MSc in computer science from Tel-Aviv University and an MBA from Stanford University.

Presentations

High-precision detection of business email compromise Session

Lior Gavish breaks down a machine learning (ML)-based system that detects a highly evasive type of email-based fraud. The system combines innovative techniques for labeling and classifying highly unbalanced datasets with a distributed cloud application capable of processing high-volume communication in real time.

Marina (Mars) Rose Geldard is a researcher from Down Under in Tasmania. Entering the world of technology relatively late as a mature-age student, she’s found her place in the world: an industry where she can apply her lifelong love of mathematics and optimization. When she’s not busy being the most annoyingly eager researcher ever, she compulsively volunteers at industry events, dabbles in research, and serves on the executive committee for her state’s branch of the Australian Computer Society (ACS). She’s currently writing Practical Artificial Intelligence with Swift for O’Reilly Media.

Presentations

Swift for TensorFlow in 3 hours Tutorial

Mars Geldard, Tim Nugent, and Paris Buttfield-Addison are here to prove Swift isn't just for app developers. Swift for TensorFlow provides the power of TensorFlow with all the advantages of Python (and complete access to Python libraries) and Swift—the safe, fast, incredibly capable open source programming language; Swift for TensorFlow is the perfect way to learn deep learning and Swift.

Lars George is the principal solutions architect at Okera. Lars has been involved with Hadoop and HBase since 2007 and became a full HBase committer in 2009. Previously, Lars was the EMEA chief architect at Cloudera, acting as a liaison between the Cloudera professional services team and customers as well as partners in and around Europe, building the next data-driven solutions, and a cofounding partner of OpenCore, a Hadoop and emerging data technologies advisory firm. He’s spoken at many Hadoop User Group meetings as well as at conferences such as ApacheCon, FOSDEM, QCon, and Hadoop World and Hadoop Summit. He started the Munich OpenHUG meetings. He’s the author of HBase: The Definitive Guide (O’Reilly).

Presentations

Conquering the AWS IAM conundrum Session

With various levels of security layers and different departments responsible data, there are a number of challenges with managing security and governance within AWS identity and access management (IAM). Lars George identifies the security layers, why there’s such a conundrum with IAM, if IAM actually slows down data projects, and the access control requirements needed in data lakes.

Dan Gifford is a Senior Data Scientist responsible for creating data products at Getty Images in Seattle, Washington. Dan works at the intersection between science and creativity and builds products that improve the workflows of both Getty Images photographers and customers. Currently, he is the lead researcher on visual intelligence at Getty Images and is developing innovative new ways for customers to discover content. Prior to this, he worked as a Data Scientist on the Ecommerce Analytics team at Getty Images where he modernized testing frameworks and analysis tools used by Getty Images Analysts in addition to modeling content relationships for the Creative Research team. Dan earned a Ph.D. in Astronomy and Astrophysics from the University of Michigan in 2015 where he developed new algorithms for estimating the size of galaxy clusters. He also engineered a new image analysis pipeline for an instrument on a telescope used by the department at the Kitt Peak National Observatory.

Presentations

At a Loss for Words: How ML Bridges the Creative Language Gap Data Case Studies

Computer vision has made great strides towards human-level accuracy of describing and identifying images, but there often aren’t words to describe what we want algorithms to predict. In this session, Getty Images’ Senior Data Scientist Dan Gifford will explore this paradox, limitations of text-based image search, and how creative AI is challenging the way we view human creativity.

Navdeep Gill is a Senior Data Scientist/Software Engineer at H2O.ai where he focuses mainly on machine learning interpretability and previously focused on GPU accelerated machine learning, automated machine learning, and the core H2O-3 platform.

Prior to joining H2O.ai, Navdeep worked at Cisco focusing on data science and software development. Before that Navdeep was a researcher/analyst in several neuroscience labs at the following institutions: California State University, East Bay, University of California, San Francisco, and Smith Kettlewell Eye Research Institute.

Navdeep graduated from California State University, East Bay with a M.S. in Computational Statistics, a B.S. in Statistics, and a B.A. in Psychology (minor in Mathematics).

Presentations

Debugging Machine Learning Models Session

Like all good software, machine learning models should be debugged to discover and remediate errors. This presentation will go over several standard techniques in the context of model debugging: disparate impact, residual, and sensitivity analysis and introduces novel applications such as global and local explanation of model residuals.

Ilana is a Director in PwC’s Emerging Technologies practice and leads globally PwC’s research and development of Responsible AI. Ilana has almost a decade of experience as a data scientist helping clients make strategic business decisions through data-informed decision making, simulation, and machine learning.

Presentations

A Practical Guide to Responsible AI: Building robust, secure and safe AI Session

A practitioner’s overview of the risks of AI and depiction of responsible AI deployment within an organization. How do we ensure the safety, security, standardizes testing, and governance of systems? How can models be fooled or subverted? We showcase client examples to illustrate how organizations safeguard their AI applications and vendor solutions to mitigate the risks AI may present.

Bruno Gonçalves is a chief data scientist at Data For Science, working at the intersection of data science and finance. Previously, he was a data science fellow at NYU’s Center for Data Science while on leave from a tenured faculty position at Aix-Marseille Université. Since completing his PhD in the physics of complex systems in 2008, he’s been pursuing the use of data science and machine learning to study human behavior. Using large datasets from Twitter, Wikipedia, web access logs, and Yahoo! Meme, he studied how we can observe both large scale and individual human behavior in an obtrusive and widespread manner. The main applications have been to the study of computational linguistics, information diffusion, behavioral change and epidemic spreading. In 2015, he was awarded the Complex Systems Society’s 2015 Junior Scientific Award for “outstanding contributions in complex systems science” and in 2018 was named a science fellow of the Institute for Scientific Interchange in Turin, Italy.

Presentations

Time series modeling: ML and deep learning approaches 2-Day Training

Time series are everywhere around us. Understanding them requires taking into account the sequence of values seen in previous steps and even long-term temporal correlations. Bruno Goncalves explains a broad range of traditional machine learning (ML) and deep learning techniques to model and analyze time series datasets with an emphasis on practical applications.

Time series modeling: ML and deep learning approaches (Day 2) Training Day 2

Time series are everywhere around us. Understanding them requires taking into account the sequence of values seen in previous steps and even long-term temporal correlations. Bruno Goncalves explains a broad range of traditional machine learning (ML) and deep learning techniques to model and analyze time series datasets with an emphasis on practical applications.

Abe Gong is CEO and cofounder at Superconductive Health. A seasoned entrepreneur, Abe has been leading teams using data and technology to solve problems in healthcare, consumer wellness, and public policy for over a decade. Previously, he was chief data officer at Aspire Health, the founding member of the Jawbone data science team, and lead data scientist at Massive Health. Abe holds a PhD in public policy, political science, and complex systems from the University of Michigan. He speaks and writes regularly on data science, healthcare, and the internet of things.

Presentations

Fighting pipeline debt with Great Expectations Session

Data organizations everywhere struggle with pipeline debt: untested, unverified assumptions that corrupt data quality, drain productivity, and erode trust in data. Abe Gong shares best practices gathered from across the data community in the course of developing a leading open source library for fighting pipeline debt and ensuring data quality: Great Expectations.

Sunil Goplani is a group development manager at Intuit, leading the big data platform. Sunil has played key architecture and leadership roles in building solutions around data platforms, big data, BI, data warehousing, and MDM for startups and enterprises. Previously, Sunil served in key engineering positions at Netflix, Chegg, Brand.net, and few other startups. Sunil has a master’s degree in computer science.

Presentations

10 lead indicators before data becomes a mess Session

Data quality metrics focus on quantifying whether data is a mess. But you need to identify lead indicators before data becomes a mess. Sandeep U, Giriraj Bagadi, and Sunil Goplani explore developing lead indicators for data quality for Intuit's production data pipelines. You'll learn about the details of lead indicators, optimization tools, and lessons that moved the needle on data quality.

Always accurate business metrics through lineage-based anomaly tracking Session

Debugging data pipelines is nontrivial and finding the root cause can take hours to days. Shradha Ambekar and Sunil Goplani outline how Intuit built a self-serve tool that automatically discovers data pipeline lineage and applies anomaly detection to detect and help debug issues in minutes–establishing trust in metrics and improving developer productivity by 10-100X.

Denise Gosnell is a global graph practice lead DataStax that builds some of the largest distributed graph applications in the world. Her passion centers on examining, applying, and evangelizing the applications of graph data and complex graph problems. An NSF fellow, Denise holds a PhD in computer science from the University of Tennessee, where her research coined the concept of “social fingerprinting” by applying graph algorithms to predict user identity from social media interactions.​ ​Since then, she’s built, published on, patented, and spoken about dozens of topics related to graph theory, graph algorithms, graph databases, and applications of graph data across all industry verticals.

Presentations

How does graph data help inform a self-organizing network? Session

Self-organizing networks rely on sensor communication and a centralized mechanism, like a cell tower, for transmitting the network's status. Denise Gosnell walks you through what happens if the tower goes down and how a graph data structure gets involved in the network's healing process. You'll see graphs in this dynamic network and how path information helps sensors come back online.

Trevor Grant is an Apache Software Foundation Memmber and involved in multiple projects such as Mahout, Streams, and SDAP-incubating just to name a few. He holds an MS in applied math and an MBA from Illinois State University. He speaks about computer stuff internationally. He has taken numerous classes in stand-up and improv comedy to make his talks more pleasant for you- the listener.

Presentations

Ship it! A practitioner's guide to model management and deployment with Kubeflow. Session

We'll show you a way to get & keep your models in production with Kubeflow.

Mark Grover is a product manager at Lyft. Mark’s a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He’s also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He’s a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Presentations

Reducing data lag from 24+ hours to 5 mins at Lyft scale Session

Mark Grover and Dev Tagare offer you a glimpse at the end-to-end data architecture Lyft uses to reduce data lag appearing in its analytical systems from 24+ hours to under 5 minutes. You'll learn the what and why of tech choices, monitoring, and best practices. They outline the use cases Lyft has enabled, especially in ML model performance and evaluation.

Ananth is a senior application architect in the Decisioning and Advanced Analytics engineering team for Commonwealth Bank of Australia. Ananth holds a Ph.D degree in the domain of computer science security and is interested in all things data including low latency distributed processing systems, machine learning and data engineering domains. He holds 3 patents granted by USPTO and has one application pending. Prior to joining to CBA, he was an architect at Threatmetrix and the member of the core team that scaled Threatmetrix architecture to 100 million transactions per day that runs at very low latencies using Cassandra, Zookeeper and Kafka. He also migrated Threatmetrix data warehouse into the next generation architecture based on Hadoop and Impala. Prior to Threatmetrix, he worked for the IBM software labs and IBM CIO labs enabling some of the first IBM CIO projects onboarding HBase, Hadoop and Mahout stack.

Ananth is a committer for Apache Apex and is currently working for the next generation architectures for CBA fraud platform and Advanced Analytics Omnia platform at CBA. Ananth has presented at a number of conferences including the YOW Data conference and the Dataworks summit conference in Australia.

Presentations

Automated feature engineering for the modern enterprise using Dask and featuretools Session

Feature engineering can make or break a machine learning model. The featuretools package and associated algorithm accelerate the way features are built. Ananth Kalyan Chakravarthy Gundabattula explains a Dask and Prefect-based framework that addresses challenges and opportunities using this approach in terms of lineage, risk, ethics and automated data pipelines for the enterprise.

Sijie Guo is the founder and CEO of StreamNative, a data infrastructure startup offering a cloud native event streaming platform based on Apache Pulsar for enterprises. Previously, he was the tech lead for the Messaging Group at Twitter and worked on push notification infrastructure at Yahoo. He’s also the VP of Apache BookKeeper and PMC Member of Apache Pulsar.

Presentations

Transactional event streaming with Apache Pulsar Session

Sijie Guo and Yong Zhang lead a deep dive into the details of Pulsar transaction and how it can be used in Pulsar Functions and other processing engines to achieve transactional event streaming.

Patrick Hall is a senior director for data science products at H2O.ai, where he focuses mainly on model interpretability and model management. Patrick is also an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning. Previously, Patrick held global customer-facing and R&D research roles at SAS Institute. He holds multiple patents in automated market segmentation using clustering and deep neural networks. Patrick is the 11th person worldwide to become a Cloudera Certified Data Scientist. He studied computational chemistry at the University of Illinois before graduating from the Institute for Advanced Analytics at North Carolina State University.

Presentations

Model debugging strategies Tutorial

Even if you've followed current best practices for model training and assessment, machine learning models can be hacked, socially discriminatory, or just plain wrong. Patrick Hall breaks down model debugging strategies to test and fix security vulnerabilities, unwanted social biases, and latent inaccuracies in models.

Hannes Hapke is a machine learning enthusiast and a Google Developer Expert for machine learning. He’s applied deep learning to a variety of computer vision and natural language problems, but his main interest is in machine learning infrastructure and automating model workflows. Hannes is a coauthor of Natural Language Processing in Action and is working on Building Machine Learning Pipelines with TensorFlow for O’Reilly. When he isn’t working on a deep learning project, you’ll find him running long distances, hiking, or enjoying a book with a good cup of coffee.

Presentations

Analyzing and deploying your machine learning model Tutorial

Most deep learning models don’t get analyzed, validated, and deployed. Catherine Nelson and Hannes Hapke explain the necessary steps to release machine learning models for real-world applications. You'll view an example project using the TensorFlow ecosystem, focusing on how to analyze models and deploy them efficiently.

Getting the most out of your AI Projects with Model Feedback Loops Session

Measuring the machine learning model’s performance is key for every successful data science project. Therefore, model feedback loops are essential to capture feedback from users and to expand your model’s training dataset. This talk will introduce the concept of model feedback to you and guide you through a framework for increasing the ROI of your data science project.

Matt Harrison is a consultant at MetaSnake, a Python training and consultancy shop. He’s a Python user, presenter, author, and user group organizer. He helps run the Utah Python user group, and he wrote Treading on Python volumes 1 and 2. His work experience covers search, business intelligence, and data science.

Presentations

Mastering pandas Tutorial

You can use pandas to load data, inspect it, tweak it, visualize it, and do analysis with only a few lines of code. Matt Harrison leads a deep dive in plotting and Matplotlib integration, data quality, and issues such as missing data. Matt uses the split-apply-combine paradigm with groupBy and Pivot and explains stacking and unstacking data.

Zak Hassan is a software engineer on the data analytics team working on data science and machine learning at Red Hat. Previously, Zak was a software consultant in the financial services and insurance industry, building end-to-end software solutions for clients.

Presentations

Log anomaly detector with NLP and unsupervised machine learning Session

The number of logs increases constantly and no human can monitor them all. Zak Hassan employs NLP for text encoding and machine learning methods for automated anomaly detection in an effort to construct a tool that could help developers perform root cause analysis more quickly on failing applications. Also, he provides a means to give feedback to the ML algorithm to learn from false positives.

Long Van Ho is a Data Scientist with over 5 years of experience in applying advanced machine learning techniques in healthcare and defense applications. His work includes developing the machine learning framework to enable data science at the Virtual Pediatric Intensive Care Unit and researching applications of artificial intelligence to improve care in the ICUs. His background includes research in particle beam physics at UCLA and Stanford has provided a strong research background in his career as a data scientist. His interests and goal is to bridge the potential of machine learning with practical applications to health.

Presentations

Semi-Supervised AI Approach for Automated Categorization of Medical Images Session

Annotating radiological images by category at scale is a critical step for further analytical ML. However, supervised learning is challenging because image metadata does not reliably identify image content and manual labeling of enough images for AI algorithms is not feasible. Here we present a semi-supervised approach for automated categorization of radiological images based on content category.

Bob Horton is a senior data scientist on the user understanding team at Bing. Bob holds an adjunct faculty appointment in health informatics at the University of San Francisco, where he gives occasional lectures and advises students on data analysis and simulation projects. Previously, he was on the professional services team at Revolution Analytics. Long before becoming a data scientist, he was a regular scientist (with a PhD in biomedical science and molecular biology from the Mayo Clinic). Some time after that, he got an MS in computer science from California State University, Sacramento.

Presentations

Machine learning for managers Tutorial

Robert Horton, Mario Inchiosa, and John-Mark Agosta offer an overview of the fundamental concepts of machine learning (ML) to business and healthcare decision makers and software product managers so you'll be able to make a more effective use of ML results and be better able to evaluate opportunities to apply ML in your industries.

Sihui “May” Hu (she/her) is a program manager at Microsoft, focused on creating data management and data lineage solutions for the Azure Machine Learning service. Previously, she had two years of working experience in the ecommerce industry and several internships in product management. She graduated from Carnegie Mellon University, studying information systems management.

Presentations

Data lineage enables reproducible and reliable machine learning at scale Session

You'll discover effective ways to track the full lineage from data preparation to model training to inference. Sihui Hu and Dominic Divakaruni unpack how to retrieve data-to-data, data-to-model, and model-to-deployment lineages in one graph to achieve reproducible and reliable machine learning at scale.

Mario Inchiosa’s passion for data science and high-performance computing drives his work as principal software engineer in Microsoft Cloud + AI, where he focuses on delivering advances in scalable advanced analytics, machine learning, and AI. Previously, Mario served as chief scientist of Revolution Analytics; analytics architect in the big data organization at IBM, where he worked on advanced analytics in Hadoop, Teradata, and R; US chief scientist at Netezza, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances; US chief science officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining; and senior scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds bachelor’s, master’s, and PhD degrees in physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning Publication of the Year and Open Literature Publication Excellence awards.

Presentations

Machine learning for managers Tutorial

Robert Horton, Mario Inchiosa, and John-Mark Agosta offer an overview of the fundamental concepts of machine learning (ML) to business and healthcare decision makers and software product managers so you'll be able to make a more effective use of ML results and be better able to evaluate opportunities to apply ML in your industries.

George Iordanescu is a data scientist on the algorithms and data science team for Microsoft’s Cortana Intelligence Suite. Previously, he was a research scientist in academia, a consultant in the healthcare and insurance industry, and a postdoctoral visiting fellow in computer-assisted detection at the National Institutes of Health (NIH). His research interests include semisupervised learning and anomaly detection. George holds a PhD in EE from Politehnica University in Bucharest, Romania.

Presentations

Using the cloud to scale up hyperparameter optimization for machine learning Session

Hyperparameter optimization for machine leaning is complex that requires advanced optimization techniques and can be implemented as a generic framework decoupled from specific details of algorithms. Fidan Boylu Uz, Mario Bourgoin, and Gheorghe Iordanescu apply such a framework to tasks like object detection and text matching in a transparent, scalable, and easy-to-manage way in a cloud service.

Shubhankar Jain (he/him) is a machine learning engineer at SurveyMonkey, where he develops and implements machine learning systems for its products and teams. He’s really excited to bring his expertise and passion of data and AI systems to rest of the industry. In his free time, he likes hiking with his dog and accelerating his hearing loss at live music shows.

Presentations

Accelerating your organization: Making data optimal for machine learning Session

Every organization leverages ML to increase value to customers and understand their business. You may have created models, but now you need to scale. Shubhankar Jain, Aliaksandr Padvitselski, and Manohar Angani use a case study to teach you how to pinpoint inefficiencies in your ML data flow, how SurveyMonkey tackled this, and how to make your data more usable to accelerate ML model development.

Ram Janakiraman is a Distinguished Engineer at the Aruba CTO Office working on Machine Intelligence for Enterprise Security. His recent focus has been on simplifying the building of behavior models by leveraging approaches in NLP and Representation learning. He hopes to improve end-user product engagement through a visual representation of entity interactions without compromising the privacy of the network entities. Ram has numerous patents from a variety of areas during the course of his career.

Ram has been in various startups and was a co-founding member of Niara Inc working on security analytics with a focus on threat detection and investigation before it was acquired by Aruba, a HPE Company. He is also an avid Scuba Diver always eager to explore the next reef or kelp. He is also an FAA Certified Drone Pilot capturing the beauty of dive destinations on his trips.

Presentations

Preserving Privacy in Behavioral Analytics using Semantic Learning techniques in NLP Session

Devices discover their way around the network and proxy the intent of the users behind them. Leveraging this information for behavior analytics can raise privacy concerns. Selective use of embedding models on a crafted corpus from anonymized data can address privacy concerns. Ram presents an approach for building representations with behavioral insights that also preserves user identity.

Dan Jeffries is Chief Technology Evangelist at Pachyderm. He’s also an author, engineer, futurist, pro blogger and he’s given talks all over the world on AI and cryptographic platforms. He’s spent more than two decades in IT as a consultant and at open source pioneer Red Hat.

His articles have held the number one writer’s spot on Medium for Artificial Intelligence, Bitcoin, Cryptocurrency and Economics more than 25 times. His breakout AI tutorial series “Learning AI If You Suck at Math” along with his explosive pieces on cryptocurrency, "Why Everyone Missed the Most Important Invention of the Last 500 Years” and "Why Everyone Missed the Most Mind-Blowing Feature of Cryptocurrency,” are shared hundreds of times daily all over social media and been read by more than 5 million people worldwide.

Presentations

When AI Goes Wrong and How to Fix It With Real World AI Auditing Session

With algorithms making more and more decisions in our lives, from who gets a job, to who gets hired and fired, and even who goes to jail, it’s more critical than ever that we make AI auditable and explainable in the real world. You can make your AI/ML systems auditable and transparent right now with a few classic IT techniques your team already knows and knows well.

Amit Kapoor is a data storyteller at narrativeViz, where he uses storytelling and data visualization as tools for improving communication, persuasion, and leadership through workshops and trainings conducted for corporations, nonprofits, colleges, and individuals. Interested in learning and teaching the craft of telling visual stories with data, Amit also teaches storytelling with data for executive courses as a guest faculty member at IIM Bangalore and IIM Ahmedabad. Amit’s background is in strategy consulting, using data-driven stories to drive change across organizations and businesses. Previously, he gained more than 12 years of management consulting experience with A.T. Kearney in India, Booz & Company in Europe, and startups in Bangalore. Amit holds a BTech in mechanical engineering from IIT, Delhi, and a PGDM (MBA) from IIM, Ahmedabad.

Presentations

Deep learning for recommendation systems 2-Day Training

Bargava Subramanian and Amit Kapoor provide you with a thorough introduction to the art and science of building recommendation systems and paradigms across domains and give an end-to-end overview of deep learning-based recommendation and learning-to-rank systems, so you'll understand practical considerations and guidelines for building and deploying recsys.

Deep Learning for Recommendation Systems (Day 2) Training Day 2

Bargava Subramanian and Amit Kapoor provide you with a thorough introduction to the art and science of building recommendation systems and paradigms across domains and give an end-to-end overview of deep learning-based recommendation and learning-to-rank systems, so you'll understand practical considerations and guidelines for building and deploying recsys.

Holden Karau is a transgender Canadian software working in the bay area. Previously, she worked at IBM, Alpine, Databricks, Google (twice), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She’s a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

Presentations

Ship it! A practitioner's guide to model management and deployment with Kubeflow. Session

We'll show you a way to get & keep your models in production with Kubeflow.

Arun Kejariwal is an independent lead engineer. Previously, he was he was a statistical learning principal at Machine Zone (MZ), where he led a team of top-tier researchers and worked on research and development of novel techniques for install-and-click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns, and his team built novel methods for bot detection, intrusion detection, and real-time anomaly detection; and he developed and open-sourced techniques for anomaly detection and breakout detection at Twitter. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Serverless streaming architectures and algorithms for the enterprise Tutorial

Arun Kejariwal, Karthik Ramasamy, and Anurag Khandelwal walk you through through the landscape of streaming systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage. You'll get an overview of the inception and growth of the serverless paradigm. They explore Apache Pulsar, which provides native serverless support in the form of Pulsar functions.

Anurag Khandelwal is a PhD candidate at the RISELab, UC Berkeley, advised by Professor Ion Stoica, and will be joining Yale as an assistant professor in the spring of 2020. His research interests span distributed systems, networking, and algorithms. In particular, his research focuses on addressing core challenges in distributed systems through novel algorithm and data structure design. During his PhD, Anurag built large-scale data-intensive systems such as Succinct and Confluo, that led to deployments in several production clusters. Anurag earned his bachelor’s degree in computer science from the Indian Institute of Technology Kharagpur in 2013.

Presentations

Serverless streaming architectures and algorithms for the enterprise Tutorial

Arun Kejariwal, Karthik Ramasamy, and Anurag Khandelwal walk you through through the landscape of streaming systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage. You'll get an overview of the inception and growth of the serverless paradigm. They explore Apache Pulsar, which provides native serverless support in the form of Pulsar functions.

After a BS in environmental engineering at Yale, founding an electrochemistry startup, joining a battery startup, and doing crazy things with PostgreSQL for Moat (an ad-analytics company), David joined Timescale to focus on research and development. He also cooks, does pottery and builds furniture.

Presentations

Simplifying data analytics by creating continuously up-to-date aggregates Session

The sheer volume of time series data from servers, applications, or IoT devices introduces performance challenges, both to insert data at high rates and to process aggregates for subsequent understanding. David Kohn demonstrates how systems can properly continuously maintain up-to-date aggregates, even correctly handling late or out-of-order data, to simplify data analysis.

Ravi Krishnaswamy is the director of software architecture in the AutoCAD Group at Autodesk. He has a passion for technology and has implemented a wide range of solutions for products at Autodesk from analytics and database applications to mobile graphics. His current projects involve analytics solutions on product usage data that leverage graph databases and machine learning techniques on graphs.

Presentations

Collaboration insights through data access graphs Session

Today’s applications interact with data in a distributed and decentralized world. Using graphs at scale, you can infer communities and your interaction by tracking access to common data across users and applications. Ravi Krishnaswamy displays a real-world product example with millions of users that uses the combined powers of Spark and graph databases to gain insights into customer workflows.

Akshay Kulkarni is a senior data scientist on the core AI and data science team at Publicis Sapient, where he’s part of strategy and transformation interventions through AI, manages high-priority growth initiatives around data science and works on various machine learning, deep learning, natural language processing, and artificial intelligence engagements by applying state-of-the-art techniques. He’s a renowned AI and machine learning evangelist, author, and speaker. Recently, he’s been recognized as one of “Top 40 under 40 Data Scientists” in India by Analytics India Magazine. He’s consulted with several Fortune 500 and global enterprises in driving AI and data science-led strategic transformation. Akshay has rich experience of building and scaling AI and machine learning businesses and creating significant client impact. He’s actively involved in next-gen AI research and is also a part of next-gen AI community. Previously, he was at Gartner and Accenture, where he scaled the AI and data science business. He’s a regular speaker at major data science conferences and recently gave a talk on “Sequence Embeddings for Prediction Using Deep Learning” at GIDS. He’s the author of a book on NLP with Apress and authoring couple more books with Packt on deep learning and next-gen NLP. Akshay is a visiting faculty (industry expert) at few of the top universities in India. In his spare time, he enjoys reading, writing, coding, and helping aspiring data scientists.

Presentations

Attention networks all the way to production using Kubeflow Tutorial

Pramod Singh and Akshay Kulkarni walk you through the in-depth process of building a text summarization model with an attention network using TensorFlow (TF) 2.0. You'll gain the practical hands-on knowledge to build and deploy a scalable text summarization model on top of Kubeflow.

Dinesh Kumar is a product engineer at Gojek. He has experience of building high-scale distributed systems and working with event-driven systems and components around Kafka.

Presentations

BEAST: Building an event processing library to handle millions of events Session

Maulik Soneji and Dinesh Kumar explore Gojek's event-processing library to consume events from Kafka and push it to BigQuery. All of its services are event sourced, and Gojek has a high load of 21K messages per second for few topics, and it has hundreds of topics.

Tianhui Michael Li is the founder and president of the Data Incubator, a data science training and placement firm. Michael bootstrapped the company and navigated it to a successful sale to the Pragmatic Institute. Previously, he headed monetization data science at Foursquare and has worked at Google, Andreessen Horowitz, J.P. Morgan, and D.E. Shaw. He’s a regular contributor to the Wall Street JournalTech CrunchWiredFast CompanyHarvard Business ReviewMIT Sloan Management ReviewEntrepreneurVenture Beat, Tech Target, and O’Reilly. Michael was a postdoc at Cornell Tech, a PhD at Princeton, and a Marshall Scholar in Cambridge.

Presentations

Big data for managers (Day 2) Training Day 2

Rich Ott and Michael Li provide a nontechnical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and use their input and analysis for your business’s strategic priorities and decision making.

Nong Li cofounded Okera in 2016 with Amandeep Khurana and serves as the company’s CEO. Previously, he was on the engineering team at Databricks, where he led performance engineering for Spark core and SparkSQL, and was tech lead for the Impala project at Cloudera and the author of the Record Service project. Nong is also one of the original authors of the Apache Parquet project and mentors several Apache projects, including Apache Arrow. Nong has a degree in computer science from Brown University.

Presentations

Data versus metadata: Overcoming the challenges to securing the modern data lake Session

The evolution of storing data in a warehouse to hybrid infrastructure of on-premises and cloud data lakes enabled agility and scale. Nong Li looks at the problems between data and metadata, the privacy and security risks associated with them, how to avoid the pitfalls of this challenges, and why companies need to get it right by enforcing security and privacy consistently across all applications.

Penghui is a software engineer at Zhaopin and an Apache Pulsar Committer.

Presentations

Life beyond pub/sub: How Zhaopin simplifies stream processing using Pulsar Functions and SQL Session

Penghui Li and Jia Zhai walk you through building an event streaming platform based on Apache Pulsar and simplifying a stream processing pipeline by Pulsar Functions, Pulsar Schema, and Pulsar SQL.

Gray Lindsey is a staff scientist at Fermi National Accelerator Laboratory studying Higgs and electroweak physics. He’s focused on developing software and detectors to address the challenge of the high-luminosity upgrade for the Large Hadron Collider and the corresponding upgrade of the Compact Muon Solenoid (CMS) experiment. He’s developed a variety of pattern recognition techniques to demonstrate and help realize new detector systems to efficiently assemble physics data from upgrades to the CMS detector. He also leads the development to make the analysis of those data more efficient and scalable using modern big data technologies.

Presentations

Data engineering at the Large Hadron Collider Session

Building a data engineering pipeline for serving segments of a 200 Pb dataset to particle physicists around the globe poses many challenges, some of which are unique to high energy physics and some apply to big science projects across disciplines. Ben Galewsky, Gray Lindsey, and Andrew Melo highlight how much of it can inform industry data science at scale.

Peng Liu is a software engineer of Robinhood’s Data Platform team.

Presentations

Usability First: the Evolution of Robinhood’s Data Platform Data Case Studies

The Data Platform at Robinhood has evolved considerably as the scale of our data and needs of the company have evolved. In this talk, we are sharing the stories behind the evolution of our platform, explaining how does it align with our business use cases, and discussing in detail the challenges we encountered and lessons we learned.

Shondria Lopez-Merlos is a Data Specialist for the Florida Conference of The United Methodist Church. After making a suggestion in a meeting, Shondria was challenged to learn more about coding and automation. She subsequently taught herself Python and has begun learning HTML/CSS, SQL and VBA. Shondria is a former O’Reilly Scholarship recipient. Additionally, she is a member of Women Who Code and Women in STEAM.

Presentations

Starting Simple: How to Use Coding and Automation at Non-Profits and Small Businesses Data Case Studies

Small data teams that work for small businesses or non-profits often want to use programming and automation, but do not know “where to start.” This presentation discusses how to learn simple Python programs and incorporate them to help streamline workflow, and, hopefully, lead to additional, increasingly complex projects.

Grace Lu is a software engineer of Robinhood’s Data Platform team.

Presentations

Usability First: the Evolution of Robinhood’s Data Platform Data Case Studies

The Data Platform at Robinhood has evolved considerably as the scale of our data and needs of the company have evolved. In this talk, we are sharing the stories behind the evolution of our platform, explaining how does it align with our business use cases, and discussing in detail the challenges we encountered and lessons we learned.

Boris Lublinsky is a software architect at Lightbend, where he specializes in big data, stream processing, and services. Boris has over 30 years’ experience in enterprise architecture. Previously, was responsible for setting architectural direction, conducting architecture assessments, and creating and executing architectural road maps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, and Serving Machine Learning Models. He’s also cofounder of and frequent speaker at several Chicago user groups.

Presentations

Model governance Tutorial

Machine learning (ML) models are data, which means they require the same data governance considerations as the rest of your data. Boris Lublinsky and Dean Wampler outline metadata management for model serving and explore what information about running systems you need and why it's important. You'll also learn how Apache Atlas can be used for storing and managing this information.

Understanding data governance for machine learning models Session

Production deployment of machine learning (ML) models requires data governance, because models are data. Dean Wampler and Boris Lublinsky justify that claim and explore its implications and techniques for satisfying the requirements. Using motivating examples, you'll explore reproducibility, security, traceability, and auditing, plus some unique characteristics of models in production settings.

Anand Madhavan is the vice president of engineering at Narvar. Previously, he was head of engineering for the Discover product at Snapchat and director of engineering at Twitter, where we worked on building out the ad serving system for Twitter Ads. He earned an MS in computer science from Stanford University.

Presentations

Using Apache Pulsar functions for data workflows at Narvar Session

Narvar originally used a large collection of point technologies such as AWS Kinesis, Lambda, and Apache Kafka to satisfy its requirements for pub/sub messaging, message queuing, logging, and processing. Karthik Ramasamy and Anand Madhavan walk you through how Narvar moved away from using a slew of technologies and consolidating their use cases using Apache Pulsar.

Suneeta Mall is a senior data scientist at Nearmap, where she leads the engineering efforts of the Artificial Intelligence Division. She led the efforts of migrating Nearmap’s backend services to Kubernetes. In her 12 years of software industry experience, she’s worked on a solving variety of challenging technical and business problems in the field of big data, machine learning, GIS, travel, DevOps, and telecommunication. She earned her PhD from University of Sydney and bachelor of computer science and engineering.

Presentations

Deep learning meets Kubernetes: Running massively parallel inference pipelines efficiently Session

Using Kubernetes as the backbone of AI infrastructure, Nearmap built a fully automated deep learning inference pipeline that's highly resilient, scalable, and massively parallel. Using this system, Nearmap ran semantic segmentation over tens of quadrillions of pixels. Suneeta Mall demonstrates the solution demonstrating using Kubernetes in big data crunching and machine learning at scale.

Katie Malone is director of data science at data science software and services company Civis Analytics, where she leads a team of diverse data scientists who serve as technical and methodological advisors to the Civis consulting team and write the core machine learning and data science software that underpins the Civis Data Science Platform. Previously, she worked at CERN on Higgs boson searches and was the instructor of Udacity’s Introduction to Machine Learning course. Katie hosts Linear Digressions, a weekly podcast on data science and machine learning. She holds a PhD in physics from Stanford.

Presentations

The Care and Feeding of Data Scientists Session

As a discipline, data science is relatively young, but the job of managing data scientists is younger still. Many people undertake this management position without the tools, mentorship, or role models they need to do it well. This session will review key themes from a recent Strata report, which examines the steps necessary to build, manage, sustain, and retain a growing data science team.

Sukanya is a data science practitioner working with Capgemini. She also has extensive experience working with IoT building various kinds of solutions. She enjoys the most when she works on the intersection of IoT and Data Science. She also leads the PyData Mumbai chapter and loves exploring new tech on her free time.

Presentations

Machine Learning on resource constraint IoT edge devices Session

Heavy ML computation on resource constraint IoT devices is a challenge. Besides, IoT demands near zero latency, high bandwidth availability, continuous and seamless availability as well as privacy. The right infrastructure derives the right ROI. This is where edge and cloud comes in. Training ML models at the cloud and inferencing at the edge has made many real world IoT use cases plausible.

Jaya Mathew is a senior data scientist on the artificial intelligence and research team at Microsoft, where she focuses on the deployment of AI and ML solutions to solve real business problems for customers in multiple domains. Previously, she worked on analytics and machine learning at Nokia and Hewlett Packard Enterprise. Jaya holds an undergraduate degree in mathematics and a graduate degree in statistics from the University of Texas at Austin.

Presentations

Machine translation helps break our language barrier Session

With the need to cater to a global audience, there's a growing demand for applications to support speech identification, translation, and transliteration from one language to another. Jaya Susan Mathew explores this topic and how to quickly use some of the readily available APIs to identify, translate, or even transliterate speech or text within your application.

Andrew Melo is a research professor of physics and a big data application developer at Vanderbilt University. He’s spent the last decade developing and implementing large-scale data workflows for the Large Hadron Collider. Recently his focus has been reimplementing these physics workflows with Apache Spark.

Presentations

Data engineering at the Large Hadron Collider Session

Building a data engineering pipeline for serving segments of a 200 Pb dataset to particle physicists around the globe poses many challenges, some of which are unique to high energy physics and some apply to big science projects across disciplines. Ben Galewsky, Gray Lindsey, and Andrew Melo highlight how much of it can inform industry data science at scale.

Rashmina Menon is a senior data engineer with GumGum for over three years. She’s passionate about building distributed and scalable systems and has around 10 years of experience in data engineering.

Presentations

Real-time forecasting at scale using Delta Lake Session

GumGum receives 30 billion programmatic inventory impressions amounting to 25 TB of data per day. By generating near-real-time inventory forecast subject to campaign-specific targeting rules, it enables users to set up successful future campaigns. Rashmina Menon and Jatinder Assi highlight the architecture enabling forecasting in less than 30 seconds with Delta Lake and Databricks Delta caching.

John Mertic is the director of program management for the Linux Foundation. Under his leadership, he’s helped ASWF, ODPi, Open Mainframe Project, and R Consortium accelerate open source innovation and transform industries. John has an open source career spanning two decades, both as a contributor to projects such as SugarCRM and PHP and in open source leadership roles at SugarCRM, OW2, and OpenSocial. With an extensive open source background, he’s a regular speaker at various Linux Foundation and other industry trade shows each year. John’s also an avid writer and has authored two books The Definitive Guide to SugarCRM: Better Business Applications and Building on SugarCRM, as well as published articles on IBM developerWorks, Apple Developer Connection, and PHP Architect.

Presentations

Creating an ecosystem on data governance in the ODPi Egeria project Session

Building on its success at establishing standards in the Apache Hadoop data platform, the ODPi (Linux Foundation) now turns its focus to the next big data challenge—enabling metadata management and governance at scale across the enterprise. Amanda Chessell and John Mertic discuss how the ODPi's guidance on governance (GoG) aims to create an open data governance ecosystem.

Minal Mishra is an engineering manager at Netflix, where he’s part of an effort to improve the software delivery of Netflix’s streaming player. Previously, he was with Xbox Live ecommerce and music and video services teams at Microsoft. Outside work, he enjoys playing tennis.

Presentations

Data powering frequent updates of Netflix's video player Session

Minal Mishra walks you through Netflix's video player release process, the challenges with deriving time series metrics from a firehose of events, and some of the oddities in running analysis on real-time metrics.

Sanjeev Mohan leads big data research for technical professionals at Gartner, where he researches trends and technologies for relational and NoSQL databases, object stores, and cloud databases. His areas of expertise span the end-to-end data pipeline, including ingestion, persistence, integration, transformation, and advanced analytics. Sanjeev is a well-respected speaker on big data and data governance. His research includes machine learning and the IoT. He also serves on a panel of judges for many Hadoop distribution organizations, such as Cloudera and Hortonworks.

Presentations

Data is leaving your door: Essentials for embarking on hybrid multicloud journey Session

The acceleration of the migration of workloads to the cloud isn't a binary journey. Some workloads will still be on-premises and some will be on multiple cloud providers. Sanjeev Mohan identifies key data and analytics considerations in modern data architectures, including strategies to handle data latency, gravity, ingress transformation, compliance, and governance needs and data orchestration.

Keith Moore is the Director of Product Management at SparkCognition and is responsible for the development of both their IoT (SparkPredict) and Cyber Security (SparkSecure) product lines. He specializes in applying advanced data science and natural language processing algorithms to complex data sets. He has previous experience working for test and measurement giant National Instruments as a analog to digital converter and vibration software product manager. Prior to that, Keith developed client software solutions for major oil & gas, aerospace, and semiconductor organizations. He previously served as a board member of Pi Kappa Phi fraternity and still serves as a volunteer on the alumni engagement committee. Keith graduated from the University of Tennessee Summa Cum Laude with a degree in Mechanical Engineering, and currently volunteers as the president of the Austin Volunteers Alumni Club President.

Presentations

Neuroevolution-based Automated Model Building: How to Create Better Models Session

AutoML brings acceleration and democratization of data science, but in the game of accuracy and flexibility, the used of predefined blueprints to find adequate algorithms falls short. In this session, we discuss a neuroevolutionary approach to AutoML to custom build novel, sophisticated neural networks that perfectly represent the relationships in your dataset.

Philipp Moritz is a PhD candidate in the Electrical Engineering and Computer Sciences (EECS) Department at the University of California, Berkeley, with broad interests in artificial intelligence, machine learning, and distributed systems. He’s a member of the Statistical AI Lab and the RISELab.

Presentations

Using Ray to scale Python, data processing, and machine learning Tutorial

There's no easy way to scale up Python applications to the cloud. Ray is an open source framework for parallel and distributed computing, making it easy to program and analyze data at any scale by providing general-purpose high-performance primitives. Robert Nishihara, Ion Stoica, and Philipp Moritz demonstrate how to use Ray to scale up Python applications, data processing, and machine learning.

Barr Moses is a cofounder and CEO of Monte Carlo Data. Previously, she was VP at Gainsight (an enterprise customer data platform) where she helped scale the company 10x in revenue and worked with hundreds of clients on delivering reliable data, a management consultant at Bain & Company, and a research assistant at the Statistics Department at Stanford University. She also served in the Israeli Air Force as a commander of an intelligence data analyst unit. Barr graduated from Stanford with a BSc in mathematical and computational science.

Presentations

Introducing data downtime: From firefighting to winning Session

Ever had your CEO or customer look at your report and say your the numbers look way off? Barr Moses defines data downtime—periods of time when your data is partial, erroneous, missing, or otherwise inaccurate. Data downtime is highly costly for organizations, yet is often addressed ad hoc. You'll discuss why data downtime matters to the data industry and how best-in-class teams address it.

Nisha Muktewar is a research engineer at Cloudera Fast Forward Labs, which is an applied machine intelligence research and advising group part of Cloudera. She works with organizations to help build data science solutions and spends time researching new tools, techniques, and libraries in this space. Previously, she was a manager on Deloitte’s actuarial, advanced analytics, and modeling practice leading teams in designing, building, and implementing predictive modeling solutions for pricing, consumer behavior, marketing mix, and customer segmentation use cases for insurance and retail and consumer businesses.

Presentations

Deep learning for anomaly detection Session

In many business use cases, it's frequently desirable to automatically identify and respond to abnormal data. This process can be challenging, especially when working with high-dimensional, multivariate data. Nisha Muktewar and Victor Dibia explore deep learning approaches (sequence models, VAEs, GANs) for anomaly detection, performance benchmarks, and product possibilities.

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.

Presentations

Building a cloud data lake: Ingesting, processing and analyzing big data on AWS Session

Data lakes are hot again. With S3 as the data lake storage, the modern data lake architecture separates compute from storage. Companies can choose a variety of elastic, scalable, and cost-efficient technologies when designing a cloud data lake. Tomer Shiran and Jacques Nadeau share best practices for building a data lake on AWS, as well as various services and open source building blocks.

Moin Nadeem is an undergraduate at MIT, where he studies Computer Science with a minor in Negotiations. His research broadly studies applications of natural language. Most recently, he has performed an extensive study on bias in language models, culminating with the release of the largest dataset on bias in NLP in the world. He has publications in recommender systems, automated fact checking, and recommender systems.

Previously, he has co-founded the Machine Intelligence Community at MIT, which aims to democratize Machine Learning across undergraduates on campus, and has received the Best Undergraduate Paper award at MIT.

Presentations

How Biased Is Your Natural Language Model? Assessing Fairness in NLP Session

The real world is highly biased, but we still train AI models on that data. This leads to models that are highly offensive and discriminatory. For instance, models have learned that male engineers are preferable, and therefore discriminate when used in hiring How can we assess the social biases that popular models exhibit, and how can we leverage this assessment to create a more fair model?

Catherine Nelson is a senior data scientist for Concur Labs at SAP Concur, where she explores innovative ways to use machine learning to improve the experience of a business traveller. She’s particularly interested in privacy-preserving ML and applying deep learning to enterprise data. Previously, she was a geophysicist and studied ancient volcanoes and explored for oil in Greenland. Catherine has a PhD in geophysics from Durham University and a master’s of earth sciences from Oxford University.

Presentations

Analyzing and deploying your machine learning model Tutorial

Most deep learning models don’t get analyzed, validated, and deployed. Catherine Nelson and Hannes Hapke explain the necessary steps to release machine learning models for real-world applications. You'll view an example project using the TensorFlow ecosystem, focusing on how to analyze models and deploy them efficiently.

Getting the most out of your AI Projects with Model Feedback Loops Session

Measuring the machine learning model’s performance is key for every successful data science project. Therefore, model feedback loops are essential to capture feedback from users and to expand your model’s training dataset. This talk will introduce the concept of model feedback to you and guide you through a framework for increasing the ROI of your data science project.

Alexander Ng is Director of Infrastructure & DevOps at Manifold. His previous work includes a stint as engineer and technical lead doing DevOps at Kryuus as well as engineering work for the Navy. He holds a BS in electrical engineering from Boston University.

Presentations

Efficient ML engineering: Tools and best practices Tutorial

Today, ML engineers are working at the intersection of data science and software engineering—that is, MLOps. Sourav Dey and Alex Ng highlight the six steps of the Lean AI process and explain how it helps ML engineers work as an integrated part of development and production teams. You'll go hands-on using real-world data so you can get up and running seamlessly.

Dave Nielsen is the head of community and ecosystem programs at Redis Labs and the cofounder of CloudCamp, a series of unconferences about cloud computing. Over his 19-year career, he’s been a web developer, systems architect, technical trainer, developer evangelist, and startup entrepreneur. Dave resides in Mountain View with his wife, Erika, to whom he proposed in his coauthored book PayPal Hacks.

Presentations

Redis plus Spark Structured Streaming: The perfect way to scale out your continuous app Session

Redis Streams enables you to collect data in time series format while matching the data processing rate of your continuous application. Apache Spark’s Structured Streaming API enables real-time decision making for your continuous data. Dave Nielsen demonstrates how to integrate open source Redis with Apache Spark’s Structured Streaming API using the Spark-Redis library.

Robert Nishihara is a fourth-year PhD student working in the University of California, Berkeley, RISELab with Michael Jordan. He works on machine learning, optimization, and artificial intelligence.

Presentations

Using Ray to scale Python, data processing, and machine learning Tutorial

There's no easy way to scale up Python applications to the cloud. Ray is an open source framework for parallel and distributed computing, making it easy to program and analyze data at any scale by providing general-purpose high-performance primitives. Robert Nishihara, Ion Stoica, and Philipp Moritz demonstrate how to use Ray to scale up Python applications, data processing, and machine learning.

Tim Nugent pretends to be a mobile app developer, game designer, tools builder, researcher, and tech author. When he isn’t busy avoiding being found out as a fraud, Tim spends most of his time designing and creating little apps and games he won’t let anyone see. He also spent a disproportionately long time writing his tiny little bio, most of which was taken up trying to stick a witty sci-fi reference in…before he simply gave up. He’s writing Practical Artificial Intelligence with Swift for O’Reilly and building a game for a power transmission company about a naughty quoll. (A quoll is an Australian animal.)

Presentations

Swift for TensorFlow in 3 hours Tutorial

Mars Geldard, Tim Nugent, and Paris Buttfield-Addison are here to prove Swift isn't just for app developers. Swift for TensorFlow provides the power of TensorFlow with all the advantages of Python (and complete access to Python libraries) and Swift—the safe, fast, incredibly capable open source programming language; Swift for TensorFlow is the perfect way to learn deep learning and Swift.

Kalvin Ogbuefi received his Master of Science in Applied Statistics from CSU Long Beach and Bachelor of Science in Applied Mathematics from UC Merced. Prior to joining CHLA, he worked as a Project Assistant in the USC Stevens Neuroimaging and Informatics Institute, Marina del Rey, on Radiology image analysis. His extensive research experience comprises projects in deep learning, statistical modeling and computer simulations at Lawrence Livermore National Laboratories and other major research institutions.

Presentations

Semi-Supervised AI Approach for Automated Categorization of Medical Images Session

Annotating radiological images by category at scale is a critical step for further analytical ML. However, supervised learning is challenging because image metadata does not reliably identify image content and manual labeling of enough images for AI algorithms is not feasible. Here we present a semi-supervised approach for automated categorization of radiological images based on content category.

Patryk grew up in Poland where he completed his undergrad education in Communication and Computer Engineering at Warsaw University of Technology. Straight out of college, he was picked up by CERN in Geneva, Switzerland, where he wrote test software for the world’s biggest particle accelerator.

After his work at CERN, he started graduate school at EPFL, Switzerland where he obtained his MSc degree in Information Technologies. He wrote his thesis in collaboration with Virgin Hyperloop One, an American company building the new mode of transportation.

Currently, he continues to build software for Hyperloop as a Data Engineer.

Presentations

Flexible and fast Simulation Analytics in a growing Company: a Hyperloop Case Study Data Case Studies

To substantiate the key business and safety propositions necessary to establish a new mode of transportation, Virgin Hyperloop One (VHO) has implemented a complex, large-scale, and highly configurable simulation. Each simulation run needs to be analyzed and assessed on several KPIs. This session highlights how we successfully reduced the time to insight of our analyses from days to hours.

ML architecture using newest tools: Predicting near-future passenger demand for Hyperloop Session

Patryk Oleniuk and Sandhya Raghava investigate how to use demand data to improve on the design of the fifth mode of transport—Hyperloop. They discuss the passenger demand prediction methods and the tech stack (Spark, koalas, Keras, MLflow) used to build a deep neural network (DNN)-based near-future demand prediction for simulation purposes.

Richard Ott obtained his PhD in particle physics from the Massachusetts Institute of Technology, followed by postdoctoral research at the University of California, Davis. He then decided to work in industry, taking a role as a data scientist and software engineer at Verizon for two years. When the opportunity to combine his interest in data with his love of teaching arose at The Data Incubator, he joined and has been teaching there ever since.

Presentations

Big data for managers (Day 2) Training Day 2

Rich Ott and Michael Li provide a nontechnical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and use their input and analysis for your business’s strategic priorities and decision making.

Richard Ott is a data scientist in residence at the Data Incubator, where he combines his interest in data with his love of teaching. Previously, he was a data scientist and software engineer at Verizon. Rich holds a PhD in particle physics from the Massachusetts Institute of Technology, which he followed with postdoctoral research at the University of California, Davis.

Presentations

Deep learning with PyTorch 2-Day Training

PyTorch is a machine learning library for Python that allows you to build deep neural networks with great flexibility. Its easy-to-use API and seamless use of GPUs make it a sought-after tool for deep learning. Join in to get the knowledge you need to build deep learning models using real-world datasets and PyTorch with Rich Ott.

Deep learning with PyTorch (Day 2) Training Day 2

PyTorch is a machine learning library for Python that allows you to build deep neural networks with great flexibility. Its easy-to-use API and seamless use of GPUs make it a sought-after tool for deep learning. Join in to get the knowledge you need to build deep learning models using real-world datasets and PyTorch with Rich Ott.

Aliaksandr Padvitselski (he/him) is a machine learning engineer at SurveyMonkey, where he works on building the machine learning platform and helping to integrate machine learning systems to SurveyMonkey’s products. He worked on a variety of projects related to data business and personalization at SurveyMonkey. Previously, he mostly worked in the finance industry contributing to backend services and building a data warehouse for BI systems.

Presentations

Accelerating your organization: Making data optimal for machine learning Session

Every organization leverages ML to increase value to customers and understand their business. You may have created models, but now you need to scale. Shubhankar Jain, Aliaksandr Padvitselski, and Manohar Angani use a case study to teach you how to pinpoint inefficiencies in your ML data flow, how SurveyMonkey tackled this, and how to make your data more usable to accelerate ML model development.

Deepak Pai is a manager of AI machine learning core services at Adobe, where he manages a team of data scientists and engineers developing core ML services. The services are used by various Adobe Sensei Services that are part of experience cloud. He holds a master’s and bachelor’s degree in computer science from a leading university in India. He’s \published papers in top peer-reviewed conferences and have been granted patents.

Presentations

A graph neural network approach for time evolving fraud networks Data Case Studies

Developing a fraud detection model using state-of-the-art graph neural networks. This model can be used to detect card testing, trial abuse, seat addition etc.

A machine learning approach to customer profiling by identifying purchase lifecycle stages Session

Identifying customer stages in a buying cycle enables you to perform personalized targeting depending on the stage. Shankar Venkitachalam, Megahanath Macha Yadagiri, and Deepak Pai identify ML techniques to analyze a customer's clickstream behavior to find the different stages of the buying cycle and quantify the critical click events that help transition a user from one stage to another.

Carlos Pazos is a Product Marketing Manager at SparkCognition responsible for automated model building and natural language processing solutions. He specializes in the real-world implementation of AI-based technologies for the Oil & Gas, Utilities, Aerospace, Finance, and Defense sectors. 
 
Pazos previously worked for National Instruments as an IIoT embedded software and distributed systems product marketing manager. He specialized in real-time systems, heterogeneous computing architectures, industrial communication protocols, and analytics at the edge.

Presentations

Neuroevolution-based Automated Model Building: How to Create Better Models Session

AutoML brings acceleration and democratization of data science, but in the game of accuracy and flexibility, the used of predefined blueprints to find adequate algorithms falls short. In this session, we discuss a neuroevolutionary approach to AutoML to custom build novel, sophisticated neural networks that perfectly represent the relationships in your dataset.

Jonathan Peck is developer and technical advocate at Algorithmia. He’s a full-stack developer with two decades of industry experience, now focusing on bringing scalable, discoverable, and secure machine learning microservices to developers across a wide variety of platforms via Algorithmia.com. He’s been a speaker at DeveloperWeek, SeattleJS, the Global Artificial Intelligence Conference, AI NEXTCon, Nordic APIs (keynote), ODSC East and West, API World, the O’Reilly Artificial Intelligence Conference, and the O’Reilly Open Source Software Conference (OSCON).

Presentations

The OS for AI: How serverless computing enables the next gen of machine learning Session

ML has been advancing rapidly, but only a few contributors focus on the infrastructure and scaling challenges that come with it. Jonathan Peck explores why ML is a natural fit for serverless computing, a general architecture for scalable ML, and common issues when implementing on-demand scaling over GPU clusters, providing general solutions and a vision for the future of cloud-based ML.

Nick Pinckernell is a senior research engineer for the applied AI research team at Comcast, where he works on ML platforms for model serving and feature pipelining. He has focused on software development, big data, distributed computing, and research in telecommunications for many years. He’s pursuing his MS in computer science at the University of Illinois at Urbana-Champaign, and when free, he enjoys IoT.

Presentations

Feature engineering pipelines 5 ways with Kafka, Redis, Spark, Dask, AirFlow and more Session

With model serving becoming easier thanks to tools like Kubeflow, the focus is shifting to feature engineering. This session will review five ways to get your raw data into engineered features (and eventually to your model) with open source tools, flexible components and various architectures.

Arvind Prabhakar is cofounder and CTO of StreamSets, provider of the industry’s first DataOps platform for modern data integration. He’s an Apache Software Foundation member and a PMC member on Flume, Sqoop, Storm, and MetaModel projects. Previously, Arvind held many roles at Cloudera, ranging from software engineer to director of engineering.

Presentations

Deploying DataOps for analytics agility Session

DataOps is the best approach for enterprises to improve business and drives future revenue streams and competitive differentiation, which is why so many businesses are rethinking their data strategy. Arvind Prabhakar explains how DataOps solves all the problems that come along with managing data movement at scale.

Sandhya Raghavan is a Senior Data Engineer at Virgin Hyperloop One, where she helps building the data analytics platform for the organization. She has 13 years of experience working with leading organizations to build scalable data architectures, integrating relational and big data technologies. She also has experience implementing large-scale, distributed machine learning algorithms. Sandhya holds a bachelor’s degree in Computer Science from Anna University, India. When Sandhya is not building data pipelines, you can see her travel the world with her family or pedaling a bike.

Presentations

Flexible and fast Simulation Analytics in a growing Company: a Hyperloop Case Study Data Case Studies

To substantiate the key business and safety propositions necessary to establish a new mode of transportation, Virgin Hyperloop One (VHO) has implemented a complex, large-scale, and highly configurable simulation. Each simulation run needs to be analyzed and assessed on several KPIs. This session highlights how we successfully reduced the time to insight of our analyses from days to hours.

ML architecture using newest tools: Predicting near-future passenger demand for Hyperloop Session

Patryk Oleniuk and Sandhya Raghava investigate how to use demand data to improve on the design of the fifth mode of transport—Hyperloop. They discuss the passenger demand prediction methods and the tech stack (Spark, koalas, Keras, MLflow) used to build a deep neural network (DNN)-based near-future demand prediction for simulation purposes.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); worked briefly on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper. He’s the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin–Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Serverless streaming architectures and algorithms for the enterprise Tutorial

Arun Kejariwal, Karthik Ramasamy, and Anurag Khandelwal walk you through through the landscape of streaming systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage. You'll get an overview of the inception and growth of the serverless paradigm. They explore Apache Pulsar, which provides native serverless support in the form of Pulsar functions.

Using Apache Pulsar functions for data workflows at Narvar Session

Narvar originally used a large collection of point technologies such as AWS Kinesis, Lambda, and Apache Kafka to satisfy its requirements for pub/sub messaging, message queuing, logging, and processing. Karthik Ramasamy and Anand Madhavan walk you through how Narvar moved away from using a slew of technologies and consolidating their use cases using Apache Pulsar.

Anand Rao is a partner in PwC’s Advisory Practice and the innovation lead for the Data and Analytics Group, where he leads the design and deployment of artificial intelligence and other advanced analytical techniques and decision support systems for clients, including natural language processing, text mining, social listening, speech and video analytics, machine learning, deep learning, intelligent agents, and simulation. Anand is also responsible for open source software tools related to Apache Hadoop and packages built on top of Python and R for advanced analytics as well as research and commercial relationships with academic institutions and startups, research, development, and commercialization of innovative AI, big data, and analytic techniques. Previously, Anand was the chief research scientist at the Australian Artificial Intelligence Institute; program director for the Center of Intelligent Decision Systems at the University of Melbourne, Australia; and a student fellow at IBM’s T.J. Watson Research Center. He has held a number of board positions at startups and currently serves as a board member for a not-for-profit industry association. Anand has coedited four books and published over 50 papers in refereed journals and conferences. He was awarded the most influential paper award for the decade in 2007 from Autonomous Agents and Multi-Agent Systems (AAMAS) for his work on intelligent agents. He is a frequent speaker on AI, behavioral economics, autonomous cars and their impact, analytics, and technology topics in academic and trade forums. Anand holds an MSc in computer science from Birla Institute of Technology and Science in India, a PhD in artificial intelligence from the University of Sydney, where he was awarded the university postgraduate research award, and an MBA with distinction from Melbourne Business School.

Presentations

A Practical Guide to Responsible AI: Building robust, secure and safe AI Session

A practitioner’s overview of the risks of AI and depiction of responsible AI deployment within an organization. How do we ensure the safety, security, standardizes testing, and governance of systems? How can models be fooled or subverted? We showcase client examples to illustrate how organizations safeguard their AI applications and vendor solutions to mitigate the risks AI may present.

ML Models are not Software : Why organizations need dedicated operations to address the b Session

This session will provide enterprise data, data scientists and IT leaders with an introduction to the core differences between software and machine learning model life cycles. We will demonstrate how AI’s success will also limit scale, and will introduce leading practices for establishing AI Ops to overcome limitations by automating CI/CD, supporting continuous learning, and enabling model safety.

Delip Rao is the vice president of research at the AI Foundation, where he leads speech, language, and vision research efforts for generating and detecting artificial content. Previously, he founded the AI research consulting company Joostware and the Fake News Challenge, an initiative to bring AI researchers across the world together to work on fact checking-related problems, and he was at Google and Twitter. Delip is the author of a recent book on deep learning and natural language processing. His attitude toward production NLP research is shaped by the time he spent at Joostware working for enterprise clients, as the first machine learning researcher on the Twitter antispam team, and as an early researcher at Amazon Alexa.

Presentations

Natural language processing with deep learning 2-Day Training

Delip Rao explores natural language processing (NLP) using a set of machine learning techniques known as deep learning. He walks you through neural network architectures and NLP tasks and teaches you how to apply these architectures for those tasks.

19+ Years in Data and technology field with experience in collaborating with business & technology architecture teams and enabling platform capabilities & innovation on enterprise data platform. Currently managing a team of product owners, data scientists and BI developers to build AI & Machine learning products to help solve customer & business problems.

Patents: Co-inventor of Patent pending for innovative use of Machine learning models at Dell technologies.

Presentations

Data science + Domain Experts = Exponentially better Products Data Case Studies

To deliver best-in class data science products, solutions must evolve through strong partnerships between data scientists and domain experts. We will describe the product lifecycle journey we took as we integrated business expertise with data scientists and technologists highlighting best practices and pitfalls to avoid when digitally transforming your business through AI and machine learning.

Nancy Rausch is a senior manager at SAS. Nancy has been involved for many years in the design and development of SAS’s data warehouse and data management products, working closely with customers and authoring a number of papers on SAS data management products and best practice design principals for data management solutions. She holds an MS in computer engineering from Duke University, where she specialized in statistical signal processing, and a BS in electrical engineering from Michigan Technological University. She has recently returned to college and is pursuing an MS in analytics from Capella University.

Presentations

A study of bees: Using AI and art to tell a data story Session

For data to be meaningful, it needs to be presented in a way people can relate to. Nancy Rausch explains how SAS combined AI and art to tell a compelling data story and how the company combined streaming data from local bee hives to forecast hive health. It visualized this data in a live-action art sculpture, which helped to bring the data to life in a fun and compelling way.

Meghana is a machine learning engineer at SigOpt with a particular focus on novel applications of deep learning across academia and industry. In particular, Meghana explores the impact of hyperparameter optimization and other techniques on model performance and evangelizes these practical lessons for the broader machine learning community. Prior to SigOpt, she worked in biotech, employing natural language processing to mine and classify biomedical literature. She holds a BS degree in Bioengineering from UC Berkeley. When she’s not reading papers, developing models/tools, or trying to explain complicated topics, she enjoys doing yoga, traveling, and hunting for the perfect chai latte.

Presentations

Optimized Image Classification on the Cheap Session

We’ll anchor on building an image classifier trained on the Stanford Cars dataset to evaluate fine tuning and feature extraction and the impact of hyperparameter optimization to these techniques, then tune image transformation parameters to augment the model. Our goal is to answer: how can resource-constrained teams make trade-offs between efficiency and effectiveness using pre-trained models?

Sriram Ravindran is a data scientist at Adobe where he is building a platform called Fraud AI. Fraud AI is a solution being designed to meet Adobe’s fraud detection needs. Prior to this, he was a graduate research student at University of California, San Diego where he worked on applying deep learning to EEG (brain activity) data.

Presentations

A graph neural network approach for time evolving fraud networks Data Case Studies

Developing a fraud detection model using state-of-the-art graph neural networks. This model can be used to detect card testing, trial abuse, seat addition etc.

Joy Rimchala is a data scientist in Intuit’s Machine Learning Futures Group working on ML problems in limited-label data settings. Joy holds a PhD from MIT, where she spent five years doing biological object tracking experiments and modeling them using Markov decision processes.

Presentations

Explainable AI: Your model is only as good as your explanation Session

Explainable AI (XAI) has gained industry traction, given the importance of explaining ML-assisted decisions in human terms and detecting undesirable ML defects before systems are deployed. Talia Tron and Joy Rimchala delve into XAI techniques, advantages and drawbacks of black box versus glass box models, concept-based diagnostics, and real-world examples using design thinking principles.

Kelley Rivoire is an engineering manager at Stripe, where she leads the data infrastructure group. As an engineer, she built Stripe’s first real-time machine learning evaluation of user risk. Previously, she worked on nanophotonics and 3D imaging as a researcher at HP Labs. She holds a PhD from Stanford.

Presentations

Production ML outside the black box: from repeatable inputs to explainable outputs Session

Tools for training and optimizing models have become more prevalent and easier to use; however, these are insufficient for deploying ML in critical production applications. We’ll discuss how Stripe has approached challenges in developing reliable, accurate, and performant ML applications that affect hundreds of thousands of businesses.

Paige Roberts is an open source relations manager at Vertica, where she promotes understanding of Vertica, MPP data processing, open source, and how the analytics revolution is changing the world. In two decades in the data management industry, she’s worked as an engineer, a trainer, a marketer, a product manager, and a consultant.

Presentations

Architecting production IoT analytics Session

What works in production is the only technology criterion that matters. Companies with successful high-scale production IoT analytics programs like Philips, Anritsu, and OptimalPlus show remarkable similarities. IoT at production scale requires certain technology choices. Paige Roberts drills into the architectures of successful production implementations to identify what works and what doesn’t.

Lisa Joy Rosner is an award-winning and patented executive with over 20 years of experience marketing big data and analytics solutions at both public and start-up technology companies. Currently the CMO at Otonomo, an automotive data services platform, Lisa Joy is driving global development of the company’s marketplace.

Previously, Rosner served as CMO at Neustar, leading a major brand transformation as the company entered into the security and marketing data services markets. Prior to that, she launched social intelligence company, NetBase, where she worked with five of the top 10 CPGs as they adopted a new approach to real-time marketing. Additionally, Lisa Joy served as Vice President of Marketing at MyBuys (sold to Magnetic) and Vice President of Worldwide Marketing at BroadVision Inc. Ms. Rosner also held positions at data warehousing companies Brio (sold to Hyperion), DecisionPoint (sold to Teradata), and started her career at Oracle Corporation.

Lisa Joy was named a 2013 “Silicon Valley Woman of Influence,” 2014 B2B Marketer of the Year by the Sage Group and Wall Street Journal and was a Top 100 Women in Marketing honoree by Brand Innovators in 2015. She has been a guest lecturer at the Hass School of Business, the Tuck School of Business and Stanford University. Lisa Joy has a bachelor’s degree (sum cum laude) in English Literature from the University of Maryland. She currently sits on the marketing advisory board of Mintigo, The Big Flip, Fyber, and PLAE Shoes, the board of trustees for UC Merced and is the mother of four young children.

Presentations

Navigating compliance in the future of the mobility while protecting driver privacy Session

● As cars introduce more advanced features, the role of customer privacy and responsible data stewardship has become an important focus for auto manufacturers and drivers. At the Strata Data Conference, Lisa Joy Rosner of car data services platform, Otonomo, will discuss the future of connected vehicles, data compliance measures, and the impact of related policies like GDPR and CCPA.

Nikki Rouda is a principal product marketing manager at Amazon Web Services (AWS). Nikki has decades of experience leading enterprise big data, analytics, and data center infrastructure initiatives. Previously, he held senior positions at Cloudera, Enterprise Strategy Group (ESG), Riverbed, NetApp, Veritas, and UK-based Alertme.com (an early consumer IoT startup). Nikki holds an MBA from Cambridge’s Judge Business School and an ScB in geophysics from Brown University.

Presentations

Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join Nikki Rouda to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Building a serverless big data application on AWS (Day 2) Training Day 2

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join Nikki Rouda to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Rachel Roumeliotis is a strategic content director at O’Reilly, where she leads an editorial team that covers a wide variety of programming topics ranging from full stack to open source in the enterprise to emerging programming languages. Rachel is a programming chair of OSCON and O’Reilly’s Software Architecture Conference. She has been working in technical publishing for 10 years, acquiring content in many areas including mobile programming, UX, computer security, and AI.

Presentations

Tuesday keynotes Keynote

Strata program chairs Rachel Roumeliotis and Alistair Croll welcome you to the first day of keynotes.

Wednesday keynotes Keynote

Strata program chairs Rachel Roumeliotis and Alistair Croll welcome you to the first day of keynotes.

Ebrahim Safavi is a senior data scientist at Juniper, focusing on knowledge discovery from big data using machine learning and large-scale data mining where he developed, and implemented several key production components including company’s chatbot inference engine and anomaly detections. He won a Microsoft research award for his work on information retrieval and recommendation systems in graph-structured networks. Ebrahim earned a PhD degree in cognitive learning networks from Stevens Institute of Technology.

Presentations

Scalable and automated pipeline for large-scale neural network training and inference Session

Anomaly detection models are essential to run data-driven businesses intelligently. At Mist Systems, the need for accuracy and the scale of the data impose challenges to build and automate ML pipelines. Ebrahim Safavi and Jisheng Wang explain how recurrent neural networks and novel statistical models allow Mist Systems to build a cloud native solution and automate the anomaly detection workflow.

Guillaume Saint-Jacques is the tech lead of computational social science at LinkedIn. Previously, he was the technical lead of the LinkedIn experimentation science team. He holds a PhD in management research from the MIT Sloan School of Management, a master’s degree in economics from the Paris École Normale Supérieure and the Paris School of Economics, and a master’s degree in entrepreneurship from HEC Paris.

Presentations

Fairness through experimentation at LinkedIn Session

Most companies want to ensure their products and algorithms are fair. Guillaume Saint-Jacques and Meg Garlinghouse we share LinkedIn's A/B testing approach to fairness, describe new methods that detect whether an experiment introduces bias or inequality. You'll learn about a scalable implementation on Spark and examples of use cases and impact at LinkedIn.

Mehrnoosh Sameki is a technical program manager at Microsoft, responsible for leading the product efforts on machine learning interpretability within the Azure Machine Learning platform. Previously, she was a data scientist at Rue Gilt Groupe, incorporating data science and machine learning in the retail space to drive revenue and enhance customers’ personalized shopping experiences. She earned her PhD degree in computer science at Boston University.

Presentations

An overview of responsible artificial intelligence Tutorial

Mehrnoosh Sameki and Sarah Bird examine six core principles of responsible AI: fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability. They focus on transparency, fairness, and privacy, and they cover best practices and state-of-the-art open source toolkits that empower researchers, data scientists, and stakeholders to build trustworthy AI systems.

Aryn Sargent is a data analyst with over 6 years of experience leading the identification and acceleration of successful solutions for enterprise conversational AI and Intelligent Virtual Assistants. Over the past six years, Aryn has held numerous roles within Verint Next IT including key positions within product management, product strategy, and data analysis. Today, Aryn leads strategic accounts and clients in identifying and defining IVA understanding and knowledge areas through the use of proprietary AI-powered tools to analyze unstructured conversational data. She is responsible for client’s automation strategies and evaluating, measuring and growing their success; defining tactical knowledge areas to achieve long-term vision. When she’s not working with data sets, she is well known for her green thumb in the garden and her love of dogs, fostering dogs in need until they find a loving, forever home.

Presentations

Deploying chatbots and conversational analysis: learn what customers really want to know Data Case Studies

Chatbots are increasingly used in customer service as a first tier of support. Through deep analysis of conversation logs you can learn real user motivations and where company improvements can be made. In this talk, a build or buy comparison is made for deploying self-service bots, motivations and techniques for deep conversational analysis are covered, and real world discoveries are discussed.

Roshan is a Product Manager who has been involved with Artificial Intelligence initiatives at DocuSign since their inception. He came to the company through an acquisition of a CLM startup, SpringCM, and worked with product leadership across the organization to formalize an AI vision before beginning to scale out our team. His job for the past 6 months has been to create a robust, enterprise-grade deep learning platform that enables intelligence and insights across the DocuSign Agreement Cloud. Understandably, many of the use cases center around Document Understanding and NLP/NLU – but they’ve also explored features leveraging CNNs, as well as classical machine learning models. One of the major challenges has been working with a bare metal tech stack while emphasizing scalability and modularity of DocuSign’s AI services.

Presentations

A Unified CV, OCR & NLP Model Pipeline for Scalable Document Understanding at DocuSign Session

This is a real-world case study about applying state-of-the-art deep learning techniques to a pipeline that combines computer vision, OCR, and natural language processing, at DocuSign - the world's largest eSignature provider. We'll also cover how the project delivered on its extreme interpretability, scalability, and compliance requirements.

Danilo Sato is a principal consultant at ThoughtWorks with more than 17 years of experience in many areas of architecture and engineering: software, data, infrastructure, and machine learning. Balancing strategy with execution, Danilo helps clients refine their technology strategy while adopting practices to reduce the time between having an idea, implementing it, and running it in production using the cloud, DevOps, and continuous delivery. He is the author of DevOps in Practice: Reliable and Automated Software Delivery, is a member of ThoughtWorks’ Technology Advisory Board and Office of the CTO, and is an experienced international conference speaker.

Presentations

Continuous delivery for machine learning: Automating the end-to-end lifecycle Tutorial

Danilo Sato lead you through applying continuous delivery (CD) to data science and machine learning (ML). Join in to learn how to make changes to your models while safely integrating and deploying them into production using testing and automation techniques to release reliably at any time and with a high frequency.

Robert Schroll obtained his Ph.D. in Physics from the University of Chicago before completing postdocs in Amherst, Massachusetts, and Santiago, Chile. There, he realized that the favorite parts of his job were teaching and analyzing data. He made the switch to data science and has been teaching at the Data Incubator for the past year.

Presentations

Machine learning from scratch in TensorFlow (Day 2) Training Day 2

The TensorFlow library provides for the use of computational graphs, with automatic parallelization across resources. This architecture is ideal for implementing neural networks. Robert Schroll introduces TensorFlow's capabilities in Python, moving from building machine learning algorithms piece by piece to using the Keras API provided by TensorFlow with several hands-on applications.

Arpan Shah is the engineering manager of Robinhood’s Data Engineering team.

Presentations

Usability First: the Evolution of Robinhood’s Data Platform Data Case Studies

The Data Platform at Robinhood has evolved considerably as the scale of our data and needs of the company have evolved. In this talk, we are sharing the stories behind the evolution of our platform, explaining how does it align with our business use cases, and discussing in detail the challenges we encountered and lessons we learned.

Liqun Shao is a data scientist in the AI Development Acceleration Program at Microsoft. She finished her first rotational project on “Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-Based Platforms” with the paper publication in SoCC 2019 and her second one on “Azure Machine Learning Text Analytics Best Practices” with the contribution of the public NLP repo. She got her bachelor’s of computer science in China and then she pursued her doctorate in computer science at the University of Massachusetts. Her research areas focus on natural language processing, data mining, and machine learning, especially on title generation, summarization and classification. She joined Microsoft in August 2018.

Presentations

Distributed training in the cloud for production-level NLP models Session

Liqun Sha leads you through material from a new GitHub repository to show how data scientists without NLP knowledge can quickly train, evaluate, and deploy state-of-the-art NLP models. She focuses on two use cases with distributed training on Azure Machine Learning with Horovod: GenSen for sentence similarity and BERT for question answering using Jupyter notebooks for Python.

Presentations

A graph neural network approach for time evolving fraud networks Data Case Studies

Developing a fraud detection model using state-of-the-art graph neural networks. This model can be used to detect card testing, trial abuse, seat addition etc.

Mehul Sheth is a senior performance engineer in the performance labs at Druva, where he’s responsible for the performance of the CloudApps product of Druva InSync. He experience more than 13 years of experience in development and performance engineering, where he’s ensured production performance of thousands of applications. Mehul loves to tackle unsolved problems and strives to bring a simple solution to the table, rather than trying complex things.

Presentations

Realistic synthetic data at scale: Influenced by, but not production, data Session

Any software product needs to be tested against data. It's difficult to have a random but realistic dataset representing production data. Mehul Sheth highlights using production data to generate models. Production data is accessed without exposing it or violating any customer agreements on privacy. The models are then used to generate test data at scale in lower environments.

Tomer Shiran is cofounder and CEO of Dremio, the data lake engine company. Previously, Tomer was the vice president of product at MapR, where he was responsible for product strategy, road map, and new feature development and helped grow the company from 5 employees to over 300 employees and 700 enterprise customers; held numerous product management and engineering positions at Microsoft and IBM Research. He’s the author of eight US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from the Technion, the Israel Institute of Technology.

Presentations

Building a cloud data lake: Ingesting, processing and analyzing big data on AWS Session

Data lakes are hot again. With S3 as the data lake storage, the modern data lake architecture separates compute from storage. Companies can choose a variety of elastic, scalable, and cost-efficient technologies when designing a cloud data lake. Tomer Shiran and Jacques Nadeau share best practices for building a data lake on AWS, as well as various services and open source building blocks.

Pramod Singh is a senior machine learning engineer at Walmart Labs. He has extensive hands-on experience in machine learning, deep learning, AI, data engineering, designing algorithms, and application development. He has spent more than 10 years working on multiple data projects at different organizations. He’s the author of three books Machine Learning with PySpark, Learn PySpark, and Learn TensorFlow 2.0. He’s also a regular speaker at major conferences such as the O’Reilly Strata Data and AI Conferences. Pramod holds a BTech in electrical engineering from BATU, and an MBA from Symbiosis University. He’s also done data science certification from IIM–Calcutta. He lives in Bangalore with his wife and three-year-old son. In his spare time, he enjoys playing guitar, coding, reading, and watching football.

Presentations

Attention networks all the way to production using Kubeflow Tutorial

Pramod Singh and Akshay Kulkarni walk you through the in-depth process of building a text summarization model with an attention network using TensorFlow (TF) 2.0. You'll gain the practical hands-on knowledge to build and deploy a scalable text summarization model on top of Kubeflow.

Joseph Sirosh is the corporate vice president of the Cloud AI Platform at Microsoft, where he leads the company’s enterprise AI strategy and products such as Azure Machine Learning, Azure Cognitive Services, Azure Search, and Bot Framework. Previously, he was the corporate vice president for Microsoft’s Data Platform; the vice president for Amazon’s Global Inventory Platform, responsible for the science and software behind Amazon’s supply chain and order fulfillment systems, as well as the central Machine Learning Group, which he built and led; and the vice president of research and development at Fair Isaac Corp., where he led R&D projects for DARPA, Homeland Security, and several other government organizations. He’s passionate about machine learning and its applications and has been active in the field since 1990. Joseph holds a PhD in computer science from the University of Texas at Austin and a BTech in computer science and engineering from the Indian Institute of Technology Chennai.

Presentations

Compass uses Amazon to simplify and modernize home search Session

Compass is changing real estate by leveraging its industry-leading software to build search and analytical tools that help real estate professionals find, market, and sell homes. Joseph Sirosh details how Compass leverages AWS services, including Amazon Elasticsearch Service, to deliver a complete, scalable home-search solution.

Divya Sivasankaran is a machine learning scientist at integrate.ai where she focuses on building out FairML capabilities within its products. Previously, she was with government organizations (police force and healthcare) to build AI capabilities to bring about positive change (and good intentions). But these experiences also shaped her thinking around the larger ethical implications of AI in the wild and the need for ethical considerations to be brought forward at the design thinking stages (proactive versus reactive).

Presentations

FairML from theory to practice: Lessons drawn from our journey to build a fair product Session

In recent years, there's been a lot of attention on the need for ethical considerations in ML, as well as different ways to address bias in different stages of the ML pipeline. However, there hasn't been a lot of focus on how to bring fairness to ML products. Divya Sivasankaran explores the key challenges (and how to overcome them) in operationalizing fairness and bias in ML products.

Jason “Jay” Smith is a Cloud customer engineer at Google. He spends his day helping enterprises find ways to expand their workload capabilities on Google Cloud. He’s on the Kubeflow go-to-market team and provides code contributions to help people build an ecosystem for their machine learning operations. His passions include big data, ML, and helping organizations find a way to collect, store, and analyze information.

Presentations

Using serverless Spark on Kubernetes for data streaming and analytics Session

Data is a valuable resource, but collecting and analyzing the data can be challenging. Further, the cost of resource allocation often prohibits the speed at which analysis can take place. Jay Smith and Remy Welch break down how serverless architecture can improve the portability and scalability of streaming event-driven Apache Spark jobs and perform ETL tasks using serverless frameworks.

Maulik Soneji is a product engineer at Gojek, where he works with different parts of data pipelines for a hypergrowth startup. Outside of learning about mature data systems, he’s interested in Elasticsearch, Go, and Kubernetes.

Presentations

BEAST: Building an event processing library to handle millions of events Session

Maulik Soneji and Dinesh Kumar explore Gojek's event-processing library to consume events from Kafka and push it to BigQuery. All of its services are event sourced, and Gojek has a high load of 21K messages per second for few topics, and it has hundreds of topics.

Ion Stoica is a professor in the Electrical Engineering and Computer Sciences (EECS) Department at the University of California, Berkeley, where he researches cloud computing and networked computer systems. Previously, he worked on dynamic packet state, chord DHT, internet indirection infrastructure (i3), declarative networks, and large-scale systems, including Apache Spark, Apache Mesos, and Alluxio. He’s the cofounder of Databricks—a startup to commercialize Apache Spark—and Conviva—a startup to commercialize technologies for large-scale video distribution. Ion is an ACM fellow and has received numerous awards, including inclusion in the SIGOPS Hall of Fame (2015), the SIGCOMM Test of Time Award (2011), and the ACM doctoral dissertation award (2001).

Presentations

Using Ray to scale Python, data processing, and machine learning Tutorial

There's no easy way to scale up Python applications to the cloud. Ray is an open source framework for parallel and distributed computing, making it easy to program and analyze data at any scale by providing general-purpose high-performance primitives. Robert Nishihara, Ion Stoica, and Philipp Moritz demonstrate how to use Ray to scale up Python applications, data processing, and machine learning.

Dave Stuart is a senior product manager at the US Department of Defense, where he’s leading a large-scale effort to transform the workflows of thousands of enterprise business analysts through Jupyter and Python adoption, making tradecraft more efficient, sharable, and repeatable. Previously, Dave led multiple grassroots technology adoption efforts, developing innovative training methods that tangibly increased the technical proficiency of a large noncoding enterprise workforce.

Presentations

Jupyter as an enterprise DIY analytic platform Session

Dave Stuart takes a look into how the US Intelligence Community (IC) uses Jupyter and Python to harness subject matter expertise of analysts in a DIY analytic movement. You'll cover the technical and cultural challenges the community encountered in its quest to find success at a large scale and address the strategies used to mitigate the challenges.

Bargava Subramanian is a cofounder and machine learning engineer of the boutique AI firm Binaize Labs in Bangalore, India. He has 15 years’ experience delivering business analytics and machine learning solutions to B2B companies, and he mentors organizations in their data science journey. He holds a master’s degree from the University of Maryland, College Park. He’s an ardent NBA fan.

Presentations

Deep learning for recommendation systems 2-Day Training

Bargava Subramanian and Amit Kapoor provide you with a thorough introduction to the art and science of building recommendation systems and paradigms across domains and give an end-to-end overview of deep learning-based recommendation and learning-to-rank systems, so you'll understand practical considerations and guidelines for building and deploying recsys.

Deep Learning for Recommendation Systems (Day 2) Training Day 2

Bargava Subramanian and Amit Kapoor provide you with a thorough introduction to the art and science of building recommendation systems and paradigms across domains and give an end-to-end overview of deep learning-based recommendation and learning-to-rank systems, so you'll understand practical considerations and guidelines for building and deploying recsys.

Dev Tagare is an engineering manager at Lyft. He has hands-on experience in building end-to-end data platforms for high-velocity and large data volume use cases. Previously, Dev spent 10 years leading engineering functions for companies including Oracle and Twitter with a focus on areas including open source; big data; low-latency, high-scalability design; data structures; design patterns; and real-time analytics.

Presentations

Reducing data lag from 24+ hours to 5 mins at Lyft scale Session

Mark Grover and Dev Tagare offer you a glimpse at the end-to-end data architecture Lyft uses to reduce data lag appearing in its analytical systems from 24+ hours to under 5 minutes. You'll learn the what and why of tech choices, monitoring, and best practices. They outline the use cases Lyft has enabled, especially in ML model performance and evaluation.

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, agile, distributed teams. Previously, he led business operations for Bing Shopping in the US and Europe with Microsoft’s Bing Group and built and ran distributed teams that helped scale Amazon’s financial systems with Amazon in both Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Advanced natural language processing with Spark NLP Tutorial

David Talby, Alex Thomas, and Claudiu Branzan detail the application of the latest advances in deep learning for common natural language processing (NLP) tasks such as named entity recognition, document classification, sentiment analysis, spell checking, and OCR. You'll learn to build complete text analysis pipelines using the highly performant, scalable, open source Spark NLP library in Python.

Model governance: A checklist for getting AI safely to production Session

The industry has about 40 years of experience forming best practices and tools for storing, versioning, collaborating, securing, testing, and building software source code—but only about 4 years doing so for AI models. David Talby catches you up on current best practices and freely available tools so that your team can go beyond experimentation to successfully deploy models.

Wangda Tan is a product management committee (PMC) member of Apache Hadoop and engineering manager of the computation platform team at Cloudera. He manages all efforts related to Kubernetes and YARN for both on-cloud and on-premises use cases of Cloudera. His primary interesting areas are the YuniKorn scheduler (scheduling containers across YARN and Kubernetes) and the Hadoop submarine project (running a deep learning workload across YARN and Kubernetes). He’s also led features like resource scheduling, GPU isolation, node labeling, resource preemption, etc., efforts in the Hadoop YARN community. Previously, he worked on integration of OpenMPI and GraphLab with Hadoop YARN at Pivotal and participated in creating a large-scale machine learning, matrix, and statistics computation program using MapReduce and MPI and Alibaba.

Presentations

It’s 2020 now: Apache Hadoop 3.x state of the union and upgrade guidance Session

2020 Hadoop is still evolving fast. You'll learn the current status of Apache Hadoop community and the exciting present and future of Hadoop 3.x. Wangda Tan and Arpit Agarwal cover new features like Hadoop on Cloud, GPU support, NameNode federation, Docker, 10X scheduling improvements, OZone, etc. And they offer you upgrade guidance from 2.x to 3.x.

Cathy Tanimura is Sr. Director of Analytics & Data Science at Strava. She has a passion for leveraging data in multiple ways: to help people make better decisions; to tell stories about companies and industries; and to develop great product experiences. She previously built and led data teams at several high growth technology companies including Okta, Zynga, and StubHub.com.

Presentations

The Power of Visualizing Health, Fitness, and Community Impact Session

Pulling from specific product innovations and applications like relative effort, cumulative stats, Year in Sport, Heatmaps, and Metro, Cathy Tanimura, Sr. Director of Analytic at Strava will share best practices for creating effective data visualizations that help improve the health and fitness of individuals, as well as the well-being of communities.

Fatma Tarlaci is a data science fellow and machine learning engineer at Quansight, where she focuses on creating trainings in AI and works closely on scientific computing within the open source community. She received her PhD in humanities before she transitioned into computer science, in which she received her master’s degree at Stanford University. Her work and research specializes in deep learning and data science.

Presentations

Natural language processing with open source Tutorial

Language is at the heart of everything we—humans—do. Natural language processing (NLP) is one of the most challenging tasks of artificial intelligence, mainly due to the difficulty of detecting nuances and common sense reasoning in natural language. Fatma Tarlaci invites you to learn more about NLP and get a complete hands-on implementation of an NLP deep learning model.

Alex Thomas is a data scientist at John Snow Labs. He’s used natural language processing (NLP) and machine learning with clinical data, identity data, and job data. He’s worked with Apache Spark since version 0.9 as well as with NLP libraries and frameworks including UIMA and OpenNLP.

Presentations

Advanced natural language processing with Spark NLP Tutorial

David Talby, Alex Thomas, and Claudiu Branzan detail the application of the latest advances in deep learning for common natural language processing (NLP) tasks such as named entity recognition, document classification, sentiment analysis, spell checking, and OCR. You'll learn to build complete text analysis pipelines using the highly performant, scalable, open source Spark NLP library in Python.

Sherin is a Software Engineer at Lyft. In her career spanning 8 years, she has worked on most parts of the tech stack, but enjoys the challenges in Data Science and Machine Learning the most. Most recently she has been focussed on building products that would facilitate advances in Artificial Intelligence and Machine Learning through Streaming.

She is passionate about getting more people, especially women, interested in this field and has been trying her best to share her work with the community through tech talks and panel discussions. Most recently she gave a talk about Machine Learning Infra and Streaming, at Beam Summit as well as Flink Forward in Berlin.

In her free time she loves to read and paint. She is also the president of the Russian Hill book club based in San Francisco and loves to organize events for her local library.

Presentations

Building a self-service platform for continuous, real-time feature generation. Data Case Studies

In the world of ride sharing, decisions such as matching a passenger to the nearest driver, pricing, ETA etc need to be made in realtime. For this it is imperative to build the most up to date view of the world using data. However, gleaning information from high volume streaming data is not just tricky but often solutions are hard to use. At Lyft we have attempted to solve this problem with Flink

Jameson Toole is the CTO and cofounder of Fritz AI, a company building tools to help developers optimize, deploy, and manage machine learning models on mobile devices. Previously, he built analytics pipelines for Google X’s Project Wing and ran the data science team at Boston technology startup Jana Mobile. He holds undergraduate degrees in physics, economics, and applied mathematics from the University of Michigan and both an MS and PhD in engineering systems from MIT, where he worked on applications of big data and machine learning to urban and transportation planning at the Human Mobility and Networks Lab.

Presentations

Creating smaller, faster, production-worthy mobile machine learning models Session

Getting machine learning models ready for use on-device is a major challenge. In this talk, you’ll learn optimization, pruning, and compression techniques that keep app sizes small and inference speeds high. We’ll apply these techniques using mobile machine learning frameworks such as Core ML and TensorFlow Lite.

Talia Tron is a senior data scientist on the ML technologies futures team at Intuit, where she leads the effort on explainable AI. She worked on the security risk and fraud team, where she used ML and AI solutions to detect threats and frauds on Intuit’s products. She’s the leader of Intuit’s innovation catalyst local community, pushing toward customer obsession and design thinking across the Israeli site. Previously, she was data scientist in Microsoft’s advanced threat analytics groups (ATA R&D) and developed customized elearning tools in the Microsoft Education Group, and she cofounded the interdisciplinary psychiatry group, which brings together clinicians, neuroscientists, and data scientists to advance brain-related psychiatric evaluation and treatment. Talia holds a PhD in computational neuroscience from the Hebrew University, where she developed automatic tools for analyzing facial expressions and motor behavior in schizophrenia. She conducted research in collaboration with the Sheba Medical Center Innovation Center—using ML to explore and predict treatment outcomes and develop medical decision support systems.

Presentations

Explainable AI: Your model is only as good as your explanation Session

Explainable AI (XAI) has gained industry traction, given the importance of explaining ML-assisted decisions in human terms and detecting undesirable ML defects before systems are deployed. Talia Tron and Joy Rimchala delve into XAI techniques, advantages and drawbacks of black box versus glass box models, concept-based diagnostics, and real-world examples using design thinking principles.

Teresa Tung is a managing director at Accenture, where she’s responsible for taking the best-of-breed next-generation software architecture solutions from industry, startups, and academia and evaluating their impact on Accenture’s clients through building experimental prototypes and delivering pioneering pilot engagements. Teresa leads R&D on platform architecture for the internet of things and works on real-time streaming analytics, semantic modeling, data virtualization, and infrastructure automation for Accenture’s Applied Intelligence Platform. Teresa is Accenture’s most prolific inventor with 170+ patent and applications. She holds a PhD in electrical engineering and computer science from the University of California, Berkeley.

Presentations

Building the digital twin IoT and unconventional data Session

The digital twin presents a problem of data and models at scale—how to mobilize IT and OT data, AI, and engineering models that work across lines of business and even across partners. Teresa Tung and William Gatehouse share their experience of implementing digital twins use cases that combine IoT, AI models, engineering models, and domain context.

Sandeep Uttamchandani is the hands-on chief data architect and head of data platform engineering at Intuit, where he’s leading the cloud transformation of the big data analytics, ML, and transactional platform used by 3M+ small business users for financial accounting, payroll, and billions of dollars in daily payments. Previously, Sandeep held engineering roles at VMware and IBM and founded a startup focused on ML for managing enterprise systems. Sandeep’s experience uniquely combines building enterprise data products and operational expertise in managing petabyte-scale data and analytics platforms in production for IBM’s federal and Fortune 100 customers. Sandeep has received several excellence awards. He has over 40 issued patents and 25 publications in key systems conference such as VLDB, SIGMOD, CIDR, and USENIX. Sandeep is a regular speaker at academic institutions and conducts conference tutorials for data engineers and scientists. He advises PhD students and startups, serves as program committee member for systems and data conferences, and was an associate editor for ACM Transactions on Storage. He blogs on LinkedIn and his personal blog, Wrong Data Fabric. Sandeep holds a PhD in computer science from the University of Illinois at Urbana-Champaign.

Presentations

10 lead indicators before data becomes a mess Session

Data quality metrics focus on quantifying whether data is a mess. But you need to identify lead indicators before data becomes a mess. Sandeep U, Giriraj Bagadi, and Sunil Goplani explore developing lead indicators for data quality for Intuit's production data pipelines. You'll learn about the details of lead indicators, optimization tools, and lessons that moved the needle on data quality.

Sr. Principal Test Engineer, DellEMC Inc, One Dell Way, Round Rock, TX
26+ years expertise in Test Engineering, 7 years in Engineering Management, Test Design, Tools & Automation.
Publications: Co-authored article, CIOReview, Nov, 2018, Patents: Co-Inventor of (4) pending patents for innovative use of machine learning models at Dell Technologies; Co-Inventor of US patent grant #9050529 for Innovative Hardware Design at Microsoft (2012).

Presentations

Data science + Domain Experts = Exponentially better Products Data Case Studies

To deliver best-in class data science products, solutions must evolve through strong partnerships between data scientists and domain experts. We will describe the product lifecycle journey we took as we integrated business expertise with data scientists and technologists highlighting best practices and pitfalls to avoid when digitally transforming your business through AI and machine learning.

Balaji Varadarajan is a senior software engineer at Uber, where he works on the Hudi project and oversees data engineering broadly across the network performance monitoring domain. Previously, he was one of the lead engineers on LinkedIn’s databus change capture system as well as the Espresso NoSQL store. Balaji’s interests lie in distributed data systems.

Presentations

Bringing stream processing to batch data using Apache Hudi (incubating) Session

Batch processing can benefit immensely from adopting some techniques from the streaming processing world. Balaji Varadarajan shares how Apache Hudi (incubating), an open source project created at Uber and currently incubating with the ASF, can bridge this gap and enable more productive, efficient batch data engineering.

Sundar Varadarajan is an industry expert in the field of analytics, machine learning and AI, having ideated, architected and implemented innovative AI solutions across multiple industry verticals. Currently, Sundar is a Consulting Partner at Wipro on AI/ML, and plays an advisory role on edge AI and Machine Learning solutions. Sundar can be reached at sundar.varadarajan@wipro.com

Presentations

An approach to automate Time and Motion Analysis Session

• Time and motion study of manufacturing operations in a shop floor is traditionally carried out through manual observation which is time consuming and involves human errors and limitations. In this study a new approach of video analytics combined with time series analysis is introduced to automate the process of activity identification and timing measurements.

Paroma Varma is a cofounder at Snorkel and completed a PhD at Stanford, advised by Professor Christopher Ré and affiliated with the DAWN, SAIL, and StatML groups, where she was supported by the Stanford Graduate Fellowship and the National Science Foundation Graduate Research Fellowship. Her research interests revolve around weak supervision or using high-level knowledge in the form of noisy labeling sources to efficiently label massive datasets required to train machine learning models.

Presentations

Programmatically building and managing training datasets with Snorkel Tutorial

Paroma Varma teaches you how to build and manage training datasets programmatically with Snorkel, an open source framework developed at the Stanford AI lab, and demonstrates how this can lead to more efficiently building and managing machine learning (ML) models in a range of practical settings.

Shankar Venkitachalam is a data scientist on the experience cloud research and sensei team at Adobe. He holds a master’s degree in computer science from the University of Massachusetts Amherst. He’s passionate about machine learning, probabilistic graphical models, and natural language processing.

Presentations

A machine learning approach to customer profiling by identifying purchase lifecycle stages Session

Identifying customer stages in a buying cycle enables you to perform personalized targeting depending on the stage. Shankar Venkitachalam, Megahanath Macha Yadagiri, and Deepak Pai identify ML techniques to analyze a customer's clickstream behavior to find the different stages of the buying cycle and quantify the critical click events that help transition a user from one stage to another.

Sumeet Vij is a Director in the Strategic Innovation Group at Booz Allen Hamilton. Sumeet drives multiple client engagements, research, product development, and strategic partnerships in the field of AI, machine learning, personalization, recommendation systems, chatbots, digital assistants, RPA, and conversational AI. His leadership in operationalizing AI and applying Deep Learning for NLP and Search has helped clients increase citizen engagement, reduce manual content curation, derive deeper insights and drive down costs.

Presentations

Weak Supervision for Stronger Models: Increasing classification strength using noisy data Session

Weak supervision allows the use of noisy, or imprecise sources to provide supervision signals for labeling large amounts of training data. Our approach combines Snorkel weak supervision framework with denoising labeling functions, a generative model, and AI powered search to train classifiers leveraging existing enterprise knowledge, without the need for tens of thousands of hand-labeled examples

Jorge Villamariona is a senior technical marketing engineer on the product marketing team at Qubole. Over the years, Jorge has acquired extensive experience in relational databases, business intelligence, big data engines, ETL, and CRM systems. He enjoys complex data challenges and helping customers gain greater insight and value from their existing data.

Presentations

Data engineering workshop 2-Day Training

Jorge Villamariona outlines how organizations using a single platform for processing all types of big data workloads are able to manage growth and complexity, react faster to customer needs, and improve collaboration—all at the same time. You'll learn to leverage Apache Spark and Hive to build an end-to-end solution to address business challenges common in retail and ecommerce.

Data engineering workshop (Day 2) Training Day 2

Jorge Villamariona outlines how organizations using a single platform for processing all types of big data workloads are able to manage growth and complexity, react faster to customer needs, and improve collaboration—all at the same time. You'll learn to leverage Apache Spark and Hive to build an end-to-end solution to address business challenges common in retail and ecommerce.

Mario Vinasco has over 15 years of progressive experience in data driven analytics with emphasis in database programming and machine learning creatively applied to eCommerce, advertising, customer acquisition/retention and marketing investment. Mario specializes in developing and applying leading edge business analytics to complex business problems using big data and predictive modeling platforms.
Mario holds a Masters in engineering economics from Stanford University and currently manages a team of data science at Uber Technologies responsible for customer management, retention and prediction. The team conducts advanced segmentation of customers by propensity to act, churn, open email and set up sophisticated experiments to test and validate hypothesis.
Until recently, Mario worked for Facebook as data scientist in the consumer marketing group; in this role he was responsible for improving the effectiveness of Facebook’s own consumer-facing campaigns. Key projects included ad-effectiveness measurement of Facebook’s brand marketing activities, and product campaigns for key product priorities using advanced experimentation techniques.
Prior roles included VP of business intelligence in digital textbook startup, people analytics manager at Google and eCommerce Sr manager at Symantec

Presentations

Optimization of digital spend using machine learning in PyTorch Session

Uber spends hundreds of millions of dollars in marketing and constantly optimizes the allocation of these budgets. It deploys complex models, using Python and PyTorch, and borrowing from machine learning (ML) to speed up solvers to optimize marketing investment. Mario Vinasco explains the framework of the marketing spend problem and how it was implemented.

Presentations

ML Models are not Software : Why organizations need dedicated operations to address the b Session

This session will provide enterprise data, data scientists and IT leaders with an introduction to the core differences between software and machine learning model life cycles. We will demonstrate how AI’s success will also limit scale, and will introduce leading practices for establishing AI Ops to overcome limitations by automating CI/CD, supporting continuous learning, and enabling model safety.

Kshitij Wadhwa is a software engineer at Rockset, where he works on the platform engineering team. Previously, Kshitij was an engineer at NetApp, on the filesystem and protocols team in Cloud Backup Service. Kshitij holds a master degree in computer science from North Carolina State University.

Presentations

Building live dashboards on Amazon DynamoDB using Rockset Session

Rockset is a serverless search and analytics engine that enables real-time search and analytics on raw data from Amazon DynamoDB—with full featured SQL. Kshitij Wadhwa and Dhruba Borthakur explore how Rockset takes an entirely new approach to loading, analyzing, and serving data so you can run powerful SQL analytics on data from DynamoDB without ETL.

Kai Waehner is a technology evangelist at Confluent. Kai’s areas of expertise include big data analytics, machine learning, deep learning, messaging, integration, microservices, the internet of things, stream processing, and blockchain. He’s regular speaker at international conferences such as JavaOne, O’Reilly Software Architecture, and ApacheCon and has written a number of articles for professional journals. Kai also shares his experiences with new technologies on his blog.

Presentations

Streaming microservice architectures with Apache Kafka and Istio service mesh Session

Apache Kafka became the de facto standard for microservice architectures, which also introduces new challenges. Kai Wähner explores the problems of distributed microservices communication and how both Kafka and a service mesh like Istio address them. You'll learn some approaches for combining both to build a reliable and scalable microservice architecture with decoupled and secure microservices.

Dean Wampler is the vice president of fast data engineering at Lightbend, where he leads the Lightbend Fast Data Platform project, a distribution of scalable, distributed stream processing tools including Spark, Flink, Kafka, and Akka, with machine learning and management tools. Dean’s the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He’s a contributor to several open source projects. A frequent Strata speaker, he’s also the co-organizer of several conferences around the world and several user groups in Chicago.

Presentations

Model governance Tutorial

Machine learning (ML) models are data, which means they require the same data governance considerations as the rest of your data. Boris Lublinsky and Dean Wampler outline metadata management for model serving and explore what information about running systems you need and why it's important. You'll also learn how Apache Atlas can be used for storing and managing this information.

Understanding data governance for machine learning models Session

Production deployment of machine learning (ML) models requires data governance, because models are data. Dean Wampler and Boris Lublinsky justify that claim and explore its implications and techniques for satisfying the requirements. Using motivating examples, you'll explore reproducibility, security, traceability, and auditing, plus some unique characteristics of models in production settings.

Kelly Wan is a senior data scientist at LinkedIn, Sunnyvale. She’s a technology and data science evangelist. Previously, Kelly worked in investment banking for five years in New York City and has undergone a career transformation into the data science area in Silicon Valley. Kelly obtained her master’s of computer science degree from Columbia University and her bachelor’s degree from Southeast University in China.

Presentations

LinkedIn end-to-end data product to measure customer happiness Session

Studies show that good customer services accelerates customers' cohesion toward a product, which increases product engagement and revenue spending. It's traditional to use customer surveys to measure how customers feel about services and products. Kelly Zhiling Wan, Chih Hui Wang, and Lili Zhou examine the innovative data product to measure customer happiness from LinkedIn.

Haopei Wang is a research scientist at DataVisor. Previously, he earned his PhD from the Department of Computer Science and Engineering at Texas A&M University. His research includes big data security and system security.

Presentations

Efficient feature engineering from digital identifiers for online fraud detection Session

Haopei Wang detail the design and implementation of a system that automatically extracts fraud-related features for digital identifiers commonly collected by online services. You'll learn the approach of addressing real-time feature computation and creating templates for feature generation. The system has been applied successfully to fraud detection as well as good user analysis.

Harrison Wang is a backend software engineer for LiveRamp and was responsible for coordinating the cloud migration for the activations team.

Presentations

Truth and reality of a cloud migration for large-scale data processing workflows Session

A migration to a new environment is never easy. You'll learn how LiveRamp tackled migrating its large-scale production workflows from their private data center to the cloud while maintaining high uptime. Harrison Wang examines the high-level steps and decisions involved, lessons learned, and what to realistically expect out of a migration.

Chih-Hui “Jason” Wang is a data scientist on the global customer operations (GCO) data science team at LinkedIn. At LinkedIn, he uses data to advocate the voices of customers and members. Previously, he was a data scientist at LeanTaaS where he helped transform healthcare operations through data science. He holds a master’s degree in statistics from the University of California, Berkeley.

Presentations

LinkedIn end-to-end data product to measure customer happiness Session

Studies show that good customer services accelerates customers' cohesion toward a product, which increases product engagement and revenue spending. It's traditional to use customer surveys to measure how customers feel about services and products. Kelly Zhiling Wan, Chih Hui Wang, and Lili Zhou examine the innovative data product to measure customer happiness from LinkedIn.

Jiao (Jennie) Wang is a Senior software engineer on the big data technology team at Intel, where she works in the area of big data analytics. She’s engaged in developing and optimizing distributed deep learning framework on Apache Spark.

Presentations

Real-time recommendation using attention network with Analytics Zoo on Apache Spark Session

In this talk, Lu Wang and Jennie Wang explain how to build a real-time menu recommendation system to leverage attention networks using Spark, Analytics Zoo, and MXNet in the cloud. They then demonstrate how to deploy the model and serve the real-time recommendation using both cloud and on-device infrastructure on Burger King’s production environment.

Jisheng Wang is the head of data science at Mist Systems, where he leads the development of Marvis—the first AI-driven virtual network assistant that automates the visibility, troubleshooting, reporting, and maintenance of enterprise networking. He has 10+ years of experience applying state-of-the-art big data and data science technologies to solve challenging enterprise problems including security, networking, and IoT. Previously, Jisheng was the senior director of data science in the CTO office of Aruba, a Hewlett-Packard Enterprise company since its acquisition of Niara in February 2017, where he led the overall innovation and development effort in big data infrastructure and data science and invented the industry’s first modular and data-agonistic User and Entity Behavior Analytics (UEBA) solution, which is widely deployed today among global enterprises; and he was a technical lead in Cisco responsible for various security products. Jisheng earned his PhD in electric engineering from Penn State University. He’s a frequent speaker at AI and ML conferences, including O’Reilly Strata AI, Frontier AI, Spark Summit, Hadoop Summit, and BlackHat.

Presentations

Scalable and automated pipeline for large-scale neural network training and inference Session

Anomaly detection models are essential to run data-driven businesses intelligently. At Mist Systems, the need for accuracy and the scale of the data impose challenges to build and automate ML pipelines. Ebrahim Safavi and Jisheng Wang explain how recurrent neural networks and novel statistical models allow Mist Systems to build a cloud native solution and automate the anomaly detection workflow.

Luyang Wang is a Sr. Manager in the BurgerKing Guest Intelligence team at Restaurant Brands International, where he works on machine learning and big data analytics. He’s engaged in developing distributed machine learning applications and real-time web services for BurgerKing brand. Before joining RBI Luyang Wang has been working at Philips Big Data&AI Lab and OfficeDepot.

Presentations

Real-time recommendation using attention network with Analytics Zoo on Apache Spark Session

In this talk, Lu Wang and Jennie Wang explain how to build a real-time menu recommendation system to leverage attention networks using Spark, Analytics Zoo, and MXNet in the cloud. They then demonstrate how to deploy the model and serve the real-time recommendation using both cloud and on-device infrastructure on Burger King’s production environment.

Dr Prashant Warier is CEO, Qure.ai & Chief Data Scientist, Fractal Analytics with 16 years of experience in architecting and developing data science solutions. Prashant founded AI-powered personalized digital marketing firm Imagna Analytics which was acquired by Fractal in 2015. Earlier, he worked with SAP and was instrumental in building their Data Science practice. Currently, he heads Qure.ai – a healthcare business which uses deep learning to automatically interpret X-rays, CT Scans and MRIs. He has a PhD and MS in Operations Research from Georgia Institute of Technology and a BTech from IIT Delhi.
He is passionate about using artificial intelligence for global good, and through Qure.ai, is working towards making healthcare accessible and affordable using the power of machine learning and artificial intelligence.

Presentations

AI at The Point of Care Revolutionizes Diagnostics Session

If AI can automate the interpretation of abnormalities at the point of care for at-risk populations, it can eliminate the delays toward diagnosis, speeding the time to treatment, and saving lives. We will detail the technology required for this healthcare revolution to become reality, sharing case studies of machine learning deployed in poverty-stricken areas.

Sophie Watson is a senior data scientist in an Emerging Technology Group at Red Hat, where she applies her data science and statistics skills to solving business problems and informing next-generation infrastructure for intelligent application development. She has a background in mathematics and holds a PhD in Bayesian statistics, in which she developed algorithms to estimate intractable quantities quickly and accurately.

Presentations

What nobody told you about machine learning in the hybrid cloud Session

Cloud native infrastructure like Kubernetes has obvious benefits for machine learning systems, allowing you to scale out experiments, train on specialized hardware, and conduct A/B tests. What isn’t obvious are the challenges that come up on day two. Sophie Watson and William Benton share their experience helping end-users navigate these challenges and make the most of new opportunities.

Dennis Wei is a research staff member with IBM Research AI. He holds a PhD degree in electrical engineering from the Massachusetts Institute of Technology (MIT). His recent research interests center around trustworthy machine learning, including explainability and interpretability, fairness, and causality.

Presentations

Introducing the AI Explainability 360 open source toolkit Tutorial

Dennis Wei teaches you to use and contribute to the new open source Python package AI Explainability 360 directly from its creators. Dennis translates new developments from research labs to data science practitioners in industry. You'll get a first look at the first comprehensive toolkit for explainable AI, including eight diverse and state-of-the-art methods from IBM Research.

Josh Weisberg is a senior director on the 3D and computer vision team for Zillow Group. Previously, he led the AI camera and computational photography team at Microsoft Research, spent several years at Apple, and four early-stage startups. He’s written four books on imaging and color. Josh studied digital imaging at the Rochester Institute of Technology and holds a bachelor’s of science degree from the University of San Francisco.

Presentations

Designing a Virtual Tour Application with Computer Vision and Edge Computing Session

Computer vision and deep learning are enabling new technologies to mimic how the human brain interprets images and create interactive shopping experiences. This progress has major implications for businesses providing customers with the information they need to make a purchase decision. In this session, attendees will learn about implementing computer vision to create rich media experiences.

Remy Welch is a data analytics specialist at Google Cloud. She works with enterprises in San Francisco to understand best practices on collecting and analyzing data. Remy has expertise working within the gaming industry and helping them better handle data ingestion, storage, and analytics.

Presentations

Using serverless Spark on Kubernetes for data streaming and analytics Session

Data is a valuable resource, but collecting and analyzing the data can be challenging. Further, the cost of resource allocation often prohibits the speed at which analysis can take place. Jay Smith and Remy Welch break down how serverless architecture can improve the portability and scalability of streaming event-driven Apache Spark jobs and perform ETL tasks using serverless frameworks.

Seth Wiesman is a senior solutions architect at Ververica, consulting with clients to maximize the benefits of real-time data processing for their business. He supports customers in the areas of application design, system integration, and performance tuning.

Presentations

Apache Flink developer training 2-Day Training

David Anderson and Seth Wiesman lead a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. You'll focus on the core concepts of distributed streaming data flows, event time, and key-partitioned state, while looking at runtime, ecosystem, and use cases with exercises to help you understand how the pieces fit together.

Apache Flink developer training (Day 2) Training Day 2

David Anderson and Seth Wiesman lead a hands-on introduction to Apache Flink for Java and Scala developers who want to learn to build streaming applications. You'll focus on the core concepts of distributed streaming data flows, event time, and key-partitioned state, while looking at runtime, ecosystem, and use cases with exercises to help you understand how the pieces fit together.

Event-driven applications made easy with Apache Flink Tutorial

David Anderson and Seth Wiesman demonstrate how building and managing scalable, stateful, event-driven applications can be easier and more straightforward than you might expect. You'll go hands-on to implement a ride-sharing application together.

Aaron is the VP of Community at OmniSci, responsible for OmniSci’s developer, user and open source communities. He comes to OmniSci with more than two decades of previous success building ecosystems around some of software’s most familiar platforms. Most recently he ran the global community for Mesosphere, including leading the launch and growth of DC/OS as an open source project. Prior to that he led the Java Community Process at Sun Microsystems, and ecosystem programs at SAP. Aaron has also served as the founding CEO of two startups in the entertainment space. Aaron has an MS in Computer Science and BS in Computer Engineering from Case Western Reserve University.

Presentations

Using GPU-acceleration to Interact with Open Street Map at Planet-Scale Data Case Studies

In this talk, we’ll explore the explosive growth in quantity of geospatial data, and how this is fueling the need to more frequently join geospatial data with traditional data.

Kathy Winger is a business, corporate, real estate, banking, and data security attorney who represents companies and individuals in commercial and corporate transactions, where she’s a solo practitioner in Tucson. She has more than 20 years of experience as an attorney in the private sector practicing corporate, business, banking, regulatory, compliance, real estate, and consumer and commercial lending law. Previously, she served as in-house counsel to a national bank and financial services company. Kathy frequently gives presentations addressing cybersecurity issues for businesses and has spoken about the topic to CFOs, financial executives, lawyers, insurance brokers, business owners, and technology professionals and groups such as Financial Executives and Affiliates of Tucson, National Bank of Arizona Women’s Financial Group, Automotive Service Association, and Arizona Technology Council, among others. Nationally, Kathy has spoken about cybersecurity and data breaches from a business lawyer’s perspective at, most recently, Cyber Security Atlanta (2018), Data Center World (2019) and the Channel Partners Conference & Expo (2019), among many others. Kathy has written articles on cybersecurity, which have been published in the BNA Broker Dealer Compliance Report and BNA Data Security and Privacy Report (Bloomberg), and she’s written and been interviewed for articles on banking and business topics that have been published in the Credit Union Times. She’s been interviewed about cybersecurity for business owners for articles that been published in Inside Tucson Business, the Arizona Daily Star, a radio spot on KQTH, Tucson, and on the Bill Buckmaster Radio Show (KVOI). Kathy is the executive vice president of the board of directors for the Boy Scouts of America Catalina Council and serves on the advisory board for the National Bank of Arizona Women’s Financial Group. She also serves on the board of directors of the Southern Arizona Children’s Advocacy Center and is a member of the Better Business Bureau of Southern Arizona.

Presentations

Cybersecurity and data breaches from a business lawyer's perspective Session

Kathy Winger breaks down what business owners and technology professionals need to know about potential risks in the cybersecurity arena. You'll learn the current legal and data security issues and practices along with what’s happening on the regulatory front. And she'll help you mitigate the risks you face.

Micah Wylde is a software engineer on the streaming compute team at Lyft, focused on the development of Apache Flink and Apache Beam. Previously, he built data infrastructure for fighting internet fraud at SIFT and real-time bidding infrastructure for ads at Quantcast.

Presentations

How Lyft built a streaming data platform on Kubernetes Session

Lyft processes millions of events per second in real time to compute prices, balance marketplace dynamics, and detect fraud, among many other use cases. Micah Wylde showcases how Lyft uses Kubernetes along with Flink, Beam, and Kafka to enable service engineers and data scientists to easily build real-time data applications.

Shuo is a software engineer from Robinhood’s Data Platform team.

Presentations

Usability First: the Evolution of Robinhood’s Data Platform Data Case Studies

The Data Platform at Robinhood has evolved considerably as the scale of our data and needs of the company have evolved. In this talk, we are sharing the stories behind the evolution of our platform, explaining how does it align with our business use cases, and discussing in detail the challenges we encountered and lessons we learned.

Huangming Xie is a senior manager of data science at LinkedIn, where he leads the infrastructure data science team to drive resource intelligence, optimize compute and storage efficiency, and automate capacity forecasting for better scalability, as well as improve site availability for a pleasant member and customer experience. Huangming is an expert at converting data into actionable recommendations that impact strategy and generate direct business impact. Previously, he lead initiatives to enable data-driven product decisions at scale and build a great product for more than 600 million LinkedIn members worldwide.

Presentations

Get a CLUE: Optimizing big data compute efficiency Session

Compute efficiency optimization is of critical importance in the big data era, as data science and ML algorithms become increasingly complex and data size increases exponentially over time. Opportunities exist throughout the resource use funnel, which Zhe Zhang and Huangming Xie characterize using a CLUE framework.

Tony Xing is a senior product manager on the AI, data, and infrastructure (AIDI) team within Microsoft’s AI and Research Organization. Previously, he was a senior product manager on the Skype data team within Microsoft’s Application and Service Group, where he worked on products for data ingestion, real-time data analytics, and the data quality platform.

Presentations

Introducing a new anomaly detection algorithm inspired by computer vision and RL Session

Anomaly detection may sound old-fashioned, yet it's super important in many industrial applications. Tony Xing outlines a novel anomaly detection algorithm based on spectral residual (SR) and convolutional neural networks (CNNs) and how this novel method was applied in the monitoring system supporting Microsoft AIOps and business incident prevention.

Chendi Xue is a software engineer on the data analytics team at Intel. She has more than five years’ experience in big data and cloud system optimization, focusing on storage, network software stack performance analysis, and optimization. She participated in the development works including Spark-Shuffle optimization, Spark-SQL ColumnarBased execution, compute side cache implementation, storage benchmark tool implementation, etc. Previously, she worked on Linux device mapper optimization and iSCSI optimization during her master degree study.

Presentations

Accelerating Spark-SQL with AVX-supported vectorization Session

Chendi Xue and Jian Zhang explore how Intel accelerated Spark SQL with AVX-supported vectorization technology. They outline the design and evaluation, including how to enable columnar process in Spark SQL, how to use Arrow as intermediate data, how to leverage AVX-enabled Gandiva for data processing, and performance analysis with system metrics and breakdown.

Megahanath Macha Yadagiri is a graduate research assistant at Carnegie Mellon University.

Presentations

A machine learning approach to customer profiling by identifying purchase lifecycle stages Session

Identifying customer stages in a buying cycle enables you to perform personalized targeting depending on the stage. Shankar Venkitachalam, Megahanath Macha Yadagiri, and Deepak Pai identify ML techniques to analyze a customer's clickstream behavior to find the different stages of the buying cycle and quantify the critical click events that help transition a user from one stage to another.

Giridhar Yasa is a principal architect at Flipkart. He’s a technology leader with a consistent track record of leading teams from concept to successful delivery of complex software products. He has solid team creation and mentoring skills and multiple peer-reviewed journal and conference papers with citations and patents. His specialties are distributed systems, scalable software system architecture, storage software, networking and internet protocols, mobile communication protocols, system performance, free software, open source software, languages and tools, C, C++, Python, Unix-like operating systems, and Debian.

Presentations

Architectural patterns for business continuity and disaster recovery: Applied to Flipkart Session

Utkarsh B and Giridhar Yasa lead a deep dive into architectural patterns and the solutions Flipkart developed to ensure business continuity to millions of online customers, and how it leveraged technology to avert or mitigate risks from catastrophic failures. Solving for business continuity requires investments application, data management, and infrastructure.

Wenming Ye is an A and ML solutions architect at Amazon Web Services, helping researchers and enterprise customers use cloud-based machine learning services to rapidly scale their innovations. Previously, Wenming had diverse R&D experience at Microsoft Research, an SQL engineering team, and successful startups.

Presentations

Put deep learning to work: A practical introduction using Amazon Web Services 2-Day Training

Machine learning (ML) and deep learning (DL) projects are becoming increasingly common at enterprises and startups alike and have been a key innovation engine for Amazon businesses such as Go, Alexa, and Robotics. Wenming Ye demonstrates a practical next step in DL learning with instructions, demos, and hands-on labs.

Put deep learning to work: A practical introduction using Amazon Web Services (Day 2) Training Day 2

Machine learning (ML) and deep learning (DL) projects are becoming increasingly common at enterprises and startups alike and have been a key innovation engine for Amazon businesses such as Go, Alexa, and Robotics. Wenming Ye demonstrates a practical next step in DL learning with instructions, demos, and hands-on labs.

Jia Zhai is a core software engineer at StreamNative, as well as PMC member of both Apache BookKeeper and Apache Pulsar, and contributes to these two projects continually.

Presentations

Life beyond pub/sub: How Zhaopin simplifies stream processing using Pulsar Functions and SQL Session

Penghui Li and Jia Zhai walk you through building an event streaming platform based on Apache Pulsar and simplifying a stream processing pipeline by Pulsar Functions, Pulsar Schema, and Pulsar SQL.

Jian Zhang is a senior software engineer manager at Intel, where he and his team primarily focus on open source storage development and optimizations on Intel platforms and build reference solutions for customers. He has 10 years of experience doing performance analysis and optimization for open source projects like Xen, KVM, Swift, and Ceph and working with Hadoop distributed file system (HDFS) and benchmarking workloads like SPEC and TPC. Jian holds a master’s degree in computer science and engineering from Shanghai Jiao Tong University.

Presentations

Accelerating Spark-SQL with AVX-supported vectorization Session

Chendi Xue and Jian Zhang explore how Intel accelerated Spark SQL with AVX-supported vectorization technology. They outline the design and evaluation, including how to enable columnar process in Spark SQL, how to use Arrow as intermediate data, how to leverage AVX-enabled Gandiva for data processing, and performance analysis with system metrics and breakdown.

Yong Zhang is a software engineer at StreamNative. He’s also a Pulsar and BookKeeper contributor, where he contributes a lot at Pulsar transaction, storage, and tools.

Presentations

Transactional event streaming with Apache Pulsar Session

Sijie Guo and Yong Zhang lead a deep dive into the details of Pulsar transaction and how it can be used in Pulsar Functions and other processing engines to achieve transactional event streaming.

Zhe Zhang is a senior manager of core big data infrastructure at LinkedIn, where he leads an excellent engineering team to provide big data services (Hadoop distributed file system (HDFS), YARN, Spark, TensorFlow, and beyond) to power LinkedIn’s business intelligence and relevance applications. Zhe’s an Apache Hadoop PMC member; he led the design and development of HDFS Erasure Coding (HDFS-EC).

Presentations

Get a CLUE: Optimizing big data compute efficiency Session

Compute efficiency optimization is of critical importance in the big data era, as data science and ML algorithms become increasingly complex and data size increases exponentially over time. Opportunities exist throughout the resource use funnel, which Zhe Zhang and Huangming Xie characterize using a CLUE framework.

Alice Zhao is a senior data scientist at Metis, where she teaches 12-week data science bootcamps. Previously, she was the first data scientist and supported multiple functions from marketing to technology at Cars.com; cofounded a data science education startup where she taught weekend courses to professionals at 1871 in Chicago at Best Fit Analytics Workshop; was an analyst at Redfin; and was a consultant at Accenture. She blogs about analytics and pop culture on A Dash of Data. Her blog post, “How Text Messages Change From Dating to Marriage” made it onto the front page of Reddit, gaining over half a million views in the first week. She’s passionate about teaching and mentoring, and loves using data to tell fun and compelling stories. She has her MS in analytics and BS in electrical engineering, both from Northwestern University.

Presentations

Introduction to natural language processing in Python Tutorial

Data scientists are known to crunch numbers, but you may run into text data. Alice Zhao walks you through the steps to turn text data into a format that a machine can understand, identifies some of the most popular text analytics techniques, and showcases several natural language processing (NLP) libraries in Python including the natural language toolkit (NLTK), TextBlob, spaCy, and gensim.

Alice Zheng is a senior manager of applied science on the machine learning optimization team on Amazon’s advertising platform. She specializes in research and development of machine learning methods, tools, and applications. She’s the author of Feature Engineering for Machine Learning. Previously, Alice has worked at GraphLab, Dato, and Turi, where she led the machine learning toolkits team and spearheaded user outreach; was a researcher in the Machine Learning Group at Microsoft Research – Redmond. Alice holds PhD and BA degrees in computer science and a BA in mathematics, all from UC Berkeley.

Presentations

Lessons learned from building large ML systems Session

You'll learn four lessons in building and operating large-scale, production-grade machine learning systems at Amazon with Alice Zheng, useful for practitioners and would-be practitioners in the field.

Lili Zhou is a manager of the data science team at LinkedIn. Lili has intensive experience in customer operations, billing and collection, risk management, fraud detection, revenue forecasting, and online gaming. She’s passionate about leveraging large-scale data analytics and modeling to drive insights and business value.

Presentations

LinkedIn end-to-end data product to measure customer happiness Session

Studies show that good customer services accelerates customers' cohesion toward a product, which increases product engagement and revenue spending. It's traditional to use customer surveys to measure how customers feel about services and products. Kelly Zhiling Wan, Chih Hui Wang, and Lili Zhou examine the innovative data product to measure customer happiness from LinkedIn.

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

pr@oreilly.com

For media/analyst press inquires