Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Speakers

Hear from innovative CxOs, talented data practitioners, and senior engineers who are leading the data industry. More speakers will be announced; please check back for updates.

Saif Addin Ellafi is a software developer at John Snow Labs, where he’s the main contributor to Spark NLP. A data scientist, forever student, and an extreme sports and gaming enthusiast, Said has a wide experience in problem solving and quality assurance in the banking and finance industry.

Presentations

Spark NLP: How Roche automates knowledge extraction from pathology and radiology reports Session

Yogesh Pandit, Saif Addin Ellafi, and Vishakha Sharma discuss how Roche applies Spark NLP for healthcare to extract clinical facts from pathology reports and radiology. They then detail the design of the deep learning pipelines used to simplify training, optimization, and inference of such domain-specific models at scale.

Sarah Aerni is a director of data science at Salesforce Einstein, where she works to democratize machine learning by building products that allow customers to deploy predictive apps in a few short clicks. Her team builds automated machine learning pipelines and products like Einstein Prediction Builder. Previously, she led teams in healthcare and life sciences at Pivotal building models for customers. She is a recovering academic and entrepreneur, with a PhD in biomedical informatics from Stanford University and a passion for building diverse teams, education, and exploration.

Presentations

Automated machine learning for Agile data science at scale Session

How does Salesforce make data science an Agile partner to over 100,000 customers? Sarah Aerni shares the nuts and bolts of the platform and details the Agile process behind it. From open source autoML library TransmogrifAI and experimentation to deployment and monitoring, Sarah covers the tools that make it possible for data scientists to rapidly iterate and adopt a truly Agile methodology.

Jaipaul Agonus is a director in the Market Regulation Technology Department at FINRA. Jaipaul is a big data engineering leader with nearly 18 years of IT industry experience, specializing in big data analytics and cloud-based solutions. He’s currently involved in building next-generation big data market analytic platforms with machine learning, advanced visualization, and contextual access across applications.

Presentations

Scaling visualization for big data and analytics in the cloud Session

Jaipaul Agonus and Daniel Monteiro do Carmo Rosa detail big data analytics and visualization practices and tools used by FINRA to support machine learning and other surveillance activities that the Market Regulation Department conducts in the AWS cloud.

Shradha Agrawal is a data scientist at Adobe in San Jose. She holds a master’s degree in computer science with a focus on AI and machine learning from the University of California, San Diego. She is the author of a number of papers and patent applications.

Presentations

Efficient multi-armed bandit with Thompson sampling for applications with delayed feedback Session

Decision making often struggles with the exploration-exploitation dilemma. Multi-armed bandits (MAB) are a popular reinforcement learning solution, but increasing the number of decision criteria leads to an exponential blowup in complexity, and observational delays don’t allow for optimal performance. Shradha Agrawal offers an overview of MABs and explains how to overcome the above challenges.

Sridhar Alla is cofounder and CTO at BlueWhale, which brings together the worlds of big data and artificial intelligence to provide comprehensive solutions to meet the business needs of organizations of all sizes. He and his team are cloud and tool agnostic and strive to embed themselves into the workstream to provide strategic and technical assistance. Sridhar is also an avid speaker, author, and coach. He lives in southern New Jersey with his wife and daughter.

Presentations

Anomaly detection using deep learning to measure the quality of large datasets​ Session

Any business big or small depends on analytics, whether the goal is revenue generation, churn reduction, or sales and marketing. No matter the algorithm and the techniques used, the result depends on the accuracy and consistency of the data being processed. Sridhar Alla and Syed Nasar share techniques used to evaluate the the quality of data and the means to detect the anomalies in the data.

Josh Alwitt is responsible for leadership and culture at Publicis Sapient, where he cocreates a culture that enables people to thrive at work through living the organization’s shared purpose and values and is also responsible for executive development, leading a neuroscience-based leadership development program, and coaching the C suite, senior executives, and client account leaders. Previously, Josh created the Talent Development function at Sapient, which included learning and development, performance management, career development, succession planning, leadership development, and integrated talent systems. Before his organizational development work, Josh was a technology consultant for 20 years, leading large custom software implementations in a variety of client industries.

Presentations

Future of the firm: How are executives preparing now? Session

In this panel session, executives will discuss how their companies are adapting to the workforce, business, and economic trends shaping the future of business.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Creating a data engineering culture at USAA Session

What happens when you have a data science organization but no data engineering organization? Jesse Anderson and Thomas Goolsby explain what happened at USAA without data engineering, how they fixed it, and the results since.

Professional Kafka development 2-Day Training

Jesse Anderson leads a deep dive into Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it. You'll also discover how to create consumers and publishers in Kafka and how to use Kafka Streams, Kafka Connect, and KSQL as you explore the Kafka ecosystem.

Professional Kafka development (Day 2) Training Day 2

Jesse Anderson leads a deep dive into Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it. You'll also discover how to create consumers and publishers in Kafka and how to use Kafka Streams, Kafka Connect, and KSQL as you explore the Kafka ecosystem.

Zachery Anderson is the chief analytics officer and senior vice president at Electronic Arts (EA), the world’s largest video game company, where he’s responsible for leading consumer insights, UX research, data science, studio analytics, and marketing analytics. His team uses in-game behavioral data, traditional consumer research, lab work, and online advertising data to provoke and inspire EA’s development and marketing teams to think and act “player first.” Previously, Zachery was head of consulting and modeling for the PIN Group at J.D. Power and Associates, corporate economist at Nissan North America, and economist at investment company Fremont Group. Zachery’s work has been highlighted in the Harvard Business Review and MIT Sloan Management Review and has won many awards, including the INFORMS Marketing Science Practice Prize. While at Nissan, he was recognized by the US Federal Reserve for the best industry forecast. He’s a member of the University of California Master of Science in Business Analytics Industry Advisory Board. Zachery holds an undergraduate degree in political science and communications from Southern Illinois University and a graduate degree in economics and political science from UCLA, where he studied game theory with Nobel Prize winner Lloyd Shapley.

Presentations

It’s in the game: A rare look into how EA brought data science into the creative process of game design Keynote

Developing games at EA is where creativity meets AI, analytics, and machine learning, combining an understanding of player motivations with the means to improve the game design process. Zachery Anderson leads a tour of EA’s history combining data with development, taking you through the early days of balancing gameplay to the future of personalized games for everyone.

Purchase, play, and upgrade data for video game players Session

Eric Bradlow and Zachery Anderson discuss the Wharton Customer Analytics Initiative research opportunity process and explain how some of EA’s solved some of its business problems by sharing its data with 11 teams of researchers from around the world.

Eva Nahari is director of product management at Cloudera, where she helps drive the future of distributed data processing and machine learning applications through Cloudera’s distribution of Hadoop and expedites the next generation of integrated search engines. Eva has been working with Java virtual machine technologies, SOA, cloud, and other enterprise middleware solutions for the past 15+ years. Previously, she was the developer of JRockit (the world’s fastest JVM) and productized Zing (the world’s only pauseless JVM) at Azul Systems. She also pioneered deterministic garbage collection and was awarded two patents on garbage collection heuristics and algorithms. In addition, she’s managed many technical partnerships, among them Sun, Intel, Dell, and Red Hat, as well as multicomponent software integration projects, including JRockit, Coherence, WebLogic, Zing, RHEL, and Cloudera Search. Eva holds an MSc in artificial intelligence and autonomous systems from the Royal Institute of Technology in Stockholm, Sweden.

Presentations

How to survive future data warehousing challenges with the help of a hybrid cloud Session

Michael Kohs, Eva Andreasson, and Mark Brine explain how Cloudera’s Finance Department used a hybrid model to speed up report delivery and reduce cost of end-of-quarter reporting. They also share guidelines for deploying modern data warehousing in a hybrid cloud environment, outlining when you should choose a private cloud service over a public one, the available options, and some dos and dont's.

June Andrews is a principal data scientist at GE, where she’s building a machine learning platform used for monitoring the health of airplanes and power plants around the world. Previously, she spearheaded the Data Trustworthiness and Signals Program at Pinterest aimed at creating a healthy data ecosystem for machine learning and led efforts at LinkedIn on growth, engagement, and social network analysis to increase economic opportunity for professionals. June holds degrees in applied mathematics, computer science, and electrical engineering from UC Berkeley and Cornell.

Presentations

Critical turbine maintenance: Monitoring and diagnosing planes and power plants in real time Session

GE produces a third of the world's power and 60% of its airplane engines—a critical portion of the world's infrastructure that requires meticulous monitoring of the hundreds of sensors streaming data from each turbine. June Andrews and John Rutherford explain how GE's monitoring and diagnostics teams released the first real-time ML systems used to determine turbine health into production.

From data-driven to data competitive Data Case Studies

Companies have adopted data into their DNA using a variety of methods, including data driven, data enabled, and data informed, but many implementations have fallen short of the promised ROI, the result of a gap between the cost of investing in people and infrastructure and the business value delivered. June Andrews investigates the ROI of using data and shows how to become data competitive.

André Araujo is a principal solutions architect at Cloudera. An experienced consultant with a deep understanding of the Hadoop stack and its components and a methodical and keen troubleshooter who loves making things run faster, André is skilled across the entire Hadoop ecosystem and specializes in building high-performance, secure, robust, and scalable architectures to fit customers’ needs.

Presentations

Hands-on with Cloudera SDX: Setting up your own shared data experience Tutorial

Cloudera SDX provides unified metadata control, simplifies administration, and maintains context and data lineage across storage services, workloads, and operating environments. Santosh Kumar, Andre Araujo, and Wim Stoop offer an overview of SDX before diving deep into the moving parts and guiding you through setting it up. You'll leave with the skills to set up your own SDX.

Tim Armstrong is an engineer at Cloudera, where he works on making Apache Impala faster and more robust via improvements to query execution and resource management. He holds a PhD focused on the intersection of high-performance computing and programming language implementation.

Presentations

When SQL users run wild: Resource management features and techniques to tame Apache Impala Session

As the popularity and utilization of Apache Impala deployments increases, clusters often become victims of their own success when demand for resources exceeds the supply. Tim Armstrong dives into the latest resource management features in Impala to maintain high cluster availability and optimal performance and provides examples of how to configure them in your Impala deployment.

Shirya Arora works on the data engineering team for personalization at Netflix, which, among other things, delivers recommendations made for each user. The team is responsible for the data that goes into training and scoring of the various machine learning models that power the Netflix home page. They have been working on moving some of the company’s core datasets from being processed in a once-a-day daily batch ETL to being processed in near real time using Apache Flink. Previously, she helped build and architect the new generation of item setup at Walmart Labs, moving from batch processing to stream. They used Storm and Kafka to enable a microservices architecture that allows products to be updated near real time as opposed to once a day on the legacy framework.

Presentations

Taming large state to join datasets for personalization Session

With so much data being generated in real time, what if we could combine all these high-volume data streams and provide near real-time feedback for model training, improving personalization and recommendations and taking the customer experience to a whole new level. Sonali Sharma and Shriya Arora explain how to do exactly that, using Flink's keyed state.

Matvey Arye is a senior software engineer at TimescaleDB, where he works on performance, scalability, and query power. Mat has been working on data infrastructure in both academia and industry. He attended Stuyvesant and the Cooper Union and holds a PhD from Princeton. In his free time, Mat enjoys theater, travel, hiking, and skiing.

Presentations

Performant time series data management and analytics with Postgres Session

Matvey Arye offers an overview of two newly released features of TimescaleDB—automated adaptation of time-partitioning intervals and continuous aggregations in near real time—and discusses how these capabilities ease time series data management. Along the way, he also shares real-world use cases, including TimescaleDB's use with other technologies such as Kafka.

Kirstin Aschbacher is a data scientist, a licensed clinical psychologist with a specialty in behavioral medicine, an associate professor in cardiology at UCSF, and the data team lead on the Health eHeart (HeH)/Eureka Digital Research Platform. One of her passions is to bridge the worlds of behavior change and data science in order to transform health, and she enjoys finding creative ways to take knowledge from psychology, neuroscience, and biology and apply them to discover new insights in large datasets. At UCSF, she builds active partnerships with companies in the behavior change and lifestyle medicine space. Previously, she was a data scientist at Silicon Valley startup Jawbone, where she helped design, test, and analyze mini-interventions to help users make healthier behavior choices and lose weight. When she’s not at work, she enjoys being a mother to her two children, biking and dancing, and learning to speak Mandarin.

Presentations

Machine learning prediction of blood alcohol content: A digital signature of behavior Session

Some people use digital devices to track their blood alcohol content (BAC). A BAC-tracking app that could anticipate when a person is likely to have a high BAC could offer coaching in a time of need. Kirstin Aschbacher shares a machine learning approach that predicts user BAC levels with good precision based on minimal information, thereby enabling targeted interventions.

Jitender Aswani supports the infrastructure and security data engineering teams at Netflix. His team designs, builds, and deploys scalable big data architecture and solutions to enable business and operations teams to achieve consistent capacity, reliability, and security gains. Jitender is a lifelong student of smart data products and data science solutions that push organizations to make data-inspired decisions and adopt analytics-first approaches.

Presentations

How Netflix measures app performance on 250 million unique devices across 190 countries Session

Netflix has over 125 million members spread across 191 countries. Each day its members interact with its client applications on 250 million+ devices under highly variable network conditions. These interactions result in over 200 billion daily data points. Vivek Pasari dives into the data engineering and architecture that enables application performance measurement at this scale.

Scaling data lineage at Netflix to improve data infrastructure reliability and efficiency Session

Hundreds of thousands of ETL pipelines ingest over a trillion events daily to populate millions of data tables downstream at Netflix. Jitender Aswani, Girish Lingappa, and Di Lin discuss Netflix’s internal data lineage service, which was essential for enhancing platform’s reliability, increasing trust in data, and improving data infrastructure efficiency.

Shivnath Babu is the CTO at Unravel Data Systems and an adjunct professor of computer science at Duke University. His research focuses on ease of use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath cofounded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.

Presentations

Automation of root cause analysis for big data stack applications Session

Alkis Simitsis and Shivnath Babu share an automated technique for root cause analysis (RCA) for big data stack applications using deep learning techniques, using Spark and Impala. The concepts they discuss apply generally to the big data stack.

Kamil Bajda-Pawlikowski is cofounder and CTO of enterprise Presto company Starburst. Previously, Kamil was the chief architect at the Teradata Center for Hadoop in Boston, focusing on the open source SQL engine Presto, and the cofounder and chief software architect of Hadapt, the first SQL-on-Hadoop company (acquired by Teradata). Kamil began his journey with Hadoop and modern MPP SQL architectures about 10 years ago during a doctoral program at Yale University, where he co-invented HadoopDB, the original foundation of Hadapt’s technology. He holds an MS in computer science from Wroclaw University of Technology and both an MS and an MPhil in computer science from Yale University.

Presentations

Presto: Tuning performance of SQL-on-anything analytics Session

Kamil Bajda-Pawlikowski and Martin Traverso explore Presto's recently introduced cost-based optimizer, which must account for heterogeneous inputs with differing and often incomplete data statistics, and detail use cases for Presto across several industries. They also share recent Presto advancements, such as geospatial analytics at scale, and the project roadmap going forward.

Ambal Balakrishnan is a digital segment lead for IBM Analytics within the Data Organization at IBM. A technology-focused product management/marketing leader with a strong track record of leading strategic growth, she leads digital transformation and growth of the Data Organization segment of the IBM Analytics portfolio by creating and bringing to market digital and SaaS product offerings. Previously, she was worldwide content marketing leader for systems hybrid cloud DevOps, where she managed content marketing strategy, digital content design, and production to drive business results, and worked on worldwide marketing and positioning of Cisco’s cloud and data center switching business. Ambal holds a master’s in computer science from Purdue University and an MBA in marketing, strategy, and entrepreneurship from the Wharton School at the University of Pennsylvania.

Presentations

Extracting stories from your data and telling them visually: It can be done; we'll show you how Data Case Studies

Whether you are a tech or biz professional, you must master the art of visual storytelling with data. But first, you have to find the story worth telling that's hidden in your data. Join Ambal Balakrishnan to learn how. As with many things in life, visual storytelling with data will take practice. But that doesn't mean you can't accelerate your learning from others' mistakes and successes.

Satheesh Bandaram is an executive director for big data and data virtualization within the R&D organization at IBM. Key offerings include IBM Db2 BigSQL, IBM BigReplicate, and the newly launched data virtualization technology that is available for private and public cloud deployments using IBM Cloud Private for Data.

Presentations

IBM and Cloudera: Bringing AI and ML to the governed data lake (sponsored by IBM) Session

Satheesh Bandaram and Saumitra Buragohain detail how IBM and Cloudera are advancing AI and ML for their customers with solutions to build on-premises or cloud-based secure governed data lakes.

Burcu Baran is a senior data scientist at LinkedIn. Burcu is passionate about bringing mathematical solutions to business problems using machine learning techniques. Previously, she worked on predicting modeling at a B2B business intelligence company and was a postdoc in the Mathematics Departments at both Stanford and the University of Michigan. Burcu holds a PhD in number theory.

Presentations

Using the full spectrum of data science to drive business decisions Tutorial

Thanks to the rapid growth in data resources, business leaders now appreciate the importance (and the challenge) of mining information from data. Join in as a group of LinkedIn's data scientists share their experiences successfully leveraging emerging techniques to assist in intelligent decision making.

Kathy Baxter is architect for ethical AI practice at Salesforce, where she develops research-informed best practices to educate employees, customers, and the industry on the development of ethical AI. She partners and collaborates with external AI and ethics experts to continuously evolve policies, practices, and products—working to create a more fair, justice, and equitable society. You can read about her research on the Salesforce UX Medium channel. Kathy has 20 years of experience in the tech industry, at companies including Google, eBay, and Oracle. She holds an MS in engineering psychology and a BS in applied psychology from the Georgia Institute of Technology. The second edition of her book, Understanding Your Users, was published in May 2015.

Presentations

AI moonshot: Designing AI that creates the world we want to live in Ethics Summit

Kathy Baxter explains how to use AI to address bias in your organizations rather than perpetuate it.

Panel: Solutions Ethics Summit

We've looked at some possible solutions. But a more complete perspective includes what's right and where we're making mistakes—from the reproducibility of social sciences to the regulations of governments. We wrap up our look at possible solutions with a group discussion.

Maxime Beauchemin is a senior software engineer at Lyft, where he develops open source products that reduce friction and help generate insights from data. He’s the creator and a lead maintainer of data pipeline workflow engine Apache Airflow (incubating) and data visualization platform Apache Superset (incubating). A recognized thought leader in the data engineering field, Maxime previously worked on the analytics and experimentation products team at Airbnb; at Facebook, where he focused on computation frameworks powering engagement and growth analytics; at Yahoo, where he did clickstream analytics; and at Ubisoft, where he was a data warehouse architect.

Presentations

Apache Superset: An open source data visualization platform Session

Maxime Beauchemin offers an overview of Apache Superset, discussing the project's open source development dynamics, security, architecture, and underlying technologies as well as the key items on its roadmap.

John Bennett leads the data engineering efforts within Netflix’s cloud infrastructure analytics team with a focus on security. For the past three years, he has built large-scale data processing systems that provide anomaly detection, network visibility, and dependency insights. John has been writing code for almost 20 years. His previous roles include stints at Blizzard and IGN. John is currently developing a template-driven platform that enables engineers to rapidly build streaming and batch ETL pipelines for detection purposes.

Presentations

Building and scaling a security detection platform: A Netflix Original Session

Data has become a foundational pillar for security teams operating in organizations of all shapes and sizes. This new norm has created a need for platforms that enable engineers to harness data for various security purposes. John Bennett and Siamac Mirzaie offer an overview of Netflix's internal platform for quickly deploying data-based detection capabilities in the corporate environment.

Till Bergmann is a senior data scientist at Salesforce Einstein, building platforms to make it easier to integrate machine learning into Salesforce products, with a focus on automating many of the laborious steps in the machine learning pipeline. He holds a PhD in cognitive science from the University of California, Merced, where he studied the collaboration patterns of academics using NLP techniques.

Presentations

How to train your model (and catch label leakage) Session

A problem in predictive modeling data is label leakage. At enterprise companies such as Salesforce, this problem takes on monstrous proportions as the data is populated by diverse business processes, making it hard to distinguish cause from effect. Till Bergmann explains how Salesforce—which needs to churn out thousands of customer-specific models for any given use case—tackled this problem.

Josh Bersin is an analyst, author, educator, and thought leader focusing on the global talent market and the challenges and trends impacting business workforces around the world. He studies the world of work, HR and leadership practices, and the broad talent technology market. He is often cited as one of the leading HR and workplace industry analysts in the world. He founded Bersin & Associates in 2001 to provide research and advisory services focused on corporate learning. Over the next 10 years, he expanded the company’s coverage to encompass HR, talent management, talent acquisition, and leadership and became a recognized expert in the talent market. He sold the company to Deloitte in 2012, when it became known as Bersin by Deloitte. He continues to serve as a senior advisor to Deloitte, advising large organizations and contributing to major research initiatives. He also sits on the board of UC Berkeley Executive Education. Previously, Josh spent 25 years in product development, product management, marketing, and sales of e-learning and other enterprise technologies.

Josh is frequently featured in talent and business publications such as Forbes, Harvard Business Review, HR Executive, FastCompany, the Wall Street Journal, and CLO Magazine. He is a regular keynote speaker at industry events and a popular blogger with more than 700,000 followers on LinkedIn. He is the author of two books—The Blended Learning Handbook and The Training Measurement Book—along with dozens of studies on corporate HR, learning, and talent technologies. His third book is currently under contract with Harvard Business Publishing. Josh holds a BS in engineering from Cornell University, an MS in engineering from Stanford University, and an MBA from the Haas School of Business at the University of California, Berkeley.

Presentations

Future of the firm: How are executives preparing now? Session

In this panel session, executives will discuss how their companies are adapting to the workforce, business, and economic trends shaping the future of business.

The future of the firm: Starting now Session

Josh Bersin explains how firms are transforming for the digital era, covering the death of the traditional organizational hierarchy, new models of leadership and management, changes in the way people learn and progress, new models of pay, and the importance of trust and transparency as a central business value.

Maneesha Bhalla is director of advanced analytics at Office Depot. A thought leader with 16+ years of experience in data analytics and data science specifically in customer analytics, Maneesha is passionate about data-driven approaches to turn data into insights and has a proven ability to build and lead highly efficient teams to enable the organization’s strategic priorities. Maneesha holds a certificate from the Executive Management Program in Global Business Management at IIM Calcutta and a bachelor’s degree in chemical engineering from Pune University.

Presentations

User-based real-time product recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL Session

User-based real-time recommendation systems have become an important topic in ecommerce. Lu Wang, Nicole Kong, Guoqiong Song, and Maneesha Bhalla demonstrate how to build deep learning algorithms using Analytics Zoo with BigDL on Apache Spark and create an end-to-end system to serve real-time product recommendations.

Ron Bodkin is a technical director on the applied artificial intelligence team at Google, where he provides leadership for AI success for customers in Google’s Cloud CTO office. Ron engages deeply with Global F500 enterprises to unlock strategic value with AI, acts as executive sponsor with Google product and engineering to deliver value from AI solutions, and leads strategic initiatives working with customers and partners. Previously, Ron was the founding CEO of Think Big Analytics, a company that provides end-to-end support for enterprise big data, including data science, data engineering, advisory, and managed services and frameworks such as Kylo for enterprise data lakes. When Think Big was acquired by Teradata, Ron led global growth, the development of the Kylo open source data lake framework, and the company’s expansion to architecture consulting; he also created Teradata’s artificial intelligence incubator.

Presentations

Applying deep learning at Google for recommendations Session

Google uses deep learning extensively in new and existing products. Join Ron Bodkin to learn how Google has used deep learning for recommendations at YouTube, in the Play store, and for customers in Google Cloud. You'll explore the role of embeddings, recurrent networks, contextual variables, and wide and deep learning and discover how to do candidate generation and ranking with deep learning.

Dhruba Borthakur is cofounder and CTO at Rockset, a company building software to enable data-powered applications. Dhruba was the founding engineer of the open source RocksDB database at Facebook and one of the founding engineers of the Hadoop File System at Yahoo. Dhruba was also an early contributor to the open source Apache HBase project. Previously, he was a senior engineer at Veritas Software, where he was responsible for the development of VxFS and Veritas SanPointDirect storage system; was the cofounder of Oreceipt.com, an ecommerce startup based in Sunnyvale; and was a senior engineer at IBM-Transarc Labs, where he contributed to the development of Andrew File System (AFS), a part of IBM’s ecommerce initiative, WebSphere. Dhruba holds an MS in computer science from the University of Wisconsin-Madison and a BS in computer science BITS, Pilani, India. He has 25 issued patents.

Presentations

ROCKSET: The design and implementation of a data system for low-latency queries for search and analytics Session

Most existing big data systems prefer sequential scans for processing queries. Igor Canadi and Dhruba Borthakur challenge this view, offering an overview of converged indexing: a single system called ROCKSET that builds inverted, columnar, and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines.

Katherine Boyle is a principal at General Catalyst, a bicoastal venture capital firm with $5B under management. She focuses on early-stage investments in frontier technology and highly regulated industries, including aerospace, defense, computational biology, robotics, and autonomous mobility. Before becoming an investor, she was a staff reporter at the Washington Post, where her features and investigations appeared in every section of the paper (except Sports). Katherine holds an MBA from the Stanford Graduate School of Business, where she was research and teaching assistant to Condoleezza Rice and Amy Zegart for their course and newly released book Political Risk: How Businesses and Organizations Can Anticipate Global Insecurity. She’s a graduate of Georgetown University and holds a master’s degree in public advocacy from National University of Ireland, Galway, where she was George J. Mitchell Scholar.

Presentations

VC dimension: How and why investors fund AI startups Session

What does it mean to be an AI investor? How is this approach different from traditional venture capital? Ash Fontana and Katherine Boyle share their perspectives on investments in machine intelligence and data science.

Catherine Bracy is a civic technologist and community organizer whose work focuses on the intersection of technology and political and economic inequality. She is the cofounder and executive director of the TechEquity Collaborative, an organization in Oakland, CA, that seeks to build a tech-driven economy in the Bay Area that works for everyone. Previously, she was senior director of partnerships and ecosystem at Code for America, where she grew the Brigade program into a network of over 50,000 civic tech volunteers in 80+ cities across the US. She also founded Code for All, the global network of Code-for organizations with partners on six continents. Catherine built Code for America’s civic engagement focus area, creating a framework and best practices for local governments to increase public participation which has been adopted in cities across the US. During the 2012 election cycle, she was director of Obama for America’s Technology Field Office in San Francisco, the first of its kind in American political history. She was responsible for organizing technologists to volunteer their skills for the campaign’s technology and digital efforts. Prior to joining the Obama campaign, she ran the Knight Foundation’s 2011 News Challenge and was the administrative director at Harvard’s Berkman Center for Internet & Society. She is on the board of directors at the Data & Society Research Institute and the Public Laboratory.

Presentations

The conscience of a company Session

Tim O'Reilly will be joined by Janet Haven, executive director of Data & Society, and Catherine Bracy, director of the TechEquity Collaborative, to discuss ways in which tech employees are flexing their muscles as the conscience of their companies.

Eric T. Bradlow is chairperson of the Marketing Department, the K.P. Chao Professor, a professor of marketing, statistics, economics, and education, and faculty director of the Wharton Customer Analytics Initiative at the Wharton School of the University of Pennsylvania. His academic research interests include Bayesian modeling, statistical computing, and developing new methodology for unique data structures with application to business problems, education, and psychometrics and health outcomes. Previously, Eric was editor in chief of Marketing Science, the premier academic journal in marketing. He was recently named one of eight inaugural University of Pennsylvania Fellows, a fellow of the American Statistical Association, a fellow of the American Education Research Association, a fellow of the Wharton Risk Center, and a senior fellow of the Leonard Davis Institute for Health Economics. He is a past chair of the American Statistical Association Section on Statistics in Marketing and is a statistical fellow of Bell Labs. He was previously named DuPont Corporation’s best young researcher and has won research awards in marketing, statistics, psychology, education, and medicine. Eric holds a BS in economics from the Wharton School and an AM in mathematical statistics and a PhD in mathematical statistics from Harvard University. His personal interests include his wife, Laura, and his sons, Ethan, Zach, and Ben. He also loves sports and movies.

Presentations

Purchase, play, and upgrade data for video game players Session

Eric Bradlow and Zachery Anderson discuss the Wharton Customer Analytics Initiative research opportunity process and explain how some of EA’s solved some of its business problems by sharing its data with 11 teams of researchers from around the world.

Claudiu Branzan is a analytics senior manager in Accenture’s Applied Intelligence Group, based in Seattle, where he leverages his more than 10 years of expertise in data science, machine learning, and AI to promote the use and benefits of these technologies to build smarter solutions to complex problems. Previously, Claudiu held highly technical client-facing leadership roles in companies utilizing big data and advanced analytics to offer solutions for clients in healthcare, high-tech, telecom, and payments verticals.

Presentations

Natural language understanding at scale with Spark NLP Tutorial

David Talby, Alex Thomas, and Claudiu Branzan lead a hands-on introduction to scalable NLP using the highly performant, highly scalable open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Avner Braverman is cofounder and CEO of Binaris, building an applications optimized serverless platform. Avner’s full stack ranges from hardware architecture, through kernel design and up to JavaScript applications.
He’s been working with distributed operating systems since his school days. Previously, he cofounded XIV, a distributed storage company, Parallel Machines, and a high-performance analytics company.

Presentations

Serverless for data and AI Session

What is serverless, and how can it be utilized for data analysis and AI? Avner Braverman outlines the benefits and limitations of serverless with respect to data transformation (ETL), AI inference and training, and real-time streaming. This is a technical talk, so expect demos and code.

Mark Brine is the director of finance of Cloudera.

Presentations

How to survive future data warehousing challenges with the help of a hybrid cloud Session

Michael Kohs, Eva Andreasson, and Mark Brine explain how Cloudera’s Finance Department used a hybrid model to speed up report delivery and reduce cost of end-of-quarter reporting. They also share guidelines for deploying modern data warehousing in a hybrid cloud environment, outlining when you should choose a private cloud service over a public one, the available options, and some dos and dont's.

Kurt Brown leads the data platform team at Netflix, which architects and manages the technical infrastructure underpinning the company’s analytics, including various big data technologies like Hadoop, Spark, and Presto, machine learning infrastructure for Netflix data scientists, and traditional BI tools including Tableau.

Presentations

The journey toward a self-service data platform at Netflix Session

The Netflix data platform is a massive-scale, cloud-only suite of tools and technologies. It includes big data tech (Spark and Flink), enabling services (federated metadata management), and machine learning support. But with power comes complexity. Kurt Brown explains how Netflix is working toward an easier, "self-service" data platform without sacrificing any enabling capabilities.

Stuart Buck is the vice president of research at Arnold Ventures, one of the leading funders of research to inform public policy. He has given advice to DARPA, IARPA (the CIA’s research arm), and the White House Social and Behavioral Sciences Team on rigorous research processes. He has sponsored major efforts showing that even the best scientific research is often irreproducible; this work has been featured in Wired, the Economist, the New York Times, and the Atlantic. He has also published in top journals (such as Science and BMJ) on how to make research more accurate. He holds a PhD in education policy from the University of Arkansas, where he studied econometrics, statistics, and program evaluation; a JD with honors from Harvard Law School, where he was an editor of the Harvard Law Review; and bachelor’s and master’s degrees in music performance from the University of Georgia.

Presentations

Panel: Solutions Ethics Summit

We've looked at some possible solutions. But a more complete perspective includes what's right and where we're making mistakes—from the reproducibility of social sciences to the regulations of governments. We wrap up our look at possible solutions with a group discussion.

What the reproducibility problem means for your business Session

Academic research has been plagued by a reproducibility crisis in fields ranging from medicine to psychology. Stuart Buck explains how to take precautions in your data analysis and experiments so as to avoid those reproducibility problems.

Saumitra Buragohain is a vice president of product management for Hortonworks Data Platform (HDP). During his tenure, Saumitra led the launch of HDP 3.0.0 and first DataPlane application, Data Lifecycle Manager, from concept to business case to release. Saumitra holds an MBA from Santa Clara University and an MSEE from the University of Southern California.

Presentations

IBM and Cloudera: Bringing AI and ML to the governed data lake (sponsored by IBM) Session

Satheesh Bandaram and Saumitra Buragohain detail how IBM and Cloudera are advancing AI and ML for their customers with solutions to build on-premises or cloud-based secure governed data lakes.

Andrew Burt is chief privacy officer and legal engineer at Immuta, the data management platform for the world’s most secure organizations. He is also a visiting fellow at Yale Law School’s Information Society Project. Previously, Andrew was a special advisor for policy to the head of the FBI Cyber Division, where he served as lead author on the FBI’s after-action report on the 2014 attack on Sony. The leading authority on the intersection between machine learning, regulation and law, Andrew has published articles on technology, history, and law in the New York Times, the Financial Times, Slate, and the Yale Journal of International Affairs. His book, American Hysteria: The Untold Story of Mass Political Extremism in the United States, was called “a must-read book dealing with a topic few want to tackle” by Nobel laureate Archbishop Emeritus Desmond Tutu. Andrew holds a JD from Yale Law School and a BA from McGill University. He is a term-member of the Council on Foreign Relations, a member of the Washington, DC, and Virginia State Bars and a Global Information Assurance Certified (GIAC) cyber incident response handler.

Presentations

Successfully deploy machine learning while managing its risks Tutorial

As ML becomes increasingly important for businesses and data science teams alike, managing its risks is quickly becoming one of the biggest challenges to the technology’s widespread adoption. Join Andrew Bur, Steven Touw, Richard Geering, Joseph Regensburger, and Alfred Rossi for a hands-on overview of how to train, validate, and audit machine learning models (ML) in practice.

Chris Bush is the head of digital analytics and data science at Levi Strauss & Co., where he established and continues to grow the analytics and data science practice. His responsibilities include increasing the data centricity, providing analytical insights, and powering personalization and optimization efforts. Previously, Chris founded the product analytics team at Walmart Labs, where he was responsible for providing product analytics and overseeing the implementation of clickstream analytics on Walmart-owned websites and apps and drove hypothesis-led inquiries to investigate business critical issues and prioritize opportunities. In a prior career, he practiced law at Latham & Watkins LLP, where he developed bespoke equity derivatives, quantitative structured products, and novel equity products.

Presentations

Building a data science team at Levi’s (sponsored by Dataiku) Session

Building a data science practice in any environment is difficult. Integrating data science into a long-standing company with established processes, complex business operations, and global scale creates additional layers of complexity that need to be navigated. Chris Bush explains how Levi’s is tackling this challenge and shares the company's continuing evolution to leverage data science.

Igor Canadi is a software engineer at Rockset, where he is developing its data indexing and distributed SQL query engine. Previously, Igor was an engineer at Facebook, working on the database engineering and product infrastructure teams, where he contributed to RocksDB, developed MongoRocks and MongoDB with RocksDB storage engine, drove RocksDB open source initiatives, worked on core GraphQL infrastructure for Facebook’s Android application, and owned GraphQL developer tooling for hundreds of developers. Igor holds a master’s degree in computer science from the University of Wisconsin-Madison and a bachelor’s degree from the University of Zagreb. In his free time, he likes sailing and snowboarding.

Presentations

ROCKSET: The design and implementation of a data system for low-latency queries for search and analytics Session

Most existing big data systems prefer sequential scans for processing queries. Igor Canadi and Dhruba Borthakur challenge this view, offering an overview of converged indexing: a single system called ROCKSET that builds inverted, columnar, and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines.

Streaming Services Specialist Solution Architect with AWS.

Presentations

Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You'll then build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Raghu Chakravarthi is the senior vice president at Actian, where he’s responsible for products, R&D, and support services and leads the teams that make sure that business-critical systems can transact and integrate at their very best through the deployment of highly scalable data management technology, underpinned by a relentless and trusted service commitment, both on-premises and in the cloud. He’s passionate about data and uncovering real insights using analytics and is a seasoned product development executive with experience in commercial enterprise software at both large and in startup companies. Previously, Raghu was the general manager and vice president of engineering at Teradata for the Vantage Platform and served in executive roles at Oracle and Hyperion. He’s active in the Silicon Valley tech corridor with speaking engagements, networking events, and hosting meetups.

Presentations

Uncovering the next generation of data architecture for insights at the speed of thought (sponsored by Actian) Session

Raghu Chakravarth explores key considerations when building an Agile data warehouse and outlines a reference architecture for hybrid data.

James Cham is a partner at Bloomberg Beta, a seed-stage VC firm investing in startups that make work better. Previously, he was a VC at Bessemer Venture Partners and Trinity Ventures. He started his career as a software developer.

Presentations

Automating yourself out of a job? The problem with knowledge work Session

Missing amid conversations about corporate strategy and innovation is a mostly untapped source of new ideas and efficiency—the people actually doing the work. James Cham explains why this a problem and suggests some possible solutions.

Jian Chang is a senior algorithm expert at the Alibaba Group, where he is working on cutting-edge applications of AI at the intersection of high-performance databases and the IoT, focusing on unleashing the value of spatiotemporal data. A data science expert and software system architect with expertise in machine learning and big data systems and deep domain knowledge on various vertical use cases (finance, telco, healthcare, etc.), Jian has led innovation projects and R&D activities to promote data science best practices within large organizations. He’s a frequent speaker at technology conferences, such as the O’Reilly Strata and AI Conferences, NVIDIA’s GPU Technology Conference, Hadoop Summit, DataWorks Summit, Amazon re:Invent, Global Big Data Conference, Global AI Conference, World IoT Expo, and Intel Partner Summit, and has published and presented research papers and posters at many top-tier conferences and journals, including: ACM Computing Surveys, ACSAC, CEAS, EuroSec, FGCS, HiCoNS, HSCC, IEEE Systems Journal, MASHUPS, PST, SSS, TRUST, and WiVeC. He’s also served as a reviewer for many highly reputable international journals and conferences. Jian holds a PhD from the Department of Computer and Information Science (CIS) at University of Pennsylvania, under Insup Lee.

Presentations

Building the AI engine for retail in the new era Session

Jian Chang and Sanjian Chen outline the design of the AI engine on Alibaba's TSDB service, which enables fast and complex analytics of large-scale retail data. They then share a successful case study of the Fresh Hema Supermarket, a major “new retail” platform operated by Alibaba Group, highlighting solutions to the major technical challenges in data cleaning, storage, and processing.

Haifeng Chen is a senior software architect at Intel’s Asia Pacific R&D Center. He has more than 12 years’ experience in software design and development, big data, and security, with a particular interest in image processing. Haifeng is the author of image browsing, editing, and processing software ColorStorm.

Presentations

Spark adaptive execution: Unleash the power of Spark SQL Session

Spark SQL is widely used, but it still suffers from stability and performance challenges in highly dynamic environments with large-scale data. Haifeng Chen shares a Spark adaptive execution engine built to address these challenges. It can handle task parallelism, join conversion, and data skew dynamically during runtime, guaranteeing the best plan is chosen using runtime statistics.

Jeff Chen is the chief innovation officer at the US Bureau of Economic Analysis, where he’s responsible for integrating advancements in data science and machine learning to advance the bureau’s capabilities. A statistician and data scientist, Jeff has extensive experience in launching and leading data science initiatives in over 40 domains, working with diverse stakeholders such as firefighters, climatologists, and technologists, among others to introduce data science and new technologies to advance their missions. Previously, he was the chief scientist at the US Department of Commerce; a White House Presidential Innovation Fellow with NASA and the White House Office of Science and Technology Policy, focused on data science for the environment; the first director of analytics at the NYC Fire Department, where he engineered pioneering algorithms for fire prediction; and on of the first data scientists at the NYC Mayor’s Office under then-Mayor Mike Bloomberg. Jeff started his career as an econometrician at an international engineering consultancy, where he developed forecasting and prediction models supporting for large-scale infrastructure investment projects. In the evenings, he’s an adjunct professor of data science at Georgetown University. He holds a bachelor’s degree in economics from Tufts University and a master’s degree in applied statistics from Columbia University.

Presentations

Deploying data science for national economic statistics Session

Jeff Chen shares strategies for overcoming time series challenges at the intersection of macroeconomics and data science, drawing from machine learning research conducted at the Bureau of Economic Analysis aimed at improving its flagship product the gross domestic product.

Roger Chen is cofounder and CEO of Computable and program chair for the O’Reilly Artificial Intelligence Conference. Previously, he was a principal at O’Reilly AlphaTech Ventures (OATV), where he invested in and worked with early-stage startups primarily in the realm of data, machine learning, and robotics. Roger has a deep and hands-on history with technology. Before startups and venture capital, he was an engineer at Oracle, EMC, and Vicor. He also developed novel nanoscale and quantum optics technology as a PhD researcher at UC Berkeley. Roger holds a BS from Boston University and a PhD from UC Berkeley, both in electrical engineering.

Presentations

Decentralized governance of data

Data remains a linchpin of success for machine learning yet too often is a scarce resource. And even when data is available, trust issues arise about the quality and ethics of collection. Roger Chen explores new models for generating and governing training data for AI applications.

Sanjian Chen is a data scientist at the Alibaba Group. He has deep knowledge of large-scale machine learning algorithms. Over his career, he’s partnered with and advised leaders at several Fortune 500 companies on making data-driven strategic decisions and provided software-based data analytics consulting service to seven global firms across multiple industries, including financial services, automotive, telecommunications, and retail.

Presentations

Building the AI engine for retail in the new era Session

Jian Chang and Sanjian Chen outline the design of the AI engine on Alibaba's TSDB service, which enables fast and complex analytics of large-scale retail data. They then share a successful case study of the Fresh Hema Supermarket, a major “new retail” platform operated by Alibaba Group, highlighting solutions to the major technical challenges in data cleaning, storage, and processing.

Shouwei Chen is an ECE PhD student at Rutgers University, advised by Ivan Rodero. Shouwei’s research focuses on the codesign of a memory-centric computing framework with an in-memory distributed filesystem.

Presentations

Optimizing computing cluster resource utilization with an in-memory distributed filesystem Session

JD.com recently designed a brand-new architecture to optimize Spark computing clusters. Yue Li and Shouwei Chen detail the problems the team faced when building it and explain how the company benefits from the in-memory distributed filesystem now.

Tim Chen is a software engineer at Cloudera leading cloud initiatives for the company’s enterprise machine learning platform. Previously, he was cofounder and CEO and cofounder of Hyperpilot, a startup focused on applying machine learning to improve performance and cost efficiency of container clusters and big data workloads, led containerization development and design at Mesosphere, and worked at VMware and Microsoft. He’s an Apache PMC member and committer on Apache Drill and Apache Mesos and helped initiate the Spark-on-Kubernetes project and led development of Mesos support for Spark.

Presentations

Cloud native machine learning: Emerging trends and the road ahead Session

Data platforms are being asked to support an ever increasing range of workloads and compute environments, including machine learning and elastic cloud platforms. Tristan Zajonc and Tim Chen discuss emerging capabilities, including running machine learning and Spark workloads on autoscaling container platforms, and share their vision for the road ahead for ML and AI in the cloud.

Chakri Cherukuri is a senior researcher in the Quantitative Financial Research Group at Bloomberg LP in NYC. His research interests include quantitative portfolio management, algorithmic trading strategies, and applied machine learning. He has extensive experience in scientific computing and software development. Previously, he built analytical tools for the trading desks at Goldman Sachs and Lehman Brothers. He holds an undergraduate degree in mechanical engineering from the Indian Institute of Technology (IIT) Madras, India, a master’s degree in computer science from Arizona State University, and an MS in computational finance from Carnegie Mellon University.

Presentations

Applied machine learning in finance Session

Quantitative finance is a rich field in finance where advanced mathematical and statistical techniques are employed by both sell-side and buy-side institutions. Chakri Cherukuri explains how machine learning and deep learning techniques are being used in quantitative finance and details how these models work under the hood.

Alan Chin is a contributor working on Jupyter Enterprise Gateway and an open source advocate at IBM. His previous roles include build and release engineer and test engineer in DB2 on z/OS. In a previous life, he served as a crew chief with the 15th in Hurlburt Field and 16th ESOS in Afghanistan. Alan holds a BS in computer science from San Jose State University. He resides in San Francisco with his wife and three furry children (two cats and a dog).

Presentations

Scaling Jupyter with Jupyter Enterprise Gateway Session

Alan Chin and Luciano Resende explain how to introduce Jupyter Enterprise Gateway into new and existing notebook environments to enable a "bring your own notebook" model while simultaneously optimizing resources consumed by the notebook kernels running across managed clusters within the enterprise.

Divya Choudhary is a data scientist at GO-JEK. A computer science engineer turned decision scientist turned data scientist, Divya is known for her business understanding, approach to problem solving, machine learning, NLP, and driving data science problems to the final execution. She has four years’ experience unveiling the wonders of data using data science. Previously, she worked closely with the boards of directors of three startups in India and Indonesia. She’s a yoga lover, painter, poetess, and avid trekker and wanderer who’s best at talking to people and learning about them.

Presentations

From an archived data field to GO-JEK’s world-class product feature for customer experience Session

Divya Choudhary explains how GO-JEK uses random chat messages and notes written in a local language sent by customers to their drivers while waiting for a ride to arrive to carve out unparalleled information about pickup points and their names (which sometimes even Google Maps has no idea of) and help create a world-class customer pickup experience feature.

Rumman Chowdhury is a senior manager and AI lead at Accenture, where she works on cutting-edge applications of artificial intelligence and leads the company’s responsible and ethical AI initiatives. She also serves on the board of directors for three AI startups. Rumman’s passion lies at the intersection of artificial intelligence and humanity. She comes to data science from a quantitative social science background. She has been interviewed by Software Engineering Daily, the PHDivas podcast, German Public Television, and fashion line MM LaFleur. In 2017, she gave talks at the Global Artificial Intelligence Conference, IIA Symposium, ODSC Masterclass, and the Digital Humanities and Digital Journalism conference, among others. Rumman holds two undergraduate degrees from MIT and a master’s degree in quantitative methods of the social sciences from Columbia University. She is near completion of her PhD from the University of California, San Diego.

Presentations

Panel: Solutions Ethics Summit

We've looked at some possible solutions. But a more complete perspective includes what's right and where we're making mistakes—from the reproducibility of social sciences to the regulations of governments. We wrap up our look at possible solutions with a group discussion.

Eric Colson is chief algorithms officer at Stitch Fix, where he leads a team of 80+ data scientists and is responsible for the multitude of algorithms that are pervasive to nearly every function of the company, from merchandise, inventory, and marketing to forecasting and demand, operations, and the styling recommender system. He’s also an advisor to several big data startups. Previously, Eric was vice president of data science and engineering at Netflix. He holds a BA in economics from SFSU, an MS in information systems from GGU, and an MS in management science and engineering from Stanford.

Presentations

How to make fewer bad decisions Session

A/B testing has revealed the fallibility in human intuition that typically drives business decisions. Eric Colson and Daragh Sibley describe some types of systematic errors domain experts commit, explain how cognitive biases arise from heuristic reasoning processes, and share several mechanisms to mitigate these human limitations and improve decision making.

Ian Cook is a data scientist at Cloudera and the author of several R packages, including implyr. Previously, he was a data scientist at TIBCO and a statistical software developer at Advanced Micro Devices. Ian is cofounder of Research Triangle Analysts, the largest data science meetup group in the Raleigh, North Carolina, area, where he lives with his wife and two young children. He holds an MS in statistics from Lehigh University and a BS in applied mathematics from Stony Brook University.

Presentations

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow 2-Day Training

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, you'll overcome obstacles to getting started using new tools.

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow (Day 2) Training Day 2

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, you'll overcome obstacles to getting started using new tools.

Will Crichton is a PhD student in computer science at Stanford, advised by Pat Hanrahan. He creates systems that merge research in parallel computing and programming language design to solve impactful problems. Will’s current focus is on tools to enable large-scale visual data analysis, or processing massive collections of images and videos, including published work at SIGGRAPH.

Presentations

Scanner: Efficient video analysis at scale Session

Video is now the largest source of data on the internet, so we need tools to make it easier to process and analyze. Alex Poms and Will Crichton offer an overview of Scanner, the first open source distributed system for building large-scale video processing applications, and explore real-world use cases.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Strata Data Ethics Summit welcome Tutorial

Susan Etlinger and Alistair Croll welcome you to the Strata Data Ethics Summit.

The future of data ethics Ethics Summit

Strata Data Ethics Summit cochairs Susan Etlinger and Alistair Croll, along with Tim O’Reilly, lead an interactive discussion format with the summit's speakers, attendees, and guests.

Thursday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program chairs Ben Lorica, Alistair Croll, and Doug Cutting welcome you to the first day of keynotes.

Anne Cruz is the manager of merchandising and supply chain technology at Walgreens. She’s currently working on Walgreens Retail’s multiyear transformation journey, in which data strategy is pivotal to how business strategy is synthesized with technology. Anne is the IT partner for supply chain analytics that enable the business with capabilities to accelerate business value through changing operating models, simplifying and modernizing architectures, and collaborating with business units and other IT functions to redesign and optimize business processes.

Presentations

How Walgreens transformed supply chain management with Kyvos, Tableau, and big data Session

Walgreens recently faced the challenge of analyzing 466 billion rows of data from 20,000 suppliers and 9,000 stores, which strained its existing systems when dealing with the scale and cardinality of data. Neerav Jain, Vikas Hardia, and Anne Cruz describe how they used Kyvos and Tableau to transform Walgreens's supply chain with instant, interactive analysis on two-year data.

Dillon Cullinan is a data engineering cybersecurity specialist at Accenture Cyber Labs, located in the Washington, DC, area. Dillon focuses on building big data solutions for the cybersecurity realm to enable large-scale analytics and visualizations.

Presentations

Using graph metrics to detect lateral movement in enterprise cybersecurity data Session

Louis DiValentin and Dillon Cullinan explain how Accenture's Cyber Security Lab built security analytics models to detect attempted lateral movement in networks by transforming enterprise-scale security data into a graph format, generating graph analytics for individual users, and building time series detection models that visualize the changing graph metrics for security operators.

Nick Curcuru is vice president of enterprise information management at Mastercard, where he is responsible for leading a team that works with organizations to generate revenue through smart data, architect next-generation technology platforms, and protect data assets from cyberattacks by leveraging Mastercard’s information technology and information security resources and creating peer-to-peer collaboration with their clients. Nick brings over 20 years of global experience successfully delivering large-scale advanced analytics initiatives for such companies as the Walt Disney Company, Capital One, Home Depot, Burlington Northern Railroad, Merrill Lynch, Nordea Bank, and GE. He frequently speaks on big data trends and data security strategy at conferences and symposiums, has publishing several articles on security, revenue management, and data security, and has contributed to several books on the topic of data and analytics.

Presentations

Executive Briefing: Forcing the legal and ethical hands of companies that collect, use, and analyze data Session

Data—in part, harvested personal data—brings industries unprecedented insights about customer behavior. We know more about our customers and neighbors than at any other time in history, but we need to avoid crossing the "creepy" line. Nick Curcuru discusses how ethical behavior drives trust, especially in today's IoT age.

Paul Curtis is a principal solutions engineer at MapR, where he provides pre- and postsales technical support to MapR’s worldwide systems engineering team. Previously, Paul was senior operations engineer for Unami, a startup founded to deliver on the promise of interactive TV for consumers, networks, and advertisers, and a systems manager for Spiral Universe, a company providing school administration software as a service. He also held senior support engineer positions at Sun Microsystems, enterprise account technical management positions for both Netscape and FileNet, and positions in application development at Applix, IBM Service Bureau, and Ticketron. Paul got started in the ancient personal computing days; he began his first full-time programming job on the day the IBM PC was introduced.

Presentations

Clusters in Kubernetes on a cluster: Building a multitenant environment for the field Session

What do you do when your technology doesn’t easily fit on a single laptop and consists of many components? Paul Curtis explains how MapR Technologies rolled out a containerized, scalable, globally available, and easily updatable environment using a combination of Kubernetes to orchestrate, shared data fabric to store and persist, and AppLariat to provide the user interface.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program chairs Ben Lorica, Alistair Croll, and Doug Cutting welcome you to the first day of keynotes.

Sabrina Dahlgren is a director in charge of strategic analysis at Kaiser Permanente. Her expertise ranges from statistics and economics to project management and computer science. Sabrina has 20 years’ total work experience in leadership and analytical roles such as vice president of marketing and product development and CRM manager and customer segmentation in technology companies including Vodafone, among others. Sabrina has twice won the Innovation Award at Kaiser, most recently in the category of broadly applicable technology for big data analytics.

Presentations

AutoML and interpretability in healthcare Data Case Studies

The healthcare industry requires accuracy and highly interpretable models, but the data is usually plagued by missing information and incorrect values. Enter AutoML and auto-model interpretability. Taposh DuttaRoy and Sabrina Dahlgren discuss tools and strategies for AutoML and interpretability and explain how KP uses them to improve time to develop and deploy highly interpretable models.

Jason Dai is a senior principal engineer and chief architect for big data technologies at Intel, where he leads the development of advanced big data analytics, including distributed machine learning and deep learning. Jason is an internationally recognized expert on big data, the cloud, and distributed machine learning; he is the cochair of the Strata Data Conference in Beijing, a committer and PMC member of the Apache Spark project, and the creator of BigDL, a distributed deep learning framework on Apache Spark.

Presentations

Analytics Zoo: Distributed TensorFlow and Keras on Apache Spark Tutorial

Jason Dai, Yuhao Yang, Jennie Wang, and Guoqiong Song explain how to build and productionize deep learning applications for big data with Analytics Zoo—a unified analytics and AI platform that seamlessly unites Spark, TensorFlow, Keras, and BigDL programs into an integrated pipeline—using real-world use cases from JD.com, MLSListings, the World Bank, Baosight, and Midea/KUKA.

Stephen Dantu is the head of big data capabilities at Marsh, where he leads the strategy, roadmap, and delivery of Marsh’s big data ecosystem to power next-gen data, analytics, and self-service capabilities across the company. He’s also an analytics evangelist, empowering the organization with a variety of self-service utilities and promoting the cause of democratizing data. Stephen is a seasoned, hands-on data and analytics leader with passion and expertise in unleashing the power of data through robust data platforms and cutting-edge analytics. Previously, he led internal analytics at Mastercard, where he was responsible for driving the strategy and execution of the company’s internal analytics BI platforms and providing consultative analytics support on strategic initiatives. He conceived and led the creation of the Centralized Analytics Hub to democratize big data by delivering fast, curated analytics coupled with best-in-class visualization.

Presentations

The new frontier: Marsh’s data voyage into the public cloud (sponsored by Impetus) Session

Stephen Dantu shares insurance broker Marsh’s pioneering journey into the public cloud and explains why this move was necessary to unleash new opportunities and future-proof the company.

Michelle Davenport is one of the most sought-after nutrition experts in food and tech. Currently Michelle consults startups and larger companies on data science in nutrition data analysis and precision-based methods to food and health product research and development. She also serves as a clinical and scientific advisor to startups, including Ritual, a direct-to-consumer supplement company for women. Previously, she was the cofounder and president of Raised Real, where she created a venture-funded, tech-driven, subscription food program for children that targets infant nutritional milestones. As the fastest growing kids’ food brand in the US, Raised Real currently delivers to thousands of families nationwide. Before that, she was the director of nutrition for Zesty (acquired by Square), where she developed the food and nutrition API and a proprietary nutrient analysis program. Michelle and her work have been featured in Fast Company, Forbes, Time, and the Wall Street Journal, among others. She lives in Menlo Park, CA, with her husband, Josh, and daughter, Sophie. Michelle holds a PhD in nutrition from New York University; she did her clinical training as a registered dietitian at the University of California, San Francisco.

Presentations

Nutrition data science Session

Noah Gift and Michelle Davenport explore exciting ideas in nutrition using data science; specifically, they analyze the detrimental relationship between sugar and longevity, obesity, and chronic diseases.

Julien Delange is a staff software engineer at Twitter working on infrastructure services. Previously, he was a senior software engineer at Amazon Web Services, a senior member of the technical staff at Carnegie Mellon University, and a software engineer at the European Space Agency. Julien holds a PhD in computer science from Télécom ParisTec and a master’s degree in computer science from Université Pierre-et-Marie-Curie.

Presentations

Real-time monitoring of Twitter's network infrastructure with Heron Session

Julien Delange and Neng Lu explain how Twitter uses the Heron stream processing engine to monitor and analyze its network infrastructure—implementing a new data pipeline that ingests multiple sources and processes about 1 billion tuples to detect network issues and generate usage statistics. Join in to learn the key technologies used, the architecture, and the challenges Twitter faced.

Sourav Dey is CTO at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. Previously, Sourav led teams building data products across the technology stack, from smart thermostats and security cams at Google/Nest to power grid forecasting at AutoGrid to wireless communication chips at Qualcomm. He holds patents for his work, has been published in several IEEE journals, and has won numerous awards. He holds PhD, MS, and BS degrees in electrical engineering and computer science from MIT.

Presentations

Applications of mixed effects random forests Session

Clustered data is all around us. The best way to attack it? Mixed effect models. Sourav Dey explains how the mixed effects random forests (MERF) model and Python package marries the world of classical mixed effect modeling with modern machine learning algorithms and shows how it can be extended to be used with other advanced modeling techniques like gradient boosting machines and deep learning.

Streamlining a machine learning project team Tutorial

Many teams are still run as if data science is mainly about experimentation, but those days are over. Now it must offer turnkey solutions to take models into production. Sourav Day and Alex Ng explain how to streamline an ML project and help your engineers work as an integrated part of your production teams, using a Lean AI process and the Orbyter package for Docker-first data science.

Rohan Dhupelia leads the analytics platform team at Atlassian, which focuses on further democratizing data in the company and providing a world-class, highly innovative data platform. Rohan has spent the last 10+ years of his career in the data space across a variety of industries, including FMCGs, property, and technology, doing everything from BI report development to data warehousing and data engineering.

Presentations

Transforming behavioral analytics at Atlassian Session

Analytics is easy, but good analytics is hard. Atlassian knows this all too well. Rohan Dhupelia and Jimmy Li explain how the company's push to become truly data driven has transformed the way it thinks about behavioral analytics, from how it defined its events to how it ingests and analyzes them.

Renee DiResta is Director of Research at cybersecurity company New Knowledge and a Mozilla Fellow in Media, Misinformation, and Trust. She investigates the spread of disinformation and malign narratives across social networks, and has advised the Congress, the State Department, and senior executives on how to understand and respond to influence operations.

Presentations

The brave new world of computational propaganda Session

Renee Diresta, lead author of the US Senate report about Russian disinformation operations, will discuss how influence operations are manifesting in 2019 as they've moved beyond politics.

Louis DiValentin is a security data scientist at Accenture Cyber Labs, located in the Washington, DC, area. His research focuses on security analytics modeling, graph analytics, and big data.

Presentations

Using graph metrics to detect lateral movement in enterprise cybersecurity data Session

Louis DiValentin and Dillon Cullinan explain how Accenture's Cyber Security Lab built security analytics models to detect attempted lateral movement in networks by transforming enterprise-scale security data into a graph format, generating graph analytics for individual users, and building time series detection models that visualize the changing graph metrics for security operators.

Thomas Dobbs is a data science product manager at KIXEYE, where he is responsible for maintaining and building machine learning models from ideation to full implementation. Thomas also wrote the underlying algorithm for KIXEYE’s offer recommendation engine. His experience spans marketing, user acquisition, finance, and product, and he has spent seven years in the gaming industry. He is an MBA candidate at UC Berkeley’s Haas School of Business.

Presentations

Recommendation engines and mobile gaming Session

As a fully closed model economy, games offer a unique opportunity to use analytics to create unique purchase opportunities for customers. Bysshe Easton and Thomas Dobbs explain how KIXEYE uses machine learning to create personalized offer recommendations for its customers, resulting in significantly increased monetization and retention.

Harish Doddi is counder and CEO of Datatron Technologies. Previously, he held roles at Oracle; Twitter, where he worked on open source technologies, including Apache Cassandra and Apache Hadoop, and built Blobstore, Twitter’s photo storage platform; Snapchat, where he worked on the backend for Snapchat stories; and Lyft, where he worked on the surge pricing model. Harish holds a master’s degree in computer science from Stanford, where he focused on systems and databases, and an undergraduate degree in computer science from the International Institute of Information Technology in Hyderabad.

Presentations

Model governance in the enterprise Session

Harish Doddi and Jerry Xu share the challenges they faced scaling machine learning models and detail the solutions they're building to conquer them.

Xiaojing Dong is head of marketing science and a staff data scientist on the data science team at LinkedIn. She’s also an associate professor of marketing and business analytics at Santa Clara University, where she led the effort in designing and starting a popular Master of Science program in business analytics and served as the founding director. She helps translate business and marketing problems into data questions and apply analytical techniques into solving such problems to assist business decisions.

Presentations

Using the full spectrum of data science to drive business decisions Tutorial

Thanks to the rapid growth in data resources, business leaders now appreciate the importance (and the challenge) of mining information from data. Join in as a group of LinkedIn's data scientists share their experiences successfully leveraging emerging techniques to assist in intelligent decision making.

Mark Donsky leads product management at Okera, a software provider that provides discovery, access control, and governance at scale for today’s modern heterogeneous data environments. Previously, Mark led data management and governance solutions at Cloudera. Mark has held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

Executive Briefing: Big data in the era of heavy worldwide privacy regulations Session

The implications of new privacy regulations for data management and analytics, such as the General Data Protection Regulation (GDPR) and the upcoming California Consumer Protection Act (CCPA), can seem complex. Mark Donsky and Nikki Rouda highlight aspects of the rules and outline the approaches that will assist with compliance.

Jed Dougherty leads Dataiku’s Data Science team in North America. He specializes in helping large companies in fields including finance, manufacturing, and medicine spin up and organize data science teams and has helped clients build successful projects in data security, real-time recommendation, predictive maintenance, and other “hot topics.” Previously, he worked at a camel ride, so he’s spent quite a bit of time appreciating the normal versus bimodal distribution of dromedaries and Bactrians. He holds a master’s degree from the QMSS Program at Columbia University.

Presentations

Scoring your business in the AI matrix (sponsored by Dataiku) Keynote

One widely accepted definition of AI is that it means going beyond simple statistics to mimic human skills in perception, learning, interaction, and decision making. Jed Dougherty tightens up this definition by sharing examples on a matrix that breaks down the different parts of that definition and how they might manifest themselves in data science projects at different levels.

Justin Driemeyer is an ML staff engineer at 8×8. Previously, he spent three years at an ML B2B advertising startup (acquired by 8×8) and seven years at Zynga, as it went from a 10-person startup to a 2,000-person public company. He holds a BS in computer engineering from U of I and an MS in CS from Stanford, where he worked on the STAIR project.

Presentations

From Jupyter to production: Accelerating solutions to business problems in production Session

Project Jupyter is very popular for data science, data exploration, and visualization. Manu Mukerji and Justin Driemeyer explain how to use it for AI/ML in a production environment.

Alain Dufaux is head of operations and development for the Metamedia Center at École Polytechnique Fédérale de Lausanne (EPFL), which is responsible for digitization and preservation of the Montreux Jazz Festival archive. Previously, he worked at a company developing ultra-low-power processors and real-time algorithms for hearing aid devices and worked in a research lab at EPFL managing projects and coaching students in the development of image and video processing algorithms applied to vision systems. He holds a PhD from the University of Neuchâtel, Switzerland, where his thesis was dedicated to automatic sound recognition.

Presentations

How EPFL captured the feel of the Montreux Jazz Festival with its immersive 3D VR to three-geo archive Session

The École Polytechnique Fédérale de Lausanne (EPFL) spearheaded the official digital archival of 15,000+ hours of A/V content captured from the Montreux Jazz Festival since 1967. Stefaan Vervaet and Alain Dufaux explain how EPFL created an immersive 3D VR experience. From capture and store to delivery and experience, they detail the evolution of the workflow that made it all possible.

Ted Dunning is chief application architect at MapR. He’s also a board member for the Apache Software Foundation, a PMC member and committer on many Apache projects, and a mentor for various incubator projects. Ted has years of experience with machine learning and other big data solutions across a range of sectors. He has contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library and designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date plus a dozen pending. He holds a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.

Presentations

Online evaluation of machine learning models Session

Evaluating machine learning models is surprisingly hard, particularly because these systems interact in very subtle ways. Ted Dunning breaks the problem of evaluation apart into operational and function evaluation, demonstrating how to do each without unnecessary pain and suffering. Along the way, he shares exciting visualization techniques that will help make differences strikingly apparent.

Taposh Roy leads the innovation team within the Decision Support Group at Kaiser Permanente. His work focuses on journey analytics, deep learning, data science architecture, and strategy. A consumer-focused machine learning and data science geek, Taposh has a unique combination of product, technology and strategy consulting, data science, and startup experience. Previously, he was head of AD products at Inpowered and Netshelter (acquired by Ziff Davis); senior associate consultant in MIT-based consulting company Sapient; and cofounder of biotech company Bio-Integrated Solutions, where he developed DNA sequencers and liquid handling devices for proteomics.

Presentations

AutoML and interpretability in healthcare Data Case Studies

The healthcare industry requires accuracy and highly interpretable models, but the data is usually plagued by missing information and incorrect values. Enter AutoML and auto-model interpretability. Taposh DuttaRoy and Sabrina Dahlgren discuss tools and strategies for AutoML and interpretability and explain how KP uses them to improve time to develop and deploy highly interpretable models.

Bysshe Easton is the director of analytics at KIXEYE, where he brings his experience using a combination of data analysis, economics, intuition, and game design sense to solve monetization and content delivery problems in games. He also manages the insights and implementations of the machine learning offer system for War Commander: Rogue Assault.

Presentations

Recommendation engines and mobile gaming Session

As a fully closed model economy, games offer a unique opportunity to use analytics to create unique purchase opportunities for customers. Bysshe Easton and Thomas Dobbs explain how KIXEYE uses machine learning to create personalized offer recommendations for its customers, resulting in significantly increased monetization and retention.

Jana Eggers is CEO of the neuroscience-inspired artificial intelligence platform company Nara Logics. An experienced tech exec focused on inspiring teams to build great products, Jana has started and grown companies and led large organizations at public companies. She’s active in customer-inspired innovation, the artificial intelligence industry, the autonomy-mastery-purpose-style leadership, and running and triathlons. Previously, she held technology and executive positions at Intuit, Blackbaud, Los Alamos National Laboratory (computational chemistry and super computing), Basis Technology (internationalization technology), Lycos, American Airlines, Spreadshirt (ecomm), and startups that you’ve never heard of.

Presentations

AI's terrible twos: When AI does what we taught it Ethics Summit

Jana Eggers covers wide-ranging and important themes to think about as you raise your AI systems. You'll also learn how to detect possible "cute" behavior early that develops into bad behavior later.

Panel: Causes Ethics Summit

Following the review of problematic technologies, we'll hold an interactive discussion with speakers and invited guests to dig deeper into neuroscience, analytics, and more.

Susan Etlinger is an industry analyst at Altimeter. Her research focuses on the impact of artificial intelligence, data, and advanced technologies on business and culture and is used in university curricula around the world. Susan’s TED Talk, “What do we do with all this big data?,” has been translated into 25 languages and has been viewed more than 1.2 million times. She’s a sought-after keynote speaker and has been quoted in such media outlets as the Wall Street Journal, the BBC, and the New York Times.

Presentations

Getting real about ethical technology Ethics Summit

During the past year, we’ve seen a lot of focus on ethics in tech. But what does it all mean? And what should business really be doing about technology ethics? Industry analyst Susan Etlinger lays out the progress and issues in the industry, as well as some of the promising approaches coming from business, the public sector, and academia.

Strata Data Ethics Summit welcome Tutorial

Susan Etlinger and Alistair Croll welcome you to the Strata Data Ethics Summit.

The future of data ethics Ethics Summit

Strata Data Ethics Summit cochairs Susan Etlinger and Alistair Croll, along with Tim O’Reilly, lead an interactive discussion format with the summit's speakers, attendees, and guests.

Adam Famularo is CEO of erwin, Inc., maker of the world’s number-one data modeling software as well as other data management solutions that help organizations derive maximum results from their data initiatives and promote strong data governance. Previously Adam oversaw Verizon’s enterprise distribution channels—partners, value-added resellers, and systems integrators—across the world and served as a senior vice president and general manager for CA Technologies’s cloud computing business and storage and data management business units.

Presentations

Solving the enterprise data dilemma (sponsored by erwin) Session

Adam Famularo showcases erwin's combination of data management and data governance to produce actionable insights. Erwin customer Nasdaq then shares a real-world use case. You'll learn how to answer tough data questions, how to maintain a metadata landscape, and how to use data management and governance to produce actionable insights.

Wenchen Fan is a software engineer at Databricks, working on Spark Core and Spark SQL, as well as a Spark committer and a Spark PMC member. He mainly focuses on the Apache Spark open source community, leading the discussion and reviews of many features and fixes in Spark.

Presentations

Apache Spark 2.4 and beyond Session

Xiao Li and Wenchen Fan offer an overview of the major features and enhancements in Apache Spark 2.4 and give insight into upcoming releases. Then you'll get the chance to ask all your burning Spark questions.

Tao Feng is a software engineer on the data platform team at Lyft. Tao is a committer and PMC member on Apache Airflow. Previously, Tao worked on data infrastructure, tooling and performance at LinkedIn and Oracle.

Presentations

Disrupting data discovery Session

Lyft has reduced the time it takes to discover data by 10x by building its own data portal, Amundsen. Mark Grover and Tao Feng offer a demo of Amundsen and lead a deep dive into its architecture, covering how it leverages centralized metadata, PageRank, and a comprehensive data graph to achieve its goal. They also explore the future roadmap, unsolved problems, and its collaboration model.

Toby Ferguson is a sales engineer at Cloudera, where he helps partners succeed with the Cloudera platform.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

There are many challenges with moving multidisciplinary big data workloads to the cloud and running them. Jason Wang, Brandon Freeman, Michael Kohs, Akihiro Nishikawa, and Toby Ferguson explore cloud architecture and its challenges and walk you through using Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Rustem Feyzkhanov is a machine learning engineer at Instrumental, where he creates analytical models for the manufacturing industry. Rustem is passionate about serverless infrastructure (and AI deployments on it) and has ported several packages to AWS Lambda from TensorFlow, Keras, and scikit-learn for ML to PhantomJS, Selenium, and WRK for web scraping.

Presentations

Serverless workflows for orchestration hybrid cluster-based and serverless processing Session

Serverless implementation of core processing is quickly becoming a production-ready solution. However, companies with existing processing pipelines may find it hard to go completely serverless. Serverless workflows unite the serverless and cluster worlds, with the benefits of both approaches. Rustem Feyzkhanov demonstrates how serverless workflows change your perception of software architecture.

Erin Flynn is the chief people officer at Optimizely, where she oversees recruiting, people operations, real estate, and workplace operations. Erin’s more than 20 years of experience includes a decade leading recruiting, talent development, and employee success (global HR) at Salesforce as well as stints with PeopleSoft and Idealab. When she’s not making Optimizely one of the best places to work, she also sits on the board of the Horizons Foundation.

Presentations

Future of the firm: How are executives preparing now? Session

In this panel session, executives will discuss how their companies are adapting to the workforce, business, and economic trends shaping the future of business.

Ash Fontana is managing director of Zetta Venture Partners—the first venture capital fund focused on intelligent systems, which he launched with Mark Gorenberg. The firm has $185M under management and has invested in 21 companies. Ash is a board member of and lead investor in companies such as Kaggle, Invenia, Clearbit, Tractable, and Focal Systems. Previously, Ash started the money side of AngelList, the most successful startup investing platform in the world; he launched online investing, managing $130M over more than 250 funds, creating the first startup index fund, and curating investment opportunities across 500,000 companies. He also ran special projects like AngelList’s expansion into Europe and the UK and simultaneously led syndicates and angel investments in Canva, Mixmax, and others. Ash cofounded Topguest, a Founders Fund-backed company that built customer analytics technology for companies like IHG, United, and Caesars Entertainment. Topguest sold in an eight-figure transaction 18 months after the company was founded.

Presentations

VC dimension: How and why investors fund AI startups Session

What does it mean to be an AI investor? How is this approach different from traditional venture capital? Ash Fontana and Katherine Boyle share their perspectives on investments in machine intelligence and data science.

Jonathan Foster leads the Windows and content intelligence writing team at Microsoft. Their work includes UX writing, designing personality and voice within products and experiences, and authoring and designing conversational interactions for products and experiences. He built the writing team for Microsoft’s digital assistant Cortana for the US and international markets, which focused on the development of Cortana’s personality while crafting fun, challenging dialogue. They’re now expanding upon this knowledge to create a personality catalogue for Microsoft’s Bot Framework and train a deep neural net conversational model to support those personalities. Jonathan started out in film and television writing screenplays and working in development. He was eventually drawn away from Hollywood by the true innovative spirit of the tech industry, starting with an interactive storytelling project that was honored by the Sundance Film Festival.

Presentations

Panel: Causes Ethics Summit

Following the review of problematic technologies, we'll hold an interactive discussion with speakers and invited guests to dig deeper into neuroscience, analytics, and more.

Say what? The ethical challenges of designing for humanlike interaction Ethics Summit

Language shapes our thinking, our relationships, our sense of self. Conversation connects us in powerful, intimate, and often unconscious ways. Jonathan Foster explains why, as we design for natural language interactions and more humanlike digital experiences, language—as design material, conversation, and design canvas—reveals ethical challenges we couldn't encounter with GUI-powered experiences.

Don Fox is a Boston-based data scientist in residence at the Data Incubator. Previously, Don developed numerical models for a geothermal energy startup. Born and raised in South Texas, Don holds a PhD in chemical engineering, where he researched renewable energy systems and developed computational tools to analyze the performance of these systems.

Presentations

Hands-on data science with Python 2-Day Training

Don Fox walks you through developing a machine learning pipeline, from prototyping to production. You'll learn about data cleaning, feature engineering, model building and evaluation, and deployment and then extend these models into two applications from real-world datasets. All work will be done in Python.

Jonathan Francis is vice president of marketing analytics and optimization at Starbucks.

Presentations

Improving AI solutions for personalization with continuous experimentation and learning Data Case Studies

Jon Francis explains how he and Arun Veetill improved the performance of an AI-based personalization solution by 2x through continuous AI-enabled experimentation and learning.

Bill Franks is chief analytics officer at the International Institute for Analytics (IIA). His work has spanned clients in a variety of industries for companies ranging in size from Fortune 100 companies to small nonprofit organizations. Previously, he was chief analytics officer at Teradata. Bill is the author of Taming the Big Data Tidal Wave and The Analytics Revolution. You can learn more on his website.

Presentations

The ethics of analytics Session

Concerns are constantly being raised today about what data is appropriate to collect and how (or if) it should be analyzed. There are many ethical, privacy, and legal issues to consider, and no clear standards exist in many cases as to what is fair and what is foul. Bill Franks explores a variety of dilemmas and provides some guidance on how to approach them.

Brandon Freeman is a Mid-Atlantic region strategic system engineer at Cloudera, specializing in infrastructure, the cloud, and Hadoop. Previously, Brandon was an infrastructure architect at Explorys, working in operations, architecture, and performance optimization for the Cloudera Hadoop environments, where he was responsible for designing, building, and managing many large Hadoop clusters.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

There are many challenges with moving multidisciplinary big data workloads to the cloud and running them. Jason Wang, Brandon Freeman, Michael Kohs, Akihiro Nishikawa, and Toby Ferguson explore cloud architecture and its challenges and walk you through using Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Cynthia Freeman is a research engineer at Verint Intelligent Self-Service, a developer of conversational AI systems. She holds an MS in applied mathematics from the University of Washington and a BS in mathematics from Gonzaga University and is currently pursuing her PhD in computer science at the University of New Mexico, where she works on time series analysis and developing new anomaly detection methods.

Presentations

How to determine the optimal anomaly detection method for your application Session

Anomaly detection has many applications, such as tracking business KPIs or fraud spotting in credit card transactions. Unfortunately, there's no one best way to detect anomalies across a variety of domains. Jonathan Merriman and Cynthia Freeman introduce a framework to determine the best anomaly detection method for the application based on time series characteristics.

Matt Fuller is cofounder at Starburst, the Presto company. Matt has held engineering roles in the data warehousing and analytics space for the past 10 years. Previously, he was director of engineering at Teradata, leading engineering teams working on Presto, and was part of the team that led the initiative to bring open source, in particular Presto, to Teradata’s products. Before that, Matt architected and led development efforts for the next-generation distributed SQL engine at Hadapt (acquired by Teradata in 2014) and was an early engineer at Vertica Systems (acquired by HP), where he worked on the Query Optimizer.

Presentations

Learning Presto: SQL on anything Tutorial

Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL on anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from GBs to PBs. Join Matt Fuller to learn how to use Presto and explore use cases and best practices you can implement today.

Mei Lin Fung is a technology pioneer working to ensure that technology works for humanity as the next 3.9 billion people come online. In 1989, she was part of the two-person skunkworks team that developed OASIS, the first customer relationship management (CRM) system. She later served as sociotechnical lead for the US Department of Defense’s Federal Health Futures initiative. In 2015, she joined “father of the internet” Vint Cerf to cofound the People-Centered Internet, which maintains a global network of “positive change agents” committed to ensuring that technology is developed with a people-centered focus—increasing access while ensuring equality, protecting the vulnerable, and prioritizing human well-being. She serves on the World Economic Forum Steering committee for Internet for All. She chairs the Industry Connections Social Impact Measurement Pre-standards working group for the Institute for Electrical and Electronic Engineers (IEEE), and is liaison to the Standards Association for the IEEE Humanitarian Activities Committee. She is founder and unit coordinator of the California Health Medical Reserve Corps.

Presentations

Community and regional data sharing policy frameworks: Frontier stories Session

Data sharing necessitates stakeholders and populations of people to come together to learn the benefits, risks, challenges, and known and unknown "unknowns." Data sharing policies and frameworks require increasing levels of trust, which takes time to build. Join Mei Fung for trail-blazing stories from Solano County, California, and ASEAN (SE Asia), which offer important insights

Krishna Gade is the founder and CEO of Fiddler Labs, an enterprise startup building an explainable AI engine to address problems regarding bias, fairness, and transparency in AI. An entrepreneur and engineering leader with a strong technical experience of creating scalable platforms and delightful consumer products, Krishna previously held senior engineering leadership roles at Facebook, Pinterest, Twitter, and Microsoft.

Presentations

Challenges in addressing bias, fairness, and transparency in AI Session

Join Krishna Gade to learn how to address engineering and organizational challenges for AI fairness and operationalize these concepts in a production AI system—and crucially, create a culture of trust in AI.

Li Gao is the tech lead for the Cloud Native Spark Compute Initiative at Lyft. Previously, Li held technical leadership positions focusing on cloud native and hybrid cloud data platforms at scale at Salesforce, Fitbit, Marin Software, and a few startups. Besides Spark, Li has scaled and productionized open source projects including Presto, Apache HBase, Apache Phoenix, Apache Kafka, Apache Airflow, and Apache Hive.

Presentations

Scaling Apache Spark on Kubernetes at Lyft Session

Li Gao and Bill Graham discuss the challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale.

Sarah Gates is the product marketer for the SAS Platform. Previously, she spent five years as an analytics advisor at SAS, where she helped government customers on the West Coast leverage their data to better serve their citizens. Prior to joining SAS, she was the vice president of research for the International Institute for Analytics, where she advised clients worldwide in the healthcare, retail, banking, and manufacturing industries on analytic applications and analytic program maturity. She also has 20 years of experience in the government sector in a wide variety of data scientist, policy, and leadership roles. Sarah holds a degree in mathematics and an MBA from Willamette University. She’s passionate about cycling and her two daughters, one of whom she has successfully launched into the real world.

Presentations

From data to discovery: The power of choice and control (sponsored by SAS) Session

SAS empowers you with choice and control, helping you uncover insights from any data for better, faster decisions regardless of language.  Sarah Gates shares methods for accelerating the analytics lifecycle, improving data preparation, quality, and governance, automating and speeding up time-consuming tasks, and quickly creating, selecting, and deploying models—be it one or thousands.

Presentations

Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You'll then build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Richard Geering is vice president of governance, risk, and compliance at Immuta. He has over 20 years’ experience in the financial services industry, in global leadership roles in risk, sales, and trading in London, New York, and Barbados. Most recently, he was the chief operational risk officer for a global custodian bank. Richard holds a BSc (with honors) in physics from the University of Nottingham.

Presentations

Successfully deploy machine learning while managing its risks Tutorial

As ML becomes increasingly important for businesses and data science teams alike, managing its risks is quickly becoming one of the biggest challenges to the technology’s widespread adoption. Join Andrew Bur, Steven Touw, Richard Geering, Joseph Regensburger, and Alfred Rossi for a hands-on overview of how to train, validate, and audit machine learning models (ML) in practice.

Adem Efe Gencer develops Apache Kafka and the ecosystem around it and supports their operation at LinkedIn. In particular, he works on the design, development, and maintenance of Cruise Control, a system for alleviating the management overhead of large-scale Kafka clusters. He actively acts as a reviewer for top-tier journals and conferences. He holds a PhD in computer science from Cornell University, where his research has focused on improving the scalability of blockchain technologies. The protocols introduced in his research were adopted by Waves Platform, Aeternity, Cypherium, Enecuum, Ergo Platform, and Legalthings and are actively being developed into other systems. His papers been cited over 500 times. He received a best student paper award in Middleware Conference.

Presentations

Cruise Control: Effortless management of Kafka clusters Session

Adem Efe Gencer explains how LinkedIn alleviated the management overhead of large-scale Kafka clusters using Cruise Control.

Noah Gift is lecturer and consultant at both the UC Davis Graduate School of Management MSBA Program and the Graduate Data Science Program at Northwestern, where he designs and teaches graduate machine learning, AI, data science courses and consults on machine learning and cloud architecture for students and faculty. These responsibilities including leading a multicloud certification initiative for students. As the founder of Pragmatic AI Labs, he also consults with companies on machine learning and cloud architecture. His previous roles have included CTO, general manager, consulting CTO, consulting chief data scientist, and cloud architect, at companies such as ABC, Caltech, Sony Imageworks, Disney Feature Animation, Weta Digital, AT&T, Turner Studios, and Linden Lab.

Noah is a Python Software Foundation Fellow, AWS Subject Matter Expert (SME) on machine learning, AWS Certified Solutions Architect and AWS Academy Accredited Instructor, Google Certified Professional Cloud Architect, and Microsoft MTA on Python.He has published close to 100 technical publications, including two books on subjects ranging from cloud machine learning to DevOps, for companies like Forbes, IBM, Red Hat, Microsoft, O’Reilly, and Pearson. He’s also led workshops and talks around the world, for organizations including NASA, PayPal, PyCon, Strata, and FooCamp. He holds an MBA from UC Davis, an MS in computer information systems from Cal State Los Angeles, and a BS in nutritional science from Cal Poly San Luis Obispo.

Presentations

Nutrition data science Session

Noah Gift and Michelle Davenport explore exciting ideas in nutrition using data science; specifically, they analyze the detrimental relationship between sugar and longevity, obesity, and chronic diseases.

Benjamin Glicksberg is a postdoctoral scholar in the lab of Atul Butte in the Bakar Computational Health Sciences Institute at the University of California, San Francisco. His work involves utilizing state-of-the-art computational methods, including artificial intelligence algorithms, on bio- and clinical informatics frameworks to make discoveries to push forward precision medicine. His work often ties together multiomic data types ranging from genomics to clinical data in the form of electronic health records (EHR). He’s also built software, tools, and applications for interacting with and visualizing EHR data across patients in the UC Health system, with a particular emphasis on interoperable common data model formats. He holds a PhD from the Icahn School of Medicine at Mount Sinai.

Presentations

Sharing cancer genomic data from clinical sequencing using the blockchain Data Case Studies

Sequencing cancer genomes has transformed how we diagnose and treat the deadliest disease in America: cancer. Benjamin Glicksberg explains how coupling cancer genomic data with treatment data through the blockchain will empower patients and citizen scientists to rapidly advance cancer research.

Sean Glover is a senior software engineer on the Fast Data Platform team at Lightbend, where he specializes in Kubernetes, Apache Kafka, and its ecosystem. Sean enjoys building fast data platforms and reactive distributed systems and contributing to open source projects.

Presentations

Put Kafka in jail with Strimzi Session

The best way to run stateful services with complex operational needs like Kafka is to use the operator pattern. Sean Glover offers an overview of the Strimzi Kafka Operator, a popular new open source Operator-based Apache Kafka implementation on Kubernetes.

Sharad Goel is an assistant professor in the Department of Management Science and Engineering at Stanford University and the founder and director of the Stanford Computational Policy Lab. He also holds courtesy appointments in the Computer Science and Sociology Departments and the Law School. Previously, he was a senior researcher at Yahoo and Microsoft in New York City. In his research, Sharad looks at public policy through the lens of computer science, bringing a new, computational perspective to a diverse range of contemporary social issues, including policing, incarceration, and elections. He holds a PhD in applied mathematics from Cornell University.

Presentations

The measure and mismeasure of fairness in machine learning Session

The nascent field of fair machine learning aims to ensure that decisions guided by algorithms are equitable. Several formal definitions of fairness have gained prominence, but, as Sharad Goel argues, nearly all of them suffer from significant statistical limitations. Perversely, when used as a design constraint, they can even harm the very groups they were intended to protect.

Shafi Goldwasser is the RSA Professor of Electrical Engineering and Computer Science at MIT and the incoming director of the Simons Institute for the Theory of Computing at UC Berkeley. She is also a professor of computer science and applied mathematics at the Weizmann Institute of Science in Israel. Shafi’s pioneering contributions include the introduction of probabilistic encryption, interactive zero-knowledge protocols, elliptic curve primality testings, hardness of approximation proofs for combinatorial problems, and combinatorial property testing. She was the recipient of the ACM Turing Award for 2012, the Gödel Prize in 1993 and another in 2001, the ACM Grace Murray Hopper award in 1996, the RSA Award in Mathematics in 1998, the ACM Athena Award for Women in Computer Science in 2008, the Benjamin Franklin Medal in 2010, the IEEE Emanuel R. Piore Award in 2011, the Simons Foundation Investigator Award in 2012, and the BBVA Foundation Frontiers of Knowledge Award in 2018. She’s a member of the NAS, NAE, AAAS, the Russian Academy of Science, the Israeli Academy of Science, and the London Royal Mathematical Society. She a holds honorary degrees from Ben Gurion University, Bar Ilan University, and Haifa University and was recognized with a Berkeley Distinguished Alumnus Award and the Barnard College Medal of Distinction. Shafi holds a BS in applied mathematics from Carnegie Mellon University and an MS and PhD in computer science from the University of California, Berkeley.

Presentations

AI and cryptography: Challenges and opportunities Keynote

Keynote with Shafi Goldwasser

Thomas Goolsby is data and analytics director at USAA, where he focuses on member interaction data and research with universities to help provide new insights and build pipelines of talent.

Presentations

Creating a data engineering culture at USAA Session

What happens when you have a data science organization but no data engineering organization? Jesse Anderson and Thomas Goolsby explain what happened at USAA without data engineering, how they fixed it, and the results since.

Alex Gorbachev is the head of enterprise data science at Pythian. His mission is to help clients around the world build applied AI solutions and democratize data science. Over the course of his 12 years at Pythian, Alex has held many roles, including chief technology officer and chief digital officer. His deep technological roots and industry vision has helped Pythian get to the forefront of the emerging cloud and data markets. Alex is a highly sought-after speaker at industry conferences and user groups around the world. His past accomplishments include achieving the prestigious Oracle ACE Director designation from Oracle and being named “Big Data Champion” by Cloudera.

Presentations

Machine learning for preventive maintenance of mining haul trucks Session

Alex Gorbachev and Paul Spiegelhalter use the example of a mining haul truck to explain how to map preventive maintenance needs to supervised machine learning problems, create labeled datasets, do feature engineering from sensors and alerts data, evaluate models—then convert it all to a complete AI solution on Google Cloud Platform that's integrated with existing on-premises systems.

Martin Gorner is a developer advocate at Google, where he focuses on parallel processing and machine learning. Martin is passionate about science, technology, coding, algorithms, and everything in between. He spent his first engineering years in the Computer Architecture Group of ST Microlectronics, then spent the next 11 years shaping the nascent ebook market at Mobipocket, which later became the software part of the Amazon Kindle and its mobile variants. He’s the author of the successful TensorFlow Without a PhD series. He graduated from Mines Paris Tech.

Presentations

Recurrent neural networks without a PhD Tutorial

Martin Gorner leads a hands-on introduction to recurrent neural networks and TensorFlow. Join in to discover what makes RNNs so powerful for time series analysis.

Denise Gosnell leads a team at DataStax that builds some of the largest distributed graph applications in the world. Her passion centers on examining, applying, and evangelizing the applications of graph data and complex graph problems. An NSF fellow, Denise holds a PhD in computer science from the University of Tennessee, where her research coined the concept of “social fingerprinting” by applying graph algorithms to predict user identity from social media interactions.​ ​Since then, she has built, published on, patented, and spoken about dozens of topics related to graph theory, graph algorithms, graph databases, and applications of graph data across all industry verticals.

Presentations

Taking graph applications to production Session

The graph community has spent years defining and describing its passion: applying graph thinking to solve difficult problems. Denise Gosnell leverages years of experience shipping large-scale applications built on graph databases to share practical and tangible decisions that come into play when designing and delivering distributed graph applications. . .or playing SimCity 2000.

Bill Graham is an architect on the data platform team at Lyft. Bill’s primary area of focus is on data processing applications and analytics infrastructure. Previously, he was a staff engineer on the data platform team at Twitter, where he built streaming compute, interactive query, batch query, ETL, and data management systems; a principal engineer at CBS Interactive and CNET Networks, where he developed ad targeting and content publishing infrastructure; and a senior engineer at Logitech focusing on webcam streaming and messaging applications. He’s contributed to a number of open source projects, including Apache HBase, Apache Hive, and Presto and is an Apache Pig and Apache Heron (incubating) PMC member.

Presentations

Scaling Apache Spark on Kubernetes at Lyft Session

Li Gao and Bill Graham discuss the challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale.

Trevor Grant is an open source technical evangelist at IBM. He’s also a committer on the Apache Mahout project and a contributor to the Apache Streams (incubating), Apache Zeppelin, and Apache Flink projects. In former roles, he called himself a data scientist, but the term is so overused these days that he stopped. Trevor is an organizer of the newly formed Chicago Apache Flink Meetup and has presented at Flink Forward, ApacheCon, Apache Big Data, and other meetups nationwide. Trevor was a combat medic in Afghanistan in 2009 and wrote an award-winning undergraduate thesis between missions. He holds an MS in applied math and an MBA from Illinois State University. He has a dog and a cat and a ’64 Ford, and he loves them all very much.

Presentations

Cross-cloud model training and serving with Kubeflow Tutorial

Holden Karau, Francesca Lazzeri, and Trevor Grant offer an overview of Kubeflow and walk you through using it to train and serve models across different cloud environments (and on-premises). You'll use a script to do the initial setup work, so you can jump (almost) straight into training a model on one cloud and then look at how to set up serving in another cluster/cloud.

Brian Patrick Green is the director of technology ethics at the Markkula Center for Applied Ethics at Santa Clara University. His responsibilities include representing the center at the Partnership on Artificial Intelligence to Benefit People and Society, speaking and publishing on AI ethics as well as various other topics in ethics and technology, and coordinating the center’s partnership with the Tech Museum of Innovation in San Jose. Brian also reviews and evaluates applications to the center’s Hackworth grant program, which awards funding to SCU faculty, staff, and students for work in applied ethics; coordinates the Technology and Ethics Faculty Group; helps coach and coordinate the university’s Ethics Bowl team; works with the center’s Environmental Ethics Fellows; and assists with several other initiatives. In addition, he teaches engineering ethics in the Graduate School of Engineering. Brian’s background includes doctoral and master’s degrees in ethics and social theory from the Graduate Theological Union in Berkeley and an undergraduate degree in genetics from the University of California, Davis. He has conducted molecular biology research in both academic and industrial settings, and between college and graduate school, he served for two years in the Jesuit Volunteers International, teaching high school in the Marshall Islands. His research interests include multiple topics in the ethics of technology, such as AI and ethics, the ethics of space exploration and use, the ethics of technological manipulation of humans, the ethics of mitigation of and adaptation toward risky emerging technologies, and various aspects of the impact of technology and engineering on human life and society, including the relationship of technology and religion (particularly the Catholic Church). Many of his writings can be found at his academia.edu page.

Presentations

An introduction to data ethics Ethics Summit

The term “technology ethics” comes up frequently these days but is not always well understood. In order to consider technology ethics in depth, we need a shared understanding of its content. Irina Raicu and Brian Green explore what ethics is, and more narrowly, the meaning of data ethics.

Michael Gregory leads the field team for machine learning at Cloudera helping organizations derive business value from machine learning. Michael has more than 20 years of experience building, selling, implementing, and supporting large-scale data management solutions at Sun Microsystems, Oracle, Teradata, and Hortonworks and has seen and evangelized the power of data to transform organizations and industries from automotive to telco and public sector to manufacturing.

Presentations

Machine learning and GDPR Session

The General Data Protection Regulation (GDPR) enacted by the European Union restricts the use of machine learning practices in many cases. Michael Gregory offers an overview of the regulations, important considerations for both EU and non-EU organizations, and tools and technologies to ensure that you're appropriately using ML applications to drive continued transformation and insights.

Mark Grover is a product manager at Lyft. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He has also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Presentations

Disrupting data discovery Session

Lyft has reduced the time it takes to discover data by 10x by building its own data portal, Amundsen. Mark Grover and Tao Feng offer a demo of Amundsen and lead a deep dive into its architecture, covering how it leverages centralized metadata, PageRank, and a comprehensive data graph to achieve its goal. They also explore the future roadmap, unsolved problems, and its collaboration model.

Sijie Guo is the PMC chair of Apache BookKeeper and the PMC member of Apache Pulsar. He worked at Twitter before and led the messaging team. Prior to Twitter, he worked on Yahoo! push notification infrastructure.

Presentations

How Zhaopin.com built its enterprise event bus using Apache Pulsar Session

Using a messaging system to build an event bus is very common. However, certain use cases demand a messaging system with a certain set of features. Sijie Guo and Penghui Li discuss the event bus requirements for Zhaopin.com, one of China's biggest online recruitment services providers, and explain why the company chose Apache Pulsar.

Bulbul Gupta is the Founding Advisor of Socos Labs, a think tank designing Augmented Intelligence to maximize human potential through “mad science research,” advance human-centered policy, ethical AI; advise startups, companies, governments on using ethical AI. She is the former head of Entrepreneurship & Market-Based Approaches at the Clinton Global Initiative; and Global Entrepreneurship Policy Advisor, Clinton Campaign ‘16. She serves as a family office advisor; and Board Member & Investment Committees for Upaya Social Ventures, and Pacific Community Ventures. Bulbul is an Adjunct Lecturer at Hult International Business School, NYU Wagner School Corporate Social Innovation programs; and a Speaker & Coach at Singularity Ventures, 500 Startups, Village Capital, Women’s Startup Lab. She is on MacArthur Foundation’s Place-Based Investing, and the Global Social Good/G8 Impact Investing task forces. She has a Masters in Public Policy & Economics from the University of Michigan, is a “conscious capitalist,” an immigrant daughter of Indian tech entrepreneurs, and lives in Palo Alto w/her Jedi daughters.

Bulbul has spent her 18-yr career at the intersection of responsible investing, technology & venture, and public policy, working with globally-minded entrepreneurs & conscious CEOs, to maximize for people & social impact. She is passionate about helping founders realize their visions for a better world, advancing technology innovation & investment in under-estimated people & places, ensuring inclusion & human-centered design. She has helped launch and operationalize a number of innovation strategies, in large and startup organizations, including change management, talent development, learning, aligning resources strategically.

Presentations

AI for Human Potential Ethics Summit

Bulbul Gupta, Socos Labs

Chunky Gupta is a member of the technical staff at Mist Systems, where he works on scaling the company’s cloud infrastructure. Previously, he was a software engineer at Yelp, where he developed an autoscaling engine, FleetMiser, to intelligently autoscale Yelp’s Mesos cluster, saving millions of dollars. He also scaled Yelp’s in-house distributed and reliable task runner, Seagull (which he wrote about for the Yelp engineering blog). Before that, he built a Hadoop-based data warehouse system at Vizury. He gave a talk on FleetMiser at re:Invent 2016. Chunky holds an MS in computer science from Texas A&M University.

Presentations

Live Aggregators: A scalable, cost-effective, and reliable way of aggregating billions of messages in real time Session

Osman Sarood and Chunky Gupta discuss Mist’s real-time data pipeline, focusing on Live Aggregators (LA)—a highly reliable and scalable in-house real-time aggregation system that can autoscale for sudden changes in load. LA is 80% cheaper than competing streaming solutions due to running over AWS Spot Instances and having 70% CPU utilization.

Kapil Gupta is a data science leader at Airbnb in San Francisco, where he leads the data science team focused on launching new travel verticals like Experiences and establishing Airbnb as an end-to-end travel platform. In his time at the company, he has worked on many challenging machine learning and personalization problems in search, pricing, and risk. Previously, he worked at PayPal and Duff & Phelps. He holds a PhD in operations research from Georgia Tech and a BTech from the Indian Institute of Technology (IIT), Madras.

Presentations

Personalizing the guest-booking experience 
at Airbnb Session

Kapil Gupta explains how Airbnb approaches the personalization of travelers’ booking experiences using machine learning.

Sonal Gupta is a research scientist at Facebook working on conversational AI systems. Previously, she developed deep learning natural language understanding models for conversational AI systems at Viv, a startup later acquired by Samsung. She holds a PhD on weakly supervised and interpretable information extraction from Stanford University and a master’s degree on combining language and vision modes for information extraction at the University of Texas at Austin.

Presentations

Natural language understanding in task-oriented conversational AI Session

Sonal Gupta explores practical systems for building a conversational AI system for task-oriented queries and details a way to do more advanced compositional understanding, which can understand cross-domain queries, using hierarchical representations.

Juan Paulo Gutierrez is a senior software engineer at Rakuten, where he leads data architecture, data engineering, and data visualization teams. Paulo contributes to open source projects through code, documentation, feature requests, and discussions. Previously, he was the product development lead for Media Links’s network management software.

Presentations

Building Rakuten analytics: A story of evolutions Session

Juan Paulo Gutierrez explains how a small team in Tokyo went through several evolutions as they built an analytics service to help 200+ businesses accelerate their decision-making process. Join in to hear about the background, challenges, architecture, success stories, and best practices as they built and productionalized Rakuten Analytics.

Barkha Gvalani is a partner at GV (formerly known as Google Ventures), where she works on investing operations, product management, and analytics and helps portfolio companies scale their operations through analytics, data warehousing, and business intelligence. Previously, Barkha helped solve hard data problems for Google’s ads and hardware finance teams and was chief of staff on the team overseeing Google’s financial systems strategy. Before that, she worked at Tata Consultancy Services, where she specialized in the leasing business and consulted for GE Commercial Finance.

Presentations

Executive Briefing: Upskilling your business teams to scale analytics in your organization Session

How do you decide if you should invest in upskilling business teams? The question is no longer "if" but "when" and "how." Barkha Gvalani shares a framework for developing and delivering analytics training to nontechnical users.

John Haddad is vice president at Informatica, where he runs product and technical marketing for the Big Data, Enterprise Data Catalog and Cloud/Hybrid data management product portfolios. He has over 25 years’ experience developing and marketing enterprise software, focusing on enterprise cloud data management over the last 10 years. Previously, John held various positions in product marketing, R&D, and management at Oracle and Right Hemisphere (acquired by SAP). John holds an AB in applied mathematics from UC Berkeley.

Presentations

Understanding the data universe with a data catalog Session

Just like a powerful space telescope that scans the universe, a data catalog scans the data universe to help data scientists and analysts find data, collaborate, and curate data for analytic and data governance projects. John Haddad explains how a data catalog can help you find the data you need and trust for analytic and data governance projects.

Patrick Hall is a senior director for data science products at H2O.ai, where he focuses mainly on model interpretability and model management. Patrick is also currently an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning. Previously, Patrick held global customer-facing and R&D research roles at SAS Institute. He holds multiple patents in automated market segmentation using clustering and deep neural networks. Patrick is the 11th person worldwide to become a Cloudera Certified Data Scientist. He studied computational chemistry at the University of Illinois before graduating from the Institute for Advanced Analytics at North Carolina State University.

Presentations

Practical techniques for interpretable machine learning Tutorial

If machine learning can lead to financial gains for your organization, why isn’t everyone doing it? One reason is training machine learning systems with transparent inner workings and auditable predictions is difficult. Patrick Hall details the good, bad, and downright ugly lessons learned from his years of experience implementing solutions for interpretable machine learning.

Melinda Han Williams is the chief data scientist at Dstillery. Before joining the ad tech industry, Melinda worked as a physicist developing third generation photovoltaics and studying electronic transport in nanostructured graphene devices. Her peer-reviewed journal publications have been cited over 8,000 times. Melinda holds bachelor’s degrees in applied math and engineering physics from the University of California at Berkeley and a PhD in applied physics with distinction from Columbia University, where she held a National Science Foundation Graduate Research Fellowship.

Presentations

Artificial intelligence on human behavior: New insights into customer segmentation Session

Customer segmentation based on coarse survey data is a staple of traditional market research. Melinda Han Williams explains how Dstillery uses neural networks to model the digital pathways of 100M consumers and uses the resulting embedding space to cluster customer populations into fine-grained behavioral segments and inform smarter consumer insights—in the process, creating a map of the internet.

Vikas Hardia is a BI and Hadoop expert at Kyvos, helping the business with data insights on big data platforms.

Presentations

How Walgreens transformed supply chain management with Kyvos, Tableau, and big data Session

Walgreens recently faced the challenge of analyzing 466 billion rows of data from 20,000 suppliers and 9,000 stores, which strained its existing systems when dealing with the scale and cardinality of data. Neerav Jain, Vikas Hardia, and Anne Cruz describe how they used Kyvos and Tableau to transform Walgreens's supply chain with instant, interactive analysis on two-year data.

Presentations

Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You'll then build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Janet Haven is the executive director of Data & Society. Previously, she was Data & Society’s director of programs and strategy. Prior to Data & Society, Janet spent more than a decade at the Open Society Foundations, where she oversaw funding strategies and grant making related to technology’s role in supporting and advancing civil society, particularly in the areas of human rights and governance and accountability. She started her career in technology startups in Central Europe, participating in several successful acquisitions. She sits on the board of the Public Lab for Open Science and Technology and advises a range of nonprofit organizations. Janet holds an MA from the University of Virginia and a BA from Amherst College.

Presentations

The conscience of a company Session

Tim O'Reilly will be joined by Janet Haven, executive director of Data & Society, and Catherine Bracy, director of the TechEquity Collaborative, to discuss ways in which tech employees are flexing their muscles as the conscience of their companies.

Yaron Haviv is CTO at iguazio. Yaron is a serial entrepreneur with deep technological experience in the fields of big data, cloud, storage and networking. Previously, he was the vice president of data center solutions at Mellanox, where he led technology innovation, software development, and solution integrations, and the CTO and vice president of R&D at Voltaire, a high-performance computing, I/O, and networking company. Yaron is a CNCF member and one of the authors in the CNCF working group.

Presentations

Goodbye, data lake: Why continuous analytics yield higher ROI Session

Faced with the need to handle increasing volumes of data, alternative datasets ("alt data"), and AI, many enterprises are working to design or redesign their big data architectures, but traditional batch platforms fail to generate sufficient ROI. Yaron Haviv shares a continuous analytics approach that yields faster answers for the business while remaining simpler and less expensive for IT.

Terry He is a senior director of engineering at MapR, where he manages MapR’s Hadoop and ecosystem engineering teams and leads the company’s AI/ML initiatives.

Presentations

Persistent storage for machine learning in KubeFlow Session

KubeFlow separates compute and storage to provide the ability to deploy best-of-breed open source systems for machine learning to any cluster running Kubernetes, whether on-premises or in the cloud. Skyler Thomas and Terry He explore the problems of state and storage and explain how distributed persistent storage can logically extend the compute flexibility provided by KubeFlow.

Norris Heintzelman is a senior research and data scientist with 19 years’ real-world experience converting data into knowledge—that is, 19 years’ experience in many areas of natural language processing, knowledge systems, cleaning and normalizing messy data, and rigorous accuracy measurement. Norris has published several papers in the fields of health informatics and general knowledge management. She has worked for Lockheed Martin for a very long time, in multiple business areas, from public sector contracts to advanced R&D to internal business process support. An alumna of both Temple University and the University of Pennsylvania, she lives in Wilmington, Delaware, with her husband, two daughters, and two cats. She likes to eat and talk about food.

Presentations

NLP from scratch: Solving the cold start problem for natural language processing Session

How do you train a machine learning model with no training data? Michael Johnson and Norris Heintzelman share their journey implementing multiple solutions to bootstrapping training data in the NLP domain, covering topics including weak supervision, building an active learning framework, and annotation adjudication for named-entity recognition.

Michael Ho is a software engineer at Cloudera. He has worked on various parts of the Apache Impala query execution engine such as reducing codegen time, overhauling expressions evaluation, and most recently, making Impala more scalable. Before Cloudera, Michael used to build hypervisors and VMMs at VMware.

Presentations

Accelerating analytical antelopes: Integrating Apache Kudu's RPC into Apache Impala Session

In recent years, Apache Impala has been deployed to clusters that are large enough to hit architectural limitations in the stack. Lars Volker and Michael Ho cover the efforts to address the scalability limitations in the now legacy Thrift RPC framework by using Apache Kudu's RPC, which was built from the ground up to support asynchronous communication, multiplexed connections, TLS, and Kerberos.

Chris Holdgraf is a data science fellow at the Berkeley Institute for Data Science and a community architect at the Data Science Education Program at UC Berkeley. His background is in cognitive and computational neuroscience, where he used predictive models to understand the auditory system in the human brain. He’s interested in the boundary between technology, open source software, and scientific workflows, as well as creating new pathways for this kind of work in science and the academy. He’s a core member of Project Jupyter, specifically working with JupyterHub and Binder, two open source projects that make it easier for researchers and educators to do their work in the cloud. He works on these core tools, along with research and educational projects that use these tools at Berkeley and in the broader open science community.

Presentations

Jupyter Book: Online interactive books with the Jupyter Notebook Session

Chris Holdgraf shares recent tools from the Jupyter project in partnership with UC Berkeley that facilitate communication with Jupyter and get us closer to displaying notebook-style content in a more discoverable and reader-friendly form—allowing you to turn collections of notebooks into an online book and connect this content with the cloud in order to make your online content interactive.

Robert Horton is a senior data scientist on the Microsoft knowledge graph team, where he analyzes customer data and helps design and evaluate approaches for knowledge extraction. He holds an adjunct faculty appointment in health informatics at the University of San Francisco and has a particular interest in educational simulations.

Presentations

Building high-performance text classifiers on a limited labeling budget Session

Robert Horton, Mario Inchiosa, and Ali Zaidi demonstrate how to use three cutting-edge machine learning techniques—transfer learning from pretrained language models, active learning to make more effective use of a limited labeling budget, and hyperparameter tuning to maximize model performance—to up your modeling game.

Jeremy Howard is an entrepreneur, business strategist, developer, and educator. Jeremy is a founding researcher at fast.ai, a research institute dedicated to making deep learning more accessible. He is also a Distinguished Research Scientist at the University of San Francisco, a faculty member at Singularity University, and a Young Global Leader with the World Economic Forum. Jeremy’s most recent startup, Enlitic, was the first company to apply deep learning to medicine and was selected one of the world’s top 50 smartest companies by MIT Tech Review two years running. Previously, he was the president and chief scientist at the data science platform Kaggle, where he was the top ranked participant in international machine learning competitions two years running; was the founding CEO of successful Australian startups FastMail and Optimal Decisions Group (acquired by Lexis-Nexis); and spent eight years in management consulting at McKinsey & Co. and AT Kearney. Jeremy has invested in, mentored, and advised many startups and contributed to many open source projects. He has made a number of television and video appearances, including as a regular guest on Australia’s highest-rated breakfast news program, a popular talk on TED.com, and data science and web development tutorials and discussions.

Presentations

Deep learning applications for non-engineers Session

Jeremy Howard describes how to leverage the latest research from the deep learning and HCI communities to train neural networks from scratch—without code or preexisting labels. He then shares case studies in fashion, retail and ecommerce, travel, and agriculture where these approaches have been used.

Joel Hron is chief technology officer at ThoughtTrace, where he’s responsible for product innovation and strategy, the development and application of AI and machine learning, and full stack software development. His team is comprised of data scientists, full stack developers, DevOps engineers, quality assurance engineers, and domain experts who all work together to deliver innovative applied AI solutions that function in a highly scalable and reliable way. Previously, he held various engineering and management roles at Anadarko Petroleum Corporation, where he helped develop and shepherd new technologies enabling digital operations, better reservoir characterization, and field development optimization, with a focus on applied data science, artificial intelligence, machine learning, and other big data technologies. He holds a BS in mechanical engineering from TCU and an MS in mechanical engineering from the University of Texas.

Presentations

Applied AI and NLP for enterprise contract intelligence (sponsored by ThoughtTrace) Session

Building a SaaS AI company targeted at enterprise users presents unique challenges, both technical and nontechnical. Joel Hron and Nick Vandivere walk you through ThoughtTrace's journey, highlighting its beginnings as a company and sharing the challenging use cases the company tackled first.

Chenhui Hu is a data scientist in the Cloud and AI Division at Microsoft. His current interests include retail forecast, inventory optimization, IoT data, and deep learning. He holds a PhD from Harvard University, where his PhD thesis focused on biomedical imaging data mining. He also has research experience in wireless networks and network data analysis. He’s a recipient of the third IEEE ComSoc Asia-Pacific Outstanding Paper Award. 

Presentations

Dilated neural networks for time series forecasting Session

Dilated neural networks are a class of recently developed neural networks that achieve promising results in time series forecasting. Chenhui Hu discusses representative network architectures of dilated neural networks and demonstrates their advantages in terms of training efficiency and forecast accuracy by applying them to solve sales forecasting and financial time series forecasting problems.

Emily Huang is senior manager of the data science team at LinkedIn. Emily has more than 10 years of experience in security data science, customer operation analytics, and data product development across industries including finance, ecommerce, social networks, and SaaS. She’s passionate about translating business problems into qualitative questions, solving them by synthesizing and mining large-scale data, and driving business decisions. She’s also enthusiastic about developing the data science community via mentoring, volunteering, and being an evangelist of data-informed culture for all organizations.

Presentations

Using the full spectrum of data science to drive business decisions Tutorial

Thanks to the rapid growth in data resources, business leaders now appreciate the importance (and the challenge) of mining information from data. Join in as a group of LinkedIn's data scientists share their experiences successfully leveraging emerging techniques to assist in intelligent decision making.

Fabian Hueske is a committer and PMC member of the Apache Flink project. He was one of the three original authors of the Stratosphere research system, from which Apache Flink was forked in 2014. Fabian is a cofounder of Ververica, a Berlin-based startup devoted to fostering Flink, where he works as a software engineer and contributes to Apache Flink. He holds a PhD in computer science from TU Berlin and is currently spending a lot of his time writing a book, Stream Processing with Apache Flink.

Presentations

Flink SQL in action Session

Processing streaming data with SQL is becoming increasingly popular. Fabian Hueske explains why SQL queries on streams should have the same semantics as SQL queries on static data. He then shares a selection of common use cases and demonstrates how easily they can be addressed with Flink SQL.

Introduction to Flink via Flink SQL Tutorial

Fabian Hueske offers an overview of Apache Flink via the SQL interface, covering stream processing and Flink's various modes of use. Then you'll use Flink to run SQL queries on data streams and contrast this with the Flink DataStream API.

Cory Ilo is a computer vision engineer in the Automotive Solutions Group at Intel, where he helps prototype and research the feasibility of various computer vision solutions in relation to privacy, ethics, deep learning, and autonomous vehicles. In his spare time, Cory focuses on his passion for fitness, video games, and wanderlust, in addition to finding ways on how they tie into computer vision.

Presentations

AI privacy and ethical compliance toolkit Tutorial

From healthcare to smart home to autonomous vehicles, new applications of autonomous systems are raising ethical concerns about a host of issues, including bias, transparency, and privacy. Iman Saleh, Cory Ilo, and Cindy Tseng demonstrate tools and capabilities that can help data scientists address these concerns and bridge the gap between ethicists, regulators, and machine learning practitioners.

Mario Inchiosa is a principal software engineer at Microsoft, where he focuses on scalable machine learning and AI. Previously, Mario served as Revolution Analytics’s chief scientist; analytics architect in IBM’s Big Data organization, where he worked on advanced analytics in Hadoop, Teradata, and R; US chief scientist in Netezza Labs, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances; US chief science officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining; and senior scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds bachelor’s, master’s, and PhD degrees in physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning publication of the year and open literature publication excellence awards.

Presentations

Building high-performance text classifiers on a limited labeling budget Session

Robert Horton, Mario Inchiosa, and Ali Zaidi demonstrate how to use three cutting-edge machine learning techniques—transfer learning from pretrained language models, active learning to make more effective use of a limited labeling budget, and hyperparameter tuning to maximize model performance—to up your modeling game.

Alex Ingerman is a product manager at Google AI, focusing on federated learning and other privacy-preserving technologies. His mission is to enable all ML practitioners to protect their users’ privacy by default. Previously, Alex worked on ML-as-a-service platforms for developers, web-scale search, content recommendation systems, and immersive data exploration and visualization. Alex lives in Seattle, where as a frequent bike and occasional kayak commuter, he has fully embraced the rain. Alex holds a BS in computer science and an MS in medical engineering.

Presentations

The future of machine learning is decentralized Session

Federated learning is an approach for training ML models across a fleet of participating devices without collecting their data in a central location. Alex Ingerman offers an overview of federated learning, compares traditional and federated ML workflows, and explores the current and upcoming use cases for decentralized machine learning, with examples from Google's deployment of this technology.

Akihiro Ishikawa is a software engineer at Cloudera

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

There are many challenges with moving multidisciplinary big data workloads to the cloud and running them. Jason Wang, Brandon Freeman, Michael Kohs, Akihiro Nishikawa, and Toby Ferguson explore cloud architecture and its challenges and walk you through using Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Maryam Jahanshahi is a research scientist at TapRecruit, a company that is developing software tools to implement evidence-based recruiting. TapRecruit’s research program integrate recent advances in NLP, data science, and decision science to identify robust methods to reduce bias in talent decision making and attract more qualified and diverse candidate pools. In a past life, Maryam was a cancer scientist, where she researched how growing organs “know” they’ve reached the right size. She’s originally from Melbourne, Australia.

Presentations

Shortcuts that short-circuit talent pipelines: Data-driven optimization of hiring Session

Hiring teams largely rely on both intuition and experience to scout talent for data science and data engineering roles. Drawing on results from analyzing over 15 million jobs and their outcomes, Maryam Jahanshahi interrogates these “common sense” judgments to determine whether they help or hurt hiring of data scientists and engineers.

Neerav Jain is a technical architect at Walgreens, where he has implemented campaign personalization for all Walgreens customers using Hadoop as well as end-to-end supply chain scorecards with over 120+ TB of data using Hadoop, Kyvos, and Tableau. A senior architect with 15+ years of experience, he has architected and operationalized Hadoop across multiple healthcare and retail organizations and has worked on multiple enterprise data warehouses, including some of the largest in the healthcare industry.

Presentations

How Walgreens transformed supply chain management with Kyvos, Tableau, and big data Session

Walgreens recently faced the challenge of analyzing 466 billion rows of data from 20,000 suppliers and 9,000 stores, which strained its existing systems when dealing with the scale and cardinality of data. Neerav Jain, Vikas Hardia, and Anne Cruz describe how they used Kyvos and Tableau to transform Walgreens's supply chain with instant, interactive analysis on two-year data.

Michael Johnson is a senior data scientist at Lockheed Martin. He has done data science and analytics in fields including manufacturing optimization, semiconductor reliability, and human resources-focused time series forecasting and simulation. He has recently been focused on how to apply cutting-edge deep learning algorithms to NLP domains.

Presentations

NLP from scratch: Solving the cold start problem for natural language processing Session

How do you train a machine learning model with no training data? Michael Johnson and Norris Heintzelman share their journey implementing multiple solutions to bootstrapping training data in the NLP domain, covering topics including weak supervision, building an active learning framework, and annotation adjudication for named-entity recognition.

Theresa Johnson is a product manager for metrics and forecasting products at Airbnb. As a data scientist, she was part of the task force and cross-functional hackathon team at Airbnb that worked to develop the framework for the current antidiscrimination efforts. Theresa is a founding board member of Street Code Academy, a nonprofit dedicated to high-touch technical training for inner city youth, and has been featured in TechCrunch for her commitment to helping early-stage founders raise capital. Theresa is passionate about extending technology access for everyone and finding mission-driven companies that can have an outsized impact on the world. She holds a PhD in aeronautics and astronautics and dual undergraduate degrees in science, technology, and society and computer science, all from Stanford University.

Presentations

Forecasting uncertainty at Airbnb Keynote

Airbnb uses AI and machine learning in many parts of its user-facing business. But it's also advancing the state of AI-powered internal tools. Theresa Johnson details the AI powering Airbnb's next-generation end-to-end metrics forecasting platform, which leverages machine learning, Bayesian inference, TensorFlow, Hadoop, and web technology.

Ken Johnston is the principal data science manager for the Microsoft 360 Business Intelligence Group (M360 BIG). In his time at Microsoft, Ken has shipped many products, including Commerce Server, Office 365, Bing Local and Segments, and Windows, and for two and a half years, he was the director of test excellence. A frequent keynote presenter, trainer, blogger, and author, Ken is a coauthor of How We Test Software at Microsoft and contributing author to Experiences of Test Automation: Case Studies of Software Test Automation. He holds an MBA from the University of Washington. Check out his blog posts on data science management on LinkedIn.

Presentations

Executive Briefing: The 6 keys to successful data spelunking Session

At the rate data sources are multiplying, business value can often be developed faster by joining data sources rather than mining a single source to the very end. Ken Johnston and Ankit Srivastava share four years of hands-on practical experience sourcing and integrating massive numbers of data sources to build the Microsoft Business Intelligence Graph (M360 BIG).

Infinite segmentation: Scalable mutual information ranking on real-world graphs Session

Today, normal growth isn't enough—you need hockey-stick levels of growth. Sales and marketing orgs are looking to AI to "growth hack" their way to new markets and segments. Ken Johnston and Ankit Srivastava explain how to use mutual information at scale across massive data sources to help filter out noise and share critical insights with new cohort of users, businesses, and networks.

Eric Jonas is a postdoc in the new Berkeley Center for Computational Imaging and RISELab at UC Berkeley EECS working with Ben Recht. His research interests include biological signal acquisition, inverse problems, machine learning, heliophysics, neuroscience, and other exciting ways of exploiting scalable computation to understand the world.

Presentations

Cloud programming simplified: A Berkeley view on serverless computing Session

Eric Jonas offers a quick history of cloud computing, including an accounting of the predictions of the 2009 "Berkeley View of Cloud Computing" paper, explains the motivation for serverless computing, describes applications that stretch the current limits of serverless, and then lists obstacles and research opportunities required for serverless computing to fulfill its full potential.

Jowanza Joseph is principal software engineer at One Click Retail. Jowanza’s work is focused on distributed stream processing and distributed data storage.

Presentations

Reducing stream processing complexity using Apache Pulsar Functions Session

After two years of running streaming pipelines through Kinesis and Spark at One Click Retail, Jowanza Joseph and Karthik Ramasamy decided to explore a new platform that would take advantage of Kubernetes and support a simpler data processing DSL. Join in to discover why they chose Apache Pulsar and learn tips and tricks for using Pulsar Functions.

Yiannis Kanellopoulos has spent the better part of two decades analyzing and evaluating software systems in order to help organizations address any potential risks and flaws related to them. (In his experience, these risks or flaws are always due to human involvement.) With Code4Thought, Yiannis is turning his expertise into democratizing technology by rendering algorithms transparent and helping organizations become accountable. Targeted outcomes of his work include building trust between the organization utilizing the algorithms and those affected by its output and rendering the algorithms more persuasive, since their reasoning will be easier to explain. He’s also a founding member of Orange Grove Patras, a business incubator sponsored by the Dutch Embassy in Greece to promote entrepreneurship and counter youth unemployment. Yiannis holds a PhD in computer science from the University of Manchester.

Presentations

On the accountability of black boxes: How to control what you can’t exactly measure Ethics Summit

Black box algorithmic systems make decisions that have a great impact in our lives. Thus, the need for their accountability and transparency is growing. Code4Thought created an evaluation model reflecting the state of practice in several organizations. Yiannis Kanellopoulos explores this model and shares lessons learned from its application at a financial corporation.

Panel: Solutions Ethics Summit

We've looked at some possible solutions. But a more complete perspective includes what's right and where we're making mistakes—from the reproducibility of social sciences to the regulations of governments. We wrap up our look at possible solutions with a group discussion.

Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

Presentations

Cross-cloud model training and serving with Kubeflow Tutorial

Holden Karau, Francesca Lazzeri, and Trevor Grant offer an overview of Kubeflow and walk you through using it to train and serve models across different cloud environments (and on-premises). You'll use a script to do the initial setup work, so you can jump (almost) straight into training a model on one cloud and then look at how to set up serving in another cluster/cloud.

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am Session

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (a.k.a. tuning) or our jobs may be eaten by Cthulhu. Holden Karau and Rachel Warren explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads—including new settings in 2.4.

Alon Kaufman is the cofounder and CEO at Duality Technologies. Previously, he was RSA’s global director of data science and innovation, leading data science across the company’s full portfolio. Alon has over 20 years of experience in technology and innovation management in high-tech companies, dealing with various aspects of artificial intelligence. He holds a PhD in computational neuroscience and machine learning from the Hebrew University and an MBA from Tel Aviv University.

Presentations

Machine learning on encrypted data: Challenges and opportunities Session

Alon Kaufman and Vinod Vaikuntanathan discuss the challenges and opportunities of machine learning on encrypted data and describe the state of the art in this space.

Until recently, Arun Kejariwal was a statistical learning principal at Machine Zone (MZ), where he led a team of top-tier researchers and worked on research and development of novel techniques for install and click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns. In addition, his team built novel methods for bot detection, intrusion detection, and real-time anomaly detection. Previously, Arun worked at Twitter, where he developed and open-sourced techniques for anomaly detection and breakout detection. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high-performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Architecture and algorithms for end-to-end streaming data processing Tutorial

Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams.

Jinchul Kim is a senior software engineer at SK Telecom, where he leads cloud platform development using Kubernetes, Docker, Apache Druid, and Apache Hadoop and designed and implemented a Dockerized DevOps framework. Previously, he was a senior software engineer at SAP Labs working on the SAP HANA in-memory engine. Jinchul is a committer to the Apache Impala project.

Presentations

Apache Druid autoscale-out/in for streaming data ingestion on Kubernetes Session

Druid supports autoscaling for data ingestion, but it's only available on AWS EC2. You can't rely on the feature on your private cloud. Jinchul Kim demonstrates autoscale-out/in on Kubernetes, details the benefit on this approach, and discusses the development of Druid Helm charts, rolling updates, and custom metric usage for horizontal autoscaling.

Alex Kira is an engineering tech lead at Uber, where he works on the data workflow management team. His team provides a data infrastructure platform for thousands of engineers, data scientists, and city ops, thereby empowering them to own and manage their data pipelines. During his 19-year career, he’s had experience across several software disciplines, including distributed systems, data infrastructure, and full stack development, giving him a holistic systems view of his projects. He holds an undergraduate degree in computer science from the University of Miami and a master’s degree from the Georgia Institute of Technology. In his free time, Alex enjoys hiking around the Bay Area, rock climbing, and traveling internationally.

Presentations

Managing Uber's data workflows at scale Session

Uber operates at scale, with thousands of microservices serving millions of rides a day, leading to 100+ PB of data. Alex Kira details Uber's journey toward a unified and scalable data workflow system used to manage this data and shares the challenges faced and how the company has rearchitected the system to make it highly available and horizontally scalable.

Tobi Knaup is the cofounder and CTO at Mesosphere, a hybrid cloud platform company that helps companies such as NBCUniversal, Deutsche Telekom, and Royal Caribbean adopt transformative technologies like machine learning and real-time analytics with ease. He was one of the first engineers and tech lead at Airbnb, where he wrote large parts of the company’s infrastructure, including its search and fraud prediction services, and helped scale the site to millions of users and build a world-class engineering team. Tobi is the main author of Marathon, Mesosphere’s container orchestrator.

Presentations

Deep learning beyond the learning Session

There are many great tutorials for training your deep learning models, but training is only a small part in the overall deep learning pipeline. Tobias Knaup and Joerg Schad offer an introduction to building a complete automated deep learning pipeline, starting with exploratory analysis, overtraining, model storage, model serving, and monitoring.

Michael Kohs is a product manager at Cloudera.

Presentations

How to survive future data warehousing challenges with the help of a hybrid cloud Session

Michael Kohs, Eva Andreasson, and Mark Brine explain how Cloudera’s Finance Department used a hybrid model to speed up report delivery and reduce cost of end-of-quarter reporting. They also share guidelines for deploying modern data warehousing in a hybrid cloud environment, outlining when you should choose a private cloud service over a public one, the available options, and some dos and dont's.

Running multidisciplinary big data workloads in the cloud Tutorial

There are many challenges with moving multidisciplinary big data workloads to the cloud and running them. Jason Wang, Brandon Freeman, Michael Kohs, Akihiro Nishikawa, and Toby Ferguson explore cloud architecture and its challenges and walk you through using Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Jari Koister is vice president of product and technology at FICO. He also teaches in the Data Science Program at UC Berkeley. Previously, he was vice president of technology at AgilOne, leading software engineering, product data science, and technical operations; led the development of Chatter, Salesforce’s social enterprise application and platform; founded and served as CTO at Groupswim.com, an early social enterprise collaboration company (acquired by Salesforce); founded and served as CSO and CTO at Qrodo.com, an elastic platform for broadcasting sports events live on the internet; led the development of CommerceOne’s flagship product MarketSite; and led research in computer languages and distributed computing at Ericsson Labs and Hewlett-Packard Laboratories. Jari holds a PhD in computer science from the Royal Institute of Technology, Stockholm, Sweden.

Presentations

Interpretable and resilient AI for financial services Session

Financial services are increasingly deploying AI services for a wide range of applications, such as identifying fraud and financial crimes. Such deployment requires models to be interpretable, explainable, and resilient to adversarial attacks—regulatory requirements prohibit black-box machine learning models. Jari Koister shares tools and infrastructure has developed to support these needs.

Jing (Nicole) Kong is a data scientist at Office Depot, where she deals with big data and transforms data and models into products and service that drive business. She’s experienced with a number of different machine learning and deep learning models.

Presentations

User-based real-time product recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL Session

User-based real-time recommendation systems have become an important topic in ecommerce. Lu Wang, Nicole Kong, Guoqiong Song, and Maneesha Bhalla demonstrate how to build deep learning algorithms using Analytics Zoo with BigDL on Apache Spark and create an end-to-end system to serve real-time product recommendations.

Gabor Kotalik is a big data project lead at Deutsche Telekom, where he’s responsible for continuous improvement of customer analytics and machine learning solutions for commercial roaming business. He has more than 10 years of experience in business intelligence and advanced analytics focusing on utilization of insights and enabling data-driven business decisions.

Presentations

Data science at Deutsche Telekom: Predicting global travel patterns and network demand Session

Knowledge of customers' location and travel patterns is important for many companies, including German telco service operator Deutsche Telekom. Václav Surovec and Gabor Kotalik explain how a commercial roaming project using Cloudera Hadoop helped the company better analyze the behavior of its customers from 10 countries and provide better predictions and visualizations for management.

.

Presentations

ML and AI at scale at PayPal Session

The PayPal data ecosystem is large, with 250+ PB of data transacting in 200+ countries. Given this massive scale and complexity, discovering and access to the right datasets in a frictionless environment is a challenge. Subhadra Tatavarti and Chen Kovacs explain how PayPal’s data platform team is helping solve this problem with a combination of self-service integrated and interoperable products.

Chi-Yi Kuan is director of data science at LinkedIn. He has over 15 years of extensive experience applying big data analytics, business intelligence, risk and fraud management, data science, and marketing mix modeling across various business domains (social network, ecommerce, SaaS, and consulting) at both Fortune 500 firms and startups. Chi-Yi is dedicated to helping organizations become more data driven and profitable. He combines deep expertise in analytics and data science with business acumen and dynamic technology leadership.

Presentations

Using the full spectrum of data science to drive business decisions Tutorial

Thanks to the rapid growth in data resources, business leaders now appreciate the importance (and the challenge) of mining information from data. Join in as a group of LinkedIn's data scientists share their experiences successfully leveraging emerging techniques to assist in intelligent decision making.

Aleksandra Kudriashova is a head of product at Astro Digital, a platform for fast and easy access to satellite imagery. She’s passionate about data pipelines and analytics on top of satellite imagery. Previously, she was a cofounder of ImageAiry, an online marketplace for satellite imaging services.

Presentations

Understanding the world food economy with satellite images and AI Data Case Studies

It has become possible to use satellites to observe food growing at a global scale—using daily satellite images to glean agriculture-specific insights and predict productivity. Alex Kudriashova offers an overview of current publicly available satellite imagery data and explains how to inject it into your data pipeline and train and deploy AI/ML models based on it.

Abhishek Kumar is a senior manager of data science in Sapient’s Bangalore office, where he looks after scaling up the data science practice by applying machine learning and deep learning techniques to domains such as retail, ecommerce, marketing, and operations. Abhishek is an experienced data science professional and technical team lead specializing in building and managing data products from conceptualization to deployment phase and interested in solving challenging machine learning problems. Previously, he worked in the R&D center for the largest power-generation company in India on various machine learning projects involving predictive modeling, forecasting, optimization, and anomaly detection and led the center’s data science team in the development and deployment of data science-related projects in several thermal and solar power plant sites. Abhishek is a technical writer and blogger as well as a Pluralsight author and has created several data science courses. He is also a regular speaker at various national and international conferences and universities. Abhishek holds a master’s degree in information and data science from the University of California, Berkeley.

Presentations

The hitchhiker's guide to deep learning-based recommenders in production Tutorial

Abhishek Kumar and Pramod Singh walk you through deep learning-based recommender and personalization systems they've built for clients. Join in to learn how to use TensorFlow Serving and MLflow for end-to-end productionalization, including model serving, Dockerization, reproducibility, and experimentation, and Kubernetes for deployment and orchestration of ML-based microarchitectures.

Arun Kumar is an assistant professor in the Department of Computer Science and Engineering at the University of California, San Diego. He’s a member of the Database Lab and CNS and an affiliate member of the AI Group. His primary research interests are in data management and systems for machine learning- and artificial intelligence-based data analytics. Systems and ideas based on his research have been released as part of the MADlib open source library, shipped as part of products from EMC, Oracle, Cloudera, and IBM, and used internally by Facebook, LogicBlox, Microsoft, and other companies. He’s a recipient of the ACM SIGMOD 2014 Best Paper Award, the 2016 Graduate Student Research Award for the best dissertation research in UW-Madison CS, and a 2016 Google Faculty Research Award.

Presentations

Faster ML over joins of tables Session

Arun Kumar details recent techniques to accelerate ML over data that is the output of joins of multiple tables. Using ideas from query optimization and learning theory, Arun demonstrates how to avoid joins before ML to reduce runtimes and memory and storage footprints. Along the way, he explores open source software prototypes and sample ML code in both R and Python.

Rakesh Kumar is a software engineer on the pricing team at Lyft. He started his career as an embedded software engineer for mobile devices; later he moved to server-side engineer to tackle bigger challenges in distributed systems. His interests include machine learning and streaming systems.

Presentations

The magic behind your Lyft ride prices: A case study on machine learning and streaming Session

Rakesh Kumar and Thomas Weise explore how Lyft dynamically prices its rides with a combination of various data sources, ML models, and streaming infrastructure for low latency, reliability, and scalability—allowing the pricing system to be more adaptable to real-world changes.

Ram Shankar is a data cowboy on the Azure security data science team at Microsoft, where his team focuses on modeling massive amounts of security logs to surface malicious activity. His work has appeared in industry conferences like DEF CON, BSides, BlueHat, DerbyCon, MIRCon, Infiltrate, and Strata as well as academic conferences like NIPS and ACM-CCS. Ram holds a degree focused on machine learning and security from Carnegie Mellon University. He’s currently an affiliate at the Berkman Klein Center at Harvard, exploring the intersection of machine learning and security.

Presentations

Framework to quantitatively assess ML safety: Technical implementation and best practices Session

How can we guarantee that the ML system we develop is adequately protected from adversarial manipulation? Ram Shankar Kumar shares a framework and corresponding best practices to quantitatively assess the safety of your ML systems.

Santosh Kumar is a senior product manager at Cloudera, where he leads SDX, Cloudera’s shared data experience offering. Previously, he was a data scientist at Facebook and a software engineer at Yahoo and Akamai. He holds a BS in computer science from IIT Kanpur, India, and an MBA from INSEAD, France.

Presentations

Hands-on with Cloudera SDX: Setting up your own shared data experience Tutorial

Cloudera SDX provides unified metadata control, simplifies administration, and maintains context and data lineage across storage services, workloads, and operating environments. Santosh Kumar, Andre Araujo, and Wim Stoop offer an overview of SDX before diving deep into the moving parts and guiding you through setting it up. You'll leave with the skills to set up your own SDX.

Kumar Sricharan is a principal data scientist responsible for leading Intuit’s Machine Learning (ML) Research Group. His team focuses on cutting-edge ML problems and applications for financial artificial intelligence, combining financial domain knowledge with information in data to produce more accurate and explainable systems. Projects include extraction of information from financial documents, chatbots for financial conversations, mining tax forms to enable compliance, understanding transactions to power financial advice, and financial forecasting. Previously, he spent five years at Xerox PARC as a senior research scientist and program manager for self-learning systems. During his time at PARC, Kumar led a team focused on building learning algorithms that can exploit rich feedback in conjunction with unlabeled data to overcome the need for prohibitively large numbers of labeled examples and make these systems explainable. His other research interests include anomaly detection for large, unstructured data and nonparametric large sample estimation, and his research has resulted in nearly 30 papers in refereed conferences and journals and several accompanying patents. He’s an active member of the academic community and has served as a reviewer for multiple conferences and journals. He has also participated in several DARPA research projects, including the ADAMS program for detecting insider threat and XAI for building explainable AI agents. Kumar holds a PhD in electrical engineering from the University of Michigan.

Presentations

Modern techniques for building robust deep networks Session

Machine learning is delivering immense value across industries. However, in some instances, machine learning models can produce overconfident results—with the potential for catastrophic outcomes. Kumar Sricharan explains how to address this challenge through Bayesian machine learning and highlights real-world examples to illustrate its benefits.

Lauren Kunze is the CEO of Pandorabots, a leading chatbot platform that powers conversational AI software for hundreds of thousands of developers and top global brands. She’s an expert on state-of-the-art solutions for hard AI problems like natural language understanding and generation with respect to the larger AI ecosystem. Lauren built her first chatbot at age 15, and she speaks and writes frequently about chatbots and AI at conferences like Mobile World Congress and South by Southwest and for publications like TechCrunch and Quartz. Pandorabots has been quoted or covered in outlets like the Wall Street Journal, the BBC, the Guardian, Wired, Quartz, RadioLab, TechCrunch, Reuters, the New Yorker, and the New York Times Magazine. Lauren is the author of four novels published by HarperCollins as well as scripts for film and television. She holds a degree from Harvard in literature and language and neuroscience.

Latheef Syed is head of analytics at Verizon. An executive thought leader, Latheef leads the company’s enterprise data science and analytical teams to enable predictive and prescriptive analytics through advance deep learning techniques both on-premises and in the cloud. He’s adept at delivering industry best practices by building a system of insights framework that leverages big data and AI platforms with integrated business solutions. He has 20+ years’ experience leading enterprise analytical application development and systems integration and is an expert at program management at all levels of the business.

Presentations

Transforming AI, ML, and BI on big data at Verizon (sponsored by Kyvos Insights) Session

Verizon wanted to use its BI on Big Data platform to enable real-time artificial intelligence and machine learning to identify friction points, detect anomalies on the fly, and fix issues instantly. Latheef Syed explains how Verizon utilizes Kyvos as a next-generation analytical platform that delivers real-time AI, ML, and BI.

JoLynn Lavin is a manager of decision sciences and analytics at General Mills, where she leads a team of analysts focused on unleashing the power of data to drive consumer-led decision making. Previously, JoLynn was a loyalty marketing consultant helping clients acquire, retain, and build profitable relationships with their customers across virtually every industry. JoLynn holds a master’s degree in agricultural and consumer economics from the University of Illinois at Champaign-Urbana.

Presentations

Voice of the customer: A case study in how machine learning can automate consumer insights Data Case Studies

General Mills engages millions of consumers in conversations every year through traditional 1-800 numbers and text messaging with call center agents, online conversations on social media, comments on its recipe websites, and chatbots. JoLynn Lavin explains how General Mills applies machine learning to listen to the voice of its customers, arguably the most powerful force in today’s market.

Francesca Lazzeri is a machine learning scientist on the cloud advocacy team at Microsoft. An expert in big data technology innovations and the applications of machine learning-based solutions to real-world problems, she has worked with these issues in a wide range of industries, including energy, oil and gas, retail, aerospace, healthcare, and professional services. Previously, she was a research fellow in business economics at Harvard Business School, where she performed statistical and econometric analysis within the Technology and Operations Management Unit and worked on multiple patent data-driven projects to investigate and measure the impact of external knowledge networks on companies’ competitiveness and innovation. Francesca periodically teaches applied analytics and machine learning classes at universities in USA and Europe and is a mentor for PhD and postdoc students at the Massachusetts Institute of Technology. She enjoys speaking at academic and industry conferences to share her knowledge and passion for AI, machine learning, and coding. Francesca holds a PhD in innovation management.

Presentations

Cross-cloud model training and serving with Kubeflow Tutorial

Holden Karau, Francesca Lazzeri, and Trevor Grant offer an overview of Kubeflow and walk you through using it to train and serve models across different cloud environments (and on-premises). You'll use a script to do the initial setup work, so you can jump (almost) straight into training a model on one cloud and then look at how to set up serving in another cluster/cloud.

Forecasting financial time series with deep learning on Azure 2-Day Training

Francesca Lazzeri and Jen Ren walk you through the core steps for using Azure Machine Learning services to train your machine learning models both locally and on remote compute resources.

Julien Le Dem is the coauthor of Apache Parquet and the PMC chair of the project. He is also a committer and PMC member on Apache Pig, Apache Arrow, and a few other projects. Julien is a principal engineer at WeWork. Previously, he was an architect at Dremio; tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_); and a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

Presentations

From flat files to deconstructed databases: The evolution and future of the big data ecosystem Session

Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem.

Mike Lee Williams is a research engineer at Cloudera Fast Forward Labs, where he builds prototypes that bring the latest ideas in machine learning and AI to life and helps Cloudera’s customers understand how to make use of these new technologies. Mike holds a PhD in astrophysics from Oxford.

Presentations

Federated learning Session

Imagine building a model whose training data is collected on edge devices such as cell phones or sensors. Each device collects data unlike any other, and the data cannot leave the device because of privacy concerns or unreliable network access. This challenging situation is known as federated learning. Mike Lee Williams discusses the algorithmic solutions and the product opportunities.

Christopher Lennan is a senior data scientist at idealo.de, where he works on computer vision problems to improve the product search experience. In previous positions, he applied machine learning methods to fMRI and financial data. Christopher holds a master’s degree in statistics from Humboldt Universität Berlin.

Presentations

Using deep learning to automatically rank millions of hotel images Session

Idealo.de recently trained convolutional neural networks (CNN) for aesthetic and technical image quality predictions. Christopher Lennan shares the training approach, along with some practical insights, and sheds light on what the trained models actually learned by visualizing the convolutional filter weights and output nodes of the trained models.

Franck Leveneur is data service technical lead at Wag!. Franck has spent the last 16 years working as a DBA and data architect across various technology companies, including Davita, TrueCar, Rubicon Project, and ZipRecruiter. He’s an expert in MySQL database design and architecture, performance tuning, capacity planning, and scalability in a multitier, high-volume production database ecosystems. He also teaches data management at UCLA Anderson.

Presentations

Break through the limits of your current database (sponsored by MemSQL) Session

MySQL is great but has limits. When you need key-value pair storage with geospatial and JSON support, easy and fast ingestion from various streams, aggregate queries against 100+ million rows in under one second, and more, there's only one solution. Franck Leveneur explains how on-demand dog walking service Wag! uses MemSQL to take its real-time data access and reporting to the next level.

Jimmy Li is a versatile principal developer at Atlassian. He has a breadth of experience from working in a variety of teams and countries, and he’s been part of several key initiatives ranging from single sign-on to segmentation and targeted in-product messaging. Most recently, he was the technical lead on an initiative to change Atlassian into a more data-driven organization by transforming the company’s behavioral analytics solution.

Presentations

Transforming behavioral analytics at Atlassian Session

Analytics is easy, but good analytics is hard. Atlassian knows this all too well. Rohan Dhupelia and Jimmy Li explain how the company's push to become truly data driven has transformed the way it thinks about behavioral analytics, from how it defined its events to how it ingests and analyzes them.

Tianhui Michael Li is the founder and CEO of the Data Incubator. Michael has worked as a data scientist lead at Foursquare, a quant at D.E. Shaw and JPMorgan, and a rocket scientist at NASA. At Foursquare, Michael discovered that his favorite part of the job was teaching and mentoring smart people about data science. He decided to build a startup that lets him focus on what he really loves. He did his PhD at Princeton as a Hertz fellow and read Part III Maths at Cambridge as a Marshall Scholar.

Presentations

Big data for managers 2-Day Training

Michael Li and Rich Ott offer a nontechnical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and utilize their input and analysis for your business’s strategic priorities and decision making.

Executive Briefing: How organizations scale along the data and AI maturity curve Session

As their data and AI teams scale from one to thousands of employees and the maturity of their analytics capabilities evolve, companies find that the analytics journey is not always smooth. Drawing on experiences gleaned from dozens of clients, Michael Li discusses organizational growing pains and the best practices that successful executives have adopted to scale and grow their team.

Penghui Li is a backend engineer at Zhaopin.com, where he leads Apache Pulsa and contributes heavily to Pulsar open source. Penghui has 5+ years of experience developing message queues and microservices.

Presentations

How Zhaopin.com built its enterprise event bus using Apache Pulsar Session

Using a messaging system to build an event bus is very common. However, certain use cases demand a messaging system with a certain set of features. Sijie Guo and Penghui Li discuss the event bus requirements for Zhaopin.com, one of China's biggest online recruitment services providers, and explain why the company chose Apache Pulsar.

Tommy Li is a software developer at IBM focusing on cloud, container, and infrastructure technology. He has worked on various developer journeys, which provide use cases on cloud-computing solutions, such as Kubernetes, microservices, and hybrid cloud deployments. He is passionate about machine learning and big data.

Presentations

Use the Jupyter Notebook to integrate adversarial attacks into a model training pipeline to detect vulnerabilities Session

Animesh Singh and Tommy Li explain how to implement state-of-the-art methods for attacking and defending classifiers using the open source Adversarial Robustness Toolbox. The library provides AI developers with interfaces that support the composition of comprehensive defense systems using individual methods as building blocks.

Xiao Li is a software engineer, Apache Spark committer, and PMC member at Databricks. His main interests are Spark SQL, data replication, and data integration. Previously, he was an IBM master inventor and an expert on asynchronous database replication and consistency verification. He holds a PhD from the University of Florida.

Presentations

Apache Spark 2.4 and beyond Session

Xiao Li and Wenchen Fan offer an overview of the major features and enhancements in Apache Spark 2.4 and give insight into upcoming releases. Then you'll get the chance to ask all your burning Spark questions.

Yang Li is the cofounder and CTO at Kyligence and a cocreator, PMC member, tech lead, and architect for Apache Kylin, focusing on big data analysis, parallel computation, data indices, relational algebra, approximation algorithms, and other technologies. Yang was a senior architect in the Analytic Data Infrastructure Department at eBay; a tech lead for IBM InfoSphere BigInsights, where he was responsible for the Hadoop open source platform and was recognized with the Outstanding Technical Achievement Award; and a vice president at Morgan Stanley, responsible for the global regulatory reporting platform.

Presentations

Augmented OLAP for big data from on-premises to multicloud (sponsored by Kyligence) Session

Augmenting data management and analytics platforms with artificial intelligence and machine learning is game changing for analysts, engineers, and other users. It enables companies to optimize their storage, speed, and spending. Yang Li details the Kyligence platform, which is evolving to the next level with augmented capabilities such as intelligent modeling, smart pushdowns, and more.

Yue Li is a cofounder at MemVerge, where together with his colleagues, he’s developing the company’s core technologies. Previously, he was a senior postdoctoral fellow at the California Institute of Technology. He has extensive research experience on both theoretical and experimental aspects of algorithms for nonvolatile memories. Yue holds a PhD in computer science from Texas A&M University and a bachelor’s degree in computer science from Huazhong University of Science & Technology.

Presentations

Optimizing computing cluster resource utilization with an in-memory distributed filesystem Session

JD.com recently designed a brand-new architecture to optimize Spark computing clusters. Yue Li and Shouwei Chen detail the problems the team faced when building it and explain how the company benefits from the in-memory distributed filesystem now.

Sam Lightstone is CTO for data at IBM, an IBM Fellow, and a Master Inventor in the IBM Data and AI Group. He leads a number of technical teams in product development for relational databases, data warehousing and big data, cloud computing, analytics for IoT, data virtualization, data movement, and machine learning. He cofounded the IEEE Data Engineering Workgroup on Self-Managing Database Systems. Sam has more than 65 patents issued and pending and has authored over 30 papers and four books, which have been translated into Chinese, Japanese, and Spanish. In his spare time, he’s an avid guitar player and fencer.

Presentations

The death of coding: How AI redefines our relationship with computers (sponsored by IBM) Session

Sam Lightstone discusses how AI is fundamentally changing computer science and the practice of coding. Join in to discover what machine learning means today and explore recent advances in hardware and software and breakthrough innovations.

Di Lin is a senior data engineer on the infrastructure and information security team at Netflix, where he focuses on building and scaling complex data systems to help infrastructure teams improve reliability and efficiency. Previously, he was a data engineer at Facebook, where he built company-wide data products related to identity and subscriber growth.

Presentations

Scaling data lineage at Netflix to improve data infrastructure reliability and efficiency Session

Hundreds of thousands of ETL pipelines ingest over a trillion events daily to populate millions of data tables downstream at Netflix. Jitender Aswani, Girish Lingappa, and Di Lin discuss Netflix’s internal data lineage service, which was essential for enhancing platform’s reliability, increasing trust in data, and improving data infrastructure efficiency.

Girish Lingappa is a senior software engineer at Netflix.

Presentations

Scaling data lineage at Netflix to improve data infrastructure reliability and efficiency Session

Hundreds of thousands of ETL pipelines ingest over a trillion events daily to populate millions of data tables downstream at Netflix. Jitender Aswani, Girish Lingappa, and Di Lin discuss Netflix’s internal data lineage service, which was essential for enhancing platform’s reliability, increasing trust in data, and improving data infrastructure efficiency.

Jorge A. Lopez works in big data solutions at Amazon Web Services. Jorge has more than 15 years of business intelligence and DI experience. He enjoys intelligent design and engaging storytelling and is passionate about data, music, and nature.

Presentations

Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You'll then build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Sustaining machine learning in the enterprise Keynote

Keynote with Ben Lorica

Thursday keynote welcome Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynote welcome Keynote

Program chairs Ben Lorica, Alistair Croll, and Doug Cutting welcome you to the first day of keynotes.

Neng Lu is a software engineer at Twitter, where he is the core member of Twitter’s real-time compute team and the core committer to the Apache Heron project (incubating). He has a broad interest in distributed systems and real-time analytics and has worked on Twitter’s key-value storage system, Manhattan; its monitoring system, Cuckoo; and its real-time processing system, Heron. He holds an MS in CS from UCLA and a bachelor’s degree in CS from Zhejiang University.

Presentations

Real-time monitoring of Twitter's network infrastructure with Heron Session

Julien Delange and Neng Lu explain how Twitter uses the Heron stream processing engine to monitor and analyze its network infrastructure—implementing a new data pipeline that ingests multiple sources and processes about 1 billion tuples to detect network issues and generate usage statistics. Join in to learn the key technologies used, the architecture, and the challenges Twitter faced.

Boris Lublinsky is a software architect at Lightbend, where he specializes in big data, stream processing, and services. Boris has over 30 years’ experience in enterprise architecture. Over his career, he has been responsible for setting architectural direction, conducting architecture assessments, and creating and executing architectural roadmaps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, and Serving Machine Learning Models. He is also cofounder of and frequent speaker at several Chicago user groups.

Presentations

Hands-on machine learning with Kafka-based streaming pipelines Tutorial

Boris Lublinsky and Dean Wampler walk you through using ML in streaming data pipeline and doing periodic model retraining and low-latency scoring in live streams. You'll explore using Kafka as a data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, model metadata tracking, and other techniques.

Patrick Lucey is the vice president of artificial intelligence at STATS, where his goal is to maximize the value of the 35+ years worth of sports data. His main research interests are in artificial intelligence and interactive machine learning in sporting domains. Previously, Patrick spent five years at Disney Research, where he conducted research into automatic sports broadcasting using large amounts of spatiotemporal tracking data, and was a postdoctoral researcher at the Robotics Institute at Carnegie Mellon University and the Department of Psychology at University of Pittsburgh, conducting research on automatic facial expression recognition. He holds a BEng(EE) from USQ and a PhD from QUT, Australia. He was a coauthor of the best paper at the 2016 MIT Sloan Sports Analytics Conference and in 2017 and 2018 was coauthor of best paper runner-up at the same conference. Patrick has also won best paper awards at the INTERSPEECH (2007) and WACV (2014) international conferences.

Presentations

Interactive sports analytics Keynote

Patrick Lucey describes methods to find play similarity using multiagent trajectory data and predict fine-grained plays, using examples using STATS SportVU data in basketball and soccer. Patrick then discusses how to go beyond center-of-mass tracking (i.e., dots) and capture body-pose information from broadcast video to take analysis to the next level.

Sports analytics in the wild using player-tracking data Data Case Studies

Patrick Lucey describes methods to find play similarity using multiagent trajectory data and predict fine-grained plays, using examples using STATS SportVU data in basketball and soccer. Patrick then discusses how to go beyond center-of-mass tracking (i.e., dots) and capture body-pose information from broadcast video to take analysis to the next level.

Adrian Lungu is a computer scientist at Adobe working with Audience Manager, a leading solution in the DMP market. He’s been focused on the company’s Cassandra clusters ever since he joined the team, trying to build a scalable architecture that would keep up with the exponential growth of the product. Adrian holds a degree in computer science and engineering from Politehnica University of Bucharest as well as a DataStax Certified Apache Cassandra Professional certification.

Presentations

Database migrations don't have to be painful, but the road will be bumpy Session

Adrian Lungu and Serban Teodorescu explain how—inspired by the green-blue deployment technique—the Adobe Audience Manager team developed an active-passive database migration procedure that allows them to test database clusters in production, minimizing the risks without compromising the innovation.

Zhenxiao Luo is an engineering manager at Uber, where he runs the interactive analytics team. Previously, he led the development and operations of Presto at Netflix and worked on big data and Hadoop-related projects at Facebook, Cloudera, and Vertica. He holds a master’s degree from the University of Wisconsin-Madison and a bachelor’s degree from Fudan University.

Presentations

Real-time analytics at Uber: Bring SQL into everything Session

From determining the most convenient rider pickup points to predicting the fastest routes, Uber uses data-driven analytics to create seamless trip experiences. Zhenxiao Luo explains how Uber supports real-time analytics with deep learning on the fly, without any data copying.

Real-time analytics on deep learning: When TensorFlow met Presto at Uber Session

From determining the most convenient rider pickup points to predicting the fastest routes, Uber uses data-driven analytics to create seamless trip experiences. Inside Uber, analysts are using deep learning and big data to train models, make predictions, and run analytics in real time. Zhenxiao Luo explains how Uber runs real-time analytics with deep learning.

Mark Madsen is the global head of architecture at Think Big Analytics, where he is responsible for understanding, forecasting, and defining the analytics landscape and architecture. Previously, he was CEO of Third Nature, where he advised companies on data strategy and technology planning and vendors on product management. Mark has designed analysis, data collection, and data management infrastructure for companies worldwide.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that isn't subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Ted Malaska is a director of enterprise architecture at Capital One. Previously, he was the director of engineering in the Global Insight Department at Blizzard; principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem; and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Foundations for successful data projects Tutorial

The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. Jonathan Seidman and Ted Malaska share guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

Scott Mcclellan is CTO at PRGX. A creative, results-driven technology leader, Scott is a change agent and problem solver with a passion for technology. He’s skilled in grasping and explaining the big picture and conceptualizing, developing, and implementing solutions. Scott has substantial experience working with business leaders and C-level executives. Previously, he was chief technologist and VP of engineering for Hewlett-Packard’s cloud services and for scalable computing, where he set technical direction for the company’s scalable computing business and introduced new products focused on cloud service providers and high-performance computing customers.

Presentations

High-performance data lakes for AI workloads using object storage (sponsored by Minio) Session

Recently, Scott Mcclellan's team—which analyzes over six petabytes of data using Hadoop technology—created a high-performance data lake using object storage for consumption by big data workloads. Scott shares his experience deploying object storage for AI workloads.

Prakhar Mehrotra is senior director of machine learning for retail data science at Walmart Labs, where he overseas the research and development of pricing, assortment, replenishment, and planning algorithms to help merchants take smarter decisions. Previously, he was head of data science for finance at Uber, where he led a global team of data scientists and data analysts spread across Amsterdam, Hyderabad, and San Francisco in the research and development of machine learning algorithms related to financial forecasting (supply and demand), budget planning, and economic simulations for autonomous vehicles. He also worked on research and development related to payment analytics and treasury financial simulations. Before that, he was a senior data scientist on the sales and monetization team at Twitter. He holds an advanced engineering degree in aeronautics from California Institute of Technology (Caltech) and a dual master’s degree in aeronautics and applied mechanics from École Polytechnique, Paris, and Caltech. He is a peer reviewer for CVPR, ICCV, and AAMAS and has given numerous invited talks including as a keynote speaker at the EARL conference, the Toronto Machine Learning Summit, Deep Learning Summit, the NYU Center for Data Science, and the Wharton Technology Conference. He also chaired the session on forecasting at a 2017 international symposium on forecasting in Australia and was a judge for risk and intelligence at the European Fintech Awards in Brussels.

Presentations

Walmart's journey from business intelligence to artificial intelligence (sponsored by Walmart Labs) Session

Prakhar Mehrotra shares Walmart’s digital transformation journey and explains how the company is using recent advancements in machine learning to power core retail operations like pricing, assortment, and replenishment. Along the way, Prakhar demonstrates how to leverage human expertise and use it as feedback to improve your algorithms.

Jon Merriman is a senior software engineer and researcher at Verint Intelligent Self-Service, where he works on core natural language understanding capabilities for dialogue systems. His primary focus is on algorithms and machine learning theory for text and speech analysis.

Presentations

How to determine the optimal anomaly detection method for your application Session

Anomaly detection has many applications, such as tracking business KPIs or fraud spotting in credit card transactions. Unfortunately, there's no one best way to detect anomalies across a variety of domains. Jonathan Merriman and Cynthia Freeman introduce a framework to determine the best anomaly detection method for the application based on time series characteristics.

Jake Metcalf is a technology ethics researcher and consultant specializing in data analytics and artificial intelligence. He’s a researcher at Data & Society on an NSF-funded multisite project, Pervasive Data Ethics for Computational Research (PERVADE), where he is studying how data ethics practices are emerging in environments that have not previously grappled with research ethics, such as industry, conference committees, and civil society organizations. He’s also exploring how design practices can successfully integrate ethical values and principles. Jake runs consulting firm Ethical Resolve, which provides clients with a range of ethics services, helping them make well-informed, consistent, actionable, and timely business decisions that reflect their values. He lives among the redwoods of the Santa Cruz mountains.

Presentations

Owning ethics: Doing ethics inside a tech company Ethics Summit

What does it mean for technology companies to “do ethics”? Jake Metcalf and Emanuel Moss discuss how AI ethics figures in longstanding philosophical debates about ethics and human values, contemporary debates about “ordinary ethics,” and how the logics and structures of corporate organizations may be creating pitfalls along the way.

Omoju Miller is a machine learning engineer at GitHub. Omoju has over a decade of experience in computational intelligence. Apart from her work in AI, she has co-led the nonprofit investment in computer science education at Google and served as a volunteer advisor to the Obama administration’s White House Presidential Innovation Fellows. She is a member of the World Economic Forum Expert Network in AI. Omoju holds a PhD from UC Berkeley.

Presentations

Where does Jupyter fit into building end-to-end ML products? Session

GitHub has a relatively nascent ML group. Its major challenge is to integrate ML product building processes into a mature product engineering org. This means that it's responsible for building end-to-end ML products, from ETL to production. Omoju Miller details the many roles Jupyter notebooks play in the building of ML products.

Patrick Miller is a data scientist at Civis Analytics specializing in survey data analysis, causal inference, and production R. Patrick holds a PhD in quantitative psychology, where he studied the applications of machine learning to analysis psychological and behavioral data. You can usually find him at his desk drinking tea and listening to Sufjan Stevens.

Presentations

Testing ad content with survey experiments Session

Brands that test the content of ads before they are shown to an audience can avoid spending resources on the 11% of ads that cause backlash. Using a survey experiment to choose the best ad typically improves effectiveness of marketing campaigns by 13% on average, and up to 37% for particular demographics. Patrick Miller explores data collection and statistical methods for analysis and reporting.

Siamac Mirzaie is a senior analytics engineer at Netflix, where he builds end-to-end anomaly detection systems for corporate security. Siamac is an applied machine learning practitioner in the security space. Previously, he was a data scientist at Facebook and director of analytics at Everquote. He holds a master’s degree in EECS from Ecole Supérieure d’Electricité and in financial engineering from the University of Michigan.

Presentations

Building and scaling a security detection platform: A Netflix Original Session

Data has become a foundational pillar for security teams operating in organizations of all shapes and sizes. This new norm has created a need for platforms that enable engineers to harness data for various security purposes. John Bennett and Siamac Mirzaie offer an overview of Netflix's internal platform for quickly deploying data-based detection capabilities in the corporate environment.

Piero Molino is a senior research scientist at Uber AI, where he focuses on machine learning for language and dialogue. Previously, he founded QuestionCube, a startup that built a framework for semantic search and QA, and worked on learning to rank at Yahoo Labs in Barcelona, on natural language processing with deep learning at IBM Watson in New York, and on grounded language understanding at Geometric Intelligence (acquired by Uber). Piero holds a PhD on question answering from the University of Bari, Italy.

Presentations

Ludwig, a code-free deep learning toolbox Session

Piero Molino offers an overview of Ludwig, a deep learning toolbox that allows you to train models and use them for prediction without the need to write code. It's unique in its ability to help make deep learning easier to understand for nonexperts and enable faster model improvement iteration cycles for experienced machine learning developers and researchers alike.

Daniel Monteiro works on the Market Regulation Technology Department at FINRA, where he’s the principal developer of the Surveillance Review and Feedback visualization tool. In his 15 years of software development experience, Daniel has developed solutions for various business areas.

Presentations

Scaling visualization for big data and analytics in the cloud Session

Jaipaul Agonus and Daniel Monteiro do Carmo Rosa detail big data analytics and visualization practices and tools used by FINRA to support machine learning and other surveillance activities that the Market Regulation Department conducts in the AWS cloud.

Kevin Moore is a senior data scientist at Salesforce, where he works on automated machine learning pipelines to generate and deploy customized models for a wide variety of customers and use cases. He holds a PhD in astrophysics and previously worked on modeling how stars evolve and eventually explode. When not stirring piles of linear algebra, he can usually be found snowboarding, brewing beer, or gaming.

Presentations

Point, click, predict Session

Kevin Moore walks you through how TransmogrifAI—Salesforce's open source AutoML library built on Spark—automatically generates models that are automatically customized to a company's dataset and use case and provides insights into why the model is making the predictions it does.

Emanuel Moss is a doctoral candidate in anthropology at the CUNY Graduate Center, where he studies data science, machine learning, and artificial intelligence from an ethnographic perspective, focusing on the role of technologists as producers of knowledge and the ethical, economic, and cultural aspects of their work. He’s also a research analyst at Data & Society, where studies issues of fairness and accountability in machine learning, and a research assistant on the Pervasive Data Ethics for Computational Research (PERVADE) project. Emanuel holds a BA from the University of Illinois and an MA from Brandeis University. Previously, he was a digital and spatial information specialist for archaeological and environmental projects in the United States and Turkey.

Presentations

Owning ethics: Doing ethics inside a tech company Ethics Summit

What does it mean for technology companies to “do ethics”? Jake Metcalf and Emanuel Moss discuss how AI ethics figures in longstanding philosophical debates about ethics and human values, contemporary debates about “ordinary ethics,” and how the logics and structures of corporate organizations may be creating pitfalls along the way.

Francesco Mucio is a BI architect at Zalando. The first time Francesco met the word data, it was just the plural of datum. Now he’s helping to redraw Zalando’s data architecture. He likes to draw data models and optimize queries. He spends his free time with his daughter, who, for some reason, speaks four languages.

Presentations

Scaling data infrastructure in the fashion world; or, “What is this? Business intelligence for ants?” Session

Francesco Mucio tells the story of how Zalando went from an old-school BI company to an AI-driven company built on a solid data platform. Along the way, he shares what Zalando learned in the process and the challenges that still lie ahead.

Manu Mukerji is senior director of data, machine learning, and analytics at 8×8. Manu’s background lies in cloud computing and big data, working on systems handling billions of transactions per day in real time. He enjoys building and architecting scalable, highly available data solutions and has extensive experience working in online advertising and social media.

Presentations

From Jupyter to production: Accelerating solutions to business problems in production Session

Project Jupyter is very popular for data science, data exploration, and visualization. Manu Mukerji and Justin Driemeyer explain how to use it for AI/ML in a production environment.

Jacques Nadeau is the CTO and cofounder of Dremio. Jacques is also the founding PMC chair of the open source Apache Drill project, spearheading the project’s technology and community. Previously, he was the architect and engineering manager for Drill and other distributed systems technologies at MapR; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive.

Presentations

Loosely coupled data with Apache Arrow Flight Session

Apache Arrow Flight is a new initiative focused on providing high-performance communication within data engineering and data science infrastructure. Jacques Nadeau explains how Flight works and where it has been integrated. He also discusses how Flight can be used to abstract physical data management from logical access and sharse benchmarks of workloads that have been improved by Flight.

Syed Nasar is a solutions architect at Cloudera. As a big data and machine learning professional, his expertise extends to artificial intelligence, machine learning, and computer vision, and he has worked with a number of enterprises in bridging big data technologies with advanced statistical analysis, machine learning, and deep learning to create high-quality data products and intelligent systems that drive strategy and investment decisions. Syed is a founder of the Nashville Artificial Intelligence Society. His research interests include NLP, deep learning (mainly RNN and GAN), distributed systems, machine learning at scale, and emerging technologies. He is the founder of Nashville Artificial Intelligence Society. He holds a master’s degree in interactive intelligence from the Georgia Institute of Technology.

Presentations

Anomaly detection using deep learning to measure the quality of large datasets​ Session

Any business big or small depends on analytics, whether the goal is revenue generation, churn reduction, or sales and marketing. No matter the algorithm and the techniques used, the result depends on the accuracy and consistency of the data being processed. Sridhar Alla and Syed Nasar share techniques used to evaluate the the quality of data and the means to detect the anomalies in the data.

Paco Nathan is known as a “player/coach” with core expertise in data science, natural language processing, machine learning, and cloud computing. He has 35+ years of experience in the tech industry, at companies ranging from Bell Labs to early-stage startups. His recent roles include director of the Learning Group at O’Reilly Media and director of community evangelism at Databricks and Apache Spark. Paco is the cochair of Rev conference and an advisor for Amplify Partners, Deep Learning Analytics, Recognai, and Primer. He was named one of the "top 30 people in big data and analytics" in 2015 by Innovation Enterprise.

Presentations

Executive Briefing: Overview of data governance Session

Effective data governance is foundational for AI adoption in enterprise, but it's an almost overwhelming topic. Paco Nathan offers an overview of its history, themes, tools, process, standards, and more. Join in to learn what impact machine learning has on data governance and vice versa.

Alexander Ng is a senior data engineer at Manifold. His previous work includes a stint as engineer and technical lead doing DevOps at Kryuus as well as engineering work for the Navy. He holds a BS in electrical engineering from Boston University.

Presentations

Streamlining a machine learning project team Tutorial

Many teams are still run as if data science is mainly about experimentation, but those days are over. Now it must offer turnkey solutions to take models into production. Sourav Day and Alex Ng explain how to streamline an ML project and help your engineers work as an integrated part of your production teams, using a Lean AI process and the Orbyter package for Docker-first data science.

Presentations

Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You'll then build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Dinesh Nirmal is vice president of development for IBM Data and AI. His mission is to empower every organization to transform their industry—whether it’s aerospace, finance, or healthcare—by unlocking the power of their data. Dinesh speaks and writes internationally on operationalizing machine learning and advises business leaders on strategies to ready their enterprises for new technologies. He leads more than a dozen IBM Development Labs globally; recognizing a market need for data science mastery, he launched six Machine Learning Hubs to work face-to-face with clients. Products in his portfolio regularly win major design awards, including two Red Dot Awards and the iF Design Award. Dinesh is a member of the board of the R Consortium and an advisor to Accel.AI. He lives in San Jose with his wife Catherine Plaia, formerly an engineer at Apple, and their two young sons.

Presentations

Streamlining your data assets: A strategy for the journey to AI (sponsored by IBM) Keynote

The journey to AI begins with data and making intelligent use of it. Dinesh Nirma shares a strategic framework for streamlining your data assets, a framework that takes into account the current state of your existing data structures, the new technologies driving enterprise, the complexities of business processes, and at the foundation, the elements required in an AI-fluent data platform.

Kyungtaak Noh runs the Metatron project team within product development at SK Telecom, South Korea’s largest wireless communications provider, where he’s responsible for managing product development for product and visualizing and coordinating with big data applications. He has 10+ years of experience working on applied big data platforms for for groupware, semiconductor, finance, and telecommunication as a software engineer.

Presentations

When self-service BI meets geospatial analysis Data Case Studies

In the analysis of the mobile world, everyone starts with the question, "Where?" SK Telecom is trying to meet these needs. Kyungtaak Noh explains how the company provides geospatial analysis by processing geospatial data through Druid with Lucene.

Amy O’Connor is a big data evangelist and telecommunications specialist at Cloudera, the leading big data vendor. She advises customers globally as they introduce big data solutions and adopt enterprise-wide big data delivery capabilities. Amy was recently named one of Information Management’s 10 Big Data Experts to Know. Prior to joining Cloudera, Amy built and ran Nokia’s big data team, developing and managing Nokia’s data assets and leading a team of data scientists to drive insights. Previously, Amy was vice president of services marketing and also led strategy for the software and storage business units of Sun Microsystems.

Presentations

The journey to the data-driven enterprise from the edge to AI Keynote

Cloudera “drinks its own champagne”—running Cloudera on Cloudera. The company analyzes data from the edge and runs probabilistic models to tune its business processes with AI, from marketing, sales, and support to strategic planning. Amy O'Connor shares what Cloudera has learned from the edge to AI and explains how it's helping Cloudera and its customers get better at data-driven.

Tim O’Reilly is the founder and CEO of O’Reilly Media, Inc. His original business plan was simply “interesting work for interesting people,” and that’s worked out pretty well. O’Reilly Media delivers online learning, publishes books, runs conferences, urges companies to create more value than they capture, and tries to change the world by spreading and amplifying the knowledge of innovators. Tim has a history of convening conversations that reshape the computer industry. In 1998, he organized the meeting where the term “open source software” was agreed on and helped the business world understand its importance. In 2004, with the Web 2.0 Summit, he defined how “Web 2.0” represented not only the resurgence of the web after the dot-com bust but a new model for the computer industry based on big data, collective intelligence, and the internet as a platform. In 2009, with his Gov 2.0 Summit, he framed a conversation about the modernization of government technology that has shaped policy and spawned initiatives at the federal, state, and local level and around the world. He has now turned his attention to implications of AI, the on-demand economy, and other technologies that are transforming the nature of work and the future shape of the business world. This is the subject of his book from Harper Business, WTF: What’s the Future and Why It’s Up to Us. In addition to his role at O’Reilly Media, Tim is a partner at early-stage venture firm O’Reilly AlphaTech Ventures (OATV) and serves on the boards of Maker Media (which was spun out from O’Reilly Media in 2012), Code for America, PeerJ, Civis Analytics, and PopVox.

Presentations

The conscience of a company Session

Tim O'Reilly will be joined by Janet Haven, executive director of Data & Society, and Catherine Bracy, director of the TechEquity Collaborative, to discuss ways in which tech employees are flexing their muscles as the conscience of their companies.

The future of data ethics Ethics Summit

Strata Data Ethics Summit cochairs Susan Etlinger and Alistair Croll, along with Tim O’Reilly, lead an interactive discussion format with the summit's speakers, attendees, and guests.

Mike Olson cofounded Cloudera in 2008 and served as its CEO until 2013, when he took on his current role of chief strategy officer. As CSO, Mike is responsible for Cloudera’s product strategy, open source leadership, engineering alignment, and direct engagement with customers. Previously, Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine, and he spent two years at Oracle Corporation as vice president for embedded technologies after Oracle’s acquisition of Sleepycat. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies, and Informix Software. Mike holds a bachelor’s and a master’s degree in computer science from the University of California, Berkeley.

Presentations

Executive Briefing: From the edge to AI—Taking control of your data for fun and profit Session

It's easier than ever to collect data, but managing it securely in compliance with regulations and legal constraints is harder. Mike Olson discusses the risks and the issues that matter most and explains how an enterprise data cloud that embraces your data center and the public cloud in combination can address them, delivering real business results for your organization.

The enterprise data cloud Keynote

Most enterprises want the same flexibility and convenience they get in the public cloud, no matter where their data lives or their applications run. We've reached the point that the "enterprise data cloud" must span the firewall and the services offered by hyperscale vendors. Mike Olson describes the key capabilities that such a system requires and why hybrid and multicloud is the future.

Diego Oppenheimer is the founder and CEO of Algorithmia. An entrepreneur and product developer with extensive background in all things data, Diego has designed, managed, and shipped some of Microsoft’s most used data analysis products, including Excel, Power Pivot, SQL Server, and Power BI. Diego holds a bachelor’s degree in information systems and a master’s degree in business intelligence and data analytics from Carnegie Mellon University.

Presentations

Automating DevOps for machine learning Session

You've invested heavily in cleaning your data, feature engineering, training, and tuning your model—but now you have to deploy your model into production, and you discover it's a huge challenge. Diego Oppenheimer shares common architectural patterns and best practices of the most advanced organizations who are deploying your model for scalability and accessibility.

Richard Ott is a data scientist in residence at the Data Incubator, where he gets to combine his interest in data with his love of teaching. Previously, he was a data scientist and software engineer at Verizon. Rich holds a PhD in particle physics from the Massachusetts Institute of Technology, which he followed with postdoctoral research at the University of California, Davis.

Presentations

Big data for managers 2-Day Training

Michael Li and Rich Ott offer a nontechnical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and utilize their input and analysis for your business’s strategic priorities and decision making.

M Pacer is a Jupyter core developer and a senior notebook engineer at Netflix. Previously, M was a postdoctoral researcher the Berkeley Institute for Data Science (BIDS), focusing on the intersection between Jupyter and scientific publishing. M holds a PhD from UC Berkeley, where their research used machine learning and human experiments to study casual explanation and causal inference, and a BS from Yale University, where their research focused on the role of causal language in evaluating scientific claims.

Presentations

Talking with Jupyter Session

M Pacer discusses two meanings of "Talking with Jupyter": talking to others with Jupyter notebooks and talking to Jupyter in the language of its standards, formats, and protocols. M describes tools, workflows, and patterns that make both kinds of talking with Jupyter easier while unlocking new ways of interacting with the Jupyter ecosystem.

Yogesh Pandit is a senior software engineer in the Analytics Group at Roche. Currently, he’s leading the NLP efforts to support the company’s NAVIFY platform, which aims to support oncology care teams to review, discuss, and align on treatment decisions for the patient. Yogesh is a bioinformatician turned data engineer with experience in biomedical NLP. For the past few years, he’s been working on building data applications in the life sciences and healthcare space.

Presentations

Spark NLP: How Roche automates knowledge extraction from pathology and radiology reports Session

Yogesh Pandit, Saif Addin Ellafi, and Vishakha Sharma discuss how Roche applies Spark NLP for healthcare to extract clinical facts from pathology reports and radiology. They then detail the design of the deep learning pipelines used to simplify training, optimization, and inference of such domain-specific models at scale.

Marc Paradis vice president and dean of Data Science University (DSU) at UnitedHealth Group, where he has built out DSU to train the next generation of UnitedHealth Group’s data science and machine learning experts in the tools, techniques, and technologies of the discipline as well as in the architecture, content, and ontology of UnitedHealth Group’s uniquely integrated claims, clinical, and pharmacy data assets. Marc’s career has spanned a variety of healthcare companies in data-related roles, where he continuously seeks the unlock the hidden drivers of profit, efficiency, and value by applying the rigor and discipline of the scientific method to business datasets. He holds an SM from Massachusetts Institute of Technology in brain and cognitive sciences.

Presentations

Data Science University: Transforming a Fortune 5 workforce Session

Data Science University (DSU) was established to bring analytics education to UnitedHealth Group, the world’s largest healthcare company, with over 270,000 employees. Marc Paradis explains how DSU was built out over time in an era of rapidly changing analytics technology and capabilities in an industry ripe for disruption, covering the challenges faced and lessons learned.

Vivek Pasari is a senior data engineer at Netflix, where he focuses on building high-quality data assets that drive innovation in delivering highly performant, consistent, and reliable user experiences for Netflix members. Previously, he spent several years in financial technology helping build large-scale cloud and data solutions.

Presentations

How Netflix measures app performance on 250 million unique devices across 190 countries Session

Netflix has over 125 million members spread across 191 countries. Each day its members interact with its client applications on 250 million+ devices under highly variable network conditions. These interactions result in over 200 billion daily data points. Vivek Pasari dives into the data engineering and architecture that enables application performance measurement at this scale.

Priyank Patel is the cofounder and chief product officer at Arcadia Data, where he leads the team’s charter in building visually beautiful and highly scalable analytical products and guides customers through their successful adoption. Previously, Priyank was part of the founding engineering team at Aster Data, where he designed core components of the Aster Database. He holds a master’s degree in computer science from Stanford University.

Presentations

Intelligent design patterns for cloud-based analytics and BI (sponsored by Arcadia Data) Session

With cloud object storage, you'd expect business intelligence (BI) applications to benefit from the scale of data and real-time analytics. However, traditional BI in the cloud surfaces non-obvious challenges. Priyank Patel reviews service-oriented cloud design (storage, compute, catalog, security, SQL) and shows how native cloud BI provides analytic depth, low cost, and high performance.

Joshua Patterson is a director of AI infrastructure at NVIDIA leading engineering for RAPIDS.AI. Previously, Josh was a White House Presidential Innovation Fellow and worked with leading experts across public sector, private sector, and academia to build a next-generation cyberdefense platform. His current passions are graph analytics, machine learning, and large-scale system design. Josh loves storytelling with data and creating interactive data visualizations. He holds a BA in economics from the University of North Carolina at Chapel Hill and an MA in economics from the University of South Carolina Moore School of Business.

Presentations

The next step in the evolution of data science with RAPIDS Session

RAPIDS is the next big step in data science, combining the ease of use of common APIs and the power and scalability of GPUs. Bartley Richardson and Joshua Patterson offer an overview of RAPIDS and and explore cuDF, cuGraph, and cuML—a trio of RAPIDS tools that enable data scientists to work with data in a familiar interface and apply graph analytics and traditional machine learning techniques.

Ji Peng is a problem solver who has developed the data backbone of Earnin, a high-growth financial company that allows anyone with a job and a bank account to get paid the minute they leave work. He has built machine learning systems that have given payroll flexibility to employees from more than 50,000 employers, guiding his data science team as they create sophisticated models to better understand the Earnin community. Ji holds a PhD from the University of Colorado, Boulder, and a BS from the University of Science and Technology of China (USTC).

Presentations

Applying machine learning in fintech startups: Modeling with sensitive customer datasets Session

As a customer-facing fintech company, Earnin has access to various types of valuable customer data, from bank transactions to GPS location. Ji Peng shares how Earnin uses unique datasets to build machine learning models and navigates the challenges of prioritizing and applying machine learning in the fintech domain.

Thomas Phelan is cofounder and chief architect of BlueData. Previously, Tom was an early employee at VMware; as senior staff engineer, he was a key member of the ESX storage architecture team. During his 10-year stint at VMware, he designed and developed the ESX storage I/O load-balancing subsystem and modular pluggable storage architecture. He went on to lead teams working on many key storage initiatives, such as the cloud storage gateway and vFlash. Earlier, Tom was a member of the original team at Silicon Graphics that designed and implemented XFS, the first commercially available 64-bit filesystem.

Presentations

How to protect big data in a containerized environment Session

Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE). But TDE is difficult to configure and manage—particularly when run in Docker containers. Thomas Phelan discusses these challenges and explains how to overcome them.

Carole Piovesan is a partner and cofounder at INQ Data Law, where she focuses on data governance, privacy law, cybersecurity, and artificial intelligence. Previously, Carole was a lawyer at a large Canadian law firm, where she served as colead of the firm’s National Cybersecurity, Privacy, and Data Management Group and a lead on artificial intelligence. Carole advises the Canadian government on legal and policy issues related to data and AI and regularly advises companies on issues related to their collection, storage, and use of data. Carole is a recognized expert on legal and policy issues relating to data and AI and is a frequent speaker and author on these topics.

Presentations

Governing with exponential change on the horizon: Law and data in an emerging AI world Ethics Summit

The appetite for data consumption is ever growing as experimentation with data-demanding technologies like AI increases. Carole Piovesan discusses responsible data use and some of the legal issues arising with AI creation and operationalization.

Panel: Solutions Ethics Summit

We've looked at some possible solutions. But a more complete perspective includes what's right and where we're making mistakes—from the reproducibility of social sciences to the regulations of governments. We wrap up our look at possible solutions with a group discussion.

Josh Poduska is the chief data scientist at Domino Data Lab. He has 17 years of experience in analytics. His work experience includes leading the statistical practice at one of Intel’s largest manufacturing sites, working on smarter cities data science projects with IBM, and leading data science teams and strategy with several big data software companies. Josh holds a master’s degree in applied statistics from Cornell University.

Presentations

Managing data science in the enterprise Tutorial

The honeymoon era of data science is ending, and accountability is coming. Successful data science leaders must deliver measurable impact on an increasing share of an enterprise's KPIs. Joshua Poduska, Kimberly Shenk, and Mac Steele explain how leading organizations take a holistic approach to people, process, and technology to build a sustainable competitive advantage.

Gungor Polatkan is a machine learning expert and engineering leader with experience in building massive-scale distributed data pipelines serving personalized content at LinkedIn and Twitter. Most recently, he led the design and implementation of the AI backend for LinkedIn Learning and ramped up the recommendation engine from scratch to hyperpersonalized models learning billions of coefficients for 500M+ users. He deployed some of the first deep ranking models for search verticals at LinkedIn improving talent search. He enjoys leading teams, mentoring engineers, and fostering a culture of technical rigor and craftsmanship while iterating fast. Previously, he worked in several notable applied research groups at Twitter, Princeton, Google, MERL, and UC Berkeley. He has published and refereed papers at top-tier ML and AI venues, such as UAI, ICML, and PAMI.

Presentations

Toward deep and representation learning for talent search at LinkedIn Session

Talent search systems at LinkedIn strive to match the potential candidates to the hiring needs of a recruiter expressed in terms of a search query. Gungor Polatkan shares the results of the company's deployment of deep learning models on a real-world production system serving 500M+ users through LinkedIn Recruiter.

Alex Poms is a CS PhD student at Stanford, where they’re advised by Kayvon Fatahalian, as well as a research contractor for Oculus/Facebook. Alex’s PhD research focuses on designing algorithms and programmable systems for efficiently analyzing video. They have published and presented work at SIGGRAPH and CVPR on systems for large-scale video analysis and efficient 3D reconstruction using deep learning.

Presentations

Scanner: Efficient video analysis at scale Session

Video is now the largest source of data on the internet, so we need tools to make it easier to process and analyze. Alex Poms and Will Crichton offer an overview of Scanner, the first open source distributed system for building large-scale video processing applications, and explore real-world use cases.

Cat Posey is a senior tech director on the leadership team in the Center for Machine Learning at Capital One, which focuses on research and strategic innovation initiatives in AI and ML, including the development of tools, technologies, frameworks, and partnerships with industry and academia. Cat is passionate about finding ways to shift the current dynamics within tech culture so the community can truly be a microcosm of the world and a place where all of its members can thrive. She’s also focused on developing and implementing AI and ML systems responsibly and in ways that put humanity at the forefront. Cat founded the Tech By Superwomen movement to shift the conversation to what works and what matters when it comes to creating a more inclusive tech industry. Previously, she was the head of strategic partnerships and alliances for the United States Digital Service, a tech startup founded by President Obama to change the technology infrastructure in order to better serve citizens. Cat is a recognized authority on issues impacting women in technology and an international speaker who has been featured in various national media publications.

Presentations

A human-centered approach to AI and machine learning Session

Cathryn Posey explains how Capital One—the only bank fully committed to a cloud-based infrastructure—is approaching machine learning with a responsible, human-centered focus. Join in to hear about Capital One's research in areas like explainable AI, how the bank is leveraging the technology, and ways in which it can be used for good.

Greg Quist is the cofounder, president, and CEO of SmartCover Systems, where he leads the strategic direction and operations of the company. Greg is a longtime member of the water community. He was elected to the Rincon del Diablo MWD board of directors in 1990 and for the past 27 years has served in various roles, including president and treasurer. Rincon’s Board appointed Greg to the San Diego County Water Authority Board in 1996, where he served for 12 years, leading a coalition of seven agencies to achieve more than $1M/year in water delivery savings. He is currently the chairman of the Urban Water Institute. With a background in the areas of metamaterials, numerical analysis, signal processing, pattern recognition, wireless communications, and system integration, Greg has worked as a technologist, manager, and executive at Alcoa, McDonnell-Douglas, and SAIC and has founded and successfully spun off several high-tech startups, primarily in real-time detection and water technology. He has held top-level government clearances and holds 14 patents and has several pending. Greg has an undergraduate degree in astrophysics with a minor in economics from Yale, where he played football and baseball, and a PhD in physics from the University of California, Santa Barbara. He currently resides in Escondido, CA. In his rare free time, he enjoys fly fishing, hiking, golf, basketball, and tennis.

Presentations

The collision between AI and underground infrastructure Session

SmartCover Systems has been providing an IoT solution to its customers for 15 years, using techniques honed in defense and remote sensing, gathering more than 200 million hours of sewer data. Greg Quist shares case studies and results from applying the IoT and AI to underground infrastructure.

Mohammad Quraishi is a senior principal technologist within the Data and Analytics Organization at Cigna, where he’s the lead engineer in the Big Data Guild; his primary focus is on Hadoop and streaming architectures. Mohammad has worked in the healthcare industry for 23 years. He holds a BS in computer science and engineering from the University of Connecticut at Storrs.

Presentations

Enabling insights and analytics with data streaming architectures and pipelines using Kafka and Hadoop Session

In a large global health services company, streaming data for processing and sharing comes with its own challenges. Data science and analytics platforms need data fast, from relevant sources, to act on this data quickly and share the insights with consumers with the same speed and urgency. Join Mohammad Quraishi to learn why streaming data architectures are a necessity—Kafka and Hadoop are key.

Irina Raicu is the director of the Internet Ethics Program at the Markkula Center for Applied Ethics at Santa Clara University. Previously, she was an attorney in private practice. Her work addresses a wide variety of issues, ranging from online privacy to net neutrality, from data ethics to social media’s impact on friendship and family, from the digital divide to the ethics of encryption, and from the ethics of artificial intelligence to the right to be forgotten. Her writing has appeared in a variety of publications, including the Atlantic, USA Today, MarketWatch, Slate, HuffPost, the San Jose Mercury News, the San Francisco Chronicle, and Recode. She’s a Certified Information Privacy Professional (US) and is a member of the Partnership on AI’s Working Group on Fair, Transparent, and Accountable AI. In collaboration with the staff of the High Tech Law Institute, Irina manages the ongoing IT, Ethics, and Law lecture series, which has brought to campus speakers such as journalist Julia Angwin, ethicists Luciano Floridi and Patrick Lin, and then-FTC commissioner Julie Brill. She holds a JD from Santa Clara University’s School of Law, a master’s degree in English and American literature from San Jose State University, and a bachelor’s degree in English from UC Berkeley. She tweets at @IEthics and is the primary contributor to the blog Internet Ethics: Views from Silicon Valley. As a teenager, Irina came to the US with her family as a refugee; her background informs her interest in the internet as a tool whose use has profound ethical implications worldwide.

Presentations

An introduction to data ethics Ethics Summit

The term “technology ethics” comes up frequently these days but is not always well understood. In order to consider technology ethics in depth, we need a shared understanding of its content. Irina Raicu and Brian Green explore what ethics is, and more narrowly, the meaning of data ethics.

Ashwin Ramachandran is the product manager for Syncsort’s Integrate portfolio, covering the DMX/DMX-h, DMX DataFunnel, and DMX Change Data Capture products. In his four years at Syncsort, he has had the opportunity to engage with customers at multiple levels, from support to training to leading presales evaluations. Across all of these instances, he has particularly enjoyed the process of creating new ways in which Syncsort software can help customers overcome their pressing business challenges.

Presentations

Strategies for leveraging legacy data for real time, cloud, and cluster (sponsored by Syncsort) Session

"Legacy" data sources like mainframes and data warehouses still power mission-critical applications, holding the historical and transactional insight essential for advanced analytics and real-time applications. Ashwin Ramachandran shares strategies, tools, and techniques for successfully deriving value from these sources using today's modern architectures while future-proofing for what lies ahead.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); briefly worked on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper Networks. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin-Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Architecture and algorithms for end-to-end streaming data processing Tutorial

Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams.

Reducing stream processing complexity using Apache Pulsar Functions Session

After two years of running streaming pipelines through Kinesis and Spark at One Click Retail, Jowanza Joseph and Karthik Ramasamy decided to explore a new platform that would take advantage of Kubernetes and support a simpler data processing DSL. Join in to discover why they chose Apache Pulsar and learn tips and tricks for using Pulsar Functions.

Sushant Rao works at Cloudera.

Presentations

Journey to the cloud: Architecting for the cloud through customer stories Session

Jason Wang and Sushant Rao offer an overview of cloud architecture, then go into detail on core cloud paradigms like compute (virtual machines), cloud storage, authentication and authorization, and encryption and security. They conclude by bringing these concepts together through customer stories to demonstrate how real-world companies have leveraged the cloud for their big data platforms.

Nancy Rausch is a senior manager at SAS Institute. Nancy has been involved for many years in the design and development of SAS’s data warehouse and data management products, working closely with customers and authoring a number of papers on SAS data management products and best practice design principals for data management solutions. She holds an MS in computer engineering from Duke University, where she specialized in statistical signal processing, and a BS in electrical engineering from Michigan Technological University. She has recently returned to college and is pursuing an MS in analytics from Capella University.

Presentations

Bringing data to life: Combining machine learning and art to tell a data story Session

For data to be meaningful, it needs to be presented in a way that people can relate to. Nancy Rausch explains how she combined streaming data from a solar array and machine learning techniques to create a live-action art piece—an approach that helped bring the data to life in a fun and compelling way.

Joseph Regensburger leads the Research Group at Immuta, where he focuses on model risk management and privacy-preserving machine learning. Previously, he was chief scientist at Illumination Works, LLC and principal research scientist at the Battelle Memorial Institute. Joseph has led research efforts characterizing airport security screening devices, engineering image analysis software, and developing machine learning algorithms for biological detection. He received both Battelle’s Technical Achievement Award and Illumination Works’s Innovation Award. He holds a PhD in physics from the Ohio State University, where his research focused on experimental high-energy physics—specifically the detection of rare decays of D0 mesons.

Presentations

Successfully deploy machine learning while managing its risks Tutorial

As ML becomes increasingly important for businesses and data science teams alike, managing its risks is quickly becoming one of the biggest challenges to the technology’s widespread adoption. Join Andrew Bur, Steven Touw, Richard Geering, Joseph Regensburger, and Alfred Rossi for a hands-on overview of how to train, validate, and audit machine learning models (ML) in practice.

Jen Ren is a program manager at Microsoft, focused on creating data wrangling tools for AI datasets. She studied health policy and computer science at Stanford University, where she was also part of the Data Challenge Lab.

Presentations

Forecasting financial time series with deep learning on Azure 2-Day Training

Francesca Lazzeri and Jen Ren walk you through the core steps for using Azure Machine Learning services to train your machine learning models both locally and on remote compute resources.

Luciano Resende is an STSM and open source data science/AI platform architect at IBM CODAIT (formerly Spark Technology Center). He is a member of ASF, where he has been contributing to open source for over 10 years. He’s currently contributing to various big data-related Apache projects around the Apache Spark ecosystem as well as Jupyter Ecosystem projects, building a scalable, secure, and flexible enterprise data science platform.

Presentations

Scaling Jupyter with Jupyter Enterprise Gateway Session

Alan Chin and Luciano Resende explain how to introduce Jupyter Enterprise Gateway into new and existing notebook environments to enable a "bring your own notebook" model while simultaneously optimizing resources consumed by the notebook kernels running across managed clusters within the enterprise.

Bartley Richardson is a senior data scientist on the AI infrastructure team at NVIDIA. Bartley’s focus at NVIDIA is the research and application of GPU-accelerated methods that can help solve today’s information security and cybersecurity challenges. Previously, Bartley was a technical lead and performer on multiple DARPA research projects, where he applied data science and machine learning algorithms at scale to solve large cybersecurity problems. He was also the principal investigator of an internet of things research project focused on applying machine and deep learning techniques to large amounts of IoT data to provide intelligence value relating to form, function, and pattern of life. His primary research areas involve NLP and sequence-based methods applied to cyber network datasets as well as cross-domain applications of machine and deep learning solutions to tackle the growing number of cybersecurity threats. He loves using data and visualizations to tell stories and help make complex concepts more relatable. Bartley holds a PhD in computer science and engineering from the University of Cincinnati with a focus on loosely and unstructured query optimization and a BS in computer engineering with a focus on software design and AI.

Presentations

The next step in the evolution of data science with RAPIDS Session

RAPIDS is the next big step in data science, combining the ease of use of common APIs and the power and scalability of GPUs. Bartley Richardson and Joshua Patterson offer an overview of RAPIDS and and explore cuDF, cuGraph, and cuML—a trio of RAPIDS tools that enable data scientists to work with data in a familiar interface and apply graph analytics and traditional machine learning techniques.

Brian is Rieger is cofounder and COO of Labelbox, the industry-leading training data software that is accelerating global access to artificial intelligence. An accomplished aerospace engineer, data scientist, and software developer turned serial entrepreneur, Brian began his career doing aerodynamics, testing, and flight certification of the Boeing 787 Dreamliner. He then built an aerospace company that put hardware on the International Space Station. Brian was recognized as one of Forbes’s “30 under 30” for transforming enterprise technology with machine intelligence.

Presentations

Is your AI making good decisions? Ethics Summit

Brian Rieger explores consideration and key questions for AI decision making.

Panel: Causes Ethics Summit

Following the review of problematic technologies, we'll hold an interactive discussion with speakers and invited guests to dig deeper into neuroscience, analytics, and more.

Kelley Rivoire is an engineering manager at Stripe, where she leads the data infrastructure group. As an engineer, she built Stripe’s first real-time machine learning evaluation of user risk. Previously, she worked on nanophotonics and 3D imaging as a researcher at HP Labs. She holds a PhD from Stanford.

Presentations

Scaling model training: From flexible training APIs to resource management with Kubernetes Session

Production ML applications benefit from reproducible, automated retraining, and deployment of ever-more predictive models trained on ever-increasing amounts of data. Kelley Rivoire explains how Stripe built a flexible API for training machine learning models that's used to train thousands of models per week on Kubernetes, supporting automated deployment of new models with improved performance.

David Rodriguez is a senior research engineer at Cisco Umbrella (formerly OpenDNS). He has coauthored multiple pending patents with Cisco in distributed machine learning applications centered around deep learning and behavioral analytics. He’s a frequent speaker about machine learning in cybersecurity at conferences including Flink Forward, Black Hat, Flocon, Virus Bulletin, and HitBSEC. David holds an MA in mathematics from San Francisco State University.

Presentations

Masquerading malicious DNS traffic Session

Malicious DNS traffic patterns are inconsistent and typically thwart anomaly detection. David Rodriguez explains how Cisco uses Apache Spark and Stripe’s Bayesian inference software, Rainier, to fit the underlying time series distribution for millions of domains and outlines techniques to identify artificial traffic volumes related to spam, malvertising, and botnets (masquerading traffic).

Pierre Romera is the chief technology officer at the International Consortium of Investigative Journalists (ICIJ), where he manages a team of programmers working on the platforms that enabled more than 300 journalists to collaborate on the Paradise Papers and Panama Papers investigations. Previously, he cofounded Journalism++, the Franco-German data journalism agency behind the Migrant Files, a project that won the European Press Prize in 2015 for Innovation. He is one of the pioneers of data journalism in France.

Presentations

The Paradise Papers and West Africa Leaks: Behind the scenes with the ICIJ Session

The ICIJ was the team behind the Panama Papers and Paradise Papers. Pierre Romera offers a behind-the-scenes look into the ICIJ's process and explores the challenges in handling 1.4 TB of data (in many different formats)—and making it available securely to journalists all over the world.

Alfred Rossi is a theoretical computer scientist and research scientist at Immuta, where his efforts are currently focused on differential privacy and model risk management. His research interests include clustering (especially in alternative settings) and privacy. Alfred holds a PhD in computer science and an MS in physics, both from the Ohio State University.

Presentations

Successfully deploy machine learning while managing its risks Tutorial

As ML becomes increasingly important for businesses and data science teams alike, managing its risks is quickly becoming one of the biggest challenges to the technology’s widespread adoption. Join Andrew Bur, Steven Touw, Richard Geering, Joseph Regensburger, and Alfred Rossi for a hands-on overview of how to train, validate, and audit machine learning models (ML) in practice.

Nikki Rouda is a principal product marketing manager at Amazon Web Services (AWS). Nikki has decades of experience leading enterprise big data, analytics, and data center infrastructure initiatives. Previously, he held senior positions at Cloudera, Enterprise Strategy Group (ESG), Riverbed, NetApp, Veritas, and UK-based Alertme.com (an early consumer IoT startup.) Nikki holds an MBA from Cambridge’s Judge Business School and an ScB in geophysics from Brown University.

Presentations

Executive Briefing: Big data in the era of heavy worldwide privacy regulations Session

The implications of new privacy regulations for data management and analytics, such as the General Data Protection Regulation (GDPR) and the upcoming California Consumer Protection Act (CCPA), can seem complex. Mark Donsky and Nikki Rouda highlight aspects of the rules and outline the approaches that will assist with compliance.

Craig Rowley is a data solutions architect and scientist at Columbia Sportswear. Craig seeks to inform the art of business through data and science, sifting through petabytes of information to find the “little data”—those nuggets of information and insight—that help deliver the right products and experiences to the right consumers at the right time…and more importantly, to do the right thing for consumers by protecting their right to privacy on a global, comprehensive, and even predictive scale. Luck and great opportunities have helped him deliver patents on streaming behavioral data models, adaptive augmented reality applications, location-aware engagement intelligence, in-the-moment predictive algorithms for wearable devices, fraud detection and prediction platforms, and personalized, connection-based recommenders. His specialities include petabyte-scale dataflow execution architectures to enable full-loop data science, data exploration and modeling of semistructured, unstructured, and structured data, inventing new machine learning algorithms, heuristics, and applications. His interests include science fiction movies and books, video games, and learning.

Presentations

Informing the art of business with data and science Data Case Studies

Few analytics organizations are successfully delivering actionable insights that make it further than a Keynote or PowerPoint presentation. Join Craig Rowley to learn why successful analytics projects must also consider the human element.

John Rutherford builds machine learning applications and helps develop the Wise data science platform at GE. Previously, he was the data scientist for an energy efficiency startup, where he headed algorithm development and exploratory analyses. He holds physics, mathematics, and astrophysics degrees from Stanford and MIT.

Presentations

Critical turbine maintenance: Monitoring and diagnosing planes and power plants in real time Session

GE produces a third of the world's power and 60% of its airplane engines—a critical portion of the world's infrastructure that requires meticulous monitoring of the hundreds of sensors streaming data from each turbine. June Andrews and John Rutherford explain how GE's monitoring and diagnostics teams released the first real-time ML systems used to determine turbine health into production.

Iman Saleh is a research scientist with the Automotive Solutions Group at Intel. Iman has authored 30+ technical publications in the areas of big data, formal data specification, service-oriented computing, and privacy-preserving data mining. Her research interests include ethical AI, machine learning, privacy-preserving solutions, software engineering, data modeling, web services, formal methods, and cryptography. She holds a PhD from the Computer Science Department at Virginia Tech, a master’s degree in computer science from Alexandria University, Egypt, and a master’s degree in software engineering from Virginia Tech.

Presentations

AI privacy and ethical compliance toolkit Tutorial

From healthcare to smart home to autonomous vehicles, new applications of autonomous systems are raising ethical concerns about a host of issues, including bias, transparency, and privacy. Iman Saleh, Cory Ilo, and Cindy Tseng demonstrate tools and capabilities that can help data scientists address these concerns and bridge the gap between ethicists, regulators, and machine learning practitioners.

David E. Sanger is the national security correspondent for the New York Times as well as a national security and political contributor for CNN and a frequent guest on CBS This Morning, Face the Nation, and many PBS shows. David’s years as a foreign correspondent have given him a unique view into the rise of Asia, nuclear proliferation, global competition, and a volatile Middle East. A three-time Pulitzer Prize winner, including as a member of the 2017 Pulitzer Prize-winning team in international reporting, he is one of the nation’s most lucid analysts of geopolitics, globalization, and cyberpower. He’s the author of national best-sellers Confront and Conceal: Obama’s Secret Wars and Surprising Use of American Power, a riveting analysis of the Obama administration’s foreign policy, including its covert reliance on cyberwarfare, drones, and special operations forces that Foreign Affairs called an “astonishingly revealing insider’s account,” and The Inheritance: The World Obama Confronts and the Challenges to American Power, an in-depth examination of American foreign policy successes and failures. His new book, The Perfect Weapon: War, Sabotage, and Fear in the Cyber Age, offers an incisive look into how cyberwarfare is influencing elections, threatening national security, and bringing us to the brink of global war.

A 30-year veteran of the New York Times, Sanger’s coverage of the Iraq and Korea crises took home the Weintal Prize, one of the highest honors for diplomatic reporting. He also won the White House Correspondents’ Association Aldo Beckman prize for his presidential coverage.
Early in his career, Sanger covered technology and economics, before turning to foreign policy. Over the years, he has focused on North Korea’s nuclear weapons program, the rise and fall of Japan, and China’s increasing power and influence. Later, he covered domestic and foreign policy issues as the Times’s White House correspondent from 1999 to 2006. He’s a featured journalist in Alex Gibney’s 2016 docu-thriller Zero Days, based largely on his own investigation of the secret American and Israeli cyberprogram to attack Iran’s nuclear facilities. David teaches national security policy as a visiting scholar and adjunct professor at Harvard University’s Kennedy School of Government.

Presentations

Cyberconflict: A new era of war, sabotage, and fear Keynote

David Sanger explains how the rise of cyberweapons has transformed geopolitics like nothing since the invention of the atomic bomb. From crippling infrastructure to sowing discord and doubt, cyber is now the weapon of choice for democracies, dictators, and terrorists.

Akshai Sarma is a principal software engineer working in big data, ETL, analytics, and distributed computing at Yahoo. He enjoys dealing with problems at scale, decreasing latency, improving quality, and creating systems that handle billions of events and terabytes of data—both streaming and batch.

Presentations

Bullet: Querying streaming data in transit with sketches Session

Akshai Sarma and Nathan Speidel offer an overview of Bullet, a scalable, pluggable, light multitenant query system on any data flowing through a streaming system without storing it. Bullet efficiently supports intractable operations like top K, count distincts, and windowing without any storage using sketch-based algorithms.

Osman Sarood leads the infrastructure team at Mist Systems, where he helps Mist scale the Mist Cloud in a cost-effective and reliable manner. Osman has published more than 20 research papers in highly rated journals, conferences, and workshops and has presented his research at several academic conferences. He has over 400 citations along with an i10-index and h-index of 12. Previously, he was a software engineer at Yelp, where he prototyped, architected, and implemented several key production systems and architected and authored Yelp’s autoscaled spot infrastructure, fleet_miser. Osman holds a PhD in high-performance computing from the University of Illinois Urbana-Champaign, where he focused on load balancing and fault tolerance.

Presentations

Live Aggregators: A scalable, cost-effective, and reliable way of aggregating billions of messages in real time Session

Osman Sarood and Chunky Gupta discuss Mist’s real-time data pipeline, focusing on Live Aggregators (LA)—a highly reliable and scalable in-house real-time aggregation system that can autoscale for sudden changes in load. LA is 80% cheaper than competing streaming solutions due to running over AWS Spot Instances and having 70% CPU utilization.

Jörg Schad is a machine learning platform engineer at Suki. In a previous life, he worked on distributed systems at Mesosphere, implemented distributed and in memory databases, and conducted research in the Hadoop and Cloud area. He’s a frequent speaker at meetups, international conferences, and lecture halls.

Presentations

Deep learning beyond the learning Session

There are many great tutorials for training your deep learning models, but training is only a small part in the overall deep learning pipeline. Tobias Knaup and Joerg Schad offer an introduction to building a complete automated deep learning pipeline, starting with exploratory analysis, overtraining, model storage, model serving, and monitoring.

Robert Schroll is a data scientist in residence at the Data Incubator. Previously, he held postdocs in Amherst, Massachusetts, and Santiago, Chile, where he realized that his favorite parts of his job were teaching and analyzing data. He made the switch to data science and has been at the Data Incubator since. Robert holds a PhD in physics from the University of Chicago.

Presentations

Machine learning from scratch in TensorFlow 2-Day Training

The TensorFlow library provides for the use of computational graphs, with automatic parallelization across resources. This architecture is ideal for implementing neural networks. Robert Schroll offers an overview of TensorFlow's capabilities in Python, demonstrating how to build machine learning algorithms piece by piece and how to use TensorFlow's Keras API with several hands-on applications.

Jonathan Seidman is a software engineer on the cloud team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Foundations for successful data projects Tutorial

The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. Jonathan Seidman and Ted Malaska share guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

Mehul Shah heads two cloud services at AWS: AWS Lake Formation and AWS Glue. His expertise spans large-scale data management, distributed systems, and energy-efficient computing. His work has been published in top-tier conferences and journals and has won several awards including a Test of Time. Previously, he was cofounder and CEO of Amiato, a startup that offered a real-time ETL cloud service, and principal research scientist at HP Labs. He holds a PhD from UC Berkeley, where his work focused on adding fault tolerance and autoscaling in the TelegraphCQ stream processing system, and both an MEng and BS in CS and physics from MIT. He’s currently a member of the Sort Benchmark committee.

Presentations

Serverless analytics in AWS Glue (sponsored by Amazon Web Services) Session

Mehul Shah offers an overview of serverless computing and details AWS Glue's severless analytics features for data science, data discovery, data cleaning and transformation, and data lake management.

Gwen Shapira is a system architect at Confluent, where she helps customers achieve success with their Apache Kafka implementations. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen currently specializes in building real-time reliable data-processing pipelines using Apache Kafka. Gwen is an Oracle Ace Director, the coauthor of Hadoop Application Architectures, and a frequent presenter at industry conferences. She is also a committer on Apache Kafka and Apache Sqoop. When Gwen isn’t coding or building data pipelines, you can find her pedaling her bike, exploring the roads and trails of California and beyond.

Presentations

Cloud native data pipelines with Apache Kafka Session

As microservices, data services, and serverless APIs proliferate, data engineers need to collect and standardize data in an increasingly complex and diverse system. Gwen Shapira discusses how data engineering requirements have changed in a cloud native world and shares architectural patterns that are commonly used to build flexible, scalable, and reliable data pipelines.

Sonali Sharma a data engineer on the data personalization team at Netflix, which, among other things, delivers recommendations made for each user. The team is responsible for the data that goes into training and scoring of the various machine learning models that power the Netflix home page. They have been working on moving some of the company’s core datasets from being processed in a once-a-day daily batch ETL to being processed in near real time using Apache Flink. A UC Berkeley graduate, Sonali has worked on a variety of problems involving big data. Previously, she worked on the mail monetization and data insights engineering team at Yahoo, where she focused on building great data-driven products to do large-scale unstructured data extractions, recommendation systems, and audience insights for targeting using technologies like Spark, the Hadoop ecosystem (Pig, Hive, MapReduce), Solr, Druid, and Elasticsearch.

Presentations

Taming large state to join datasets for personalization Session

With so much data being generated in real time, what if we could combine all these high-volume data streams and provide near real-time feedback for model training, improving personalization and recommendations and taking the customer experience to a whole new level. Sonali Sharma and Shriya Arora explain how to do exactly that, using Flink's keyed state.

Vishakha Sharma is a data scientist for diagnostic information solutions at Roche, where she leads advanced analytics initiatives such as natural language processing (NLP) and machine learning (ML) to discover key insights improving NAVIFY product portfolio, leading to better and more efficient patient care. Vishakha has authored 40+ peer-reviewed publications and proceedings and has given 15+ invited talks. She serves on the program committee of the ACM-W, AMIA, and ACM-BCB. Her research work has been funded by the NIH Big Data to Knowledge (BD2K) initiative to build an NLP precision medicine software to automate molecular and clinical information extraction, categorization, and ranking of clinical evidence associated with biomarkers that predict response to cancer therapies. She holds a PhD in computer science.

Presentations

Spark NLP: How Roche automates knowledge extraction from pathology and radiology reports Session

Yogesh Pandit, Saif Addin Ellafi, and Vishakha Sharma discuss how Roche applies Spark NLP for healthcare to extract clinical facts from pathology reports and radiology. They then detail the design of the deep learning pipelines used to simplify training, optimization, and inference of such domain-specific models at scale.

Kimberly Shenk is the cofounder of NakedPoppy, an online beauty company that offers personalized clean makeup. Previously, she was the director of data science products at Domino Data Lab, the director of data science at Eventbrite, cofounder of a boutique data science consultancy, a marketing data scientist in retail, and a captain and data scientist in the United States Air Force and at Draper Labs. She serves on the board of the USF Data Science Institute. Kimberly holds an MS from MIT and a BS from the US Air Force Academy.

Presentations

Managing data science in the enterprise Tutorial

The honeymoon era of data science is ending, and accountability is coming. Successful data science leaders must deliver measurable impact on an increasing share of an enterprise's KPIs. Joshua Poduska, Kimberly Shenk, and Mac Steele explain how leading organizations take a holistic approach to people, process, and technology to build a sustainable competitive advantage.

Aashish Sheshadri is a research engineer at PayPal, where he currently ideates and applies deep learning to new avenues and actively contributes to the Jupyter ecosystem and the SEIF Project. He holds an MS in computer science from the University of Texas at Austin, where his research focused on active learning with human-in-the-loop systems.

Presentations

On a deep journey toward five nines Session

Deep learning using sequence-to-sequence networks (Seq2Seq) has demonstrated unparalleled success in neural machine translation. A less explored but highly sought-after area of forecasting can leverage recent gains made in Seq2Seq networks. Aashish Sheshadri explains how PayPal has applied deep networks to monitoring and alerting intelligence.

Daragh Sibley is director of data science at Stitch Fix, where he leads a team of data scientists that use algorithms and the scientific method to optimize the portfolio of products stocked in Stitch Fix’s inventory. Previously, Daragh spent a decade in academia, where he developed neural networks of human language acquisition and tested their predictions with behavioral and neuroimaging experiments.

Presentations

How to make fewer bad decisions Session

A/B testing has revealed the fallibility in human intuition that typically drives business decisions. Eric Colson and Daragh Sibley describe some types of systematic errors domain experts commit, explain how cognitive biases arise from heuristic reasoning processes, and share several mechanisms to mitigate these human limitations and improve decision making.

Alkis Simitsis is a chief scientist for cybersecurity analytics at Micro Focus. Alkis has more than 15 years of experience building innovative information and data management solutions in areas like real-time business intelligence, security, massively parallel processing, systems optimization, data warehousing, graph processing, and web services. He holds 26 US patents and has filed over 50 patent applications in the US and worldwide. He’s published more than 100 papers in refereed international journals and conferences (top publications cited 5,000+ times) and frequently serves in various roles in program committees of top-tier international scientific conferences. He’s also an IEEE senior member and a member of the ACM.

Presentations

Automation of root cause analysis for big data stack applications Session

Alkis Simitsis and Shivnath Babu share an automated technique for root cause analysis (RCA) for big data stack applications using deep learning techniques, using Spark and Impala. The concepts they discuss apply generally to the big data stack.

Peter Warren Singer is strategist at New America and an editor at Popular Science magazine. He has been named by the Smithsonian as one of the nation’s 100 leading innovators, by Defense News as one of the 100 most influential people in defense issues, by Foreign Policy to their “top 100 global thinkers” list, as an official “mad scientist” for the US Army’s Training and Doctrine Command, and by Onalytica social media data analysis as one of the 10 most influential voices in the world on cybersecurity and the 25th most influential in the field of robotics. Peter’s award-winning books include Corporate Warriors: The Rise of the Privatized Military Industry, Children at War, Wired for War: The Robotics Revolution and Conflict in the 21st Century, Cybersecurity and Cyberwar: What Everyone Needs to Know, and Ghost Fleet: A Novel of the Next World War, a technothriller crossed with nonfiction research, which has been endorsed by people who range from the chairman of the Joint Chiefs to the coinventor of the internet to the writer of HBO’s Game of Thrones. His latest book is LikeWar, which explores how social media has changed war and politics and how war and politics has changed social media. It was named an Amazon book of the month and a New York Times “new and notable” selection. In its review, Booklist argued that “LikeWar should be required reading for everyone living in a democracy and all who aspire to.” Peter’s past work includes serving at the Office of the Secretary of Defense, Harvard University, and as the founding director of the Center for 21st Century Security and Intelligence at Brookings, where he was the youngest person named senior fellow in its 100-year history.

Presentations

Likewar: How social media is changing the world…and how the world is changing social media Keynote

Terrorists live-stream their attacks, “Twitter wars” sell music albums and produce real-world casualties, and viral misinformation alters not just the result of battles but the very fate of nations. The result is that war, tech, and politics have blurred into a new kind of battle space that plays out on our smartphones. P. W. Singer explains.

Animesh Singh is an STSM and lead for IBM Watson and Cloud Platform, where he leads machine learning and deep learning initiatives on IBM Cloud and works with communities and customers to design and implement deep learning, machine learning, and cloud computing frameworks. He has a proven track record of driving design and implementation of private and public cloud solutions from concept to production. In his decade-plus at IBM, Animesh has worked on cutting-edge projects for IBM enterprise customers in the telco, banking, and healthcare Industries, particularly focusing on cloud and virtualization technologies, and led the design and development first IBM public cloud offering.

Presentations

Use the Jupyter Notebook to integrate adversarial attacks into a model training pipeline to detect vulnerabilities Session

Animesh Singh and Tommy Li explain how to implement state-of-the-art methods for attacking and defending classifiers using the open source Adversarial Robustness Toolbox. The library provides AI developers with interfaces that support the composition of comprehensive defense systems using individual methods as building blocks.

Harinder Singh is the global director of data strategy and solution architecture at AB inBev. Harinder is passionate about building data products to unlock the power of information within the organization. Previously, he led the data strategy and analytics program at Walmart.

Presentations

Modernizing Ab inBev’s data architecture to improve predictive analytics and forecast (sponsored by Talend) Session

Harinder Singh explains how, over the course of two years, the world’s largest brewer completely modernized its data architecture and moved it to the cloud. By accelerating data analytics and freeing up the time of its data scientists, AB inBev has been able to better anticipate demand and production, streamline logistics, and develop new beverages that have become best-sellers.

Pramod Singh is a manager for data science at Publicis Sapient and a track lead for a machine learning platform project with Mercedes Benz. He has extensive hands-on experience in machine learning, deep learning, AI, data engineering, programming, and designing algorithms for various business requirements in domains such as retail, telecom, automotive, and consumer goods and has spent the last eight years working on data projects at product and service-based organizations. He’s the author of Machine Learning with PySpark and is also a regular speaker at major conferences and universities. He’s currently writing a couple books on deep learning and AI techniques for O’Reilly and Apress. Pramod holds a bachelor’s degree in electrical and electronics engineering from Mumbai University, an MBA focused on operations and finance from Symbiosis International University, and a data analytics certification from IIM–Calcutta. He lives in Bangalore with his wife and two-year-old son. In his spare time, he enjoys playing guitar, coding, reading, and watching football.

Presentations

The hitchhiker's guide to deep learning-based recommenders in production Tutorial

Abhishek Kumar and Pramod Singh walk you through deep learning-based recommender and personalization systems they've built for clients. Join in to learn how to use TensorFlow Serving and MLflow for end-to-end productionalization, including model serving, Dockerization, reproducibility, and experimentation, and Kubernetes for deployment and orchestration of ML-based microarchitectures.

Swatee Singh is the first female Distinguished Architect at American Express, where she is spearheading machine learning transformation at the company. Swatee is a proponent of democratizing machine learning by providing the right tools, capabilities, and talent structure to the broader engineering and data science community. The platform her team is building looks to leverage American Express’s closed loop data to enhance its customer experience by combining artificial intelligence, big data, and the cloud, incorporating guiding pillars such as ease of use, reusability, shareability, and discoverability. Swatee also led the American Express Recommendation Engine roadmap and delivery for card-linked merchant offers as well as for personalized merchant recommendations. Over the course of her career, she has applied predictive modeling to a variety of problems ranging from financial services to retailers and even power companies. Previously, Swatee was a consultant at McKinsey & Company and PwC, where she supported leading businesses in retail, banking and financial services, insurance, and manufacturing, and cofounded a medical device startup that used a business card-sized thermoelectric cooling device implanted in an epileptic’s brain as a mechanism to stop seizures. Swatee holds a PhD focusing on machine learning techniques from Duke University.

Presentations

Yay, we are going to deploy an AI/ML-powered app. But wait! Where do I deploy? Session

Organizations developing artificial intelligence and machine learning (AI/ML)-powered applications face two existential questions: Should they consider a fully or partially hybrid cloud environment for AI/ML deployments, and which public cloud will give them the most features and capabilities? Swatee Singh discusses available options for companies facing these challenges.

Guoqiong Song is a software engineer on the big data technology team at Intel, where she works in the area of big data analytics. She’s engaged in developing and optimizing distributed deep learning frameworks on Apache Spark.

Presentations

Analytics Zoo: Distributed TensorFlow and Keras on Apache Spark Tutorial

Jason Dai, Yuhao Yang, Jennie Wang, and Guoqiong Song explain how to build and productionize deep learning applications for big data with Analytics Zoo—a unified analytics and AI platform that seamlessly unites Spark, TensorFlow, Keras, and BigDL programs into an integrated pipeline—using real-world use cases from JD.com, MLSListings, the World Bank, Baosight, and Midea/KUKA.

User-based real-time product recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL Session

User-based real-time recommendation systems have become an important topic in ecommerce. Lu Wang, Nicole Kong, Guoqiong Song, and Maneesha Bhalla demonstrate how to build deep learning algorithms using Analytics Zoo with BigDL on Apache Spark and create an end-to-end system to serve real-time product recommendations.

Nathan Speidel develops novel solutions to big data problems at Yahoo (Verizon Media Group) and works on the Audience Data ETL pipeline. He enjoys leveraging ubiquitous open source tools such as Kafka, Storm, Spark, HDFS, Oozie, and Hive as well as new, cutting-edge open source tools like Bullet to push the limits of streaming data processing, visualization, querying, and transformation.

Presentations

Bullet: Querying streaming data in transit with sketches Session

Akshai Sarma and Nathan Speidel offer an overview of Bullet, a scalable, pluggable, light multitenant query system on any data flowing through a streaming system without storing it. Bullet efficiently supports intractable operations like top K, count distincts, and windowing without any storage using sketch-based algorithms.

Paul Spiegelhalter is a data scientist and deep learning specialist at Pythian, where he’s recognized for his deep expertise in utilizing cutting-edge advances in artificial intelligence and machine learning in order to transform groundbreaking research into usable algorithms. Paul’s experience with predictive analytics and algorithmic modeling runs across a number of industries, including computer vision, predictive maintenance, online advertising and user analysis, medical diagnostics, natural language processing, and anomaly detection. He holds a PhD in mathematics from the University of Illinois at Urbana-Champaign.

Presentations

Machine learning for preventive maintenance of mining haul trucks Session

Alex Gorbachev and Paul Spiegelhalter use the example of a mining haul truck to explain how to map preventive maintenance needs to supervised machine learning problems, create labeled datasets, do feature engineering from sensors and alerts data, evaluate models—then convert it all to a complete AI solution on Google Cloud Platform that's integrated with existing on-premises systems.

AWS Solutions Architect

Presentations

Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You'll then build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Ankit Srivastava is a senior data scientist on the core data science team for the Azure Cloud + AI Platform Division at Microsoft, where he focuses on commercial and education segment data science projects within the company. Previously, he was a developer on the data integration and insights team. He has built several production-scale ML enrichments that are leveraged for sales compensation and senior leadership team metrics.

Presentations

Executive Briefing: The 6 keys to successful data spelunking Session

At the rate data sources are multiplying, business value can often be developed faster by joining data sources rather than mining a single source to the very end. Ken Johnston and Ankit Srivastava share four years of hands-on practical experience sourcing and integrating massive numbers of data sources to build the Microsoft Business Intelligence Graph (M360 BIG).

Infinite segmentation: Scalable mutual information ranking on real-world graphs Session

Today, normal growth isn't enough—you need hockey-stick levels of growth. Sales and marketing orgs are looking to AI to "growth hack" their way to new markets and segments. Ken Johnston and Ankit Srivastava explain how to use mutual information at scale across massive data sources to help filter out noise and share critical insights with new cohort of users, businesses, and networks.

Mac Steele is the director of product at Domino Data Lab, where he leads strategic development of the company’s data science platform. Based in San Francisco, he works closely with leading financial services, insurance, and technology companies to build a mature data science process across their entire organization. He has extensive experience leading advanced analytical organizations in both finance and tech. Previously, Mac worked in the Research Group at Bridgewater Associates, the world’s largest hedge fund, where he developed quantitative models for the firm’s emerging market portfolio; he also built the core data capability at leading fintech company Funding Circle. Steele holds a degree (summa cum laude) from the Woodrow Wilson School of Public and International Affairs at Princeton University.

Presentations

Managing data science in the enterprise Tutorial

The honeymoon era of data science is ending, and accountability is coming. Successful data science leaders must deliver measurable impact on an increasing share of an enterprise's KPIs. Joshua Poduska, Kimberly Shenk, and Mac Steele explain how leading organizations take a holistic approach to people, process, and technology to build a sustainable competitive advantage.

Wim Stoop is a senior product marketing manager at Cloudera.

Presentations

Hands-on with Cloudera SDX: Setting up your own shared data experience Tutorial

Cloudera SDX provides unified metadata control, simplifies administration, and maintains context and data lineage across storage services, workloads, and operating environments. Santosh Kumar, Andre Araujo, and Wim Stoop offer an overview of SDX before diving deep into the moving parts and guiding you through setting it up. You'll leave with the skills to set up your own SDX.

Dave Stuart is a senior product manager at the US Department of Defense, where he is leading a large-scale effort to transform the workflows of thousands of enterprise business analysts through Jupyter and Python adoption, making tradecraft more efficient, sharable, and repeatable. Previously, Dave led multiple grass-roots technology adoption efforts, developing innovative training methods that tangibly increased the technical proficiency of a large noncoding enterprise workforce.

Presentations

An alternative approach to adding data science to an organization: Use Jupyter and start with the domain experts Session

Many organizations look to add data science to their skill portfolios through the hiring of data science experts. Dave Stuart shares a complementary way to build a data science-savvy workforce that nets tremendous value by using Jupyter to add introductory data science practices to domain experts and business analysts.

Patrick Stuedi is a member of the research staff at IBM research Zurich. His research interests include distributed systems, networking, and operating systems. The general theme of his work is to explore how modern networking and storage hardware can be exploited in distributed systems. Previously, he was a postdoc at Microsoft Research Silicon Valley. Patrick is the creator of several open source projects such as DiSNI (RDMA for Java), DaRPC (Low latency RPC), and Apache Crail (incubating). He holds a PhD from ETH Zurich.

Presentations

Data processing at the speed of 100 Gbps using Apache Crail Session

Modern networking and storage technologies like RDMA or NVMe are finding their way into the data center. Patrick Stuedi offers an overview of Apache Crail (incubating), a new project that facilitates running data processing workloads (ML, SQL, etc.) on such hardware. Patrick explains what Crail does and how it benefits workloads based on TensorFlow or Spark.

Jagane Sundar is the CTO at WANdisco. Jagane has extensive big data, cloud, virtualization, and networking experience. He joined WANdisco through its acquisition of AltoStor, a Hadoop-as-a-service platform company. Previously, Jagane was founder and CEO of AltoScale, a Hadoop- and HBase-as-a-platform company acquired by VertiCloud. His experience with Hadoop began as director of Hadoop performance and operability at Yahoo. Jagane’s accomplishments include creating Livebackup, an open source project for KVM VM backup, developing a user mode TCP stack for Precision I/O, developing the NFS and PPP clients and parts of the TCP stack for JavaOS for Sun Microsystems, and creating and selling a 32-bit VxD-based TCP stack for Windows 3.1 to NCD Corporation for inclusion in PC-Xware. Jagane is currently a member of the technical advisory board of VertiCloud. He holds a BE in electronics and communications engineering from Anna University.

Presentations

Managing globally distributed data for deep learning using TensorFlow on YARN (sponsored by WANdisco) Session

Jagane Sundar shares a system for replicating data across geographically distributed data centers and discusses the benefits of consistently replicating data that is used by TensorFlow for training.

Václav Surovec comanages the Big Data Department at Deutsche Telekom IT. The department’s more than 45 engineers deliver big data projects to Germany, the Netherlands, and the Czech Republic. Recently, he led the Commercial Roaming project. Previously, he worked at T-Mobile Czech Republic while he was still a student of Czech Technical University in Prague.

Presentations

Data science at Deutsche Telekom: Predicting global travel patterns and network demand Session

Knowledge of customers' location and travel patterns is important for many companies, including German telco service operator Deutsche Telekom. Václav Surovec and Gabor Kotalik explain how a commercial roaming project using Cloudera Hadoop helped the company better analyze the behavior of its customers from 10 countries and provide better predictions and visualizations for management.

Elizabeth Svoboda is an award-winning journalist and contributor to MIT Technology Review, Aeon, Sapiens, Psychology Today, the Washington Post, and other publications. She is the author of What Makes a Hero? The Surprising Science of Selflessness as well as the children’s book The Life Heroic (forthcoming from Zest Books in 2019). She’s fascinated with the subtle forces that guide our decisions and actions, sometimes without our knowledge.

Presentations

Hacking the vote: The neuropolitical universe Keynote

Using biosensors and predictive analytics, political campaigns aim to decode your true desires—and influence your vote—without your knowledge. Elizabeth Svoboda explains how these tools work, who's using them, and what they mean for the future of free and fair elections.

Ian Swanson is vice president of product for AI and machine learning at Oracle, where he oversees the product strategy for the company’s AI/ML PaaS offerings. Previously, Ian was founder and CEO of DataScience.com (acquired by Oracle in 2018)—a company that provided an industry-leading enterprise data science platform that combined the tools, libraries, and languages data scientists loved with the infrastructure and workflows their organizations needed. Earlier in his career, he was an executive at American Express and Sprint and CEO of Sometrics, a company that launched the industry’s first global virtual currency platform (acquired by American Express in 2011). That platform, for which he earned a patent, managed more than 3.3 trillion units of virtual currency and served an online audience of 250 million in more than 180 countries. A sought-after speaker and expert on digital transformation, data science, big data, and performance-based analytics, Ian actively advises Fortune 500 companies and invests in leading startups.

Presentations

How to compete in the AI arms race (sponsored by Oracle Cloud Infrastructure) Session

Being an AI-­driven enterprise earlier than a competitor is an opportunity within your reach. Join in to find out how, as Ian Swanson dives into problem domains, platform differentiators, ease of use, automation, and scale and shares best practices on quick starts with the right infrastructure choices.

Shubham Tagra is a senior staff engineer at Qubole working on Presto and Hive development and making these solutions cloud ready. Previously, Shubham worked on the storage area network at NetApp. Shubham holds a bachelor’s degree in computer engineering from the National Institute of Technology, Karnataka, India.

Presentations

Cost-effective Presto on AWS with Spot nodes Session

Did you know you can run Presto in AWS at a tenth of the cost with AWS Spot nodes, with just a few architectural enhancements to Presto. Shubham Tagra explores the gaps in Presto architecture, explains how to use Spot nodes, covers enhancements, and showcases the improvements in terms of reliability and TCO achieved through them.

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe, and worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Executive Briefing: Why machine-learned models crash and burn in production and what to do about it Session

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Natural language understanding at scale with Spark NLP Tutorial

David Talby, Alex Thomas, and Claudiu Branzan lead a hands-on introduction to scalable NLP using the highly performant, highly scalable open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Haodong Tang is a big data storage optimization and development engineer at Intel.

Presentations

Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric Session

Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance.

Subhadra Tatavarti leads strategy and product for data platforms and infrastructure at PayPal. Her team manages and propels the data platforms that power PayPal’s core customers, processing over 250 PB of data, and builds products that cater to over 5,000 PayPal developers, analysts, and data scientists—with the goal to not just enable this community but also drive efficiency, reduce friction, and reduce time to market, which in turn drives PayPal’s growth. Subhadra is an experienced leader of large organization-wide transformations that drive innovation and accelerate business delivery.

Presentations

ML and AI at scale at PayPal Session

The PayPal data ecosystem is large, with 250+ PB of data transacting in 200+ countries. Given this massive scale and complexity, discovering and access to the right datasets in a frictionless environment is a challenge. Subhadra Tatavarti and Chen Kovacs explain how PayPal’s data platform team is helping solve this problem with a combination of self-service integrated and interoperable products.

James Taylor is a software engineer in the Data Infrastructure Group at Lyft, where he works on big data systems. Previously, he was an architect at Salesforce, where he founded the Apache Phoenix project and led its development, and worked on federated query processing systems and event-driven programming platforms at BEA Systems.

Presentations

Adaptive ETL to optimize query performance at Lyft Session

James Taylor offers an overview of an automated feedback loop at Lyft to adapt ETL based on the aggregate cost of queries run across the cluster. He also discusses future work to enhance the system through the use of materialized views to reduce the number of ad hoc joins and sorting performed by the most expensive queries by transparently rewriting queries when possible.

Serban Teodorescu is an SRE at Adobe, where he’s part of a small team that manages 20+ Cassandra clusters for Adobe Audience Manager. Previously, he was a Python programmer, and he’s still trying to find out how a developer who preferred SQL databases ended up as an SRE for a Cassandra team. Apart from Cassandra and Python, he’s interested in automating infrastructure provisioning with Terraform.

Presentations

Database migrations don't have to be painful, but the road will be bumpy Session

Adrian Lungu and Serban Teodorescu explain how—inspired by the green-blue deployment technique—the Adobe Audience Manager team developed an active-passive database migration procedure that allows them to test database clusters in production, minimizing the risks without compromising the innovation.

Rhonda Textor is the head of data science at True Fit, a platform dedicated to helping shoppers find clothes and shoes they love and keep. Rhonda is passionate about modeling fit and style elements of both shoppers and garments in order to recommend products to shoppers that they love and that fit and flatter. Previously, Rhonda applied machine learning and data science to problems in remote sensing such as land cover classification of satellite images and problems in national security, such as detecting threats in imagery.

Presentations

Leveraging fashion data to make shopping recommendations Data Case Studies

Fashion recommendation problems are characterized by sparse datasets and large catalogs of styles that have short lifespans—areas traditional transaction-based approaches are not well suited to address. Rhonda Textor explains how to transform raw retail data into scalable recommendations using widely available machine learning libraries.

Yves Thibaudeau is a mathematical statistician and principal researcher at the US Census Bureau. His publications include a book chapter on record linkage and computer matching in the Annals of Applied Statistics. He’s given a number of presentations on statistics and record linkage at conferences since 1988. Yves holds a PhD and an MS in statistics from Carnegie Mellon and a BSc in mathematics from McGill.

Presentations

New directions in record linkage Session

The US Census Bureau has been involved in record linkage projects for over 40 years. In that time, there's been a lot of change in computing capabilities and new techniques, and the Census Bureau is reviewing an inventory of linkage methodologies. Yves Thibaudeau describes the progress made so far in identifying specific record linkage techniques for specific applications.

Alex Thomas is a data scientist at Indeed. He’s used natural language processing (NLP) and machine learning with clinical data, identity data, and now employer and jobseeker data. An Apache Spark user since version 0.9, he’s also worked with NLP libraries and frameworks, including UIMA and OpenNLP.

Presentations

Natural language understanding at scale with Spark NLP Tutorial

David Talby, Alex Thomas, and Claudiu Branzan lead a hands-on introduction to scalable NLP using the highly performant, highly scalable open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Rachel Thomas was selected by Forbes as one of “20 Incredible Women in AI”, was an early engineer at Uber, and earned her math PhD at Duke. She is co-founder of fast.ai, which created the “Practical Deep Learning for Coders” course that over 200,000 students have taken, and she is also a professor at the University of San Francisco Data Institute. Rachel is a popular writer and keynote speaker on the topics of data science ethics, bias, machine learning, and technical education. Her writing has been read by nearly a million people and has made the front page of Hacker News 9×.

Presentations

Panel: Causes Ethics Summit

Following the review of problematic technologies, we'll hold an interactive discussion with speakers and invited guests to dig deeper into neuroscience, analytics, and more.

Skyler Thomas is an engineer at MapR, where he is designing Kubernetes-based infrastructure to deliver machine learning and big data applications at scale. Previously, Skyler was chief architect for WebSphere user experience at IBM, where he worked with more than a hundred customers to deliver extreme-scaled applications in the healthcare, financial services, and retail industries.

Presentations

Persistent storage for machine learning in KubeFlow Session

KubeFlow separates compute and storage to provide the ability to deploy best-of-breed open source systems for machine learning to any cluster running Kubernetes, whether on-premises or in the cloud. Skyler Thomas and Terry He explore the problems of state and storage and explain how distributed persistent storage can logically extend the compute flexibility provided by KubeFlow.

Jordan Tigani was one of the founding engineers on Google BigQuery, wrote the first book on the subject, and now leads its vision and roadmap as director of product management. Previously, Jordan worked at Microsoft Research and on the Windows Kernel team as well as at a number of star-crossed startups. He holds a bachelor’s degree from Harvard and a master’s degree from the University of Washington.

Presentations

Data warehousing is not a use case (sponsored by Google Cloud) Keynote

Modern data analysis requirements have fundamentally redefined what our expectations should be for data warehouses. Join Google BigQuery cocreator Jordan Tigani as he shares his vision for where he sees cloud-scale data analytics heading as well as what technology leaders should be considering as part of their data warehousing roadmap.

Rethinking big data analytics with Google Cloud (sponsored by Google Cloud) Session

Google Cloud Platform combines powerful serverless solutions for enterprise data warehousing, streaming analytics, managed Spark and Hadoop, modern BI, planet-scale data lake, and AI. Jordan Tigani details Google Cloud’s vision and engineering strategy, which can help you move big data analytics solutions to the next level of benefits.

Steve Touw is the cofounder and CTO of Immuta. Steve has a long history of designing large-scale geotemporal analytics across the US intelligence community, including some of the very first Hadoop analytics as well as frameworks to manage complex multitenant data policy controls. He and his cofounders at Immuta drew on this real-world experience to build a software product to make data experimentation easier. Previously, Steve was the CTO of 42Six Solutions (acquired by Computer Sciences Corporation), where he led a large big data services engineering team. Steve holds a BS in geography from the University of Maryland.

Presentations

Successfully deploy machine learning while managing its risks Tutorial

As ML becomes increasingly important for businesses and data science teams alike, managing its risks is quickly becoming one of the biggest challenges to the technology’s widespread adoption. Join Andrew Bur, Steven Touw, Richard Geering, Joseph Regensburger, and Alfred Rossi for a hands-on overview of how to train, validate, and audit machine learning models (ML) in practice.

Martin Traverso is a cofounder of the Presto Software Foundation and one of the original creators of Presto. Previously, he was a software engineer at Facebook, where he lead the Presto development team.

Presentations

Presto: Tuning performance of SQL-on-anything analytics Session

Kamil Bajda-Pawlikowski and Martin Traverso explore Presto's recently introduced cost-based optimizer, which must account for heterogeneous inputs with differing and often incomplete data statistics, and detail use cases for Presto across several industries. They also share recent Presto advancements, such as geospatial analytics at scale, and the project roadmap going forward.

Cindy Tseng is a research scientist with the Applied Research in Automotive Driving Group at Intel, where she has recently been focusing on bias detection in convolution neural nets. Cindy has also worked in the high-throughput computing and deep learning hardware accelerator spaces. She holds a master’s degree from the Electrical and Computer Engineering Department at Carnegie Mellon University and a bachelor’s degree in electrical engineering and computer science from the University of Michigan-Ann Arbor. Cindy is currently enrolled as a part-time student in the Masters in Data Science Program in Computer Science at the University of Illinois Urbana-Champaign.

Presentations

AI privacy and ethical compliance toolkit Tutorial

From healthcare to smart home to autonomous vehicles, new applications of autonomous systems are raising ethical concerns about a host of issues, including bias, transparency, and privacy. Iman Saleh, Cory Ilo, and Cindy Tseng demonstrate tools and capabilities that can help data scientists address these concerns and bridge the gap between ethicists, regulators, and machine learning practitioners.

Geoff Tudor is vice president and general manager at Vizion.ai. Geoff has over 22 years of experience in storage, broadband, and networking. Previously, he was chief cloud strategist at Hewlett Packard Enterprise, where he led CxO engagements for Fortune 100 private cloud opportunities, resulting in 10X growth to over $1B in revenues while positioning HPE as the #1 private cloud infrastructure supplier globally. Before that, he was cofounder and launched award-winning products in cloud storage at Nirvanix (acquired by Oracle), backup and recovery at GNS (acquired by Symantec), and gigabit ethernet last-mile networking with Advent Networks and Tellaire (acquired by MRV Communications). He was nominated for the Ernst and Young Entrepreneur of the Year award and recognized by Discover magazine as an Innovator. Geoff holds an MBA from the University of Texas at Austin and a BA from Tulane University. He holds patents in satellite communications. He resides in Austin, TX, with his wife and two children.

Presentations

Go serverless with Elasticsearch: Eliminate scaling and performance bottlenecks for faster data search (sponsored by Vizion.ai) Session

Elasticsearch is powerful. In its current form, it's also nontrivial and rather expensive to deploy. Not very "elastic." Fortunately, innovations like serverless and microservices are eliminating these barriers, lowering upfront costs, and reducing complexity. Geoff Tudor explains how this is unfolding in the market.

Sandeep Uttamchandani is the hands-on chief data architect and head of data platform engineering at Intuit, where he leads the big data analytics, ML, and transactional platform used by 3M+ small business users for financial accounting, payroll, and billions of dollars in daily payments. Previously, Sandeep held engineering roles at VMware and IBM and founded a startup focused on ML for managing enterprise systems. Sandeep’s experience uniquely combines building enterprise data products and operational expertise in managing petabyte-scale data and analytics platforms in production for IBM’s federal and Fortune 100 customers. Sandeep has received several excellence awards. He has over 40 issued patents and 25 publications in key systems conference such as VLDB, SIGMOD, CIDR, and USENIX. Sandeep is a regular speaker at academic institutions and conducts conference tutorials for data engineers and scientists. He advises PhD students and startups, serves as program committee member for systems and data conferences, and was an associate editor for ACM Transactions on Storage. Sandeep holds a PhD in computer science from the University of Illinois at Urbana-Champaign.

Presentations

How Intuit reduced time to reliable insights for data pipelines Session

How efficient is your data platform? The single metric Intuit uses is time to reliable insights: the total of time spent to ingest, transform, catalog, analyze, and publish. Sandeep Uttamchandani shares three design patterns/frameworks Intuit implemented to deal with three challenges to determining time to reliable insights: time to discover, time to catalog, and time to debug for data quality.

Vinod Vaikuntanathan is an associate professor of computer science within MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and a cofounder of Duality Technologies. His research focuses on lattice-based cryptography and the theory and practice of computing on encrypted data. Vinod holds a PhD in computer science from MIT, where he received the George M. Sprowls Award for the best computer science thesis. His teaching and research in cybersecurity was recently recognized with MIT’s Harold E. Edgerton Faculty Achievement Award, a Sloan Faculty Fellowship, a Microsoft Faculty Fellowship, and a DARPA Young Faculty Award.

Presentations

Machine learning on encrypted data: Challenges and opportunities Session

Alon Kaufman and Vinod Vaikuntanathan discuss the challenges and opportunities of machine learning on encrypted data and describe the state of the art in this space.

Nick Vandivere is CEO at ThoughtTrace, where he has led the company’s transformation into a technology and thought leader for the application of applied artificial intelligence and machine learning. He believes that product innovation, reliability, and an outstanding user experience are the required ingredients for exceptional performance and long-term value. Previously, he was an officer in the US Army and served in an advisory role with the US Department of State. Nick holds a BS in economics from Texas A&M University.

Presentations

Applied AI and NLP for enterprise contract intelligence (sponsored by ThoughtTrace) Session

Building a SaaS AI company targeted at enterprise users presents unique challenges, both technical and nontechnical. Joel Hron and Nick Vandivere walk you through ThoughtTrace's journey, highlighting its beginnings as a company and sharing the challenging use cases the company tackled first.

Stefaan Vervaet is a San Jose-based storage expert and senior director of product marketing and strategic alliances for Western Digital’s Data Center Systems business unit, where he’s responsible for leading marketing and business development efforts to deliver advanced object storage-based systems and emerging storage solutions for today’s at-scale enterprise and cloud workloads, including big data analytics, virtualization, application acceleration, and long-term backup/active archives. He has 15 years of experience as a business-focused technologist in the data storage and backup industry, with an extensive startup background that includes product management and go-to-market positions in the backup space and technical sales and support. An innovator with a proven track record, Stefaan successfully helped build startup companies like DataCenter Technologies, a dedupe technology (acquired by Veritas), and Amplidata, a leading object storage vendor (acquired by HGST, a Western Digital Company), where he established and built the US office running technical sales, support, and operations worldwide. He holds a master’s degree in applied informatics from the University of Ghent, Belgium.

Presentations

How EPFL captured the feel of the Montreux Jazz Festival with its immersive 3D VR to three-geo archive Session

The École Polytechnique Fédérale de Lausanne (EPFL) spearheaded the official digital archival of 15,000+ hours of A/V content captured from the Montreux Jazz Festival since 1967. Stefaan Vervaet and Alain Dufaux explain how EPFL created an immersive 3D VR experience. From capture and store to delivery and experience, they detail the evolution of the workflow that made it all possible.

Nancy Vitale is the CHRO and senior vice president of human resources for Genentech as well as a member of the Genentech Executive Committee. She’s responsible for leading the HR team that’s dedicated to creating a great place for the organization’s 14,000 employees to do their best work in pursuit of the company’s important mission. Nancy is also a board member for the Make-A-Wish Foundation of America, another mission-driven organization. Previously, Nancy was director of HR for the Gillette North American Commercial Division at Procter & Gamble (P&G), where she was integrally involved in the merger of Gillette and P&G; was vice president of HR for CIGNA’s Group Insurance Division; and held other senior HR positions in a variety of industries and companies, including Deloitte Consulting, DANKA Office Imaging, and the Times Publishing Company. Nancy is known for her drive and leadership. She thrives on challenges, particularly those that involve complex organizational dynamics. Her life purpose is to “make meaningful connections,” which she explains as being for both people and ideas. Nancy holds a bachelor’s degree in business administration from the University of Michigan and an MBA from Emory University. In 2015, Nancy was recognized by the San Francisco Business Times as one of the “most influential women in the Bay Area” and in 2016 was named to its “forever influential” honor roll.

Presentations

Future of the firm: How are executives preparing now? Session

In this panel session, executives will discuss how their companies are adapting to the workforce, business, and economic trends shaping the future of business.

Lars Volker is a software engineer at Cloudera. He has worked on various parts of Apache Impala, including crash handling, its Parquet scanners, and scan range scheduling. Most recently, he worked on integrating Kudu’s RPC framework into Impala. Previously, he worked on various databases at SAP.

Presentations

Accelerating analytical antelopes: Integrating Apache Kudu's RPC into Apache Impala Session

In recent years, Apache Impala has been deployed to clusters that are large enough to hit architectural limitations in the stack. Lars Volker and Michael Ho cover the efforts to address the scalability limitations in the now legacy Thrift RPC framework by using Apache Kudu's RPC, which was built from the ground up to support asynchronous communication, multiplexed connections, TLS, and Kerberos.

Bradley Voytek is an associate professor in the Department of Cognitive Science at UC San Diego and is the inaugural Halıcıoğlu Data Science Institute Fellow. In addition, he’s an Alfred P. Sloan Neuroscience Research Fellow and National Academies Kavli Fellow as well as a founding faculty member of the UC San Diego Data Science program and the Halıcıoğlu Data Science Institute. His research lab studies the computational role that neural oscillations play in coordinating information transfer in the brain. To do this, they combine large-scale data mining and machine learning techniques with hypothesis-driven experimental research. He’s also known for his zombie brain “research” with his friend and fellow neuroscientist Timothy Verstynen, with whom he has published the book Do Zombies Dream of Undead Sheep?

Presentations

Panel: Causes Ethics Summit

Following the review of problematic technologies, we'll hold an interactive discussion with speakers and invited guests to dig deeper into neuroscience, analytics, and more.

The human side of data and technology Ethics Summit

The vast majority of data scientists and AI/ML researchers are interested in understanding human behavior, so it's critical to consider both the people who generated the data we analyze and the processes that generate it. Join Bradley Voytek to explore the human side of data and technology from a neuroscientific, cognitive, and social context.

Todd Walter is chief technologist and fellow at Teradata, where he helps business leaders, analysts, and technologists better understand all of the astonishing possibilities of big data and analytics in view of emerging and existing capabilities of information infrastructures. Todd has been with Teradata for more than 30 years. He’s a sought-after speaker and educator on analytics strategy, big data architecture, and exposing the virtually limitless business opportunities that can be realized by architecting with the most advanced analytic intelligence platforms and solutions. Todd holds more than a dozen patents.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that isn't subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Dean Wampler is the vice president of fast data engineering at Lightbend, where he leads the Lightbend Fast Data Platform project, a distribution of scalable, distributed stream processing tools including Spark, Flink, Kafka, and Akka, with machine learning and management tools. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He is a contributor to several open source projects. A frequent Strata speaker, he’s also the co-organizer of several conferences around the world and several user groups in Chicago.

Presentations

Executive Briefing: What it takes to use machine learning in fast data pipelines Session

Your team is building machine learning capabilities. Dean Wampler demonstrates how to integrate these capabilities in streaming data pipelines so you can leverage the results quickly and update them as needed and covers challenges such as how to build long-running services that are very reliable and scalable and how to combine a spectrum of very different tools, from data science to operations.

Hands-on machine learning with Kafka-based streaming pipelines Tutorial

Boris Lublinsky and Dean Wampler walk you through using ML in streaming data pipeline and doing periodic model retraining and low-latency scoring in live streams. You'll explore using Kafka as a data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, model metadata tracking, and other techniques.

Jason Wang is a software engineer at Cloudera focusing on the cloud.

Presentations

Journey to the cloud: Architecting for the cloud through customer stories Session

Jason Wang and Sushant Rao offer an overview of cloud architecture, then go into detail on core cloud paradigms like compute (virtual machines), cloud storage, authentication and authorization, and encryption and security. They conclude by bringing these concepts together through customer stories to demonstrate how real-world companies have leveraged the cloud for their big data platforms.

Running multidisciplinary big data workloads in the cloud Tutorial

There are many challenges with moving multidisciplinary big data workloads to the cloud and running them. Jason Wang, Brandon Freeman, Michael Kohs, Akihiro Nishikawa, and Toby Ferguson explore cloud architecture and its challenges and walk you through using Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Jiao (Jennie) Wang is a deep learning R&D engineer on the big data technology team at Intel, where she works in the area of big data analytics. She’s engaged in developing and optimizing distributed deep learning frameworks on Apache Spark.

Presentations

Analytics Zoo: Distributed TensorFlow and Keras on Apache Spark Tutorial

Jason Dai, Yuhao Yang, Jennie Wang, and Guoqiong Song explain how to build and productionize deep learning applications for big data with Analytics Zoo—a unified analytics and AI platform that seamlessly unites Spark, TensorFlow, Keras, and BigDL programs into an integrated pipeline—using real-world use cases from JD.com, MLSListings, the World Bank, Baosight, and Midea/KUKA.

Analytics Zoo: Distributed TensorFlow in production on Apache Spark Session

Yuhao Yang and Jennie Wang demonstrate how to run distributed TensorFlow on Apache Spark with the open source software package Analytics Zoo. Compared to other solutions, Analytics Zoo is built for production environments and encourages more industry users to run deep learning applications with the big data ecosystems.

Luyang Wang is a data scientist and big data engineer at Office Depot. He has a strong system architecture and software development background.

Presentations

User-based real-time product recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL Session

User-based real-time recommendation systems have become an important topic in ecommerce. Lu Wang, Nicole Kong, Guoqiong Song, and Maneesha Bhalla demonstrate how to build deep learning algorithms using Analytics Zoo with BigDL on Apache Spark and create an end-to-end system to serve real-time product recommendations.

Rachel Warren is a software engineer and data scientist for Salesforce Einstein, where she is working on scaling and productionizing auto ML on Spark. Previously, Rachel was a machine learning engineer for Alpine Data, where she helped build a Spark auto-tuner to automatically configure Spark applications in new environments. A Spark enthusiast, she is the coauthor of High Performance Spark. Rachel is a climber, frisbee player, cyclist, and adventurer. Last year, she and her partner completed a thousand-mile off-road unassisted bicycle tour of Patagonia.

Presentations

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am Session

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (a.k.a. tuning) or our jobs may be eaten by Cthulhu. Holden Karau and Rachel Warren explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads—including new settings in 2.4.

Robin Way is a faculty member for banking at the International Institute of Analytics and the founder and president of management analytics consultancy Corios. Robin has over 25 years of experience in the design, development, execution, and improvement of applied analytics models for clients in the credit, payments, lending, brokerage, insurance, and energy industries. Previously, Robin was a managing analytics consultant in SAS Institute’s Financial Services Business Unit for 12 years and spent another 10+ years in analytic management roles for several client-side and consulting firms. Robin’s professional passion is devoted to democratizing and demystifying the science of applied analytics. His contributions to the field correspondingly emphasize statistical visualization, analytical data preparation, predictive modeling, time series forecasting, mathematical optimization applied to marketing, and risk management strategies. He is author of Skate Where the Puck’s Headed: A Playbook for Scoring Big with Predictive Analytics. Robin holds an undergraduate degree from the University of California at Berkeley; his subsequent graduate-level coursework emphasized the analytical modeling of human and consumer behavior. He lives in Portland, Oregon, with his wife, Melissa, and two sons, Colin and Liam. In his spare time, Robin plays soccer and holds a black belt in taekwondo.

Presentations

Organic intelligence: Telling a story about the human experience with math Data Case Studies

Why do we call it "artificial" intelligence? Did AI write itself? No, of course it didn't. We invented the math, built the computer technology, and harnessed the data sources. Robin Way argues that we should reposition what we do as "organic intelligence": we apply math and computers to data to tell a story about the human experience. Join in to learn what organic intelligence is all about.

Thomas Weise is a software engineer for the streaming platform at Lyft. He’s also a PMC member for the Apache Apex and Apache Beam projects and has contributed to several more projects within the ASF ecosystem. Thomas is a frequent speaker at international big data conferences and the author of Learning Apache Apex.

Presentations

The magic behind your Lyft ride prices: A case study on machine learning and streaming Session

Rakesh Kumar and Thomas Weise explore how Lyft dynamically prices its rides with a combination of various data sources, ML models, and streaming infrastructure for low latency, reliability, and scalability—allowing the pricing system to be more adaptable to real-world changes.

Jeffrey Wong is the global chief innovation officer at EY, where he’s challenging everything from the way EY operates internally to how it provides services to its clients. Jeff works across the entire organization to help identify, share, and scale the best ideas and serves the EY Global Innovation team’s remit to research and explore new technologies. He brings deep experience across strategy, investing, and building new ventures globally. Throughout his career, he has built new businesses across various concepts, including local commerce, B2B exchanges, services, mobile, and big data at Boston Consulting Group, JAFCO America Ventures, JP Morgan Partners, and eBay. Jeff sits on the Oxford Foundry Board at Oxford University and the advisory board for AI4All, a nonprofit organization working to increase diversity and inclusion in artificial intelligence. He’s also a member of the World Economic Forum’s Global Future Council on Innovation Ecosystems. Jeff holds an AB in economics, master’s degrees in industrial engineering and engineering management, and an MBA from Stanford University.

Presentations

Digital transformation writ large Session

Jeffrey Wong explains how an old-world firm leveraged technology to transform everything and thrive in our new world of continuous change—anticipating, scaling, and adapting to meet internal needs and client expectations.

Jerry Xu is cofounder and CTO at Datatron Technologies. An innovative software engineer with extensive programming and design experience in storage systems, online services, mobile, distributed systems, virtualization, and OS kernels, Jerry also has a demonstrated ability to direct and motivate a team of software engineers to complete projects meeting specifications and deadlines. Previously, he worked at Zynga, Twitter, Box, and Lyft, where he built the company’s ETA machine learning model. Jerry is the author of open source project LibCrunch. He’s a three-time Microsoft Gold Star Award winner.

Presentations

Model governance in the enterprise Session

Harish Doddi and Jerry Xu share the challenges they faced scaling machine learning models and detail the solutions they're building to conquer them.

Boris Yakubchik is a data scientist at Forbes. He creates user-facing products that use machine learning and builds systems from the servers that clean and process data to the frontend that users interact with.

Presentations

Creating a bionic newsroom Session

Boris Yakubchik and Salah Zalatimo offer an overview of Bertie, Forbes's new publishing platform—an AI assistant that learns from writers and suggests improvements—and detail Bertie’s features, architecture, and ultimate goals, paying special attention to how the company implemented an ensemble of machine learning models that, together, make up the AI assistant's skill set and personality.

Yuhao Yang is a senior software engineer on the big data team at Intel, where he focuses on deep learning algorithms and applications—particularly distributed deep learning and machine learning solutions for fraud detection, recommendation, speech recognition, and visual perception. He’s also an active contributor to Apache Spark MLlib.

Presentations

Analytics Zoo: Distributed TensorFlow and Keras on Apache Spark Tutorial

Jason Dai, Yuhao Yang, Jennie Wang, and Guoqiong Song explain how to build and productionize deep learning applications for big data with Analytics Zoo—a unified analytics and AI platform that seamlessly unites Spark, TensorFlow, Keras, and BigDL programs into an integrated pipeline—using real-world use cases from JD.com, MLSListings, the World Bank, Baosight, and Midea/KUKA.

Analytics Zoo: Distributed TensorFlow in production on Apache Spark Session

Yuhao Yang and Jennie Wang demonstrate how to run distributed TensorFlow on Apache Spark with the open source software package Analytics Zoo. Compared to other solutions, Analytics Zoo is built for production environments and encourages more industry users to run deep learning applications with the big data ecosystems.

Jeffrey Yau is the chief data scientist at global asset-management and research firm AllianceBernstein, where he leads all of the data science efforts. Jeffrey has many years of experience in applying a wide range of econometric and machine learning techniques to create analytic solutions for financial institutions, policy institutions, and businesses, and his expertise includes combining high-performance computing and big data technology to generate analytic insights for strategic decision making. Previously, he was the vice president and head of data science at Silicon Valley Data Science, where he led a team of PhD data scientists helping companies transform their businesses using advanced data science techniques and emerging technology; the head of risk analytics at Charles Schwab; director of financial risk management consulting at KPMG; assistant director at Moody’s Analytics; and assistant professor of economics at Virginia Tech. He’s active in the data science community and often speaks at data science conferences and local events. Jeffrey holds a PhD and an MA in economics from the University of Pennsylvania and a BS in mathematics and economics from UCLA.

Presentations

Time series forecasting using statistical and machine learning models: When and how Session

Time series forecasting techniques are applied in a wide range of scientific disciplines, business scenarios, and policy settings. Jeffrey Yau discusses the applications of statistical time series models, such as ARIMA, VAR, and regime-switching models, and machine learning models, such as random forest and neural network-based models, to forecasting problems.

Ting-Fang Yen is director of research at DataVisor, the leading fraud, crime, and abuse detection solution utilizing unsupervised machine learning to detect fraudulent and malicious activity such as fake account registrations, fraudulent transactions, spam, account takeovers, and more. She has over 10 years of experience in applying big data analytics and machine learning to tackle problems in cybersecurity. Ting-Fang holds a PhD in electrical and computer engineering from Carnegie Mellon University.

Presentations

Talking to the machines: Monitoring production machine learning systems Session

Ting-Fang Yen details an approach for monitoring production machine learning systems that handle billions of requests daily by discovering detection anomalies, such as spurious false positives, as well as gradual concept drifts when the model no longer captures the target concept. Join in to explore new tools for detecting undesirable model behaviors early in large-scale online ML systems.

Fang Yu is the cofounder and CTO of DataVisor, where her work focuses on big data for security. Over the past 10 years, Fang has developed algorithms and built systems for identifying various kinds of malicious traffic including worms, spam, bot queries, faked and hijacked account activities, and fraudulent financial transactions. Fang holds a PhD degree from the EECS Department at the University of California, Berkeley.

Presentations

Detecting coordinated fraud attacks using deep learning Session

Online fraud flourishes as online services become ubiquitous in our daily life. Fang Yu explains how DataVisor leverages cutting-edge deep learning technologies to address the challenges in large-scale fraud detection.

Ali Zaidi is a PhD student in statistics at UC Berkeley. Previously, he was a data scientist in Microsoft’s AI and Research Group, where he worked to make distributed computing and machine learning in the cloud easier, more efficient, and more enjoyable for data scientists and developers alike. Before that, Ali was a research associate at NERA (National Economic Research Associates), providing statistical expertise on financial risk, securities valuation, and asset pricing. He studied statistics at the University of Toronto and computer science at Stanford University.

Presentations

Building high-performance text classifiers on a limited labeling budget Session

Robert Horton, Mario Inchiosa, and Ali Zaidi demonstrate how to use three cutting-edge machine learning techniques—transfer learning from pretrained language models, active learning to make more effective use of a limited labeling budget, and hyperparameter tuning to maximize model performance—to up your modeling game.

Tristan Zajonc is CTO for machine learning at Cloudera. Previously, Tristan led engineering for Cloudera Data Science Workbench and was the cofounder and CEO of enterprise data science platform Sense (acquired by Cloudera in 2016). He has over 15 years’ experience in applied data science, machine learning, and machine learning systems development across academia and industry and holds a PhD from Harvard University.

Presentations

Cloud native machine learning: Emerging trends and the road ahead Session

Data platforms are being asked to support an ever increasing range of workloads and compute environments, including machine learning and elastic cloud platforms. Tristan Zajonc and Tim Chen discuss emerging capabilities, including running machine learning and Spark workloads on autoscaling container platforms, and share their vision for the road ahead for ML and AI in the cloud.

Salah Zalatimo is the chief digital officer at Forbes, where he leads the product management, design, engineering audience development and ecommerce groups. He joined the company in 2015 through the acquisition of his startup, Camerama. He has since led a digital transformation, helping generate record traffic and revenues. 

Salah is a leader in the New York tech community, working closely with the NYC Department of Education. He is an active speaker on digital transformation and AI in journalism. He serves on the Media Advisory Board for Quinnipiac University and the board of Haymakers For Hope charity boxing organization.

 Salah holds a BA and MBA from Columbia University. He lives with his wife and two sons in Bed Stuy, Brooklyn.

Presentations

Creating a bionic newsroom Session

Boris Yakubchik and Salah Zalatimo offer an overview of Bertie, Forbes's new publishing platform—an AI assistant that learns from writers and suggests improvements—and detail Bertie’s features, architecture, and ultimate goals, paying special attention to how the company implemented an ensemble of machine learning models that, together, make up the AI assistant's skill set and personality.

Jian Zhang is a senior software engineer manager at Intel, where he and his team primarily focus on open source storage development and optimizations on Intel platforms and build reference solutions for customers. He has 10 years of experience doing performance analysis and optimization for open source projects like Xen, KVM, Swift ,and Ceph and working with HDFS and benchmarking workloads like SPEC and TPC. Jian holds a master’s degree in computer science and engineering from Shanghai Jiao Tong University.

Presentations

Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric Session

Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance.

Yongzheng Zhang is a senior manager of data mining at LinkedIn and an active researcher and practitioner of text mining and machine learning. He’s developed many practical and scalable solutions for utilizing unstructured data for ecommerce and social networking applications, including search, merchandising, social commerce, and customer-service excellence. Yongzheng is a highly regarded expert in text mining and has published and presented many papers in top journals and at conferences. He also organizes tutorials and workshops on sentiment analysis at prestigious conferences. He holds a PhD in computer science from Dalhousie University in Canada.

Presentations

Using the full spectrum of data science to drive business decisions Tutorial

Thanks to the rapid growth in data resources, business leaders now appreciate the importance (and the challenge) of mining information from data. Join in as a group of LinkedIn's data scientists share their experiences successfully leveraging emerging techniques to assist in intelligent decision making.

Yuan Zhou is a senior software development engineer in the Software and Service Group at Intel, where he works on the Open Source Technology Center team primarily focused on big data storage software. He has been working in databases, virtualization, and cloud computing for most of his 7+ year career at Intel.

Presentations

Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric Session

Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance.

Corey Zumar is a software engineer at Databricks, where he’s working on machine learning infrastructure and APIs for model management and production deployment. Corey is also an active contributor to MLflow. He holds a master’s degree in computer science from UC Berkeley. At UC Berkeley’s RISELab, he was one of the lead developers of Clipper, an open source project and research effort focused on high-performance model serving.

Presentations

MLflow: An open platform to simplify the machine learning lifecycle Session

Developing applications that leverage machine learning is difficult. Practitioners need to be able to reproduce their model development pipelines, as well as deploy models and monitor their health in production. Corey Zumar offers an overview of MLflow, which simplies this process by managing, reproducing, and operationalizing machine learning through a suite of model tracking and deployment APIs.