Sep 23–26, 2019

Speakers

Hear from innovative researchers, talented CxOs, and senior developers who are doing amazing things with data. More speakers will be announced; please check back for updates.

Grid viewList view

Saif Addin Ellafi is a software developer at John Snow Labs, where he’s the main contributor to Spark NLP. A data scientist, forever student, and an extreme sports and gaming enthusiast, Saif has wide experience in problem solving and quality assurance in the banking and finance industry.

Presentations

Feature engineering with Spark NLP to accelerate clinical trial recruitment Session

Recruiting patients for clinical trials is a major challenge in drug development. Saif Addin Ellafi and Scott Hoch explain how Deep 6 uses Spark NLP to scale its training and inference pipelines to millions of patients while achieving state-of-the-art accuracy. They dive into the technical challenges, the architecture of the full solution, and the lessons the company learned.

Natural language understanding at scale with Spark NLP Tutorial

David Talby, Alex Thomas, Saif Addin Ellafi, and Claudiu Branzan walk you through state-of-the-art natural language processing (NLP) using the highly performant, highly scalable open source Spark NLP library. You'll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

- Production Engineering Manager at Facebook (Data Warehouse Team)
- Data Infrastructure Team at Facebook since 2012
- Previously worked on the search team at Yahoo!

Presentations

Scaling Apache Spark at Facebook Session

Apache Spark is the largest compute engine at Facebook by CPU. Sameer Agarwal dives into the story of how Facebook optimized, tuned, and scaled Apache Spark to run on clusters of tens of thousands of machines, processing hundreds of petabytes of data, and being used by thousands of data scientists, engineers, and product analysts every day.

Sameer Agarwal is an Apache Spark committer and a software engineer at Facebook, where he works as part of the data warehouse team on building distributed systems and databases that scale across clusters of tens of thousands of machines. He received his PhD in databases from UC Berkeley AMPLab where he worked on BlinkDB, an approximate query engine for Spark.

Presentations

Scaling Apache Spark at Facebook Session

Apache Spark is the largest compute engine at Facebook by CPU. Sameer Agarwal dives into the story of how Facebook optimized, tuned, and scaled Apache Spark to run on clusters of tens of thousands of machines, processing hundreds of petabytes of data, and being used by thousands of data scientists, engineers, and product analysts every day.

Ashish Aggarwal is a principal engineer at Expedia Group, leading Haystack—an open source project that’s rapidly being adopted for distributed tracing in fast-growing ecommerce companies like Expedia, HomeAway, Hotels.com, Egencia, SoFi, etc. He’s a full stack software and large-scale data systems engineer with experience in distributed web applications and data analytics platforms leveraging a multitude of languages and technologies. He’s a conference speaker at the Open Source Summit (Linux) and chair speaker and the OpenTracing meetup in Austin 2018.

Presentations

Real-time anomaly detection on observability data using neural networks Session

Observability is the key in modern architecture to quickly detect and repair problems in microservices. Modern observability platforms have evolved beyond simple application logs and include distributed tracing systems like Zipkin and Haystack. Keshav Peswani and Ashish Aggarwal explore how combining them with real-time, intelligent alerting mechanisms helps in the automated detection of problems.

Panos Alexopoulos is the head of ontology at Textkernel, where he leads a team of data professionals (linguists, data scientists and data engineers) in developing and delivering a large cross-lingual knowledge graph in the HR and recruitment domain. Born and raised in Athens, Greece, and living in Amsterdam, Netherlands, he’s been working for more than 12 years at the intersection of data, semantics, language, and software, contributing in building semantics-powered systems that deliver value to business and society. Previously, he was a semantic applications research manager at Expert System and a semantic solutions architect and ontologist at IMC Technologies. Panos holds a PhD in knowledge engineering and management from the National Technical University of Athens, and has published ~60 papers at international conferences, journals, and books. He strives to present his work and experiences in all kinds of venues, trying to bridge the gap between academia and industry so that they can benefit from one another.

Presentations

Mind the semantic gap: How "talking semantics" can help you perform better data science Session

In an era where discussions among data scientists are monopolized by the latest trends in machine learning, the role of semantics in data science is often underplayed. Panos Alexopoulos presents real-world cases where making fine, seemingly pedantic, distinctions in the meaning of data science tasks and the related data has helped improve significantly the effectiveness and value.

Alasdair Allan is a director at Babilim Light Industries and a scientist, author, hacker, maker, and journalist. An expert on the internet of things and sensor systems, he’s famous for hacking hotel radios, deploying mesh networked sensors through the Moscone Center during Google I/O, and for being behind one of the first big mobile privacy scandals when, back in 2011, he revealed that Apple’s iPhone was tracking user location constantly. He has written eight books, and writes regularly for Hackster.io, Hackaday, and other outlets. A former astronomer, he also built a peer-to-peer autonomous telescope network that detected what was, at the time, the most distant object ever discovered.

Presentations

Executive Briefing: Making intelligent insights at the edge—The demise of big data? Session

The arrival of a new generation of smart embedded hardware may cause the demise of large-scale data harvesting. In its place, smart devices will let us process data at the edge and extract insights without storing potentially privacy and GDPR infringing data. Join Alasdair Allan to learn why the current age where privacy is no longer "a social norm" may not long survive the coming of the IoT.

John Allen is the head of group-wide data analytics function at Deutsche Bank, supporting both business and infrastructure teams. The function uses data science and advanced big data analytics technology across a variety of use cases including revenue generation, cost avoidance, and risk reduction. John is responsible for a range of services for the business including analytics strategy, research and development, shared big data technology platforms, and agile, customer centric, project delivery.

Presentations

How Deutsche Bank industrialized AI and machine learning Session

As an early adopter of data science, machine learning, and AI, Deutsche Bank's analytics function is trailblazing new ways to drive revenues, lower costs, and reduce risk across all areas of the group. John Allen shares how his team combines commercial offerings with open source technologies to revolutionize legacy processes and transform the way the bank uses technology to drive innovation.

Shradha Ambekar is a staff software engineer in the Small Business Data Group at Intuit, where she’s the technical lead for lineage framework (SuperGLUE), real-time analytics, and has made several key contributions in building solutions around the data platform, and she contributed to spark-cassandra-connector. She has experience with HDFS, Hive, MapReduce, Hadoop, Spark, Kafka, Cassandra, and Vertica. Previously, she was a software engineer at Rearden Commerce. Shradha spoke at the O’Reilly Open Source Conference in 2019. She holds a bachelor’s degree in electronics and communication engineering from NIT Raipur, India.

Presentations

Time travel for data pipelines: Solving the mystery of what changed Session

A business insight shows a sudden spike. It can take hours, or days, to debug data pipelines to find the root cause. Shradha Ambekar, Sunil Goplani, and Sandeep Uttamchandani outline how Intuit built a self-service tool that automatically discovers data pipeline lineage and tracks every change, helping debug the issues in minutes—establishing trust in data while improving developer productivity.

Ajay Anand is the Chief Product Officer at Kyvos Insights. He is an industry veteran with more than 25 years of experience in creating best-for-business solutions for enterprises in the areas of business intelligence, distributed computing and storage. At Kyvos, he leads the Core Product Team and has been instrumental in shaping the product since its inception. Previously he led product management for Hadoop at Yahoo, after which he founded Datameer. Earlier in his career, he led product management teams at SGI and Sun.

Presentations

Transforming Financial Reporting Services with Massively Scalable OLAP (sponsored by Kyvos Insights) Session

Learn how you can overcome the challenges of traditional OLAP solutions and scale BI to deliver quick insights to business users across your enterprise

Janisha Anand is a senior business development manager for data lakes at AWS, where she focuses on designing, implementing, and architecting large-scale solutions in the areas of data management, data processing, data architecture, and data analytics.

Presentations

Fuzzy matching and deduplicating data: Techniques for advanced data prep Session

Nikki Rouda and Janisha Anand demonstrate how to deduplicate or link records in a dataset, even when the records don’t have a common unique identifier and no fields match exactly. You'll also learn how to link customer records across different databases, match external product lists against your own catalog, and solve tough challenges to prepare and cleanse data for analysis.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He’s taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He’s widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Creating a data engineering culture Session

In this talk, we will cover the most common reasons why data engineering teams fail and how to correct them. This will include ways to get your management to understand that data engineering is really complex and time consuming. It is not data warehousing with new names. Management needs to understand that you can’t compare a data engineering team to the web development team, for example.

Professional Kafka development 2-Day Training

Jesse Anderson offers you an in-depth look at Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it, as well as how to create consumers and publishers. You'll take a look Jesse then walks you through Kafka’s ecosystem, demonstrating how to use tools like Kafka Streams, Kafka Connect, and KSQL.

Professional Kafka development (Day 2) Training Day 2

Jesse Anderson offers you an in-depth look at Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it, as well as how to create consumers and publishers. You'll take a look Jesse then walks you through Kafka’s ecosystem, demonstrating how to use tools like Kafka Streams, Kafka Connect, and KSQL.

André Araujo is a principal solutions architect at Cloudera. An experienced consultant with a deep understanding of the Hadoop stack and its components and a methodical and keen troubleshooter who loves making things run faster, André is skilled across the entire Hadoop ecosystem and specializes in building high-performance, secure, robust, and scalable architectures to fit customers’ needs.

Presentations

Cloudera Edge Management in the IoT Tutorial

There are too many edge devices and agents, and you need to control and manage them. Purnima Reddy Kuchikulla, Timothy Spann, Abdelkrim Hadjidj, and Andre Araujo walk you through handling the difficulty in collecting real-time data and the trouble with updating a specific set of agents with edge applications. Get your hands dirty with CEM, which addresses these challenges with ease.

Amar Arsikere is the founder and CEO at Infoworks.io. Previously, Amar built several large scale data systems on Bigtable and Hadoop at Google and Zynga. At Zynga, he also led the design and deployment of the gaming database, the largest in-memory database in the world built at that time. At Google, he pioneered the development of data warehousing platform on Bigtable. Amar is a recipient of InfoVision award from IEC and the Jars Top 25 award and holds several patents in the field of software and Internet technologies.

Presentations

Solving for enterprise scale analytics and agile data operations (sponsored by Infoworks) Session

The breakneck pace of business change and its insatiable appetite for data and analytics to drive Digital Transformation makes agile use of data an imperative.

Amit Assudani is Senior Technical Architect in Impetus Technologies for Cloud and Big data analytics. He has more than 12 years of experience working with many among Fortune #100 enterprises in Financials and Risk Management domain in their digital transformation journey. His deep expertise in Big Data Analytics helped enterprises setting up and effectively managing their data and cloud strategy.

Presentations

DevOps in the cloud: Deploy, monitor, manage and automate (sponsored by Impetus) Session

Data lakes and analytical processing on the cloud is a reality. This presents new challenges for DevOps, with respect to Governance, Continuous Integration & Deployment, etc. This session will present our views on how to maintain sanity in your development organization while implementing the many dimensions of building an efficient cloud-based data platform and application development environment.

Peter Bailis is the founder and CEO of Sisu, a data analytics platform that helps users understand the key drivers behind critical business metrics in real time. Peter is also an assistant professor of computer science at Stanford University, where he coleads Stanford DAWN, a research project focused on making it dramatically easier to build machine learning-enabled applications. He holds a PhD from the University of California, Berkeley, for which he was awarded the ACM SIGMOD Jim Gray Doctoral Dissertation Award, and an AB from Harvard College in 2011, both in computer science.

Presentations

Executive Briefing: Usable machine learning—Lessons from Stanford and beyond Session

Despite a meteoric rise in data volumes within modern enterprises, enabling nontechnical users to put this data to work in diagnostic and predictive tasks remains a fundamental challenge. Peter Bailis details the lessons learned in building new systems to help users leverage the data at their disposal, drawing on production experience from Facebook, Microsoft, and the Stanford DAWN project.

Vitaliy Baklikov is the senior vice president at DBS Bank, where he leads a team of architects who drive the evolution of the platform and tackle various use cases ranging from batch and stream big data processing to sophisticated machine learning workloads, with over 15 years of experience in advanced analytics and distributed architectures. He’s building a next-generation enterprise data platform for the bank that sits across private and public clouds. Previously he held various roles at startups and financial institutions across the US, UK, and Russia.

Presentations

Enabling big data and AI workloads on the object store at DBS Bank Session

Vitaliy Baklikov and Dipti Borkar explore how DBS Bank built a modern big data analytics stack leveraging an object store even for data-intensive workloads like ATM forecasting and how it uses Alluxio to orchestrate data locality and data access for Spark workloads.

Gowri Balasubramanian is a senior solutions architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance on both relational and NoSQL database services, helping them improve the value of their solutions when using AWS.

Presentations

From relational databases to cloud databases: Using the right tool for the right job Tutorial

Enterprises adopt cloud platforms such as AWS for agility, elasticity, and cost savings. Database design and management requires a different mindset in AWS when compared to traditional RDBMS design. Gowrishankar Balasubramanian and Rajeev Srinivasan explore considerations in choosing the right database for your use case and access pattern while migrating or building a new application on the cloud.

Pete Ball is a consultant with Liminal Innovation, a consultancy focused on innovation in the health care industry. Pete applies his experience in data science, software development, intellectual property and business law, research collaborations, joint ventures, and startups to solve problems and exploit opportunities with new technology. Prior to launching Liminal Innovation, Pete worked in technology commercialization at Mayo Clinic Ventures and Johns Hopkins Technology Ventures. Pete serves as co-chair of the IEEE SA P2418.6 Blockchain in Healthcare and Life Sciences IP subgroup. Pete is a registered patent attorney and a member of the Maryland Bar.

Presentations

Semantics and graph data models in the enterprise data fabric (sponsored by Cambridge Semantics) Session

Join industry consultant Peter Ball, of Liminal Innovation, and Barbara Petrocelli, VP Field Operations of Cambridge Semantics, to learn how enterprise data fabrics are reshaping the modern data management landscape.

Dylan Bargteil is a data scientist in residence at the Data Incubator, where he works on research-guided curriculum development and instruction. Previously, he worked with deep learning models to assist surgical robots and was a research and teaching assistant at the University of Maryland, where he developed a new introductory physics curriculum and pedagogy in partnership with the Howard Hughes Medical Institute (HHMI). Dylan studied physics and math at the University of Maryland and holds a PhD in physics from New York University.

Presentations

Machine learning from scratch in TensorFlow 2-Day Training

The TensorFlow library provides for the use of computational graphs with automatic parallelization across resources. This architecture is ideal for implementing neural networks. Dylan Bargteil explores TensorFlow's capabilities in Python, demonstrating how to build machine learning algorithms piece by piece and how to use TensorFlow's Keras API with several hands-on applications.

Machine learning from scratch in TensorFlow (Day 2) Training Day 2

The TensorFlow library provides for the use of computational graphs with automatic parallelization across resources. This architecture is ideal for implementing neural networks. Dylan Bargteil explores TensorFlow's capabilities in Python, demonstrating how to build machine learning algorithms piece by piece and how to use TensorFlow's Keras API with several hands-on applications.

Dan Barker leads RSA Archer as the chief architect in the company’s cloud migration and conversion to SaaS. Previously, he was the chief architect at the National Association of Insurance Commissioners, leading its technical and cultural transformation. Dan spent 12 years in the military as a fighter jet mechanic before transitioning to a career in technology as a software engineer and manager. Dan is an organizer of DevOps KC and the devopsdays KC conference.

Presentations

Creating a data culture at a 150-year-old nonprofit Findata

Sometimes if you build it, no one comes. What do you do if your data and tools aren't being leveraged as expected? It's a massive shift in thinking, but the payoff can be even bigger. Dan Barker explains how he and his team took the National Association of Insurance Commissioners, a 150-year-old nonprofit, into the data age and how you can do it too.

William Benton leads a team of data scientists and engineers at Red Hat, where he has built machine learning systems to solve problems ranging from understanding infrastructure logs at datacenter scale to designing better cycling workouts.

Presentations

Sketching data and other magic tricks Tutorial

Go hands-on with Sophie Watson and William Benton to examine data structures that let you answer interesting queries about massive datasets in fixed amounts of space and constant time. This seems like magic, but they'll explain the key trick that makes it possible and show you how to use these structures for real-world machine learning and data engineering applications.

John Berryman started out in the field of aerospace engineering, but soon found that he was more interested in math and software than in satellites and aircraft. He made the leap into software development, specializing in search and recommendation technologies. John’s a senior software engineer at Eventbrite, where he helps build Eventbrite’s event discovery platform. He also recently coauthored a tech book, Relevant Search, (Manning). The proceeds from the book have mostly paid for the coffee consumed while writing it.

Presentations

Search logs + machine learning = autotagged inventory Session

Eventbrite is exploring a new machine learning approach that allows it to harvest data from customer search logs and automatically tag events based upon their content. John Berryman dives into the results and how they have allowed the company to provide users with a better inventory-browsing experience.

Alex Beutel is a staff research scientist on the SIR team at Google Brain, leading a team working on ML fairness and researching neural recommendation and ML for systems. He earned a PhD from Carnegie Mellon University’s Computer Science Department and his BS from Duke University in computer science and physics. His PhD thesis on large-scale user behavior modeling, covering recommender systems, fraud detection, and scalable machine learning, was given the SIGKDD 2017 Doctoral Dissertation Award Runner-Up. He received the Best Paper Award at KDD 2016 and ACM GIS 2010, was a finalist for best paper in KDD 2014 and ASONAM 2012, and was awarded the Facebook Fellowship in 2013 and the NSF Graduate Research Fellowship in 2011. More details can be found at alexbeutel.com.

Presentations

War stories from the front lines of ML Session

Machine learning techniques are being deployed across almost every industry and sector. But this adoption comes with real, and oftentimes underestimated, privacy and security risks. Andrew Burt and Brenda Leong convene a panel of experts including David Florsek, Chris Wheeler, and Alex Beutel to detail real-life examples of when ML goes wrong, and the lessons they learned.

Varun Rao Bhamidimarri is an enterprise solution architect at Amazon Web Services helping customers with adoption of cloud-enabled analytics solutions to meet business requirements.

Presentations

Building a recommender system with Amazon ML services Tutorial

Karthik Sonti, Emily Webber, and Varun Rao Bhamidimarri introduce you to the Amazon SageMaker machine learning platform and provide a high-level discussion of recommender systems. You'll dig into different machine learning approaches for recommender systems, including common methods such as matrix factorization as well as newer embedding approaches.

Gayle Bieler is director of the Center for Data Science at RTI International, where her team of 24 data scientists, statisticians, software developers, and artists is busy solving important national problems, improving local communities, and transforming research. A a statistician by training, with 31 years’ experience at RTI, she’s also experienced in statistical analysis of complex data from designed experiments, observational studies, and software development for sample surveys. She’s passionate about building and leading a vibrant data science team that solves complex problems across multiple research domains, that is a hub of innovation and collaboration, and a place where people can thrive and be at their natural best while doing meaningful work to improve the world. The most important things to her are people and impact, in that order. She holds a master’s and a bachelor’s degree in mathematics from Boston University.

Presentations

Executive Briefing: Creating a center for data science from scratch—Lessons from nonprofit research Session

Gayle Bieler explains how she built a thriving center for data science within a large, well-respected nonprofit research institute and shares some of its most impactful projects and best adventures to date, that have solved important national problems, improved local communities, and transformed research.

Albert Bifet is a professor and head of the Data, Intelligence, and Graphs (DIG) Group at Télécom ParisTech and a scientific collaborator at École Polytechnique. A big data scientist with 10+ years of international experience in research, Albert has led new open source software projects for business analytics, data mining, and machine learning at Huawei, Yahoo, the University of Waikato, and UPC. At Yahoo Labs, he cofounded Apache scalable advanced massive online analysis (SAMOA), a distributed streaming machine learning framework that contains a programing abstraction for distributed streaming ML algorithms. At the WEKA Machine Learning Group, he co-led massive online analysis (MOA), the most popular open source framework for data stream mining with more than 20,000 downloads each year. Albert is the author of Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams and the editor of the Big Data Mining special issue of SIGKDD Explorations. He was cochair of the industrial track at ECML PKDD, BigMine, and the data streams track at ACM SAC. He holds a PhD from BarcelonaTech.

Presentations

Machine learning for streaming data: Practical insights Session

Heitor Murilo Gomes and Albert Bifet introduce you to a machine learning pipeline for streaming data using the streamDM framework. You'll also learn how to use streamDM for supervised and unsupervised learning tasks, see examples of online preprocessing methods, and discover how to expand the framework by adding new learning algorithms or preprocessing methods.

Rossella Blatt Vital is the director of innovation and AI at Wonderlic, where she leads the machine learning and innovation program by establishing a culture driven by creative thinking, collaboration, customer centricity, and a focus on talent development and recognition. Previously, she headed projects in machine learning in the academic field with relevant social impact that ranged from lung cancer diagnosis to a brain-computer interface and crime forecasting; she led the AI and data science initiatives in the corporate world at NewEdge and Societe Generale; designed and executed the machine learning road map and headed a team that addressed the detection and prevention of financial fraud along with pricing, compliance, and customer service as the director of machine learning at Remitly. Rossella is a passionate and visionary thought leader in the field of artificial intelligence and an accomplished speaker who has been invited to give keynote lectures on AI and innovation at conferences worldwide, and she’s been published in numerous scientific peer-reviewed journals. A native of Italy, Rossella holds an MS and BS in telecommunications engineering from Politecnico di Milano and pursued a PhD in information technology. Rossella loves innovation and how, if driven by the right principles and culture, it can contribute to make a difference and create a better tomorrow.

Presentations

Building and leading a successful AI practice for your organization Tutorial

Creating and leading a successful ML strategy is an elegant orchestration of many components: master key ML concepts, operationalize ML workflow, prioritize highest-value projects, build a high-performing team, nurture strategic partnerships, align with the company’s mission, etc. Rossella Blatt Vital details insights and lessons learned in how to create and lead a flourishing ML practice.

Brittany Bogle is a senior data scientist and machine learning engineer on the science elite team at IBM, where she collaborates with clients to execute data science use cases across a range of industries, including healthcare, finance, hospitality, and manufacturing. She also aggregates lessons from experiences and shares with executive and technical audiences. Brittany’s expertise in data science spans descriptive analytics, predictive analytics, and decision optimization, with an emphasis on using these tools in healthcare settings. Previously, she completed a fellowship at the University of North Carolina, where she collaborated directly with emergency medicine physicians, cardiologists, and epidemiology researchers. She built predictive models for cardiac arrest, established and implemented natural language processing (NLP) methodology to emulate manual adjudication of myocardial infarction and chronic heart failure, and she’s a coinvestigator on a pilot study to assess the effectiveness of AED-equipped drone delivery and optimize drone placement strategies. She holds a bachelor of science in industrial engineering from the University of Arkansas, an MS and PhD in industrial engineering and management sciences, and a master’s in public health from Northwestern University.

Presentations

Why AI fails: Overcoming AI challenges (sponsored by IBM) Session

AI will be the most disruptive class of technologies over the next decade, fueled by near-endless amounts of data and unprecedented advances in deep learning. Brittany Bogle walks you through how to address some of the major AI challenges, like trust, talent, and data.

Charles Boicey is the chief innovation officer for Clearsense, a healthcare analytics organization specializing in bringing big data technologies to healthcare. Previously, Charles was the enterprise analytics architect for Stony Brook Medicine, where he developed the analytics infrastructure to serve the clinical, operational, quality, and research needs of the organization; was a founding member of the team that developed the Health and Human Services award-winning application NowTrending to assist in the early detection of disease outbreaks by utilizing social media feeds; and is a former president of the American Nursing Informatics Association.

Presentations

Organizing the chaos of healthcare with smart data discovery (sponsored by Io-Tahoe) Session

Healthcare’s reliance on comprehendible data is critical to the mission of providing optimal and affordable care. Charles Boicey takes a deep dive into how the application of technology, such as machine learning, is paramount to the modernization of healthcare that provides its professionals with fully integrated and complete medical records.

Davor Bio

Presentations

Posttransaction processing using Apache Pulsar at Narvar Session

Narvar provides next-generation posttransaction experience for over 500 retailers. Karthik Ramasamy and Anand Madhavan take you on the journey of how Narvar moved away from using a slew of technologies for their platform and consolidated its use cases using Apache Pulsar.

Dipti Borkar is the vice president of product and marketing at Alluxio with over 15 years experience in relational and nonrelational data and database technology. Previously, Dipti was vice president of product marketing at Kinetica and Couchbase, where she held several leadership positions, including head of global technical sales and head of product management; she managed development teams at IBM DB2, where she started her career as a database software engineer. Dipti holds an MS in computer science from the University of California San Diego and an MBA from the Haas School of Business at the University of California, Berkeley.

Presentations

Enabling big data and AI workloads on the object store at DBS Bank Session

Vitaliy Baklikov and Dipti Borkar explore how DBS Bank built a modern big data analytics stack leveraging an object store even for data-intensive workloads like ATM forecasting and how it uses Alluxio to orchestrate data locality and data access for Spark workloads.

Farrah Bostic is the founder of the Difference Engine, which she created based on her belief that deep understanding of customer needs is essential to growing businesses through great products and services. Farrah has honed her customer-centric insights as an advisor to some of the world’s most respected brands, including Apple, Microsoft, Disney, Samsung, and UPS. Previously, she began her career as a creative and then went on to be a strategist at leading agencies, including Wieden+Kennedy, TBWA\Chiat\Day, Mad Dogs & Englishmen, and Digitas, where she was group planning director and mobile strategy lead; she also ran innovation as a partner at Hall & Partners and developed digital tools for online qualitative research as SVP of consumer immersion at OTX.

Presentations

Executive Briefing: Understanding the cult of prediction Session

We're living in a culture obsessed with predictions. In politics and business, we collect data in service of the obsession. But our need for certainty and control leads some organizations to be duped by unproven technology or pseudoscience—often with unforeseen societal consequences. Farrah Bostic looks at historical—and sometimes funny—examples of sacrificing understanding for "data."

David Boyle is passionate about helping businesses to build analytics-driven decision making to help them make quicker, smarter, and bolder decisions. Previously, he built global analytics and insight capabilities for a number of leading global entertainment businesses covering television (the BBC), book publishing (HarperCollins Publishers), and the music industry (EMI Music), helping to drive each organization’s decision making at all levels. He builds on experiences working to build analytics for global retailers as well as political campaigns in the US and UK, in philanthropy, and in strategy consulting.

Presentations

Combining creativity and analytics Session

Companies that harness creativity and data in tandem have growth rates twice as high as companies that don’t. David Boyle shares lessons from his successes and failures in trying to do just that across presidential politics, with pop stars, and with power brands in the world of luxury goods. Join in to find out how analysts can work differently to build these partnerships and unlock this growth.

Data Case Studies welcome Tutorial

Introduction to the Data Case Studies daylong tutorial.

Driving adoption of data DCS

Often, the difference between a successful data initiative and failed one isn't the data or the technology, but rather its adoption by the wider business. With every business wanting the magic of data but many failing to properly embrace and harness it, we will explore what factors our panelists have seen that led to successes and failures in getting companies to use data products.

Bob Bradley is the data solutions manager at Geotab, a global leader in telematics providing open platform fleet management solutions to over 1.2 million connected vehicles worldwide. Bob leads a team responsible for developing data-driven solutions that leverage Geotab’s big data repository of over 3 billion records each day. Previously, Bob spent more than 14 years as the cofounder and vice president of a software development shop (acquired by Geotab in 2016), where he focused on delivering custom business intelligence solutions to companies across Canada.

Presentations

Turning petabytes of data from millions of vehicles into open data with Geotab Session

Geotab is a world-leading asset-tracking company with millions of vehicles under service every day. Felipe Hoffa and Bob Bradley examine the challenges and solutions to create an ML- and geographic information system- (GI)S enabled petabyte-scale data warehouse leveraging Google Cloud. And they dive into the process to publish open, how you can access it, and how cities are using it.

Claudiu Branzan is a analytics senior manager in the Applied Intelligence Group at Accenture, based in Seattle, where he leverages his more than 10 years of expertise in data science, machine learning, and AI to promote the use and benefits of these technologies to build smarter solutions to complex problems. Previously, Claudiu held highly technical client-facing leadership roles in companies utilizing big data and advanced analytics to offer solutions for clients in healthcare, high-tech, telecom, and payments verticals.

Presentations

Natural language understanding at scale with Spark NLP Tutorial

David Talby, Alex Thomas, Saif Addin Ellafi, and Claudiu Branzan walk you through state-of-the-art natural language processing (NLP) using the highly performant, highly scalable open source Spark NLP library. You'll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Navinder is a senior software engineer at Walmart Labs, where he’s been working with the Kafka ecosystem for the last couple of years, especially Kafka Streams, and created a new platform on top of it to suit the company’s needs to process billions of events per day in real time and trigger models on each event. He’s been active in contributing back to Kafka Streams and has patented few features. He’s interested in solving complex problems and distributed systems. Navinder likes to spend time in gym and boxing ring in his spare time.

Presentations

Building a multitenant data processing and model inferencing platform with Kafka Streams Session

Each week 275 million people shop at Walmart, generating interaction and transaction data. Navinder Pal Singh Brar explains how the customer backbone team enables extraction, transformation, and storage of customer data to be served to other teams. At 5 billion events per day, the Kafka Streams cluster processes events from various channels and maintains a uniform identity of a customer.

Mikio Braun is a principal engineer for search at Zalando, one of Europe’s biggest fashion platforms. He worked in research for a number of years before becoming interested in putting research results to good use in the industry. Mikio holds a PhD in machine learning.

Presentations

Fair, privacy-preserving, and secure ML Session

With ML becoming more mainstream, the side effects of machine learning and AI on our lives become more visible. You have to take extra measures to make machine learning models fair and unbiased. And awareness for preserving the privacy in ML models is rapidly growing. Mikio Braun explores techniques and concepts around fairness, privacy, and security when it comes to machine learning models.

Andrew Brust is founder and CEO of Blue Badge Insights, a blogger for ZDNet Big Data, and is a data and analytics-focused analyst for GigaOm. He’s the coauthor of Programming Microsoft SQL Server 2012, a Microsoft tech influencer, and advises data and analytics ISVs on winning in the market, solution providers on their service offerings, and customers on their analytics strategy. Andrew is an entrepreneur, a consulting veteran, a former research director, and a current Microsoft Data Platform MVP.

Presentations

Executive Briefing: Data catalogs—Concepts, capabilities, and key platforms Session

Andrew Brust provides a primer on data catalogs and a review of the major vendors and platforms in the market. He examines the use of data catalogs with classic and newer data repositories, including data warehouses, data lakes, cloud object storage, and even software and applications. You'll learn about AI's role in the data catalog world and get an analysis of data catalog futures.

Andrew Burt, an internationally recognized expert on the intersection between data privacy, security and AI, leads Immuta’s Legal Engineering team. The team, comprised of lawyers with deep expertise in data science, focuses on automating compliance and oversight activities within the Immuta Automated Data Governance software platform.

Before joining Immuta, Andrew was Special Advisor for Policy to the head of the FBI Cyber Division, where he was the lead author on the FBI’s after action report on the 2014 Sony data breach. Andrew also served as Chief Compliance and Privacy Officer for the Cyber Division, overseeing privacy and compliance policies for sensitive data across the FBI’s 56 field offices.

Andrew is a term member of the Council on Foreign Relations and is a visiting fellow at Yale Law School’s Information Society Project. A published author and former journalist, Andrew holds a J.D. from Yale Law School and a B.A. with first class honors from McGill University.

Presentations

Regulations and the future of data Session

From the EU to California and China, more of the world is regulating how data can be used. Andrew Burt and Brenda Leong convene leading experts on law and data science for a deep dive into ways to regulate the use of AI and advanced analytics. Come learn why these laws are being proposed, how they’ll impact data, and what the future has in store.

War stories from the front lines of ML Session

Machine learning techniques are being deployed across almost every industry and sector. But this adoption comes with real, and oftentimes underestimated, privacy and security risks. Andrew Burt and Brenda Leong convene a panel of experts including David Florsek, Chris Wheeler, and Alex Beutel to detail real-life examples of when ML goes wrong, and the lessons they learned.

Gwen Campbell is director of product and data at Revibe Technologies, where she’s leveraging Revibe’s wearable technology to further research and product development in the focus, attention, and movement spaces. She was a part of the team that brought Revibe Classic to market in 2015 and Revibe Connect to market in 2018. She’s also one of the inventors of the Revibe Connect technology. An industrial and systems engineer with a passion for helping others, she enjoys pioneering advanced data analytics in her field.

Presentations

From isolated to connected: The metamorphosis of Revibe DCS

It’s no surprise that Revibe needed to learn how to evolve to satiate the current data hungry market. It launched its first hardware-only device in 2015 and quickly learned that to stay alive, the company needed to get its hands into data. Gwen Campbell discusses Revibe's metamorphosis from a hardware company to a data company and shares lessons learned along the way.

Matt is a security principal at Cox Communications. He holds patents on weighted data packet communication system, systems and methods of DNS grey listing, and systems and methods of mapped network address translation.

Presentations

Secured computation: Analyzing sensitive data using homomorphic encryption Session

Organizations often work with sensitive information such as social security and credit card numbers. Although this data is stored in encrypted form, most analytical operations require data decryption for computation. This creates unwanted exposures to theft or unauthorized read by undesirables. Matt Carothers, Jignesh Patel, and Harry Tang explain how homomorphic encryption prevents fraud.

Sandra Carrico is the vice president of engineering and chief data scientist at GLYNT, where she leads the software development team, ensuring rapid iteration and releases using agile software development. She invented WattzOn’s GLYNT machine learning project, which extracts data trapped in complex documents, and she invented mixed formal learning, which is used in the GLYNT machine learning product. Previously Sandra was vice president of engineering at a number of startups and has been in engineering management at AT&T Labs and Aurigin.

Presentations

Data need not be a moat: Mixed formal learning enables zero- and low-shot learning Session

Sandra Carrico explores mixed formal learning, explains it, and outlines one machine learning example that previously used large numbers of examples and now learns with either zero or a handful of training examples. It maps apparently idiosyncratic techniques to mixed formal learning, a general AI architecture that you can use in your projects.

Rich Caruana is a principal researcher at Microsoft Research. Previously, he was on the faculty in the Computer Science Department at Cornell University, at UCLA’s Medical School, and at Carnegie Mellon University’s Center for Learning and Discovery. Rich received an NSF CAREER Award in 2004 (for meta clustering); best paper awards in 2005 (with Alex Niculescu-Mizil), 2007 (with Daria Sorokina), and 2014 (with Todd Kulesza, Saleema Amershi, Danyel Fisher, and Denis Charles); co-chaired KDD in 2007 (with Xindong Wu); and serves as area chair for Neural Information Processing Systems (NIPS), International Conference on Machine Learning (ICML), and KDD. His research focus is on learning for medical decision making, transparent modeling, deep learning, and computational ecology. He holds a PhD from Carnegie Mellon University, where he worked with Tom Mitchell and Herb Simon. His thesis on multi-task learning helped create interest in a new subfield of machine learning called transfer learning.

Presentations

Unified tooling for machine learning interpretability Session

Understanding decisions made by machine learning systems is critical for sensitive uses, ensuring fairness, and debugging production models. Interpretability presents options for trying to understand model decisions. Harsha Nori, Sameul Jenkins, and Rich Caruana explore the tools Microsoft is releasing to help you train powerful, interpretable models and interpret existing black box systems.

Scott Castle is GM of the Data Business at Sisense and served as VP of Product for Periscope Data prior to its merger with Sisense in 2019. At Periscope Data, Scott oversaw product strategy, planning, design and delivery. He brings over 20 years of experience in software development and product management at leading technology companies including Adobe, Electric Cloud, and FileNet. Scott has a B.A. degree from the University of Massachusetts, Amherst, and a master’s degree from UC Irvine, both in computer science.

Presentations

Lessons learned from scaling the tech stack of a modern analytics platform Session

In this session, Scott Castle, General Manager at Sisense and former VP of Product at Periscope Data, will discuss lessons learned from scaling up Periscope Data to support incredibly large volumes of data and queries from its data teams.

Dan Chaffelson is a director of DataFlow field engineering at Cloudera. Previously, Dan was a solutions engineer at Hortonworks. He drives the international practice for enterprise adoption of the HDF product line and maintains a public project for Apache NiFi Python automation (NiPyAPI) on GitHub. Throughout a decade of virtualization and launching two startups, Dan has been nerdy on three continents and in every line of business from UK bulge bracket banking to Australian desert public services. Dan is based in London with his family and pet samoyed; he can be found building an open source baby monitor out of Raspberry Pi’s while mining cryptocurrency in his shed.

Presentations

Kafka and Streams Messaging Manager (SMM) crash course Tutorial

Kafka is omnipresent and the backbone of streaming analytics applications and data lakes. The challenge is understanding what's going on overall in the Kafka cluster, including performance, issues, and message flows. Purnima Reddy Kuchikulla and Dan Chaffelson walk you through a hands-on experience to visualize the entire Kafka environment end-to-end and simplify Kafka operations via SMM.

Streaming Services Specialist Solution Architect with AWS.

Presentations

SOLD OUT: Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join the AWS team to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

SOLD OUT: Building a serverless big data application on AWS (Day 2) Training Day 2

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join the AWS team to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Badrish Chandramouli is a senior principal researcher in the database group at Microsoft Research. He is interested in creating technologies to perform real-time and offline big and raw data processing, as well as resilient state management for cloud and edge applications. His research work first shipped in 2010 as part of the Microsoft SQL Server StreamInsight engine. Starting 2012, Badrish built Trill, a streaming analytics engine that is widely used at Microsoft, for example, in the Bing ads platform and in the Azure Stream Analytics cloud service. More recently, Badrish built FASTER, a high-performance embedded, resilient, and concurrent state store and cache that supports larger-than-memory data and is optimized for streaming analytics. He has also worked on simplifying distributed computing via frameworks such as Ambrosia and CRA.

Presentations

Trill: The crown jewel of Microsoft’s streaming pipeline explained Session

Trill has been open-sourced, making the streaming engine behind services like the Bing Ads platform available for all to use and extend. James Terwilliger, Badrish Chandramouli, and Jonathan Goldstein dive into the history of and insights from streaming data at Microsoft. They demonstrate how its API can power complex application logic and the performance that gives the engine its name.

Felix Cheung is an engineering manager II at Uber and a PMC and committer for Apache Spark. Felix started his journey in the big data space about five years ago with the then-state-of-the-art MapReduce. Since then, he’s (re-)built Hadoop clusters from metal more times than he would like, created a Hadoop distro from two dozen or so projects, and juggled hundreds to thousands of cores in the cloud or in data centers. He built a few interesting apps with Apache Spark and ended up contributing to the project. In addition to building stuff, he frequently presents at conferences, meetups, and workshops. He was also a teaching assistant for the first set of edX MOOCs on Apache Spark.

Presentations

We run, we improve, we scale: The XGBoost story at Uber Session

XGBoost has been widely deployed in companies across the industry. Nan Zhu and Felix Cheung dive into the internals of distributed training in XGBoost and demonstrate how XGBoost resolves the business problem in Uber with a scale to thousands of workers and tens of TB of training data.

Darren Chinen is a senior director of data science and engineering at Malwarebytes. He began his career in data more than 20 years ago. The early part of his career was focused on analytics and building data warehouses for companies like Legato/EMC, E*TRADE, Lucent Technologies, Peet’s Coffee, and Apple. During his time at Apple he was initially focused on Apple Online Store analytics, then moved on to help build the data science platform. It was in the transition that he was forced to evolve from relational databases to big data. He has spent the last six years of his career in big data where he lived through all the big data fads at Apple, Roku, GoPro, and Malwarebytes. It was a charming time when we all thought that map-reduce would save the world, then Hive, then mixed workloads made us invent YARN, Sqoop was a thing, and then came Cassandra which made us all talk about the CAP theorem. Now we live in a time of streaming data, complex multicloud orchestration, real-time processing, next-gen web applications and AI, which Darren is convinced as of this writing will actually save the world and end world hunger.

Presentations

Running AI workloads in containers (sponsored by BMC Software)

Developing, deploying and managing AI and anomaly detection models is tough business. See-Kit Lam details how Malwarebytes has leveraged containerization, scheduling, and orchestration to build a behavioral detection platform and a pipeline to bring models from concept to production.

Anant Chintamaneni is Vice President and General Manager of BlueData Products at HPE (via BlueData acquisition), where he’s responsible for product and go-to-market (GTM) strategy to help enterprises on their digital transformation journey with next generation hybrid cloud software. Anant has more than 19 years of experience in enterprise software, business intelligence and big data and analytics infrastructure. Previously, Anant led product management teams at Pivotal, Dell EMC, and NICE.

Presentations

How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE (BlueData)) Session

Anant Chintamaneni and Matt Maccaux explore whether the combination of containers with large-scale distributed data analytics and machine learning applications is like combining oil and water— or like peanut butter and chocolate.

Praveen Chitrada is an engineering manager at Akamai, where he leads a team responsible for designing and building the next generation of applications. Praveen and members of his technology team love using cutting-edge technologies such as Airflow, Docker, Hadoop, MuleSoft, ODI, Prometheus, Spark, and of course MemSQL. Over nearly a decade at Akamai, Praveen has applied and expanded his extensive knowledge in building complex and scalable applications. His contributions include projects to generate usage metrics for billing, track the cost of goods sold, and implement event monitoring and real-time analytics at Akamai scale. Recently, he played a pivotal role in rearchitecting the network usage statistics application to meet demanding service-level agreements (SLAs) while accommodating and leveraging ongoing, exponential growth in data volumes.

Presentations

Building a fast, scalable, efficient operational analytics and reporting application using MemSQL, Docker, Airflow, and Prometheus (sponsored by MemSQL) Session

Praveen Chitrada walks you through how Akamai uses MemSQL, Docker, Airflow, Prometheus, and other technologies as an enabler to streamline and accelerate data ingestion and calculation to generate usage metrics for billing, reporting, and analytics at massive scale.

Wei-Chiu Chuang, Ph.D., Software Engineer

Wei-Chiu joined Cloudera in 2015 as a software engineer, where he is responsible for development of Cloudera’s storage systems, mostly the Hadoop Distributed File System (HDFS). He is an Apache Hadoop Committer/Project Management Committee member for his contribution in the open source project. He is also a co-founder of Taiwan Data Engineering Association, a non-profit organization promoting better Data Engineering technologies and applications in Taiwan. Wei-Chiu received his Ph.D. in Computer Science from Purdue University for his research in distributed systems and programming models.

Presentations

Apache Hadoop 3.x state of the union and upgrade guidance Session

Wangda Tan and Wei-Chiu Chuang outline the current status of Apache Hadoop community and dive into present and future of Hadoop 3.x. You'll get a peak at new features like erasure coding, GPU support, NameNode federation, Docker, long-running services support, powerful container placement constraints, data node disk balancing, etc. And they walk you through upgrade guidance from 2.x to 3.x.

Moise Convolbo is a data scientist and research scientist at Rakuten, where he’s harnessing the potential of customer data in reaching “zero customer dissatisfaction.” He built a platform called the Rakuten PathFinder, which empowers product stakeholders such as PDMs, managers, and test engineers to focus on specific struggles along the users’ journeys in order to improve the company’s products and measure their business impact. It’s currently used by Rakuten Gora (the #1 golf course reservation site in Japan), Rakuten toto (the #1 lottery betting site), Rakuten O-net (the #1 match-making web service), and Rakuten Keiba (a horse racing betting service). Moise has had a long experience working with data, the cloud, and geodistributed data centers. He’s always been fascinated by what comes next, in terms of utilizing data from strategic data-informed business decisions. He’s active i academia and has spent time as a reviewer for major big data, cloud optimization, and data science journals from ACM, Elsevier, Springer, and the IEEE.

Presentations

Driving adoption of data DCS

Often, the difference between a successful data initiative and failed one isn't the data or the technology, but rather its adoption by the wider business. With every business wanting the magic of data but many failing to properly embrace and harness it, we will explore what factors our panelists have seen that led to successes and failures in getting companies to use data products.

Gaining new insight into online customer behavior using AI DCS

Customer satisfaction is a key success factor for any business. Moise Convolbo highlights the process to capture relevant customer behavioral data, cluster the user journey by different patterns, and draw conclusions for data-informed business decisions.

Ian Cook is a data scientist at Cloudera and the author of several R packages, including implyr. Previously, he was a data scientist at TIBCO and a statistical software developer at AMD. Ian is a cofounder of Research Triangle Analysts, the largest data science meetup group in the Raleigh, North Carolina, area, where he lives with his wife and two young children. He holds an MS in statistics from Lehigh University and a BS in applied mathematics from Stony Brook University.

Presentations

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow 2-Day Training

Advancing your career in data science requires learning new languages and frameworks—but you face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by outlining the abstractions common to these systems. You'll go hands-on exercises to overcome obstacles to getting started using new tools.

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow (Day 2) Training Day 2

Advancing your career in data science requires learning new languages and frameworks—but you face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by outlining the abstractions common to these systems. You'll go hands-on exercises to overcome obstacles to getting started using new tools.

John Cooper is a technical architect at Bayer.

Presentations

Finding your needle in a haystack Session

As complexity of data systems has grown at Bayer, so has the difficulty to locate and understand what datasets are available for consumption. Naghman Waheed and John Cooper outline a custom metadata management tool recently deployed at Bayer. The system is cloud-enabled and uses multiple open source components, including machine learning and natural language processing to aid searches.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Closing remarks Keynote

Program chairs, Ben Lorica, Doug Cutting, and Alistair Croll, offer closing remarks.

Findata Day welcome Tutorial

Alistair Croll, findata host and the Strata Data Conference program chair, welcomes you to the day-long tutorial.

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Michael Cullan is a data scientist in residence at the Data Incubator, where he combines a passion for teaching and statistical programming. He has three years of teaching experience in academic and professional settings and four years of research experience spanning topics in nonparametric statistics, applied mathematics, and artificial intelligence. He holds a master’s degree in statistics.

Presentations

Hands-on data science with Python 2-Day Training

Michael Cullan walks you through developing a machine learning pipeline from prototyping to production. You'll learn about data cleaning, feature engineering, model building and evaluation, and deployment and then extend these models into two applications from real-world datasets. All work will be done in Python.

Hands-on data science with Python (Day 2) Training Day 2

Michael Cullan walks you through developing a machine learning pipeline from prototyping to production. You'll learn about data cleaning, feature engineering, model building and evaluation, and deployment and then extend these models into two applications from real-world datasets. All work will be done in Python.

Jim Cushman is the chief product officer at Collibra. Armed with 20 years of enterprise product leadership experience and a passion for the explosive potential of data, Jim’s charge is to deliver world-class solutions that can empower customers to disrupt and lead their respective markets. Jim brings technical product expertise along with an aptitude for business strategy, design, development, and delivery of data-intensive solutions. Previously, he was general manager and senior vice president of network and open data at Veeva, a leader in cloud-based software for the global life sciences industry; served as president of commercial and products at Novetta, an advanced analytics company; was general manager and senior vice president of products and services for Initiate Systems, a best-of-breed MDM company acquired by IBM in 2010, where he was the vice president of engineering and the director of master data management. He holds a BS in management and finance from Purdue.

Presentations

Powering the future with data intelligence (sponsored by Collibra) Session

Transforming data into a trusted business asset that informs decision making requires giving teams access to a powerful platform that makes it easy to harness data across the enterprise. Jim Cushman and Piyush Jain detail how Progressive uses Collibra to transform the way data is managed and used across the organization, driving real business value.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Closing remarks Keynote

Program chairs, Ben Lorica, Doug Cutting, and Alistair Croll, offer closing remarks.

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Dan D’Orazio is a skilled problem-solver and data enthusiast with eighteen years of experience working across various industries. Dan is a Solution Architect at Matillion, an ETL tool purpose-built for cloud data warehouses, where he supports customers and internal teams with his technical know-how and expert communication skills. Prior to that, he worked as a data engineer to help solve complex business problems for internal and external stakeholders.

Dan enjoys a wide variety of activities, from working on cars, brewing beer, fishing, carpentry, and just about everything in between. Originally from Buffalo, NY, he now lives in Denver, Colorado, with his family of five.

Presentations

Solve tomorrow’s business challenges with a modern data warehouse (sponsored by Matillion) Session

According to Forrester, insight-driven companies are on pace to make $1.8 trillion annually by 2021. Daniel D'Orazio wants to know how fast your team can collect, process, and analyze data to solve present—and future—business challenges. You'll gain actionable tips and lessons learned from cloud data warehouse modernizations at companies like DocuSign that you can take back to your business.

Brian d’Alessandro is a Sr Director of data science at Capital One (Financial Services). Brian is also an active professor for NYU’s Center for Data Science graduate degree program. Previously, Brian built and led data science programs for several NYC tech startups, including Zocdoc and Dstillery. A veteran data scientist and leader with over 18 years of experience developing machine learning-driven practices and products, Brian holds several patents and has published dozens of peer-reviewed articles on the subjects of causal inference, large-scale machine learning, and data science ethics. When not doing data science, Brian likes to cook, create adventures with his family, and surf in the frigid north Atlantic waters.

Presentations

Improve your data science ROI with a portfolio and risk management lens Session

While data science value is well recognized within tech, experience across industries shows that the ability to realize and measure business impact is not universal. A core issue is that data science programs face unique risks many leaders aren’t trained to hedge against. Brian Dalessandro addresses these risks and advocates for new ways to think about and manage data science programs.

Jules S. Damji is an Apache Spark community and developer advocate at Databricks. He’s a hands-on developer with over 20 years of experience. Previously, he worked at leading companies such as Sun Microsystems, Netscape, @Home, LoudCloud/Opsware, Verisign, ProQuest, and Hortonworks, building large-scale distributed systems. He holds a BSc and MSc in computer science and MA in political advocacy and communication from Oregon State University, the California State University, and Johns Hopkins University, respectively.

Presentations

SOLD OUT: Managing the complete machine learning lifecycle with MLflow Tutorial

ML development brings many new complexities beyond the software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools, and parameters to get the best results, and they need to track this information. Jules Damji walks you through MLflow, an open source project that simplifies the entire ML lifecycle, to solve this problem.

Shirshanka Das is a principal staff software engineer and the architect for LinkedIn’s analytics platforms and applications team. He was among the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. He’s working with his team to simplify the big data analytics space at LinkedIn through a multitude of mostly open source projects, including Pinot, a high-performance distributed OLAP engine; Gobblin, a data lifecycle management platform for Hadoop; WhereHows, a data discovery and lineage platform; and Dali, a data virtualization layer for Hadoop.

Presentations

The evolution of metadata: LinkedIn’s story Session

Imagine scaling metadata to an organization of 10,000 employees, 1M+ data assets, and an AI-enabled company that ships code to the site three times a day. Shirshanka Das and Mars Lan dive into LinkedIn’s metadata journey from a two-person back-office team to a central hub powering data discovery, AI productivity, and automatic data privacy. They reveal metadata strategies and the battle scars.

Jeff is the Director of Strategy for ROI Training’s Google Cloud Partnership. He’s also an ex-Googler. As head of Customer Success @ Google, Jeff helped Google’s biggest customers adopt cloud technologies. Jeff has worked on teams delivering highly scaled SaaS products, and has built customer success tools on App Engine & BigQuery. He has spent the last 2 years training Google Customer Engineers, Google Partners, and Google customers around the world as they come up to speed on Google Cloud Platform. Jeff was named 2017 Google Cloud Trainer of the Year and is certified as both a Professional Cloud Architect and Data Engineer.

Presentations

Serverless machine learning with TensorFlow and BigQuery (sponsored by Google Cloud) Training

Jeff Davis provides a hands-on introduction to designing and building machine learning models on structured data on Google Cloud Platform. Through a combination of presentations, demos, and hands-on labs, you'll learn machine learning (ML) concepts and how to implement them using both BigQuery Machine Learning and TensorFlow and Keras.

Serverless machine learning with TensorFlow and BigQuery (sponsored by Google Cloud) (Day 2) Training Day 2

Jeff Davis provides a hands-on introduction to designing and building machine learning models on structured data on Google Cloud Platform. Through a combination of presentations, demos, and hands-on labs, you'll learn machine learning (ML) concepts and how to implement them using both BigQuery Machine Learning and TensorFlow and Keras.

Gerard de Melo is an assistant professor of computer science at Rutgers University, where he heads a team of researchers working on big data analytics, natural language processing, and web mining. Gerard’s research projects include UWN/MENTA, one of the largest multilingual knowledge bases, and Lexvo.org, an important hub in the web of data. Previously, he was a faculty member at Tsinghua University, one of China’s most prestigious universities, where he headed the Web Mining and Language Technology Group, and a visiting scholar at UC Berkeley, where he worked in the ICSI AI Group. He serves as an editorial board member for Computational Intelligence, the Journal of Web Semantics, the Springer Language Resources and Evaluation journal, and the Language Science Press TMNLP book series. Gerard has published over 80 papers, with best paper or demo awards at WWW 2011, CIKM 2010, ICGL 2008, and the NAACL 2015 Workshop on Vector Space Modeling, as well as an ACL 2014 best paper honorable mention, a best student paper award nomination at ESWC 2015, and a thesis award for his work on graph algorithms for knowledge modeling. He holds a PhD in computer science from the Max Planck Institute for Informatics.

Presentations

Toward more fine-grained sentiment and emotion analysis of text Session

Gerard de Melo takes a deep dive into the kinds of sentiment and emotion consumers associate with a text. With new data-driven approaches, organizations can better pay attention to what's being said about them in different markets. And you can consider fonts and palettes best suited to convey specific emotions, so organizations can make informed choices when presenting information to consumers.

Randy DeFauw is a solutions architect at AWS, with over 20 years of experience in enterprise software architecture. He worked heavily in DevOps in the past and now focuses on analytics and machine learning.

Presentations

ML ops: Applying DevOps practices to machine learning workloads Session

As an increasing level of automation becomes available to data science, the balance between automation and quality needs to be maintained. Applying DevOps practices to machine learning workloads brings models to the market faster and maintains the quality and integrity of those models. Sireesha Muppala, Shelbee Eigenbrode, and Randall DeFauw explore applying DevOps practices to ML workloads.

Dan DeMers is the CEO and co-founder of Cinchy, the global leader in enterprise data collaboration. Dan spent over a decade as an IT executive with leading global financial institutions where he was responsible for delivering mission-critical projects, greenfield technologies, and multi-million dollar technology investments. In 2019, he joined the Canadian Council of Innovators, a select group of CEOs from Canada’s most successful technology companies who work with public policy leaders to optimize the growth of Canada’s innovation-based sector.

Presentations

The end of applications: How data collaboration is changing everything (sponsored by Cinchy) Session

After 40 years of apps, enterprise companies now realize that building or buying an application for every use case has become a major threat to their ability to leverage and protect their core data assets. Dan DeMers provides a live demo of Cinchy, the world’s first data collaboration platform.

Matt Derda leads customer marketing at Trifacta, where he focuses on helping customers communicate the value they’re generating from initiatives to modernize data management, analytics and machine learning. Matt’s previous experience as a data wrangler at PepsiCo means that he is well-versed in the challenges of getting data ready for various analytics initiatives. He uses this experience to establish strong relationships with Trifacta customers and help them promote their innovative work to internal and external audiences.

Presentations

Getting clinical trial data ready for analysis: How IQVIA wrangled its way to success (sponsored by Trifacta) Session

Clinical trial data analysis can be a complex process. The data is typically hand-coded and formatted differently and is required to be delivered in an FDA-approved format. Matt Derda and Yogesh Prasad explain how IQVIA built its Clean Patient Tracker and how it enabled agility and flexibility for end users of the platform, from data acquisition to reporting and analytics.

Ifi Derekli is a senior solutions engineer at Cloudera, focusing on helping large enterprises solve big data problems using Hadoop technologies. Her subject-matter expertise is around security and governance, a crucial component of every successful production big data use case. Previously, Ifi was a presales technical consultant at Hewlett Packard Enterprise, where she provided technical expertise for Vertica and IDOL (currently part of Micro Focus). She holds a BS in electrical engineering and computer science from Yale University.

Presentations

Getting ready for CCPA: Securing data lakes for heavy privacy regulation Tutorial

New regulations drive compliance, governance, and security challenges for big data. Infosec and security groups must ensure a secured and governed environment across workloads that span on-premises, private cloud, multicloud, and hybrid cloud. Mark Donsky, Lars George, Michael Ernest, and Ifigeneia Derekli outline hands-on best practices for meeting these challenges with special attention to CCPA.

John Derrico is director of data strategy at Mastercard, where he’s responsible for the enterprise data sourcing strategy, working to create the next generation of products and solutions. John has more than 20 years of experience as a business transformation leader. He’s delivered business intelligence products for startups and fortune 500 clients in clinical, retail, financial, payments, and big data industries and led teams to ideate, build, patent, and support enterprise data products. He worked at Mastercard early in his career before serving in operations, product, sales, and strategic roles with ADP, Medco/Express Scripts, Consumer Reports, and multiple startups. In addition John directed ADP’s data science lab, achieving multiple patents while delivering patentable products to the marketplace. John holds a BA from Fordham University. He’s also a Six Sigma Black belt and a certified snowboard instructor.

Presentations

Mastercard and Pitney Bowes: Creating a data-driven business (sponsored by Pitney Bowes) Session

Mastercard and Pitney Bowes have overcome many challenges on their journey to accelerate innovation, achieve efficiencies, and improve the overall customer experience. Olga Lagunova and John Derrico share lessons learned as the data strategy evolved and highlight pitfalls and solutions from data science projects across several industries, from finance to cross-border shipping logistics.

John DesJardins is currently VP Solution Architecture and CTO for North America at Hazelcast, where he champions growth and adoption of our in-memory computing platform. His expertise in large scale computing spans Big Data, Internet of Things, Machine Learning, Microservices, and Cloud.
John brings over 25 years of experience including 15 years of experience in architecting and implementing global scale computing solutions with leading financial services organizations at Hazelcast, Cloudera, Software AG and webMethods. He has helped architect and implement large scale applications with organization such as the European Central Bank, Morgan Stanley, JP Morgan Chase, CapitalOne, Standard Chartered, Visa, Discover, Paypal, Apple, AT&T, Comcast, Coca-Cola, and many others.
He holds a BS in Economics from George Mason University, where he first built predictive models, long before that was considered cool.
Prior to Hazelcast, he worked at Cloudera, Software AG and webMethods, helping Global 2K firms architect massively scalable solutions for ingest, analysis and storage of real-time data, as well as working on innovations in areas such as IoT and Machine Learning.

Presentations

Low-latency computing and stream processing for financial systems (sponsored by Hazelcast) Session

In this talk, we will explore the challenges with integrating real-time stream processing and machine learning into banking and capital markets applications.

Sourav Dey is CTO at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. Previously, Sourav led teams building data products across the technology stack, from smart thermostats and security cams at Google Nest to power grid forecasting at AutoGrid to wireless communication chips at Qualcomm. He holds patents for his work, has been published in several IEEE journals, and has won numerous awards. He holds PhD, MS, and BS degrees in electrical engineering and computer science from MIT.

Presentations

Efficient ML engineering: Tools and best practices Tutorial

Sourav Dey and Jakov Kucan walk you through the six steps of the Lean AI process and explain how it helps your ML engineers work as an an integrated part of your development and production teams. You'll get a hands-on example using real-world data, so you can get up and running with Docker and Orbyter and see firsthand how streamlined they can make your workflow.

Gonzalo Diaz is a data scientist in residence at the Data Incubator, where he teaches the data science fellowship and online courses; he also develops the curriculum to include the latest data science tools and technologies. Previously, he was a web developer at an NGO and a researcher at IBM TJ Watson Research Center. He has a PhD in computer science from the University of Oxford.

Presentations

SOLD OUT: Big data for managers 2-Day Training

Michael Li and Gonzalo Diaz provide a nontechnical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and use their input and analysis for your business’s strategic priorities and decision making.

SOLD OUT: Big data for managers (Day 2) Training Day 2

Michael Li and Gonzalo Diaz provide a nontechnical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and use their input and analysis for your business’s strategic priorities and decision making.

Victor Dibia is a research engineer at Cloudera’s Fast Forward Labs where his work focuses on prototyping state-of-the-art machine learning algorithms and advising clients. He’s passionate about community work and serves as a Google Developer Expert in machine learning. Previously, he was a research staff member at the IBM TJ Watson Research Center. His research interests are at the intersection of human-computer interaction, computational social science, and applied AI. He’s a senior member of IEEE and has published research papers at conferences such as AAAI Conference on Artificial Intelligence and ACM Conference on Human Factors in Computing Systems. His work has been featured in outlets such as the Wall Street Journal and VentureBeat. He holds an MS from Carnegie Mellon University and a PhD from City University of Hong Kong.

Presentations

Handtrack.js: Building gesture-based interactions in the browser using TensorFlow Session

Recent advances in machine learning frameworks for the browser such as TensorFlow provides the opportunity to craft truly novel experiences within frontend applications. Victor Dibia explores the state of the art for machine learning in the browser using TensorFlow and outlines its use in the design of Handtrack.js—a library for prototyping real-time hand detection in the browser.

Masaru Dobashi is a manager, IT specialist and architect at NTT DATA, where he leads the OSS professional service team and is responsible for introducing Hadoop, Spark, Storm, and other OSS middleware into enterprise systems. Previously, Masaru developed an enterprise Hadoop cluster consisting of over 1,000 nodes—one of the largest Hadoop clusters in Japan—and designed and provisioned several kinds of clusters using non-Hadoop open source software, such as Spark and Storm.

Presentations

Deep learning technologies for giant hogweed eradication Session

Giant hogweed is a highly toxic plant. Naoto Umemori and Masaru Dobashi aim to automate the process of detecting the plant with technologies like drones and image recognition and detection using machine learning. You'll see how they designed the architecture, took advantage of big data and machine and deep learning technologies (e.g., Hadoop, Spark, and TensorFlow), and the lessons they learned.

Harish Doddi is cofunder and CEO of Datatron. Previously, he held roles at Oracle; Twitter, where he worked on open source technologies, including Apache Cassandra and Apache Hadoop, and built Blobstore, Twitter’s photo storage platform; Snap, where he worked on the backend for Snapchat Stories; and Lyft, where he worked on the surge pricing model. Harish holds a master’s degree in computer science from Stanford, where he focused on systems and databases, and an undergraduate degree in computer science from the International Institute of Information Technology in Hyderabad.

Presentations

Challenges faced in machine learning infrastructure in traditional large enterprises Session

Machine learning infrastructure is key to the success of AI at scale in enterprises, with many challenges when you want to bring machine learning models to a production environment, given the legacy of the enterprise environment. Venkata Gunnu and Harish Doddi explore some key insights, what worked, what didn't work, and best practices that helped the data engineering and data science teams.

Mark Donsky leads product management at Okera, a software provider that provides discovery, access control, and governance at scale for today’s modern heterogeneous data environments. Previously, Mark led data management and governance solutions at Cloudera. Mark has held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

Executive Briefing: Big data in the era of heavy worldwide privacy regulations Session

California is following the EU's GDPR with the California Consumer Protection Act (CCPA) in 2020. Penalties for non-compliance, but many companies aren't prepared for this strict regulation. This session will explore the capabilities your data environment needs in order to simplify CCPA and GDPR compliance, as well as other regulations.

Getting ready for CCPA: Securing data lakes for heavy privacy regulation Tutorial

New regulations drive compliance, governance, and security challenges for big data. Infosec and security groups must ensure a secured and governed environment across workloads that span on-premises, private cloud, multicloud, and hybrid cloud. Mark Donsky, Lars George, Michael Ernest, and Ifigeneia Derekli outline hands-on best practices for meeting these challenges with special attention to CCPA.

Sangeeta Thirumalai is a software developer at Cloudera, specializing in database technologies. She is currently responsible for architecting workload-level optimization tools for SQL-on-Hadoop workloads.

Presentations

Apache Metron: Open source cybersecurity at scale Tutorial

Bring your laptop, roll up your sleeves, and get ready to crunch some cybersecurity events with Apache Metron, an open source big data cybersecurity platform. Carolyn Duby walks you through how Metron finds actionable events in real time.

Anais Dotis-Georgiou is a developer advocate at InfluxData with a passion for making data beautiful using data analytics, AI, and machine learning. She takes the data that she collects and does a mix of research, exploration, and engineering to translate the data into something of function, value, and beauty. When she’s not behind a screen, you can find her outside drawing, stretching, or chasing after a soccer ball.

Presentations

When Holt-Winters is better than machine learning Session

Machine learning (ML) gets a lot of hype, but its classical predecessors are still immensely powerful, especially in the time series space, and classical algorithms outperform machine learning methods in time series forecasting. Anais Dotis dives into how she used the Holt-Winters forecasting algorithm to predict water levels in a creek.

Jed Dougherty is a data scientist at Dataiku, where he leads the data scientist team in North America. He specializes in helping large companies in fields including finance, manufacturing, and medicine spin up and organize data science teams and has helped clients build successful projects in data security, real-time recommendation, predictive maintenance, and other “hot topics.” Previously, he worked at a camel ride, so he’s spent quite a bit of time appreciating the normal versus bimodal distribution of dromedaries and Bactrians. He holds a master’s degree from the QMSS program at Columbia University.

Presentations

Data Science Pioneers: Conquering the next frontier, a documentary investigating the future of data science (sponsored by Dataiku) Keynote

Jed Dougherty presents the trailer of the upcoming _Data Science Pioneers_ documentary about the passionate data scientists driving us toward technological revolution. Cut through the hype with _Data Science Pioneers_ and see what it really means to be a data scientist.

So you built a model; now what? (sponsored by Dataiku) Session

Jed Dougherty takes a deep dive into an often overlooked aspect of the data science lifecycle: model deployment. Once they’ve constructed a data science model that does a good job accurately predicting their test set, many data scientists think the job is over. But really, it’s just begun.

Presentations

Running multidisciplinary big data workloads in the cloud with CDP Tutorial

Organizations now run diverse, multidisciplinary, big data workloads that span data engineering, data warehousing, and data science applications. Many of these workloads operate on the same underlying data, and the workloads themselves can be transient or long running in nature. There are many challenges with moving these workloads to the cloud. In this talk we start off with a technical deep...

Blake DuBois is a Big Data & Analytics Architect within the Google Cloud (GCP) Professional Services Organization (PSO) where he co-leads its Hadoop Migration working group and helps customers migrate and/or build large data platforms. Prior to his current role, he was a Big Data Solution Architect at Microsoft Azure, Hortonworks, and Amazon Web Services (AWS) where he built PB-scale data analytics environments in the Cloud.

Presentations

10 things to know about running and migrating Hadoop to GCP (sponsored by Google Cloud) Session

Taking advantage of cloud infrastructure and analytic services is a must for any digital enterprise. Join Google Cloud as they discuss 10 things you should know about running and migrating on-prem Hadoop deployments to GCP.

Carolyn Duby is a solutions engineer at Cloudera, where she helps customers harness the power of their data with Apache open source platforms. Previously, she was the architect for cybersecurity event correlation at Secureworks. A subject-matter expert in cybersecurity and data science, Carolyn is an active leader in the community and frequent speaker at Future of Data meetups in Boston, MA, and Providence, RI, and at conferences such as Open Data Science Conference and Global Data Science Conference. Carolyn holds an ScB (magna cum laude) and ScM from Brown University, both in computer science. She’s lifelong learner and recently completed the Johns Hopkins University Coursera data science specialization.

Presentations

Apache Metron: Open source cybersecurity at scale Tutorial

Bring your laptop, roll up your sleeves, and get ready to crunch some cybersecurity events with Apache Metron, an open source big data cybersecurity platform. Carolyn Duby walks you through how Metron finds actionable events in real time.

Ted Dunning is the chief technology officer at MapR. He’s also a board member for the Apache Software Foundation; a PMC member and committer of the Apache Mahout, Apache Zookeeper, and Apache Drill projects; and a mentor for various incubator projects. Ted has years of experience with machine learning and other big data solutions across a range of sectors. He’s contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library and designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date plus a dozen pending. He holds a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.

Presentations

Practical feature engineering Session

Feature engineering is generally the section that gets left out of machine learning books, but it's also the most critical part in practice. Ted Dunning explores techniques, a few well known, but some rarely spoken of outside the institutional knowledge of top teams, including how to handle categorical inputs, natural language, transactions, and more in the context of machine learning.

Barbara Eckman is a senior principal software architect at Comcast specializing in big data architecture and governance. She leads data discovery and lineage platform architecture for a division-wide initiative comprising streaming, transforming, storing, and analyzing big data. Barbara is also the lead metadata architect for the Comcast Privacy Program, an initiative tackling the challenge of legislation like the California Consumer Privacy Act. Her prior experience includes scientific data and model integration at the Human Genome Project, Merck, GlaxoSmithKline, and IBM, where she served on the peer-elected IBM Academy of Technology.

Presentations

Postrevolutionary big data: Promoting the general welfare (sponsored by Io-Tahoe) Keynote

Barbara Eckman shares lessons learned from early big data mistakes and the progress her team at Comcast is making toward a postrevolutionary big data vision.

Vlad Eidelman is the vice president of research at FiscalNote, where he leads AI R&D into advanced methods for analyzing, modeling, and extracting knowledge from unstructured data related to government, policy, and law, and he built the first version of the company’s patented technology to help organizations predict and act on policy changes. Previously, he was a researcher in a number of academic and industry settings, completing his PhD in computer science as an NSF and NDSEG Fellow at the University of Maryland, and his BS in computer science and philosophy at Columbia University. His research focused on machine learning algorithms for a broad range of natural language processing applications, including entity extraction, machine translation, text classification, and information retrieval, especially applied to computational social science. His work has led to 10 patent applications, has been published in conferences like the Association for Computational Linguistics (ACL), North American Chapter of the Association for Computational Linguistics (NAACL), and Empirical Methods in Natural Language Processing (EMNLP), and has been covered by media such as Wired, Vice News, Washington Post and Newsweek.

Presentations

What does the public say? A computational analysis of regulatory comments Session

While regulations affect your life every day, and millions of public comments are submitted to regulatory agencies in response to their proposals, analyzing the comments has traditionally been reserved for legal experts. Vlad Eidelman outlines how natural language processing (NLP) and machine learning can be used to automate the process by analyzing over 10 million publicly released comments.

Shelbee Eigenbrode is a solutions architect at Amazon Web Services (AWS). Her current areas of depth include DevOps combined with machine learning and artificial intelligence. She’s been in technology for 22 years, spanning multiple roles and technologies. Previously, she spent 20+ years at IBM. She’s a published author, blogger, and vlogger evangelizing DevOps practices with a passion for driving rapid innovation and optimization at scale. In 2016, she won the DevOps dozen blog of the year demonstrating what DevOps is not. With over 26 patents granted across various technology domains, her passion for continuous innovation combined with a love of all things data recently turned her focus to data science. Combining her backgrounds in data, DevOps, and machine learning, her passion is helping customers embrace data science and ensure all data models have a path to use. She also aims to put ML in the hands of developers and customers who are not classically trained data scientists.

Presentations

Alexa, do men talk too much? Session

Mansplaining. Know it? Hate it? Want to make it go away? Sireesha Muppala, Shelbee Eigenbrode, and Emily Webber tackle the problem of men talking over or down to women and its impact on career progression for women. They also demonstrate an Alexa skill that uses deep learning techniques on incoming audio feeds, examine ownership of the problem for women and men, and suggest helpful strategies.

ML ops: Applying DevOps practices to machine learning workloads Session

As an increasing level of automation becomes available to data science, the balance between automation and quality needs to be maintained. Applying DevOps practices to machine learning workloads brings models to the market faster and maintains the quality and integrity of those models. Sireesha Muppala, Shelbee Eigenbrode, and Randall DeFauw explore applying DevOps practices to ML workloads.

Michael Ernest is a partner solution architect at Dataiku, supporting technical integration with cloud platforms. He previously led field-enablement programming at Cloudera, where he developed training for new and tenured hires in Hadoop operations, application architecture, and full stack security. He’s published four books on Java programming and Sun Solaris administration. Ernest lives in Berkeley, California.

Presentations

Getting ready for CCPA: Securing data lakes for heavy privacy regulation Tutorial

New regulations drive compliance, governance, and security challenges for big data. Infosec and security groups must ensure a secured and governed environment across workloads that span on-premises, private cloud, multicloud, and hybrid cloud. Mark Donsky, Lars George, Michael Ernest, and Ifigeneia Derekli outline hands-on best practices for meeting these challenges with special attention to CCPA.

Richard Evans is a 28-year veteran at Statistics Canada, Institut national de la statistique et des études économiques (Insee). He’s an expert in high-frequency economic indicators, a transformative leader, and an architect and project executive of the CPI Enhancement Initiative. Richard is passionate about using data science and AI to create user-centric data products from big data sources and is a recruiter of tomorrow’s statistical leaders.

Presentations

Driving adoption of data DCS

Often, the difference between a successful data initiative and failed one isn't the data or the technology, but rather its adoption by the wider business. With every business wanting the magic of data but many failing to properly embrace and harness it, we will explore what factors our panelists have seen that led to successes and failures in getting companies to use data products.

Implementing ML models into production at Statistics Canada DCS

Join Richard Evans to find out how Statistics Canada, Canada’s national statistical organization, created and put in place its first AI/ML team, applying Lean Startup principles and a proactive hiring strategy to formulate a strategy that within months was delivering production-ready ML models in a complex and demanding data processing environment.

Stephan Ewen is a cofounder and chief technology officer at Ververica, where he leads the development of the stream processing platform based on open source Apache Flink. He’s also a PMC member and one of the original creators of Apache Flink. Previously, Stephan worked on in-memory databases, query optimization, and distributed systems. He holds a PhD from the Berlin University of Technology.

Presentations

Stream processing beyond streaming data Session

Stephan Ewen details how stream processing is becoming a "grand unifying paradigm" for data processing and the newest developments in Apache Flink to support this trend: new cross-batch-streaming machine learning algorithms, state-of-the-art batch performance, and new building blocks for data-driven applications and application consistency.

Moty Fania is a principal engineer and the CTO of the Advanced Analytics group at Intel, which delivers AI and Big Data solutions across Intel. Moty has rich experience in ML engineering, analytics, data warehousing, and decision support solutions. He led the architecture work and development of various AI and big data initiatives such as IoT systems, predictive engines, online inference systems, and more.

Presentations

Building an AI platform: Key principles and lessons learned Session

Moty Fania details Intel’s IT experience of implementing a sales AI platform. This platform is based on streaming, microservices architecture with a message bus backbone. It was designed for real-time data extraction and reasoning and handles the processing of millions of website pages and is capable of sifting through millions of tweets per day.

Usama Fayyad is a cofounder and chief technology officer at OODA Health, a VC-funded company founded in 2017 to bring AI and automation to create a retail-like experience in payments and processing to healthcare delivery, and founder and chairman at Open Insights, a technology and strategic consulting firm founded in 2008 to help enterprises deploy data-driven solutions to grow revenue from data assets. In addition to big data strategy and building new business models on data assets, Open Insights deploys data science, AI and ML, and big data solutions for large enterprises. Previously, he served as global chief data officer at Barclays in London after he launched the largest tech startup accelerator in MENA as executive chairman of Oasis500 in Jordan; held chairman and CEO roles at several startups, including Blue Kangaroo, DMX Group, and DigiMine; was the first person to hold the chief data officer title when Yahoo acquired his second startup in 2004, where he built the Strategic Data Solutions Group and founded Yahoo Research Labs; held leadership roles at Microsoft and founded the Machine Learning Systems Group at NASA’s Jet Propulsion Laboratory, where his work on machine learning resulted in the top Excellence in Research award from Caltech and a US government medal from NASA. Usama has published over 100 technical articles on data mining, data science, AI and ML, and databases. He holds over 30 patents and is a fellow of both the AAAI and the ACM. Usama earned his PhD in engineering in AI and machine learning from the University of Michigan-Ann Arbor. He’s edited two influential books on data mining and served as editor-in-chief on two key industry journals. He also served on the boards or advisory boards of several private and public companies including Criteo, InvenSense, RapidMiner, Stella, Virsec, Silniva, Abe AI, NetSeer, ChoiceStream, Medio, and others. On the academic front, he’s on advisory boards of the Data Science Institute at Imperial College, AAI at UTS, and the University of Michigan College of Engineering National advisory Board.

Presentations

An in-depth look at the data science career: Defining roles, assessing skills Session

If you've ever been confused about what it takes to be a data scientist or curious about how companies recruit, train, and manage analytics resources, Usama Fayyad and Hamit Hamutcu are here to explore insights from the most comprehensive research effort to date on the data analytics profession and propose a framework for the standardization of roles and methods for assessing skills.

David Florsek is the architect of innovation at IDEMIA National Security Solutions (IDEMIA NSS), where he leads efforts to define, develop, and deploy situational awareness platforms that integrate multiple types of intelligence-oriented data, including facial and biometric data, linguistic and lexicological data, historical and contextual data, and other forms of information. David has been responsible for developing and deploying systems as wide ranging as the FBI’s Integrated Automated Fingerprint Identification System (IAFIS) biometric criminal justice system and from military air-to-ground missile-targeting systems for fighter jets to ground-based missile-defense systems to university student financial management systems helping students to track their meal plans and maintenance inventory control systems. Previously, David led a successful entrepreneurial consulting small business and spent many years of software and system development and integration at Deloitte, Lockheed Martin Aeronautics, and Boeing; and was at the Centers for Disease Control and Prevention (CDC), US Department of Veterans Affairs (VA), Federal Bureau of Investigation (FBI), Department of Homeland Security (DHS), University System of Georgia, and, of course, all branches of the Department of Defense. As a developer, David designed circuitry flying today on the US Air Force’s B-52 bomber as well as software and systems operational across many agencies within the US federal government. The common thread across all of these efforts was the need to find innovative solutions to seemingly intractable problems. David has pioneered efforts in data mining, AI, and DL to extract actionable information from what was previously considered vast quantities of trash data. David specializes in developing algorithms and concepts that bridge traditional boundaries, staying within the defined law and statutes, but that solve previously unsolved problems.

Presentations

War stories from the front lines of ML Session

Machine learning techniques are being deployed across almost every industry and sector. But this adoption comes with real, and oftentimes underestimated, privacy and security risks. Andrew Burt and Brenda Leong convene a panel of experts including David Florsek, Chris Wheeler, and Alex Beutel to detail real-life examples of when ML goes wrong, and the lessons they learned.

Ryan Foltz is a data scientist at Smarter SIEM company Exabeam, applying the latest machine learning approaches to cybersecurity. Previously, he was a data specialist at the Harvard-Smithsonian Center for Astrophysics, and he is founder and lead game designer at Epic Banana Studios. He earned his PhD at the University of California, Riverside, for his research on galaxy formation and evolution.

Presentations

Learning asset naming patterns to find risky unmanaged devices Session

Unmanaged and foreign devices in the corporate networks pose a security risk, and the first step toward reducing this risk is the ability to identify them. Ryan Foltz walks you through a comprehensive device management machine learning model based on deep learning that performs anomaly detection based on only device names to flag devices that do not follow naming structures.

Jonathan Foster is the principal content experiences manager at Microsoft, where he leads the Windows and content intelligence writing team. Their work includes UX writing, designing personality and voice within products and experiences, and authoring and designing conversational interactions for products and experiences. He built the writing team for Microsoft’s digital assistant Cortana for the US and international markets, which focused on the development of Cortana’s personality while crafting fun, challenging dialogue. They’re now expanding upon this knowledge to create a personality catalogue for Microsoft’s Bot framework and train a deep neural net conversational model to support those personalities. Jonathan started out in film and television writing screenplays and working in development. He was eventually drawn away from Hollywood by the true innovative spirit of the tech industry, starting with an interactive storytelling project that was honored by the Sundance Film Festival.

Presentations

Executive Briefing: Say what? The ethical challenges of designing for humanlike interaction Session

Language shapes our thinking, our relationships, our sense of self. Conversation connects us in powerful, intimate, and often unconscious ways. Jonathan Foster explains why, as we design for natural language interactions and more humanlike digital experiences, language—as design material, conversation, and design canvas—reveals ethical challenges we couldn't encounter with GUI-powered experiences.

Say what? The ethical challenges of designing for humanlike interaction Keynote

Language shapes our thinking, our relationships, our sense of self. Conversation connects us in powerful, intimate, and often unconscious ways. Jonathan Foster explains why, as we design for natural language interactions and more humanlike digital experiences, language—as design material, conversation, and design canvas—reveals ethical challenges we couldn't encounter with GUI-powered experiences.

Marcus Fowler is the director of strategic threat at Darktrace. Previously, he spent 15 years at the Central Intelligence Agency developing global cyber operations and technical strategies, led cyber efforts with various US Intelligence Community elements and global partners, has extensive experience advising senior leaders on cyber efforts, and was an officer in the United States Marine Corps. He’s recognized as a leader in developing and deploying innovative cyber solutions. Marcus has an engineering degree from the United States Naval Academy and a master’s degree in international security studies from the Fletcher School. He also completed Harvard Business School’s Executive Education Advanced Management Program.

Presentations

When machines fight machines: Cyberbattles and the new frontier of artificial intelligence Session

Cybersecurity must find what it doesn’t know to look for. AI technologies led to the emergence of self-learning, self-defending networks that achieve this—detecting and autonomously responding to in-progress attacks in real time. Marcus Fowler examine these cyber-immune systems enable the security team to focus on high-value tasks, counter even machine-speed threats, and work in all environments.

Michael J. Freedman is the cofounder and CTO of TimescaleDB and a full professor of computer science at Princeton University. His work broadly focuses on distributed and storage systems, networking, and security, and his publications have more than 12,000 citations. He developed CoralCDN (a decentralized content distribution network serving millions of daily users) and helped design Ethane (which formed the basis for OpenFlow/software-defined networking). Previously, he cofounded Illuminics Systems (acquired by Quova, now part of Neustar) and serves as a technical advisor to Blockstack. Michael’s honors include a Presidential Early Career Award for Scientists and Engineers (given by President Obama), the SIGCOMM Test of Time Award, a Sloan Fellowship, an NSF CAREER award, the Office of Naval Research Young Investigator award, and support from the DARPA Computer Science Study Group. He did his PhD at NYU and Stanford and his undergraduate and master’s degrees at MIT.

Presentations

Performant time series data management and analytics with PostgreSQL Session

Leveraging polyglot solutions for your time series data can lead to issues including engineering complexity, operational challenges, and even referential integrity concerns. Michael Freedman explains why, by re-engineering PostgreSQL to serve as a general data platform, your high-volume time series workloads will be better streamlined, resulting in more actionable data and greater ease of use.

Brandy Freitas is a principal data scientist at Pitney Bowes, where she works with clients in a wide variety of industries to develop analytical solutions for their business needs. Brandy is a research-physicist-turned-data-scientist based in Boston, Massachusetts. Her academic research focused primarily on protein structure determination, applying machine learning techniques to single-particle cryoelectron microscopy data. Brandy is a National Science Foundation Graduate Research Fellow and a James Mills Pierce Fellow. She holds an undergraduate degree in physics and chemistry from the Rochester Institute of Technology and did her graduate work in biophysics at Harvard University.

Presentations

Harnessing graph-native algorithms to enhance machine learning: A primer Session

Brandy Freitas examines the interplay between graph analytics and machine learning, improved feature engineering with graph native algorithms, and how to harness the power of graph structure for machine learning through node embedding.

Matt Fuller is cofounder at Starburst, the Presto company. Previously, Matt has held engineering roles in the data warehousing and analytics space for the past 10 years, including director of engineering at Teradata, where he led engineering teams working on Presto and was part of the team that led the initiative to bring open source, in particular Presto, to Teradata’s products; architected and led development efforts for the next-generation distributed SQL engine at Hadapt (acquired by Teradata in 2014); and was an early engineer at Vertica (acquired by HP), where he worked on the query optimizer.

Presentations

Learning Presto: SQL on anything Tutorial

Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL on anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from GBs to PBs. Join Matt Fuller to learn how to use Presto and explore use cases and best practices you can implement today.

Viktor Gamov is a developer advocate at Confluent, the company that makes an event streaming platform based on Apache Kafka. Back in his consultancy days, Viktor developed comprehensive expertise in building enterprise application architectures using open source technologies. He enjoys helping architects and developers design and develop low-latency, scalable, and highly available distributed systems. He’s a professional conference speaker on distributed systems, streaming data, the Java virtual machine (JVM), and DevOps topics, and he’s a regular on events including JavaOne, Devoxx, OSCON, QCon, and others. He co-authored O’Reilly’s Enterprise Web Development. He blogs at gamov.io and cohosts Crazy Russians in Devoops and DevRelRad.io podcasts. Follow Viktor on Twitter as @gamussa, where he posts about gym life, food, open source, and, of course, Kafka and Confluent.

Presentations

Real-time SQL stream processing at scale with Apache Kafka and KSQL Tutorial

Building stream processing applications is certainly one of the hot topics in the IT community. But if you've ever thought you needed to be a programmer to do stream processing and build stream processing data pipelines, think again. Viktor Gamov explores KSQL, the stream processing query engine built on top of Apache Kafka.

Alon Gavra is a platform team lead at AppsFlyer. Originally a backend developer, he’s transitioned to lead the real time infrastructure team and took on the role of bringing some of the most heavily used infrastructure in AppsFlyer to the next level. A strong believer in sleep-driven design, Alon’s main focus is stability and resiliency in building massive data ingestion and storage solutions.

Presentations

Managing your Kafka in an explosive growth environment Session

Frequently, Kafka is just a piece of the stack that lives in production that often times no one wants to touch—because it just works. Alon Gavra outlines how Kafka sits at the core of AppsFlyer's infrastructure that processes billions of events daily.

Presentations

SOLD OUT: Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join the AWS team to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

SOLD OUT: Building a serverless big data application on AWS (Day 2) Training Day 2

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join the AWS team to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Bas Geerdink is an independent technology lead, focusing on AI and big data. He has worked in several industries on state-of-the-art data platforms and streaming analytics solutions, in the cloud and on prem. Bas has a background in software development, design, and architecture with broad technical experience from C++ to Prolog to Scala. His academic background is in artificial intelligence and informatics. Bas’s research on reference architectures for big data solutions was published at the IEEE conference ICITST 2013. He occasionally teaches programming courses and is a regular speaker at conferences and informal meetings.

Presentations

Fast data with the KISSS stack Session

Streaming analytics (or fast data processing) is the field of making predictions based on real-time data. Bas Geerdink presents a fast data architecture that covers many use cases that follow a "pipes and filters" pattern. This architecture can be used to create enterprise-grade solutions with a diversity of technology options. The stack is Kafka, Ignite, and Spark Structured Streaming (KISSS).

Lars George is the principal solutions architect at Okera. Lars has been involved with Hadoop and HBase since 2007 and became a full HBase committer in 2009. Previously, Lars was the EMEA chief architect at Cloudera, acting as a liaison between the Cloudera professional services team and customers as well as partners in and around Europe, building the next data-driven solutions, and a cofounding partner of OpenCore, a Hadoop and emerging data technologies advisory firm. He has spoken at many Hadoop User Group meetings as well as at conferences such as ApacheCon, FOSDEM, QCon, and Hadoop World and Hadoop Summit. He also started the Munich OpenHUG meetings. He’s the author HBase: The Definitive Guide from O’Reilly.

Presentations

Getting ready for CCPA: Securing data lakes for heavy privacy regulation Tutorial

New regulations drive compliance, governance, and security challenges for big data. Infosec and security groups must ensure a secured and governed environment across workloads that span on-premises, private cloud, multicloud, and hybrid cloud. Mark Donsky, Lars George, Michael Ernest, and Ifigeneia Derekli outline hands-on best practices for meeting these challenges with special attention to CCPA.

Gidon Gershinsky is a lead architect at IBM Research – Haifa. He works on secure cloud analytics, data-at-rest and data-in-use encryption, and attestation of trusted computing enclaves. Gidon plays a leading role in the Apache Parquet community work on the big data encryption and integrity verification technology. He’s earned a PhD degree at the Weizmann Institute of Science in Israel, and was a post-doctoral fellow at Columbia University.

Presentations

Parquet modular encryption: Confidentiality and integrity of sensitive column data Session

The Apache Parquet community is working on a column encryption mechanism that protects sensitive data and enables access control for table columns. Many companies are involved, and the mechanism specification has recently been signed off on by the community management committee. Gidon Gershinsky explores the basics of Parquet encryption technology, its usage model, and a number of use cases.

Debasish Ghosh is principal software engineer at Lightbend. Passionate about technology and open source, he loves functional programming and has been trying to learn math and machine learning. Debasish is an occasional speaker in technology conferences worldwide, including the likes of QCon, Philly ETE, Code Mesh, Scala World, Functional Conf, and GOTO. He’s the author of DSLs In Action and Functional & Reactive Domain Modeling. Debasish is a senior member of ACM. He’s also a father, husband, avid reader, and Seinfeld fanboy who loves spending time with his beautiful family.

Presentations

Online machine learning in streaming applications Session

Stavros Kontopoulos and Debasish Ghosh explore online machine learning algorithm choices for streaming applications, especially those with resource-constrained use cases like IoT and personalization. They dive into Hoeffding Adaptive Trees, classic sketch data structures, and drift detection algorithms from implementation to production deployment, describing the pros and cons of each of them.

Jonathan Goldstein is a principal researcher at Microsoft Research.

Presentations

Trill: The crown jewel of Microsoft’s streaming pipeline explained Session

Trill has been open-sourced, making the streaming engine behind services like the Bing Ads platform available for all to use and extend. James Terwilliger, Badrish Chandramouli, and Jonathan Goldstein dive into the history of and insights from streaming data at Microsoft. They demonstrate how its API can power complex application logic and the performance that gives the engine its name.

Heitor Murilo Gomes is a researcher at Télécom ParisTech focusing on machine learning—particularly, evolving data streams, concept drift, ensemble methods, and big data streams. He coleads the streamDM open data stream mining project.

Presentations

Machine learning for streaming data: Practical insights Session

Heitor Murilo Gomes and Albert Bifet introduce you to a machine learning pipeline for streaming data using the streamDM framework. You'll also learn how to use streamDM for supervised and unsupervised learning tasks, see examples of online preprocessing methods, and discover how to expand the framework by adding new learning algorithms or preprocessing methods.

Bruno Gonçalves is currently a Senior Data Scientist working at the intersection of Data Science and Finance. Previously, he was a Data Science fellow at NYU’s Center for Data Science while on leave from a tenured faculty position at Aix-Marseille Université. Since completing his PhD in the Physics of Complex Systems in 2008 he has been pursuing the use of Data Science and Machine Learning to study Human Behavior. Using large datasets from Twitter, Wikipedia, web access logs, and Yahoo! Meme he studied how we can observe both large scale and individual human behavior in an obtrusive and widespread manner. The main applications have been to the study of Computational Linguistics, Information Diffusion, Behavioral Change and Epidemic Spreading. In 2015 he was awarded the Complex Systems Society’s 2015 Junior Scientific Award for “outstanding contributions in Complex Systems Science” and in 2018 is was named a Science Fellow of the Institute for Scientific Interchange in Turin, Italy.

Presentations

Deep learning from scratch Tutorial

You'll go hands-on to learn the theoretical foundations and principal ideas underlying deep learning and neural networks. Bruno Gonçalves provides the code structure of the implementations that closely resembles the way Keras is structured, so that by the end of the course, you'll be prepared to dive deeper into the deep learning applications of your choice.

Madhu Gopinathan is the vice president of data science at MakeMyTrip, India’s leading online travel company. Previously, he started his career in the San Francisco Bay Area developing large-scale software systems in the telecom industry; spent about 10 years at companies such as Covad Communications and Microsoft; built a machine learning at team at Infosys Product Incubation Group; which was followed by couple of startups. He collaborated with researchers at Microsoft Research, General Motors, and the Indian Institute of Science, leading to publications in prominent computer science conferences. He earned his PhD in computer science from the Indian Institute of Science, working on the mathematical modeling of software systems, and his MS in computer science from the University of Florida, Gainesville. He has extensive experience developing large scale systems using machine learning and natural language processing and has been granted multiple US patents.

Presentations

Migrating millions of users from voice- and email-based customer support to a chatbot Session

At MakeMyTrip customers were using voice or email to contact agents for postsale support. In order to improve the efficiency of agents and improve customer experience, MakeMyTrip developed a chatbot, Myra, using some of the latest advances in deep learning. Madhu Gopinathan and Sanjay Mohan explain the high-level architecture and the business impact Myra created.

Sunil Goplani is a group development manager at Intuit, leading the big data platform. Sunil has played key architecture and leadership roles in building solutions around data platforms, big data, BI, data warehousing and MDM for startups and enterprises. Previously, Sunil served in key engineering positions at Netflix, Chegg, Brand.net, and few other startups. Sunil has a master’s degree in computer science.

Presentations

Time travel for data pipelines: Solving the mystery of what changed Session

A business insight shows a sudden spike. It can take hours, or days, to debug data pipelines to find the root cause. Shradha Ambekar, Sunil Goplani, and Sandeep Uttamchandani outline how Intuit built a self-service tool that automatically discovers data pipeline lineage and tracks every change, helping debug the issues in minutes—establishing trust in data while improving developer productivity.

Sajan Govindan is a solutions architect on the data analytics technologies team at Intel, focusing on open source technologies for big data analytics and AI solutions. Sajan has been with Intel for more than eighteen years, with many years of experience and expertise in building analytics and AI solutions, working through the advancements in the Hadoop and Spark ecosystem and machine learning and deep learning frameworks in various industry verticals and domains.

Presentations

Deep learning on Apache Spark at CERN’s Large Hadron Collider with Analytics Zoo Session

Sajan Govindan outlines CERN’s research on deep learning in high energy physics experiments as an alternative to customized rule-based methods with an example of topology classification to improve real-time event selection at the Large Hadron Collider. CERN uses deep learning pipelines on Apache Spark using BigDL and Analytics Zoo open source software on Intel Xeon-based clusters.

Sourabh Goyal is a member of the technical staff at Qubole, where he works on the Hadoop team. Sourabh holds a bachelor in computer engineering from Netaji Shubas Institute of Technology, University of Delhi

Presentations

Downscaling: The Achilles heel of autoscaling Spark clusters Session

Autoscaling of resources aims to achieve low latency for a big data application while reducing resource costs. Upscaling a cluster in cloud is fairly easy as compared to downscaling nodes, and so the overall total cost of ownership (TCO) goes up. Prakhar Jain and Sourabh Goyal examine a new design to get efficient downscaling, which helps achieve better resource utilization and lower TCO.

Michael Gregory leads the field team for machine learning at Cloudera helping organizations derive business value from machine learning. Michael has more than 20 years of experience building, selling, implementing, and supporting large-scale data management solutions at Sun Microsystems, Oracle, Teradata, and Hortonworks and has seen and evangelized the power of data to transform organizations and industries from automotive to telco and public sector to manufacturing.

Presentations

Apache Metron: Open source cybersecurity at scale Tutorial

Bring your laptop, roll up your sleeves, and get ready to crunch some cybersecurity events with Apache Metron, an open source big data cybersecurity platform. Carolyn Duby walks you through how Metron finds actionable events in real time.

An economist cum data scientist, Catherine Gu is pursuing graduate study in computational social science at Stanford University. Her areas of focus include exploratory data analysis, market design and stochastic modeling. She has spent time with Anchorage and SETL as a blockchain researcher, focusing on staking, yield generation and protocol design.

Before coming to the US, she was a macro strategist at JPMorgan in London, covering emerging market FX and rates. Prior to that, she worked as a risk and an investment analyst at Man Group, with four years of quantitative finance experience. Catherine is the cofounder of Quantess London Group. She holds a bachelor and a master degree in Economics from the University of Cambridge.

Presentations

The future of stablecoin Findata

With the emergence of cryptoeconomy, there is a real demand for an alternative form of money. Major cryptocurrencies such as Bitcoin and Ethereum have thus far failed to achieve mass adoption. Catherine Gu examines the paradigm of algorithmic design of stablecoins, focusing on incentive structure and decentralized governance, to evaluate the role of stablecoin as a future medium of exchange.

Venkata Gunnu is a senior director of data science at Comcast, where he manages data science and data engineering teams and architects data science projects that process and analyze billions of messages a day and petabytes of data. Venkata is a leader in data science democratization with 15+ years of data science modeling, design, architect, consultant, entrepreneur, and development experience, and 10+ years of that in data science modeling, big data and the cloud. He earned a master’s in information systems management in project planning and management from Central Queensland University, Australia. He has experience with product evangelization and speaking at conferences, user groups.

Presentations

Challenges faced in machine learning infrastructure in traditional large enterprises Session

Machine learning infrastructure is key to the success of AI at scale in enterprises, with many challenges when you want to bring machine learning models to a production environment, given the legacy of the enterprise environment. Venkata Gunnu and Harish Doddi explore some key insights, what worked, what didn't work, and best practices that helped the data engineering and data science teams.

Chenzhao Guo is a big data software engineer at Intel. He’s currently a contributor of Spark and a committer of OAP and HiBench. He graduated from Zhejiang University.

Presentations

Improving Spark by taking advantage of disaggregated architecture Session

Shuffle in Spark requires the shuffle data to be persisted on local disks. However, the assumptions of collocated storage do not always hold in today’s data centers. Chenzhao Guo and Carson Wang outline the implementation of a new Spark shuffle manager, which writes shuffle data to a remote cluster with different storage backends, making life easier for customers.

Atul Gupte is a product manager on the product platform team at Uber, where he helps drive product decisions to ensure Uber’s data science teams are able to achieve their full potential by providing access to foundational infrastructure, stable compute resources, and advanced tooling to power Uber’s global ambitions. Previously, he built some of the world’s most beloved games (CityVille, Looney Tunes Dash!, and Words with Friends) and designed mobile advertising systems that serve over 1B ads per day and power revenues of ~$200M per year. Atul is a technologist at heart and firmly believes in the power of technology to deliver experiences that enrich lives and delight millions globally. He’s always found fulfillment by creating such avenues that serve as force multipliers and enable others to achieve their full potential. He holds a BS in computer science from the University of Illinois at Urbana-Champaign.

Presentations

From raw data to informed intelligence: Democratizing data science and ML at Uber Session

Uber is changing the way people think about transportation. As an integral part of the logistical fabric in 65+ countries around the world, it uses ML and advanced data science to power every aspect of the Uber experience—from dispatch to customer support. Atul Gupte and Nikhil Joshi explore how Uber enables teams to transform insights into intelligence and facilitate critical workflows.

Atul Gupte is a product manager on the product platform team at Uber, where he helps drive product decisions to ensure Uber’s data science teams are able to achieve their full potential by providing access to foundational infrastructure, stable compute resources, and advanced tooling to power Uber’s global ambitions. Previously, he built some of the world’s leading social games and helped build out the mobile advertising platform at Zynga. He holds a BS in computer science from the University of Illinois at Urbana-Champaign.

Presentations

Turning big data into knowledge: Managing metadata and data relationships at Uber's scale Session

Uber takes data driven to the next level. It needs a robust system for discovering and managing various entities, from datasets to services to pipelines, and their relevant metadata isn't just nice—it's absolutely integral to making data useful. Kaan Onuk, Luyao Li, and Atul Gupte explore the current state of metadata management, end-to-end data flow solutions at Uber, and what’s coming next.

Abdelkrim Hadjidj is a senior data streaming specialist at Cloudera with 10 years experience on several distributed systems (big data, IoT, peer to peer and cloud). Previously, he held several positions including big data lead, CTO, and software engineer at several companies. He was a speaker at various international conferences and published several scientific papers at well-known IEEE and ACM journals. Abdelkrim holds a PhD, MSc, and MSe degrees in computer science.

Presentations

Cloudera Edge Management in the IoT Tutorial

There are too many edge devices and agents, and you need to control and manage them. Purnima Reddy Kuchikulla, Timothy Spann, Abdelkrim Hadjidj, and Andre Araujo walk you through handling the difficulty in collecting real-time data and the trouble with updating a specific set of agents with edge applications. Get your hands dirty with CEM, which addresses these challenges with ease.

Presentations

Running multidisciplinary big data workloads in the cloud with CDP Tutorial

Organizations now run diverse, multidisciplinary, big data workloads that span data engineering, data warehousing, and data science applications. Many of these workloads operate on the same underlying data, and the workloads themselves can be transient or long running in nature. There are many challenges with moving these workloads to the cloud. In this talk we start off with a technical deep...

Josh Hamilton is a data scientist at Major League Baseball, where he works with the league and its 30 teams to build data pipelines and predictive models. Previously, Josh helped build data infrastructure and recommender systems for a movie streaming company and worked as a product manager for a platform-as-a-service startup. He studied finance and economics and holds an MS in applied statistics from the University of Alabama.

Presentations

Data science and the business of Major League Baseball Session

Using SAS, Python, and AWS SageMaker, Major League Baseball's (MLB's) data science team outlines how it predicts ticket purchasers’ likelihood to purchase again, evaluates prospective season schedules, estimates customer lifetime value, optimizes promotion schedules, quantifies the strength of fan avidity, and monitors the health of monthly subscriptions to its game-streaming service.

Hamit Hamutcu is a cofounder of Analytics Center, a company focused on the use of data and analytics in business and an advisor or investor in several analytics-related initiatives that work in developing vertical machine learning solutions for industries such as advertising and ecommerce. He has over 20 years of industry and consulting experience in the areas of analytics, customer relationship management, and marketing strategies driven by data. Previously, Hamit was a founding partner for EMEA offices for Peppers & Rogers Group and led the development of the firm in the region by serving clients across the Middle East, Africa, and Europe; was a partner for the firm’s US office, heading up its global Analytics Group, where he oversaw the growth of the analytics practice and helped his clients develop analytics functions, build data infrastructure, and deploy analytical models to support business goals; and he held several positions within FedEx in Memphis in marketing analytics and technology, where he led IT and business teams to leverage the enormous amount of data the company generated to serve its customers better. Hamit is also a frequent speaker, writer, and board member at various startups and nonprofit organizations. He earned his BSc degree in electronics engineering at Bogazici University in Istanbul and his MBA degree at University of Florida.

Presentations

An in-depth look at the data science career: Defining roles, assessing skills Session

If you've ever been confused about what it takes to be a data scientist or curious about how companies recruit, train, and manage analytics resources, Usama Fayyad and Hamit Hamutcu are here to explore insights from the most comprehensive research effort to date on the data analytics profession and propose a framework for the standardization of roles and methods for assessing skills.

Janet Haven is the executive director of Data & Society. Previously, she was Data & Society’s director of programs and strategy; spent more than a decade at the Open Society Foundations, where she oversaw funding strategies and grant making related to technology’s role in supporting and advancing civil society, particularly in the areas of human rights and governance and accountability; and started her career in technology startups in central Europe, participating in several successful acquisitions. She sits on the board of the Public Lab for Open Science and Technology and advises a range of nonprofit organizations. Janet holds an MA from the University of Virginia and a BA from Amherst College.

Presentations

Embrace complexity: The new rules of AI Session

Join Data & Society Research Institute Executive Director Janet Haven for a deep dive into research, case studies and emerging governance approaches to creating the rules of ethical AI.

Amy Heineike is the vice president of product engineering at Primer, where she leads teams to build machines that read and write text leveraging natural language processing (NLP), natural language generation (NLG_, and a host of other algorithms to augment human analysts. Previously, she built out technology for visualizing large document sets as network maps at Quid. A Cambridge mathematician who previously worked in London modeling cities, Amy is fascinated by complex human systems and the algorithms and data that help us understand them.

Presentations

Data science versus engineering: Does it really have to be this way? Session

If, as a data scientist, you've wondered why it takes so long to deploy your model into production or, as an engineer, thought data scientists have no idea what they want, you're not alone. Join a lively discussion with industry veterans Ann Spencer, Paco Nathan, Amy Heineike, and Chris Wiggins to find best practices or insights on increasing collaboration when developing and deploying models.

Daniel G. Hernandez is the vice president of data and AI at IBM, where he’s the head of products for the company’s hybrid data management, unified governance and integration, and AI businesses. Daniel’s team is responsible for the strategy, road map, and overall performance for IBM Cloud Private for Data, Watson Studio, Watson Machine Learning, Db2, Integrated Analytics System, Informix, Information Server, Data Replication, MDM, Optim, StoredIQ, SPSS, and Decision Optimization offerings. Major releases and partnerships during his tenure include IBM Cloud Private for Data, the Hortonworks, Mongo, and Actifio partnerships, and several offerings that have won Red Dot Awards for Design. Previously, Daniel launched IBM’s data and analytics-as-a-service portfolio and led IBM’s Unified governance and integration business. He made critical contributions to IBM’s core franchises such as enterprise content management, helped acquire companies like i2, Curam, and StoredIQ, and launched and scaled several organically grown businesses. Daniel graduated with a bachelor’s from the University of Texas at Dallas and a graduate degree from the University of Texas at Austin. He has two kids, lives in Austin with his wife, daughter, and two dogs, and sends a lot of his money to pay for his son’s education at Baylor University.

Presentations

The key to climbing the AI ladder (sponsored by IBM) Session

AI isn't magic. It’s still hard work. Daniel Hernandez explains why having the technology alone isn't enough; it requires a thoughtful and well-architected approach.

Unlocking the value of your data (sponsored by IBM) Keynote

Daniel Hernandez takes a deep dive into how, with a unified, prescriptive information architecture, organizations can successfully unlock the value of their data for an AI and multicloud world.

Annette Hester is a senior data visualization initiative lead at the National Energy Board of Canada. She brings innovative approaches to working with data. Through her company, TheHesterView, she assembles leading experts in their fields into teams that deliver excellence in data structuring and data visualization. The quality of the design in her work reflects decades of experience in advisory and strategic policy services. Previously, she was a faculty member of the University of Calgary’s Haskayne Global Energy EMBA, where she was founding director of the university’s Latin American Research Centre; she was a senior adviser to the deputy minister of the government of Alberta, Canada, and part of the energy and environment policy team for the leadership campaign that saw Alison Redford elected leader and premier. Annette has extensive experience as a consultant in the private sector and to governmental agencies in several countries of the Americas, primarily Brazil and Canada.

Presentations

Purposefully designing technology for civic engagement Session

As new digital platforms emerge and governments look at new ways to engage with citizens, there's an increasing awareness of the role these platforms play in shaping public participation and democracy. Audrey Lobo-Pulo, Annette Hester, and Ryan Hum examine the design attributes of civic engagement technologies and their ensuing impacts and an NEB Canada case study.

Mark Hinely is a director of regulatory compliance at KirkpatrickPrice and a member of the Florida Bar, with 10 years of experience in data privacy, regulatory affairs, and internal regulatory compliance. His specific experiences include performing mock regulatory audits, creating vendor compliance programs, and providing compliance consulting. He’s also SANS certified in the law of data security and investigations. As GDPR has become a revolutionary data privacy law around the world, Mark has become the resident GDPR expert at KirkpatrickPrice. He has led the GDPR charge through internal training, developing free, educational content, and performing gap analyses, assessments, and consulting services for organizations of all sizes.

Presentations

Are your privacy practices auditor approved? Session

The fear that comes along with new compliance requirements is overwhelming. Organizations don’t know where to start, what to fix, or what an auditor expects to see. Mark Hinely gives you an auditor's perspective on the newest security and privacy regulations, how your business can prepare for compliance, and what the audit looks like to an auditor.

Keegan Hines is Director of Machine Learning Research at Capital One where he leads development in areas including Explainable AI, Representation Learning, ML on Graphs, and Computer Vision. He is also an adjunct professor at Georgetown University, teaching graduate coursework in statistics and machine learning.

Presentations

Executive Briefing: Lessons from the front lines—Building a responsible AI/ML program in the enterprise Session

This talk will explore some of the philosophy around the concept of explaining a model given the colloquial definition is partially recursive. It will cover the lens banking regulation places on this philosophical basis and expand into techniques used for these well governed aspects.

Scott Hoch is the founder of Blackbox Engineering.

Presentations

Feature engineering with Spark NLP to accelerate clinical trial recruitment Session

Recruiting patients for clinical trials is a major challenge in drug development. Saif Addin Ellafi and Scott Hoch explain how Deep 6 uses Spark NLP to scale its training and inference pipelines to millions of patients while achieving state-of-the-art accuracy. They dive into the technical challenges, the architecture of the full solution, and the lessons the company learned.

Felipe Hoffa is a developer advocate for big data at Google, where he inspires developers around the world to leverage the Google Cloud Platform tools to analyze and understand their data in ways they could never before. You can find him in several videos, blog posts, and conferences around the world.

Presentations

Turning petabytes of data from millions of vehicles into open data with Geotab Session

Geotab is a world-leading asset-tracking company with millions of vehicles under service every day. Felipe Hoffa and Bob Bradley examine the challenges and solutions to create an ML- and geographic information system- (GI)S enabled petabyte-scale data warehouse leveraging Google Cloud. And they dive into the process to publish open, how you can access it, and how cities are using it.

Garrett Hoffman is a director of data science at StockTwits, where he leads efforts to use data science and machine learning to understand social dynamics and develop research and discovery tools that are used by a network of over one million investors. Garrett has a technical background in math and computer science but gets most excited about approaching data problems from a people-first perspective—using what we know or can learn about complex systems to drive optimal decisions, experiences, and outcomes.

Presentations

Deep learning methods for natural language processing Tutorial

Garrett Hoffman walks you through deep learning methods for natural language processing and natural language understanding tasks, using a live example in Python and TensorFlow with StockTwits data. Methods include Word2Vec, recurrent neural networks (RNNs) and variants (long short-term memory [LSTM] and gated recurrent unit [GRU]), and convolutional neural networks.

Mick leads Cloudera’s worldwide marketing efforts, including advertising, brand, communications, demand, partner, solutions, and web. Mick has had a successful 25-year career in enterprise and cloud software. Previously, he was CMO at sales acceleration and machine learning company InsideSales.com, helping the company pioneer a shift to data-driven marketing and sales that has served as a model for organizations around the globe; served as global vice president of marketing and strategy at Citrix, where he led the company’s push into the high-growth desktop virtualization market; managed executive marketing at Microsoft; and held numerous leadership positions at IBM Software. Mick is an advisory board member for InsideSales and a contributing author on Inc.com. He’s also an accomplished public speaker who has shared his insightful messages about the business impact of technology with audiences around the world. Mick holds a BS in management from the Georgia Institute of Technolgy.

Presentations

The road to an enterprise cloud Keynote

Learn how IBM and Cloudera are fueling innovation in IoT, streaming, data warehouse and machine learning, and making their customer’s digital transformation journey easier, faster and safer.

Matt Horton is the senior director of data science at Major League Baseball (MLB). In his 11+ years at MLB, Matt has developed numerous projects including predicting ticket buyers’ future purchasing behavior to aid teams in prioritizing their marketing efforts and building a framework for predicting and preventing subscriber churn for MLB’s game-streaming service, MLB.TV. Matt’s team is focused on quantifying fans’ relationships with their favorite teams, modeling trends in both team and league-wide attendance, and estimating fans’ future engagement with MLB. Previously, Matt was at Rosetta and Accenture. He has a BS in statistics from the University of Tennessee and a master’s in applied statistics from Cornell University.
In addition to him being a huge baseball fan, Matt is also an avid fan of sports in general, rooting for teams from his home state of Tennessee, including the Volunteers, Titans, Predators, and Grizzlies.

Presentations

Data science and the business of Major League Baseball Session

Using SAS, Python, and AWS SageMaker, Major League Baseball's (MLB's) data science team outlines how it predicts ticket purchasers’ likelihood to purchase again, evaluates prospective season schedules, estimates customer lifetime value, optimizes promotion schedules, quantifies the strength of fan avidity, and monitors the health of monthly subscriptions to its game-streaming service.

Rick Houlihan is a principal technologist and leads the NoSQL blackbelt team at AWS and has designed hundreds of NoSQL database schemas for some of the largest and most highly scaled applications in the world. Many of Rick’s designs are deployed at the foundation of core Amazon and AWS services such as CloudTrail, IAM, CloudWatch, EC2, Alexa, and a variety of retail internet and fulfillment-center services. Rick brings over 25 years of technology expertise and has authored nine patents across a diverse set of technologies including complex event processing, neural network analysis, microprocessor design, cloud virtualization, and NoSQL technologies. As an innovator in the NoSQL space, Rick has developed a repeatable process for building real-world applications that deliver highly efficient denormalized data models for workloads of any scale, and he regularly delivers highly rated sessions at re:Invent and other AWS conferences on this specific topic.

Presentations

Where's my lookup table? Modeling relational data in a denormalized world Session

Data has always been and will always be relational. NoSQL databases are gaining in popularity, but that doesn't change the fact that the data is still relational, it just changes how we have to model the data. Rick Houlihan dives deep into how real entity relationship models can be efficiently modeled in a denormalized manner using schema examples from real application services.

Shant Hovsepian is a cofounder and CTO of Arcadia Data, where he’s responsible for the company’s long-term innovation and technical direction. Previously, Shant was an early member of the engineering team at Teradata, which he joined through the acquisition of Aster Data, and he interned at Google, where he worked on optimizing the AdWords database. His experience includes everything from Linux kernel programming and database optimization to visualization. He started his first lemonade stand at the age of four and ran a small IT consulting business in high school. Shant studied computer science at UCLA, where he had publications in top-tier computer systems conferences.

Presentations

Intelligent design patterns for cloud-based analytics and BI Session

With cloud object storage (e.g., S3, ADLS) one expects business intelligence (BI) applications to benefit from the scale of data and real-time analytics. However, traditional BI in the cloud surfaces nonobvious challenges. Shant Hovsepian examines service-oriented cloud design (storage, compute, catalog, security, SQL) and how native cloud BI provides analytic depth, low cost, and performance.

Congrui Huang is a senior data scientist on the AI platform team within the Cloud and AI Division at Microsoft.

Presentations

Introducing a new anomaly detection algorithm (SR-CNN) inspired by computer vision Session

Anomaly detection may sound old fashioned, yet it's super important in many industry applications. Tony Xing, Congrui Huang, Qiyang Li, and Wenyi Yang detail a novel anomaly-detection algorithm based on spectral residual (SR) and convolutional neural network (CNN) and how this method was applied in the monitoring system supporting Microsoft AIOps and business incident prevention.

Jing Huang is a director of engineering, machine learning, at SurveyMonkey, where she drives the vision and execution of democratizing machine learning. She leads the effort to build the next-generation machine learning platform and oversees all machine learning operation projects. Previously, she was an entrepreneur and devoted her time to building mobile-first solutions and data products for nontech industries and worked at Cisco, where her contribution ranged from security and cloud management to big data infrastructure.

Presentations

Your cloud, your ML, but more and more scale? How SurveyMonkey did it Session

You're a SaaS company operating on a cloud infrastructure prior to the machine learning (ML) era and you need to successfully extend your existing infrastructure to leverage the power of ML. Jing Huang and Jessica Mong detail a case study with critical lessons from SurveyMonkey’s journey of expanding its ML capabilities with its rich data repo and hybrid cloud infrastructure.

Presentations

Running multidisciplinary big data workloads in the cloud with CDP Tutorial

Organizations now run diverse, multidisciplinary, big data workloads that span data engineering, data warehousing, and data science applications. Many of these workloads operate on the same underlying data, and the workloads themselves can be transient or long running in nature. There are many challenges with moving these workloads to the cloud. In this talk we start off with a technical deep...

Hillery is CTO of IBM Cloud, responsible for technical strategy for IBM’s cloud-native and infrastructure offerings. Prior to this role, she served as Director of Accelerated Cognitive Infrastructure in IBM Research, leading a team doing cross-stack (hardware through software) optimization of AI workloads, producing productivity breakthroughs of 40x and greater which were transferred into IBM product offerings. Her technical interests have always been interdisciplinary, spanning from silicon technology through system software, and she has served in technical and leadership roles in memory technology, Systems for AI, and other areas. She is a member of the IBM Academy of Technology and was appointed as an IBM Fellow in 2017. Hillery is a BS, MS, and PhD graduate of the University of Illinois at Urbana-Champaign.

Presentations

The road to an enterprise cloud Keynote

Learn how IBM and Cloudera are fueling innovation in IoT, streaming, data warehouse and machine learning, and making their customer’s digital transformation journey easier, faster and safer.

Muhammed Y. Idris is a fellow at leading global impact investment firm Capria and the cofounder and lead developer at Edel Technologies, an organization that builds AI-powered tools and products for frontline humanitarian organizations and service providers, including the United Nation’s Refugee Agency (UNHCR). His mission is to help improve social services delivery for all through collective and artificial intelligence as an entrepreneur and investor. Inspired by his own personal experiences with refugee resettlement, Muhammed left academia to build the world’s first virtual advocate bot, Atar, which empowers refugees and other newcomers with information about what to do, where to go, and what to expect using customized step-by-step guides. Trained as a computational social scientist, he started his career in finance at BlackRock and went on to complete a PhD with a focus on the research and development of open source tools for leveraging socially generating big data and make it easy to digest this complex technical information into actionable insights. His work has been presented at numerous academic, policy, and industry conferences and he has held teaching and research positions at the University of Washington, Penn State, Concordia University in Montreal, and Harvard, where he held a predoctoral fellowship at the Belfer Center for Science and International Affair while completing his dissertation. A self-taught programmer, Muhammed enjoys teaching statistics and programming in his spare time.

Presentations

Social services 2.0: Atar and the future of (social) work DCS

Muhammed Y. Idris offers an overview of Atar, the first-ever virtual advocate bot, and shares a practical case study of how technology can help streamline and scale service delivery for refugees.

Susan Israel is an experienced privacy attorney who focuses her practice on developing and implementing data privacy policies and programs. She advises companies on how to comply with the California Consumer Privacy Act (CCPA), and on issues that continue to arise with respect to the General Data Protection Regulation (GDPR). She has experience drafting and negotiating data deals that protect privacy, especially in the context of advertising and related technology, as well as providing counsel on public policy issues in relation to both privacy as well as advertising law.
Susan maintains a broad knowledge of media, communications and advertising businesses, and has a pre-law background in broadcast news and publishing. She brings a deep understanding of the most complex and critical data privacy challenges faced by corporations and business entities, particularly within the media and advertising industries.

Presentations

Regulations and the future of data Session

From the EU to California and China, more of the world is regulating how data can be used. Andrew Burt and Brenda Leong convene leading experts on law and data science for a deep dive into ways to regulate the use of AI and advanced analytics. Come learn why these laws are being proposed, how they’ll impact data, and what the future has in store.

Alex Izydorczyk leads the data science team at Coatue, overseeing the engineering and statistical teams’ process of integrating “alternative data” into the investment process. The team uses cutting edge methods from machine learning and statistics to digest and analyze a broad universe of data points to identify market and investing trends. Alex is also involved on the private investment side, particularly on topics of cryptocurrency and data science infrastructure. He graduated from the Wharton School at the University of Pennsylvania in 2015 with a degree in statistics.

Presentations

Managing data science in the enterprise Tutorial

The honeymoon era of data science is ending and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders must deliver measurable impact on an increasing share of an enterprise’s KPIs. The speakers explore how leading organizations take a holistic approach to people, process, and technology to build a sustainable advantage.

Piyush Jain is the director of data governance at Progressive. With over 10 years of experience in data governance and data management, Piyush leads the data governance office, providing strategic direction for data governance and data management capabilities. He works closely with business leaders from across Progressive to ensure data governance activities are closely aligned with business priorities and oversees the tactical implementation of the data-governance program across all lines of business. Piyush also serves as the business sponsor for the Collibra implementation, owning the solution road map to ensure business-driven use cases drive the implementation plan. Piyush has been with Progressive since 2007 and holds an MBA from Cleveland State University.

Presentations

Powering the future with data intelligence (sponsored by Collibra) Session

Transforming data into a trusted business asset that informs decision making requires giving teams access to a powerful platform that makes it easy to harness data across the enterprise. Jim Cushman and Piyush Jain detail how Progressive uses Collibra to transform the way data is managed and used across the organization, driving real business value.

Prakhar Jain is a member of the technical staff at Qubole, where he works on Spark. Prakhar holds a bachelor of computer science engineering from the Indian Institute of Technology, Bombay, India.

Presentations

Downscaling: The Achilles heel of autoscaling Spark clusters Session

Autoscaling of resources aims to achieve low latency for a big data application while reducing resource costs. Upscaling a cluster in cloud is fairly easy as compared to downscaling nodes, and so the overall total cost of ownership (TCO) goes up. Prakhar Jain and Sourabh Goyal examine a new design to get efficient downscaling, which helps achieve better resource utilization and lower TCO.

Jeroen Janssens is the founder, CEO, and an instructor of Data Science Workshops, which provides on-the-job training and coaching in data visualization, machine learning, and programming. Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and startups YPlan and Outbrain in New York City. He’s the author of Data Science at the Command Line (O’Reilly). Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University.

Presentations

Scalable anomaly detection with Spark and SOS Session

Jeroen Janssens dives into stochastic outlier section (SOS), an unsupervised algorithm for detecting anomalies in large, high-dimensional data. SOS has been implemented in Python, R, and, most recently, Spark. He illustrates the idea and intuition behind SOS, demonstrates the implementation of SOS on top of Spark, and applies SOS to a real-world use case.

Karan Jaswal is the co-founder and CTO of Cinchy, the global leader in enterprise Data Collaboration. Karan spent over 10 years at leading global financial institutions including RBC and Citi where he was responsible for developing and launching leading enterprise solutions. In 2019, he was invited by Canada’s Department of Finance to consult on their Open Banking consultations and is a regular contributor to the Canadian Council of Innovators who work with public policy leaders to optimize the growth of Canada’s innovation-based

Presentations

Banking on change: Data collaboration and enterprise financial services Findata

After 40 years of apps, leading financial services providers are realizing that building or buying an application for every use case has become a major threat to their agility, profitability, and data protection. Join Karan Jaswal for an introduction to data collaboration, the radical new paradigm that's the single biggest shift in technology delivery since the 1970s.

Sameul Jenkins is a data scientist at Microsoft. He works on interpretability for machine learning.

Presentations

Unified tooling for machine learning interpretability Session

Understanding decisions made by machine learning systems is critical for sensitive uses, ensuring fairness, and debugging production models. Interpretability presents options for trying to understand model decisions. Harsha Nori, Sameul Jenkins, and Rich Caruana explore the tools Microsoft is releasing to help you train powerful, interpretable models and interpret existing black box systems.

Clare Jeon is a data scientist at Klick, where she focuses on identifying digital biomarkers for diagnosis, risk assessment of diseases, and prevention of health problems. She also explores the applications of machine learning to optimize clinic performance. Previously, she was involved in working on the systems biology of cancer and the development of the computational pipeline to identify key genomic and clinical signatures for cancer treatment. She holds a PhD in bioinformatics and systems biology.

Presentations

Handling data gaps in time series using imputation Session

Time series forecasts depend on sensors or measurements made in the real, messy world. The sensors flake out, get turned off, disconnect, and otherwise conspire to cause missing signals. Signals that may tell you what tomorrow's temperature will be or what your blood glucose levels are before bed. Alfred Whitehead and Clare Jeon explore methods for handling data gaps and when to consider which.

Edward Jezierski is a PM at Microsoft bringing reinforcement learning services to market. He works across all Microsoft divisions with the goal of making the power of RL-based AI available to everyone. Previously, he was CEO and CTO at InSTEDD, a Google.org spin-off nonprofit building technologies used in disaster response, public health, and global development; helped start patterns & practices, a group at Microsoft that pioneered open source and Agile product development at Microsoft; and researched model design and robot control with deep neural networks and evolutionary systems.

Presentations

Deliver personalized experiences and content like Xbox with Cognitive Services Personalizer (sponsored by Microsoft Azure) Session

Edward Jezierski and Jackie Nichols demonstrate how Cognitive Services Personalizer works with your content and data, how it autonomously learns to make optimal decisions, how you can add it to your app with two lines of code, and what’s under the hood. Then they share the results Personalizer achieved on the Xbox One home page as well as best practices for applying it in your applications today.

RL in real life: Bringing reinforcement learning to the enterprise (sponsored by Microsoft Azure) Keynote

Microsoft has an ecosystem spanning research, gaming, and the cloud that's advancing reinforcement learning (RL) and putting it into everyday use. Join Edward Jezierski to see where RL is used practically across Microsoft and imagine the opportunities that exist for your business today.

Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Omkar has a keen interest in solving large-scale distributed systems problems. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.

Presentations

How to performance-tune Spark applications in large clusters Session

Omkar Joshi and Bo Yang offer an overview of how Uber’s ingestion (Marmary) and observability team improved performance of Apache Spark applications running on thousands of cluster machines and across hundreds of thousands+ of applications and how the team methodically tackled these issues. They also cover how they used Uber’s open-sourced jvm-profiler for debugging issues at scale.

Brindaalakshmi K is a communications professional and researcher working at the intersection of identities, human rights, public health, and technology. She’s authoring the study "Gendering of Development Data in India: Beyond the Binary: for the Centre for Internet and Society, India as part of the Big Data for Development Network led by five global south organizations and supported by International Development Research Centre (IDRC), Canada. She’s a youth leader with citiesRISE, a global platform working on transforming mental heath policy and practice. She also peer supports members of the LGBTIQA+ community.

Presentations

Looking beyond the binary: How data for development impacts gender justice? Session

There's a lack of standard for the collection of gender data. Brindaalakshmi K takes a look at the implications of such a lack in the context of a developing country like India, the exclusion of individuals beyond the binary genders of male and female, and how this exclusion permeates beyond the public sector into private sector services.

Swasti Kakker is a software development engineer on the LinkedIn Data team at LinkedIn. Her passion lies in increasing and improving developer productivity by designing and implementing scalable platforms for the same. In her two-year tenure at LinkedIn, she’s worked on the design and implementation of hosted notebooks at LinkedIn, which focuses on providing a hosted solution of Jupyter notebooks. She’s worked closely with the stakeholders to understand the expectations and requirements of the platform that would improve developer productivity. Previously, she worked with the Spark team, discussing how Spark History Server can be improved to make it more scalable to cater to the traffic by Dr. Elephant. She’s also contributed to adding the Spark heuristics in Dr. Elephant after understanding the needs of the stakeholders (mainly Spark developers) which gave her good knowledge about Spark infrastructure, Spark parameters, and how to tune them efficiently.

Presentations

A productive data science platform: Beyond a hosted-notebooks solution at LinkedIn Session

Join Swasti Kakker, Manu Ram Pandit, and Vidya Ravivarma to explore what's offered by a flexible and scalable hosted data science platform at LinkedIn. It provides features to seamlessly develop in multiple languages, enforce developer best practices, governance policies, execute, visualize solutions, efficient knowledge management, and collaboration to improve developer productivity.

Shannon Kalisky is the lead product manager for analytics and data science at Esri, where she works with development and engineering teams to bring spatial data science mainstream. She started her career in geographic information system (GIS) where she worked for a variety of organizations ranging from government to Fortune 500 companies, leveraging spatial data to uncover patterns and build predictive models with a combination of GIS and Python. Her undergraduate studies were in geography and her graduate education was in community and regional planning. She’s pursuing her MBA in global business. When she’s not working behind a computer, you’re most likely to find Shannon with her hands dirty in a garden or at the local hardware store gathering supplies for her next project.

Presentations

See what others can’t with spatial analysis and data science (sponsored by Esri) Session

Digital location data is a crucial part of data science. The "where" matters as much to an analysis as the "what" and the "why." Shannon Kalisky and Alberto Nieto explore tools that help you apply a range of geospatial techniques in your data science workflows to get deeper insights.

Victoriya Kalmanovich is an R&D group lead at a large maritime corporation in Israel. She specializes in healing work environments by addressing them as startup companies and promotes and leads innovative and broad processes throughout the organization. In her day-to-day experience, she deals with all technological issues, product management, budgets, and client handling of her group. She’s an education enthusiast and often uses educational directives as a part of her management strategies, especially guiding group members and leadership. She’s also a firm believer in deploying data science where there’s a great value for data. She’s organized a successful data science hackathon and is forming a data science community within her organization. She also gives talks about management, leadership, and workplace challenges.

Presentations

Driving adoption of data DCS

Often, the difference between a successful data initiative and failed one isn't the data or the technology, but rather its adoption by the wider business. With every business wanting the magic of data but many failing to properly embrace and harness it, we will explore what factors our panelists have seen that led to successes and failures in getting companies to use data products.

Predictive maintenance: How does data science revolutionize the world of machines? DCS

Predictive maintenance predicts the future of machines, using data science to establish the machine’s unique lifecycle and increase efficiency. Using a maritime case study, Victoriya Kalmanovich explains why, in a world full of machines, we need to be the bridge connecting the methods of the past to the opportunities of the future.

Supun Kamburugamuve is a graduate student at Indiana University and a senior software architect at the Digital Science Center of Indiana University, where he researches big data applications and frameworks. He’s working on high-performance enhancements to big data systems with HPC interconnect such as InfiniBand and Omni-Path. Supun is an elected member of Apache Software Foundation and has contributed to many open source projects including Apache Web Services projects. Previously, Supun worked on middleware systems and was a key member of a WSO2 enterprise service bus (ESB), an open source enterprise integration product widely used by enterprises. He has a PhD in computer science, specializing in high-performance data analytics at Indiana University.

Presentations

Bridging the gap between big data computing and high-performance computing Session

Big data computing and high-performance computing (HPC) evolved over the years as separate paradigms. With the explosion of the data and the demand for machine learning algorithms, these two paradigms increasingly embrace each other for data management and algorithms. Supun Kamburugamuve explores the possibilities and tools available for getting the best of HPC and big data.

Linhong Kang is a manager and staff data scientist at Walmart Labs, where she’s the lead of multiple fraud and abuse detection solutions for Walmart’s various products. She has more than 10 years of experience in data science, business analytics, and risk and fraud management across different industries including business consulting, banks, financial payment, and ecommerce. She’s passionate about translating business problems into qualitative questions, delivering cost savings and helping companies to become more profitable.

Presentations

Machine learning and large-scale data analysis on a centralized platform Session

James Tang, Yiyi Zeng, and Linhong Kang outline how Walmart provides a secure and seamless shopping experience through machine learning and large scale data analysis on centralized platform.

Presentations

Kafka and Streams Messaging Manager (SMM) crash course Tutorial

Kafka is omnipresent and the backbone of streaming analytics applications and data lakes. The challenge is understanding what's going on overall in the Kafka cluster, including performance, issues, and message flows. Purnima Reddy Kuchikulla and Dan Chaffelson walk you through a hands-on experience to visualize the entire Kafka environment end-to-end and simplify Kafka operations via SMM.

Amit Kapoor is a data storyteller at narrativeViz, where he uses storytelling and data visualization as tools for improving communication, persuasion, and leadership through workshops and trainings conducted for corporations, nonprofits, colleges, and individuals. Interested in learning and teaching the craft of telling visual stories with data, Amit also teaches storytelling with data for executive courses as a guest faculty member at IIM Bangalore and IIM Ahmedabad. Amit’s background is in strategy consulting, using data-driven stories to drive change across organizations and businesses. Previously, he gained more than 12 years of management consulting experience with A.T. Kearney in India, Booz & Company in Europe, and startups in Bangalore. Amit holds a BTech in mechanical engineering from IIT, Delhi, and a PGDM (MBA) from IIM, Ahmedabad.

Presentations

Recommendation systems using deep learning 2-Day Training

Recommendation systems play a significant role—for users, a new world of options; for companies, it drives engagement and satisfaction. Amit Kapoor and Bargava Subramanian walk you through the different paradigms of recommendation systems and introduce you to deep learning-based approaches. You'll gain the practical hands-on knowledge to build, select, deploy, and maintain a recommendation system.

Recommendation systems using deep learning (Day 2) Training Day 2

Recommendation systems play a significant role—for users, a new world of options; for companies, it drives engagement and satisfaction. Amit Kapoor and Bargava Subramanian walk you through the different paradigms of recommendation systems and introduce you to deep learning-based approaches. You'll gain the practical hands-on knowledge to build, select, deploy, and maintain a recommendation system.

Meher Kasam is an iOS software engineer at Square and is a seasoned software developer with apps used by tens of millions of users every day. He’s shipped features for a range of apps from Square’s point of sale to the Bing app. Previously, he worked at Microsoft, where he was the mobile development lead for the Seeing AI app, which has received widespread recognition and awards from Mobile World Congress, CES, FCC, and the American Council of the Blind, to name a few. A hacker at heart with a flair for fast prototyping, he’s won close to two dozen hackathons and converted them to features shipped in widely used products. He also serves as a judge of international competitions including the Global Mobile Awards and the Edison Awards.

Presentations

Deep learning on mobile Session

Over the last few years, convolutional neural networks (CNNs) have risen in popularity, especially in the area of computer vision. Anirudh Koul and Meher Kasam take you through how you can get deep neural nets to run efficiently on mobile devices.

Arun Kejariwal is an independent lead engineer. Previously, he was he was a statistical learning principal at Machine Zone (MZ), where he led a team of top-tier researchers and worked on research and development of novel techniques for install-and-click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns, and his team built novel methods for bot detection, intrusion detection, and real-time anomaly detection; and he developed and open-sourced techniques for anomaly detection and breakout detection at Twitter. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Serverless streaming architectures and algorithms for the enterprise Tutorial

Arun Kejariwal, Karthik Ramasamy, and Anurag Khandelwal walk you through the landscape of streaming systems and examine the inception and growth of the serverless paradigm. You'll take a deep dive into Apache Pulsar, which provides native serverless support in the form of Pulsar functions and get a bird’s-eye view of the application domains where you can leverage Pulsar functions.

Brian Keng is the chief data scientist at Rubikloud, where he leads a team building out intelligent enterprise solutions for some of the world’s largest retail organizations. Brian is a big fan of Bayesian statistics, but his main professional focus is building out scalable machine learning systems that seamlessly integrate into traditional software solutions. Previously, Brian was at Sysomos, leading a team of data scientists performing large-scale social media analytics, working with datasets such as the Twitter firehouse. He earned his PhD in computer engineering from the University of Toronto, during which time he was an early employee of a startup that commercialized some of his research.

Presentations

ML is not enough: Decision automation in the real world Session

Automating decisions require a system to consider more than just a data-driven prediction. Real-world decisions require additional constraints and fuzzy objectives to ensure they're robust and consistent with business goals. Brian Keng takes a deep dive into how to leverage modern machine learning methods and traditional mathematical optimization techniques for decision automation.

Anurag is a PhD candidate at the RISELab, UC Berkeley, advised by Professor Ion Stoica, and will be joining Yale as an assistant professor in the spring of 2020. His research interests span distributed systems, networking and algorithms. In particular, his research focuses on addressing core challenges in distributed systems through novel algorithm and data structure design. During his PhD, Anurag built large-scale data-intensive systems such as Succinct and Confluo, that led to deployments in several production clusters. Anurag received his bachelor’s degree in computer science from the Indian Institute of Technology Kharagpur in 2013.

Presentations

Serverless streaming architectures and algorithms for the enterprise Tutorial

Arun Kejariwal, Karthik Ramasamy, and Anurag Khandelwal walk you through the landscape of streaming systems and examine the inception and growth of the serverless paradigm. You'll take a deep dive into Apache Pulsar, which provides native serverless support in the form of Pulsar functions and get a bird’s-eye view of the application domains where you can leverage Pulsar functions.

Matt Kirk runs Your Chief Scientist which is a firm devoted to training small cohorts of highly motivated engineers to become data scientist practitioners. He pulls from his experience writing Thoughtful Machine Learning with Python as well as his clients like Clickfunnels, Garver, SheerID, SupaDupa, and Madrona Ventures. To find out more check out https://yourchiefscientist.com/.

Presentations

Machine learning for the enterprise (sponsored by IBM) Training

Note: This free workshop, courtesy of IBM, is open to the first 50 registrants. You'll take a fascinating deep dive into the power and applications of machine learning in the enterprise.

Madhu Kochar is the vice president of product development in the data and AI business unit at IBM. She leads a large portfolio of products, such as IBM Cloud Pak for Data and the Unified Governance and Integration segment. Over the last few years, Madhu has distinguished herself by successfully establishing a world-class service delivery and engineering team that transformed IBM analytics business on the IBM public and private cloud. She has extensive experience in software development, DevOps, data ops, Hybrid cloud, machine learning, and AI. In 2012, Madhu was named one of the outstanding executive women from Silicon Valley and a recipient of the prestigious Tribute to Women (TWIN) award, a recognition of her role as a woman executive and an inspiration for other women in technology fields. Madhu has also represented IBM in the science, technology, engineering and mathematics (STEM) summit panels and is on the IBM corporate Asian Council. She also leads the local IBM Women Charter, whose goal is to grow the next generation of leaders. Madhu is based in San Jose, California.

Presentations

The future? Data, AI, and multicloud: It’s time to modernize (sponsored by IBM) Session

An economic revolution is underway, driven by advancements in AI and multicloud technologies. Businesses are crafting strategic plans to modernize their data architecture for this emerging reality, and at the top of their wish list is the ability to virtualize all their data regardless of where it lives. Madhu Kochar explores the data advancements on the horizon.

Jari Koister is the vice president of product and technology at FICO. He also teaches in the Data Science Program at UC Berkeley. Previously, he has, among other things, led the development of Chatter, Salesforce’s social enterprise application and platform; founded and served as CTO at Groupswim.com, an early social enterprise collaboration company (acquired by Salesforce); founded and served as CSO and CTO at Qrodo.com, an elastic platform for broadcasting sports events live on the internet; led the development of CommerceOne’s flagship product MarketSite; and led research in computer languages and distributed computing at Ericsson Labs and Hewlett-Packard Laboratories. Jari holds a PhD in computer science from the Royal Institute of Technology, Stockholm, Sweden.

Presentations

How machine learning meets optimization Session

Machine learning and constraint-based optimization are both used to solve critical business problems. They come from distinct research communities and have traditionally been treated separately. But Jari Koister examines how they're similar, how they're different, and how they can be used to solve complex problems with amazing results.

Naren Koneru is an engineering manager at Cloudera and leads the navigator development team. Prior to Cloudera, Naren was at Miti, building enterprise-wide metadata and governance solutions. Before joining Miti, Naren spent over seven years with the platform team at Informatica and was instrumental in making PowerCenter the leading data integration platform. He has a master’s in computer science from East Tennessee State University and bachelor’s from Osmania University.

Presentations

Running multidisciplinary big data workloads in the cloud with CDP Tutorial

Organizations now run diverse, multidisciplinary, big data workloads that span data engineering, data warehousing, and data science applications. Many of these workloads operate on the same underlying data, and the workloads themselves can be transient or long running in nature. There are many challenges with moving these workloads to the cloud. In this talk we start off with a technical deep...

Stavros Kontopoulos is a senior engineer on the data systems team at Lightbend. He implement Lightbend’s fast data strategy. Previously, he worked building software solutions that scale in different verticals like telecoms and marketing. His interests among include distributed system design, streaming technologies, and NoSQL databases.

Presentations

Online machine learning in streaming applications Session

Stavros Kontopoulos and Debasish Ghosh explore online machine learning algorithm choices for streaming applications, especially those with resource-constrained use cases like IoT and personalization. They dive into Hoeffding Adaptive Trees, classic sketch data structures, and drift detection algorithms from implementation to production deployment, describing the pros and cons of each of them.

James Kotecki is the Director of Marketing and Communications for Infinia ML, a machine learning company helping enterprise firms deploy data science to automate complex challenges.

James is the former Head of Communications for Automated Insights, where he helped an international audience understand the concepts of natural language generation. He later started his own firm to help technology marketers tell customer stories.

As a professional content creator for over a decade, he’s worked with brands including YouTube, Politico, Business Insider, LinkedIn, and DHL. He’s spoken about customer storytelling from North Carolina to South Korea, and you can see his on-camera interviews with leading executives every year at the CES electronics trade show.

Presentations

Communication breakdown: Facing machine learning’s all-too-human failure Session

Miscommunication between business leaders and technical experts can doom even the best data science project. Don’t let it drive you insane! In this session, we’ll dissect many flavors of communication failure, from goal misalignment to technical misunderstanding. Then, we’ll explore practical ways to bridge these gaps.

Anirudh is a noted AI expert, O’Reilly author, and a former scientist at Microsoft AI & Research, where he founded Seeing AI, the most used technology among the blind community after the iPhone. Anirudh serves as the Head of AI & Research at Aira, noted by Time Magazine as one of the best inventions of 2018. He’s also the author of the upcoming ‘Practical Deep Learning for Cloud & Mobile’. With features shipped to a billion users, he brings over a decade of production-oriented Applied Research experience on PetaByte scale datasets. He has been developing technologies using AI techniques for Augmented Reality, Robotics, Speech, Productivity as well as Accessibility. Some of his recent work, which IEEE has called ‘life-changing’, has been honored by CES, FCC, Cannes Lions, American Council of the Blind, showcased at events by UN, White House, House of Lords, World Economic Forum, Netflix, National Geographic, and applauded by world leaders including Justin Trudeau and Theresa May.

Presentations

Deep learning on mobile Session

Over the last few years, convolutional neural networks (CNNs) have risen in popularity, especially in the area of computer vision. Anirudh Koul and Meher Kasam take you through how you can get deep neural nets to run efficiently on mobile devices.

Cassie Kozyrkov is Google Cloud’s chief decision scientist. Cassie is passionate about helping everyone make better decisions through harnessing the beauty and power of data. She speaks at conferences and meets with leadership teams to empower decision makers to transform their industries through AI, machine learning, and analytics. At Google, Cassie has advised more than a hundred teams on statistics and machine learning, working most closely with research and machine intelligence, Google Maps, and ads and commerce. She has also personally trained more than 15,000 Googlers (executives, engineers, scientists, and even nontechnical staff members) in machine learning, statistics, and data-driven decision making. Previously, Cassie spent a decade working as a data scientist and consultant. She’s a leading expert in decision science, with undergraduate studies in statistics and economics at the University of Chicago and graduate studies in statistics, neuroscience, and psychology at Duke University and NCSU. When she’s not working, you’re most likely to find Cassie at the theatre, in an art museum, exploring the world, playing board games, or curled up with a good novel.

Presentations

Staying safe in the AI era Keynote

Machine learning and artificial intelligence are no longer science fiction, so now you have to address what it takes to harness their potential effectively, responsibly, and reliably. Based on lessons learned at Google, Cassie Kozyrkov offers actionable advice to help you find opportunities to take advantage of machine learning, navigate the AI era, and stay safe as you innovate.

Jakov Kucan is a senior architect at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. Previously, Jakov was chief architect at Kyruus and director of product strategy at PTC. He’s a skilled architect and engineer, able to see through the details of implementations, keep track of the dependencies within a large design, and communicate the vision and ideas to both technical and nontechnical audiences. Jakov earned his PhD in computer science from MIT and his MA degree in mathematics and BSE degree in computer science and engineering from the University of Pennsylvania. He’s an author of several publications and patent applications.

Presentations

Efficient ML engineering: Tools and best practices Tutorial

Sourav Dey and Jakov Kucan walk you through the six steps of the Lean AI process and explain how it helps your ML engineers work as an an integrated part of your development and production teams. You'll get a hands-on example using real-world data, so you can get up and running with Docker and Orbyter and see firsthand how streamlined they can make your workflow.

Purnima Kuchikulla is a solution engineer at Cloudera, where she works with customers on their cloud and big data strategies, and a big data evangelist with 15 years of experience in the industry. Previously, she was at IBM and ADP.

Presentations

Cloudera Edge Management in the IoT Tutorial

There are too many edge devices and agents, and you need to control and manage them. Purnima Reddy Kuchikulla, Timothy Spann, Abdelkrim Hadjidj, and Andre Araujo walk you through handling the difficulty in collecting real-time data and the trouble with updating a specific set of agents with edge applications. Get your hands dirty with CEM, which addresses these challenges with ease.

Kafka and Streams Messaging Manager (SMM) crash course Tutorial

Kafka is omnipresent and the backbone of streaming analytics applications and data lakes. The challenge is understanding what's going on overall in the Kafka cluster, including performance, issues, and message flows. Purnima Reddy Kuchikulla and Dan Chaffelson walk you through a hands-on experience to visualize the entire Kafka environment end-to-end and simplify Kafka operations via SMM.

Karthik Kulkarni is the lead big data solutions and infrastructure architect in the UCS Data Center Solutions Group at Cisco. His main focus is on infrastructure and how it enables digital transformation using emerging trends in big data, AI and ML solutions, and architecture in the data center.

Presentations

Operationalizing AI and ML with Cisco Data Intelligence Platform (sponsored by Cisco) Session

Artificial intelligence and machine learning are well beyond the laboratory exploratory stage of deployment. In fact, the speed of AI and ML deployment has a huge impact on an organization’s financial income. Chiang Yang and Karthik Kulkarni explore how the Cisco Data Intelligence Platform can help bridge the gap between AI and ML and big data.

Ben Lackey leads the Marketplace Partnerships team at Oracle Cloud Infrastructure. This teams works with industry leading ISVs including Cloudera, Confluent and DataStax to build Marketplace listings and Quick Start templates to automate deployment on OCI based on Terraform and Packer. Prior to joining Oracle, Ben worked in cloud partnership at Couchbase and DataStax. His experience in enterprise software includes roles at TIBCO, Terracotta and Software AG.

Presentations

AI/ML on Oracle Cloud with Kinetica and H2O.ai (sponsored by Oracle Cloud Infrastructure) Session

Learn about running AI/ML solutions like H2O.ai and Kinetica on Oracle Cloud. The session will include a live demo of Terraform, Oracle Cloud Infrastructure, GPUs and Oracle Marketplace. We’ll discuss other leading Data and AI products including Cloudera, DataStax and Confluent.

Olga Lagunova is the chief data and analytics officer for Pitney Bowes, where she enables digital transformation of PB businesses through data engineering and data science. Olga drives the strategy and delivery for PB’s platforms and initiatives across client and user experience, design, big data, analytics, the IoT, SaaS, the cloud, and APIs. Olga has more than 25 years of experience. Previously, she was SVP and distinguished engineer at CA Technologies, where she led global engineering teams with a focus on SaaS, analytics, and mobile application development. Her greatest professional asset is her ability to initiate and deliver new, innovative products while building high‐performing development teams that feel valued and integral to the company’s success.

Presentations

Mastercard and Pitney Bowes: Creating a data-driven business (sponsored by Pitney Bowes) Session

Mastercard and Pitney Bowes have overcome many challenges on their journey to accelerate innovation, achieve efficiencies, and improve the overall customer experience. Olga Lagunova and John Derrico share lessons learned as the data strategy evolved and highlight pitfalls and solutions from data science projects across several industries, from finance to cross-border shipping logistics.

See-Kit Lam is a Sr. Software Engineering at Malwarebytes. As one of the first employees at Malwarebytes, he sat side-by-side with our CEO and Founder, Marcin Kleczynski, helping to protect our earliest customers. His deep expertise in software quality elevated him to the level of Director of QA at Malwarebytes where he lead a large team of Quality Engineers around the world. After pursuing that for many years, See-Kit decided to try his hand at development and quickly rose to rock star status on the team. See-Kit is the pioneer of containerized detection workloads which is now the backbone of our flagship product. See-Kit has a holds a Master’s in Computer Engineering from Mississippi State University.

Presentations

Running AI workloads in containers (sponsored by BMC Software)

Developing, deploying and managing AI and anomaly detection models is tough business. See-Kit Lam details how Malwarebytes has leveraged containerization, scheduling, and orchestration to build a behavioral detection platform and a pipeline to bring models from concept to production.

Mars Lan is a staff software engineer at LinkedIn, where he’s been leading the team to design and implement LinkedIn’s metatdata infrastructure for the past two years. Previously, he worked on Google Assistant and Google Cloud products at Google. Mars earned his PhD in computer science from UCLA.

Presentations

The evolution of metadata: LinkedIn’s story Session

Imagine scaling metadata to an organization of 10,000 employees, 1M+ data assets, and an AI-enabled company that ships code to the site three times a day. Shirshanka Das and Mars Lan dive into LinkedIn’s metadata journey from a two-person back-office team to a central hub powering data discovery, AI productivity, and automatic data privacy. They reveal metadata strategies and the battle scars.

Chon Yong Lee is a project manager at SK Telecom, where he’s designing 5G Infra Visualization, the company’s telco network visualization system. He has 10 years of experience in telecom networks. He’s demonstrate the 5G Infra Visualization system twice at Mobile World Congress, in 2016 and 2018.

Presentations

SK Telecom's 5G network monitoring and 3D visualization on streaming technologies Session

Jonghyok Lee Chon Yong Lee discuss T-CORE, SK Telecom’s monitoring and service analytics platform, which collects system and application data from several thousand servers and applications and provides a 3D visualization of the real-time status of the whole network. Join in to hear lessons learned during development.

Jonghyok Lee is an architect at SK Telecom, where he’s designing T-CORE, the company’s monitoring and analytics platform. He has more than 20 years of experience in data processing. Previously, he was a senior architect at IBM and led the architecture design for several enterprise-wide data processing systems at many companies in various industries.

Presentations

SK Telecom's 5G network monitoring and 3D visualization on streaming technologies Session

Jonghyok Lee Chon Yong Lee discuss T-CORE, SK Telecom’s monitoring and service analytics platform, which collects system and application data from several thousand servers and applications and provides a 3D visualization of the real-time status of the whole network. Join in to hear lessons learned during development.

David Leichner is the chief marketing officer of SQream, where he’s responsible for creating and executing its marketing strategy and managing the global marketing team that forms the foundation of SQream’s product market penetration. He has over 25 years of marketing and sales executive management experience garnered from leading software vendors, including Information Builders, Magic Software, and BluePhoenix Solutions.

Presentations

The future of Hadoop in an era of exponentially growing data (sponsored by SQream) Session

What started as an asset for data scientists and BI professionals has become a poorly performing problem. David Leichner explores the Hadoop ecosystem and relational databases from an analytics perspective—reviewing the current landscape, what Hadoop was designed for, and how a Hadoop-based infrastructure can be improved to support a new era of exponentially growing data.

Brenda Leong is a senior counsel and director of strategy at the Future of Privacy Forum (FPF) and a Certified Information Privacy Professional/United States (CIPP/US). She oversees strategic planning of organizational goals, as well as managing the FPF portfolio on biometrics, particularly facial recognition, along with the ethics and privacy issues associated with artificial intelligence. She works on industry standards and collaboration on privacy concerns, by partnering with stakeholders and advocates to reach practical solutions to the privacy challenges for consumer and commercial data uses. Previously, Brenda served in the US Air Force, including policy and legislative affairs work from the Pentagon and the US Department of State. She’s a graduate of the George Mason University School of Law.

Presentations

Regulations and the future of data Session

From the EU to California and China, more of the world is regulating how data can be used. Andrew Burt and Brenda Leong convene leading experts on law and data science for a deep dive into ways to regulate the use of AI and advanced analytics. Come learn why these laws are being proposed, how they’ll impact data, and what the future has in store.

War stories from the front lines of ML Session

Machine learning techniques are being deployed across almost every industry and sector. But this adoption comes with real, and oftentimes underestimated, privacy and security risks. Andrew Burt and Brenda Leong convene a panel of experts including David Florsek, Chris Wheeler, and Alex Beutel to detail real-life examples of when ML goes wrong, and the lessons they learned.

Tomer Levi is a senior data engineer on the DataOps team at Fundbox, where he helps shape the data platform architecture to drive business goals. Previously, he was a data engineer at Intel’s advanced analytics group, helping to build out the data platform supporting the data storage and analysis needs of Intel Pharma Analytics Platform, an edge-to-cloud artificial intelligence solution for remote monitoring of patients during clinical trials. He’s incredibly passionate about the power of data. Tomer holds a BSc in software engineering.

Presentations

Orchestrating data workflows using a fully serverless architecture Session

Use of data workflows is a fundamental functionality of any data engineering team. Nonetheless, designing an easy-to-use, scalable, and flexible data workflow platform is a complex undertaking. Tomer Levi walks you through how the data engineering team at Fundbox uses AWS serverless technologies to address this problem and how it enables data scientists, BI devs, and engineers move faster.

Dong Li is a founding member and head of product at Kyligence, leading strategy, delivery, and growth of the enterprise product portfolio based on Apache Kylin. He’s also an Apache Kylin committer and PMC member. Previously, he was a senior software engineer in the Analytics Data Infrastructure Department at eBay and a software engineer in the Cloud and Enterprise Department at Microsoft.

Presentations

Take the bias out of big data insights with augmented analytics (sponsored by Kyligence) Session

Your analytics are biased. Efforts to extract meaning by manually scrubbing, indexing, and parsing big data is limited by time, cost, and human assumptions. Dong Li and Hongbin Ma offer an overview of augmented analytics. It takes OLAP into the future with AI, ensuring objective and unique insights that cover all relevant scenarios found in petabytes of multidimensional and variable data.

Luyao Li is a technical lead manager on the data platform team at Uber, where he manages the data lineage team, which builds systems including end-to-end data flow tracking, latency tracking, and cost attribution and pricing. Previously he built multiple systems spanning from service discovery, configuration management, and ad campaign results tracking and reporting as a software engineer at Electronic Arts. He holds a master’s degree from Duke University.

Presentations

Turning big data into knowledge: Managing metadata and data relationships at Uber's scale Session

Uber takes data driven to the next level. It needs a robust system for discovering and managing various entities, from datasets to services to pipelines, and their relevant metadata isn't just nice—it's absolutely integral to making data useful. Kaan Onuk, Luyao Li, and Atul Gupte explore the current state of metadata management, end-to-end data flow solutions at Uber, and what’s coming next.

Tianhui Michael Li is the founder and president of the Data Incubator, a data science training and placement firm. Michael bootstrapped the company and navigated it to a successful sale to the Pragmatic Institute. Previously, he headed monetization data science at Foursquare and has worked at Google, Andreessen Horowitz, J.P. Morgan, and D.E. Shaw. He’s a regular contributor to the Wall Street JournalTech CrunchWiredFast CompanyHarvard Business ReviewMIT Sloan Management ReviewEntrepreneurVenture Beat, Tech Target, and O’Reilly. Michael was a postdoc at Cornell Tech, a PhD at Princeton, and a Marshall Scholar in Cambridge.

Presentations

SOLD OUT: Big data for managers 2-Day Training

Michael Li and Gonzalo Diaz provide a nontechnical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and use their input and analysis for your business’s strategic priorities and decision making.

SOLD OUT: Big data for managers (Day 2) Training Day 2

Michael Li and Gonzalo Diaz provide a nontechnical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and use their input and analysis for your business’s strategic priorities and decision making.

Qiyang Li is a a program manager at Microsoft, having had worked on big data, machine learning and AI-powered products for years. His recent focus is on time series-anomaly detection and prediction to empower scenarios like AIOps-related operation metrics detection, business metrics detection, and prediction.

Presentations

Introducing a new anomaly detection algorithm (SR-CNN) inspired by computer vision Session

Anomaly detection may sound old fashioned, yet it's super important in many industry applications. Tony Xing, Congrui Huang, Qiyang Li, and Wenyi Yang detail a novel anomaly-detection algorithm based on spectral residual (SR) and convolutional neural network (CNN) and how this method was applied in the monitoring system supporting Microsoft AIOps and business incident prevention.

Audrey Lobo-Pulo is the founder of Phoensight and has a passion for using emerging data technologies to empower individuals, governments, and organizations in creating a better society. Audrey has over 10 years’ experience working with the Australian Treasury in public policy areas including personal taxation, housing, social policy, labor markets, and population demographics. She’s an open government advocate and has a passion for open data and open models. She pioneered the concept of “government open source models,” which are government policy models open to the public to use, modify, and distribute freely. Audrey’s deeply interested in how technology enables citizens to actively participate and engage with their governments in cocreating public policy. She holds a PhD in physics and a master’s in economic policy.

Presentations

Purposefully designing technology for civic engagement Session

As new digital platforms emerge and governments look at new ways to engage with citizens, there's an increasing awareness of the role these platforms play in shaping public participation and democracy. Audrey Lobo-Pulo, Annette Hester, and Ryan Hum examine the design attributes of civic engagement technologies and their ensuing impacts and an NEB Canada case study.

Jorge A. Lopez works in big data solutions at Amazon Web Services. Jorge has more than 15 years of business intelligence and DI experience. He enjoys intelligent design and engaging storytelling and is passionate about data, music, and nature.

Presentations

SOLD OUT: Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join the AWS team to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

SOLD OUT: Building a serverless big data application on AWS (Day 2) Training Day 2

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join the AWS team to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Ben Lorica is the chief data scientist at O’Reilly. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Closing remarks Keynote

Program chairs, Ben Lorica, Doug Cutting, and Alistair Croll, offer closing remarks.

Recent trends in data and machine learning technologies Keynote

Ben Lorica dives into emerging technologies for building data infrastructures and machine learning platforms.

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Viridiana Lourdes is data scientist at Ayasdi with over 15 years of experience applying advanced statistical methods to challenging problems and implementing solutions through the use of latest technology. She’s developed statistical models for bid data based on the integration of quantitative research with advanced scientific computation and collaborative interdisciplinary applications in many fields, and she has practical experience in econometrics, portfolio management, asset allocation, portfolio construction, optimization, and risk management. Her computer programming experience includes professional-level applications with Python and statistical software R. She earned her doctor of philosophy and MS in statistics and decision sciences from Duke University and an MA in finance and a BA in actuarial sciences from ITAM. Her quantitative skills include Bayesian dynamic linear models, econometric models, time series and forecasting, predictive modeling, nonlinear models, generalized linear models, multivariate analysis, multifactor models, and classification trees.

Presentations

Assumed risk versus actual risk: The new world of behavior-based risk modeling Findata

Viridiana Lourdes explains how banks and financial enterprises can adopt and integrate actual risk models with existing systems to enhance the performance and operational efficiency of the financial crimes organization. Join in to learn how actual risk models can reduce segmentation noise, utilize unlabeled transactional data, and spot unusual behavior more effectively.

Boris Lublinsky is a software architect at Lightbend, where he specializes in big data, stream processing, and services. Boris has over 30 years’ experience in enterprise architecture. Previously, was responsible for setting architectural direction, conducting architecture assessments, and creating and executing architectural road maps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, and Serving Machine Learning Models. He’s also cofounder of and frequent speaker at several Chicago user groups.

Presentations

Hands-on machine learning with Kafka-based streaming pipelines Tutorial

Boris Lublinsky and Dean Wampler examine ML use in streaming data pipelines, how to do periodic model retraining, and low-latency scoring in live streams. Learn about Kafka as the data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, metadata tracking, and more.

Patrick Lucey is the chief scientist of artificial intelligence at Stats Perform, where his goal is to maximize the value of the 35+ years worth of sports data. His main research interests are in artificial intelligence and interactive machine learning in sporting domains. Previously, Patrick spent five years at Disney Research, where he conducted research into automatic sports broadcasting using large amounts of spatiotemporal tracking data, and was a postdoctoral researcher at the Robotics Institute at Carnegie Mellon University and the Department of Psychology at the University of Pittsburgh, conducting research on automatic facial expression recognition. He holds a BEng (EE) from the University of Southern Queensland and a PhD from QUT, Australia. He was a coauthor of the best paper at the 2016 MIT Sloan Sports Analytics Conference and in 2017 and 2018 was coauthor of best paper runner-up at the same conference. Patrick has also won best paper awards at the INTERSPEECH (2007) and the Winter Conference on Applications of Computer Vision (WACV) (2014) international conferences.

Presentations

Interactive sports analytics Keynote

Imagine watching sports and being able to immediately find all plays that are similar to what just happened. Better still, imagine being able to draw a play with the Xs and Os on an interface like a coach draws on a chalkboard and instantaneously finding all the similar plays and conduct analytics on those plays. Join Patrick Lucey to see how this is possible.

Brian Lynch leads the strategic analytics team for the Personal Savings and Investing Division at TD Bank. Brian holds an undergraduate degree in applied econometrics from Western University and a master’s degree in management analytics from Queens University.

Presentations

Creating a data-driven team culture Findata

As data has become heavily ingrained in corporate strategy, businesses have been challenged to build and retain team with data science capabilities. Brian Lynch walks you through some of the materials he's used to grow, build, and maintain data-driven teams with diverse skill sets ranging from reporting and insights to data sciences.

Hongbin Ma is a founding member and vice president at Kyligence, leading the technical innovation of Kyligence’s enterprise data platform, based on Apache Kylin. He’s also an Apache Kylin committer and PMC member and ranked as the number-one code committer for a few years. Previously, he was a key contributor to Microsoft Research Asia’s graph database, Trinity. He’s most interested in query optimization, distributed database, and parallel computing.

Presentations

Take the bias out of big data insights with augmented analytics (sponsored by Kyligence) Session

Your analytics are biased. Efforts to extract meaning by manually scrubbing, indexing, and parsing big data is limited by time, cost, and human assumptions. Dong Li and Hongbin Ma offer an overview of augmented analytics. It takes OLAP into the future with AI, ensuring objective and unique insights that cover all relevant scenarios found in petabytes of multidimensional and variable data.

Matt Maccaux is the Global Field Chief Technology Officer at BlueData, where he focuses on helping enterprise organizations define and implement their enterprise-wide initiatives for AI, big data, and analytics and works closely with executives at enterprise customers to develop their road map and strategies for data-driven digital transformation using AI, ML, and advanced analytics. He helps these customers to accelerate their time to market with AI, ML, and analytics, with an enterprise-wide platform to provide those capabilities as a service to their data science teams. Previously, he worked worked with leading enterprises across many industries for the past twenty years in a variety of roles at some of the biggest technology companies in the world.

Presentations

How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE (BlueData)) Session

Anant Chintamaneni and Matt Maccaux explore whether the combination of containers with large-scale distributed data analytics and machine learning applications is like combining oil and water— or like peanut butter and chocolate.

Gloria Macia is a data scientist at the Diagnostics Data Science Lab at Roche AG, where she works to bring data-driven solutions to the market. Previously, she was part of the quality and clinical affairs management team at Sonova AG. Her interests include AI, translational science, medical device development, and healthcare regulation. Gloria came to Switzerland in 2015 for a research stay at the Institute of Bioengineering from EPFL and later decided to pursue a MSc in biomedical engineering-bioelectronics at ETH Zurich. Her computer science skills have been awarded several prizes and scholarships from renowned companies such as Intel, Toptal, and Palantir.

Presentations

AI and health: Achieving regulatory compliance DCS

Healthcare is emerging as a prominent area for AI applications, but innovators aiming to seize this chance face one major issue: achieving regulatory compliance. Using a real industry case study, Gloria Macia walks you through the current American and European regulatory frameworks for medical devices and provides a step-by-step guide to market for AI applications.

David Mack is a founder and machine learning engineer at Octavian.ai, exploring new approaches to machine learning on graphs. Previously, he cofounded SketchDeck, a Y Combinator-backed startup providing design as a service. He holds an MSci in mathematics and the foundations of computer science from the University of Oxford and a BA in computer science from the University of Cambridge.

Presentations

An introduction to machine learning on graphs Session

Graphs are a powerful way to represent knowledge. Organizations, in fields such as biosciences and finance, are starting to amass large knowledge graphs, but they lack the machine learning tools to extract insights from them. David Mack offers an overview of what insights are possible and surveys the most popular approaches.

Anand Madhavan is the vice president of engineering at Narvar. Previously, he was head of engineering for the Discover product at Snapchat and director of engineering at Twitter where we worked on building out the ad serving system for Twitter Ads. He has an MS in computer science from Stanford University.

Presentations

Posttransaction processing using Apache Pulsar at Narvar Session

Narvar provides next-generation posttransaction experience for over 500 retailers. Karthik Ramasamy and Anand Madhavan take you on the journey of how Narvar moved away from using a slew of technologies for their platform and consolidated its use cases using Apache Pulsar.

Mark Madsen is a fellow at Teradata, where he’s responsible for understanding, forecasting, and defining the analytics ecosystem and architecture. Previously, he was CEO of Third Nature, where he advised companies on data strategy and technology planning and vendors on product management. Mark has designed analysis, machine learning, data collection, and data management infrastructure for companies worldwide.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that isn't subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Krishna Maheshwari is the director of product management at Cloudera and is responsible for operational databases (HBase, Phoenix, Kudu, and Accumulo). You can find him on LinkedIn.

Presentations

Foundations for successful data projects Tutorial

The enterprise data management space has changed dramatically in recent years, and this has led to new challenges for organizations in creating successful data practices. Ted Malaska and Jonathan Seidman detail guidelines and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

HBase 2.0 and beyond Session

Krishna Maheshwari provides an overview of the major features and enhancements in the HBase 2.0 release, upcoming releases, and the future of HBase. You'll be able to ask her questions at the end. Apache HBase 2.0 comes packed with a lot of new functionalities: off-heap read paths, multitier bucket cache, new finite state machine-based assignment manager, etc.

Deepak Majeti is a systems software engineer at Vertica. He’s also an active contributor to Hadoop’s two most popular file formats: ORC and Parquet. His interests lie in getting the best from high-performance computing (HPC) and big data by building scalable, high-performance, and energy-efficient data analytics tools for modern computer architectures. Deepak holds a PhD in the HPC domain from Rice University.

Presentations

Kubernetes for stateful MPP systems Session

GoodData needed to autorecover from node failures and scale rapidly when workloads spiked on their MPP database in the cloud. Kubernetes could solve it, but it's for stateless microservices, not a stateful MPP database that needs hundreds of containers. Paige Roberts and Deepak Majeti detail the hurdles GoodData needed to overcome in order to merge the power of the database with Kubernetes.

Ted Malaska is a director of enterprise architecture at Capital One. Previously, he was the director of engineering in the Global Insight Department at Blizzard; principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem; and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Foundations for successful data projects Tutorial

The enterprise data management space has changed dramatically in recent years, and this has led to new challenges for organizations in creating successful data practices. Ted Malaska and Jonathan Seidman detail guidelines and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

Miguel Maldonado leads the Machine Learning curriculum development for IBM’s Data & AI Learning practice. He has 10+ years of experience in Machine Learning, Data Science, and AI. He has worked in R&D developing data mining products and served leadership roles as Director of Data Science and VP of Analytics across several industries including Retail, Banking, and Fintech. He holds a B.Sc. in Physics from Monterrey Tech and a M.S. in Analytics from NC State’s Institute for Advanced Analytics. In his free time he enjoys hiking and bouldering. He is passionate about democratizing Data Science and actively participates in datathons, local meetups, and events that support NGOs.

Presentations

Machine learning for the enterprise (sponsored by IBM) Training

Note: This free workshop, courtesy of IBM, is open to the first 50 registrants. You'll take a fascinating deep dive into the power and applications of machine learning in the enterprise.

James Malone is a product manager for Google Cloud Platform and manages Cloud Dataproc and Apache Beam (incubating). Previously, James worked at Disney and Amazon. He’s a big fan of open source software because it shows what’s possible when people come together to solve common problems with technology. He also loves data, amateur radio, Disneyland, photography, running, and Legos.

Presentations

Mass migration: Tales of moving on-premises Hadoop to Google Cloud (sponsored by Google Cloud) Session

James Malone takes a deep dive into how customers across the world partner with Google Cloud to reimagine big data processing and data lakes while generating incredible business value.

The future of Google Cloud data processing (sponsored by Google Cloud) Keynote

Open source has always been a core pillar of Google Cloud’s data and analytics strategy. James Malone examines how, as the community continues to set industry standards, the company continues to integrate those standards into its services so organizations around the world can unlock the value of data faster.

Chaithanya Manda is an assistant vice president at EXL, where he’s responsible for building AI-enabled solutions that can bring efficiencies across various business processes. He has over 10 years of experience in developing advanced analytics solutions across multiple business domains. He holds a bachelor’s of technology degree from the Indian Institute of Technology Guwahati.

Presentations

Improving OCR quality of documents using generative adversarial networks Session

Every NLP-based document-processing solution depends on converting scanned documents and images to machine readable text using an OCR solution, limited by the quality of scanned images. Nagendra Shishodia, Chaithanya Manda, and Solmaz Torabi explore how GAN can bring significant efficiencies in any document-processing solution by enhancing resolution and denoising scanned images.

Rochelle March is a sustainability expert, quantifier of environmental and social impact, and developer of financial products designed to improve the ESG (environmental, social, governance) performance of companies. She’s a senior analyst at Trucost, part of S&P Global, where she manages a portfolio of clients and products, including the Trucost SDG Evaluation product.

Presentations

How S&P’s Trucost empowered analysts with modern, interactive data reporting tools Findata

Trucost's Rochelle March is migrating a product that quantitatively measures company performance on UN Sustainable Development Goals from Excel to Python. Her team cut multiday workflows to a few hours, delivering rich 27-page interactive reports. Join in to learn modern techniques to design, build, and deploy a data visualization and reporting framework in your organization.

Tim McKenzie is general manager of big data solutions at Pitney Bowes, where he leads a global team dedicated to helping clients unlock the value hidden in the massive amounts of data collected about customers, infrastructure, and products. With over 17 years of experience engaging with customers about technology, Tim has a proven track record of delivering value in every engagement.

Presentations

Enabling 5G use cases through location intelligence Session

Tim McKenzie examines why planning 5G network rollout and associated services requires a good understanding of location-based data. Accurate addressing and linking consumers to property or points of interest allows data enrichment with attributes, demographics and social data. Companies use location to organize and analyze network and customer data to understand where to target new services.

Hamlet Jesse Medina Ruiz is a senior data scientist at Criteo. Previously, he was a control system engineer for Petróleos de Venezuela. Hamlet finished in the top ranking in multiple data science competitions, including 4th place on predicting return volatility on the New York Stock Exchange hosted by Collège de France and CFM in 2018 and 25th place on predicting stock returns hosted by G-Research in 2018. Hamlet holds a two master degrees on mathematics and machine learning from Pierre and Marie Curie University, and a PhD in applied mathematics from Paris-Sud University in France, where he focused on statistical signal processing and machine learning.

Presentations

Predicting Criteo’s internet traffic load using Bayesian structural time series models Session

Criteo’s infrastructure provides the capacity and connectivity to host Criteo’s platform and applications. The evolution of this infrastructure is driven by the ability to forecast Criteo’s traffic demand. Hamlet Jesse Medina Ruiz explains how Criteo uses Bayesian dynamic time series models to accurately forecast its traffic load and optimize hardware resources across data centers.

Martin Mendez-Costabel leads the geospatial data asset team for Monsanto’s Products and Engineering organization within the IT Department, where he drives the engineering and adoption of global geospatial data assets for the enterprise. He has more than 12 years of experience in the agricultural sector covering a wide range of precision agriculture-related roles, including data scientist and GIS manager for E&J Gallo Winery in California. Martin holds a BSc in agronomy from the National University of Uruguay and two viticulture degrees: an MSc from the University of California, Davis, and a PhD from the University of Adelaide in Australia.

Presentations

Optimizing the ROI of a geospatial platform in the cloud DCS

Cloud architecture is an extremely flexible environment for deploying solutions. However, the first build of a solution, even using open source software, may quickly exceed initial cost estimates and could outpace ROI if not managed properly. Martin Mendez-Costabel explains how Bayer Crop Science manages its geospatial platform and how it increased ROI.

Sara Menker is the founder and CEO of Gro Intelligence. Gro is AI for agriculture. Gro’s platform automatically harvests vast amounts of disparate global agricultural data, transforms it into knowledge, and generates predictions for volatile markets. Prior to founding Gro, Sara was a Vice President in Morgan Stanley’s commodities group. She began her career in commodities risk management, where she covered all commodity markets, and subsequently moved to trading, where she managed an options trading portfolio. Sara is a Trustee of the Mandela Institute For Development Studies (MINDS) and a Trustee of the International Center for Tropical Agriculture (CIAT). Sara was named a Global Young Leader by the World Economic Forum and is a fellow of the Aspen Institute. Sara received a B.A. in Economics and African Studies at Mount Holyoke College and the London School of Economics and an M.B.A. from Columbia University.

Presentations

Subhasish is a staff data scientist at Walmart Labs, where he leads efforts to create scalable machine learning solutions for Walmart’s customer base. He’s also a member of the global data science board at I-COM, a cross-industry global think tank on harnessing data and analytics for better marketing. Previously, Subhasish was at HP, WPP, and Aon and consulted for many Fortune 500 clients across multiple geographies in his 12 years of advanced analytics career. His broad expertise lies along a wide spectrum of marketing analytics, and his current data science interest areas are around modeling customer behavior and causal inference. He holds an MA in economics from the Delhi School of Economics, where econometrics was one of his focus areas.

Presentations

Causal inference 101: Answering the crucial "why" in your analysis Session

Causal questions are ubiquitous, and randomized tests are considered the gold standard. However, such tests are not always feasible, and then you just have observational data to get to causal insights. But techniques such as matching offer an opportunity to solve this. Subhasish Misra explores this and practical tips when trying to infer causal effects.

Sanjay Mohan is the group chief technology officer at MakeMyTrip. He leads the overall technology for MakeMyTrip, Goibibo, and redBus. He’s responsible for developing and executing global technology strategy and planning ongoing technology innovations for continued success. Previously, Sanjay was at Yahoo, IBM, Infosys, Oracle, and Netscape in the US and in India; he was the chief technology officer of MakeMyTrip prior to the merger with ibibo Group. With an overall experience of 25+ years, Sanjay brings along significant expertise in product and platform engineering, including architecture, user experience, site operations, product management, and product strategy. Sanjay earned his master’s degree in computer science from the University of Louisiana and his bachelor’s degree in engineering from the Birla Institute of Technology Mesra.

Presentations

Migrating millions of users from voice- and email-based customer support to a chatbot Session

At MakeMyTrip customers were using voice or email to contact agents for postsale support. In order to improve the efficiency of agents and improve customer experience, MakeMyTrip developed a chatbot, Myra, using some of the latest advances in deep learning. Madhu Gopinathan and Sanjay Mohan explain the high-level architecture and the business impact Myra created.

Jessica Egoyibo Mong is an engineering manager on the machine learning engineering (MLE) team at SurveyMonkey. She leads efforts to rearchitect the online serving ML system. Previously, she was a full stack engineer on the billing and payments team, where she built and maintained software to enable SurveyMonkey’s global financial growth and operation; she oversaw the technical talks program, jointly managed the engineering internship program, and co-led the Women in Engineering Group. She’s a 2014 White House Initiative on HBCUs All-Star and a Hackbright (summer 2013) and CODE2040 (summer 2014) alum. She’s served on the leadership team of the Silicon Valley local chapter of the Anita Borg Institute and is a member of /dev/color. Jessica earned a BS in computer engineering from Claflin University in South Carolina. She’s a singer and upcoming drummer, and sings and drums at her church in Livermore, California. In her spare time, she enjoys eating, CrossFit, reading, learning new technologies, and sleeping.

Presentations

Your cloud, your ML, but more and more scale? How SurveyMonkey did it Session

You're a SaaS company operating on a cloud infrastructure prior to the machine learning (ML) era and you need to successfully extend your existing infrastructure to leverage the power of ML. Jing Huang and Jessica Mong detail a case study with critical lessons from SurveyMonkey’s journey of expanding its ML capabilities with its rich data repo and hybrid cloud infrastructure.

James Morantus is a Cloud Solutions/Customer Success Engineer at Cloudera. Previously, James was a Senior Solutions Architect with the Professional Service organization at Cloudera, delivering services both on-prem and on the public cloud.

Presentations

Running multidisciplinary big data workloads in the cloud with CDP Tutorial

Organizations now run diverse, multidisciplinary, big data workloads that span data engineering, data warehousing, and data science applications. Many of these workloads operate on the same underlying data, and the workloads themselves can be transient or long running in nature. There are many challenges with moving these workloads to the cloud. In this talk we start off with a technical deep...

Tusharadri Mukherjee is head of ecommerce analytics at Lenovo, where he’s one of the driving forces behind the company’s data-driven transformation. He helped the web business cross the $1B mark last year through his team’s continuous endeavors in data-driven merchandising, pricing, and campaign initiatives. Tushar has 12+ years of experience in the tech industry, with expertise in ecommerce, analytics, consulting, and financial management. Previously, he worked for Tata Consultancy Services, where he managed technical delivery and client relationships. Tushar has a BTech in computer science from India and an MBA from Duke University.

Presentations

The attribution problem DCS

Attribution of media spend is a common problem shared by people in many different roles and industries. Many of the solutions that are simplest and easiest to implement don't drive the right behavior. Tushar Mukherjee shares practical lessons learned developing and applying multivariate attribution models at Lenovo.

Sireesha Muppala is a solutions architect at Amazon Web Services. Her area of depth is machine learning and artificial intelligence, and she provides guidance to AWS customers on their ML and AI workloads. She led the Colorado University team to win and successfully complete a two-year research grant from the Air Force Research Lab on “Autonomous Job Scheduling in Unmanned Arial Vehicles.” She’s an experienced public speaker and has presented research papers at international conferences, such as CoSAC: Coordinated Session-Based Admission Control for Multi-Tier Internet Applications at the IEEE International Conference on Computer Communications and Networks (ICCCN) and Regression Based Multi-Tier Resource Provisioning for Session Slowdown Guarantees at the IEEE International Conference Performance, Computing and Communications (IPCCC). She’s published technical articles, such as Coordinated Session-Based Admission Control with Statistical Learning for Multi-Tier Internet Applications in the Journal of Network and Computer Applications (JNCA), Regression-Based Resource Provisioning for Session Slowdown Guarantee in Multi-Tier Internet Servers, and Multi-Tier Service Differentiation: Coordinated Resource Provisioning and Admission Control in the Journal of Parallel and Distributed Computing (JPDC). Sireesha earned her PhD and postdoctorate from the University of Colorado, Colorado Springs, while working full time. Her PhD thesis is Multi-Tier Internet Service Management Using Statistical Learning Techniques.

Presentations

Alexa, do men talk too much? Session

Mansplaining. Know it? Hate it? Want to make it go away? Sireesha Muppala, Shelbee Eigenbrode, and Emily Webber tackle the problem of men talking over or down to women and its impact on career progression for women. They also demonstrate an Alexa skill that uses deep learning techniques on incoming audio feeds, examine ownership of the problem for women and men, and suggest helpful strategies.

ML ops: Applying DevOps practices to machine learning workloads Session

As an increasing level of automation becomes available to data science, the balance between automation and quality needs to be maintained. Applying DevOps practices to machine learning workloads brings models to the market faster and maintains the quality and integrity of those models. Sireesha Muppala, Shelbee Eigenbrode, and Randall DeFauw explore applying DevOps practices to ML workloads.

Arun was chief product officer at Hortonworks, prior to its merger with Cloudera, leading engineering and R&D efforts across the company’s entire portfolio. He was one of the founders of Hortonworks and has spent well over a decade in the open source big data industry. He has worked on Apache Hadoop since its inception in 2006 and remains an Apache Hadoop Project Management Committee (PMC) member. Prior to co-founding Hortonworks, Arun was one of the original members of the Hadoop team at Yahoo, where he was responsible for all Hadoop/MapReduce code and configuration deployed across the 42,000+ servers and for running Hadoop as a service for the company.

Presentations

Delivering the enterprise data cloud Keynote

In this keynote, we’ll introduce you to the new 100% open source Cloudera Data Platform (CDP), the world’s first enterprise data cloud. CDP is hybrid and multi-cloud, delivering the speed, agility, and scale you need to secure and govern your data anywhere from the edge to AI.

Mikheil Nadareishvili is deputy head of BI at TBC Bank, in charge of company-wide data science initiative. His main responsibilities include overseeing development of data science capability and embedding it in business to achieve maximum business value. Previously, Mikheil applied data science to various domains, most notably real estate (to determine housing market trends and predict real estate prices) and education (to determine factors that influence students’ educational attainment in Georgia).

Presentations

Quadrupling profit through analytics: Data science with business value in mind Findata

TBC Bank recently started doing analytics in a new way: embedding data scientists directly in the businesses they're working for, along with staff dedicated to connecting them to data and business. It also made sure all projects had clear measures of success. Mikheil Nadareishvili explains how this shift unlocked value—in one project, helping improve the profit of a major loan product four-fold.

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.

Presentations

Building a best-in-class data lake on AWS and Azure Session

Data lakes have become a key ingredient in the data architecture of most companies. In the cloud, object storage systems such as S3 and ADLS make it easier than ever to operate a data lake. Tomer Shiran and Jacques Nadeau explain how you can build best-in-class data lakes in the cloud, leveraging open source technologies and the cloud's elasticity to run and optimize workloads simultaneously.

Securing your cloud data lake with a "defense in depth" approach Session

With cheap and scalable storage services such as S3 and ADLS, it's never been easier to dump data into a cloud data lake. But you still need to secure that data and be sure it doesn't leak. Tomer Shiran and Jacques Nadeau explore capabilities for securing a cloud data lake, including authentication, access control, encryption (in motion and at rest), and auditing, as well as network protections.

Arup Nanda is the managing director of ML, AI, and data platform at Capital One. He’s an Oracle ACE director and has been working in the Oracle technologies field for 25 years, spanning all aspects of data management from modeling to performance tuning and data science. He’s written six books and more than 700 articles and presented more than 500 sessions in conferences in 25 countries. He blogs at arup.blogspot.com.

Presentations

Business transformation with a visitor funnel DCS

Arup Nanda explains how combining multiple data elements, businesses, and systems and charting visitor drop-offs, along with A/B testing—using Cloudera, Airflow, and machine learning with regression and classification models—has allowed Priceline to redefine its business, change product design, and offer incentives to customers.

Executive Briefing: Building a data-assisted organization Session

Every organization wants to use data more effectively and as a weapon, but few succeed. Arup Nanda explores how Priceline started on this journey and how it was successful using different techniques and tools. Join in to learn how to streamline data assets, make it easier for end users, define KPIs, create value from data, and build sponsorships to build a data organization.

Paco Nathan is known as a “player/coach” with core expertise in data science, natural language processing, machine learning, and cloud computing. He has 35+ years of experience in the tech industry, at companies ranging from Bell Labs to early-stage startups. His recent roles include director of the Learning Group at O’Reilly and director of community evangelism at Databricks and Apache Spark. Paco is the cochair of Rev conference and an advisor for Amplify Partners, Deep Learning Analytics, Recognai, and Primer. He was named one of the "top 30 people in big data and analytics" in 2015 by Innovation Enterprise.

Presentations

Data science versus engineering: Does it really have to be this way? Session

If, as a data scientist, you've wondered why it takes so long to deploy your model into production or, as an engineer, thought data scientists have no idea what they want, you're not alone. Join a lively discussion with industry veterans Ann Spencer, Paco Nathan, Amy Heineike, and Chris Wiggins to find best practices or insights on increasing collaboration when developing and deploying models.

Executive Briefing: Unpacking AutoML Session

Paco Nathan outlines the history and landscape for vendors, open source projects, and research efforts related to AutoML. Starting from the perspective of an AI expert practitioner who speaks business fluently, Paco unpacks the ground truth of AutoML—translating from the hype into business concerns and practices in a vendor-neutral way.

Presentations

Apache Metron: Open source cybersecurity at scale Tutorial

Bring your laptop, roll up your sleeves, and get ready to crunch some cybersecurity events with Apache Metron, an open source big data cybersecurity platform. Carolyn Duby walks you through how Metron finds actionable events in real time.

Max Neunhöffer is a mathematician turned database developer. He’s a senior software developer architect at ArangoDB. In his academic career, he worked for 16 years on the development and implementation of new algorithms in computer algebra, where he juggled with mathematical big data like group orbits containing trillions of points. He recently returned from St. Andrews to Germany, shifted his focus to NoSQL databases, and now helps develop ArangoDB. He’s spoken at international conferences including O’Reilly Software Architecture London, J On The Beach, and MesosCon Seattle.

Presentations

The case for a common metadata layer for machine learning platforms Session

Machine learning platforms are becoming more complex, with different components each producing their own metadata and their own way of storing metadata. Max Neunhöffer and Joerg Schad propose a first draft of a common metadata API and demonstrate a first implementation of this API in Kubeflow using ArangoDB, a native multimodel database.

Jackie Nichols is a member of the applied innovation and incubation team within Microsoft’s Enterprise Commercial Organization, where she’s responsible for partnering with Microsoft internal Product Groups (PG) and developing longstanding and trusted relationships by providing a rare combination of both senior business experience and the deep technical capabilities necessary to provide the critical interface between early adopters, the field, and the PG. Her responsibilities include business alignment and strategy, customer executive relationship and customer deliveries, quality improvement activities through amplifying real-world customer and partner experiences and feedback to the PG, cross-domain expertise, recruiting and driving key partner relationships, and educating the Microsoft field, as well as creating and contributing to PG IP. Jackie is involved in business and technical risk management as well as the entire implementation lifecycle with customers, with a broad focus on consistent delivery, innovation, and technology alignment. She’s currently partnering with the Azure Cognitive Personalizer Service PG to transform how customers think about and perform personalization of their content through machine learning. Previously, Jackie was the CTO for modern productivity at Microsoft Services, where she led a team of Microsoft’s most-senior solution, regional, and industry architects and was responsible for developing strategies to align Microsoft Services delivery capabilities and business goals with Microsoft product and industry trends, provide pre-sales architecture leadership, readiness, IP, delivery quality oversight, in addition to driving diversity and inclusion. She holds a degree in computer science from Carleton University in Ottawa, Canada. Jackie believes in giving back and spends her spare time working with local charities as well as more globally focused organizations that concentrate on world-view issues.

Presentations

Deliver personalized experiences and content like Xbox with Cognitive Services Personalizer (sponsored by Microsoft Azure) Session

Edward Jezierski and Jackie Nichols demonstrate how Cognitive Services Personalizer works with your content and data, how it autonomously learns to make optimal decisions, how you can add it to your app with two lines of code, and what’s under the hood. Then they share the results Personalizer achieved on the Xbox One home page as well as best practices for applying it in your applications today.

Alberto Nieto is a GIS solutions engineer at Esri; he’s a developer with a passion for applying GIS to solve real-world problems. Alberto’s specialties are development in Python and JavaScript for solutions ranging from predictive behavior analysis to frontend dashboards for operational awareness.

Presentations

See what others can’t with spatial analysis and data science (sponsored by Esri) Session

Digital location data is a crucial part of data science. The "where" matters as much to an analysis as the "what" and the "why." Shannon Kalisky and Alberto Nieto explore tools that help you apply a range of geospatial techniques in your data science workflows to get deeper insights.

Michael Noll is the technologist of the office of the CTO at Confluent, the company founded by the creators of Apache Kafka. Previously, Michael was the technical lead of DNS operator Verisign’s big data platform, where he grew the Hadoop, Kafka, and Storm-based infrastructure from zero to petabyte-sized production clusters spanning multiple data centers—one of the largest big data infrastructures in Europe at the time. He’s a well-known tech blogger in the big data community. In his spare time, Michael serves as a technical reviewer for publishers such as Manning and is a frequent speaker at international conferences, including Strata, ApacheCon, and ACM SIGIR. Michael holds a PhD in computer science.

Presentations

Now you see me; now you compute: Building event-driven architectures with Apache Kafka Session

Would you cross the street with traffic information that's a minute old? Certainly not. Modern businesses have the same needs. Michael Noll explores why and how you can use Kafka and its growing ecosystem to build elastic event-driven architectures. Specifically, you look at Kafka as the storage layer, at Kafka Connect for data integration, and at Kafka Streams and KSQL as the compute layer.

Harsha Nori is a data scientist at Microsoft. He works on interpretability for machine learning.

Presentations

Unified tooling for machine learning interpretability Session

Understanding decisions made by machine learning systems is critical for sensitive uses, ensuring fairness, and debugging production models. Interpretability presents options for trying to understand model decisions. Harsha Nori, Sameul Jenkins, and Rich Caruana explore the tools Microsoft is releasing to help you train powerful, interpretable models and interpret existing black box systems.

Owen O’Malley is a cofounder and technical fellow at Cloudera, formerly Hortonworks. Cloudera’s software includes Hadoop and the large ecosystem of big data tools that enterprises need for their data analytics. Owen has been working on Hadoop since the beginning of 2006 at Yahoo, was the first committer added to the project, and used Hadoop to set the Gray sort benchmark in 2008 and 2009. Previously, he was the architect of MapReduce, Security, and now Hive. He’s driving the development of the ORC file format and adding ACID transactions to Hive.

Presentations

Protect your private data in your Hadoop clusters with ORC column encryption Session

Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. Owen O'Malley dives into how column encryption in ORC files enables both fine-grain protection and audits of who accessed the private data.

Tim O’Reilly is the founder and CEO of O’Reilly Media, Inc. His original business plan was simply “interesting work for interesting people,” and that’s worked out pretty well. O’Reilly Media delivers online learning, publishes books, runs conferences, urges companies to create more value than they capture, and tries to change the world by spreading and amplifying the knowledge of innovators. Tim has a history of convening conversations that reshape the computer industry. In 1998, he organized the meeting where the term “open source software” was agreed on and helped the business world understand its importance. In 2004, with the Web 2.0 Summit, he defined how “Web 2.0” represented not only the resurgence of the web after the dot-com bust but a new model for the computer industry based on big data, collective intelligence, and the internet as a platform. In 2009, with his Gov 2.0 Summit, he framed a conversation about the modernization of government technology that has shaped policy and spawned initiatives at the federal, state, and local level and around the world. He has now turned his attention to implications of AI, the on-demand economy, and other technologies that are transforming the nature of work and the future shape of the business world. This is the subject of his book from Harper Business, WTF: What’s the Future and Why It’s Up to Us. In addition to his role at O’Reilly Media, Tim is a partner at early-stage venture firm O’Reilly AlphaTech Ventures (OATV) and serves on the boards of Maker Media (which was spun out from O’Reilly Media in 2012), Code for America, PeerJ, Civis Analytics, and PopVox.

Presentations

AI isn't magic. It’s computer science. Keynote

AI has the potential to add $16 trillion global economy by 2030, but adoption has been slow. While we understand the power of AI, many of us aren’t sure how to fully unleash its potential. Join Robert Thomas and Tim O'Reilly to learn that the reality is AI isn't magic. It’s hard work.

Kaan Onuk is an engineering manager at Uber, where he leads the metadata management team on the Big Data org. Previously, he was a tech lead at Uber, where he designed and built infrastructure to power data discovery and data privacy, and he helped build data infrastructure from the ground up at Graphiq, a startup acquired by Amazon. Kaan holds a master’s degree in electrical engineering from the University of Southern California.

Presentations

Turning big data into knowledge: Managing metadata and data relationships at Uber's scale Session

Uber takes data driven to the next level. It needs a robust system for discovering and managing various entities, from datasets to services to pipelines, and their relevant metadata isn't just nice—it's absolutely integral to making data useful. Kaan Onuk, Luyao Li, and Atul Gupte explore the current state of metadata management, end-to-end data flow solutions at Uber, and what’s coming next.

Diego Oppenheimer, founder and CEO of Algorithmia, is an entrepreneur and product developer with an extensive background in all things data. Previously, he designed, managed, and shipped some of Microsoft’s most-used data analysis products including Excel, Power Pivot, SQL Server, and Power BI. Diego holds a bachelor’s degree in information systems and a master’s degree in business intelligence and data analytics from Carnegie Mellon University.

Presentations

The new SDLC: CI/CD in the age of machine learning Session

Machine learning (ML) will fundamentally change the way we build and maintain applications. Diego Oppenheimer dives into how you can adapt your infrastructure, operations, staffing, and training to meet the challenges of the new software development life cycle (SDLC) without throwing away everything that already works.

Aaron Owen is a data scientist at Major League Baseball, where he leverages his skills to solve business problems for the organization and its 30 teams. Aaron holds an MS and PhD in evolutionary biology and was previously a professor at both the City University of New York and New York University.

Presentations

Data science and the business of Major League Baseball Session

Using SAS, Python, and AWS SageMaker, Major League Baseball's (MLB's) data science team outlines how it predicts ticket purchasers’ likelihood to purchase again, evaluates prospective season schedules, estimates customer lifetime value, optimizes promotion schedules, quantifies the strength of fan avidity, and monitors the health of monthly subscriptions to its game-streaming service.

Jignesh Patel is a principal architect at Cox Communications. He has more than 15 years’ experience in applying scientific methods and mathematical models to solve problems concerning the management of systems, people, machines, materials, and finance in industry. Previously, he was a trusted advisor for large software company in the northwest, assisting in data center capacity forecasting and providing machine learning capabilities to detect email spam, predicting DDOS attacks, and preventing DNS blackholes.

Presentations

Secured computation: Analyzing sensitive data using homomorphic encryption Session

Organizations often work with sensitive information such as social security and credit card numbers. Although this data is stored in encrypted form, most analytical operations require data decryption for computation. This creates unwanted exposures to theft or unauthorized read by undesirables. Matt Carothers, Jignesh Patel, and Harry Tang explain how homomorphic encryption prevents fraud.

Nick Pentreath is a principal engineer at the Center for Open Source Data & AI Technologies (CODAIT) at IBM, where he works on machine learning. Previously, he cofounded Graphflow, a machine learning startup focused on recommendations, and was at Goldman Sachs, Cognitive Match, and Mxit. He’s a committer and PMC member of the Apache Spark project and author of Machine Learning with Spark. Nick is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.

Presentations

Deploying end-to-end deep learning pipelines with ONNX Session

The common perception of deep learning is that it results in a fully self-contained model. However, in most cases, these models have similar requirements for data preprocessing as does more "traditional" machine learning. Despite this, there are few standard solutions for deploying end-to-end deep learning. Nick Pentreath explores how the ONNX format and ecosystem addresses this challenge.

Robert Pesch is a senior data scientist and big data engineer at inovex. Robert holds a PhD in bioinformatics and an MSc in computer science. He gets most excited about analyzing large and complex datasets and implementing novel insight-generating data products using advanced mathematical and statistical models.

Presentations

From whiteboard to production: A demand forecasting system for an online grocery shop Session

Data-driven software is revolutionizing the world and enable intelligent services we interact with daily. Robert Pesch and Robin Senge outline the development process, statistical modeling, data-driven decision making, and components needed for productionizing a fully automated and highly scalable demand forecasting system for an online grocery shop for a billion-dollar retail group in Europe.

Keshav Peswani is a senior software engineer at Expedia Group, focusing on technology and innovation on various platform initiatives. Keshav is involved in building neural network-based anomaly detection models as part of Expedia’s adaptive alerting system, an open source project for anomaly detection. He’s also a core contributor of the open source project Haystack from Expedia for distributed tracing, a software that facilitates detection and remediation of problems in service-oriented architecture. Previously, he was at the D. E. Shaw group, and since then has worked on several projects based on deep learning, particularly recurrent neural networks, monolithic systems, distributed systems, and big data processing. Keshav is a fast learner and passionate about deep learning and event-driven architecture. He’s spoken about Haystack in Open Source India, Asia’s largest open source conference and has talked about Haystack in Open Source For You (OSFY).

Presentations

Real-time anomaly detection on observability data using neural networks Session

Observability is the key in modern architecture to quickly detect and repair problems in microservices. Modern observability platforms have evolved beyond simple application logs and include distributed tracing systems like Zipkin and Haystack. Keshav Peswani and Ashish Aggarwal explore how combining them with real-time, intelligent alerting mechanisms helps in the automated detection of problems.

Barbara Petrocelli is the vice president of Cambridge Semantics, where she drives efforts to help more organizations harness the power of semantic and graph data models to meet the challenges of today’s analytics revolution. Across generations of analytic architectures—from data warehousing and lakes to today’s big data fabrics—Barbara has led strategic marketing for software pioneers, including Ascential, ProfitLogic, Netezza, and Cambridge Semantics.

Presentations

Semantics and graph data models in the enterprise data fabric (sponsored by Cambridge Semantics) Session

Join industry consultant Peter Ball, of Liminal Innovation, and Barbara Petrocelli, VP Field Operations of Cambridge Semantics, to learn how enterprise data fabrics are reshaping the modern data management landscape.

Nick Pinckernell is a senior research engineer for the applied AI research team at Comcast, where he works on ML platforms for model serving and feature pipelining. He has focused on software development, big data, distributed computing, and research in telecommunications for many years. He’s pursuing his MS in computer science at the University of Illinois at Urbana-Champaign, and when free, he enjoys IoT.

Presentations

Automating ML model training and deployments via metadata-driven data, infrastructure, feature engineering, and model management Session

Mumin Ransom gives an overview of the data management and privacy challenges around automating ML model (re)deployments and stream-based inferencing at scale.

Passionate about using ethical AI to enhance users’ experiences and grow businesses

Presentations

Building and leading a successful AI practice for your organization Tutorial

Creating and leading a successful ML strategy is an elegant orchestration of many components: master key ML concepts, operationalize ML workflow, prioritize highest-value projects, build a high-performing team, nurture strategic partnerships, align with the company’s mission, etc. Rossella Blatt Vital details insights and lessons learned in how to create and lead a flourishing ML practice.

Josh Poduska is the chief data scientist at Domino Data Lab. He has 17 years of experience in analytics. His work experience includes leading the statistical practice at one of Intel’s largest manufacturing sites, working on smarter cities data science projects with IBM, and leading data science teams and strategy with several big data software companies. Josh holds a master’s degree in applied statistics from Cornell University.

Presentations

Managing data science in the enterprise Tutorial

The honeymoon era of data science is ending and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders must deliver measurable impact on an increasing share of an enterprise’s KPIs. The speakers explore how leading organizations take a holistic approach to people, process, and technology to build a sustainable advantage.

Kevin Poskitt is a senior director at SAP, where he’s focused on machine learning, data science, and artificial intelligence. He’s responsible for leading SAP’s next-generation projects in unified machine learning. His experience encompasses more than 10 years in various technology companies ranging from small startups to large software vendors, where he’s worked in multiple departments including sales, marketing, finance, and product management. He’s a graduate of the University of Toronto with a specialty in economics and finance, and he holds a bachelor’s of commerce and a diploma in accounts from the University of British Columbia.

Presentations

Bringing together machine and human intelligence in business applications at enterprise scale (sponsored by SAP) Session

Oftentimes there's a fracture between the highly governed data of enterprise IT systems and the comprehensive but often ungoverned world of large-scale data lakes and streams of data from blogs, system logs, sensors, IoT devices, and more. Kevin Poskitt and Andreas Wesselmann walk you through how AI needs to connect to all of this data, as well as image, video, audio, and text data sources.

Yogesh Prasad is platform leader for the clinical data repository at leading human data science company IQVIA. IQVIA leverages human biology and data science to serve the combined industries of clinical research and health information technologies. Yogesh has over 20 years of experience working on data lakes, streaming data solutions, data governance, data quality, master data management, data integration, and data warehousing. He has been instrumental in shaping the data strategy at IQVIA and other companies to deliver transformational projects through the use of data-driven solutions. Yogesh holds an MBA from Duke University and a bachelor’s in engineering from Gujarat, India.

Presentations

Getting clinical trial data ready for analysis: How IQVIA wrangled its way to success (sponsored by Trifacta) Session

Clinical trial data analysis can be a complex process. The data is typically hand-coded and formatted differently and is required to be delivered in an FDA-approved format. Matt Derda and Yogesh Prasad explain how IQVIA built its Clean Patient Tracker and how it enabled agility and flexibility for end users of the platform, from data acquisition to reporting and analytics.

Jeremy D. Rader is the Sr. Director of Data Centric Solutions in Intel’s Data Center Group, Enterprise and Government Organization. He leads the group’s strategies and efforts to accelerate E&G customers’ digital transformation by delivering Analytics and AI solutions enabled by partners and scaled through the industry. Rader joined the Data Center Group in 2015 after fifteen years in Intel’s IT organization, culminating in his role as Director of Business Intelligence, where he led Intel’s global team responsible for driving advance analytics and insights across manufacturing, sales & marketing, supply chain, and finance. Earlier in his Intel career, he held positions in the eBusiness group and supply chain organization. Rader joined Intel in 1997 as a Materials Services Manager in Intel’s manufacturing business.

Presentations

Navigating the Transition to a Data First Enterprise: an Intel perspective (sponsored by Intel) Session

This session will reveal first-hand insights of an Intel analytics practitioner, share Intel IT’s own data maturity journey and provide actionable best known methods (BKMs) for Enterprises amidst transformation into an intelligent data-first business.

Unleash the power of data at scale (sponsored by Intel) Keynote

Data analytics is the long-standing but constantly evolving science that companies leverage for insight, innovation, and competitive advantage. Jeremy Rader explores Intel’s end-to-end data pipeline software strategy designed and optimized for a modern and flexible data-centric infrastructure that allows for the easy deployment of unified advanced analytics and AI solutions at scale.

Akshay Rai is a senior software engineer at LinkedIn, whose primary focus is to reduce the mean time to detect issues and the mean time to resolve issues that arise at LinkedIn. He works on LinkedIn’s next-generation anomaly detection and diagnosis platform. Previously, he actively led the popular Dr. Elephant project at LinkedIn and helped open source it, and he worked on operational intelligence solutions for Hadoop and Spark by building real-time systems that enable monitoring, visualizing, and debugging of big data applications and Hadoop clusters.

Presentations

ThirdEye: LinkedIn’s business-wide monitoring platform Session

Failures or issues in a product or service can negatively affect the business. Detecting issues in advance and recovering from them is crucial to keeping the business alive. Join Akshay Rai to learn more about LinkedIn's next-generation open source monitoring platform, an integrated solution for real-time alerting and collaborative analysis.

Presentations

Foundations for successful data projects Tutorial

The enterprise data management space has changed dramatically in recent years, and this has led to new challenges for organizations in creating successful data practices. Ted Malaska and Jonathan Seidman detail guidelines and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

Manu Ram Pandit is a senior software engineer on the data analytics and infrastructure team at LinkedIn. He has extensive experience in building complex and scalable applications. During his tenure at LinkedIn, he’s influenced design and implementation of hosted notebooks, providing a seamless experience to end users. He works closely with customers, engineers, and product to understand and define the requirements and design of the system. Previously, he was with Paytm, Amadeus, and Samsung, where he built scalable applications for various domains.

Presentations

A productive data science platform: Beyond a hosted-notebooks solution at LinkedIn Session

Join Swasti Kakker, Manu Ram Pandit, and Vidya Ravivarma to explore what's offered by a flexible and scalable hosted data science platform at LinkedIn. It provides features to seamlessly develop in multiple languages, enforce developer best practices, governance policies, execute, visualize solutions, efficient knowledge management, and collaboration to improve developer productivity.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); briefly worked on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper Networks. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin-Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Serverless streaming architectures and algorithms for the enterprise Tutorial

Arun Kejariwal, Karthik Ramasamy, and Anurag Khandelwal walk you through the landscape of streaming systems and examine the inception and growth of the serverless paradigm. You'll take a deep dive into Apache Pulsar, which provides native serverless support in the form of Pulsar functions and get a bird’s-eye view of the application domains where you can leverage Pulsar functions.

Mumin Ransom joined Comcast in 2005. Since then he has worked across worked across the HSD, Voice and Video Production lines. Mumin currently leads a machine learning platform development team . His system handles billions of events daily. The system is designed to improve customer experience by predicting service issues and meeting customers digitally with simple resolutions. This results in less down time for customers making them happier and save millions in operations cost.

Mumin also is co-founder of BENgineers a black technology professionals organization at Comcast. Their goal is to enhance the tech pipeline and create advocacy and representation for black tech professionals at Comcast. The BENgineers participated in coding events with local children, hosted discussions about blacks in technology, participated in Comcast Lab weeks and have received a Commerce Impact Award for entrepreneurship and strategy for the creation of the organization.

Presentations

Automating ML model training and deployments via metadata-driven data, infrastructure, feature engineering, and model management Session

Mumin Ransom gives an overview of the data management and privacy challenges around automating ML model (re)deployments and stream-based inferencing at scale.

Sushant Rao is a cloud product marketer at Cloudera.

Presentations

The hitchhiker’s guide to the cloud: Architecting for the cloud through customer stories Session

Jason Wang and Sushant Rao offer an overview of cloud architecture, then go into detail on core cloud paradigms like compute (virtual machines), cloud storage, authentication and authorization, and encryption and security. They conclude by bringing these concepts together through customer stories to demonstrate how real-world companies have leveraged the cloud for their big data platforms.

Radhika Ravirala is a solutions architect at Amazon Web Services, where she helps customers craft
distributed big data applications on the AWS platform. Previously, she was a software engineer and designer for technology companies in Silicon Valley. She holds an MS in computer science from San Jose State University.

Presentations

Migrating Apache Spark and Hive from on-premises to Amazon EMR (sponsored by Amazon Web Services) Session

Radhika Ravirala explains how to migrate your workloads to Amazon EMR. Join in to learn the key motivations and benefits from a move to the cloud, along with the architectural changes required and best practices you can use right away.

SOLD OUT: Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join the AWS team to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

SOLD OUT: Building a serverless big data application on AWS (Day 2) Training Day 2

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join the AWS team to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Vidya Ravivarma is a senior software engineer on the data analytics and infrastructure team at LinkedIn. She focuses on the design and implementation of building platform to improve developer productivity via hosted notebooks. She contributed to design and development of dynamic unified ACL management system for GDPR enforcement on datasets produced via LinkedIn’s metrics platform. She interacts closely with data analysts, scientists, engineers, and stakeholders to understand their requirements to build scalable and flexible solutions and platforms that enhance their productivity. Previously, she was at Yahoo, mainly in data science and engineering and web development. This provided her with the insights to develop a scalable, productive data science platform.

Presentations

A productive data science platform: Beyond a hosted-notebooks solution at LinkedIn Session

Join Swasti Kakker, Manu Ram Pandit, and Vidya Ravivarma to explore what's offered by a flexible and scalable hosted data science platform at LinkedIn. It provides features to seamlessly develop in multiple languages, enforce developer best practices, governance policies, execute, visualize solutions, efficient knowledge management, and collaboration to improve developer productivity.

Thiago Ribeiro is the server-side Product Director at Griaule, a software company that develops multi-biometric identification technology to help institutions and companies deploy large-scale biometric identification projects. Thiago has been working in the identity industry since 2015. He graduated from Unicamp University in Mechatronics Engineering, with an exchange at the University of New South Wales.

Presentations

How Brazil deployed a 160 million-person biometric identification system: Challenges, benefits, and lessons learned Session

Brazil deployed a national biometric system to register all Brazilian voters using multiple biometric modalities and to ensure that a person does not enroll twice. This session highlights how a large-scale biometric system works, and what are the main architecture decisions that one has to take in consideration.

Paige Roberts is an open source relations manager at Vertica, where she promotes understanding of Vertica, MPP data processing, open source, and how the analytics revolution is changing the world. In two decades in the data management industry, she’s worked as an engineer, a trainer, a marketer, a product manager, and a consultant.

Presentations

Kubernetes for stateful MPP systems Session

GoodData needed to autorecover from node failures and scale rapidly when workloads spiked on their MPP database in the cloud. Kubernetes could solve it, but it's for stateless microservices, not a stateful MPP database that needs hundreds of containers. Paige Roberts and Deepak Majeti detail the hurdles GoodData needed to overcome in order to merge the power of the database with Kubernetes.

Nikki Rouda is a principal product marketing manager at Amazon Web Services (AWS). Nikki has decades of experience leading enterprise big data, analytics, and data center infrastructure initiatives. Previously, he held senior positions at Cloudera, Enterprise Strategy Group (ESG), Riverbed, NetApp, Veritas, and UK-based Alertme.com (an early consumer IoT startup). Nikki holds an MBA from Cambridge’s Judge Business School and an ScB in geophysics from Brown University.

Presentations

Fuzzy matching and deduplicating data: Techniques for advanced data prep Session

Nikki Rouda and Janisha Anand demonstrate how to deduplicate or link records in a dataset, even when the records don’t have a common unique identifier and no fields match exactly. You'll also learn how to link customer records across different databases, match external product lists against your own catalog, and solve tough challenges to prepare and cleanse data for analysis.

SOLD OUT: Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join the AWS team to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

SOLD OUT: Building a serverless big data application on AWS (Day 2) Training Day 2

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join the AWS team to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Neelesh Srinivas Salian is a software engineer on the data platform team at Stitch Fix, where he works on the compute infrastructure used by the company’s data scientists. Previously, he worked at Cloudera, where he worked with Apache projects like YARN, Spark, and Kafka.

Presentations

The why and how of data lineage Session

Every data team has to build an ecosystem that sustains the data, the users, and the use of the data itself. This data ecosystem comes with its own challenges during the building phase, maintenance, and enhancement. Neelesh Salian dives into the importance of data lineage for an organization. You'll explore how to go about building such a system.

Shioulin Sam is a research engineer at Cloudera Fast Forward Labs, where she bridges academic research in machine learning with industrial applications. Previously, she managed a portfolio of early stage ventures focusing on women-led startups and public market investments and worked in the investment management industry designing quantitative trading strategies. She holds a PhD in electrical engineering and computer science from the Massachusetts Institute of Technology.

Presentations

Learning with limited labeled data Session

Supervised machine learning requires large labeled datasets—a prohibitive limitation in many real world applications. But this could be avoided if machines could earn with a few labeled examples. Shioulin Sam explores and demonstrates an algorithmic solution that relies on collaboration between human and machine to label smartly, and she outlines product possibilities.

Anjali Samani is a data science manager and leads the predictive modelling team at CircleUp, an innovative fintech company recently honored as one of the World’s Top 10 Most Innovative Companies in Data Science. Anjali has extensive experience in managing and delivering commercial data science projects and has worked with senior decision makers in startups, Financial Times Stock Exchange (FTSE) 100 businesses and public sector organizations in the UK and US to enable them to develop their data strategy and execute data science projects. Her roles bridge technical data science and business to identify and execute innovative solutions that leverage proprietary and open data sources to deliver value and drive growth. In her former life, Anjali was a quantitative analyst in asset management, and she has a background in computer science, economics, and mathematics.

Presentations

Working with time series: Denoising and imputation frameworks to improve data density Session

The application of smoothing and imputation strategies is common practice in predictive modeling and time series analysis. With a technique-agnostic approach, Anjali Samani provides qualitative and quantitative frameworks that address questions related to smoothing and imputation of missing values to improve data density.

Alejandro is the Chief Scientist at the Institute for Ethical AI & Machine Learning, where he leads highly technical research on machine learning explainability, bias evaluation, reproducibility and responsible design. With over 10 years of software development experience, Alejandro has held technical leadership positions across hyper-growth scale-ups and tech giants including Eigen Tchnologies, Bloomberg LP and Hack Partners. He has a strong track record building departments of machine learning engineers from scratch, and leading the delivery of large-scale machine learning system across the financial, insurance, legal, transport, manufacturing and construction sectors (in Europe, US and Latin America).

Presentations

A practical guide to algorithmic bias and explainability in machine learning Session

Alejandro Saucedo demystifies AI explainability through a hands-on case study, where the objective is to automate a loan-approval process by building and evaluating a deep learning model. He introduces motivations through the practical risks that arise with undesired bias and black box models and shows you how to tackle these challenges using tools from the latest research and domain knowledge.

Jörg Schad is Head of Machine Learning at ArangoDB. In a previous life, he has worked on or built machine learning pipelines in healthcare, distributed systems at Mesosphere, and in-memory databases. He received his Ph.D. for research around distributed databases and data analytics. He’s a frequent speaker at meetups, international conferences, and lecture halls.

Presentations

The case for a common metadata layer for machine learning platforms Session

Machine learning platforms are becoming more complex, with different components each producing their own metadata and their own way of storing metadata. Max Neunhöffer and Joerg Schad propose a first draft of a common metadata API and demonstrate a first implementation of this API in Kubeflow using ArangoDB, a native multimodel database.

Ross Schalmo is the senior director of data and analytics at GE Aviation. Ross’s career has covered everything from network and client security to hybrid cloud implementations to finally the data and analytics space. He currently leads an organization that offers a world-class self-service ecosystem, complete with tooling, training, and curated datasets to allow functional subject matter experts to apply their knowledge in GE’s Data lake without having to be technical experts. This organization also provides tools and process to enable data scientists to rapidly build, test, deploy, and orchestrate models in a CI/CD framework. After the business created PoCs and pilots in the self-service ecosystem, Ross’s organization also leads 30+ execution scrum teams, building enterprise-class data products that cover use cases from fleet segmentation analytics to supply chain optimization. Lastly, the organization is building a data governance practice across the entire aviation business, ensuring quality and compliance for all users, whether self-service, developers, or application consumption. His organization is taking data governance from a protectionist mindset to a culture of data democratization.

Presentations

Executive Briefing: Building a culture of self-service from predeployment to continued engagement Session

Jonathan Tudor and Ross Schalmo explore how GE Aviation made it a mission to implement self-service data. To ensure success beyond initial implementation of tools, the data engineering and analytics teams created initiatives to foster engagement from an ongoing partnership with each part of the business to the gamification of tagging data in a data catalog to forming a published dataset council.

Chad Scherrer is a senior data scientist with Metis, where he trains burgeoning data scientists. In addition to data science education, he has a passion for technology transfer, especially in the area of probabilistic programming. Previously, at the beginning of his work in probabilistic programming, he began leading the development of the Haskell-based language Passage, then joined Galois as technical lead for language evaluation in DARPA’s PPAML program, after which he moved to Seattle and joined Metis. His blog discusses a variety of topics related to data science, with a particular focus on Bayesian modeling.

Presentations

Soss: Lightweight probabilistic programming in Julia Session

Chad Scherrer explores the basic ideas in Soss, a new probabilistic programming library for Julia. Soss allows a high-level representation of the kinds of models often written in PyMC3 or Stan, and offers a way to programmatically specify and apply model transformations like approximations or reparameterizations.

Presentations

Building and leading a successful AI practice for your organization Tutorial

Creating and leading a successful ML strategy is an elegant orchestration of many components: master key ML concepts, operationalize ML workflow, prioritize highest-value projects, build a high-performing team, nurture strategic partnerships, align with the company’s mission, etc. Rossella Blatt Vital details insights and lessons learned in how to create and lead a flourishing ML practice.

Matt has been working in the enterprise infrastructure software space for 15 years in various capacities, including product management, sales engineering, and strategic alliances. A veteran of the Hadoop ecosystem since 2010, Matt is currently focused on driving cluster management and workload management technology initiatives at Cloudera. Matt holds a BS in Computer Science from the University of Virginia.

Presentations

Foundations for successful data projects Tutorial

The enterprise data management space has changed dramatically in recent years, and this has led to new challenges for organizations in creating successful data practices. Ted Malaska and Jonathan Seidman detail guidelines and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

Jim Scott is the head of developer relations, data science, at NVIDIA. He’s passionate about building combined big data and blockchain solutions. Over his career, Jim has held positions running operations, engineering, architecture, and QA teams in the financial services, regulatory, digital advertising, IoT, manufacturing, healthcare, chemicals, and geographical management systems industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG).

Presentations

Problems taking AI to production and how to fix them Session

Data scientists create and test hundreds or thousands more models than in the past. Models require support from both real-time and static data sources. As data becomes enriched, and parameters tuned and explored, there's a need for versioning everything, including the data. Jim Scott examines the very specific problems and approaches to fix them.

Paul Scott-Murphy is vice president of product management at WANdisco, where he has overall responsibility for the definition and management of WANdisco’s product strategy, the delivery of product to market, and its success. This includes the direction of the product management team, product strategy, requirements definitions, feature management and prioritization, road maps, coordination of product releases with customer and partner requirements, user testing, and feedback. Paul has built his career on technical leadership, strategy, and consulting roles for major organizations. Previously, he was the regional CTO for TIBCO Software in Asia-Pacific and Japan.

Presentations

Migrating Hadoop analytics to Spark in the cloud without disruption (sponsored by WANdisco) Session

Paul Scott-Murphy dives into the options that exist for cloud migration and their advantages and disadvantages, what cloud vendors do and don't offer to support large-scale migration, the business risks associated with large-scale cloud migration, and how to migrate analytics data at scale for immediate use in Spark without disrupting on-premises operations.

Boris Segalis, vice chair of Cooley’s cyber/data/privacy practice, has focused his practice exclusively on privacy, data protection and cybersecurity for more than 10 years. He counsels clients on a range of privacy, cybersecurity and information management issues in the context of compliance, business strategy, technology transactions, breach preparedness and response, disputes and regulatory investigations and legislative and regulatory strategy. He advises clients on information law issues within data-based products and services, big data programs, smart grid operations, marketing and advertising, corporate transactions (including M&A, private equity, buyouts, public offerings and bankruptcy), state and federal investigations and regulatory actions, cross-border data transfer, vendor management, cloud computing, technology transactions, incident and breach response and pre-response planning.
Boris represents clients in a variety of industries, ranging from startups to Fortune 100 companies. His clients include companies in the consumer products and services areas, online retailers and ecommerce, fintech, blockchain, media and entertainment, pharmaceutical, utilities, travel-related businesses, B2B SaaS technology, payment processing, and non-profit organizations.
For six consecutive years, Boris has been individually recognized by Chambers USA in the Privacy & Data Security category, and is recognized by Chambers Global for Privacy & Data Security – USA. He was also previously recognized by Crains NY 40 under 40. Boris is a Certified Information Privacy Professional (CIPP/US) through the International Association of Privacy Professionals, has served as co-chair of NYC IAPP KnowledgeNet and served on the IAPP’s Research Board.
Before joining Cooley, Boris was the former US co-chair of Norton Rose Fulbright’s data protection, privacy and cybersecurity practice group. Prior to Norton Rose Fulbright, he practiced at two national firms, subsequently joining InfoLawGroup LLP, where he helped develop the firm into one of the leading privacy and data security practices in the US. Boris began his career in the aerospace industry where he worked as an engineer on the Space Shuttle and other space programs.
Boris regularly speaks on Fox News Live as an authority on privacy and data security issues. He is a co-author of the Privacy and Data Security Law Deskbook, Aspen Publishers, Wolters Kluwer Law & Business, July 2010.

Presentations

Regulations and the future of data Session

From the EU to California and China, more of the world is regulating how data can be used. Andrew Burt and Brenda Leong convene leading experts on law and data science for a deep dive into ways to regulate the use of AI and advanced analytics. Come learn why these laws are being proposed, how they’ll impact data, and what the future has in store.

Jonathan Seidman is a software engineer on the cloud team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Foundations for successful data projects Tutorial

The enterprise data management space has changed dramatically in recent years, and this has led to new challenges for organizations in creating successful data practices. Ted Malaska and Jonathan Seidman detail guidelines and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

Nemo joined Gro after 8+ years in engineering at Google, where he first worked for several years in web search on ranking algorithms. He was then Tech Lead on Google’s Ad Exchange. He led the engineering team responsible for auction algorithms and mathematical optimization. Most recently, he was Tech Lead for Contributor by Google, a product that he invented and led from research to launch. Contributor enables a post-ads business model for the web through market-based micropayments. Previously, Nemo was co-founder of Invisible Hand Networks, Inc. which provided real-time auction-based bandwidth exchanges for content providers and ISPs. Nemo was also co-founder of a VOIP startup, and Adjunct Professor of Electrical Engineering at Columbia University.

Nemo is the author of dozens of highly cited research publications, and inventor on 9 US patents, with several more pending. He obtained a PhD in Electrical Engineering from Columbia University, where his dissertation won the Eliahu I. Jury award. He also holds a Masters in engineering from McGill, where he specialized in control theory applied to biomechanical robots, and a B. Eng (Honours) in Honours Electrical Engineering and Math, where he graduated with distinction and a university scholar award.

Presentations

Robin Senge is a senior big data scientist on an analytics team at inovex, where he applies machine learning to optimize supply chain processes for one of the biggest groups of retailers in Germany. Robin holds an MSc in computer science and a PhD from the University of Marburg, where his research at the Computational Intelligence Lab focused on machine learning and fuzzy systems.

Presentations

From whiteboard to production: A demand forecasting system for an online grocery shop Session

Data-driven software is revolutionizing the world and enable intelligent services we interact with daily. Robert Pesch and Robin Senge outline the development process, statistical modeling, data-driven decision making, and components needed for productionizing a fully automated and highly scalable demand forecasting system for an online grocery shop for a billion-dollar retail group in Europe.

Santanu Sengupta is the managing director of innovation and data science technology at Nuveen, a TIAA company, where he has pioneered the use of advanced technologies and developed innovative platforms for investment research and financial products using new technologies, which helped Nuveen become one of the leaders in financial technology among major asset managers. Previously, Santanu worked in the Quantitative Research Group at ITG and spent significant time at State Street. Santanu holds a BS in EE from Indian Institute of Engineering, Science, and Technology and an MBA from Northeastern University.

Presentations

How Nuveen rapidly integrated ESG data to advance its platform value (sponsored by Zaloni) Session

Ben Sharma and Santanu Sengupta walk you through how to quickly integrate and accelerate environmental, social, and governance (ESG) data and third-party data into your environment to provide governed, trusted, and traceable data to portfolio managers and analysts in a self-service manner.

Jungwook Seo is a data platform development team leader at SK Holdings, where he spent three years developing AccuInsight+. He has over 20 years of experience as a researcher in a variety of areas, such as distributed systems, cloud computing, and big data, including three big projects in the UK. Previously, he was a researcher in cloud computing and big data for SK Holdings. His PhD focused on the grid project in the University of Manchester, and he successfully executed two more research projects as a postdoc researcher in the Leeds and Cardiff Universities.

Presentations

Architecting a data analytics service both in the public cloud and in the on-premise private cloud: ETL, BI, and machine learning (sponsored by SK Holdings) Session

Jungwook Seo walks you through a data analytics platform in the cloud by the name of AccuInsight+ with eight data analytic services in the CloudZ (one of the biggest cloud service providers in Korea), which SK Holdings announced in January 2019.

Shital Shah is a principal research engineer at Microsoft Research AI. His interests include simulation, autonomous vehicles, robotics, deep learning, and reinforcement learning. At Microsoft, he’s architected, designed, and developed large-scale distributed machine learning systems. He conceived and lead the development of AirSim, a physically and visually realistic cross-platform simulator for AI research. Most recently, he developed TensorWatch, a new system for debugging and visualizing machine learning. He’s contributed in research and engineering in various roles at Microsoft, including technical lead, architect, engineering manager, and research engineer. Previously, he founded and led the team to design and develop distributed machine-learned clustering platform for web-scale data at Bing.

Presentations

Getting to know the elephant: Real-time debugging and visualization for deep learning Session

Taming massive deep learning models, data, and training times requires new way of thinking. Shital Shah explores new tools and methods to better understand AI. Explaining the decisions made by AI not only helps us accelerate its development but also make it safe and more trustworthy.

Nikita Shamgunov is the cofounder and CEO at MemSQL. He’s dedicated his entire career to innovation in data infrastructure. Previously, Nikita was a senior database engineer for Microsoft’s SQL Server and worked on the core data infrastructure of Facebook. He’s been awarded several patents and was a world medalist in ACM programming contests. Nikita holds a BS, MS, and PhD in computer science.

Presentations

It’s not you; it’s your database: How to unlock the full potential of your operational data (sponsored by MemSQL) Keynote

Data is now the world’s most valuable resource, with winners and losers decided every day by how well we collect, analyze, and act on data. However, most companies struggle to unlock the full value of their data, using outdated, outmoded data infrastructure. Nikita Shamgunov examines how businesses use data, the new demands on data infrastructure, and what you should expect from your tools.

Ben Sharma is founder and CEO of Zaloni. Ben is a passionate technologist with experience in solutions architecture and service delivery of big data, analytics, and enterprise infrastructure solutions and expertise ranging from development to production deployment in a wide array of technologies, including Hadoop, HBase, databases, virtualization, and storage. Previously, he held technology leadership positions at NetApp, Fujitsu, and others. Ben is the coauthor of Java in Telecommunications and Architecting Data Lakes. He holds two patents.

Presentations

How Nuveen rapidly integrated ESG data to advance its platform value (sponsored by Zaloni) Session

Ben Sharma and Santanu Sengupta walk you through how to quickly integrate and accelerate environmental, social, and governance (ESG) data and third-party data into your environment to provide governed, trusted, and traceable data to portfolio managers and analysts in a self-service manner.

Diana Shaw is a manager of the Americas artificial intelligence team at SAS. She’s an analytics expert and thought leader and heads a team of data scientists in creating practical, effective AI applications. She focuses on helping customers apply advanced analytics, machine learning, natural language processing, and forecasting to solve their most complex problems. Over the past 19 years, Diana honed her skills mining data, troubleshooting problems, and developing technical solutions, including streaming IoT solutions. Diana regularly leads discussions with executives, business analysts, and data scientists around the globe, consulting on analytics strategies and broader technology objectives. She has a bachelor’s in metallurgical engineering and an MBA and a master’s in analytics. Her experience spans steelmaking, banking, automotive manufacturing, bridge construction, and building controls. She’s passionate about getting more girls to code and pursue careers in STEM.

Presentations

The ugly truth about making analytics actionable (sponsored by SAS) Session

Companies today are working to adopt data-driven mind-sets, strategies, and cultures. Yet the ugly truth is many still struggle to make analytics actionable. Diana Shaw outlines a simple, powerful, and automated solution to operationalize all types of analytics at scale. You'll learn how to put analytics into action while providing model governance and data scalability to drive real results.

Reza Shiftehfar leads the Hadoop platform teams at Uber, which help build and grow Uber’s reliable and scalable big data platform that serves petabytes of data utilizing technologies such as Apache Hadoop, Apache Hive, Apache Kafka, Apache Spark, and Presto. Reza is one of the founding engineers of Uber’s data team and helped scale Uber’s data platform from a few terabytes to over 100 petabytes while reducing the data latency from 24+ hours to minutes. Reza holds a PhD in computer science from the University of Illinois, Urbana-Champaign focused on building mobile hybrid cloud applications.

Presentations

Creating an extensible 100+ PB real-time big data platform by unifying storage and serving Session

Building a reliable big data platform is extremely challenging when it has to store and serve hundreds of petabytes of data in real time. Reza Shiftehfar reflects on the challenges faced and proposes architectural solutions to scale a big data platform to ingest, store, and serve 100+ PB of data with minute-level latency while efficiently utilizing the hardware and meeting security needs.

Tomer Shiran is cofounder and CEO of Dremio. Previously, Tomer was the vice president of product at MapR, where he was responsible for product strategy, road map, and new feature development. As a member of the executive team, he helped grow the company from 5 employees to over 300 employees and 700 enterprise customers. Previously, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. He’s the author of five US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from the Technion, the Israel Institute of Technology.

Presentations

Building a best-in-class data lake on AWS and Azure Session

Data lakes have become a key ingredient in the data architecture of most companies. In the cloud, object storage systems such as S3 and ADLS make it easier than ever to operate a data lake. Tomer Shiran and Jacques Nadeau explain how you can build best-in-class data lakes in the cloud, leveraging open source technologies and the cloud's elasticity to run and optimize workloads simultaneously.

Securing your cloud data lake with a "defense in depth" approach Session

With cheap and scalable storage services such as S3 and ADLS, it's never been easier to dump data into a cloud data lake. But you still need to secure that data and be sure it doesn't leak. Tomer Shiran and Jacques Nadeau explore capabilities for securing a cloud data lake, including authentication, access control, encryption (in motion and at rest), and auditing, as well as network protections.

Nagendra Shishodia is the head of analytics products for EXL, where he leads the analytics product development initiative and has written thought leadership articles on healthcare clinical solutions and AI. He has over 17 years of experience in developing advanced analytics solutions across business functions. His focus has been on developing solutions that enable better decision making through the use of machine learning, natural language processing, and big data technologies. Nagendra consults with senior executives of global firms across industries including healthcare, insurance, banking, retail, and travel. Nagendra holds an MS degree from Purdue University and a BTech from the Indian Institute of Technology Bombay.

Presentations

Improving OCR quality of documents using generative adversarial networks Session

Every NLP-based document-processing solution depends on converting scanned documents and images to machine readable text using an OCR solution, limited by the quality of scanned images. Nagendra Shishodia, Chaithanya Manda, and Solmaz Torabi explore how GAN can bring significant efficiencies in any document-processing solution by enhancing resolution and denoising scanned images.

Rosaria Silipo is a principal data scientist at KNIME. She loved data before it was big and learning before it was deep. She’s spent 25+ years in applied AI, predictive analytics, and machine learning at Siemens, Viseca, Nuance Communications, and private consulting. Rosaria shares her practical experience in a broad range of industries and deployments, including IoT, customer intelligence, financial services, and cybersecurity, and through her 50+ technical publications, including her recent ebook, Practicing Data Science: A Collection of Case Studies. Follow her on Twitter, LinkedIn, and the KNIME blog.

Presentations

Practicing data science: A collection of case studies DCS

Rosaria Silipo reviews a number of AI case studies, from classic customer intelligence to the IoT, from sentiment in social media to user graphs, from free text generation to fraud detection, and so on. You'll leave inspired to apply AI in your own domain.

Swatee Singh is the vice president of the big data platform and the first female distinguished architect of the machine learning platform at American Express, where she’s spearheading machine learning transformation. Swatee’s a proponent of democratizing machine learning by providing the right tools, capabilities, and talent structure to the broader engineering and data science community. The platform her team is building looks to leverage American Express’s closed loop data to enhance its customer experience by combining artificial intelligence, big data, and the cloud, incorporating guiding pillars such as ease of use, reusability, shareability, and discoverability. Swatee also led the American Express recommendation engine road map and delivery for card-linked merchant offers as well as for personalized merchant recommendations. Over the course of her career, she’s applied predictive modeling to a variety of problems ranging from financial services to retailers and even power companies. Previously, Swatee was a consultant at McKinsey & Company and PwC, where she supported leading businesses in retail, banking and financial services, insurance, and manufacturing, and cofounded a medical device startup that used a business card-sized thermoelectric cooling device implanted in the brain of someone with epilepsy as a mechanism to stop seizures. Swatee holds a PhD focused on machine learning techniques from Duke University.

Presentations

How disruptive tech is reshaping the financial services industry Keynote

The financial services industry is increasingly using disruptive technology—including AI and machine learning, edge computing, blockchain, mobile and mixed reality, virtual assistants, and quantum computing to name a few—to enhance the customer experience and personalize their interactions with customers. Swatee Singh outlines how the same is true at American Express.

Benjamin Singleton is the Director of Data Science & Analytics at JetBlue. He leads the design and development of data and analytics solutions for multiple functional areas, establishing JetBlue’s strategic roadmap for data governance, and leading management over data architecture, data science and analytics. Previously, he was the Director of Analytics at the New York Police Department where he focused on building data platform capabilities and data products to support operational and strategic needs.

Presentations

Managing data science in the enterprise Tutorial

The honeymoon era of data science is ending and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders must deliver measurable impact on an increasing share of an enterprise’s KPIs. The speakers explore how leading organizations take a holistic approach to people, process, and technology to build a sustainable advantage.

Siva Sivakumar is a senior director in the Data Center Business Group at Cisco. Siva drives strategy and execution of Data Center Solutions, which includes converged and hyperconverged stacks, enterprise apps, analytics and AI, and cloud solutions.

Presentations

Cisco Data Intelligence Platform (sponsored by Cisco) Keynote

Siva Sivakumar explains the Cisco Data Intelligence Platform (CDIP), which is a cloud-scale architecture that brings together big data, AI and compute farm, and storage tiers to work together as a single entity, while also being able to scale independently to address the IT issues in the modern data center.

Alan Smith is the head of visual and data journalism at the Financial Times. A data visualization specialist, he writes the _FT_’s popular “Chart Doctor” column. Alan is an experienced presenter, having lectured extensively on how to communicate with data. His TEDx talk, “Why you should love statistics,” was a TED.com featured talk in 2017. Previously, he worked at the UK’s Office for National Statistics, where he founded its award-winning Data Visualisation Centre. Alan received a BA in geography from the University of Lancaster and holds an MSc in GIS from Salford University. He was appointed Officer of the Order of the British Empire (OBE) in Queen Elizabeth II’s 2011 Birthday Honours list.

Presentations

Data sonification: Making music from the yield curve Keynote

Based on a critical evaluation of the iconic yield curve chart, Alan Smith argues that combining visualization (data to pixels) with sonification (data to pitch) offers potential to improve not only aesthetic multimedia experiences but also an opportunity to take the presentation of data into the rapidly expanding universe of screenless devices and products.

Karthik Sonti is a partner solution architect at AWS, where he works with GSIs to help accelerate adoption of AWS services with a focus on analytics and machine learning.

Presentations

Building a recommender system with Amazon ML services Tutorial

Karthik Sonti, Emily Webber, and Varun Rao Bhamidimarri introduce you to the Amazon SageMaker machine learning platform and provide a high-level discussion of recommender systems. You'll dig into different machine learning approaches for recommender systems, including common methods such as matrix factorization as well as newer embedding approaches.

Tim Spann is a field engineer for the data in motion team at Cloudera. Previously, he was a senior solutions architect at airisDATA working with Apache Spark and machine learning; a senior software engineer at SecurityScorecard, helping to build a reactive platform for monitoring real-time third-party vendor security risk in Java and Scala; and a senior field engineer for Pivotal focusing on Cloud Foundry, HAWQ and big data. He’s an avid blogger and the big data zone leader for Dzone. He runs the the very successful Future of Data: Princeton meetup with over 1192. You can find all the source and material behind his talks at his GitHub and Community blog: https://github.com/tspannhw and https://community.hortonworks.com/users/9304/tspann.html.

Presentations

Cloudera Edge Management in the IoT Tutorial

There are too many edge devices and agents, and you need to control and manage them. Purnima Reddy Kuchikulla, Timothy Spann, Abdelkrim Hadjidj, and Andre Araujo walk you through handling the difficulty in collecting real-time data and the trouble with updating a specific set of agents with edge applications. Get your hands dirty with CEM, which addresses these challenges with ease.

Ann Spencer is the head of content at Domino. She’s responsible for ensuring Domino’s data science content provides a high degree of value, density, and analytical rigor that sparks respectful candid public discourse from multiple perspectives, discourse that’s anchored in the intention of helping accelerate data science work. Previously, she was the data editor at O’Reilly (2012–2014), focusing on data science and data engineering.

Presentations

Data science versus engineering: Does it really have to be this way? Session

If, as a data scientist, you've wondered why it takes so long to deploy your model into production or, as an engineer, thought data scientists have no idea what they want, you're not alone. Join a lively discussion with industry veterans Ann Spencer, Paco Nathan, Amy Heineike, and Chris Wiggins to find best practices or insights on increasing collaboration when developing and deploying models.

Rajeev Srinivasan is a senior solutions architect for AWS. He works very closely with its customers to provide big data and NoSQL solution leveraging the AWS platform. He enjoys coding, and in his spare time, he enjoys riding his motorcycle and reading books.

Presentations

From relational databases to cloud databases: Using the right tool for the right job Tutorial

Enterprises adopt cloud platforms such as AWS for agility, elasticity, and cost savings. Database design and management requires a different mindset in AWS when compared to traditional RDBMS design. Gowrishankar Balasubramanian and Rajeev Srinivasan explore considerations in choosing the right database for your use case and access pattern while migrating or building a new application on the cloud.

Michael Stonebraker is the cofounder and chief technology officer at Tamr, an adjunct professor at MIT CSAIL, and a database pioneer who has been involved with PostgreSQL, SciDB, Vertica, VoltDB, Tamr, and other database companies. He coauthored the paper “Data Curation at Scale: The Data Tamer System,” presented at the Conference on Innovative Data Systems Research (CIDR’13).

Presentations

Executive Briefing: Top 10 big data blunders Session

As a steward for your enterprise’s data and digital transformation initiatives, you’re tasked with making the right choice. But before you can make those decisions, it’s important to understand what not to do when planning for your organization’s big data initiatives. Michael Stonebraker shares his top 10 big data blunders.

Wim Stoop is a senior product marketing manager at Cloudera.

Presentations

Sharing is caring: Using Egeria to establish true enterprise metadata governance Session

Establishing enterprise-wide security and governance remains a challenge for most organizations. Integrations and exchanges across the landscape are costly to manage and maintain, and typically work in one direction only. Wim Stoop and Srikanth Venkat explore how ODPi's Egeria standard and framework removes the challenges and is leveraged by Cloudera and partners alike to deliver value.

Bargava Subramanian is a cofounder and machine learning engineer of the boutique AI firm Binaize Labs in Bangalore, India. He has 15 years’ experience delivering business analytics and machine learning solutions to B2B companies, and he mentors organizations in their data science journey. He holds a master’s degree from the University of Maryland, College Park. He’s an ardent NBA fan.

Presentations

Recommendation systems using deep learning 2-Day Training

Recommendation systems play a significant role—for users, a new world of options; for companies, it drives engagement and satisfaction. Amit Kapoor and Bargava Subramanian walk you through the different paradigms of recommendation systems and introduce you to deep learning-based approaches. You'll gain the practical hands-on knowledge to build, select, deploy, and maintain a recommendation system.

Recommendation systems using deep learning (Day 2) Training Day 2

Recommendation systems play a significant role—for users, a new world of options; for companies, it drives engagement and satisfaction. Amit Kapoor and Bargava Subramanian walk you through the different paradigms of recommendation systems and introduce you to deep learning-based approaches. You'll gain the practical hands-on knowledge to build, select, deploy, and maintain a recommendation system.

Aaron has more than 15 years of experience leading teams in the software industry. Prior to joining Talend in 2016, he held diverse positions ranging from tech support management to program management. For the past 7 years Aaron has focused on the enterprise analytics space helping customers get value from analytics projects and solutions. When he’s not improving the life of Talend customers, you can spot Aaron on his bike or coaching his kids ski racing team in Minnesota.

Presentations

ALDO’s data strategy to create the right customer experience for its consumers (sponsored by Talend) Session

Winning the hearts and minds of millennials and Gen Z is not an easy task. ALDO has devised a data-driven strategy to create the best consumer experience. Today ALDO relies on Talend and AWS. Aaron Swanson explains the choices made for its data architecture and the hurdles the teams had to solve to turn the vision into reality.

Peter Swartz is cofounder and CTO at Altana Trade, an artificial intelligence partnership unlocking the power of global economic data to make trade safer, more efficient, and more profitable. Previously, Peter led the data science team at Panjiva—ranked as one of Fast Company’s most innovative companies in 2018 and acquired by S&P Global.

Presentations

Shared Artificial Intelligence without data sharing: Data privacy in a hub and spoke architecture for modeling and managing global flows Findata

Peter Swartz demonstrates a “hub-and-spoke” model that can facilitate secure and private accessing, linking, and generating insight from data describing global flows. These hub-and-spoke infrastructures benefit from advances in software containerization, and deployment across customer-specific, virtual private clouds.

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, agile, distributed teams. Previously, he led business operations for Bing Shopping in the US and Europe with Microsoft’s Bing Group and built and ran distributed teams that helped scale Amazon’s financial systems with Amazon in both Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Executive Briefing: Why machine-learned models crash and burn in production and what to do about it Session

Machine learning and data science systems often fail in production in unexpected ways. David Talby outlines real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Natural language understanding at scale with Spark NLP Tutorial

David Talby, Alex Thomas, Saif Addin Ellafi, and Claudiu Branzan walk you through state-of-the-art natural language processing (NLP) using the highly performant, highly scalable open source Spark NLP library. You'll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Wangda Tan is a product management committee (PMC) member of Apache Hadoop and engineering manager of the computation platform team at Cloudera. He manages all efforts related to Kubernetes and YARN for both on-cloud and on-premises use cases of Cloudera. His primary interesting areas are the YuniKorn scheduler (scheduling containers across YARN and Kubernetes) and the Hadoop submarine project (running a deep learning workload across YARN and Kubernetes). He’s also led features like resource scheduling, GPU isolation, node labeling, resource preemption, etc., efforts in the Hadoop YARN community. Previously, he worked on integration of OpenMPI and GraphLab with Hadoop YARN at Pivotal and participated in creating a large-scale machine learning, matrix, and statistics computation program using MapReduce and MPI and Alibaba.

Presentations

Apache Hadoop 3.x state of the union and upgrade guidance Session

Wangda Tan and Wei-Chiu Chuang outline the current status of Apache Hadoop community and dive into present and future of Hadoop 3.x. You'll get a peak at new features like erasure coding, GPU support, NameNode federation, Docker, long-running services support, powerful container placement constraints, data node disk balancing, etc. And they walk you through upgrade guidance from 2.x to 3.x.

Harry Tang is a senior solutions architect in network management service and enterprise data support system organization at Cox Communications. He has more than 20 years of experience in the telecom industry and more than 10 years of experience in large systems architecture and solutions design. His software development expertise spreads into numerous area such as J2EE, SOA, messaging, app server containers, databases (relational database management system (RDBMS) and Columnar NoSQL), Hadoop big data lake, Kubernetes and Docker containers, Telcos OSS, etc. His recent work involves building enterprise data ecosystems in AWS Cloud. Previously, he was at AT&T, focusing on telecom OSS systems solution design and architecture. He has numerous patents officially granted by the US Patent and Trademark Office. He holds a PhD in physics and an MS in computer science.

Presentations

Secured computation: Analyzing sensitive data using homomorphic encryption Session

Organizations often work with sensitive information such as social security and credit card numbers. Although this data is stored in encrypted form, most analytical operations require data decryption for computation. This creates unwanted exposures to theft or unauthorized read by undesirables. Matt Carothers, Jignesh Patel, and Harry Tang explain how homomorphic encryption prevents fraud.

James Tang is a senior director of engineering at Walmart Labs. He’s spent time creating large-scale, resilient, and distributed architectures with high security and high performance for enterprise applications, web applications, online payments, online games, and real-time predictive analytics applications. While enthusiastic about technologies, he enjoys mentoring, training and leading teams to be successful with distributed systems concepts, microservices, DevOps, and cloud native application design.

Presentations

Machine learning and large-scale data analysis on a centralized platform Session

James Tang, Yiyi Zeng, and Linhong Kang outline how Walmart provides a secure and seamless shopping experience through machine learning and large scale data analysis on centralized platform.

James Terwilliger is a principal software development engineer at Microsoft, where he’s a 10-year veteran, having spent time on both product and research teams. He began as an intern during the last year of his PhD research at Portland State University. His background is in innovative data query and exploration interfaces and streaming data processing. At Microsoft, he helped develop the PowerQuery extension to Excel that is now the default data tab there, and now works on the Trill temporal data engine. Whatever he works on, he finds a way to add Pivot and Unpivot to it.

Presentations

Trill: The crown jewel of Microsoft’s streaming pipeline explained Session

Trill has been open-sourced, making the streaming engine behind services like the Bing Ads platform available for all to use and extend. James Terwilliger, Badrish Chandramouli, and Jonathan Goldstein dive into the history of and insights from streaming data at Microsoft. They demonstrate how its API can power complex application logic and the performance that gives the engine its name.

Alex Thomas is a data scientist at John Snow Labs. He’s used natural language processing (NLP) and machine learning with clinical data, identity data, and job data. He’s worked with Apache Spark since version 0.9 as well as with NLP libraries and frameworks including UIMA and OpenNLP.

Presentations

Natural language understanding at scale with Spark NLP Tutorial

David Talby, Alex Thomas, Saif Addin Ellafi, and Claudiu Branzan walk you through state-of-the-art natural language processing (NLP) using the highly performant, highly scalable open source Spark NLP library. You'll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Robert D. Thomas is the general manager of IBM Data and Artificial Intelligence. He directs IBM’s product investment strategy, sales and marketing, expert labs, and global software product development. With a portfolio of over 1,000 products, IBM has emerged as a leader in data and AI, spanning databases, data integration and governance, business intelligence, planning, data science and AI tools, and AI applications. Major product brands under Rob’s leadership include Watson, DB2, Netezza, Cognos, SPSS, and InfoSphere. Since joining IBM’s software unit, Rob has held roles of increasing responsibility, including business development, product engineering, sales and marketing, and general management. He’s overseen four acquisitions by the firm representing over $2.5 billion in transaction value. At IBM, he’s worked in technology and strategy consulting, first in Atlanta and then in New York; founded IBM’s practice to advise companies moving onto the internet, where he executed and advised on strategy and operations at a variety of companies including Motorola, AXA, and Sanmina; joined IBM’s high-technology business to build an engineering services organization focused on semiconductor design, embedded software, and systems development; led the business and manufacturing operations in Asia Pacific, living in Tokyo, Japan; joined IBM’s software business, focused on data and analytics, holding a variety of roles, leading IBM’s transition from core databases to delivering broader analytical capabilities and eventually artificial intelligence. Rob’s first book Big Data Revolution: What Farmers, Doctors, and Insurance Agents Can Teach Us About Patterns in Big Data (Wiley) was published in 2015. The Financial Times called the book, “interesting as a case study of the philosophical assumptions that underpin the growing obsession with data.” His second book, The End of Tech Companies was published in 2016, educating business leaders on how to navigate digital disruption in every industry. Today, he writes extensively on his blog Robdthomas.com. Rob serves on the board of Domus (Stamford, CT), which assists underprivileged children in Fairfield county. He lives in New Canaan, Connecticut, with his wife and three children. He was born in Florida and trained in economics at Vanderbilt University, earning his BA in economics. During his MBA from the University of Florida, he worked in equity research, learning applied economics, finance, and financial analysis.

Presentations

AI isn't magic. It’s computer science. Keynote

AI has the potential to add $16 trillion global economy by 2030, but adoption has been slow. While we understand the power of AI, many of us aren’t sure how to fully unleash its potential. Join Robert Thomas and Tim O'Reilly to learn that the reality is AI isn't magic. It’s hard work.

Long Tian is a Software Engineer Manger at Microsoft Big Data Analytics team. Focus on building developer experience (authoring, debugging, continuous integration and monitoring) for cloud big data services, including Spark, Hive and Azure Datalake.

Presentations

Using Spark to speed up the diagnosis performance for big data applications Session

Ruixin Xu, Long Tian, and Yu Zhou explore an experiment run using Spark and Jupyter notebooks as a replacement for existing IDE-based tools for internal DevOps. The Spark-based solution improved the diagnosis performance significantly, especially for a complex job with a large profile, and leveraging the Jupyter notebooks brings the benefit of fast iteration and easy knowledge share.

Moto Tohda is the vice president of information systems at Tokyo Century (USA), where he oversees the entire IT for the US operations of Tokyo Century Corporation. Moto wears many hats, and he always strives to balance business objectives with IT initiatives. His past IT initiatives include security, disaster recovery, and many implementations of the ERP-related systems. His newest endeavor is to bring data back to users through data visualization and user-driven predictive analysis models.

Presentations

Democratization of data science: Using machine learning to build credit risk models Findata

Tokyo Century was ready for a change. Credit risk decisions were taking too long and the home office was taking notice. The company needed a full stack data solution to increase the speed of loan authorizations, and it needed it quickly. Moto Tohda explains how Tokyo Century put data at the center of its credit risk decision making and removed institutional knowledge from the process.

Meir Toledano is a data scientist and algorithm engineer at Anodot. He’s an engineer and entrepreneur, having studied and started his career in Paris, France. Previously, he was an aeronautic engineer, developed trading algorithms and risk models in the financial industries, and worked to the internet and high-tech industry.

Presentations

Lightning-fast time series modeling and prediction: (S)ARIMA on steroids Session

ARIMA has been used for time series modeling for decades. In practice, most time series collected from human activities exhibit seasonal patterns, but the efficient estimation of seasonal ARIMA ((S)ARIMA) models was inefficient for decades. Meir Toledano explains how Anodot was able to apply the technique for forecasting and anomaly detection for millions of time series every day.

Solmaz Torabi is a data scientist at EXL, where she’s responsible for building image and text analytics models using deep learning methods to extract information from images and documents. She holds a PhD in electrical and computer engineering from Drexel University.

Presentations

Improving OCR quality of documents using generative adversarial networks Session

Every NLP-based document-processing solution depends on converting scanned documents and images to machine readable text using an OCR solution, limited by the quality of scanned images. Nagendra Shishodia, Chaithanya Manda, and Solmaz Torabi explore how GAN can bring significant efficiencies in any document-processing solution by enhancing resolution and denoising scanned images.

Steve Touw is the cofounder and CTO of Immuta. Steve has a long history of designing large-scale geotemporal analytics across the US intelligence community, including some of the very first Hadoop analytics, as well as frameworks to manage complex multitenant data policy controls. He and his cofounders at Immuta drew on this real-world experience to build a software product to make data security and privacy controls easier. Previously, Steve was the CTO of 42six (acquired by Computer Sciences Corporation), where he led a large big data services engineering team. Steve holds a BS in geography from the University of Maryland.

Presentations

Data security and privacy anti-patterns Session

Anti-patterns are behaviors that take bad problems and lead to even worse solutions. In the world of data security and privacy, they’re everywhere. Over the past four years, data security and privacy anti-patterns have emerged across hundreds of customers and industry verticals—there's been an obvious trend. Steven Touw details five anti-patterns and, more importantly, the solutions for them.

Jonathan Tudor is a Director of Data and Analytics at GE Aviation leading and building both the Self-Service Data program and Data Governance program. He has a background in big data ETL, data architecture, cloud, security, networking, compliance, data warehousing, supply chain operations, engineering IT, business intelligence, and analytics. He’s passionate about self-service data, data governance, big data, digital cultural transformation data, business, and leadership.

Presentations

Executive Briefing: Building a culture of self-service from predeployment to continued engagement Session

Jonathan Tudor and Ross Schalmo explore how GE Aviation made it a mission to implement self-service data. To ensure success beyond initial implementation of tools, the data engineering and analytics teams created initiatives to foster engagement from an ongoing partnership with each part of the business to the gamification of tagging data in a data catalog to forming a published dataset council.

Giovanni Tummarello is the founder and chief product officer at Siren. Previously, he was an academic team lead at the National University of Ireland Galway and has over 100 scholarly works on Knowledge Graphs, semantic technologies, and information retrieval, as well as several startups and active open source projects started from the research, including the top-level any23 Apache project and Spaziodati and Siren companies. He holds a PhD.

Presentations

Supercharging Elasticsearch for extended Knowledge Graph use cases Session

Elasticsearch (ES) allows extremely quick search and drilldowns on large amounts of semistructured data. Elasticsearch, however, does not have relational join capabilities. Giovanni Tummarello examines a plug-in for ES that adds cluster distributed joins and demonstrates how it enables an exciting array of use cases dealing with interconnected or "Knowledge Graph" enterprise data.

Naoto Umemori is a senior infrastructure engineer and deputy manager at NTT DATA, working in the technology and innovation area. He’s spent around 10 years in the platform and infrastructure field, focusing mainly on the open source software technology stack.

Presentations

Deep learning technologies for giant hogweed eradication Session

Giant hogweed is a highly toxic plant. Naoto Umemori and Masaru Dobashi aim to automate the process of detecting the plant with technologies like drones and image recognition and detection using machine learning. You'll see how they designed the architecture, took advantage of big data and machine and deep learning technologies (e.g., Hadoop, Spark, and TensorFlow), and the lessons they learned.

Sandeep Uttamchandani is a chief data architect at Intuit, where he leads the cloud transformation of the big data analytics, ML, and transactional platform used by 4M+ small business users for financial accounting, payroll, and billions of dollars in daily payments. Previously, Sandeep was cofounder and CEO of a machine learning startup focused on ML for managing enterprise systems and played various engineering roles at VMware and IBM. His experience uniquely combines building enterprise data products and operational expertise in managing petabyte-scale data and analytics platforms in production. He’s received several excellence awards, over 40 issued patents , and 25 publications in key systems conferences such as the International Conference on Very Large Data Bases (VLDB), Special Interest Group on Management of Data (SIGMOD), Conference on Innovative Data Systems Research (CIDR), and USENIX. He’s a regular speaker at academic institutions, guest lectures for university courses, and conducts conference tutorials for data engineers and scientists, as well as advising PhD students and startups, serving as program committee member for systems and data conferences, and was an associate editor for ACM Transactions on Storage. He holds a PhD in computer science from the University of Illinois Urbana-Champaign.

Presentations

Time travel for data pipelines: Solving the mystery of what changed Session

A business insight shows a sudden spike. It can take hours, or days, to debug data pipelines to find the root cause. Shradha Ambekar, Sunil Goplani, and Sandeep Uttamchandani outline how Intuit built a self-service tool that automatically discovers data pipeline lineage and tracks every change, helping debug the issues in minutes—establishing trust in data while improving developer productivity.

Srikanth Venkat is a senior director of product management at Cloudera.

Presentations

Sharing is caring: Using Egeria to establish true enterprise metadata governance Session

Establishing enterprise-wide security and governance remains a challenge for most organizations. Integrations and exchanges across the landscape are costly to manage and maintain, and typically work in one direction only. Wim Stoop and Srikanth Venkat explore how ODPi's Egeria standard and framework removes the challenges and is leveraged by Cloudera and partners alike to deliver value.

Presentations

Running multidisciplinary big data workloads in the cloud with CDP Tutorial

Organizations now run diverse, multidisciplinary, big data workloads that span data engineering, data warehousing, and data science applications. Many of these workloads operate on the same underlying data, and the workloads themselves can be transient or long running in nature. There are many challenges with moving these workloads to the cloud. In this talk we start off with a technical deep...

Gil Vernik is a researcher in the Storage Clouds, Security, and Analytics Group at IBM, where he works with Apache Spark, Hadoop, object stores, and NoSQL databases. Gil has more than 25 years of experience as a code developer on both the server side and client side and is fluent in Java, Python, Scala, C/C++, and Erlang. He holds a PhD in mathematics from the University of Haifa and held a postdoctoral position in Germany.

Presentations

Your easy move to serverless computing and radically simplified data processing Session

Most analytic flows can benefit from serverless, starting with simple cases to and moving to complex data preparations for AI frameworks like TensorFlow. To address the challenge of how to easily integrate serverless without major disruptions to your system, Gil Vernik explores the “push to the cloud” experience, which dramatically simplifies serverless for big data processing frameworks.

Evgeny Vinogradov is the head of data warehouse development at Yandex.Money, where he and his team are responsible for data engineering, antifraud systems development, and business intelligence. Previously, he spent 20 years in IT development in different areas from CAD systems in outsource to fintech. He earned his PhD from the Applied Mathematics Department of Saint-Petersburg State University.

Presentations

Scaling data engineers Session

With a microservice architecture, a data warehouse is the first place where all the data meets. It's supplied by many different data sources and used for many purposes—from near-online transactional processing (OLTP) to model fitting and real-time classifying. Evgeny Vinogradov details his experience in managing and scaling data for support of 20+ product teams.

Jordan Volz is a senior data scientist at Dataiku, where he helps customers design and implement ML applications. Previously, Jordan specialized in big data technologies as a systems engineer at Cloudera and enterprise search technology as a technical consultant at Autonomy, frequently working with large financial organizations in the US and Canada. He holds degrees from Bard College and the University of Amherst, and he’s academically trained in pure mathematics.

Presentations

Spark on Kubernetes for data science Session

Spark on Kubernetes is a winning combination for data science that stitches together a flexible platform harnessing the best of both worlds. Jordan Volz gives a brief overview of Spark and Kubernetes, the Spark on Kubernetes project, why it’s an ideal fit for data scientists who may have been dissatisfied with other iterations of Spark in the past, and some applications.

Naghman Waheed is the data platforms lead at Bayer Crop Science, where he’s responsible for defining and establishing enterprise architecture and direction for data platforms. Naghman is an experienced IT professional with over 25 years of work devoted to the delivery of data solutions spanning numerous business functions, including supply chain, manufacturing, order to cash, finance, and procurement. Throughout his 20+ year career at Bayer, Naghman has held a variety of positions in the data space, ranging from designing several scale data warehouses to defining a data strategy for the company and leading various data teams. His broad range of experience includes managing global IT data projects, establishing enterprise information architecture functions, defining enterprise architecture for SAP systems, and creating numerous information delivery solutions. Naghman holds a BA in computer science from Knox College, a BS in electrical engineering from Washington University, an MS in electrical engineering and computer science from the University of Illinois, and an MBA and a master’s degree in information management, both from Washington University.

Presentations

Finding your needle in a haystack Session

As complexity of data systems has grown at Bayer, so has the difficulty to locate and understand what datasets are available for consumption. Naghman Waheed and John Cooper outline a custom metadata management tool recently deployed at Bayer. The system is cloud-enabled and uses multiple open source components, including machine learning and natural language processing to aid searches.

Todd Walter is chief technologist and fellow at Teradata, where he helps business leaders, analysts, and technologists better understand all of the astonishing possibilities of big data and analytics in view of emerging and existing capabilities of information infrastructures. Todd has been with Teradata for more than 30 years. He’s a sought-after speaker and educator on analytics strategy, big data architecture, and exposing the virtually limitless business opportunities that can be realized by architecting with the most advanced analytic intelligence platforms and solutions. Todd holds more than a dozen patents.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that isn't subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Dean Wampler is an expert in streaming data systems, focusing on applications of ML/AI. Formerly, he was the vice president of fast data engineering at Lightbend, where he led the development Lightbend CloudFlow, an integrated system for building and running streaming data applications with Akka Streams, Apache Spark, Apache Flink, and Apache Kafka. Dean is the author of Fast Data Architectures for Streaming Applications, Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He’s a contributor to several open source projects. A frequent Strata speaker, he’s also the co-organizer of several conferences around the world and several user groups in Chicago. He has a Ph.D. in Physics from the University of Washington.

Presentations

Executive Briefing: What it takes to use machine learning in fast data pipelines Session

Dean Wampler dives into how (and why) to integrate ML into production streaming data pipelines and to serve results quickly; how to bridge data science and production environments with different tools, techniques, and requirements; how to build reliable and scalable long-running services; and how to update ML models without downtime.

Hands-on machine learning with Kafka-based streaming pipelines Tutorial

Boris Lublinsky and Dean Wampler examine ML use in streaming data pipelines, how to do periodic model retraining, and low-latency scoring in live streams. Learn about Kafka as the data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, metadata tracking, and more.

Carson Wang is a big data software engineer at Intel, where he focuses on developing and improving new big data technologies. He’s an active open source contributor to the Apache Spark and Alluxio projects as well as a core developer and maintainer of HiBench, an open source big data microbenchmark suite. Previously, Carson worked for Microsoft on Windows Azure.

Presentations

Improving Spark by taking advantage of disaggregated architecture Session

Shuffle in Spark requires the shuffle data to be persisted on local disks. However, the assumptions of collocated storage do not always hold in today’s data centers. Chenzhao Guo and Carson Wang outline the implementation of a new Spark shuffle manager, which writes shuffle data to a remote cluster with different storage backends, making life easier for customers.

Fei Wang is a senior data scientist and statistician at CarGurus. His work primarily involves experimental design and causal inference modeling for online and TV advertising. Fei’s research includes statistical machine learning, matrix factorization, optimization, and high-dimensional data modeling. Fei holds a PhD in biostatistics from the University of Michigan.

Presentations

Building a machine learning framework to measure TV advertising attribution Session

Fei Wang takes a deep dive into a case study for the CarGurus TV Attribution Model. You'll understand how you can leverage the creation of a causal inference model to calculate cost per acquisition (CPA) of TV spend and measure effectiveness when compared to CPA of digital performance marketing spend.

Peter Wang is the cofounder and CTO of Anaconda, where he leads the product engineering team for the Anaconda platform and open source projects including Bokeh and Blaze. Peter’s been developing commercial scientific computing and visualization software for over 15 years and has software design and development experience across a broad variety of areas, including 3-D graphics, geophysics, financial risk modeling, large data simulation and visualization, and medical imaging. As a creator of the PyData conference, he also devotes time and energy to growing the Python data community by advocating, teaching, and speaking about Python at conferences worldwide. Peter holds a BA in physics from Cornell University.

Presentations

Data science isn't just another job (sponsored by Anaconda) Session

Peter Wang explores why data science shouldn’t be seen as merely another technical job within the business and why open source is such a critical aspect of innovation in the field of data science.

Sophie Watson is a Senior Data Scientist in an Emerging Technology Group at Red Hat, where she applies her data science and statistics skills to solving business problems and informing next-generation infrastructure for intelligent application development. She has a background in mathematics and holds a PhD in Bayesian statistics, in which she developed algorithms to estimate intractable quantities quickly and accurately.

Presentations

Sketching data and other magic tricks Tutorial

Go hands-on with Sophie Watson and William Benton to examine data structures that let you answer interesting queries about massive datasets in fixed amounts of space and constant time. This seems like magic, but they'll explain the key trick that makes it possible and show you how to use these structures for real-world machine learning and data engineering applications.

Emily Webber is a machine learning specialist solutions architect at Amazon Web Services (AWS). She guides customers from project ideation to full deployment, focusing on Amazon SageMaker, where her customers are household names across the world, such as T-Mobile. She’s been leading data science projects for many years, piloting the application of machine learning into such diverse areas as social media violence detection, economic policy evaluation, computer vision, reinforcement learning, the IoT, drones, and robotic design. Previously, she was a data scientist at the Federal Reserve Bank of Chicago and a solutions architect for an explainable AI startup in Chicago. Her master’s degree is from the University of Chicago, where she developed new applications of machine learning for public policy research with the Data Science for Social Good Fellowship.

Presentations

Alexa, do men talk too much? Session

Mansplaining. Know it? Hate it? Want to make it go away? Sireesha Muppala, Shelbee Eigenbrode, and Emily Webber tackle the problem of men talking over or down to women and its impact on career progression for women. They also demonstrate an Alexa skill that uses deep learning techniques on incoming audio feeds, examine ownership of the problem for women and men, and suggest helpful strategies.

Building a recommender system with Amazon ML services Tutorial

Karthik Sonti, Emily Webber, and Varun Rao Bhamidimarri introduce you to the Amazon SageMaker machine learning platform and provide a high-level discussion of recommender systems. You'll dig into different machine learning approaches for recommender systems, including common methods such as matrix factorization as well as newer embedding approaches.

Andreas Wesselmann is the senior vice president of SAP products and innovations big data at SAP. He leads the development organization for SAP Data Hub and SAP Data Intelligence. His development leadership roles have included cloud and on-premises integration, access management, and data orchestration topics.

Presentations

Bringing together machine and human intelligence in business applications at enterprise scale (sponsored by SAP) Session

Oftentimes there's a fracture between the highly governed data of enterprise IT systems and the comprehensive but often ungoverned world of large-scale data lakes and streams of data from blogs, system logs, sensors, IoT devices, and more. Kevin Poskitt and Andreas Wesselmann walk you through how AI needs to connect to all of this data, as well as image, video, audio, and text data sources.

Presentations

War stories from the front lines of ML Session

Machine learning techniques are being deployed across almost every industry and sector. But this adoption comes with real, and oftentimes underestimated, privacy and security risks. Andrew Burt and Brenda Leong convene a panel of experts including David Florsek, Chris Wheeler, and Alex Beutel to detail real-life examples of when ML goes wrong, and the lessons they learned.

Alfred Whitehead is the senior vice president of data science at Klick, where he’s responsible for the delivery of data science solutions and oversees a team of data scientists and AI researchers. He brings over 15 years of experience in data science, software development, and high-performance computing to the Klick team, combining his scientific background with an appreciation of the craft of code writing. Previously, he was an information security officer, technology vice president, and acting chief technology officer. He holds two master’s degrees in physical sciences, including thesis work in computational astrophysics, and is also a certified information systems security professional (CISSP).

Presentations

Handling data gaps in time series using imputation Session

Time series forecasts depend on sensors or measurements made in the real, messy world. The sensors flake out, get turned off, disconnect, and otherwise conspire to cause missing signals. Signals that may tell you what tomorrow's temperature will be or what your blood glucose levels are before bed. Alfred Whitehead and Clare Jeon explore methods for handling data gaps and when to consider which.

Chris Wiggins is an associate professor of applied mathematics at Columbia University and the Chief Data Scientist at The New York Times. At Columbia he is a founding member of the executive committee of the Data Science Institute, and of the Department of Systems Biology, and is affiliated faculty in Statistics.

Presentations

Data science versus engineering: Does it really have to be this way? Session

If, as a data scientist, you've wondered why it takes so long to deploy your model into production or, as an engineer, thought data scientists have no idea what they want, you're not alone. Join a lively discussion with industry veterans Ann Spencer, Paco Nathan, Amy Heineike, and Chris Wiggins to find best practices or insights on increasing collaboration when developing and deploying models.

Paul Wolmering is vice president of worldwide sales engineering for Actian’s hybrid data solution. Paul has over 30 years of experience in the enterprise data ecosystem, including parallel databases, distributed computing, big data, cloud computing, and supporting production systems. Previously, he led field engineering teams for Informix, Netezza, ParAccel, Pivotal, Cazena, and others.

Presentations

Next-generation serverless data architecture for insights at the speed of thought (sponsored by Actian) Session

Paul Wolmering explores the key characteristics for building an Agile data warehouse and defines a reference architecture for hybrid data.

Tony Wu is an engineering manager at Cloudera, where he manages the Altus core engineering team. Previously, Tony was a team lead for the partner engineering team at Cloudera. He’s responsible for Microsoft Azure integration for Cloudera Director.

Presentations

Kafka and Streams Messaging Manager (SMM) crash course Tutorial

Kafka is omnipresent and the backbone of streaming analytics applications and data lakes. The challenge is understanding what's going on overall in the Kafka cluster, including performance, issues, and message flows. Purnima Reddy Kuchikulla and Dan Chaffelson walk you through a hands-on experience to visualize the entire Kafka environment end-to-end and simplify Kafka operations via SMM.

Vincent Xie (谢巍盛) is the Chief Data Scientist/Senior Director at Orange Financial, as head of the AI Lab, he built the Big Data & Artificial Intelligence team from scratch, successfully established the big data and AI infrastructure and landed tons of businesses on top, a thorough data-driven transformation strategy successfully boosts the company’s total revenue by many times. Previously, he worked at Intel for about 8 years, mainly on machine learning- and big data-related open source technologies and productions.

Presentations

How Orange Financial combats financial fraud over 50M transactions a day using Apache Pulsar Session

As a fintech company of China Telecom with half of a billion registered users and 41 million monthly active users, risk control decision deployment has been critical to its success. Weisheng Xie and Jia Zhai explore how the company leverages Apache Pulsar to boost the efficiency of its risk control decision development for combating financial frauds of over 50 million transactions a day.

Tony Xing is a senior product manager on the AI, data, and infrastructure (AIDI) team within Microsoft’s AI and Research Organization. Previously, he was a senior product manager on the Skype data team within Microsoft’s Application and Service Group, where he worked on products for data ingestion, real-time data analytics, and the data quality platform.

Presentations

Introducing a new anomaly detection algorithm (SR-CNN) inspired by computer vision Session

Anomaly detection may sound old fashioned, yet it's super important in many industry applications. Tony Xing, Congrui Huang, Qiyang Li, and Wenyi Yang detail a novel anomaly-detection algorithm based on spectral residual (SR) and convolutional neural network (CNN) and how this method was applied in the monitoring system supporting Microsoft AIOps and business incident prevention.

Leah Xu is a software engineer at Spotify, where she works on analytics for marketers, real-time streaming infrastructure, and Spotify Wrapped. Previously, Leah worked on data infrastructure and secure deployments in the cloud at Bridgewater and Nest.

Presentations

Driving adoption of data DCS

Often, the difference between a successful data initiative and failed one isn't the data or the technology, but rather its adoption by the wider business. With every business wanting the magic of data but many failing to properly embrace and harness it, we will explore what factors our panelists have seen that led to successes and failures in getting companies to use data products.

Spotify Wrapped: Product, design, and deadlines DCS

Spotify Wrapped is a "year in music" for active consumers and artists. Wrapped surfaces nostalgic insights derived from dozens of petabytes of user listening data. Leah Xu sheds light on how Spotify creates Wrapped for hundreds of millions of users in an ecosystem ingesting millions of events per second and discusses the engineering trade-offs given demanding requirements and stringent deadlines.

Ruixin Xu is a senior program manager on the Azure big data team at Microsoft. Her focus areas include product design and project management, development experience in big data platforms, the software development tool chain, and Software as a Service (SaaS) offerings.

Presentations

Using Spark to speed up the diagnosis performance for big data applications Session

Ruixin Xu, Long Tian, and Yu Zhou explore an experiment run using Spark and Jupyter notebooks as a replacement for existing IDE-based tools for internal DevOps. The Spark-based solution improved the diagnosis performance significantly, especially for a complex job with a large profile, and leveraging the Jupyter notebooks brings the benefit of fast iteration and easy knowledge share.

Presentations

Cloudera Edge Management in the IoT Tutorial

There are too many edge devices and agents, and you need to control and manage them. Purnima Reddy Kuchikulla, Timothy Spann, Abdelkrim Hadjidj, and Andre Araujo walk you through handling the difficulty in collecting real-time data and the trouble with updating a specific set of agents with edge applications. Get your hands dirty with CEM, which addresses these challenges with ease.

Bo Yang is a software engineer at Uber.

Presentations

How to performance-tune Spark applications in large clusters Session

Omkar Joshi and Bo Yang offer an overview of how Uber’s ingestion (Marmary) and observability team improved performance of Apache Spark applications running on thousands of cluster machines and across hundreds of thousands+ of applications and how the team methodically tackled these issues. They also cover how they used Uber’s open-sourced jvm-profiler for debugging issues at scale.

Han Yang is a senior product manager at Cisco, where he drives UCS solutions for artificial intelligence and machine learning. And he’s always enjoyed driving technologies. Previously, Han drove the big data and analytics UCS solutions and the largest switching beta at Cisco with the software virtual switch, Nexus 1000V. Han has a PhD in electrical engineering from Stanford University.

Presentations

Operationalizing AI and ML with Cisco Data Intelligence Platform (sponsored by Cisco) Session

Artificial intelligence and machine learning are well beyond the laboratory exploratory stage of deployment. In fact, the speed of AI and ML deployment has a huge impact on an organization’s financial income. Chiang Yang and Karthik Kulkarni explore how the Cisco Data Intelligence Platform can help bridge the gap between AI and ML and big data.

Jennifer Yang is the head of data management and data governance at Wells Fargo Enterprise Core Services. Previously, Jennifer served various senior leadership roles in risk management and capital management at major financial institutions. Jennifer’s unique experience allows her to understand data and technology from both the end user’s and data management’s perspectives. Jennifer is passionate about leveraging the power of new technologies to gain insights from the data to develop cost effective and scalable business solutions. Jennifer holds an undergraduate degree in applied chemistry from Beijing University, a master’s degree in computer science from the State University of New York at Stony Brook, and an MBA specializing in finance and accounting from New York University’s Stern School of Business.

Presentations

Machine learning in data quality management Findata

Jennifer Yang discusses a use case that demonstrates how to use machine learning techniques in the data quality management space in the financial industry. You'll discover the results of applying various machine learning techniques in the four most commonly defined data validation categories and learn approaches to operationalize the machine learning data quality management solution.

Wenyi Yang is a software engineer on the AI platform team at Microsoft.

Presentations

Introducing a new anomaly detection algorithm (SR-CNN) inspired by computer vision Session

Anomaly detection may sound old fashioned, yet it's super important in many industry applications. Tony Xing, Congrui Huang, Qiyang Li, and Wenyi Yang detail a novel anomaly-detection algorithm based on spectral residual (SR) and convolutional neural network (CNN) and how this method was applied in the monitoring system supporting Microsoft AIOps and business incident prevention.

Chuck Yarbrough is vice president of solutions marketing and management at leading IoT and big data analytics company Hitachi Vantara, where he’s responsible for creating and driving repeatable solutions that leverage Hitachi Vantara’s Pentaho platform, enabling customers to implement IoT and big data solutions that transform companies into data-driven enterprises. Chuck has more than 20 years of experience helping organizations use data to ensure they can run, manage, and transform their business through better use of data. Previously, Chuck held leadership roles at Deloitte, SAP, and Hyperion.

Presentations

Clean the swamp: Gain greater visibility, speed, and governance with data ops (sponsored by Hitachi Vantara) Session

According to Gartner, over 80% of data lake projects were deemed inefficient. Data lakes come and go. Swamps happen. Data agility is fleeting. Chuck Yarbrough walks you through how data ops practices and a modern data architecture bring greater visibility and allow faster data access with proper governance.

Alex Yoon is the principal data analyst at T-Mobile, where he leads the cell phone-based big data crowdsourcing and analytics strategy. He identifies the use cases of big data, defines the technical requirements for data collection, crunches the crowdsourced data, and visualizes the results to steer the business decisions. He won the Innovator of the Year in 2017 for delivering the 5 GHz band utilization analysis for unlicensed LTE deployment strategy. Alex has 19 years of the mobile industry experience in areas from radio frequency engineering, product marketing, and big data analytics. He uses all the experience to make his big data analytics create bigger value and make practical differences.

Presentations

T-Mobile's journey to turn crowdsourced big data into actionable insights Session

T-Mobile successfully improved the quality of voice calling by analyzing crowdsourced big data from mobile devices. Alex Yoon walks you through how engineers from multiple backgrounds collaborated to achieve 10% improvement in voice quality and why the analysis of big data was the key to the success in bringing a better voice call service quality to millions of end users.

Petar Zecevic is the chief technology officer of SV Group in Zagreb, Croatia, while pursuing his PhD at the University of Zagreb. He’s collaborating with the Astronomy Department at the University of Washington on building new methods for processing images and data from future astronomical surveys. Previously, he was a Java developer and worked as a software architect, team leader, and IBM software consultant. After switching to the exciting new field of big data technologies, he wrote Spark in Action (Manning, 2016) and primarily works on Apache Spark and big data projects.

Presentations

Using Spark for crunching astronomical data on the LSST scale Session

The Large Scale Survey Telescope (LSST) is one of the most important future surveys. Its unique design allows it to cover large regions of the sky and obtain images of the faintest objects. After 10 years of operation, it will produce about 80 PB of data in images and catalog data. Petar Zecevic explains AXS, a system built for fast processing and cross-matching of survey catalog data.

Jeff Zemerick is a software engineer, cloud architect, and a consultant for Mountain Fog. He’s a committer and PMC on Apache OpenNLP. He currently works on cloud, big data, and NLP projects.

Presentations

Protecting the healthcare enterprise from PHI breaches using streaming and NLP Session

Hospitals small and large are adopting cloud technologies, and many are in hybrid environments. These distributed environments pose challenges, none of which are more critical than the protection of protected health information (PHI). Jeff Zemerick explores how open source technologies can be used to identify and remove PHI from streaming text in an enterprise healthcare environment.

Yiyi Zeng is a senior manager and principal data scientist at Walmart Labs, where she and her team use supervised and unsupervised machine learning techniques to detect fraud including stolen financials, account takeover, identity fraud, promotion and return abuse, and victim scams. She has 12 years of extensive experience in business analytics and intelligence, decision management, fraud detection, credit risk, online payment, and ecommerce across various business domains including both Fortune 500 firms and startups. She’s enthusiastic about mining large-scale data and applying machine learning knowledge to improve business outcomes.

Presentations

Machine learning and large-scale data analysis on a centralized platform Session

James Tang, Yiyi Zeng, and Linhong Kang outline how Walmart provides a secure and seamless shopping experience through machine learning and large scale data analysis on centralized platform.

Jia is a core software engineer at StreamNative, as well as PMC member of both Apache BookKeeper and Apache Pulsar, and contributes to these 2 projects continually.

Presentations

How Orange Financial combats financial fraud over 50M transactions a day using Apache Pulsar Session

As a fintech company of China Telecom with half of a billion registered users and 41 million monthly active users, risk control decision deployment has been critical to its success. Weisheng Xie and Jia Zhai explore how the company leverages Apache Pulsar to boost the efficiency of its risk control decision development for combating financial frauds of over 50 million transactions a day.

Alice Zhao is a senior data scientist at Metis, where she teaches 12-week data science bootcamps. Previously, she was the first data scientist and supported multiple functions from marketing to technology at Cars.com; cofounded a data science education startup where she taught weekend courses to professionals at 1871 in Chicago at Best Fit Analytics Workshop; was an analyst at Redfin; and was a consultant at Accenture. She blogs about analytics and pop culture on A Dash of Data. Her blog post, “How Text Messages Change From Dating to Marriage” made it onto the front page of Reddit, gaining over half a million views in the first week. She’s passionate about teaching and mentoring, and loves using data to tell fun and compelling stories. She has her MS in analytics and BS in electrical engineering, both from Northwestern University.

Presentations

Introduction to natural language processing in Python Tutorial

As a data scientist, we are known to crunch numbers, but you need to decide what to do when you run into text data. Alice Zhao walks you through the steps to turn text data into a format that a machine can understand, explores some of the most popular text analytics techniques, and showcases several natural language processing (NLP) libraries in Python, including NLTK, TextBlob, spaCy, and gensim.

Yu Zhou is a software development engineer on the Azure big data team at Microsoft, where he develops innovative big data solutions, including distributing computing systems and streaming computing. He earned his master of science degree in EE from Beijing University of Posts and Telecommunications and his bachelor of science degree in EE from Hunan University.

Presentations

Using Spark to speed up the diagnosis performance for big data applications Session

Ruixin Xu, Long Tian, and Yu Zhou explore an experiment run using Spark and Jupyter notebooks as a replacement for existing IDE-based tools for internal DevOps. The Spark-based solution improved the diagnosis performance significantly, especially for a complex job with a large profile, and leveraging the Jupyter notebooks brings the benefit of fast iteration and easy knowledge share.

Nan Zhu is a software engineer at Uber. He works on optimizing Spark for Uber’s scenarios and scaling XGBoost in Uber’s machine learning platform. Nan has been the committee member of XGBoost since 2016. He started the XGBoost4J-Spark project facilitating distributed training in XGBoost and fast histogram algorithms in distributed training.

Presentations

We run, we improve, we scale: The XGBoost story at Uber Session

XGBoost has been widely deployed in companies across the industry. Nan Zhu and Felix Cheung dive into the internals of distributed training in XGBoost and demonstrate how XGBoost resolves the business problem in Uber with a scale to thousands of workers and tens of TB of training data.

    Contact us

    confreg@oreilly.com

    For conference registration information and customer service

    partners@oreilly.com

    For more information on community discounts and trade opportunities with O’Reilly conferences

    strataconf@oreilly.com

    For information on exhibiting or sponsoring a conference

    pr@oreilly.com

    For media/analyst press inquires