Sep 23–26, 2019

Speakers

Hear from innovative researchers, talented CxOs, and senior developers who are doing amazing things with data. More speakers will be announced; please check back for updates.

Grid viewList view

Saif Addin Ellafi is a software developer at John Snow Labs, where he is the main contributor to Spark NLP. A data scientist, forever student, and an extreme sports and gaming enthusiast, Said has a wide experience in problem solving and quality assurance in the banking and finance industry.

Presentations

Feature engineering with Spark NLP to accelerate clinical trial recruitment Session

Recruiting patients for clinical trials is a major challenge in drug development. This talk explains how Deep6 utilizes Spark NLP to scale its training and inference pipelines to millions of patients while achieving state-of-the-art accuracy. It covers the technical challenges, the architecture of the full solution, and lessons learned.

Natural language understanding at scale with Spark NLP Tutorial

This is a hands-on tutorial on state-of-the-art NLP using the highly performant, highly scalable open-source Spark NLP library. You'll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Panos Alexopoulos has been working for more than 12 years at the intersection of data, semantics, language and software, contributing in building semantics-powered systems that deliver value to business and society. Born and raised in Athens, Greece, Panos currently works as Head of Ontology at Textkernel, in Amsterdam, Netherlands, where he leads a team of data professionals (Linguists, Data Scientists and Data Engineers) in developing and delivering a large cross-lingual Knowledge Graph in the HR and Recruitment domain. Prior to Textkernel, he worked at Expert System Iberia (former iSOCO) in Madrid, Spain, as a Semantic Applications Research Manager, and at IMC Technologies in Athens, Greece, as a Semantic Solutions Architect and Ontologist.

Academically, Panos holds a PhD in Knowledge Engineering and Management from National Technical University of Athens, and has published ~60 papers at international conferences, journals and books. He strives though to present his work and experiences in all kinds of venues, trying to bridge the gap between academia and industry so that they can benefit from one another.

Presentations

Mind the Semantic Gap: How “talking semantics” can help you perform better data science Session

In an era where discussions among data scientists are monopolized by the latest trends in Machine Learning, the role of Semantics in Data Science is often underplayed. In this talk, I present real-world cases where making fine, seemingly pedantic, distinctions in the meaning of data science tasks and their related data, has helped improve significantly their effectiveness and value.

Alasdair Allan is a scientist and researcher who has authored over eighty peer reviewed papers, eight books, and has been involved with several standards bodies. Originally an astrophysicist he now works as a consultant and journalist, focusing on open hardware, machine learning, big data, and emerging technologies — with expertise in electronics, especially wireless devices and distributed sensor networks, mobile computing, and the "Internet of Things.” He runs a small consulting company, and has written for Make: Magazine, Motherboard/VICE, Hackaday, Hackster.io, and the O’Reilly Radar. In the past he has mesh networked the Moscone Center, caused a U.S. Senate hearing, and contributed to the detection of what was—at the time—the most distant object yet discovered.

Presentations

Executive Briefing: Making Intelligent Insights at the Edge — the Demise of Big Data? Session

A arrival of new generation of smart embedded hardware may cause the demise of large scale data harvesting. In its place smart devices will allow us process data at the edge, allowing us to extract insights from the data without storing potentially privacy and GDPR infringing data. The current age where privacy is no longer "a social norm" may not long survive the coming of the Internet of Things.

Shradha Ambekar is a staff software engineer with the Small Business Data Group at Intuit. She has experience working with Hadoop, Spark, Kafka, Cassandra and Vertica. She is the technical lead for Lineage Framework (SuperGlue), Real-Time analytics and has made several key contributions in building solutions around data platform at Intuit. She is a contributor to spark-cassandra-connector.She is a speaker at Open Source O’reilly Conference 2019. Prior to joining Intuit, she worked as a software engineer with Rearden Commerce. She has a bachelor’s degree in Electronics and Communication Engineering from NIT Raipur, India.

Presentations

Time-travel for Data Pipelines: Solving the mystery of what changed? Session

Imagine a business insight showing a sudden spike.Debugging data pipelines is non-trivial and finding the root cause can take hours or even days! We’ll share how Intuit built a self-serve tool that automatically discovers data pipeline lineage and tracks every change that impacts pipeline.This helps debug pipeline issues in minutes–establishing trust in data while improving developer productivity.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He’s widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Professional Kafka development 2-Day Training

Jesse Anderson offers an in-depth look at Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it as well as how to create consumers and publishers. Jesse then walks you through Kafka’s ecosystem, demonstrating how to use tools like Kafka Streams, Kafka Connect, and KSQL.

Professional Kafka development (Day 2) Training Day 2

Jesse Anderson offers an in-depth look at Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it as well as how to create consumers and publishers. Jesse then walks you through Kafka’s ecosystem, demonstrating how to use tools like Kafka Streams, Kafka Connect, and KSQL.

Vitaliy Baklikov is a data architect at Development Bank of Singapore.

Presentations

Enabling Big Data and AI workloads on the Object Store at DBS Bank Session

In this presentation, Vitaliy Baklikov from DBS Bank and Dipti Borkar from Alluxio will share how DBS Bank has built a modern big data analytics stack leveraging an object store even for data-intensive workloads like ATM forecasting and how it uses Alluxio to orchestrate data locality and data access for Spark workloads.

Gowri Balasubramanian is a senior solutions architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance on both relational as well as NoSQL database services, helping them improve the value of their solutions when using AWS.

Presentations

From relational databases to Cloud databases, using the right tool for the right job. Tutorial

Enterprises adopt Cloud platforms such as AWS for agility, elasticity and cost savings. Database design and management requires a different mindset in AWS when compared to traditional RDBMS design. In this session, you will learn important considerations in choosing the right database based on your use cases and access pattern while migrating an application or building a new application on cloud.

Dylan Bargteil is a data scientist in residence at the Data Incubator, where he works on research-guided curriculum development and instruction. Previously, he worked with deep learning models to assist surgical robots and was a research and teaching assistant at the University of Maryland, where he developed a new introductory physics curriculum and pedagogy in partnership with HHMI. Dylan studied physics and math at University of Maryland and holds a PhD in physics from New York University.

Presentations

Machine learning from scratch in TensorFlow 2-Day Training

The TensorFlow library provides for the use of computational graphs, with automatic parallelization across resources. This architecture is ideal for implementing neural networks. Dylan Bargteil offers an overview of TensorFlow's capabilities in Python, demonstrating how to build machine learning algorithms piece by piece and how to use TensorFlow's Keras API with several hands-on applications.

Machine learning from scratch in TensorFlow (Day 2) Training Day 2

The TensorFlow library provides for the use of computational graphs, with automatic parallelization across resources. This architecture is ideal for implementing neural networks. Dylan Bargteil offers an overview of TensorFlow's capabilities in Python, demonstrating how to build machine learning algorithms piece by piece and how to use TensorFlow's Keras API with several hands-on applications.

Dan spent 12 years in the military as a fighter jet mechanic before transitioning to a career in technology as a software engineer and then a manager. He was the Chief Architect at the National Association of Insurance Commissioners leading their technical and cultural transformation. He’s now leading RSA Archer as their Chief Architect in their cloud migration and conversion to SaaS. Dan is also an organizer of DevOps KC and the DevOpsDays KC conference.

Presentations

Creating a data culture at a 150-year-old non-profit Findata

Sometimes if you build it, no one comes. What do you do if your data and tools aren't being leveraged as expected. It's a massive shift in thinking, but the payoff can be even bigger. Let me show you how we took the National Association of Insurance Commissioners, a 150-year-old non-profit, into the data age, and how you can do it too.

William Benton leads a team of data scientists and engineers at Red Hat, where he has also applied machine learning to problems ranging from forecasting cloud infrastructure costs to designing better cycling workouts. His current focus is investigating the best ways to build and deploy intelligent applications in cloud-native environments, but he has also conducted research and development in the areas of static program analysis, managed language runtimes, logic databases, cluster configuration management, and music technology.

Presentations

Sketching data and other magic tricks Tutorial

In this hands-on workshop, we’ll introduce several data structures that let you answer interesting queries about massive data sets in fixed amounts of space and constant time. This seems like magic, but we'll explain the key trick that makes it possible and show you how to use these structures for real-world machine learning and data engineering applications.

John Berryman started out in the field of Aerospace Engineering but soon found that he was more interested in math and software than in satellites and aircraft. He soon made the leap into software development specializing in search and recommendation technologies. John is now a Senior Software Engineer at Eventbrite, where he is helping build Eventbrite’s event discovery platform. He also recently coauthored a tech book, Relevant Search, published by Manning. The proceeds from the book have mostly paid for the coffee consumed while writing it.

Presentations

Search Logs + Machine Learning = Auto-Tagged Inventory Session

Eventbrite is exploring a new machine learning approach that allows us to harvest data from customer search logs and automatically tag events based upon their content. The results have allowed us to provide users with a better inventory browsing experience.

Enterprise Solution Architect helping customers with adoption of of cloud enabled analytics solutions to meet business requirements.

Presentations

Building a recommender system with Amazon ML Services Tutorial

In this workshop we’ll introduce the Amazon SageMaker machine learning platform, followed by a high level discussion of recommender systems. Next we’ll dig into different machine learning approaches for recommender systems.

Gayle Bieler holds Master’s and Bachelor’s degrees in Mathematics from Boston University. She’s been a statistician at RTI International for the past 31 years and Director of RTI’s Center for Data Science for the past 5 of those years. She is passionate about building and leading a vibrant data science team that solves complex problems across multiple research domains, that is a hub of innovation and collaboration, and a place where people can thrive and be at their natural best while doing meaningful work to improve the world. Ms. Bieler’s team of 24 data scientists, statisticians, software developers, and visual designers is busy solving important national problems, improving our local communities, and transforming research. The most important things to her are 1) people and 2) impact, in that order. As a statistician by training, she is also experienced in statistical analysis of complex data from designed experiments, observational studies, and software development for sample surveys.

Presentations

Executive Briefing: Creating a Center for Data Science From Scratch - Lessons from Nonprofit Research Session

This presentation is about building a thriving Center for Data Science within a large and well-respected non-profit research institute. I'll discuss my transformation from an entrepreneurial statistician to data science leader, as well as some of our most impactful projects and best adventures to date--solving important national problems, improving our local communities, and transforming research.

Albert Bifet is a professor at LTCI and head of the Data, Intelligence, and Graphs (DIG) Group at Télécom ParisTech, and a scientific collaborator at École Polytechnique. A big data scientist with 10+ years of international experience in research, Albert has led new open source software projects for business analytics, data mining, and machine learning at Huawei, Yahoo, the University of Waikato, and UPC. At Yahoo Labs, he cofounded Apache SAMOA (Scalable Advanced Massive Online Analysis), a distributed streaming machine learning framework that contains a programing abstraction for distributed streaming ML algorithms. At the WEKA Machine Learning Group, he co-led MOA (Massive Online Analysis), the most popular open source framework for data stream mining, with more than 20,000 downloads each year. Albert is the author of Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams and the editor of the Big Data Mining special issue of SIGKDD Explorations in 2012. He was cochair of the industrial track at ECML PKDD 2015, BigMine (2014, 2013, 2012), and the data streams track at ACM SAC (2015, 2014, 2013, 2012). He holds a PhD from BarcelonaTech.

Presentations

Machine learning for streaming data: practical insights Session

In this talk, we show how to develop a machine learning pipeline for streaming data using the StreamDM framework (https://github.com/huawei-noah/streamDM). We also introduce how to use StreamDM for supervised and unsupervised learning tasks, show examples of online preprocessing methods, and how to expand the framework adding new learning algorithms or preprocessing methods.

Rossella Blatt Vital, Director of Innovation – Machine Learning at Wonderlic, is a passionate and visionary thought leader in the field of artificial intelligence. She has 15+ years of experience driving Machine Learning strategies and leading ML teams, in both enterprise, SMB and startup environments. Her career began in the academic field, where she headed projects in machine learning that ranged from lung cancer diagnosis to a brain computer interface and crime forecasting. Rossella then transitioned to the corporate world, leading the AI and Data Science initiatives at Newedge and Société Générale before moving to Remitly to serve as the Director of Machine Learning.
Rossella is particularly motivated by empowering organizations to build a strong ML capability, generate value through ML and create high performing ML teams. A native of Italy, she holds an M.S. and B.S. in telecommunications engineering from Politecnico di Milano (which is when she discovered and fell in love with ML) and has pursued a Ph.D. (ABD) in Information Technology.
Rossella loves innovation and machine learning and how, when driven by the right principles and culture, they can contribute to make a difference and create a better tomorrow.

Presentations

Building and Leading a Successful AI Practice for your Organization Tutorial

Creating and leading a successful ML strategy is an elegant orchestration of many components: master the key ML concepts, operationalize the ML workflow, prioritize highest value projects, build a high performing team, nurture strategic partnerships, align with the company’s mission, etc. This tutorial aims to share insights and lessons learned in how to create and lead a flourishing ML practice.

Dipti Borkar is the VP of Product & Marketing at Alluxio with over 15 years experience in data and database technology across relational and non-relational. Prior to Alluxio, Dipti was VP of Product Marketing at Kinetica and Couchbase. At Couchbase she held several leadership positions there including Head of Global Technical Sales and Head of Product Management. Earlier in her career Dipti managed development teams at IBM DB2 where she started her career as a database software engineer. Dipti holds a M.S. in Computer Science from the UC San Diego, and an MBA from the Haas School of Business at UC Berkeley.

Presentations

Enabling Big Data and AI workloads on the Object Store at DBS Bank Session

In this presentation, Vitaliy Baklikov from DBS Bank and Dipti Borkar from Alluxio will share how DBS Bank has built a modern big data analytics stack leveraging an object store even for data-intensive workloads like ATM forecasting and how it uses Alluxio to orchestrate data locality and data access for Spark workloads.

Bob Bradley is the data solutions manager at Geotab, a global leader in telematics providing open platform fleet management solutions to over 1.2 million connected vehicles worldwide. Bob leads a team that is responsible for developing data-driven solutions that leverage Geotab’s big data repository of over 3 billion records each day. Previously, Bob spent more than 14 years as the vice president and cofounder of a software development shop (acquired by Geotab in 2016), where he focused on delivering custom business intelligence solutions to companies across Canada.

Presentations

Turning petabytes of data from millions of vehicles into open data with Geotab Session

Geotab is a world's leading asset tracking company, with millions of vehicles under service every day. In the first part of this talk we are going to review their challenges and solutions to create an ML and GIS enabled petabyte scale data warehouse leveraging Google Cloud. Then we are going to review their process to publish open, how to access it, and how cities are using it.

Navinder is a data engineer in Walmart Labs where he has been working on Kafka and Kafka Streams for over a year now. He likes working on distributed systems and lives in Bangalore, India. He has prior experience in building web applications and one of the biggest GDS platform used in the travel industry. He has a Bachelors degree in Computer Science.

Presentations

Building a multi-tenant data processing and model inferencing platform with Kafka Streams Session

Each week 275 million people shop at Walmart, generating multi-terabytes of interaction and transaction data. In Customer Backbone team, we enable extraction, transforming and storing of customer data to be served to teams such as Ads and Personalisation. At 5 Billion events/day our Kafka Streams cluster processes events from various channels and maintains a uniform identity of a customer.

Mikio Braun is principal engineer for search at Zalando, one of Europe’s biggest fashion platforms. He worked in research for a number of years before becoming interested in putting research results to good use in the industry. Mikio holds a PhD in machine learning.

Presentations

Fair, privacy preserving, and secure ML Session

With ML becoming more and more mainstream, the side effects of using machine learning and AI on our lives become more and more visible. One has to take extra measures to make machine learning models fair and unbiased In addition, awareness for preserving the privacy in ML models is rapidly growing.

Michael leads the data science team at CarGurus, a prominent automotive marketplace. His expertise is in algorithms, data mining, deep learning and machine learning, and statistics. His work focuses on creating solutions to central problems in online advertising and other fundamental aspects of the business. In the past, he worked in the capacities of director of data science, lead data scientist and principal data scientist, applying his expertise to create solutions to prime problems in actuary and healthcare, CRM, restaurant management software, and retail.

Michael is a Hebrew University alumnus (BS, MS), a University of Pennsylvania alumnus (MS, PhD), and a Microsoft Research alumnus.

Presentations

Building a Machine Learning Framework to Measure TV Advertising Attribution Session

This session will present the case study for the CarGurus TV Attribution Model. Attendees will learn how the creation of a causal inference model can be leveraged to calculate cost per acquisition (CPA) of TV spend and measure effectiveness when compared to CPA of Digital Performance Marketing spend.

Andrew Brust is the founder and CEO of Blue Badge Insights. He advises data and analytics ISVs on winning in the market, solution providers on their service offerings, and customers on their analytics strategy. He writes about Big Data for ZDNet and is a data and analytics-focused analyst for Gigaom Research. Andrew is an entrepreneur, a consulting veteran, a former research director, and a current Microsoft Data Platform MVP.

Presentations

Executive Briefing: Data Catalogs - Concepts, Capabilities and Key Platforms Session

A primer on data catalogs and review of the major vendors and platforms in the market. Includes discussion on the use of data catalogs with classic and newer data repositories, including data warehouses, data lakes, cloud object storage and even software/applications. Coverage of AI's role in the data catalog world and analysis of data catalog futures will be provided.

Andrew Burt is chief privacy officer and legal engineer at Immuta, the data management platform for the world’s most secure organizations. He is also a visiting fellow at Yale Law School’s Information Society Project. Previously, Andrew was a special advisor for policy to the head of the FBI Cyber Division, where he served as lead author on the FBI’s after-action report on the 2014 attack on Sony. The leading authority on the intersection between machine learning, regulation and law, Andrew has published articles on technology, history, and law in the New York Times, the Financial Times, Slate, and the Yale Journal of International Affairs. His book, American Hysteria: The Untold Story of Mass Political Extremism in the United States, was called “a must-read book dealing with a topic few want to tackle” by Nobel laureate Archbishop Emeritus Desmond Tutu. Andrew holds a JD from Yale Law School and a BA from McGill University. He is a term-member of the Council on Foreign Relations, a member of the Washington, DC, and Virginia State Bars and a Global Information Assurance Certified (GIAC) cyber incident response handler.

Presentations

Regulations and the Future of Data Session

From the EU to California and China, more and more of the world is regulating how data can be used. In this session, Immuta and the Future of Privacy Forum will convene leading experts on law and data science for a deep dive into ways to regulate the use of AI and advanced analytics. Come learn why these laws are being proposed, how they’ll impact data, and what the future has in store.

War Stories from the Front Lines of ML Session

Machine learning techniques are being deployed across almost every industry and sector. But this adoption comes with real, and oftentimes underestimated, privacy and security risks. In this session, Immuta and the Future of Privacy Forum will convene leading industry representatives and experts to talk about real life examples of when ML goes wrong, and the lessons they learned.

Gwen is an Industrial & Systems Engineer with a passion for helping others. She wanted to find an exciting, innovative company that would allow her to use her skills to change the world, which is how she found her way to a start-up named Revibe Technologies. Gwen is leveraging Revibe’s wearable technology to further research and product development in the focus, attention, and movement spaces. She was a part of the team that brought Revibe Classic to market in 2015 and Revibe Connect to market in 2018. She is also one of the inventors of the Revibe Connect technology. Today, Gwen is the Director of Product and Data where she enjoys pioneering advanced data analytics in her field.

Presentations

From Isolated to Connected: The Metamorphosis of Revibe DCS

It’s no surprise that our company needed to learn how to evolve to satiate the current data hungry market. Revibe launched its first hardware only device in 2015 and quickly learned that to stay alive, we needed to get our hands into data. We began the metamorphosis from a hardware company to a data company, and this presentation shows the transformation and lessons learned along the way.

Luca is a data engineer at CERN with the Hadoop, Spark, Streaming and database services. Luca has 18+ years of experience with architecting, deploying and supporting enterprise-level database and data services with a special interest in methods and tools for performance troubleshooting. Luca is involved in developing and supporting solutions for data analytics and ML for the CERN community, including LHC experiments, the accelerator sector and CERN IT and enjoys taking part and sharing results with the data community at large.

Presentations

Deep learning on Apache Spark at CERN’s Large Hadron Collider with Analytics Zoo Session

We will show CERN’s research on applying Deep Learning in High Energy Physics experiments as an alternative to customized rule based methods with an example of topology classification to improve real-time event selection at the Large Hadron Collider experiments. CERN implemented deep learning pipelines on Apache Spark using BigDL and Analytics Zoo open source software on Intel Xeon-based clusters

Matt is a Security Principal at Cox Communications. He holds Patents on: Weighted Data Packet Communication System, Systems and Methods of DNS Grey Listing, and Systems and Methods of Mapped Network Address Translation

Presentations

Secured Computation – Analyzing Sensitive Data using Homomorphic Encryption Session

Organizations often work with sensitive information such as social security number, and Credit card information. Although this data is stored in encrypted form, most analytical operations ranging from data analysis to advanced machine learning algorithms require data decryption for computation. This creates unwanted exposures to theft or unauthorized read by undesirables.

As Vice President of Engineering, Sandra leads WattzOn’s software development team, ensuring rapid iteration and releases using agile software development. Previously Sandra was VP of Engineering at a number of startups. She has also been in engineering management at AT&T Bell Labs, Aurigin and AT&T Labs.

As Chief Data Scientist, Sandra invented WattzOn’s GLYNT machine learning product, which extracts data trapped in complex documents. She is also the inventor of Mixed Formal Learning, which is used in the GLYNT machine learning product.

Presentations

Data need not be a moat: Mixed Formal Learning enables zero and low shot learning Session

This talk motivates mixed formal learning, explains it and outlines one machine learning example that previously used large numbers of examples and now learns with either zero or a handful of training examples. It maps apparently idiosyncratic techniques to Mixed Formal Learning, a general AI architecture that you can use in your projects.

Rich Caruana is a Principal Researcher at Microsoft Research. Before joining Microsoft, Rich was on the faculty in the Computer Science Department at Cornell University, at UCLA’s Medical School, and at CMU’s Center for Learning and Discovery. Rich’s Ph.D. is from Carnegie Mellon University, where he worked with Tom Mitchell and Herb Simon. His thesis on Multi-Task Learning helped create interest in a new subfield of machine learning called Transfer Learning. Rich received an NSF CAREER Award in 2004 (for Meta Clustering), best paper awards in 2005 (with Alex Niculescu-Mizil), 2007 (with Daria Sorokina), and 2014 (with Todd Kulesza, Saleema Amershi, Danyel Fisher, and Denis Charles), co-chaired KDD in 2007 (with Xindong Wu), and serves as area chair for NIPS, ICML, and KDD. His current research focus is on learning for medical decision making, transparent modeling, deep learning, and computational ecology.

Presentations

Unified Tooling for Machine Learning Interpretability Session

Understanding decisions made by machine learning systems is critical for sensitive uses, ensuring fairness, and debugging production models. Interpretability is a maturing field of research that presents many options for trying to understand model decisions. Microsoft is releasing new tools to help you train powerful, interpretable models and interpret decisions of existing blackbox systems.

Dave Castillo joined Capital One in 2018 and is a Managing Vice President in the Center for Machine Learning, an in-house center of excellence for machine learning, where he leads AI and machine learning initiatives in research and consulting across lines of business, as well as the development of tools, technologies, frameworks, and partnerships with industry and academia.

Prior to joining Capital One, Dave was Head of Data Science, Data Engineering, and Innovation for Early Warning where he was responsible for data ingestion, data transformation, profile assembly, feature extraction, and the development and deployment of machine learning models. Before Early Warning, Dave was CTO of Voltari, where he oversaw Cloud and Data Center Operations, Media Operations, Technology, Data Management Platform, and the Demand Side Platform (DSP) for real-time media buying and placement using automated self-training machine learning models. He also founded two startups specializing in automated predictive modeling for customer acquisition, retention, retargeting and segmentation in the digital and mobile marketing channels. He also held the position as Chief Software Engineer for Motorola’s IRIDIUM project and he began his career developing AI applications for NASA in vision, robotics, NLP, and case-based reasoning.

Dave holds a Ph.D., Masters, and Bachelor degrees in Engineering from University of Central Florida, Arizona State University, and the University of Arizona, respectively. He is also an active Adjunct Professor of Computer Science for the University of Maryland University College.

Presentations

Executive Briefing: Lessons from the Front Lines - Building a Responsible AI/ML Program in the Enterprise Session

The head of Capital One's Center for Machine Learning will share best practices for building a Responsible AI program in the enterprise, from multidisciplinary internal working groups to research & development.

Throughout a decade of virtualisation and launching two startups, Dan has now been nerdy on three continents and in every line of business from UK bulge bracket banking to Australian desert public services.
Joining Hortonworks as a Solutions Engineer in 2016, he swiftly automated a sales manager using Apache Nifi and now drives the international practice for enterprise adoption and automation of the HDF product line, and maintains a public project for Apache NiFi python automation (NiPyAPI) on github.
Dan is based in London with his family and pet Samoyed, he can most recently be found building an open source baby monitor out of Raspberry Pi’s while mining Cryptocurrency in his shed.

Presentations

Kafka/SMM(Streams Messaging Manager) Crash Course Tutorial

Kafka is omnipresent and is the backbone of not only streaming analytics applications but data lakes as well. The challenge is understanding what is going on overall in the Kafka cluster including performance, issues and message flows. This session gives a hands on experience to visualize their entire Kafka environment end-to-end and simplifies Kafka operations via SMM.

Magesh Chandramouli is a Vice President for Brand Expedia Group in Technology, responsible for determining the technology development for areas such as observability, cloud, micro-services, real-time and big data analytics, for many of the full-service brands at Expedia Group, Inc., including the Expedia brand globally and regional brands Travelocity, Orbitz, Cheaptickets, eBookers and Wotif. Chandramouli started writing code in 1991 and his current favorite languages are scala, kotlin, javascript and golang. He is also passionate about open source technology; leading and contributing to Haystack, an open source distributed tracing and analysis platform. Prior to Expedia, he worked for Tata Consultancy Services, Netegrity and JPMorgan Chase.

Presentations

Real time Anomaly detection on observability data using neural networks Session

Observability is the key in modern architecture to quickly detect and repair problems in microservices. Modern observability platforms have evolved beyond simple application logs and now include distributed tracing systems like Zipkin, Haystack. Combining them with real time intelligent alerting mechanisms with accurate alerts helps in automated detection of these problems.

Felix Cheung is an engineer at Uber and a PMC and committer for Apache Spark. Felix started his journey in the big data space about five years ago with the then state-of-the-art MapReduce. Since then, he’s (re-)built Hadoop clusters from metal more times than he would like, created a Hadoop distro from two dozen or so projects, and juggled hundreds to thousands of cores in the cloud or in data centers. He built a few interesting apps with Apache Spark and ended up contributing to the project. In addition to building stuff, he frequently presents at conferences, meetups, and workshops. He was also a teaching assistant for the first set of edX MOOCs on Apache Spark.

Presentations

We run, we improve, we scale - XGBoost story in Uber Session

XGBoost has been widely deployed in companies across the industry. This talk begins with introducing the internals of distributed training in XGBoost and then demonstrate how XGBoost resolves the business problem in Uber with a scale to thousands of workers and 10s of TB training data.

Dr. Moise CONVOLBO has had a long experience working with data, Cloud and Geo-distributed datacenters.
He’s always been fascinated by what comes next, in terms of utilizing data from strategic data-informed business decisions.
At Rakuten, he works as Data scientist and Research Scientist where he is harnessing the potential of the customer data in reaching “Zero customer dissatisfaction”.
Dr. CONVOLBO has built a platform called “The Rakuten PathFinder”. This tool empowers product stakeholders such as PDMs, Managers and Test Engineers to focus on specific struggles along the users’ journeys to improve their product and measure the business impact.
The Rakuten is currently used by Rakuten Gora(#1 golf course reservation site in Japan), Rakuten toto (#1 lottery betting site), Rakuten O-net ( #1 match making web service) and Rakuten Keiba ( Horse racing betting service).
Dr. CONVOLBO is also active in the academia, spending time as a reviewer for major Big Data, Cloud optimization, Data Science journals from ACM, Elsevier, Springer and IEEE.

Presentations

Gaining New Insight into Online Customer Behavior using AI DCS

The customer satisfaction is a key success factor for any business. This session highlights the process to capture relevant customer behavioral data, cluster the user journey by different patterns and drawing conclusions for data-informed business decisions.

Ian Cook is a data scientist at Cloudera and the author of several R packages, including implyr. Previously, he was a data scientist at TIBCO and a statistical software developer at Advanced Micro Devices. Ian is cofounder of Research Triangle Analysts, the largest data science meetup group in the Raleigh, North Carolina, area, where he lives with his wife and two young children. He holds an MS in statistics from Lehigh University and a BS in applied mathematics from Stony Brook University.

Presentations

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow 2-Day Training

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, you'll overcome obstacles to getting started using new tools.

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow (Day 2) Training Day 2

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, you'll overcome obstacles to getting started using new tools.

Will add later

Presentations

Finding your needle in a Haystack Session

As complexity of data systems has grown at Bayer, so has the difficulty to locate and understand what data sets are available for consumption. To address this challenge, a custom metadata management tool was recently deployed as a new capability at Bayer. The system is cloud enabled and uses multiple open source components including machine learning and natural language processing to aid search.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Findata Day Welcome Tutorial

Alistair Croll, Findata Host and Strata Data Conference Program Chair, welcomes you to the day-long tutorial.

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Michael Cullan holds a Masters in Statistics and has 4 years of research experience spanning topics in nonparametric statistics, applied mathematics, and artificial intelligence. He has 3 years of teaching experience in academic and professional settings. He combines a passion for teaching and statistical programming as a Data Scientist in Residence at The Data Incubator.

Presentations

Hands-on data science with Python 2-Day Training

<span class="bold">Michael Cullan</span> walks you through developing a machine learning pipeline, from prototyping to production. You'll learn about data cleaning, feature engineering, model building and evaluation, and deployment and then extend these models into two applications from real-world datasets. All work will be done in Python.

Hands-on data science with Python (Day 2) Training Day 2

Michael Cullan walks you through developing a machine learning pipeline, from prototyping to production. You'll learn about data cleaning, feature engineering, model building and evaluation, and deployment and then extend these models into two applications from real-world datasets. All work will be done in Python.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Brian Dalessandro is the head of data science at SparkBeyond, a research and consulting platform that accelerates discoveries and insights. Brian is also an active professor for NYU’s Center for Data Science graduate degree program. Prior to SparkBeyond, Brian has built and led data science programs for several NYC tech startups, including Zocdoc and Dstillery. A veteran data scientist and leader with over 15 years of experience developing machine learning-driven practices and products, Brian holds several patents and has published dozens of peer-reviewed articles on the subjects of causal inference, large-scale machine learning, and data science ethics. When not doing Data Science, Brian likes to cook, create adventures with his family and surf in the frigid north Atlantic waters.

Presentations

Improve Your Data Science ROI with a Portfolio and Risk Management Lens Session

While Data Science value is well recognized within Tech, our experience with leaders across industries shows that the ability to realize and measure business impact is not universal. A core issue is DS programs face unique risks that many leaders aren’t trained to hedge against. This talk addresses these risks and advocates for new ways to think about and manage data science programs.

Jules S. Damji is an Apache Spark Community and Developer Advocate at Databricks. He is a hands-on developer with over 20 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, @Home, LoudCloud/Opsware, VeriSign, ProQuest, and Hortonworks, building large-scale distributed systems. He holds a B.Sc and M.Sc in Computer Science and MA in Political Advocacy and Communication from Oregon State University, Cal State, and Johns Hopkins University respectively.

Presentations

Managing the Complete Machine Learning Lifecycle with MLflow Tutorial

ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools, and parameters to get the best results, and they need to track this information to reproduce work

Shirshanka Das is a principal staff software engineer and the architect for LinkedIn’s analytics platforms and applications team. He was among the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. He is currently working with his team to simplify the big data analytics space at LinkedIn through a multitude of mostly open source projects, including Pinot, a high-performance distributed OLAP engine, Gobblin, a data lifecycle management platform for Hadoop, WhereHows, a data discovery and lineage platform, and Dali, a data virtualization layer for Hadoop.

Presentations

The Evolution of Metadata: LinkedIn’s story Session

How do you scale metadata to an organization of 10,000 employees, 1M+ data assets and an AI-enabled company that ships code to the site three times a day. We describe the journey of LinkedIn’s metadata from a two-person back-office team to a central hub powering data discovery, AI productivity and automatic data privacy. Different metadata strategies and our battle scars will be revealed!

Gerard de Melo is an Assistant Professor of computer science at Rutgers University, where he heads the Deep Data Lab, a team of researchers working on big data analytics, natural language processing, and web mining. Gerard’s research projects include UWN/MENTA, one of the largest multilingual knowledge bases, and Lexvo.org, an important hub in the Web of Data. Previously, he was a faculty member at Tsinghua University, one of China’s most prestigious universities, and a visiting scholar at ICSI/UC Berkeley. He serves as an editorial board member for Computational Intelligence, the Journal of Web Semantics, the Springer Language Resources and Evaluation journal, and the Language Science Press TMNLP book series. Gerard has published over 100 papers, with best paper or demo awards at WWW 2011, CIKM 2010, ICGL 2008, and the NAACL 2015 Workshop on Vector Space Modeling, as well as an ACL 2014 best paper honorable mention, a best student paper award nomination at ESWC 2015, and a thesis award for his work on graph algorithms for knowledge modeling. He holds a PhD in computer science from the Max Planck Institute for Informatics.

Presentations

Towards More Fine-Grained Sentiment and Emotion Analysis of Text Session

What kinds of sentiment and emotions do consumers associate with a text? With new data-driven approaches, organizations can better pay attention to what is being said about them in different markets. We can also consider the fonts and color palettes best-suited to convey specific emotions, so that organizations can make informed choices when presenting information to consumers.

Randy is a solutions architect at AWS, with over 20 years of experience in enterprise software architecture. He’s worked heavily in DevOps in the past, and currently focuses on Analytics and Machine Learning.

Presentations

MLOps – Applying DevOps Practices to Machine Learning Workloads Session

As an increasing level of automation is becoming available to data science, there is a balance between automation and quality that needs to be maintained. Applying DevOps practices to machine learning workloads not only brings models to the market faster but also maintains the quality and integrity of those models. This presentation will focus on applying DevOps practices to ML workloads.

Sourav Dey is CTO at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. Previously, Sourav led teams building data products across the technology stack, from smart thermostats and security cams at Google/Nest to power grid forecasting at AutoGrid to wireless communication chips at Qualcomm. He holds patents for his work, has been published in several IEEE journals, and has won numerous awards. He holds PhD, MS, and BS degrees in electrical engineering and computer science from MIT.

Presentations

Streamlining a Machine Learning Project Team Tutorial

Many teams are still run as if data science is about experimentation, but those days are over. Now it must offer turnkey solutions to take models into production. We'll explain how to streamline a ML project and help your engineers work as an integrated part of production teams, using a Lean AI process and the Orbyter package for Docker-first data science.

Victor Dibia is a Research Engineer with Cloudera’s Fast Forward Labs where his work focuses on prototyping state of the art machine learning algorithms and advising clients. Prior to this, he was a Research Staff Member at the IBM TJ Watson Research Center, New York. His research interests are at the intersection of human computer interaction, computational social science, and applied AI. He holds an M.S. from Carnegie Mellon University and a Ph.D. from City University of Hong Kong.

Presentations

Handtrack.js: Building Gesture Based Interactions in the Browser Using Tensorflow.js Session

Recent advances in Machine Learning frameworks for the browser such as Tensorflow.js provides opportunity to craft truly novel experiences within front-end applications. This talk explores the state of the art for Machine Learning in the browser using Tensorflow.js and covers its use in the design of Handtrack.js - a library for prototyping real time hand detection in the browser.

Masaru is a senior IT infrastructure engineer / IT architect and manager of NTT DATA Corporation. He is responsible for the research and development about data processing and analytics platform leveraging open sources (e.g. Hadoop, Spark, Kafka, etc) The clusters which he designed and developed includes thousands of nodes. He had several presentations at Strata Data Conference, Kafka Summit, Spark Summit, DataWorks Summit and so on.

Presentations

Deep Learning Technologies for Giant Hogweed Eradication Session

Giant Hogweed is a highly toxic plant. Our project aims to automate the process of detecting the Giant Hogweed by exploiting technologies like drones and image recognition/detection using Machine Learning. We show you how we designed the architecture, how we took advantage of both of Big Data and Machine / Deep Learning technologies (e.g. Hadoop, Spark and TensorFlow) and lessons learned.

Mark Donsky leads product management at Okera, a software provider that provides discovery, access control, and governance at scale for today’s modern heterogeneous data environments. Previously, Mark led data management and governance solutions at Cloudera. Mark has held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

Executive briefing: big data in the era of heavy worldwide privacy regulations Session

California is following the EU's GDPR with the California Consumer Protection Act (CCPA) in 2020. Penalties for non-compliance, but many companies aren't prepared for this strict regulation. This session will explore the capabilities your data environment needs in order to simplify CCPA and GDPR compliance, as well as other regulations.

Getting ready for CCPA: securing data lakes for heavy privacy regulation Tutorial

New regulations such as CCPA and GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span on-prem, private cloud, multi-cloud, and hybrid cloud. We will share hands-on best practices for meeting these challenges, with special attention to CCPA.

Anais Dotis-Georgiou is a Developer Advocate for InfluxData with a passion for making data beautiful with the use of Data Analytics, AI, and Machine Learning. She takes the data that she collects, does a mix of research, exploration, and engineering to translate the data into something of function, value, and beauty. When she is not behind a screen, you can find her outside drawing, stretching, or chasing after a soccer ball.

Presentations

When Holt-Winters is better than Machine Learning Session

Did you know that Classical algorithms outperform Machine Learning methods in time series forecasting? I’ll show you how I used the Holt-Winters forecasting algorithm to predict water levels in a creek.

Chelsea is a Program Manager at Plotly, leading the documentation and support teams. As a polyglot in R, MATLAB, Python, and JavaScript, she leads communication of Plotly capabilities to users across programming languages, academic fields, and industries. Chelsea has a MA in Music Technology from Mcgill University.

Presentations

How S&P’s Trucost Empowered Their Analysts with Modern and Interactive Data Reporting Tools Findata

S&P’s Trucost Senior Analyst, Rochelle March, is migrating a product that quantitatively measures company performance on UN Sustainable Development Goals from Excel to Python. Her team cut multi-day workflows to a few hours, delivering rich, 27-page interactive reports. Learn about modern techniques to design, build, and deploy a data visualization and reporting framework in your organization.

Carolyn Duby is a Solutions Engineer and lead Cyber Security SME at Cloudera, where she helps customers harness the power of their data with Apache open source. Previously, she was the architect for cybersecurity event correlation at SecureWorks. A subject-matter expert in cybersecurity and data science, Carolyn is an active leader in the community and frequent speaker at Future of Data meetups and at conferences such as Strata Data Conference, Dataworks Summit, Open Data Science Conference and Day of Shecurity. Carolyn holds an ScB (magna cum laude) and ScM from Brown University, both in computer science. She is lifelong learner and recently completed the Johns Hopkins University Coursera Data Science Specialization.

Presentations

Apache Metron: Open source cybersecurity at scale Tutorial

Bring your laptop, roll up your sleeves, and get ready to crunch some cyber security events with Apache Metron, an open source big data cyber security platform. Learn how Metron finds actionable events in real time.

Ted Dunning is CTO at MapR. He’s also a board member for the Apache Software Foundation, a PMC member and committer on many Apache projects, and a mentor for various incubator projects. Ted has years of experience with machine learning and other big data solutions across a range of sectors. He has contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library and designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date plus a dozen pending. He holds a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.

Presentations

Practical Feature Engineering Session

Feature engineering is generally the section that gets left out of machine learning books, but it is also the most critical part in practice. I will provide a variety of techniques, a few well known, but some rarely spoken of outside the tribal lore of top teams, including how to handle categorical inputs, natural language, transactions and more all in the context of modern machine learning.

Vlad Eidelman is the VP of Research at FiscalNote, where he leads AI R&D into advanced methods for analyzing, modeling, and extracting knowledge from unstructured data related to government, policy, and law and built the first version of the companies patented technology to help organizations predict and act on policy changes. Prior to FiscalNote, he worked as a researcher in a number of academic and industry settings, completing his Ph.D. in CS, as an NSF and NDSEG Fellow, at the University of Maryland and his B.S. in CS and Philosophy at Columbia University. His research focuses on machine learning algorithms for a broad range of natural language processing applications, including entity extraction, machine translation, text classification and information retrieval, especially applied to computational social science. His work has led to 10 patent applications, has been published in conferences like ACL, NAACL and EMNLP, and has been covered by media such as Wired, Vice News, Washington Post and Newsweek.

Presentations

What Does The Public Say? A Computational Analysis of Regulatory Comments Session

While regulations affect your life every day, and millions of public comments are submitted to regulatory agencies in response to their proposals, analyzing the comments has traditionally been reserved for legal experts. In this talk, we show how natural language processing and machine learning can be used to automate the process by analyzing over 10 million publicly released comments.

Shelbee Eigenbrode is a Solutions Architect at Amazon Web Services (AWS). Her current areas of depth include DevOps combined with Machine Learning/Artificial Intelligence. She has been in technology for 22 years spanning multiple roles and technologies. She spent 20+ years at IBM and joined AWS in November of 2018. She is a published author, blogger/vlogger evangelizing DevOps practices with a passion for driving rapid innovation and optimization at scale. In 2016, she won the DevOps Dozen Blog of the year (https://devops.com/the-2016-devops-dozen-winners-announced/)demonstrating what DevOps Is Not. With over 26 patents granted across various technology domains, her passion for continuous innovation combined with a love of all things data has recently turned her focus to the field of Data Science. Combining her backgrounds in Data, DevOps and Machine Learning, her current passion is to help customers not only embrace data science but also to ensure all data models have a path to being utilized. She also aims to put ML is the hands of developers and customers that are not classically trained data scientists.

Presentations

Alexa, Do Men Talk Too Much? Session

Mansplaining. Know it? Hate it? Want to make it go away? In this session we tackle the chronic problem of men talking over or down to women and its negative impact on career progression for women. We will also demonstrate an Alexa skill that uses deep learning techniques on incoming audio feeds. We discuss ownership of the problem for both women and men, and suggest helpful strategies.

MLOps – Applying DevOps Practices to Machine Learning Workloads Session

As an increasing level of automation is becoming available to data science, there is a balance between automation and quality that needs to be maintained. Applying DevOps practices to machine learning workloads not only brings models to the market faster but also maintains the quality and integrity of those models. This presentation will focus on applying DevOps practices to ML workloads.

Nick Elprin is the cofounder and CEO of Domino Data Lab, a data science platform that accelerates the development and deployment of models while enabling best practices like collaboration and reproducibility. Previously, Nick built tools for quantitative researchers at Bridgewater, one of the world’s largest hedge funds. He has over a decade of experience working with data scientists at advanced enterprises. Nick holds a BA and MS in computer science from Harvard.

Presentations

Managing data science in the enterprise Tutorial

The honeymoon era of data science is ending, and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders must deliver measurable impact on an increasing share of an enterprise’s KPIs. Attendees will learn how leading organizations take a holistic approach to people, process, and technology to build a sustainable competitive advantage.

28-year veteran at Statistics Canada, Institut national de la statistique et des études économiques (Insee). Expert in high-frequency economic indicators. Transformative Leader. Architect and project executive of the CPI Enhancement Initiative. Passionate about using Data Science and AI to create user-centric data products from Big Data sources. Recruiter of tomorrow’s statistical leaders.

Presentations

Implementing ML Models into Production at Statistics Canada DCS

Statistics Canada aims to be a key player in the Canadian AI ecosystem. Find out how Canada’s national statistical organization created and put in place its first AI/ML team, applying Lean Startup principles and a proactive hiring strategy to formulate a strategy that within months was delivering production-ready ML models in a complex and demanding data processing environment.

Stephan Ewen is CTO and co-founder at Ververica where he leads the development of the stream processing platform based on open source Apache Flink. He is also a PMC member and one of the original creators of Apache Flink. Before working on Apache Flink, Stephan worked on in-memory databases, query optimization, and distributed systems. He holds a Ph.D. from the Berlin University of Technology.

Presentations

Stream Processing beyond Streaming Data Session

The talk discusses how stream processing is becoming a "grand unifying paradigm" for data processing and the newest developments in Apache Flink to support this trend: New cross-batch-streaming Machine Learning algorithms, State-of-the-art batch performance, and new building blocks for data-driven applications and application consistency.

Moty Fania is a principal engineer for big data analytics at Intel IT and the CTO of the Advanced Analytics Group, which delivers big data and AI solutions across Intel. With over 15 years of experience in analytics, data warehousing, and decision support solutions, Moty leads the development and architecture of various big data and AI initiatives, such as IoT systems, predictive engines, online inference systems, and more. Moty holds a bachelor’s degree in economics and computer science and a master’s degree in business administration from Ben-Gurion University.

Presentations

Building an AI platform – key principles and lessons learned Session

In this session, Moty Fania will share Intel’s IT experience of implementing a Sales AI platform. This platform is based on streaming, micro-services architecture with a message bus backbone. It was designed for real-time, data extraction and reasoning. The platform handles processing of millions of website pages and capable of sifting thru millions of tweets per day.

Usama is Co-Founder & CTO at OODA Health, Inc a VC-funded company founded in 2017 to bring AI/automation to create a retail-like experience in payments and processing to healthcare delivery. He is also Chairman at Open Insights – a technology and strategic consulting firm he founded in 2008 to help enterprises deploy data-driven solutions to grow revenue from Data assets. In addition to BigData strategy and building new business models on data assets, the company deploys data science, AI/ML, and bigData solutions for large enterprises. From 2013-2016 Usama served as Global Chief Data Officer at Barclays in London, after he launched the largest tech startup accelerator in MENA as Executive Chairman of Oasis500 in Jordan in 2010. His background includes Chairman and CEO roles at several startups, including Blue Kangaroo Corp, DMX Group and digiMine Inc. He was the first person to hold the Chief Data Officer title when Yahoo! acquired his second startup in 2004. At Yahoo! he built the Strategic Data Solutions group and founded Yahoo! Research Labs where much of the early work on BigData made it to open source and established the early collaborations that launched Hadoop and other open source contributions. He has held leadership roles at Microsoft (1996-2000) and founded the machine learning systems group at NASA’s Jet Propulsion Laboratory (1989-2005), where his work on machine learning resulted in the top Excellence in Research award from Caltech, and a U.S. Government medal from NASA.

Usama has published over 100 technical articles on data mining, data science, AI/ML, and databases. He holds over 30 patents and is a Fellow of both the AAAI and the ACM. Usama earned his PhD in Engineering in AI and Machine Learning from the University of Michigan. Ann Arbor. He has edited two influential books on data mining and served as editor-in-chief on two key industry journals. He also served on the boards or advisory boards of several private and public companies including: Criteo, Invensense, RapidMiner, Stella.AI, Virsec, Silniva, Abe.AI, NetSeer, Choicestream, Medio, and others. On the academic front his is on advisory boards of the Data Science Institute at Imperial College, AAI at UTS, and The University of Michigan College of Engineering National advisory Board.

Presentations

An In-Depth Look at the Data Science Career: Defining Roles, Assessing Skills Session

Ever confused about what it takes to be a data scientist? Or curious about how companies recruit, train and manage analytics resources? This presentation covers insight from the most comprehensive research effort to-date on the data analytics profession, propose a framework for standardization of roles in the industry and methods for assessing skills.

Elva Fernandez Sanchez is a Senior Lead Analyst on the Privacy and Information Security team at American Express, where she focuses on ensuring the Data Security and Privacy of the million card members of American Express applying machine learning algorithms and big data tools. Elva’s data experience spans almost a decade where she has handled and supported different data projects. Elva holds a master’s degree from the University of Utah.

Presentations

How to create a Data Security and Privacy Framework that it is compliance with the law Findata

The speaker will discuss an overview of GDPR and the different data laws that are currently in force. The session will help the audience to understand and create new Data Security and Privacy Frameworks like Privacy and Security by Design. The audience will gain knowledge about the basic Data Principles in order for the audience to create and implement their own Data Principles at their companies

Ricardo is a Developer Advocate at Confluent — the company founded by the creators of Apache Kafka. He has +21 years of experience working with Software Development, where he specialized in different Distributed Systems architectures such as Integration, SOA, NoSQL, Messaging, In-Memory Caching and Cloud Computing. Prior to Confluent he worked for other vendors such as Oracle, Red Hat and IONA Technologies, as well as several consulting firms. While not working and like any good Brazilian — he loves doing Churrasco’s (i.e.: Brazilian Barbecue) with his friends & family, where he get the chance to talk about anything that is not geek related. He can be easily found on Twitter @riferrei or via his blog https://riferrei.net.

Presentations

Real-time SQL Stream Processing at Scale with Apache Kafka and KSQL Tutorial

Building stream processing applications are certainly one of the hot topics among the IT community. Though a lot has been talked about this subject, one might say that building stream processing applications is the new sex during teenage. This tutorial aims to change this by introducing KSQL, the stream processing query engine built on top of Apache Kafka.

Justin Fier is the Director for Cyber Intelligence & Analytics at Darktrace, based in Washington D.C. Justin is one of the US’s leading cyber intelligence experts, and his insights have been widely reported in leading media outlets, including Wall Street Journal, CNN, the Washington Post, and VICELAND. With over 10 years of experience in cyber defense, Justin has supported various elements in the US intelligence community, holding mission-critical security roles with Lockheed Martin, Northrop Grumman Mission Systems and Abraxas. Justin is also a highly-skilled technical specialist, and works with Darktrace’s strategic global customers on threat analysis, defensive cyber operations, protecting IoT, and machine learning.

Presentations

When Machines Fight Machines: Cyber Battles & the New Frontier of Artificial Intelligence Session

Cyber security must find what it doesn’t know to look for. AI technologies have led to the emergence of self-learning, self-defending networks that achieve this – detecting and autonomously responding to in-progress attacks in real time. These cyber immune systems enable the security team to focus on high-value tasks, can counter even machine-speed threats, and work in all environments.

Michael J. Freedman is the cofounder and CTO of TimescaleDB, an open source database that scales SQL for time series data, and a professor of computer science at Princeton University, where his research focuses on distributed systems, networking, and security. Previously, Michael developed CoralCDN (a decentralized CDN serving millions of daily users) and Ethane (the basis for OpenFlow and software-defined networking) and cofounded Illuminics Systems (acquired by Quova, now part of Neustar). He is a technical advisor to Blockstack. Michael’s honors include the Presidential Early Career Award for Scientists and Engineers (PECASE, given by President Obama), the SIGCOMM Test of Time Award, the Caspar Bowden Award for Privacy Enhancing Technologies, a Sloan Fellowship, the NSF CAREER Award, the Office of Naval Research Young Investigator Award, a DARPA Computer Science Study Group membership, and multiple award publications. He holds a PhD in computer science from NYU’s Courant Institute and bachelor’s and master’s degrees from MIT.

Presentations

Performant time-series data management and analytics with Postgres Session

Leveraging polyglot solutions for your time-series data can lead to a variety of issues including engineering complexity, operational challenges, and even referential integrity concerns. By re-engineering Postgres to serve as a general data platform, your high-volume time-series workloads will be better streamlined, resulting in more actionable data and greater ease of use.

Brandy Freitas is a principal data scientist at Pitney Bowes, where she works with clients in a wide variety of industries to develop analytical solutions for their business needs. Brandy is a research physicist-turned-data scientist based in Boston, MA. Her academic research focused primarily on protein structure determination, applying machine learning techniques to single-particle cryoelectron microscopy data. Brandy is a National Science Foundation Graduate Research Fellow and a James Mills Pierce Fellow. She holds an undergraduate degree in physics and chemistry from the Rochester Institute of Technology and did her graduate work in biophysics at Harvard University.

Presentations

Harnessing Graph Native Algorithms to Enhance Machine Learning: A Primer Session

In this session, Brandy Freitas from Pitney Bowes will cover the interplay between graph analytics and machine learning, improved feature engineering with graph native algorithms, and harnessing the power of graph structure for machine learning through node embedding.

Matt Fuller is cofounder at Starburst, the Presto company. Matt has held engineering roles in the data warehousing and analytics space for the past 10 years. Previously, he was director of engineering at Teradata, leading engineering teams working on Presto, and was part of the team that led the initiative to bring open source, in particular Presto, to Teradata’s products. Before that, Matt architected and led development efforts for the next-generation distributed SQL engine at Hadapt (acquired by Teradata in 2014) and was an early engineer at Vertica Systems (acquired by HP), where he worked on the Query Optimizer.

Presentations

Learning Presto: SQL on anything Tutorial

Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL on anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from GBs to PBs. Join Matt Fuller to learn how to use Presto and explore use cases and best practices you can implement today.

Siddha Ganju, who Forbes featured in their 30 under 30 list, is a Self-Driving Architect at Nvidia. Previously at Deep Vision, she developed deep learning models for resource constraint edge devices. A graduate from Carnegie Mellon University, her prior work ranges from Visual Question Answering to Generative Adversarial Networks to gathering insights from CERN’s petabyte-scale data and has been published at top-tier conferences including CVPR and NeurIPS. Serving as an AI domain expert, she has also been guiding teams at NASA as well as featured as a jury member in several international tech competitions.

Presentations

DEEP LEARNING ON MOBILE Session

Optimizing deep neural nets to run efficiently on mobile devices.

Alon Gavra has been with Appsflyer for the past two years – and today serves as the Platform Team Lead. Originally a backend developer he has transitioned to lead the real time infrastructure team, and took on the role of bringing some of the most heavily used infrastructure in AppsFlyer to the next level. A strong believer in sleep driven design, Alon’s main focus is stability and resiliency in builfing massive data ingestion and storage solutions.

Presentations

Managing Your Kafka in an Explosive Growth Environment Session

Kafka, many times is just a piece of the stack that lives in production that often times no one wants to touch - because it just works. At AppsFlyer, Kafka sits at the core of our infrastructure that processes billions of events daily.

Bas Geerdink is a programmer, scientist, and IT manager at ING, where he is responsible for the fast data systems that process and analyze streaming data. Bas has a background in software development, design, and architecture with broad technical experience from C++ to Prolog to Scala. His academic background is in artificial intelligence and informatics. Bas’s research on reference architectures for big data solutions was published at the IEEE conference ICITST 2013. He occasionally teaches programming courses and is a regular speaker at conferences and informal meetings.

Presentations

Fast Data with the KISSS stack Session

Streaming Analytics (or Fast Data processing) is the field of making predictions on real-time data. In this talk, I'll present a fast data architecture that covers many use cases that follows a 'pipes and filters' pattern. This architecture can be used to create enterprise-grade solutions with a diversity of technology options. The stack is Kafka, Impala, and Spark Structured Streaming (KISSS).

Gidon is a lead architect at the IBM Research – Haifa Laboratory. He works on secure cloud analytics, data-at-rest and data-in-use encryption, attestation of trusted computing enclaves. Currently, Gidon plays a leading role in the Apache Parquet community work on protecting sensitive data in analytic workloads. Gidon has completed a Ph.D degree in the Weizmann Institute of Science in Israel, and was a Post-Doctoral fellow in the Columbia University, NYC.

Presentations

Parquet Modular Encryption: Confidentiality and Integrity of Sensitive Column Data Session

The Apache Parquet community is working on a column encryption mechanism that protects the sensitive data and enables access control for table columns. Many companies are involved, the mechanism specification has recently been signed off by the community management committee. I will present the basics of Parquet encryption technology, its usage model and a number of use cases.

Debasish Ghosh is principal software engineer at Lightbend. Passionate about technology and open source, he loves functional programming and has been trying to learn math and machine learning. Debasish is an occasional speaker in technology conferences worldwide, including the likes of QCon, Philly ETE, Code Mesh, Scala World, Functional Conf, and GOTO. He is the author of DSLs In Action and Functional & Reactive Domain Modeling. Debasish is a senior member of ACM. He’s also a father, husband, avid reader, and Seinfeld fanboy who loves spending time with his beautiful family.

Presentations

Online Machine Learning in Streaming Applications Session

In this talk, we discuss online machine learning algorithm choices for streaming applications. We motive the discussion with resource constrained use cases like IoT and personalization. We cover Hoeffding Adaptive Trees, classic sketch data structures, and drift detection algorithms, all the way from implementation to production deployment, describing the pros and cons of using each of them.

Heitor Murilo Gomes is a researcher at Télécom ParisTech focusing on machine learning—particularly evolving data streams, concept drift, ensemble methods, and big data streams. He co-leads the StreamDM open data stream mining project.

Presentations

Machine learning for streaming data: practical insights Session

In this talk, we show how to develop a machine learning pipeline for streaming data using the StreamDM framework (https://github.com/huawei-noah/streamDM). We also introduce how to use StreamDM for supervised and unsupervised learning tasks, show examples of online preprocessing methods, and how to expand the framework adding new learning algorithms or preprocessing methods.

Bruno Gonçalves is currently a Vice President in Data Science and Finance at JPMorgan Chase. Previously, we was a Data Science fellow at NYU’s Center for Data Science while on leave from a tenured faculty position at Aix-Marseille Université. Since completing his PhD in the Physics of Complex Systems in 2008 he has been pursuing the use of Data Science and Machine Learning to study Human Behavior. Using large datasets from Twitter, Wikipedia, web access logs, and Yahoo! Meme he studied how we can observe both large scale and individual human behavior in an obtrusive and widespread manner. The main applications have been to the study of Computational Linguistics, Information Diffusion, Behavioral Change and Epidemic Spreading.

Presentations

Deep Learning from Scratch Tutorial

Students will learn, in a hands-on way, the theoretical foundations and principal ideas underlying Deep Learning and Neural Networks. The code structure of the implementations provided is meant to closely resemble he way Keras is structured so that by the end of the course, students will be prepared to dive deeper into the deep learning applications of their choice.

Madhu Gopinathan is currently Vice President, Data Science at MakeMyTrip, India’s leading online travel company. He started his career in the San Francisco Bay area developing large scale software systems in the telecom industry. After spending about 10 years in the industry at companies such as Covad Communications and Microsoft, he completed his PhD in computer science from Indian Institute of Science, working on mathematical modelling of software systems. He then came back to the industry, starting with building a machine learning at team at Infosys product incubation group followed by couple of startups before his current stint at MakeMyTrip. He has collaborated with researchers at Microsoft Research, General Motors and Indian Institute of Science leading to publications in prominent computer science conferences. He holds an MS in computer science from the University of Florida, Gainesville, USA. He has extensive experience developing large scale systems using machine learning & natural language processing and has been granted multiple US patents.

Presentations

Migrating millions of users from voice and email based customer support to a chatbot Session

At MakeMyTrip, India’s leading online travel platform, customers were using voice or email to contact agents for post sale support. In order to improve the efficiency of agents and improve customer experience, MakeMyTrip developed a chatbot Myra using some of the latest advances in deep learning. In this talk, we will discuss the high level architecture and the business impact created by Myra.

Sunil Goplani is a Group Development Manager at Intuit. Sunil has played key architecture and leadership roles building solutions around data platforms, Big Data, BI, Data Warehousing and MDM for startups and enterprises. Sunil is currently leading Big Data platform at Intuit. Prior to Intuit, Sunil served in key engineering positions at Netflix, Chegg, Brand.net and few other startups. Sunil has a Master’s degree in Computer Science.

Presentations

Time-travel for Data Pipelines: Solving the mystery of what changed? Session

Imagine a business insight showing a sudden spike.Debugging data pipelines is non-trivial and finding the root cause can take hours or even days! We’ll share how Intuit built a self-serve tool that automatically discovers data pipeline lineage and tracks every change that impacts pipeline.This helps debug pipeline issues in minutes–establishing trust in data while improving developer productivity.

Sajan Govindan is a Solution Architect in the Data Analytics Technologies team in Intel focusing on open source technologies for Big Data Analytics and AI solutions. Sajan has been with Intel for more than eighteen years with many years of experience and expertise in building Analytics and AI solutions working through the advancements in Hadoop and Spark ecosystem, Machine Learning and Deep Learning frameworks, in various industry verticals and domains

Presentations

Deep learning on Apache Spark at CERN’s Large Hadron Collider with Analytics Zoo Session

We will show CERN’s research on applying Deep Learning in High Energy Physics experiments as an alternative to customized rule based methods with an example of topology classification to improve real-time event selection at the Large Hadron Collider experiments. CERN implemented deep learning pipelines on Apache Spark using BigDL and Analytics Zoo open source software on Intel Xeon-based clusters

Sourabh Goyal is a member of the technical staff at Qubole, where he works in Hadoop team. Sourabh holds a bachelor in computer engineering from Netaji Shubas Institute of Technology, University of Delhi

Presentations

Downscaling: The Achilles heel of autoscaling Spark Clusters Session

Autoscaling of resources aims to achieve low latency for a big data application, while reducing resource costs at the same time. Upscale a cluster in cloud is fairly easy as compared to downscaling nodes and so overall Total-cost-of-ownership (TCO) goes up. We will talk about new design to get efficient downscaling which further helps in achieving better resource utilization and thus lower TCO.

An economist cum data scientist, Catherine has held position as an investment research analyst at Man Group covering global macro systematic hedge fund managers for three years and then as a macro strategist at J.P. Morgan covering emerging market FX and rates.

Catherine is currently pursuing graduate study at Stanford University, in Management Science and Engineering. Her areas of focus include market incentive design, computation and stochastic modeling.

Catherine is the co-founder of Quantess London Group since 2014. Catherine holds a Master in Economics from the University of Cambridge.

Pronoun: She/ Her

Presentations

The Future of Stablecoin Findata

With the emergence of cryptoeconomy, there is a real demand for an alternative form of money. Major cryptocurrencies such as Bitcoin and Ethereum have thus far failed into mass adoption. In this talk, I will examine the paradigm of algorithmic design of stablecoins, focusing on incentive structure and decentralized governance, to evaluate the role of stablecoin as a future medium of exchange.

Chenzhao Guo is a big data engineer at Intel. He graduated from Zhejiang University and joined Intel in 2016. He is currently a contributor of Spark and a committer of OAP and HiBench.

Presentations

Improving Spark by taking advantage of disaggregated architecture Session

Shuffle in Spark requires the shuffle data to be persisted on local disks.However, the assumptions of collocated storage do not always hold in today’s data centers. We implemented a new Spark shuffle manager, which writes shuffle data to a remote cluster with different storage backends. This makes life easier for those customers who want to leverage the latest storage hardware, and HPC customers

Sijie Guo is the PMC Chair of Apache BookKeeper and the PMC member of Apache Pulsar. He worked at Twitter before and led the messaging team. Prior to Twitter, he worked on Yahoo! push notification infrastructure.

Presentations

How China Telecom combat financial frauds over 50M transactions a day using Apache Pulsar Session

As a Fintech company of China Telecom with half billion registered users and 41 million monthly active users, risk control decision deployment has been critical to the success of the business. In this talk we share how we leverage Apache Pulsar to boost the efficiency of our risk control decision development for combating financial frauds over 50 million transactions a day.

Atul Gupte is a Product Manager on the Product Platform team at Uber. He holds a BS in Computer Science from the University of Illinois at Urbana-Champaign. At Uber, he helps drive product decisions to ensure Uber’s data science teams are able to achieve their full potential, by providing access to foundational infrastructure, stable compute resources & advanced tooling to power Uber’s global ambitions. Previously, at Zynga, he spent time building some of the world’s leading social games and also helped build out the company’s mobile advertising platform.

Presentations

From raw data to informed intelligence: democratizing data science and ML at Uber Session

At Uber, we’re changing the way people think about transportation. As an integral part of the logistical fabric in 65+ countries around the world, we’re using ML and advanced data science to power every aspect of the Uber experience - from dispatch to customer support. In this talk, we’ll explore how we enable teams at Uber to transform insights into intelligence and facilitate critical workflows.

Atul Gupte is a Product Manager on the Product Platform team at Uber. He holds a BS in Computer Science from the University of Illinois at Urbana-Champaign. At Uber, he helps drive product decisions to ensure Uber’s data science teams are able to achieve their full potential, by providing access to foundational infrastructure, stable compute resources & advanced tooling to power Uber’s global ambitions. Previously, at Zynga, he spent time building some of the world’s leading social games and also helped build out the company’s mobile advertising platform.

Presentations

Turning Big Data into Knowledge: Managing metadata and data-relationships at Uber scale Session

At Uber’s scale and pace of growth, a robust system for discovering and managing various entities, from datasets to services to pipelines, and their relevant metadata is not just nice to have: it is absolutely integral to making data useful at Uber. In this talk, we will explore the current state of metadata management and end-to-end data flow solutions at Uber and what’s coming next.

Josh Hamilton is a data scientist at Major League Baseball where he works with the league and its 30 teams building data pipelines and predictive models. Previously, Josh helped build data infrastructure and recommender systems for a movie streaming company and worked as a product manager for a platform as a service startup. He studied finance and economics and holds his MS in applied statistics from the University of Alabama.

Presentations

Data Science and the Business of Major League Baseball Session

Utilizing SAS, Python, and AWS Sagemaker, MLB’s data science team discusses how it predicts ticket purchasers’ likelihoods to purchase again, evaluates prospective season schedules, estimates customer lifetime value, optimizes promotion schedules, quantifies the strength of fan avidity, and monitors the health of monthly subscriptions to its game-streaming service.

Hamit has over 20 years of industry and consulting experience in the areas of analytics, customer relationship management and marketing strategies driven by data. He is a co-Founder of Analytics Center, a company focused on the use of data and analytics in business as well as an advisor or investor in several analytics related initiatives that work in developing vertical machine learning solutions for industries such as advertising and e-commerce.
Hamit was a Founding Partner for EMEA offices for Peppers & Rogers Group, the leading customer-led business strategy consulting firm based in the U.S. He then led the development of the firm in the region by serving clients across the Middle East, Africa and Europe. He also worked as a Partner for the firm’s US office heading up its global Analytics group. In this capacity he oversaw the growth of the analytics practice and helped his clients develop analytics functions, build data infrastructure and deploy analytical models to support business goals.
His industry experience includes several positions within Federal Express in Memphis in marketing analytics, and technology where he led IT and business teams to leverage the enormous amount of data the company generated to serve its customers better.
Hamit is also a frequent speaker, writer and board member at various start-ups and non-profit organizations. He earned his B.Sc. degree in Electronics Engineering at Bogazici University in Istanbul and his MBA degree at University of Florida.

Presentations

An In-Depth Look at the Data Science Career: Defining Roles, Assessing Skills Session

Ever confused about what it takes to be a data scientist? Or curious about how companies recruit, train and manage analytics resources? This presentation covers insight from the most comprehensive research effort to-date on the data analytics profession, propose a framework for standardization of roles in the industry and methods for assessing skills.

Roy Hasson is a Sr Manager of Global Business Development for Analytics and Data Lakes at Amazon Web Services, where he helps transform organizations using data. Roy serves as an expert resource on big data architectures, data lakes and machine learning. Previously at AWS, Roy served as a Technical Account Manager leading strategy and supporting implementation of data architectures with select customers. Prior to AWS, Roy spent 15 years working with tier 1 service providers to design and deploy large data and telephone network systems.

Presentations

Fuzzy matching and deduplicating data - techniques for advanced data prep Session

Learn how to deduplicate or link records in a dataset, even when the records don’t have a common unique identifier and no fields match exactly. Link customer records across different databases (e.g. different name spelling or address.) Match external product lists against your own catalog, such as lists of hazardous goods. Solve tough challenges to prepare and cleanse data for analysis.

Janet Haven is the executive director of Data & Society. Previously, she was Data & Society’s director of programs and strategy. Prior to Data & Society, Janet spent more than a decade at the Open Society Foundations, where she oversaw funding strategies and grant making related to technology’s role in supporting and advancing civil society, particularly in the areas of human rights and governance and accountability. She started her career in technology startups in Central Europe, participating in several successful acquisitions. She sits on the board of the Public Lab for Open Science and Technology and advises a range of nonprofit organizations. Janet holds an MA from the University of Virginia and a BA from Amherst College.

Presentations

Session with Janet Haven DCS

Details to come.

Amy Heineike is the VP of Product Engineering at Primer AI, where she leads teams to build machines that read and write text leveraging NLP, NLG and a host of other algorithms to augment human analysts. Previously she built out technology for visualizing large document sets as network maps at Quid. A Cambridge Mathematician who previously worked in London modeling cities, Amy is fascinated by complex human systems and the algorithms and data that help us understand them

Presentations

Data Science vs Engineering: Does it really have to be this way? Session

Are you a data scientist that has wondered "why does it take so long to deploy my model into production?" Are you an engineer that has ever thought "data scientists have no idea what they want"? You are not alone. Join us for a lively discussion panel, with industry veterans, to chat about best practices and insights regarding how to increase collaboration when developing and deploying models.

Annette Hester brings innovative approaches to working with data. Through her company, TheHesterView, she assembles leading experts in their fields into teams that deliver excellence in data structuring and data visualization. The quality of the design in her work reflects decades of experience in advisory and strategic policy services. Until recently, she was a faculty member of the University of Calgary’s Haskayne Global Energy EMBA, where she was founding director of the university’s Latin American Research Centre. She was also previously a senior adviser to the Deputy Minister of the Government of Alberta, Canada, and part of the energy and environment policy team for the leadership campaign that saw Alison Redford elected leader and premier. Mrs. Hester has extensive experience as a consultant the private sector and to governmental agencies in several countries of the Americas, primarily Brazil and Canada.

Presentations

Purposefully Designing Technology for Civic Engagement Session

As new digital platforms emerge and governments look at new ways to engage with citizens, there is an increasing awareness of the role these platforms play in shaping public participation and democracy. This talk examines the design attributes of civic engagement technologies, and their ensuing impacts. A framework for better achieving desired outcomes is demonstrated with a NEB Canada case study.

Mark Hinely, Esq., is Director of Regulatory Compliance at KirkpatrickPrice and a member of the Florida Bar, with 10 years of experience in data privacy, regulatory affairs, and internal regulatory compliance. His specific experiences include performing mock regulatory audits, creating vendor compliance programs and providing compliance consulting. He is also SANS certified in the Law of Data Security and Investigations.

As GDPR has become a revolutionary data privacy law around the world, Mark has become the resident GDPR expert at KirkpatrickPrice. He has led the GDPR charge through internal training, developing free, educational content, and performing gap analyses, assessments, and consulting services for organizations of all sizes.

Presentations

Are Your Privacy Practices Auditor-Approved? Session

The fear that comes along with new compliance requirements is overwhelming. Organizations don’t know where to start, what to fix, or what an auditor expects to see. In this session, learn what an auditor’s perspective is on the newest security and privacy regulations, how your business can prepare for compliance, and what the audit looks like from their perspective.

Ana Hocevar is a data scientist in residence at the Data Incubator, where she combines her love for coding and teaching. Ana has more than a decade of experience in physics and neuroscience research and over five years of teaching experience. Previously, she was a postdoctoral fellow at the Rockefeller University, where she worked on developing and implementing an underwater touchscreen for dolphins. She holds a PhD in physics.

Presentations

Big data for managers 2-Day Training

Michael Li and Ana Hocevar offer a nontechnical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and utilize their input and analysis for your business’s strategic priorities and decision making.

Big data for managers (Day 2) Training Day 2

Michael Li and Ana Hocevar offer a nontechnical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and utilize their input and analysis for your business’s strategic priorities and decision making.

Scott Hoch is a lead data scientist at Deep6.ai, where he works on matching patients with clinical trials in minutes, instead of months. Previously, he has been a VP Engineering at Duco, a solutions engineering at Gem HQ, and a data engineer at NationBuilder. He holds a Master degree in Physics from Yale University.

Presentations

Feature engineering with Spark NLP to accelerate clinical trial recruitment Session

Recruiting patients for clinical trials is a major challenge in drug development. This talk explains how Deep6 utilizes Spark NLP to scale its training and inference pipelines to millions of patients while achieving state-of-the-art accuracy. It covers the technical challenges, the architecture of the full solution, and lessons learned.

Felipe Hoffa is a developer advocate for big data at Google, where he inspires developers around the world to leverage the Google Cloud Platform tools to analyze and understand their data in ways they could never before. You can find him in several videos, blog posts, and conferences around the world.

Presentations

Turning petabytes of data from millions of vehicles into open data with Geotab Session

Geotab is a world's leading asset tracking company, with millions of vehicles under service every day. In the first part of this talk we are going to review their challenges and solutions to create an ML and GIS enabled petabyte scale data warehouse leveraging Google Cloud. Then we are going to review their process to publish open, how to access it, and how cities are using it.

Garrett Hoffman is director of data science at StockTwits, where he leads efforts to use data science and machine learning to understand social dynamics and develop research and discovery tools that are used by a network of over one million investors. Garrett has a technical background in math and computer science but gets most excited about approaching data problems from a people-first perspective—using what we know or can learn about complex systems to drive optimal decisions, experiences, and outcomes.

Presentations

Deep learning methods for natural language processing Tutorial

Garrett Hoffman walks you through deep learning methods for natural language processing and natural language understanding tasks, using a live example in Python and TensorFlow with StockTwits data. Methods include word2vec, recurrent neural networks and variants (LSTM, GRU), and convolutional neural networks.

Matt Horton is the Senior Director of Data Science at Major League Baseball (MLB). In his 11+ years at MLB, Matt has developed numerous projects including predicting ticket buyers’ future purchasing behavior to aid teams in prioritizing their marketing efforts and building a framework for predicting and preventing subscriber churn for MLB’s game-streaming service, MLB.TV. Most recently, Matt’s team has been focused on quantifying fans’ relationships with their favorite teams, modeling trends in both team and league-wide attendance, and estimating fans’ future engagement with MLB.
Prior to joining MLB, Matt worked for Rosetta and Accenture. He has a BS in Statistics from the University of Tennessee and a Masters in Applied Statistics from Cornell University.
In addition to him being a huge baseball fan, Matt is also an avid fan of sports in general rooting for teams from his home state of Tennessee, including the Volunteers, Titans, Predators, and Grizzlies.

Presentations

Data Science and the Business of Major League Baseball Session

Utilizing SAS, Python, and AWS Sagemaker, MLB’s data science team discusses how it predicts ticket purchasers’ likelihoods to purchase again, evaluates prospective season schedules, estimates customer lifetime value, optimizes promotion schedules, quantifies the strength of fan avidity, and monitors the health of monthly subscriptions to its game-streaming service.

Rick leads the NoSQL Blackbelt team at AWS and has designed hundreds of NoSQL database schemas for some of the largest and most highly scaled applications in the world. Many of Rick’s designs are deployed at the foundation of core Amazon and AWS services such as Cloudtrail, IAM, Cloudwatch, EC2, Alexa, and a variety of retail Internet and fulfillment center services. Rick brings over 25 years of technology expertise and has authored 9 patents across a diverse set of technologies including Complex Event Processing, Neural Network Analysis, Microprocessor Design, Cloud Virtualization, and NoSQL Technologies.

As an innovator in the NoSQL space, Rick has over the years developed a repeatable process for building real world applications that delivers highly efficient denormalized data models for workloads of any scale, and he regularly delivers highly rated sessions at re:Invent and other AWS conferences on this specific topic.

Presentations

Where's my lookup table? Modeling relational data in a denormalized world Session

Data has always been relational, and it always will be. NoSQL databases are gaining in popularity, but that does not change the fact that the data they manage is still relational, it just changes how we have to model the data. This session dives deep into how real Entity Relationship Models can be efficiently modeled in a denormalized manner using schema examples from real application services.

Shant Hovsepian is a cofounder and CTO of Arcadia Data, where he is responsible for the company’s long-term innovation and technical direction. Previously, Shant was an early member of the engineering team at Teradata, which he joined through the acquisition of Aster Data. Shant interned at Google, where he worked on optimizing the AdWords database, and was a graduate student in computer science at UCLA. He is the coauthor of publications in the areas of modular database design and high-performance storage systems.

Presentations

Intelligent Design Patterns for Cloud-Based Analytics and BI Session

With cloud object storage (e.g. S3, ADLS) one expects business intelligence (BI) applications to benefit from the scale of data and real-time analytics. However, traditional BI in the cloud surfaces non-obvious challenges. This talk will review service-oriented cloud design (storage, compute, catalog, security, SQL) and shows how native cloud BI provides analytic depth, low cost and performance

Congrui Huang is a senior data scientist at AI platform team of Microsoft Cloud + AI division.

Presentations

Introducing a new anomaly detection algorithm (SR-CNN) inspired by Computer Vision Session

Anomaly Detection may sound old fashioned yet super important in many industry applications. How about doing this in a computer vision way? Come to our talk to learn a novel Anomaly Detection algorithm based on Spectral Residual (SR) and Convolutional Neural Network (CNN), and how this novel method was applied in the monitoring system supporting Microsoft AIOps and business incident prevention.

Jing Huang is director of engineering, machine learning at SurveyMonkey, where she drives the vision and execution of democratizing machine learning. She is leading the effort to build the next-gen machine learning platform, oversees all machine learning operations. Previously she was an entrepreneur where she devoted her time to build mobile-first solutions and data products for non-tech industries. She also worked at Cisco Systems for six years, where her contribution range from security, cloud management to big data infrastructure.

Presentations

Your cloud, your ML, but more and more scale? How SurveyMonkey did it Session

You are a SaaS company that operates on a cloud infra prior to the ML era. How do you successfully extend your existing infrastructure to leverage the power of ML? In this case study, you will learn critical lessons from SurveyMonkey’s journey of expanding its ML capabilities with its rich data repo and hybrid cloud infrastructure.

Ryan Hum is the Vice President of Data and Information Management, with the National Energy Board. Ryan’s previous post was with Immigration, Refugee and Citizenship Canada, where he was Director of Service Insights and Experimentation, responsible for designing and implementing service improvements for people seeking refuge, immigrating, and becoming citizens of Canada. Ryan has had a long and distinguished career in public service in the departments of Health, the Food Inspection Agency, and Natural Resources, where he was Acting Director of Sustainable Mining Policy, Intergovernmental Affairs and Environmental Assessments. He was also a founding member of the Government of Canada’s Central Innovation Hub (housed at the Privy Council Office), where he served as Chief Designer and Data Scientist leading the design insights and data analytics practice to improve policy, program and service delivery.

Ryan has Bachelor of Science (Honours) and a Bachelor of Chemical Engineering from Queen’s University, a Master’s of Engineering Design from McMaster University and is working toward a PhD in Chemical Engineering from the University of Toronto. He is also an Adjunct Professor at the Ontario College of Art and Design University (OCAD) and has previously taught design, public policy and engineering at Carleton University and the University of Toronto.

Presentations

Purposefully Designing Technology for Civic Engagement Session

As new digital platforms emerge and governments look at new ways to engage with citizens, there is an increasing awareness of the role these platforms play in shaping public participation and democracy. This talk examines the design attributes of civic engagement technologies, and their ensuing impacts. A framework for better achieving desired outcomes is demonstrated with a NEB Canada case study.

Prakhar Jain is a member of the technical staff at Qubole, where he works in Spark team. Prakhar holds a bachelor of computer science engineering from the Indian Institute of Technology, Bombay, India.

Presentations

Downscaling: The Achilles heel of autoscaling Spark Clusters Session

Autoscaling of resources aims to achieve low latency for a big data application, while reducing resource costs at the same time. Upscale a cluster in cloud is fairly easy as compared to downscaling nodes and so overall Total-cost-of-ownership (TCO) goes up. We will talk about new design to get efficient downscaling which further helps in achieving better resource utilization and thus lower TCO.

Sam is a data scientist at Microsoft. He works on interpretability for machine learning.

Presentations

Unified Tooling for Machine Learning Interpretability Session

Understanding decisions made by machine learning systems is critical for sensitive uses, ensuring fairness, and debugging production models. Interpretability is a maturing field of research that presents many options for trying to understand model decisions. Microsoft is releasing new tools to help you train powerful, interpretable models and interpret decisions of existing blackbox systems.

Clare is a data scientist at Klick Health, where she focuses on identifying digital biomarkers for diagnosis, risk assessment of diseases and prevention of health problems. Also she is exploring the applications of machine learning to optimize clinic performance. She was previously involved in working on the systems biology of cancer and the development of computational pipeline to identify key genomic and clinical signatures for cancer treatment. She holds a Ph.D degree in bioinformatics and systems biology.

Presentations

Handling Data Gaps in Time Series Using Imputation Session

What will tomorrow’s temperature be? My blood glucose levels tonight before bed? Time series forecasts depend on sensors or measurements made out in the real, messy world. Those sensors flake out, get turned off, disconnect, and otherwise conspire to cause missing data in our signals. We will show a number of methods for handling data gaps and give advice on which to consider and when.

Nikhil is a Product Management lead at Uber. His team manages the big data storage and analytics portfolio at the company,

Even before Uber, Nikhil helped customer wrangle data in companies like EMC, Pivotal, and Yahoo.

Presentations

From raw data to informed intelligence: democratizing data science and ML at Uber Session

At Uber, we’re changing the way people think about transportation. As an integral part of the logistical fabric in 65+ countries around the world, we’re using ML and advanced data science to power every aspect of the Uber experience - from dispatch to customer support. In this talk, we’ll explore how we enable teams at Uber to transform insights into intelligence and facilitate critical workflows.

Brindaalakshmi.K is a communication strategist, peer supporter, writer, researcher and activist, working at the intersection of gender, sexuality and technology.
Communication Strategy and Writing:
Journalist: She has been a journalist for over 7 years and has written extensively on business, technology and gender & sexuality as a journalist with multiple publications including YourStory, MediaNama, Hidden Pockets, Varta, among others.
Brand marketing consultant: She has worked on the brand communication of global brands including Allen Solly, redBus, Adobe, VMWare, Red Hat, PAYBACK, 3M Car Care, 3M Bike Care, among others.
In 2018, she designed and executed an exclusive marketing, public relations and advocacy campaign, reach OUT for a queer-friendly health and legal services locator launched by Varta Trust, SAATHII and Grindr For Equality in 16 Indian states
Policy Advocacy & Consultations: She has presented at and contributed to several National and State level policy consultations pertaining to the representation of the youth & children and sexual minorities in different policies including
National Consultation on the Legal Interventions for Reproductive & Sexual Health (2017),
National Consultation on Economic Inclusion of Transgender Persons in India (2017),
Foresight Forum on National Child Protection Policy (2018)
Consultation on LGBTIQA+ Workplace Inclusion in India (2019)
Research: She’s presently working on a research on Gendering of Development Data in India: Beyond the Binary for the Centre for Internet & Society for the Big Data for Development Network backed by the International Development Research Centre (IDRC), Canada
Peer Support: As a peer supporter, she has supported members of the LGBTIQA+ community on issues ranging from divorce, rape, marital rape and rescue, among others. She also leads workshops on consent, safer sex, among others for the youth

Brindaa’s preferred pronouns are she/her/they.

Presentations

Looking Beyond the Binary: How the lack of a gender data collection standard impacts users Session

There is a lack of standard for the collection of gender data. This session takes a look at the implications of such a lack in the context of a developing country like India, the exclusion of individuals beyond the binary genders of male and female and how this exclusion permeates beyond the public sector into private sector services.

Swasti is a Software Development Engineer working with the LinkedIn Data team. Her passion lies in increasing and improving Developer Productivity by designing and implementing scalable platforms for the same. In her 2-year tenure at LinkedIn, she has worked the design and implementation of Hosted Notebooks at LinkedIn which focuses on providing a hosted solution of Jupyter Notebooks. She has worked closely with the stakeholders to understand the expectations and requirements of the platform which would improve Developer productivity. Prior to this, she has closely worked with the Spark team, discussing how Spark History Server can be improved to make it more scalable to cater to the traffic by Dr. Elephant. She has also contributed to adding the Spark Heuristics in Dr. Elephant after understanding the needs of the stakeholders (mainly Spark developers) which give her a good knowledge about Spark infrastructure, Spark parameters and how to tune them efficiently.

Presentations

Productive Data Science Platform - Beyond Hosted notebooks solution at LinkedIn Session

Come hear about the infrastructure and features offered by flexible and scalable hosted data science platform at LinkedIn. The platform provides features to seamlessly develop in multiple languages, enforce developer best practices, governance policies, execute, visualize solutions, efficient knowledge management and collaboration that improve developer productivity.

Victoriya Kalmanovich is an R&D group lead at a large corporation in Israel.
She specializes in healing work environments, by addressing them as failing startup companies. She promotes and leads innovative and broad processes throughout the organization.
In her day-to-day experience, she deals with all technological issues, product management, budgets and client handling of her group. She is an education enthusiast, and often uses educational directives as a part of her management strategies, especially group members guidance and leadership.
She is a firm believer of deploying data science where there’s a great value for data. She has organized a successful data science hackathon and is forming a data science community withing the organization.
She gives talks about management, leadership and workplace challenges.

Presentations

Predictive Maintenance - How Does Data Science Revolutionize the World of Machines? DCS

Predictive maintenance predicts the future of machines. Using data science we establish the machine’s unique life cycle and increase efficiency. In a world full of machines, we need to be the bridge connecting the methods of the past to the opportunities of the future.

Supun Kamburugamuve has a PhD in computer science specializing in high performance data analytics at Indiana University. He is working as a software architect at Digital Science Center of Indiana University where he researches big data applications and frameworks. Recently, he has been working on high-performance enhancements to big data systems with HPC interconnect such as Infiniband and Omnipath. Supun is an elected member of Apache Software Foundation and has contributed to many open source projects including Apache Web Services projects. Before joining Indiana University, Supun worked on middle-ware systems and was a key member of a WSO2 ESB, which is an open source enterprise integration product which is being widely used by enterprises.

Presentations

Bridging the gap between big data computing and high-performance computing Session

Big data computing and high-performance computing (HPC) has evolved over the years as separate paradigms. With the explosion of the data and the demand for machine learning algorithms, these two paradigms are increasingly embracing each other for data management and algorithms. Supun Kamburugamuve explores the possibilities and tools available for getting the best of HPC and big data.

Linhong Kang is a manager/staff data scientist at Wal-Mart Labs. She has more than 10 years of experience in data science, business analytics, and risk/fraud management across different industries including business consulting, bank, financial payment and ecommerce. She is the lead of multiple fraud/abuse detection solutions for Wal-Mart’s various products. She is passionate about translating business problems into qualitative questions, delivering cost-savings and helping companies to become more profitable.

Presentations

Machine Learning and Large Scale Data Analysis On Centralized Platform Session

How No1 retailer provides secure and seamless shopping experience through machine learning and large scale data analysis on centralized platform.

Amit Kapoor is interested in learning and teaching the craft of telling visual stories with data. At narrativeVIZ Consulting, Amit uses storytelling and data visualization as tools for improving communication, persuasion, and leadership through workshops and trainings conducted for corporations, nonprofits, colleges, and individuals. Amit also teaches storytelling with data for executive courses as a guest faculty member at IIM Bangalore and IIM Ahmedabad. Amit’s background is in strategy consulting, using data-driven stories to drive change across organizations and businesses. He has more than 12 years of management consulting experience with AT Kearney in India, Booz & Company in Europe, and startups in Bangalore. Amit holds a BTech in mechanical engineering from IIT, Delhi, and a PGDM (MBA) from IIM, Ahmedabad.

Presentations

Recommendation System using Deep Learning 2-Day Training

In this two-days workshop, you will learn the different paradigms of recommendation systems and get introduced to the usage of deep-learning based approaches . By the end of the workshop, you will have enough practical hands-on knowledge to build, select, deploy and maintain a recommendation system for your problem.

Recommendation System using Deep Learning (Day 2) Training Day 2

In this two-days workshop, you will learn the different paradigms of recommendation systems and get introduced to the usage of deep-learning based approaches . By the end of the workshop, you will have enough practical hands-on knowledge to build, select, deploy and maintain a recommendation system for your problem.

Meher is a seasoned software developer with apps used by tens of millions of users every day. Currently at Square, and previously at Microsoft, he shipped features for a range of apps, from Square’s Point of Sale to the Bing app. He was the mobile development lead for Microsoft’s Seeing AI app, which has received widespread recognition and awards from Mobile World Congress, CES, FCC, American Council of the Blind to name a few. A hacker at heart with a flair for fast prototyping, he has won close to two dozen hackathons and converted them to features shipped in widely-used products. He also serves as a judge of international competitions including Global Mobile Awards, Edison Awards.

Presentations

DEEP LEARNING ON MOBILE Session

Optimizing deep neural nets to run efficiently on mobile devices.

Until recently, Arun Kejariwal was a statistical learning principal at Machine Zone (MZ), where he led a team of top-tier researchers and worked on research and development of novel techniques for install and click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns. In addition, his team built novel methods for bot detection, intrusion detection, and real-time anomaly detection. Previously, Arun worked at Twitter, where he developed and open-sourced techniques for anomaly detection and breakout detection. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high-performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Serverless Streaming Architectures & Algorithms for the Enterprise Tutorial

In this tutorial, we shall walk the audience through the landscape of streaming systems and overview the inception and growth of the serverless paradigm. Next, we shall present a deep dive of Apache Pulsar which provides native serverless support in the form of Pulsar functions and paint a bird’s eye view of the application domains where Pulsar functions can be leveraged.

Brian is the Chief Data Scientist at Rubikloud Technologies where he leads a team building out intelligent enterprise solutions for some of the world’s largest retail organizations. Brian is a big fan of Bayesian statistics but his main professional focus is around building out scalable machine learning systems that seamlessly integrate into traditional software solutions. Before Rubikloud, Brian has worked at Sysomos leading a team of data scientists performing large-scale social media analytics working with datasets such as the Twitter firehouse. He earned his PhD in Computer Engineering from the University of Toronto during which time he was an early employee of a startup that commercialized some of his research.

Presentations

ML is not enough: Decision Automation in the Real World Session

Automating decisions require a system to consider more than just a data-driven prediction. Real-world decisions require additional constraints and fuzzy objectives to ensure that they are robust and consistent with business goals. This talk will describe how to leverage modern machine learning methods and traditional mathematical optimization techniques for decision automation.

Dr. Jennifer Kloke is the VP of Product Innovation at Ayasdi. For the last three years, she has been responsible for the automation and algorithm development for the entire Ayasdi codebase and cutting edge analysis techniques utilizing TDA and AI. During that time, she was the principal investigator for a Phase 2 DARPA SBIR developing automation and data fusion capabilities. These have led to breakthroughs in the field and several patents. Jennifer also served five years as a Senior Data Scientist analyzing a wide variety of data including point cloud, text, and networks from diverse industries including large military contractors, finance, bio-tech, and electronics manufacturing. Her work includes developing prediction algorithms for reducing the number of false alarms for a large military jet manufacturer as well as developing and deploying a predictive program management application at a large government contractor. Jennifer and her team’s efforts landed Ayasdi spots on Fast Company’s Most Innovative Companies in Big Data, AIconics, and consecutively on Forbes Fintech50.

Jennifer received her Ph.D. in Mathematics from Stanford University with an emphasis on topological data analysis. She has collaborated with chemists at Lawrence Berkeley National Laboratory and UC Berkeley to develop topological methods to mine large databases of chemical compounds to identify energy efficient compounds for carbon capture. She also developed a de-noising algorithm to efficiently process high dimensional data and has published in the Journal of Differential Geometry.

Presentations

Assumed Risk vs. Actual Risk: The New World of Behavior-Based Risk Modeling Findata

Jennifer Kloke will discuss how banks and financial enterprises can adopt and integrate actual risk models with existing systems to enhance the performance and operational efficiency of the financial crimes organization. In doing so, she will explain how actual risk models can reduce segmentation noise, utilize unlabeled transactional data, and spot unusual behavior more effectively.

VP of Technology
Jari holds a Ph.D. in Computer Science from the Royal Institute of Technology, Stockholm Sweden. Dr. Koister has also been teaching at the Data Science program at UC Berkeley.

Jari led research at Ericsson Labs (Stockholm and Los Angeles) and Hewlett-Packard Laboratories (Palo Alto) in computer languages and distributed computing. Between 1998 and 2002 Jari led the development at CommerceOne of the flagship product MarketSite. After Commerce One Jari worked as CTO and CEO with investors turning around and growing companies in online gaming and smart metering. In 2006 Jari founded two companies, both in the emerging space of cloud computing. The first operating as CSO/CTO was Qrodo.com (Singapore) that delivers an elastic platform for broadcasting sports events live on the internet. Jari also founded and held the position of CTO at Groupswim.com (San Francisco), an early social enterprise collaboration company. In 2009 Salesforce.com acquired Groupswim.com, at which time Jari moved in to the role of leading the development of Chatter, Salesforce.com’s social enterprise application and platform. In 2012 Jari took the role of VP Technology at AgilOne, leading Software Engineering, Product Data Science and Technical Operations.

Presentations

How Machine Learning Meets Optimization Session

Machine Learning and Constraint-based Optimization are both used to solve critical business problems. They come from distinct research communities and have traditionally been treated separately. This talk describes how they are similar, how they differ and how they can be used to solve complex problems with amazing results.

Stavros is a senior engineer at data systems team at Lightbend. He helps with the implementation of the Lightbend’s fast data strategy. He has worked for several years building software solutions that scale in different verticals like telecoms and marketing. His interests among others are: distributed system design, streaming technologies, and NoSql databases.

Presentations

Online Machine Learning in Streaming Applications Session

In this talk, we discuss online machine learning algorithm choices for streaming applications. We motive the discussion with resource constrained use cases like IoT and personalization. We cover Hoeffding Adaptive Trees, classic sketch data structures, and drift detection algorithms, all the way from implementation to production deployment, describing the pros and cons of using each of them.

Cassie Kozyrkov is Google Cloud’s chief decision scientist. Cassie is passionate about helping everyone make better decisions through harnessing the beauty and power of data. She speaks at conferences and meets with leadership teams to empower decision makers to transform their industries through AI, machine learning, and analytics. At Google, Cassie has advised more than a hundred teams on statistics and machine learning, working most closely with research and machine intelligence, Google Maps, and ads and commerce. She has also personally trained more than 15,000 Googlers (executives, engineers, scientists, and even nontechnical staff members) in machine learning, statistics, and data-driven decision making. Previously, Cassie spent a decade working as a data scientist and consultant. She is a leading expert in decision science, with undergraduate studies in statistics and economics at the University of Chicago and graduate studies in statistics, neuroscience, and psychology at Duke University and NCSU. When she’s not working, you’re most likely to find Cassie at the theatre, in an art museum, exploring the world, playing board games, or curled up with a good novel.

Presentations

Keynote with Cassie Kozyrkov Keynote

Cassie Kozyrkov, Chief Decision Scientist, Google

Aljoscha Krettek is a cofounder and software engineer at Ververica. Previously, he worked at IBM Germany and at the IBM Almaden Research Center in San Jose. Aljoscha is a PMC member at Apache Beam and Apache Flink, where he mainly works on the Streaming API and also designed and implemented the most recent additions to the Windowing and State APIs. He studied computer science at TU Berlin.

Presentations

Stream Processing beyond Streaming Data Session

The talk discusses how stream processing is becoming a "grand unifying paradigm" for data processing and the newest developments in Apache Flink to support this trend: New cross-batch-streaming Machine Learning algorithms, State-of-the-art batch performance, and new building blocks for data-driven applications and application consistency.

Purnima is a Big Data evangelist with 15 years of experience in the industry. Purnima comes to Cloudera after working with IBM and ADP. She works with customers on their IoT, Cloud and Big Data strategies. She has previously presented at other conferences like IBM, Dataworks Summit in San Jose & Barcelona.

Presentations

IoT - Cloudera Edge Management Tutorial

Too many edge devices and agents. How does one control and manage them. How do we have handle the difficulty in collecting real-time data and most importantly, the trouble with updating specific set of agents with edge applications. Get your hands dirty with Cloudera Edge Management that addresses these challenges with ease.

Kafka/SMM(Streams Messaging Manager) Crash Course Tutorial

Kafka is omnipresent and is the backbone of not only streaming analytics applications but data lakes as well. The challenge is understanding what is going on overall in the Kafka cluster including performance, issues and message flows. This session gives a hands on experience to visualize their entire Kafka environment end-to-end and simplifies Kafka operations via SMM.

Mars is currently the technical lead of the metadata team at LinkedIn and has been leading the team to design and implementation of LinkedIn’s metadata infrastructure for the past 2 years. Prior to that he was a software engineer at Google working on the Google Assistant and Google Cloud products. Mars received his PhD degree in Computer Science from UCLA.

Presentations

The Evolution of Metadata: LinkedIn’s story Session

How do you scale metadata to an organization of 10,000 employees, 1M+ data assets and an AI-enabled company that ships code to the site three times a day. We describe the journey of LinkedIn’s metadata from a two-person back-office team to a central hub powering data discovery, AI productivity and automatic data privacy. Different metadata strategies and our battle scars will be revealed!

Julien Le Dem is the coauthor of Apache Parquet and the PMC chair of the project. He is also a committer and PMC member on Apache Pig, Apache Arrow, and a few other projects. Julien is a principal engineer at WeWork. Previously, he was an architect at Dremio; tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_); and a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

Presentations

Data platform architecture principles Session

Big Data is crucial to organizations. Big not only by volume of data but also by the multitude of datasources and teams using them. Central data teams doing all the work is outdated as the entire organization becomes an ecosystem and central teams become enablers. We will discuss the principles of a data platform enabling the entire organization to build data centric products.

Drew Leamon started his career at Microsoft while studying Computer Science at Princeton University. In his studies, he delved into Computer Graphics, Artificial Intelligence and Computational Neurobiology. At Microsoft, he collaborated with Microsoft Research on one of the first commercial implementations of collaborative filtering for e-commerce. This was released as Microsoft Site Server: Commerce Edition.

Graduating into the DotCom boom, Drew caught the entrepreneurial spirit of the time and went on to sell cars on the internet through CarOrder.com, a Trilogy Software spin-off. While there, he created new ways to sell content online through innovative configuration solutions. Next Drew became one of the charter members of AirClic where he helped to create a platform to support workforce automation using wireless technologies. Drew’s work and IP in this space became core to the company’s business value. At Traffic.com / Navteq / Nokia, Drew pioneered the visualization of traffic data collected from highway sensors and digital probe devices.

Moving on to Comcast, Drew has taken his diverse experience and background and now leads part of the Engineering Analysis organization. His team is developing advanced data visualizations for network data. They are building elastically scaling Big Data infrastructure to support Analytic workloads. Simulations of Comcast’s CDNs and platforms, developed by Drew’s team, are leveraging this platform and guiding the business and engineering teams. His team is identifying high ROI opportunities. They apply machine learning to datasets and are currently operationalizing the resulting predictive models to help improve customer experience.

Presentations

Automating ML Model Training & Deployments via Metadata driven Data/Infrastructure/Feature Engineering/Model Management Session

And overview of the Data Management and privacy challenges around automating ML model (re)deployments and stream based inferencing at scale.

Chon Yong Lee is a project manager of SK Telecom and designes 5G Infra Visualization, telco network visualization system of the company. He has 10 years of experience in telecom network area, demonstrate 5G Infra Visualization two time in MWC.

Presentations

SK Telecom's 5G network monitoring and 3D visualization on streaming technologies Session

Architecture and lessons learned from development of T-CORE, SK Telecom’s monitoring and service analytics platform, which collects system and application data from several thousand servers and applications and provides 3D visualized real-time status of the whole network and services for the operators and analytics platform for data scientists, engineers and developers.

JongHyok Lee is an architect of SK Telecom and designes T-CORE, monitoring and analytics platform of the company. He has more than 20 years of experience in data processing area, and worked at IBM before joining SK Telecom as a senior architect and led design of architecture for several enterprise-wide data processing systems of many companies in various industry.

Presentations

SK Telecom's 5G network monitoring and 3D visualization on streaming technologies Session

Architecture and lessons learned from development of T-CORE, SK Telecom’s monitoring and service analytics platform, which collects system and application data from several thousand servers and applications and provides 3D visualized real-time status of the whole network and services for the operators and analytics platform for data scientists, engineers and developers.

Brenda Leong, CIPP/US, is Senior Counsel and Director of Strategy at the Future of Privacy Forum. She oversees strategic planning of organizational goals, as well as managing the FPF portfolio on biometrics, particularly facial recognition, along with the ethics and privacy issues associated with artificial intelligence. She works on industry standards and collaboration on privacy concerns, by partnering with stakeholders and advocates to reach practical solutions to the privacy challenges for consumer and commercial data uses. Prior to working at FPF, Brenda served in the U.S. Air Force, including policy and legislative affairs work from the Pentagon and the U.S. Department of State. She is a 2014 graduate of George Mason University School of Law.

Presentations

Regulations and the Future of Data Session

From the EU to California and China, more and more of the world is regulating how data can be used. In this session, Immuta and the Future of Privacy Forum will convene leading experts on law and data science for a deep dive into ways to regulate the use of AI and advanced analytics. Come learn why these laws are being proposed, how they’ll impact data, and what the future has in store.

War Stories from the Front Lines of ML Session

Machine learning techniques are being deployed across almost every industry and sector. But this adoption comes with real, and oftentimes underestimated, privacy and security risks. In this session, Immuta and the Future of Privacy Forum will convene leading industry representatives and experts to talk about real life examples of when ML goes wrong, and the lessons they learned.

Tomer is a senior data engineer on the DataOps team at Fundbox, where he helps shape the data platform architecture to drive business goals.
Previously, he was a data engineer at Intel’s advanced analytics group helping to build out the data platform supporting the data storage and analysis needs of Intel® Pharma Analytics Platform, an edge-to-cloud artificial intelligence solution for remote monitoring of patients during clinical trials.
He is incredibly passionate about the power of data. Tomer holds a BSc in software engineering.

Presentations

Orchestrating Data Workflows Using a Fully Serverless Architecture Session

Use of data workflows is a fundamental functionality of any data engineering team. Nonetheless, designing an easy to use, scalable, and flexible data workflow platform is a complex undertaking. In this talk, attendees will learn how the data engineering team at Fundbox uses AWS serverless technologies to address this problem, and how it enables data scientists, BI devs and engineers move faster.

Luyao Li is a senior software engineer at Uber. He is an enthusiast of building reliable, scalable and performant systems. Prior to Uber, he architected EA’s global ad-campaign SAAS suite including Segmentation Manager and Engagement Manager on top of data from all franchises. He holds a master’s degree from Duke University.

Presentations

Turning Big Data into Knowledge: Managing metadata and data-relationships at Uber scale Session

At Uber’s scale and pace of growth, a robust system for discovering and managing various entities, from datasets to services to pipelines, and their relevant metadata is not just nice to have: it is absolutely integral to making data useful at Uber. In this talk, we will explore the current state of metadata management and end-to-end data flow solutions at Uber and what’s coming next.

Michael Li is the founder of the Data Incubator, an elite fellowship program that trains and places data scientists and quants with advanced degrees (PhD or masters) into industry roles. Previously, Michael was a data science lead with Foursquare and with Andreessen Horowitz. He holds PhD in math from Princeton University.

Presentations

Big data for managers 2-Day Training

Michael Li and Ana Hocevar offer a nontechnical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and utilize their input and analysis for your business’s strategic priorities and decision making.

Big data for managers (Day 2) Training Day 2

Michael Li and Ana Hocevar offer a nontechnical overview of AI and data science. Learn common techniques, how to apply them in your organization, and common pitfalls to avoid. You’ll pick up the language and develop a framework to be able to effectively engage with technical experts and utilize their input and analysis for your business’s strategic priorities and decision making.

Derek is a seasoned data scientist passionate in the art of building data-driven defense against cyber threats and frauds. He also enjoys solving challenging Big Data problems in enterprise IT operations. His current and prior machine-learning research experiences include behavior-based security analytics such as malware detection and insider threat detection, risk-based on-line banking fraud detection, data loss prevention, voice-biometrics security, and speech and language processing. Derek is an experienced leader in directing teams of data scientists to perform POCs, core research, and product development.

Presentations

Learning asset naming patterns to find risky unmanaged devices Session

Unmanaged & foreign devices in the corporate networks pose a security risk. The 1st step toward reducing risk from these devices is the ability to identify them. To have a comprehensive device management program, we proposed a machine learning model based on Deep Learning to perform anomaly detection based on only device names to flag devices that do not follow device naming structures.

Kai Liu is a Senior Program Manager in AI and Research group of Microsoft. He has 7 years of experience in data driven engineering, big data platform and AI infrastructure for Office product families. He led his team to create a service health portal for SharePoint Online, inject a distributed log collection and storage system for Exchange Online, publish curated data sets and key business metrics and enable sub-hour experimentations in Office 365. Currently he is working on the AI and Deep Learning infrastructure for large scale enterprise data under compliance obligations.

Presentations

Large-scale Deep Learning offline platform: Bing's approach Session

Facilitating large scale of deep learning projects in parallel requires some effort and innovation. Bing is now running a deployment of thousands of servers to address this challenge. We provides training services, offline data processing, vector hosting, and inferencing service at offline fashion to help data scientists through all steps in the project life cycle.

Dr. Audrey Lobo-Pulo is the founder of Phoensight, a data and technology startup consultancy, and has a passion for using emerging data technologies to empower individuals, governments and organisations in creating a better society. Audrey holds a PhD in Physics and a Masters in Economic Policy, and has over 10 years experience working with the Australian Treasury in public policy areas including personal taxation, housing, social policy, labour markets and population demographics.

Audrey is an Open Government advocate and has a passion for Open Data and Open Models. She has pioneered the concept of ‘Government Open Source Models’, which are government policy models open to the public to use, modify and distribute freely. Audrey is deeply interested in how technology enables citizens to actively participate and engage with their governments in co-creating public policy.

Presentations

Purposefully Designing Technology for Civic Engagement Session

As new digital platforms emerge and governments look at new ways to engage with citizens, there is an increasing awareness of the role these platforms play in shaping public participation and democracy. This talk examines the design attributes of civic engagement technologies, and their ensuing impacts. A framework for better achieving desired outcomes is demonstrated with a NEB Canada case study.

Jorge A. Lopez works in big data solutions at Amazon Web Services. Jorge has more than 15 years of business intelligence and DI experience. He enjoys intelligent design and engaging storytelling and is passionate about data, music, and nature.

Presentations

Building a serverless big data application on AWS 2-Day Training

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Building a serverless big data application on AWS (Day 2) Training Day 2

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Boris Lublinsky is a software architect at Lightbend, where he specializes in big data, stream processing, and services. Boris has over 30 years’ experience in enterprise architecture. Over his career, he has been responsible for setting architectural direction, conducting architecture assessments, and creating and executing architectural roadmaps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, and Serving Machine Learning Models. He is also cofounder of and frequent speaker at several Chicago user groups.

Presentations

Hands-on Machine Learning with Kafka-based Streaming Pipelines Tutorial

This hands-on tutorial examines production use of ML in streaming data pipelines; how to do periodic model retraining and low-latency scoring in live streams. We'll discuss Kafka as the data backplane, pros and cons of microservices vs. systems like Spark and Flink, tips for Tensorflow and SparkML, performance considerations, model metadata tracking, and other techniques.

Gloria Macia is a data scientist at Roche AG, where she works at the Diagnostics Data Science Lab bringing data-driven solutions to the market. Prior to joining the company, she was part of the Quality and Clinical Affairs Management team at Sonova AG. Her interests include AI, translational science, medical device development and healthcare regulation. Gloria came to Switzerland in 2015 for a research stay at the Institute of Bioengineering from EPFL and later decided to pursue a MSc in Biomedical Engineering – Bioelectronics at ETH Zurich. Her computer science skills have been awarded several prizes and scholarships from renowned companies such as Intel, Toptal or Palantir.

Presentations

AI & Health: achieving regulatory compliance DCS

Healthcare is emerging as a prominent area for AI applications but innovators aiming to seize this chance face one major issue: achieving regulatory compliance. With a real industry case study, this talk will guide the audience through the current American & European regulatory framework for medical devices and provide a step-by-step guide to market for AI applications.

David is a founder and machine learning engineer at Octavian.ai, exploring new approaches to machine learning on graphs

Previously he co-founded of SketchDeck, a Y-Combinator backed technology startup providing design as a service. He has a MSci in Mathematics and the Foundations of Computer Science from the University of Oxford and a BA in Computer Science from the University of Cambridge.

Presentations

An introduction to machine learning on graphs Session

Graphs are a powerful way to represent knowledge. Organizations (in fields such as bio-sciences and finance) are starting to amass large knowledge graphs, but lack the machine-learning tools to extract the insights they need from them. In this presentation, I’ll give an overview of what insights are possible and survey the most popular approaches.

Anand Madhavan is VP of Engineering at Narvar. Before that he was the head of engineering for the Discover product at Snapchat and prior to that was the Director of Engineering at Twitter where we worked on building out the ad serving system for Twitter Ads. He has an MS in computer science from Stanford University.

Presentations

Post Transaction Processing Using Apache Pulsar at Narvar Session

Narvar provides next generation post transaction experience for over 500+ retailers. This talk explores the journey of how Narvar moving away from using a slew of technologies for their platform and consolidating their use cases using Apache Pulsar.

Mark Madsen is a fellow at Teradata, where he is responsible for understanding, forecasting, and defining the analytics ecosystem and architecture. Previously, he was CEO of Third Nature, where he advised companies on data strategy and technology planning and vendors on product management. Mark has designed analysis, machine learning, data collection, and data management infrastructure for companies worldwide.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that isn't subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Deepak Majeti is a systems software engineer at Vertica. He is also an active contributor to Hadoop’s two most popular file formats: ORC and Parquet. His interests lie in getting the best from high-performance computing (HPC) and big data by building scalable, high-performance, and energy-efficient data analytics tools for modern computer architectures. Deepak holds a PhD in the HPC domain from Rice University.

Presentations

Kubernetes for Stateful MPP systems Session

a. Analytics experts, GoodData, needed to auto-recover from node failures and scale rapidly when workloads spike on their MPP database in the cloud. Kubernetes could solve that, but K8 is for stateless micro-services, not a stateful MPP database that needs 100s of containers. In order to merge the power of an MPP database with the flexibility of Kubernetes, a lot of hurdles had to be overcome.

Ted Malaska is currently a Director of Enterprise Architecture at Capital One, before that he was the Director of Engineering at Blizzard’s Global Insight Department. Ted was also principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem, and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Foundations for Successful Data Projects Tutorial

The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. In this presentation we’ll provide guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

Chaithanya is an Assistant Vice President at EXL Service. He has over 10 years of experience in developing advanced analytics solutions across multiple business domains. He holds a bachelor of technology degree from IIT Guwahati. At EXL,he is responsible for building AI enabled solutions which can bring efficiencies across various business processes

Presentations

Improving OCR Quality of Documents using Generative Adversarial Networks Session

Every NLP based document processing solution depends on converting scanned documents/ images to machine readable text using an OCR solution. However, accuracy of OCR solutions is limited by quality of scanned images. We show that generative adversarial networks can be used to bring significant efficiencies in any document processing solution by enhancing resolution and de-noising scanned images.

Rochelle March is a sustainability expert, quantifier of environmental and social impact, and developer of financial products designed to improve the ESG (environmental, social, governance) performance of companies. She is a Senior Analyst at Trucost, part of S&P Global where she manages a portfolio of clients and products, including the Trucost SDG Evaluation product.

Presentations

How S&P’s Trucost Empowered Their Analysts with Modern and Interactive Data Reporting Tools Findata

S&P’s Trucost Senior Analyst, Rochelle March, is migrating a product that quantitatively measures company performance on UN Sustainable Development Goals from Excel to Python. Her team cut multi-day workflows to a few hours, delivering rich, 27-page interactive reports. Learn about modern techniques to design, build, and deploy a data visualization and reporting framework in your organization.

Tim McKenzie is general manager of big data solutions at Pitney Bowes, where he leads a global team dedicated to helping clients unlock the value that is hidden in the massive amounts of data collected about customers, infrastructure, and products. With over 17 years of experience engaging with customers about technology, Tim has a proven track record of delivering value in every engagement.

Presentations

Enabling 5G use cases through Location Intelligence Session

Planning 5G network rollout and associated services requires a good understanding of location based data. Accurate addressing and linking consumers to property parcels or points of interest allows data enrichment with property attributes, demographics and social data. Companies use location to organize and analyze network and customer data in order to understand where to target new services.

Hamlet is a Senior Data Scientist at Criteo. Previously he was working as a Control System Engineer for Petróleos de Venezuela. Hamlet has finished in the top ranking in multiple data science competitions, including the prestigious 4th place on predicting return’s volatility on the NY stock exchange hosted by College de France and Capital Fund Management in 2018, and the 25th place on predicting stock returns hosted by G-Research also in 2018. Hamlet holds a two master degrees on Mathematics and Machine Learning from Pierre and Marie Curie University – Paris 6, and a Ph.D. in Applied Mathematics from University Paris-Sud – Paris 11 in France, where he focused on statistical signal processing and machine learning.

Presentations

Predicting Criteo’s Internet traffic load using Bayesian structural time series model. Session

Criteo’s infrastructure provides capacity and connectivity to host Criteo’s platform and applications. The evolution of our infrastructure is driven by the ability to forecast Criteo’s traffic demand. In this talk, we explain how Criteo uses Bayesian Dynamic time series models to accurately forecast its traffic load and optimize hardware resources across data centers.

Nitzan Mekel-Bobrov is managing vice president of machine learning and artificial intelligence at Capital One, where he leads a group of scientists, engineers, researchers, and product managers who are driving ML and AI transformation at the company. Nitzan and his team develop enterprise-scale, real-time intelligent solutions for a broad range of business problems, including forecasting, NLP in customer service and conversational AI, anomaly & fraud detection, reinforcement learning and more. Nitzan earned his master’s degree in computer science and his Ph.D. in bioinformatics from the University of Chicago.

Presentations

How Real-Time Machine Learning is Redefining the Customer Experience Findata

Capital One's Card Machine Learning lead will provide a framework for how enterprises can use real-time machine learning to provide long-term, recurring relationships with customers, with a focus on leveraging longitudinal omni-channel context, enhanced capabilities able to infer state from this new scale of data, and real-time personalization based on streaming data ecosystems.

Martin Mendez-Costabel leads the Geospatial Data Asset team for Monsanto’s Products and Engineering organization within the IT department, where he drives the engineering and adoption of global geospatial data assets for the enterprise. He has more than 12 years of experience in the agricultural sector covering a wide range of precision agriculture-related roles, including data scientist and GIS manager for E&J Gallo Winery in California. Martin holds an agronomy degree (BSc) from the National University of Uruguay and two viticulture degrees: an MSc from the University of California, Davis and a PhD from the University of Adelaide in Australia.

Presentations

Optimizing ROI of a Geospatial Platform in Cloud DCS

Cloud architecture is an extremely flexible environment for deploying solutions. Our recent learnings have revealed some interesting insights. One being that the first build of solutions, even using open source software, may quickly exceed initial cost estimates and could outpace ROI if not managed properly. The case study will focus on managing our Geospatial Platform and how we increased ROI.

Sara Menker is the founder and CEO of Gro Intelligence, a data company dedicated to building products that change the way the world understands agriculture. Prior to founding Gro, Sara was a Vice President in Morgan Stanley’s commodities group. She began her career in commodities risk management, where she covered all commodity markets, and subsequently moved to trading, where she managed an options trading portfolio. Sara is a Trustee of the Mandela Institute For Development Studies (MINDS) and a Trustee of the International Center for Tropical Agriculture (CIAT). Sara was named a Global Young Leader by the World Economic Forum and is a fellow of the African Leadership Initiative of the Aspen Institute. Sara received a B.A. in Economics and African Studies at Mount Holyoke College and the London School of Economics and an M.B.A. from Columbia University.

Presentations

Keynote with Sara Menker Keynote

Sara Menker, CEO, Gro Intelligence

Subhasish Misra is currently a Data Scientist at Walmart Labs where he leads efforts to create scalable machine learning solutions for Walmart’s customer base. Alongside this, he is also a member of the global data science board at i-com, a cross industry global think tank on harnessing data & analytics for better marketing.
Subhasish has previously worked at Hewlett Packard Co, WPP & Aon and consulted for many Fortune 500 clients across multiple geographies in his 12 years of advanced analytics career.
His broad expertise lies along a wide spectrum of marketing analytics & current data science interest areas are around modeling customer behavior & causal inference.
Subhasish holds a M.A in Economics from Delhi School of Economics, where econometrics was one of his focus areas.

Presentations

Causal inference 101: Answering the crucial ‘why’ in your analysis. Session

Causal questions are ubiquitous. Randomized tests are considered to be the gold standard for these. However, such tests are not always feasible and then, one just has observational data to get to causal insights. Techniques such as matching offer a solve then. This talk will offer a take on the above aspects, plus share practical tips when trying to infer causal effects.

Sanjay is the Group Chief Technology Officer MakeMyTrip Ltd. He leads overall technology for MakeMyTrip, GoIbibo and Redbus. Prior to the merger with the Ibibo Group, he was the CTO of MakeMyTrip.

Sanjay is responsible for developing and executing global technology strategy for the combined entity, and planning ongoing technology innovations for the company’s continued success. Sanjay has brought on-board an extensive product leadership and management experience in marquee companies like Yahoo, IBM, Infosys, Oracle, and Netscape in the US and in India.

With an overall experience of 25+ years, Sanjay brings along significant expertise in product and platform engineering, including architecture, user experience, site operations, product management and product strategy.

Sanjay has Masters in Computer Science from University of Louisiana and completed his Bachelors’ degree in Engineering from the Birla Institute of Technology Mesra – Ranchi.

Presentations

Migrating millions of users from voice and email based customer support to a chatbot Session

At MakeMyTrip, India’s leading online travel platform, customers were using voice or email to contact agents for post sale support. In order to improve the efficiency of agents and improve customer experience, MakeMyTrip developed a chatbot Myra using some of the latest advances in deep learning. In this talk, we will discuss the high level architecture and the business impact created by Myra.

Jessica Egoyibo Mong is an engineering manager on the Machine Learning Engineering (MLE) team at SurveyMonkey. She’s currently leading efforts to re-architect online serving ML system. Prior to the MLE team, she worked as a full-stack engineer on the Billing & Payments team, where she built and maintained software to enable SurveyMonkey’s global financial growth and operation. In past years, Jessica oversaw the technical talks program, jointly managed the engineering internship program, and co-led the Women in Engineering group. Jessica received a B.S in Computer Engineering from Claflin University in South Carolina. She is a 2014 White House Initiative on HBCUs All-Star, a Hackbright (Summer 2013) and CODE2040 (Summer 2014) alum. She has served on the leadership team of the Silicon Valley local chapter of the Anita Borg Institute and is a member of /dev/color. Jessica is a singer and upcoming drummer, and sings and drums at her church in Livermore, CA. In her spare time, she enjoys eating, CrossFit, reading, learning new technologies, and sleeping!

Presentations

Your cloud, your ML, but more and more scale? How SurveyMonkey did it Session

You are a SaaS company that operates on a cloud infra prior to the ML era. How do you successfully extend your existing infrastructure to leverage the power of ML? In this case study, you will learn critical lessons from SurveyMonkey’s journey of expanding its ML capabilities with its rich data repo and hybrid cloud infrastructure.

Sireesha is a Solutions Architect at Amazon Web Services (AWS), with Area of Depth is Machine Learning and Artificial Intelligence. She provides guidance to AWS customers on their ML/AI workloads. While working full-time, Sireesha earned her Ph.D in May 2013 and Post Doctorate in 2015 from University of Colorado, Colorado Springs. Her Ph.D thesis is, “Multi-tier Internet Service Management using Statistical Learning Techniques (https://dspace.library.colostate.edu/bitstream/handle/10976/264/CUCS2013100001ETDSPECS.pdf?sequence=1.)”. She led the Colorado University team in winning and successfully completing a 2-year research grant from Air Force Research Lab on “Autonomous Job Scheduling in Unmanned Aerial Vehicles”. She is an experienced public speaker and has presented research papers at International Conferences: CoSAC: Coordinated Session-Based Admission Control for Multi-Tier Internet Applications (https://www.researchgate.net/publication/221092402_CoSAC_Coordinated_Session-Based_Admission_Control_for_Multi-Tier_Internet_Applications) at IEEE Int’l Conf. on Computer Communications and Networks (ICCCN), 2009; Regression Based Multi-tier Resource Provisioning for Session Slowdown Guarantees (https://www.researchgate.net/publication/220780958_Regression_Based_Multi-tier_Resource_Provisioning_for_Session_Slowdown_Guarantees) at IEEE Int’l Conf. on Performance, Computing and Communications (IPCCC), 2010. She also published technical articles : Coordinated session-based admission control with statistical learning for multi-tier internet applications (https://www.researchgate.net/publication/222549520_Coordinated_session-based_admission_control_with_statistical_learning_for_multi-tier_internet_applications) in Journal of Network and Computer Applications (JNCA);Regression-based resource provisioning for session slowdown guarantee in multi-tier Internet servers (https://www.researchgate.net/publication/220379377_Regression-based_resource_provisioning_for_session_slowdown_guarantee_in_multi-tier_Internet_servers) and Multi-tier Service Differentiation: Coordinated Resource Provisioning and Admission Control (https://www.researchgate.net/publication/260042453_Multi-tier_Service_Differentiation_Coordinated_Resource_Provisioning_and_Admission_Control) in Journal of Parallel and Distributed Computing (JPDC)

Presentations

Alexa, Do Men Talk Too Much? Session

Mansplaining. Know it? Hate it? Want to make it go away? In this session we tackle the chronic problem of men talking over or down to women and its negative impact on career progression for women. We will also demonstrate an Alexa skill that uses deep learning techniques on incoming audio feeds. We discuss ownership of the problem for both women and men, and suggest helpful strategies.

MLOps – Applying DevOps Practices to Machine Learning Workloads Session

As an increasing level of automation is becoming available to data science, there is a balance between automation and quality that needs to be maintained. Applying DevOps practices to machine learning workloads not only brings models to the market faster but also maintains the quality and integrity of those models. This presentation will focus on applying DevOps practices to ML workloads.

Daniel Musgrave is a Principal Software Development Engineer at Microsoft, where he specializes in massive scale stream processing in Bing Ads.
He built the state-of-the-art near real-time ads processing pipeline that powers the Bing Ads platform and which serves as the model for other streaming workloads at Microsoft. He has a passion for API and language design, which he manages to fit into his day job whenever given the opportunity.

Presentations

Trill: The Crown Jewel of Microsoft’s Streaming Pipeline Explained Session

Trill has been open-sourced, making the streaming engine behind services like the multi-billion-dollar Bing Ads platform available for all to use and extend. We give a brief history of streaming data at Microsoft and lessons learned. We then demonstrate how its API can power complex application logic, and the performance that gives the engine its name: a trillion events per day per node.

Mikheil Nadareishvili is Deputy Head of BI at TBC Bank, in charge of company-wide data science initiative. His main responsibilities include overseeing development of data science capability and embedding it in business to achieve maximum business value.
Prior to TBC, Mikheil worked on applying data science to various domains, most notably to real estate (to determine housing market trends and predict real estate prices), and education (to determine factors that influence students’ educational attainment in Georgia).

Presentations

Quadrupling Profit through Analytics: Data Science with Business Value in Mind Findata

TBC Bank recently shifted to a new way of doing analytics: embedding data scientists directly in business which they are working for, along with staff dedicated to connecting them to data and business. Moreover, we made sure all projects had clear measures of success. This shift unlocked value of analytics for us: in one project, we were able to improve profit of a major loan product four-fold.

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.

Presentations

Building a best-in-class data lake on AWS and Azure Session

Data lakes have become a key ingredient in the data architecture of most companies. In the cloud, object storage systems such as S3 and ADLS make it easier than ever to operate a data lake. In this talk we describe how companies can build best-in-class data lakes in the cloud, leveraging open source technologies and the cloud's elasticity to run and optimize various workloads simultaneously.

Securing your cloud data lake with a "defense in depth" approach Session

With cheap and infinitely scalable storage services such as S3 and ADLS, it has never been easier to dump data into a cloud data lake. But how do you secure that data and make sure it doesn't leak? In this talk we explore numerous capabilities for securing a cloud data lake, including authentication, access control, encryption (in motion and at rest) and auditing, as well as network protections.

Arup Nanda is head of data at Priceline, the leading travel company.

Presentations

Business Transformation with Visitor Funnel on Cloudera DCS

Customer funnels are not unknown; but combining multiple data elements, business and systems, and charting our visitor drop-offs, along with A/B Testing has allowed Priceline to redefine the business, change product design and offer incentives to customers, on Cloudera, Airflow, Machine Learning with regression and classification models.

Paco Nathan, is known as a “player/coach”, with core expertise in data science, natural language processing, machine learning, and cloud computing. He is the Evil Mad Scientist at Derwen, Co-chair of Rev, Advisor for Amplify Partners, Deep Learning Analytics, Recognai, Data Spartan, Primer.

Presentations

Data Science vs Engineering: Does it really have to be this way? Session

Are you a data scientist that has wondered "why does it take so long to deploy my model into production?" Are you an engineer that has ever thought "data scientists have no idea what they want"? You are not alone. Join us for a lively discussion panel, with industry veterans, to chat about best practices and insights regarding how to increase collaboration when developing and deploying models.

Max Neunhöffer is a mathematician turned database developer. In his academic career he has worked for 16 years on the development and implementation of new algorithms in computer algebra. During this time he has juggled a lot with mathematical big data like group orbits containing trillions of points. Recently he has returned from St. Andrews to Germany, has shifted his focus to NoSQL databases, and now helps to develop ArangoDB. He has spoken at international conferences including O’Reilly Software Architecture London, J On The Beach or MesosCon Seattle.

Presentations

The case for a common Metadata Layer for Machine Learning Platforms Session

Machine Learning Platforms being built are becoming more complex with different components each producing their own metadata. Currently, most components provide their own way of storing metadata. In this talk, we propose a first draft of a common Metadata API and demo a first implementation of this API in Kubeflow using ArangoDB, which is a native multi-model database.

Alexander Ng is a senior data engineer at Manifold. His previous work includes a stint as engineer and technical lead doing DevOps at Kryuus as well as engineering work for the Navy. He holds a BS in electrical engineering from Boston University.

Presentations

Streamlining a Machine Learning Project Team Tutorial

Many teams are still run as if data science is about experimentation, but those days are over. Now it must offer turnkey solutions to take models into production. We'll explain how to streamline a ML project and help your engineers work as an integrated part of production teams, using a Lean AI process and the Orbyter package for Docker-first data science.

Michael Noll is the product manager for stream processing at Confluent, the company founded by the creators of Apache Kafka. His work is focused on Kafka’s Streams API and KSQL, the streaming SQL engine for Kafka. Previously Michael was the technical lead of the Big Data platform of .COM/.NET DNS operator Verisign, where he grew the Hadoop and Kafka based infrastructure from zero to petabyte-sized production clusters spanning multiple data centers – one of the largest Big Data infrastructures operated from Europe at the time. He is a contributor and committer to open source projects such as Apache Kafka and Apache Storm, and writes a well-known blog about Big Data and distributed systems at www.michael-noll.com. Michael has a Ph.D. in computer science and has been a frequent speaker at international conferences such as Kafka Summit, Strata, and ApacheCon.

Presentations

Now You See Me, Now You Compute: Building Event-driven Architectures with Apache Kafka Session

Would you cross the street with traffic information that is a minute old? Certainly not! Modern businesses have the same needs. In this talk we cover why and how you can use Kafka and its growing ecosystem to build elastic event-driven architectures. Specifically, we look at Kafka as the storage layer, at Kafka Connect for data integration, and at Kafka Streams and KSQL as the compute layer.

Harsha is a data scientist at Microsoft. He works on interpretability and privacy for Machine Learning.

Presentations

Unified Tooling for Machine Learning Interpretability Session

Understanding decisions made by machine learning systems is critical for sensitive uses, ensuring fairness, and debugging production models. Interpretability is a maturing field of research that presents many options for trying to understand model decisions. Microsoft is releasing new tools to help you train powerful, interpretable models and interpret decisions of existing blackbox systems.

Owen O’Malley is a co-founder and Technical Fellow at Cloudera, formerly Hortonworks. Cloudera’s software includes Hadoop and the large ecosystem of big data tools that enterprises need for their data analytics. Owen has been working on Hadoop since the beginning of 2006 at Yahoo, was the first committer added to the project, and used Hadoop to set the Gray sort benchmark in 2008 and 2009. In the last 10 years, he has been the architect of MapReduce, Security, and now Hive. Recently he has been driving the development of the ORC file format and adding ACID transactions to Hive.

Presentations

Protect your Private Data in your Hadoop Clusters with ORC Column Encryption Session

Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. This talk describes how column encryption in ORC files enables both fine grain protection and audits of who accessed the private data.

As the Co-Founder and CTO of Periscope Data, Tom O’Neill is responsible for overseeing the technology vision for the company and leading the engineering and product teams. Previously he worked on the machine learning algorithms behind the search results at Microsoft Bing. He has a bachelor’s degree in computer science from the University of Rochester.

Presentations

Lessons Learned from Scaling the Tech Stack of a Modern Analytics Platform Session

In this session, CTO Tom O’Neill will discuss lessons learned from scaling up Periscope Data to support incredibly large volumes of data and queries from its 1,000+ data teams. He’ll highlight the process of migrating from Heroku to Kubernetes and discovering new ways to leverage its power, plus other developments that have allowed users to delve deeper into new data science and ML analysis.

Kaan Onuk is the Engineering Manager in the Data Platform Team at Uber. Previously, he worked as a tech lead at Uber, building metadata management infrastructure to transform big data into knowledge. Prior to Uber, he was the founding member of the data infrastructure team at Graphiq, a semantic technology startup which later acquired by Amazon to help improve Alexa. He holds a Master’s degree in Electrical Engineering from University of Southern California. Kaan can be reached on LinkedIn.

Presentations

Turning Big Data into Knowledge: Managing metadata and data-relationships at Uber scale Session

At Uber’s scale and pace of growth, a robust system for discovering and managing various entities, from datasets to services to pipelines, and their relevant metadata is not just nice to have: it is absolutely integral to making data useful at Uber. In this talk, we will explore the current state of metadata management and end-to-end data flow solutions at Uber and what’s coming next.

Diego Oppenheimer, founder and CEO of Algorithmia, is an entrepreneur and product developer with an extensive background in all things data. Prior to founding Algorithmia he designed, managed and shipped some of Microsoft’s most used data analysis products including Excel, Power Pivot, SQL Server and Power BI.
Diego holds a Bachelors degree in Information Systems and a Masters degree in Business Intelligence and Data Analytics from Carnegie Mellon University.

Presentations

The New SDLC: CI/CD in the age of Machine Learning Session

Machine Learning (ML) will fundamentally change the way we build and maintain applications. How can we adapt our infrastructure, operations, staffing, and training to meet the challenges of the new Software Development Life Cycle (SDLC) without throwing away everything that already works?

Aaron Owen is a data scientist at Major League Baseball, where he leverages his skills to solve business problems for the organization and its 30 teams. Aaron holds an MS and PhD in evolutionary biology and was previously a professor at both the City University of New York and New York University.

Presentations

Data Science and the Business of Major League Baseball Session

Utilizing SAS, Python, and AWS Sagemaker, MLB’s data science team discusses how it predicts ticket purchasers’ likelihoods to purchase again, evaluates prospective season schedules, estimates customer lifetime value, optimizes promotion schedules, quantifies the strength of fan avidity, and monitors the health of monthly subscriptions to its game-streaming service.

Jitendra Pandey leads HDFS, Ozone and HBase engineering at Hortonworks Inc and has been contributing to Hadoop Ecosystem for more than 9 years. Jitendra is a committer and PMC member for Apache Hadoop. He is also a committer for Apache Ambari and Apache Hive projects. Jitendra’s contributions include various areas in Ozone, HDFS, Vectorized query processing in Hive, and Hadoop security infrastructure. Prior to Hortonworks, Jitendra worked at Yahoo in Big Data infrastructure, and applications.

Presentations

Apache Hadoop 3.x State of The Union and Upgrade Guidance Session

In this talk, we’ll start with the current status of Apache Hadoop community, we'll then move on to the exciting present & future of Hadoop 3.x. We will cover new features like erasure coding, GPU support, namenode federation, Docker, long-running services support, powerful container placement constraints, data node disk balancing, etc. Also we will talk about upgrade guidance from 2.x to 3.x.

Jignesh is a Principal Architect at Cox Communications and this is his first time at SCTE. He has more than 15 years’ experience in applying scientific methods and mathematical models to solve problems concerning the management of systems, people, machines, materials and finance in industry. Previously he has been a trusted advisor for large software company in the North-West assisting in data center capacity forecasting and providing machine learning capabilities to detect email spam, predicting DDOS attack and preventing DNS blackhole.

Presentations

Secured Computation – Analyzing Sensitive Data using Homomorphic Encryption Session

Organizations often work with sensitive information such as social security number, and Credit card information. Although this data is stored in encrypted form, most analytical operations ranging from data analysis to advanced machine learning algorithms require data decryption for computation. This creates unwanted exposures to theft or unauthorized read by undesirables.

Nick Pentreath is a principal engineer in IBM’s Center for Open Source Data & AI Technologies (CODAIT), where he works on machine learning. Previously, he cofounded Graphflow, a machine learning startup focused on recommendations. He has also worked at Goldman Sachs, Cognitive Match, and Mxit. He is a committer and PMC member of the Apache Spark project and author of Machine Learning with Spark. Nick is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.

Presentations

Deploying End-to-End Deep Learning Pipelines with ONNX Session

The common perception of deep learning is that it results in a fully self-contained model. However, in most cases these models have similar requirements for data pre-processing as more "traditional" machine learning. Despite this, there are few standard solutions for deploying end-to-end deep learning. In this talk, I show how the ONNX format and ecosystem is addressing this challenge.

Robert Pesch is a senior data scientist and big-data engineer at Inovex GmbH. Robert holds a PhD in Bioinformatics and a MSc in Computer Science. He gets most excited about analyzing large and complex data sets and implementing novel insight-generating data-products utilizing advanced mathematical and statistical models.

Presentations

From Whiteboard to Production: a Demand Forecasting System for an Online Grocery Shop Session

In this talk, we outline the development process, the statistical modeling, the data-driven decision making, and the components needed for productionizing a fully automated and highly scalable demand forecasting system for an online grocery shop for a billion-dollar retail group in Europe.

Keshav Peswani has been working as a Senior Software Engineer at Expedia Group focusing on technology and innovation on various platform initiatives. Keshav is involved in building neural network based anomaly detection model as part of Expedia’s adaptive alerting system, an open source project for anomaly detection. He is also a core contributor of the open source project Haystack from Expedia for distributed tracing, a software which facilitates detection and remediation of problems in service oriented architecture. Keshav started his career at D.E. Shaw & Co. and through his journey has worked on several projects based on deep learning particularly recurrent neural networks, monolithic systems, distributed systems, big data processing. Keshav is a fast learner and passionate about deep learning and event driven architecture.

Keshav has spoken about Haystack in Open Source India, Asia’s largest open source conference and has talked about Haystack in OSFY.

Presentations

Real time Anomaly detection on observability data using neural networks Session

Observability is the key in modern architecture to quickly detect and repair problems in microservices. Modern observability platforms have evolved beyond simple application logs and now include distributed tracing systems like Zipkin, Haystack. Combining them with real time intelligent alerting mechanisms with accurate alerts helps in automated detection of these problems.

Akshay Rai is a Senior Software Engineer at LinkedIn whose primary focus is to reduce the Mean time to Detect issues and the Mean time to Resolve issues that arise at LinkedIn. He is currently working on LinkedIn’s next-generation anomaly detection and diagnosis platform. Earlier, he was actively leading the popular Dr. Elephant project at LinkedIn and helped open source it. He has also worked on operational intelligence solutions for Hadoop and Spark by building real-time systems that enable monitoring, visualizing and debugging of Big Data applications and Hadoop clusters.

Presentations

ThirdEye: LinkedIn’s Business-Wide Monitoring Platform Session

Failures or issues in a product or service can negatively affect the business. Detecting issues in advance and recovering from them is crucial to keep the business alive. Come, join us, to learn more about LinkedIn's next-generation open-source monitoring platform, an integrated solution for real-time alerting and collaborative analysis.

Manu is a Sr Software engineer at Linkedin for the last one and half years working with Data Analytics and Infrastructure team. He has an extensive experience in building complex and scalable applications. During his tenure at LinkedIn, he has influenced design and implementation of Hosted notebooks at LinkedIn providing seamless experience to end users. He works closely with customers, engineers and product to understand and define the requirements and design of the system. Prior to joining LinkedIn, he has worked with Paytm, Amadeus and Samsung wherein he was building scalable applications for various domains.

Presentations

Productive Data Science Platform - Beyond Hosted notebooks solution at LinkedIn Session

Come hear about the infrastructure and features offered by flexible and scalable hosted data science platform at LinkedIn. The platform provides features to seamlessly develop in multiple languages, enforce developer best practices, governance policies, execute, visualize solutions, efficient knowledge management and collaboration that improve developer productivity.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); briefly worked on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper Networks. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin-Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Post Transaction Processing Using Apache Pulsar at Narvar Session

Narvar provides next generation post transaction experience for over 500+ retailers. This talk explores the journey of how Narvar moving away from using a slew of technologies for their platform and consolidating their use cases using Apache Pulsar.

Serverless Streaming Architectures & Algorithms for the Enterprise Tutorial

In this tutorial, we shall walk the audience through the landscape of streaming systems and overview the inception and growth of the serverless paradigm. Next, we shall present a deep dive of Apache Pulsar which provides native serverless support in the form of Pulsar functions and paint a bird’s eye view of the application domains where Pulsar functions can be leveraged.

Sushant Rao works at Cloudera.

Presentations

Journey to the cloud - architecting for the cloud through customer stories Session

We’ll give you actionable understanding of cloud architecture and different approaches customers took in their journey to the cloud. We start with the different ways we’ve seen customers be successful in the cloud. Then deep dive into the decisions they made, and how that drove their cloud architecture. Along the way we review problems they overcame, lessons learned, and core cloud paradigms.

Vidya Ravivarma is a Sr Software Engineer at LinkedIn since the last one and half years with Data Analytics and Infrastructure team. She is focusing on design and implementation of building platform to improve developer productivity via Hosted Notebooks. Before this, she contributed to design and development of dynamic unified ACL management system for GDPR enforcement on datasets produced via LinkedIn’s metrics platform. She interacts closely with data analysts, scientists, engineers and stakeholders to understand their requirements to build scalable and flexible solutions/platform that enhances their productivity. Prior to LinkedIn, she worked at Yahoo, for three years mainly in data science and engineering and web development. This provides her with insights to developing a scalable, productive data science platform.

Presentations

Productive Data Science Platform - Beyond Hosted notebooks solution at LinkedIn Session

Come hear about the infrastructure and features offered by flexible and scalable hosted data science platform at LinkedIn. The platform provides features to seamlessly develop in multiple languages, enforce developer best practices, governance policies, execute, visualize solutions, efficient knowledge management and collaboration that improve developer productivity.

Thiago Ribeiro is the server-side Product Director at Griaule, a software company that develops multi-biometric identification technology to help institutions and companies deploy large-scale biometric identification projects. Thiago has been working in the identity industry since 2015. He graduated from Unicamp University in Mechatronics Engineering, with an exchange at the University of New South Wales.

Presentations

How Brazil deployed a 160-million people biometric identification system: challenges, benefits, and lessons learned Session

Brazil deployed a national biometric system to register all Brazilian voters using multiple biometric modalities and to ensure that a person does not enroll twice. This session highlights how a large-scale biometric system works, and what are the main architecture decisions that one has to take in consideration.

In two decades in the data management industry, I have worked as an engineer, a trainer, a marketer, a product manager, and a consultant. Now, I promote understanding of Vertica, MPP data processing, open source, and how the analytics revolution is changing the world.

Presentations

Kubernetes for Stateful MPP systems Session

a. Analytics experts, GoodData, needed to auto-recover from node failures and scale rapidly when workloads spike on their MPP database in the cloud. Kubernetes could solve that, but K8 is for stateless micro-services, not a stateful MPP database that needs 100s of containers. In order to merge the power of an MPP database with the flexibility of Kubernetes, a lot of hurdles had to be overcome.

Nikki Rouda is a principal product marketing manager at Amazon Web Services (AWS). Nikki has decades of experience leading enterprise big data, analytics, and data center infrastructure initiatives. Previously, he held senior positions at Cloudera, Enterprise Strategy Group (ESG), Riverbed, NetApp, Veritas, and UK-based Alertme.com (an early consumer IoT startup). Nikki holds an MBA from Cambridge’s Judge Business School and an ScB in geophysics from Brown University.

Presentations

Fuzzy matching and deduplicating data - techniques for advanced data prep Session

Learn how to deduplicate or link records in a dataset, even when the records don’t have a common unique identifier and no fields match exactly. Link customer records across different databases (e.g. different name spelling or address.) Match external product lists against your own catalog, such as lists of hazardous goods. Solve tough challenges to prepare and cleanse data for analysis.

Neelesh Srinivas Salian is a Software Engineer on the Data Platform team at Stitch Fix, where he works on the compute infrastructure used by the company’s data scientists. Previously, he worked at Cloudera, where he worked with Apache projects like YARN, Spark, and Kafka.

Presentations

The Why and How of Data Lineage Session

It is important to understand why Data Lineage is needed for an organization. Once the purpose is defined, we can talk about how to go about building such a system.

Shioulin Sam is a research engineer at Cloudera Fast Forward Labs, where she bridges academic research in machine learning with industrial applications. In her previous life, she managed a portfolio of early-stage ventures focusing on women-led startups and public market investments. She also worked in the investment management industry designing quantitative trading strategies. She holds a Ph.D in Electrical Engineering and Computer Science from Massachusetts Institute of Technology.

Presentations

Learning with Limited Labeled Data Session

Supervised machine learning requires large labeled datasets - a prohibitive limitation in many real world applications. What if machines could learn with few labeled examples? This talk explores and demonstrates an algorithmic solution that relies on collaboration between human and machines to label smartly, and discuss product possibilities.

Anjali Samani leads the Predictive Modelling team at CircleUp, an innovative fintech company recently honored as one of the World’s Top 10 Most Innovative Companies in Data Science.

Anjali has extensive experience in managing and delivering commercial data science projects, and has worked with senior decision makers in startups, FTSE 100 businesses and public sector organisations in UK and US to enable them to develop their data strategy and execute data science projects. Her roles bridge technical data science and business to identify and execute innovative solutions that leverage proprietary and open data sources to deliver value and drive growth.

In her former life, Anjali was a quantitative analyst in asset management, and she has a background in computer science, economics, and mathematics.

Presentations

Working with Time Series: Denoising & Imputation Frameworks to Improve Data Density Session

The application of smoothing and imputation strategies is common practice in predictive modelling and time series analysis. With a technique-agnostic approach, this session will provide qualitative and quantitative frameworks that address questions related to smoothing and imputation of missing values to improve data density.

Sameer Wadkar is a senior principal architect for machine learning at Comcast NBCUniversal, where he works on operationalizing machine learning models to enable rapid turnaround times from model development to model deployment and oversees data ingestion from data lakes, streaming data transformations, and model deployment in hybrid environments ranging from on-premises deployments to cloud and edge devices. Previously, he developed big data systems capable of handling billions of financial transactions per day arriving out of order for market reconstruction to conduct surveillance of trading activity across multiple markets and implemented natural language processing (NLP) and computer vision-based systems for various public and private sector clients. He is the author of Pro Apache Hadoop and blogs about data architectures and big data.

Presentations

Automating ML Model Training & Deployments via Metadata driven Data/Infrastructure/Feature Engineering/Model Management Session

And overview of the Data Management and privacy challenges around automating ML model (re)deployments and stream based inferencing at scale.

Alejandro is the Chief Scientist at the Institute for Ethical AI & Machine Learning, where he leads highly technical research on machine learning explainability, bias evaluation, reproducibility and responsible design. With over 10 years of software development experience, Alejandro has held technical leadership positions across hyper-growth scale-ups and tech giants including Eigen Tchnologies, Bloomberg LP and Hack Partners. He has a strong track record building departments of machine learning engineers from scratch, and leading the delivery of large-scale machine learning system across the financial, insurance, legal, transport, manufacturing and construction sectors (in Europe, US and Latin America).

Presentations

A practical guide towards algorithmic bias and explainability in machine learning Session

Undesired bias in machine learning has become a worrying topic due to the numerous high profile incidents. In this talk we demystify machine learning bias through a hands-on example. We'll be tasked to automate the loan approval process for a company, and introduce key tools and techniques from latest research that allow us to assess and mitigate undesired bias in our machine learning models.

Jörg is a Machine Learning Platform Engineer at Suki. In his previous life, he worked on distributed systems at Mesosphere, implemented distributed and in memory databases and conducted research in the Hadoop and Cloud area. His speaking experience includes various Meetups, international conferences, and lecture halls.

Presentations

The case for a common Metadata Layer for Machine Learning Platforms Session

Machine Learning Platforms being built are becoming more complex with different components each producing their own metadata. Currently, most components provide their own way of storing metadata. In this talk, we propose a first draft of a common Metadata API and demo a first implementation of this API in Kubeflow using ArangoDB, which is a native multi-model database.

Ross Schalmo is Director of Data and Analytics at GE Aviation.

As part of the Information Management Leadership Program at General Electric, he has been exposed to many different types of challenges, both in project management and technical expertise. He is experienced in relational database development and administration, Java, XML, XSL, and JavaScript programming, and large-scale IT program rollouts.

Presentations

Executive Briefing: Building a culture of self-service - From pre-deployment to continued engagement Session

GE Aviation has made it a mission to implement Self-Service Data. To ensure success beyond initial implementation of tools, the Data Engineering and Analytics teams at GE Aviation created initiatives designed to foster engagement from an ongoing partnership with each part of the business to the gamification of tagging data in a data catalog to forming a Published Dataset Council.

Dr. Chad Scherrer is a Senior Data Scientist with Metis, where he trains burgeoning data scientists. In addition to data science education, he has a passion for technology transfer, especially in the area of probabilistic programming.

His work in probabilistic programming goes back to 2010, when he began leading development of the Haskell-based language Passage. Dr. Scherrer then joined Galois to serve as technical lead for language evaluation in DARPA’s PPAML program, after which he moved to Seattle and joined Metis.

Dr. Scherrer’s blog discusses a variety of topics related to data science, with a particular focus on Bayesian modeling.

Presentations

Soss: Lightweight Probabilistic Programming in Julia Session

This talk will explore the basic ideas in Soss, a new probabilistic programming library for Julia. Soss allows a high-level representation of the kinds of models often written in PyMC3 or Stan, and offers a way to programmatically specify and apply model transformations like approximations or reparameterizations.

Jim Scott is the director of enterprise strategy and architecture at MapR Technologies. He is passionate about building combined big data and blockchain solutions. Over his career, Jim has held positions running operations, engineering, architecture, and QA teams in the financial services, regulatory, digital advertising, IoT, manufacturing, healthcare, chemicals, and geographical management systems industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG).

Presentations

Problems Taking AI to Production, and How to Fix Them! Session

Data scientists are creating and testing hundreds or thousands more models than in the past. Models require support from both real-time and static data sources. As data becomes enriched, and parameters tuned and explored, there is a need for versioning everything, including the data. We will discuss the very specific problems and approaches to fix them.

Jonathan Seidman is a software engineer on the cloud team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Foundations for Successful Data Projects Tutorial

The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. In this presentation we’ll provide guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

Robin Senge received his M.Sc. degree in computer science from the University of Marburg, Germany, in 2006. After graduation, he worked as a software engineer in industry, where he was consigned to consulting and developed software for financial applications like trading and portfolio management systems. In 2009, he joined the Computational Intelligence Lab at the University of Marburg as a doctoral student. His research topics were focused on machine learning and fuzzy systems. After successfully finishing his PhD in 2015 he joined inovex GmbH to work as a Senior Data Scientist. Being a data enthusiast he is currently part of an analytics team applying machine learning to optimize supply chain processes of one of the biggest group of retailers in Germany.

Presentations

From Whiteboard to Production: a Demand Forecasting System for an Online Grocery Shop Session

In this talk, we outline the development process, the statistical modeling, the data-driven decision making, and the components needed for productionizing a fully automated and highly scalable demand forecasting system for an online grocery shop for a billion-dollar retail group in Europe.

I am working as a technical product manager on the first open source product by Expedia. I am happiest when talking about the problems we have solved, the various approaches we tried and the challenges we overcame. I previously worked as a software developer and a tester for about 4 years and understand the perspectives and pains well having lived them. I have a double masters in computer science graduating from USC in 2015. I have spoken at various conferences including Linux Open Source Summit, Open Source India, KubeCon and Linux Fest and love sharing our journey and challenges.

Presentations

Finding a needle in a Haystack DCS

Over the last decade, logging of data has come a long way right from writing files in application logs to using sophisticated tools such as Splunk. As the data increased, it became harder to go through it manually and a system was needed to automate and standardized this telemetry data. Our talk shows how any company can leverage data to improve developer productivity and customer satisfaction.

Real time Anomaly detection on observability data using neural networks Session

Observability is the key in modern architecture to quickly detect and repair problems in microservices. Modern observability platforms have evolved beyond simple application logs and now include distributed tracing systems like Zipkin, Haystack. Combining them with real time intelligent alerting mechanisms with accurate alerts helps in automated detection of these problems.

Reza Shiftehfar currently leads Uber’s Hadoop Platform teams. His teams help build and grow Uber’s reliable and scalable Big Data platform that serves petabytes of data utilizing technologies such as Apache Hadoop, Apache Hive, Apache Kafka, Apache Spark, and Presto. Reza is one of the founding engineers of Uber’s Data team and helped scale Uber’s data platform from a few terabytes to over 100 petabytes while reducing the data latency from 24+ hours to minutes. Reza holds a Ph.D. in Computer Science from the University of Illinois, Urbana-Champaign with focus on building Mobile Hybrid Cloud applications.

Presentations

Creating an extensible 100+ PB real-time Big Data Platform by unifying storage & serving Session

Building a reliable Big Data platform is extremely challenging when it has to store and serve 100s of PetaBytes of data in a real-time fashion . This talk reflects on the challenges faced and proposes architectural solutions to scale a Big Data Platform to ingest, store, and serve 100+ PB of data with minute level latency while efficiently utilizing the hardware and meeting the security needs.

Tomer Shiran is cofounder and CEO of Dremio. Previously, Tomer was the vice president of product at MapR, where he was responsible for product strategy, roadmap, and new feature development. As a member of the executive team, he helped grow the company from 5 employees to over 300 employees and 700 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. He is the author of five US patents. Tomer holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from the Technion, the Israel Institute of Technology.

Presentations

Building a best-in-class data lake on AWS and Azure Session

Data lakes have become a key ingredient in the data architecture of most companies. In the cloud, object storage systems such as S3 and ADLS make it easier than ever to operate a data lake. In this talk we describe how companies can build best-in-class data lakes in the cloud, leveraging open source technologies and the cloud's elasticity to run and optimize various workloads simultaneously.

Securing your cloud data lake with a "defense in depth" approach Session

With cheap and infinitely scalable storage services such as S3 and ADLS, it has never been easier to dump data into a cloud data lake. But how do you secure that data and make sure it doesn't leak? In this talk we explore numerous capabilities for securing a cloud data lake, including authentication, access control, encryption (in motion and at rest) and auditing, as well as network protections.

Nagendra leads the Analytics Product Development initiative for EXL.  He has over 17 years of experience in developing advanced analytics solutions across business functions.  His focus has been on developing solutions that enable better decision making through the use of Machine Learning, Natural Language Processing and Big Data technologies.  Nagendra consults with senior executives of global firms across industry – including healthcare, insurance, banking, retail, and travel.  Nagendra holds an MS degree from Purdue University, IN and a B.Tech. from IIT Bombay.  At EXL, Nagendra has written thought leadership articles on healthcare clinical solutions and AI.

Presentations

Improving OCR Quality of Documents using Generative Adversarial Networks Session

Every NLP based document processing solution depends on converting scanned documents/ images to machine readable text using an OCR solution. However, accuracy of OCR solutions is limited by quality of scanned images. We show that generative adversarial networks can be used to bring significant efficiencies in any document processing solution by enhancing resolution and de-noising scanned images.

Rosaria Silipo, PhD, principal data scientist at KNIME, loved data before it was big and learning before it was deep. She has spent 25+ years in applied AI, predictive analytics and machine learning at Siemens, Viseca, Nuance Communications, and private consulting. Sharing her practical experience in a broad range of industries and deployments, including IoT, customer intelligence, financial services, and cybersecurity, Rosaria has authored 50+ technical publications, including her recent ebook: Practicing Data Science:A Collection of Case Studies. Follow Rosaria on Twitter, LinkedIn and the KNIME blog.

Presentations

Practicing Data Science: A Collection of Case Studies DCS

This is a review of AI case studies: from classic customer intelligence to IoT, from sentiment in social media to user graphs, from free text generation to fraud detection, and so on. The goal of this presentation is to inspire user creativity to apply AI in their own domain.

Swatee Singh is the first female Distinguished Architect at American Express, where she is spearheading machine learning transformation at the company. Swatee is a proponent of democratizing machine learning by providing the right tools, capabilities, and talent structure to the broader engineering and data science community. The platform her team is building looks to leverage American Express’s closed loop data to enhance its customer experience by combining artificial intelligence, big data, and the cloud, incorporating guiding pillars such as ease of use, reusability, shareability, and discoverability. Swatee also led the American Express Recommendation Engine roadmap and delivery for card-linked merchant offers as well as for personalized merchant recommendations. Over the course of her career, she has applied predictive modeling to a variety of problems ranging from financial services to retailers and even power companies. Previously, Swatee was a consultant at McKinsey & Company and PwC, where she supported leading businesses in retail, banking and financial services, insurance, and manufacturing, and cofounded a medical device startup that used a business card-sized thermoelectric cooling device implanted in an epileptic’s brain as a mechanism to stop seizures. Swatee holds a PhD focusing on machine learning techniques from Duke University.

Presentations

Keynote with Swatee Singh Keynote

Details to come.

As a AWS Partner Solution Architect I work with GSI’s to help accelerate adoption of AWS services with a focus on Analytics and Machine learning

Presentations

Building a recommender system with Amazon ML Services Tutorial

In this workshop we’ll introduce the Amazon SageMaker machine learning platform, followed by a high level discussion of recommender systems. Next we’ll dig into different machine learning approaches for recommender systems.

Tim Spann was a Senior Solutions Architect at AirisData working with Apache Spark and Machine Learning. Previously he was a Senior Software Engineer at SecurityScorecard ("http://securityscorecard.com/) helping to build a reactive platform for monitoring real-time 3rd party vendor security risk in Java and Scala. Before that he was a Senior Field Engineer for Pivotal focusing on CloudFoundry, HAWQ and Big Data. He is an avid blogger and the Big Data Zone Leader for Dzone (https://dzone.com/users/297029/bunkertor.html).

He runs the the very successful Future of Data Princeton meetup with over 1192 members at http://www.meetup.com/futureofdata-princeton/.

He is currently a Senior Solutions Engineer at Cloudera in the Princeton New Jersey area.

You can find all the source and material behind his talks at his Github and Community blog:

https://github.com/tspannhw
https://community.hortonworks.com/users/9304/tspann.html

Presentations

IoT - Cloudera Edge Management Tutorial

Too many edge devices and agents. How does one control and manage them. How do we have handle the difficulty in collecting real-time data and most importantly, the trouble with updating specific set of agents with edge applications. Get your hands dirty with Cloudera Edge Management that addresses these challenges with ease.

Moderator: Ann Spencer is the Head of Content at Domino Data Lab. She is responsible for ensuring Domino’s data science content provides a high degree of value, density, and analytical rigor that sparks respectful candid public discourse from multiple perspectives. Discourse that is anchored in the intention of helping accelerate data science work. Previously, she was the Data Editor at O’Reilly Media (2012-2014) focusing on data science and data engineering. It was in this role where she previously met and worked with the panelists.

Presentations

Data Science vs Engineering: Does it really have to be this way? Session

Are you a data scientist that has wondered "why does it take so long to deploy my model into production?" Are you an engineer that has ever thought "data scientists have no idea what they want"? You are not alone. Join us for a lively discussion panel, with industry veterans, to chat about best practices and insights regarding how to increase collaboration when developing and deploying models.

Rajeev Srinivasan is a Senior Solution Architect for AWS. He works very close with our customers to provide big data and NoSQL solution leveraging the AWS platform and enjoys coding . In his spare time he enjoys riding his motorcycle and reading books.

Presentations

From relational databases to Cloud databases, using the right tool for the right job. Tutorial

Enterprises adopt Cloud platforms such as AWS for agility, elasticity and cost savings. Database design and management requires a different mindset in AWS when compared to traditional RDBMS design. In this session, you will learn important considerations in choosing the right database based on your use cases and access pattern while migrating an application or building a new application on cloud.

Mac Steele is the director of product at Domino Data Lab, where he leads strategic development of the company’s data science platform. Based in San Francisco, he works closely with leading financial services, insurance, and technology companies to build a mature data science process across their entire organization. He has extensive experience leading advanced analytical organizations in both finance and tech. Previously, Mac worked in the Research Group at Bridgewater Associates, the world’s largest hedge fund, where he developed quantitative models for the firm’s emerging market portfolio; he also built the core data capability at leading fintech company Funding Circle. Steele holds a degree (summa cum laude) from the Woodrow Wilson School of Public and International Affairs at Princeton University.

Presentations

Managing data science in the enterprise Tutorial

The honeymoon era of data science is ending, and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders must deliver measurable impact on an increasing share of an enterprise’s KPIs. Attendees will learn how leading organizations take a holistic approach to people, process, and technology to build a sustainable competitive advantage.

Michael Stonebraker
is an adjunct professor at MIT CSAIL and a database pioneer who has been involved with Postgres, SciDB, Vertica, VoltDB, Tamr and other database companies. He co-authored the paper “Data Curation at Scale: The Data Tamer System,”
presented at the Conference on Innovative Data Systems Research (CIDR’13).

Presentations

Executive Briefing: Top 10 Big Data Blunders Session

As a steward for your enterprise’s data and digital transformation initiatives, you’re tasked with making the right choice. But before you can make those decisions, it’s important to understand what NOT to do when planning for your organization’s Big Data initiatives. Dr Michael Stonebraker, Adjunct Professor, MIT, & Co-Founder/CTO, Tamr will discuss his Top 10 Big Data Blunders.

Wim Stoop is a senior product marketing manager at Cloudera.

Presentations

Sharing is caring: using Egeria to establish true enterprise metadata governance Session

Establishing enterprise wide security and governance remains a challenge for most organisations. Integrations and exchanges across their landscape are costly to manage and maintain, and typically work in one direction only. In this session, we'll discuss how ODPi's Egeria standard and framework removes the challenges and is leveraged by Cloudera and partners alike to deliver value for customers.

Bargava Subramanian is a deep learning engineer and cofounder of a boutique AI firm Binaize Labs in Bangalore, India. He has 15 years’ experience delivering business analytics and machine learning solutions to B2B companies, and he mentors organizations in their data science journey. He holds a master’s degree from the University of Maryland at College Park. He’s an ardent NBA fan.

Presentations

Recommendation System using Deep Learning 2-Day Training

In this two-days workshop, you will learn the different paradigms of recommendation systems and get introduced to the usage of deep-learning based approaches . By the end of the workshop, you will have enough practical hands-on knowledge to build, select, deploy and maintain a recommendation system for your problem.

Recommendation System using Deep Learning (Day 2) Training Day 2

In this two-days workshop, you will learn the different paradigms of recommendation systems and get introduced to the usage of deep-learning based approaches . By the end of the workshop, you will have enough practical hands-on knowledge to build, select, deploy and maintain a recommendation system for your problem.

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe, and worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Executive Briefing: Why machine-learned models crash and burn in production and what to do about it Session

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Natural language understanding at scale with Spark NLP Tutorial

This is a hands-on tutorial on state-of-the-art NLP using the highly performant, highly scalable open-source Spark NLP library. You'll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Wangda Tan is Product Management Committee (PMC) member of Apache Hadoop and engineering manager of computation platform team at Cloudera. He manages all efforts related to Kubernetes and YARN for both on-cloud and on-prem use cases of Cloudera. His primary interesting areas are YuniKorn scheduler (scheduling containers across YARN and Kubernetes) and Hadoop submarine project (running Deep learning workload across YARN and Kubernetes). He has also led features like resource scheduling, GPU isolation, node labeling, resource preemption etc efforts in the Hadoop YARN community. Before joining Cloudera, he was working at Pivotal, working on integration OpenMPI/GraphLab with Hadoop YARN. Before that, he was working at Alibaba cloud computing, participated in creating a large scale machine learning, matrix and statistics computation platform using Map-Reduce and MPI.

Presentations

Apache Hadoop 3.x State of The Union and Upgrade Guidance Session

In this talk, we’ll start with the current status of Apache Hadoop community, we'll then move on to the exciting present & future of Hadoop 3.x. We will cover new features like erasure coding, GPU support, namenode federation, Docker, long-running services support, powerful container placement constraints, data node disk balancing, etc. Also we will talk about upgrade guidance from 2.x to 3.x.

James Tang is a senior director of engineering at Wal-Mart Labs. He has spent time in creating large-scale, resilient, and distributed architectures with high-security and high-performance for enterprise applications, web applications, online payments, online games, and real-time predictive analytics applications. While enthusiastic about technologies, he enjoys mentoring, training and leading teams to be successful with distributed systems concepts, micro-services, DevOps, and cloud-native application design.

Presentations

Machine Learning and Large Scale Data Analysis On Centralized Platform Session

How No1 retailer provides secure and seamless shopping experience through machine learning and large scale data analysis on centralized platform.

James is a 10-year veteran at Microsoft, having spent time on both product and research teams. He began as an intern during the last year of his PhD research at Portland State University. His background is in innovative data query and exploration interfaces and streaming data processing.

Presentations

Trill: The Crown Jewel of Microsoft’s Streaming Pipeline Explained Session

Trill has been open-sourced, making the streaming engine behind services like the multi-billion-dollar Bing Ads platform available for all to use and extend. We give a brief history of streaming data at Microsoft and lessons learned. We then demonstrate how its API can power complex application logic, and the performance that gives the engine its name: a trillion events per day per node.

Alex Thomas is a data scientist at Indeed. He has used natural language processing (NLP) and machine learning with clinical data, identity data, and now employer and jobseeker data. He has worked with Apache Spark since version 0.9, and has worked with NLP libraries and frameworks including UIMA and OpenNLP.

Presentations

Natural language understanding at scale with Spark NLP Tutorial

This is a hands-on tutorial on state-of-the-art NLP using the highly performant, highly scalable open-source Spark NLP library. You'll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Moto Tohda is the VP of Information Systems at Tokyo Century (USA) Inc, where he oversees the entire IT for the US operations of a global company, Tokyo Century Corporation. As Moto wears many hats, his past IT initiatives are in security, disaster recovery and many implementations of the ERP-related systems, always striving to balance business objectives with IT initiatives. His newest endeavour since last year is to bring data back to users’ hand with data visualization and user-driven predictive analysis models.

Presentations

Democratization of Data Science - Using Machine Learning to Build Credit Risk Models Findata

Tokyo Century was ready for a change. Credit risk decisions were taking too long and the home office was taking notice. A full-stack data solution was needed to increase the speed of loan authorizations, and it was needed quickly. In this session you will learn how we put data at the center of our credit risk decision making and removed tribal knowledge from the process.

Solmaz Torabi is a Data Scientist at EXL Service. She holds a PhD in Electrical and Computer Engineering from Drexel University. At EXL, she is responsible for building image and text analytics models using deep learning methods to extract information from images and documents.

Presentations

Improving OCR Quality of Documents using Generative Adversarial Networks Session

Every NLP based document processing solution depends on converting scanned documents/ images to machine readable text using an OCR solution. However, accuracy of OCR solutions is limited by quality of scanned images. We show that generative adversarial networks can be used to bring significant efficiencies in any document processing solution by enhancing resolution and de-noising scanned images.

Steve Touw is the cofounder and CTO of Immuta. Steve has a long history of designing large-scale geotemporal analytics across the US intelligence community, including some of the very first Hadoop analytics as well as frameworks to manage complex multi-tenant data policy controls. He and his cofounders at Immuta drew on this real-world experience to build a software product to make data security and privacy controls easier. Previously, Steve was the CTO of 42Six Solutions (acquired by Computer Sciences Corporation), where he led a large big data services engineering team. Steve holds a BS in geography from the University of Maryland.

Presentations

Data Security & Privacy Anti-Patterns Session

Anti-patterns are behaviors that take bad problems and lead to even worse solutions. In the world of data security and privacy, they’re everywhere. Over the past 4 years we’ve seen data security and privacy anti-patterns consistently emerge across 100s of customers and industry verticals - there has been an obvious trend. We’ll cover 5 anti-patterns and more importantly, the solutions for them.

Jonathan Tudor is Sr. Manager for Self-Service Data and Analytics at GE Aviation. Jon has a background in big data ETL, cloud, security, networking, compliance, data warehousing, supply chain operations, engineering IT, business intelligence and analytics. He is passionate about self-service data, big data, digital cultural transformation data, business, and leadership.

Presentations

Executive Briefing: Building a culture of self-service - From pre-deployment to continued engagement Session

GE Aviation has made it a mission to implement Self-Service Data. To ensure success beyond initial implementation of tools, the Data Engineering and Analytics teams at GE Aviation created initiatives designed to foster engagement from an ongoing partnership with each part of the business to the gamification of tagging data in a data catalog to forming a Published Dataset Council.

Giovanni Tummarello, Ph.D is founder and Chief Product Officer at Siren.io. Previously academic team lead at the National University of Ireland Galway, he has over 100 scholarly works on knowledge graphs, semantic technologies and information retrieval as well several startups and active open source projects started from its research, among which the top level “any23” Apache project, the Spaziodati.eu and Siren.io companies.

Presentations

Supercharging Elasticsearch for extended Knowledge Graph use cases Session

Elasticsearch allows extremely quick search and drilldowns on large amounts of semistructured data. Elasticsearch, however, does not have relational join capabilities. In this presentation I'll introduce a plugin for ES that adds cluster distributed joins and demonstrate how it enables an exciting array of use cases dealing with interconnected or "Knowledge Graph" enterprise data.

Naoto is a Senior Infrastructure Engineer and Deputy Manager at NTT DATA Corporation, working on technology and innovation area. He has spent around 10 years in the Platform and Infrastructure field, focusing mainly on the Open Source Software Technology Stack.

Presentations

Deep Learning Technologies for Giant Hogweed Eradication Session

Giant Hogweed is a highly toxic plant. Our project aims to automate the process of detecting the Giant Hogweed by exploiting technologies like drones and image recognition/detection using Machine Learning. We show you how we designed the architecture, how we took advantage of both of Big Data and Machine / Deep Learning technologies (e.g. Hadoop, Spark and TensorFlow) and lessons learned.

Sandeep Uttamchandani is the hands-on Chief Data Architect at Intuit. He is currently leading the Cloud transformation of the Big Data Analytics, ML, and Transactional platform used by 4M+ Small Business Users for financial accounting, payroll, and billions of dollars in daily payments. Prior to Intuit, Sandeep has played various engineering roles at VMware, IBM, as well as founding a startup focused on ML for managing Enterprise systems. Sandeep’s experience uniquely combines building Enterprise data products and operational expertise in managing petabyte scale data and analytics platforms in production for IBM’s Federal and Fortune 100 customers. Sandeep has received several excellence awards, and over 40 issued patents and 25 publications in key systems conference such as VLDB, SIGMOD, CIDR, USENIX. Sandeep is a regular speaker at academic institutions, guest lectures for university courses, as well as conducts conference tutorials for data engineers and scientists. He advises PhD students and startups, serves as Program Committee Member for systems and data conferences, and the past associate editor for ACM Transactions on Storage. Sandeep is a Ph.D. in Computer Science from University of Illinois at Urbana-Champaign.

Presentations

Time-travel for Data Pipelines: Solving the mystery of what changed? Session

Imagine a business insight showing a sudden spike.Debugging data pipelines is non-trivial and finding the root cause can take hours or even days! We’ll share how Intuit built a self-serve tool that automatically discovers data pipeline lineage and tracks every change that impacts pipeline.This helps debug pipeline issues in minutes–establishing trust in data while improving developer productivity.

Vinithra Varadharajan is a senior engineering manager in the cloud organization at Cloudera, where she is responsible for the cloud portfolio products, including Altus Data Engineering, Altus Analytic Database, Altus SDX, and Cloudera Director. Previously, Vinithra was a software engineer at Cloudera, working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop lifecycle management.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Moving to the cloud poses challenges from re-architecting to be cloud-native, to data context consistency across workloads that span multiple clusters on-prem and in the cloud. First, we’ll cover in depth cloud architecture and challenges; second, you’ll use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Gil Vernik is a researcher in IBM’s Storage Clouds, Security, and Analytics group, where he works with Apache Spark, Hadoop, object stores, and NoSQL databases. Gil has more than 25 years of experience as a code developer on both the server side and client side and is fluent in Java, Python, Scala, C/C++, and Erlang. He holds a PhD in mathematics from the University of Haifa and held a postdoctoral position in Germany.

Presentations

Your easy move to serverless computing and radically simplified data processing Session

Most analytic flows can benefit from the serverless, starting with simple cases to complex data preparations for AI frameworks, like TensorFlow. To address the challenge of how to easily integrate serverless, without major disruptions to your system, we present “push to the cloud” experience. This ability dramatically simplifies using serverless for different big data processing frameworks.

Graduate of Applied Mathematics Department of Saint-Petersburg State University, PhD. Last 20 year spend in IT development in different areas – from CAD systems in oursourcing till fintech. In 2003 join Yandex.Money team, where he and his team is responsible for Data Engineering, Anti-Fraud systems development and Business Intelligence.

Presentations

Scaling: data engineers Session

With a microservice architecture, DWH is a first place where all the data gets together. It supplied by many different datasources. It is used for many purposes – from near-OLTP till models fitting and realtime classifying. Talk will cover our experience in management and scaling of data Engineering Team and infrastructure for support of 20+ Product Teams.

Naghman Waheed leads the data platforms team at Bayer, where he is responsible for defining and establishing enterprise architecture and direction for data platforms. Naghman is an experienced IT professional with over 25 years of work devoted to the delivery of data solutions spanning numerous business functions, including supply chain, manufacturing, order to cash, finance, and procurement. Throughout his 20+year career at Bayer, Naghman has held a variety of positions in the data space, ranging from designing several scale data warehouses to defining a data strategy for the company and leading various data teams. His broad range of experience includes managing global IT data projects, establishing enterprise information architecture functions, defining enterprise architecture for SAP systems, and creating numerous information delivery solutions. Naghman holds a BA in computer science from Knox College, a BS in electrical engineering from Washington University, an MS in electrical engineering and computer science from the University of Illinois, and an MBA and a master’s degree in information management, both from Washington University.

Presentations

Finding your needle in a Haystack Session

As complexity of data systems has grown at Bayer, so has the difficulty to locate and understand what data sets are available for consumption. To address this challenge, a custom metadata management tool was recently deployed as a new capability at Bayer. The system is cloud enabled and uses multiple open source components including machine learning and natural language processing to aid search.

Optimizing ROI of a Geospatial Platform in Cloud DCS

Cloud architecture is an extremely flexible environment for deploying solutions. Our recent learnings have revealed some interesting insights. One being that the first build of solutions, even using open source software, may quickly exceed initial cost estimates and could outpace ROI if not managed properly. The case study will focus on managing our Geospatial Platform and how we increased ROI.

Todd Walter is chief technologist and fellow at Teradata, where he helps business leaders, analysts, and technologists better understand all of the astonishing possibilities of big data and analytics in view of emerging and existing capabilities of information infrastructures. Todd has been with Teradata for more than 30 years. He’s a sought-after speaker and educator on analytics strategy, big data architecture, and exposing the virtually limitless business opportunities that can be realized by architecting with the most advanced analytic intelligence platforms and solutions. Todd holds more than a dozen patents.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that isn't subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Dean Wampler, Ph.D., is VP of Fast Data Engineering at Lightbend. He leads the Lightbend Fast Data Platform team, a scalable, distributed stream data processing stack using Kubernetes, Spark, Flink, Kafka, and Akka, with machine learning and management tools. Dean is the author of Fast Data Architectures for Streaming Applications, Programming Scala, Second Edition and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly Media. He is a contributor to several open source projects, a frequent Strata speaker, and the co-organizer of several conferences around the world and several user groups in Chicago. Dean yells at clouds on Twitter, @deanwampler.

Presentations

Executive Briefing: What It Takes to Use Machine Learning in Fast Data Pipelines Session

Join me for a discussion of the following problems and their solutions: 1. How (and why) to integrate ML into production streaming data pipelines, to serve results quickly? 2. How to bridge data science and production environments, with different tools, techniques, and requirements? 3. How to build reliable and scalable, long-running services? 4. How to update ML models without downtime?

Hands-on Machine Learning with Kafka-based Streaming Pipelines Tutorial

This hands-on tutorial examines production use of ML in streaming data pipelines; how to do periodic model retraining and low-latency scoring in live streams. We'll discuss Kafka as the data backplane, pros and cons of microservices vs. systems like Spark and Flink, tips for Tensorflow and SparkML, performance considerations, model metadata tracking, and other techniques.

Carson Wang is a big data software engineer at Intel, where he focuses on developing and improving new big data technologies. He is an active open source contributor to the Apache Spark and Alluxio projects as well as a core developer and maintainer of HiBench, an open source big data microbenchmark suite. Previously, Carson worked for Microsoft on Windows Azure.

Presentations

Improving Spark by taking advantage of disaggregated architecture Session

Shuffle in Spark requires the shuffle data to be persisted on local disks.However, the assumptions of collocated storage do not always hold in today’s data centers. We implemented a new Spark shuffle manager, which writes shuffle data to a remote cluster with different storage backends. This makes life easier for those customers who want to leverage the latest storage hardware, and HPC customers

Fei Wang is a senior data scientist/Statistician at CarGurus. His work primarily involves experimental design, causal inference modeling for
online and TV advertising. Fei’s research includes statistical machine learning, matrix factorization, optimization and high-dimensional data modeling. Fei holds a Ph.D in Biostatistics from University of Michigan.

Presentations

Building a Machine Learning Framework to Measure TV Advertising Attribution Session

This session will present the case study for the CarGurus TV Attribution Model. Attendees will learn how the creation of a causal inference model can be leveraged to calculate cost per acquisition (CPA) of TV spend and measure effectiveness when compared to CPA of Digital Performance Marketing spend.

Jason Wang is a software engineer at Cloudera focusing on the cloud.

Presentations

Journey to the cloud - architecting for the cloud through customer stories Session

We’ll give you actionable understanding of cloud architecture and different approaches customers took in their journey to the cloud. We start with the different ways we’ve seen customers be successful in the cloud. Then deep dive into the decisions they made, and how that drove their cloud architecture. Along the way we review problems they overcame, lessons learned, and core cloud paradigms.

Running multidisciplinary big data workloads in the cloud Tutorial

Moving to the cloud poses challenges from re-architecting to be cloud-native, to data context consistency across workloads that span multiple clusters on-prem and in the cloud. First, we’ll cover in depth cloud architecture and challenges; second, you’ll use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Pete Warden is the Technical Lead on the TensorFlow Mobile Embedded Team at Google doing Deep Learning. He is formerly the CTO of Jetpac, which was acquired by Google. He is also an Apple alumnus and blogs at petewarden.com.

Presentations

Data Science vs Engineering: Does it really have to be this way? Session

Are you a data scientist that has wondered "why does it take so long to deploy my model into production?" Are you an engineer that has ever thought "data scientists have no idea what they want"? You are not alone. Join us for a lively discussion panel, with industry veterans, to chat about best practices and insights regarding how to increase collaboration when developing and deploying models.

Sophie Watson is a software engineer in an Emerging Technology Group at Red Hat, where she applies her data science and statistics skills to solving business problems and informing next-generation infrastructure for intelligent application development. She has a background in mathematics and holds a PhD in Bayesian statistics, in which she developed algorithms to estimate intractable quantities quickly and accurately.

Presentations

Sketching data and other magic tricks Tutorial

In this hands-on workshop, we’ll introduce several data structures that let you answer interesting queries about massive data sets in fixed amounts of space and constant time. This seems like magic, but we'll explain the key trick that makes it possible and show you how to use these structures for real-world machine learning and data engineering applications.

  • Emily is a Machine Learning Specialist Solutions Architect at Amazon Web Services (AWS). She has been leading data science projects for many years, piloting the application of machine learning into such diverse areas as social media violence detection, economic policy evaluation, computer vision, reinforcement learning, IOT, drone, and robotic design. Her master’s degree is from the University of Chicago, where she developed new applications of machine learning for public policy research with the Data Science for Social Good Fellowship. She has worked as a data scientist at the Federal Reserve Bank of Chicago, and as a solutions architect for an explainable AI start-up in Chicago. At AWS she guides customers from project ideation to full deployment, focusing on Amazon SageMaker. Her customers are household names across the world, such as TMobile.

Presentations

Alexa, Do Men Talk Too Much? Session

Mansplaining. Know it? Hate it? Want to make it go away? In this session we tackle the chronic problem of men talking over or down to women and its negative impact on career progression for women. We will also demonstrate an Alexa skill that uses deep learning techniques on incoming audio feeds. We discuss ownership of the problem for both women and men, and suggest helpful strategies.

Building a recommender system with Amazon ML Services Tutorial

In this workshop we’ll introduce the Amazon SageMaker machine learning platform, followed by a high level discussion of recommender systems. Next we’ll dig into different machine learning approaches for recommender systems.

Alf is responsible for the delivery of data science solutions at Klick Health, where he oversees a team of data scientists and AI researchers. He brings over 15 years of experience in data science, software development, and high-performance computing to the Klick team, combining his scientific background with an appreciation of the craft of code-writing. He has previously served as an information security officer, technology VP, and acting Chief Technology Officer. He holds two Masters degrees in the physical sciences, including thesis work in computational astrophysics, and is also a Certified Information Systems Security Professional (CISSP).

Presentations

Handling Data Gaps in Time Series Using Imputation Session

What will tomorrow’s temperature be? My blood glucose levels tonight before bed? Time series forecasts depend on sensors or measurements made out in the real, messy world. Those sensors flake out, get turned off, disconnect, and otherwise conspire to cause missing data in our signals. We will show a number of methods for handling data gaps and give advice on which to consider and when.

Tony Wu manages the Altus core engineering team at Cloudera. Previously, Tony was a team lead for the partner engineering team at Cloudera. He is responsible for Microsoft Azure integration for Cloudera Director.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Moving to the cloud poses challenges from re-architecting to be cloud-native, to data context consistency across workloads that span multiple clusters on-prem and in the cloud. First, we’ll cover in depth cloud architecture and challenges; second, you’ll use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX.

Vincent Xie (谢巍盛) is the Chief Scientist and Director of China Telecom BestPay Co., Ltd. He builds the company’s Artificial Intelligence Group and leads the team to carry out research related to big data and A.I. Previously, he worked for Intel leading an engineering team working on machine learning- and big data-related open source technologies.

Presentations

How China Telecom combat financial frauds over 50M transactions a day using Apache Pulsar Session

As a Fintech company of China Telecom with half billion registered users and 41 million monthly active users, risk control decision deployment has been critical to the success of the business. In this talk we share how we leverage Apache Pulsar to boost the efficiency of our risk control decision development for combating financial frauds over 50 million transactions a day.

Tony Xing is a Principal Product Manager in AI platform team within Microsoft’s Cloud + AI organization. Previously, he was a senior product manager on the AI/Data/Infra team and Skype data team within Microsoft’s Application and Service Group, where he worked on products for data ingestion, real-time data analytics, and the data quality platform.

Presentations

Introducing a new anomaly detection algorithm (SR-CNN) inspired by Computer Vision Session

Anomaly Detection may sound old fashioned yet super important in many industry applications. How about doing this in a computer vision way? Come to our talk to learn a novel Anomaly Detection algorithm based on Spectral Residual (SR) and Convolutional Neural Network (CNN), and how this novel method was applied in the monitoring system supporting Microsoft AIOps and business incident prevention.

Bixiong Xu is the principal dev manager on the AI Platform team at Microsoft Cloud + AI.

Presentations

Introducing a new anomaly detection algorithm (SR-CNN) inspired by Computer Vision Session

Anomaly Detection may sound old fashioned yet super important in many industry applications. How about doing this in a computer vision way? Come to our talk to learn a novel Anomaly Detection algorithm based on Spectral Residual (SR) and Convolutional Neural Network (CNN), and how this novel method was applied in the monitoring system supporting Microsoft AIOps and business incident prevention.

Leah Xu is a software engineer at Spotify, where she works on analytics for marketers, real time streaming infrastructure, and Spotify Wrapped. Leah previously worked at Bridgewater and Nest on data infrastructure and secure deployments in the cloud.

Presentations

Spotify Wrapped: Product, Design, and Deadlines DCS

Spotify Wrapped is a year-in-music for active consumers and artists. Wrapped surfaces nostalgic insights derived from dozens of petabytes of user listening data. This talk sheds light on creating Wrapped for hundreds of millions of users in an ecosystem ingesting millions of events per second. Leah Xu discusses making engineering tradeoffs given demanding requirements and stringent deadlines.

Ruixin Xu is a Senior Program Manager from Microsoft Azure Big Data Tools team. Her focus areas are product design and project management, development experience in Big Data platforms, software development tool-chain, Software as a Service (SaaS) offerings.

Presentations

Using Spark to speed up the diagnosis performance for big data applications Session

Microsoft big data team run experiment to use Spark and Jupyter notebook as a replacement of existing IDE based diagnose tools for internal DevOps. Experiment result indicates the Spark based solution has improved the diagnosis performance significantly especially for complex job with large profile, and leveraging Jupyter notebook also bring the benefit of fast iteration and easy knowledge share.

Software engineer at Uber

Presentations

How to performance tune Spark applications in large clusters? Session

Omkar Joshi and Bo Yang offer an overview of how Uber’s ingestion (Marmary) & observability team improved performance of Apache Spark applications running on thousands of cluster machines and across 100 thousands+ of applications and how they methodically tackled these issues. They will also cover how they used Uber’s open sourced jvm-profiler for debugging issues at scale.

Jennifer Yang is the Head of Data Management and Data Governance at Wells Fargo Enterprise Core Services. Prior to this role, Jennifer served various senior leadership roles in Risk Management and Capital Management areas in major financial institutions. Jennifer has a unique experience that allows her to understand data and technology from both the end users and data management’s perspectives. Jennifer is passionate about leveraging the power of new technologies in gaining insights from the data to develop cost effective and scalable business solutions. Jennifer holds an undergraduate degree in Applied Chemistry from Beijing University, a master degree in Computer Science from State University of New York at Stony Brook, and an MBA degree from New York University Stern School of Business specializing in Finance and Accounting.

Presentations

Machine Learning in Data Quality Management Findata

Traditional rule based data quality management methodology is costly and hard to scale in the big data environment. It requires subject matter experts within business, data and technology domains. Machine learning technique based data quality management methodology enables a cost-effective and scalable way to manage data quality for a large amount of data.

Jeffrey is a Distinguished Data Scientist at WalmartLabs, where leads data science for the store technology department. His prior roles include the Chief Data Scientist at AllianceBernstein, a global asset-management firm, Vice President and Head of Data Science at Silicon Valley Data Science, and senior leadership position at Charles Schwab Corporation and KPMG. He has also taught econometrics, statistics, and machine learning at UC Berkeley, Cornell, NYU, University of Pennsylvania, and Virginia Tech. Jeffrey is active in the data science community and often speaks at data science conferences and local events. He has many years of experience in applying a wide range of econometric and machine learning techniques to create analytic solutions for financial institutions, businesses, and policy institutions. Jeffrey holds a Ph.D. and an M.A. in Economics from the University of Pennsylvania and a B.S. in Mathematics and Economics from UCLA.

Presentations

Time Series Forecasting using Deep Learning with PyTorch Session

Time series forecasting techniques can be applied in a wide range of scientific disciplines, business scenarios, and policy settings. This session discusses the application of deep learning techniques to time series forecasting and compares them to time series statistical models when forecasting time series with trends, multiple seasonality, regime switch, and exogenous series.

Qun Ying is a Sr. product manager @ AI Platform team of Microsoft Cloud + AI division

Presentations

Introducing a new anomaly detection algorithm (SR-CNN) inspired by Computer Vision Session

Anomaly Detection may sound old fashioned yet super important in many industry applications. How about doing this in a computer vision way? Come to our talk to learn a novel Anomaly Detection algorithm based on Spectral Residual (SR) and Convolutional Neural Network (CNN), and how this novel method was applied in the monitoring system supporting Microsoft AIOps and business incident prevention.

Alex leads the cell phone based big data crowd sourcing and analytics strategy at T-Mobile. He identifies the use cases of big data, defines the technical requirements for data collection, crunches the crowd sourced data and visualizes the results to steer the business decisions. He won the Innovator of the Year in 2017 for delivering the 5GHz band utilization analysis for unlicensed LTE deployment strategy.
Alex has 19 years of the mobile industry experience in areas from radio frequency engineering, product marketing and big data analytics. He is using all the experience to make his big data analytics create bigger value and make practical differences.

Presentations

Journey to Turn Crowd Sourced Big Data into Actionable Insights at T-Mobile Session

T-Mobile successfully improved the quality of voice calling by analyzing crowd sourced big data from mobile devices. In this session, you will learn how engineers from multiple backgrounds collaborated to achieve 10% improvement in voice quality and why the analysis of big data was the key to the success in bringing a better voice call service quality to millions of end users.

Petar started out as a Java developer almost 20 years ago, and worked as a Software Architect, Team Leader and IBM software consultant. After switching to the exciting new field of Big Data technologies, he wrote the Spark in Action book (Manning 2016) and these days primarily works on Apache Spark and Big Data projects. Today he is CTO of SV Group in Zagreb, Croatia, while also pursuing his PhD at the University of Zagreb. He is collaborating with Astronomy Department at the University of Washington on building new methods for processing images and data from future astronomical surveys.

Presentations

Using Spark for crunching astronomical data on the LSST scale Session

Large Scale Survey Telescope, or LSST, is one of the most important future surveys. Its unique design will allow it to cover large regions of the sky and obtain images of the faintest objects. In 10 years of its operation it will produce about 80 PB of data, both in images and catalog data. I will present AXS, a system we built for fast processing and cross-matching of survey catalog data.

Jeff is a software engineer and cloud architect. He is a committer and PMC on Apache OpenNLP. Jeff currently works on cloud, big-data, and NLP projects.

Presentations

Protecting the Healthcare Enterprise from PHI Breaches using Streaming and NLP Session

This talk describes how open source technologies can be used to identify and remove PHI from streaming text in an enterprise healthcare environment.

Yiyi Zeng is a senior manager/principal data scientist at Wal-Mart Labs. She has 12 years of extensive experience in business analytics and intelligence, decision management, fraud detection, credit risk, online payment and e-commerce across various business domains including both Fortune 500 firms and startups. She and her team use supervised and unsupervised machine learning technics to detect frauds including stolen financial, account take over, identity fraud, promotion/return abuse & victim scam. She is enthusiastic about mining large-scale data and applying machine learning knowledge to improve business outcomes.

Presentations

Machine Learning and Large Scale Data Analysis On Centralized Platform Session

How No1 retailer provides secure and seamless shopping experience through machine learning and large scale data analysis on centralized platform.

Alice Zhao is currently a Senior Data Scientist at Metis, where she teaches 12-week data science bootcamps. Previously, she worked at Cars.com, where she started as the company’s first data scientist, supporting multiple functions from Marketing to Technology. During that time, she also co-founded a data science education startup, Best Fit Analytics Workshop, teaching weekend courses to professionals at 1871 in Chicago. Prior to becoming a data scientist, she worked at Redfin as an analyst and at Accenture as a consultant. She has her M.S. in Analytics and B.S. in Electrical Engineering, both from Northwestern University. She blogs about analytics and pop culture on A Dash of Data. Her blog post, “How Text Messages Change From Dating to Marriage” made it onto the front page of Reddit, gaining over half a million views in the first week. She is passionate about teaching and mentoring, and loves using data to tell fun and compelling stories.

Presentations

Introduction to Natural Language Processing in Python Tutorial

As a data scientist, we are known to crunch numbers, but what happens when we run into text data? In this tutorial, I will walk through the steps to turn text data into a format that a machine can understand, share some of the most popular text analytics techniques, and showcase several natural language processing (NLP) libraries in Python including NLTK, TextBlob, spaCy and gensim.

Nan Zhu is a software engineer in Uber. He works on optimizing Apache Spark for Uber scenarios and scaling XGBoost in the machine learning platform of Uber. Nan has been the committee member of XGBoost since 2016. He started project XGBoost4J-Spark integrating XGBoost and Spark as well as fast histogram algorithm in distributed training.

Presentations

We run, we improve, we scale - XGBoost story in Uber Session

XGBoost has been widely deployed in companies across the industry. This talk begins with introducing the internals of distributed training in XGBoost and then demonstrate how XGBoost resolves the business problem in Uber with a scale to thousands of workers and 10s of TB training data.

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

strataconf@oreilly.com

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts