Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Speakers

Hear from innovative data scientists, senior engineers, and leading executives who are doing amazing things with data. More speakers will be announced; please check back for updates.

Filter

Search Speakers

Ziya Ma is the vice president of architecture, graphics, and software as well as a director of data analytics technologies in system software products at Intel. She’s responsible for optimizing big data solutions on the Intel architecture platform, leading open source efforts in the Apache community, and bringing about optimal big data analytics and AI experiences for customers. Her team works across Intel, the open source community, industry, and academia to further Intel’s leadership in big data analytics. Ziya is a cofounder of the Women in Big Data Forum. At the 2018 Global Women Economic Forum, she was honored as Women of the Decade in Data and Analytics. She holds a master’s degree and PhD in computer science and engineering from Arizona State University.

马子雅是英特尔架构、图形和软件副总裁,以及英特尔系统软件产品部的数据分析技术总监。她主要负责优化英特尔架构平台上的大数据解决方案,领导在Apache社区开展的开源工作,并为客户带来最佳的大数据分析和人工智能体验。她的团队参与了英特尔内部、开源社区、工业界和学术界的工作,以进一步巩固英特尔在大数据分析领域的领导地位。马子雅是“女性在大数据”论坛的联合创始人。在2018年全球妇女经济论坛上,她获得“近10年杰出的数据与分析女性”的荣誉。马子雅在亚利桑那州立大学获得了计算机科学与工程专业的硕士和博士学位。

Presentations

Derive value from analytics and AI at scale (sponsored by Intel) Keynote

Data is the fuel for analytics and AI workloads, but the challenges in using it are constant. Ziya Ma discusses how recent innovations from Intel in high-capacity persistent memory and open source software are accelerating production-scale deployments, delivering breakthrough optimizations and faster insights to a wide range of opportunities in the digital enterprise.

Suraj Acharya is a software engineer on the cloud team at Cloudera.

Presentations

A comparative analysis of the fundamentals of AWS and Azure Session

The largest infrastructure paradigm change of the 21st century is the shift to the cloud. Companies now face the difficult decision of which cloud to go with. This decision is not just financial and in many cases rests on the underlying infrastructure. Jason Wang, Suraj Acharya, and Tony Wu compare the relative strengths and weaknesses of AWS and Azure.

Running multidisciplinary big data workloads in the cloud Tutorial

Attend this tutorial to learn how to successfully run a data analytics pipeline in the cloud and integrate data engineering and data analytic workflows and explore considerations and best practices for data analytics pipelines in the cloud. Along the way, you'll see how to share metadata across workloads in a big data PaaS.

Viviana Acquaviva is an astrophysicist and associate professor at CUNY, where she uses data science techniques to study the universe.

Presentations

Learning machine learning using astronomy datasets Tutorial

Using interesting, diverse publicly available datasets and actual problems in astronomy research, Viviana Acquaviva leads an intermediate tutorial on machine learning. You'll learn how to customize algorithms and evaluation metrics required by scientific applications and discover best practices for choosing, developing, and evaluating machine learning algorithms in "real-world" datasets.

Dan Adams is vice president of data product management at Pitney Bowes, where he is developing and evolving Pitney Bowes’s rich portfolio of data products and capabilities, focused on data quality, currency, and usability. With 21 years of experience in the location industry, primarily in the creation of map databases and data products, Dan is well versed in the important role data plays in today’s business landscape. Previously, he was CEO at Maponics (acquired by Pitney Bowes), where he focused on building and leveraging the company’s unique spatial data portfolio to serve customers across real estate, social tech, and mobile; held executive positions at TomTom, including vice president of partner development, vice president of product management, and vice president of geospatial sales; and served on the management teams of GDT and Tele Atlas North America, managing all aspects of data collection and map production and engineering.

Presentations

Data for posterity: Nobody licenses or builds data just to have it. (sponsored by Pitney Bowes) Session

The role of data and the demand to get it right, coupled with competitive pressures to move faster, have dramatically increased. Companies now recognize data as an asset and need to manage it that way. Join Dan Adams for the insights you need to ensure that your data addresses current and future needs and that your organization is set up for success.

Nishith Agarwal is a senior software engineer at Uber, where he works on the Hudi project and the Hadoop platform at large. His interests lie in large-scale distributed and data systems.

Presentations

Hudi: Unifying storage and serving for batch and near-real-time analytics Session

Uber has a real need to provide faster, fresher data to its data consumers and products, which are running hundreds of thousands of analytical queries every day. Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar share the design, architecture, and use cases of the second-generation of Hudi, an analytical storage engine designed to serve such needs and beyond.

Dr. Vijay Srinivas Agneeswaran has a Bachelor’s degree in Computer Science & Engineering from SVCE, Madras University (1998), an MS (By Research) from IIT Madras in 2001, a PhD from IIT Madras (2008) and a post-doctoral research fellowship in the LSIR Labs, Swiss Federal Institute of Technology, Lausanne (EPFL). He currently heads data sciences R&D at Walmart Labs, India. He has spent the last eighteen years creating intellectual property and building data-based products in Industry and academia. In his current role, he heads machine learning platform development and data science foundation teams, which provide platform/intelligent services for Walmart businesses across the world. In the past, he has led the team that delivered real-time hyper-personalization for a global auto-major as well as other work for various clients across domains such as retail, banking/finance, telecom, automotive etc. He has built PMML support into Spark/Storm and realized several machine learning algorithms such as LDA, Random Forests over Spark. He led a team that designed and implemented a big data governance product for a role-based fine-grained access control inside of Hadoop YARN. He and his team have also built the first distributed deep learning framework on Spark. He is a professional member of the ACM and the IEEE (Senior) for the last 10+ years. He has five full US patents and has published in leading journals and conferences, including IEEE transactions. His research interests include distributed systems, artificial intelligence as well as Big-Data and other emerging technologies.

Presentations

Deep learning-based search and recommendation systems using TensorFlow Tutorial

Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.

Arpan Agrawal is software engineer on the analytics platforms and applications team at LinkedIn. He holds a graduate degree in computer science and engineering from IIT Kanpur.

Presentations

TuneIn: How to get your jobs tuned while you are sleeping Session

Have you ever tuned a Spark or MR job? If the answer is yes, you already know how difficult it is to tune more than hundred parameters to optimize the resources used. Manoj Kumar, Pralabh Kumar, and Arpan Agrawal offer an overview of TuneIn, an auto-tuning tool developed to minimize the resource usage of jobs. Experiments have shown up to a 50% reduction in resource usage.

Adil Aijaz is cofounder and CEO at Split Software. Adil brings over 10 years of engineering and technical experience in roles such as software engineer and technical specialist at some of the most innovative enterprise companies, including LinkedIn, Yahoo, and RelateIQ (acquired by Salesforce). Adil’s tenure at these companies helped build the foundation for Split Software, giving him the needed experience in solving data-driven challenges and delivering data infrastructure. Adil holds a BS in computer science and engineering from UCLA and an ME in computer science from Cornell University.

Presentations

The lure of "the one metric that matters" Session

Many products, whether data driven or not, chase “the one metric that matters.” It may be engagement, revenue, or conversion, but the common theme is the pursuit of improvement in one metric. Product development teams should instead focus on the design of metrics that measure our goals. Adil Aijaz shares an approach to designing metrics and discusses best practices and common pitfalls.

Sara Alavi is a senior manager of data science on the network big data and AI team at Bell Canada, where she leads the network-analytics-as-a-service team with a primary focus on wireless. The team is championing the use of big data, advanced analytics, artificial intelligence, and machine learning to transform traditional networks to self-operating and self-healing intelligent networks. Sara has more than 11 years of experience in telecommunication industry with a record of accomplishments in data use for network analytics, strategic planning, engineering, and operations. Previously, she held technical positions at Ericsson and Nortel. Sara holds a BASc in electrical engineering from the University of Ottawa.

Presentations

Use of modern data environments in telecom (sponsored by Microstrategy) Session

Bell Canada, Canada's largest communications company, leads the industry in providing world-class broadband communications services to consumers and business customers. Join Sara Alavi to learn how the network big data and AI team within Bell is using modern data environments and applying a startup mindset to transform traditional networks into insight-driven intelligent networks.

Amro Alkhatib is a data scientist with the National Health Insurance Company-Daman, a leading health insurance company headquartered in Abu Dhabi, UAE. He focuses on business-driven AI expert systems for health insurance. Amro holds an MSc in quantum computing from Masdar Institute in partnership with MIT and a BSc in computer systems engineering from Birzeit University.

Presentations

Real-time automated claim processing: The surprising utility of NLP methods on non-text data Findata

Processing claims is central to every insurance business. Amro Alkhatib shares a successful business case for automating claims processing, from idea to production. The machine learning-based claim automation model uses NLP methods on non-text data and allows auditable automated claims decisions to be made.

SriSatish Ambati is the cofounder and CEO of H2O.ai, makers of H2O, the leading open source machine learning platform, and Driverless AI, which speeds up data science workflows by automating feature engineering, model tuning, ensembling, and model deployment. Sri is known for envisioning killer apps in fast-evolving spaces and assembling stellar teams towards productizing that vision. A regular speaker on the big data, NoSQL and Java circuit, Sri leaves a trail @srisatish.

Presentations

GPU-accelerated analytics and machine learning ecosystems (Inception Showcase sponsored by NVIDIA) Session

Explore case studies from Datalogue, FASTDATA.io, and H20.ai that demonstrate how GPU-accelerated analytics, machine learning, and ETL help companies overcome slow queries and tedious data preparation process, dynamically correlate among data, and enjoy automatic feature engineering.

Anand S is a cofounder of data science company Gramener, where he leads a team of data enthusiasts who tell visual stories of insights from analysis. These are built on the Gramener Visualisation Server. Previously, Anand worked at IBM, Infosys, Lehman Brothers, and BCG. He studied at IIT Madras, IIM Bangalore, and LBS.

Presentations

Mapping India Data Case Studies

Answering simple questions about India's geography can be a nightmare. Official shape files are not publicly available. Worse, each ministry uses their own maps. But an active group of volunteers are crafting open maps. Anand S explains what it takes for a grass-roots initiative to transform a country's data infrastructure.

Archana Anandakrishnan is a senior data scientist in the Decision Science Organization at American Express, where she works on developing data products that accelerate the modeling lifecycle and adoption of new methods at American Express. She is currently a lead developer and contributor to DataQC Studio. Previously, she was a postdoc researcher in particle physics at Cornell University. She is passionate about mentoring and is currently a workplace mentor with Big Brothers Big Sisters, NYC. Archana holds a PhD in physics from the Ohio State University.

Presentations

Let the machines learn to improve data quality Session

Building accurate machine learning models hinges on the quality of the data. Errors and anomalies get in the way of data scientists doing their best work. Archana Anandakrishnan explains how American Express created an automated, scalable system for measurement and management of data quality. The methods are modular and adaptable to any domain where accurate decisions from ML models are critical.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He’s taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He’s widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Real-time systems with Spark Streaming and Kafka 1-Day Training

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks and explains how to choose the right one for your company.

Alberto Andreotti is a senior data scientist on the Spark NLP team at John Snow Labs, where he implements state-of-the-art NLP algorithms on top of Spark. He has a decade of experience working for companies including Motorola, Intel, and Samsung and as a consultant, specializing in the field of machine learning. Alberto has written lots of low-level code in C/C++ and was an early Scala enthusiast and developer. A lifelong learner, he holds degrees in engineering and computer science and is working on a third in AI. Alberto was born in Argentina. He enjoys the outdoors, particularly hiking and camping in the mountains of Argentina.

Presentations

Spark NLP in action: How SelectData uses AI to better understand home health patients Session

David Talby, Alberto Andreotti, Stacy Ashworth, and Tawny Nichols outline a question-answering system for accurately extracting facts from free-text patient records and share best practices for training domain-specific deep learning NLP models. The solution is based on Spark NLP, an extension of Spark ML that provides state-of-the-art performance and accuracy for natural language understanding.

Patrick Angeles is the chief architect for financial services at Cloudera.

Presentations

Too big data to fail: How banks use big data to prevent the next financial crisis Findata

The financial crisis of 2008 exposed systemic issues in the financial system that resulted in the failures of several established institutions and a bailout of the entire industry. Patrick Angeles explains why banks and regulators are turning to big data solutions to avoid a repeat of history.

Julia Angwin is an award-winning investigative journalist at the independent news organization ProPublica. Previously, she was a reporter at the Wall Street Journal, where she led a privacy investigative team that was a finalist for a Pulitzer Prize in Explanatory Reporting in 2011 and won a Gerald Loeb Award in 2010. In 2003, she was on a team of reporters at the Wall Street Journal that was awarded the Pulitzer Prize in Explanatory Reporting for coverage of corporate corruption. She is the author of Dragnet Nation: A Quest for Privacy, Security, and Freedom in a World of Relentless Surveillance and Stealing MySpace: The Battle to Control the Most Popular Website in America. Julia holds a BA in mathematics from the University of Chicago and an MBA from the Graduate School of Business at Columbia University.

Presentations

Quantifying forgiveness Keynote

Algorithms are increasingly arbiters of forgiveness. Julia Angwin discusses what she has learned about forgiveness in her series of articles on algorithmic accountability and the lessons we all need to learn for the coming AI future.

Mauricio Aristizabal is the data pipeline architect at Impact (formerly Impact Radius), a marketing technology company that helps brands grow by optimizing their paid marketing and media spend. Mauricio is responsible for massively scaling and modernizing the company’s analytics capabilities, selecting data stores and processing platforms, and designing many of the jobs that process internally and externally captured data and make it available to report and dashboard users, analytic applications, and machine learning jobs. He also assists the operations team with maintaining and tuning its Hadoop and Kafka clusters.

Presentations

Real-time analytics and BI with data lakes and data warehouses using Kudu, HBase, Spark, and Kafka: Lessons learned Session

Mauricio Aristizabal shares lessons learned from migrating Impact's traditional ETL platform to a real-time platform on Hadoop (leveraging the full Cloudera EDH stack). Mauricio also discusses the company's data lake in HBase, Spark Streaming jobs (with Spark SQL), using Kudu for "fast data" BI queries, and using Kafka's data bus for loose coupling between components.

Sudhanshu Arora is a software engineer at Cloudera, where he leads the development for data management and governance solutions. Previously, Sudhanshu was with the platform team at Informatica, where he helped design and implement its next-generation metadata repository.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Attend this tutorial to learn how to successfully run a data analytics pipeline in the cloud and integrate data engineering and data analytic workflows and explore considerations and best practices for data analytics pipelines in the cloud. Along the way, you'll see how to share metadata across workloads in a big data PaaS.

David Arpin is a data scientist at Amazon Web Services.

Presentations

Building a large-scale machine learning application using Amazon SageMaker and Spark Tutorial

David Arpin walks you through building a machine learning application, from data manipulation to algorithm training to deployment to a real-time prediction endpoint, using Spark and Amazon SageMaker.

Ahsan Ashraf is a data scientist at Pinterest focusing on recommendations and ranking for the discovery team. Previously, Ahsan worked with personal finance startup wallet.ai as part of an Insight Data Science Fellowship, where he designed and built a recommender system that drew insights into users’ spending habits from their transaction histories. Ahsan holds a PhD in condensed/soft matter physics.

Presentations

Diversification in recommender systems: Using topical variety to increase user satisfaction Session

Online recommender systems often rely heavily on user engagement features. This can cause a bias toward exploitation over exploration, overoptimizing on users' interests. Content diversification is important for user satisfaction, but measuring and evaluating impact is challenging. Ahsan Ashraf outlines techniques used at Pinterest that drove ~2–3% impression gains and a ~1% time-spent gain.

Stacy Ashworth is a registered nurse and chief clinical officer at SelectData. Stacy’s professional interests lie in the use of technology to improve the quality of care through better decision making. An accomplished speaker, she has served as a contributor to the healthcare informatics and technology track of the 2016 Business and Health Administration Association meeting, performing research regarding the evaluation of glucose monitoring technologies for cost-effective and quality control/management of diabetes. She holds a master’s degree in healthcare administration with an emphasis in informatics. Postacute care, geriatrics, and coding may be her passions, but her love is firmly centered on her family of two lively teenagers, a spouse, and a couple of schnauzers to keep things interesting.

Presentations

Spark NLP in action: How SelectData uses AI to better understand home health patients Session

David Talby, Alberto Andreotti, Stacy Ashworth, and Tawny Nichols outline a question-answering system for accurately extracting facts from free-text patient records and share best practices for training domain-specific deep learning NLP models. The solution is based on Spark NLP, an extension of Spark ML that provides state-of-the-art performance and accuracy for natural language understanding.

Shivnath Babu is the CTO at Unravel Data Systems and an adjunct professor of computer science at Duke University. His research focuses on ease of use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath cofounded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.

Presentations

Quick, reliable, and cost-effective ways to operationalize big data apps (sponsored by Unravel) Session

Operationalizing big data apps in a quick, reliable, and cost-effective manner remains a daunting task. Shivnath Babu and Madhusudan Tumma outline common problems and their causes and share best practices to find and fix these problems quickly and prevent such problems from happening in the first place.

Tony Baer is the founding principal of dbInsight LLC, which provides independent counsel to data and analytics technology providers navigating a changing world where cloud deployment, artificial intelligence, and data are upending their client’s expectations. Baer is an authority on how cloud-native architecture can transform traditional on-premise data platforms, and how it can be used to break down data and application silos. He coauthored some of the earliest books on the Java and .NET frameworks, including Understanding the .NET Framework and J2EE Technology in Practice. His career began as journalist with leading publications including Computerworld, Application Development Trends, Computergram, Software Magazine, InformationWeek, and Manufacturing Business Technology.

Presentations

Executive Briefing: Profit from AI and machine learning—The best practices for people and process Session

Tony Baer and Florian Douetteau share the results of research cosponsored by Ovum and Dataiku that surveyed a specially selected sample of chief data officers and data scientists on how to map roles and processes to make success with AI in the business repeatable.

白冰 is a senior big data platform development engineer at JD.com focusing on computation and storage framworks such as Spark, Hive, Presto, Alluxio, and HDFS. 白冰 is experienced in designing and developing architecture for deploying the frameworks into production with large-scale clusters.

Presentations

Using Alluxio as a fault-tolerant pluggable optimization component of JD.com's compute frameworks Session

Tao Huang, Mang Zhang, and 白冰 explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average.

Michael Balint is a senior manager of applied solutions engineering at NVIDIA. Previously, Michael was a White House Presidential Innovation Fellow, where he brought his technical expertise to projects like Vice President Biden’s Cancer Moonshot program and Code.gov. Michael has had the good fortune of applying software engineering and data science to many interesting problems throughout his career, including tailoring genetic algorithms to optimize air traffic, harnessing NLP to summarize product reviews, and automating the detection of melanoma via machine learning. He is a graduate of Cornell and Johns Hopkins University.

Presentations

Kubernetes on GPUs (sponsored by NVIDIA) Session

Michael Balint explains how NVIDIA employs its own distribution of Kubernetes, in conjunction with DGX hardware, to make the most efficient use of GPU resources and scale its efforts across a cluster, allowing multiple users to run experiments and push their finished work to production.

Roger Barga is general manager and director of development at Amazon Web Services, where he is responsible for Kinesis data streaming services. Previously, Roger was in the Cloud Machine Learning Group at Microsoft, where he was responsible for product management of the Azure Machine Learning service. Roger is also an affiliate professor at the University of Washington, where he is a lecturer in the Data Science and Machine Learning programs. Roger holds a PhD in computer science, has been granted over 30 patents, has published over 100 peer-reviewed technical papers and book chapters, and has authored a book on predictive analytics.

Presentations

Continuous machine learning over streaming data: The story continues. Session

Roger Barga, Sudipto Guha, and Kapil Chhabra explain how unsupervised learning with the robust random cut forest (RRCF) algorithm enables insights into streaming data and share new applications to impute missing values, forecast future values, detect hotspots, and perform classification tasks. They also demonstrate how to implement unsupervised learning over massive data streams.

Dylan Bargteil is a data scientist in residence at the Data Incubator, where he works on research-guided curriculum development and instruction. Previously, he worked with deep learning models to assist surgical robots and was a research and teaching assistant at the University of Maryland, where he developed a new introductory physics curriculum and pedagogy in partnership with the Howard Hughes Medical Institute (HHMI). Dylan studied physics and math at the University of Maryland and holds a PhD in physics from New York University.

Presentations

Machine learning from scratch in TensorFlow 1-Day Training

The TensorFlow library provides for the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs. This architecture makes it ideal for implementing neural networks and other machine learning algorithms. Dylan Bargteil introduces TensorFlow's capabilities through its Python interface.

Bonnie Barrilleaux is a staff data scientist in analytics at LinkedIn, primarily focused on communities and the content ecosystem. She uses data to guide product strategy, performs experiments to understand the ecosystem, and creates metrics to evaluate product performance. Previously, she completed a postdoctoral fellowship in genomics at the University of California, Davis, studying the function of the MYC gene in cancer and stem cells. Bonnie has published peer-reviewed works including 11 journal articles, a book chapter, and a video article and has been awarded multiple grants to create interactive art. She holds a PhD in chemical engineering from Tulane University.

Presentations

Perverse incentives in metrics: Inequality in the like economy Session

As LinkedIn encouraged members to join conversations, it found itself in danger of creating a "rich get richer" economy in which a few creators got an increasing share of all feedback. Bonnie Barrilleaux explains why you must regularly reevaluate metrics to avoid perverse incentives—situations where efforts to increase the metric cause unintended negative side effects.

James Bednar is a senior solutions architect at Anaconda. Previously, Jim was a lecturer and researcher in computational neuroscience at the University of Edinburgh, Scotland, and a software and hardware engineer at National Instruments. He manages the open source Python projects datashader, HoloViews, GeoViews, ImaGen, and Param. He has published more than 50 papers and books about the visual system, data visualization, and software development. Jim holds a PhD in computer science from the University of Texas as well as degrees in electrical engineering and philosophy.

Presentations

Making interactive browser-based visualizations easy in Python Tutorial

Python lets you solve data science problems by stitching together packages from the Python ecosystem, but it can be difficult to assemble the right tools to solve real-world problems. James Bednar walks you through using the 15+ packages covered by the new PyViz.org initiative to make it simple to build interactive plots and dashboards, even for large, streaming, and highly multidimensional data.

William Benton is an engineering manager and senior principal software engineer at Red Hat, where he leads a team of data scientists and engineers. He’s applied machine learning to problems ranging from forecasting cloud infrastructure costs to designing better cycling workouts. His focus is investigating the best ways to build and deploy intelligent applications in cloud native environments, but he’s also conducted research and development in the areas of static program analysis, managed language runtimes, logic databases, cluster configuration management, and music technology.

Presentations

Why data scientists should love Linux containers Session

Containers are a hot technology for application developers, but they also provide key benefits for data scientists. William Benton details the advantages of containers for data scientists and AI developers, focusing on high-level tools that will enable you to become more productive and collaborate more effectively.

Mike Berger is vice president of population health informatics and data science at Mount Sinai Health, where he delivers on the promise to transform Mount Sinai Health into the premier population health management health system in the New York metro area. His role includes developing and implementing data-driven clinical and actuarial decision support through the use of advanced analytics, machine learning, and timely operational BI. He has over 20 years of experience across a combination of large academic medical centers, payer organizations, entrepreneurial startups, and management consultancies. Mike is the cochair for the HIMSS Clinical and Business Intelligence Community and hosts a webinar series for analytics thought leaders. He was recently named a 2018 “top 50 data and analytics professional in the US and Canada” by Corinium Intelligence. Originally from Huntington Beach, CA, Mike holds a degree in industrial and systems engineering from USC, a healthcare project management certification from Harvard’s Graduate School of Public Health, and a master’s degree from NYU Stern.

Presentations

Decision-centricity: Operationalizing analytics and data science in health systems Data Case Studies

Mount Sinai Health has moved up the analytics maturity chart to deliver business value in new risk models around population health. Mike Berger explains how Mount Sinai designed a team, built a data factory, and generates the analytics to drive decision-centricity and explores examples of mixing Tableau, SQL, Hive, APIs, Python, and R into a cohesive ecosystem supported by a data factory.

Tim Berglund is the senior director of developer experience with Confluent, where he serves as a teacher, author, and technology leader. Tim can frequently be found speaking at conferences internationally and in the United States. He’s the copresenter of various O’Reilly training videos on topics ranging from Git to distributed systems and is the author of Gradle Beyond the Basics. He tweets as @tlberglund, blogs very occasionally at Timberglund.com, and is the cohost of the DevRel Radio podcast. He lives in Littleton, Colorado, with the wife of his youth and their youngest child, the other two having mostly grown up.

Presentations

Stream processing with Kafka and KSQL Tutorial

Tim Berglund leads this solid introduction to Apache Kafka as a streaming data platform. You'll cover the internal architecture, APIs, and platform components like Kafka Connect and Kafka Streams, then finish with an exercise processing streaming data using KSQL, the new SQL-like declarative stream processing language for Kafka.

Partha works for the Solutions Engineering team at Cambridge Semantics.

Presentations

From data lakes to the data fabric: Our vision for digital strategy (sponsored by Cambridge Semantics) Session

Ben Szekely shares a vision for digital innovation: The data fabric connects enterprise data for unprecedented access in an overlay fashion that does not disrupt current investments. Interconnected and reliable data drives business outcomes by automating scalable AI and ML efforts. Graph technology is the way forward to realize this future.

Anya Bida is a senior member of the technical staff (SRE) at Salesforce. She’s also a co-organizer of the SF Big Analytics meetup group and is always looking for ways to make platforms more scalable, cost efficient, and secure. Previously, Anya worked at Alpine Data, where she focused on Spark operations.

Presentations

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am Session

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads.

Albert Bifet is a professor and head of the Data, Intelligence, and Graphs (DIG) Group at Télécom ParisTech and a scientific collaborator at École Polytechnique. A big data scientist with 10+ years of international experience in research, Albert has led new open source software projects for business analytics, data mining, and machine learning at Huawei, Yahoo, the University of Waikato, and UPC. At Yahoo Labs, he cofounded Apache scalable advanced massive online analysis (SAMOA), a distributed streaming machine learning framework that contains a programing abstraction for distributed streaming ML algorithms. At the WEKA Machine Learning Group, he co-led massive online analysis (MOA), the most popular open source framework for data stream mining with more than 20,000 downloads each year. Albert is the author of Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams and the editor of the Big Data Mining special issue of SIGKDD Explorations. He was cochair of the industrial track at ECML PKDD, BigMine, and the data streams track at ACM SAC. He holds a PhD from BarcelonaTech.

Presentations

Machine learning for nonstationary streaming data using Structured Streaming and StreamDM Session

The StreamDM library provides the largest collection of data stream mining algorithms for Spark. Heitor Murilo Gomes and Albert Bifet explain how to use StreamDM and Structured Streaming to develop, apply, and evaluate learning models specially for nonstationary streams (i.e., those with concept drifts).

Alex Bleakley is the manager of the Machine Learning Solutions Architecture team at Cloudera. Alex combines core machine learning skills with 6 years experience implementing practical data solutions across multiple industries to lead a team focused on taking machine learning solutions to production at big data scale.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh, Alan Silva, Alex Bleakley, Steven Totman, Mirko Kämpf, and Syed Nasar outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Ryan Blue is an engineer on Netflix’s big data platform team. Previously, Ryan was responsible for the Avro and Parquet file formats at Cloudera. He is the author of the Analytic Data Storage in Hadoop series of screencasts from O’Reilly.

Presentations

Introducing Iceberg: Tables designed for object stores Session

Owen O'Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout with properties specifically designed for cloud object stores, such as S3. It provides a common set of capabilities such as partition pruning, schema evolution, atomic additions, removal, or replacements of files regardless of whether the data is stored in Avro, ORC, or Parquet.

The evolution of Netflix's S3 data warehouse Session

In the last few years, Netflix's data warehouse has grown to more than 100 PB in S3. Ryan Blue and Daniel Weeks share lessons learned, the tools Netflix currently uses and those it has retired, and the improvements it is rolling out, including Iceberg, a new table format for S3.

Ashim Bose is Global Leader of Analytics Product Management, DXC Analytics. He is focused on helping clients achieve business outcomes from their data by leveraging DXC Analytics offerings. He has over 20 years of industry experience in automotive, industrial, airlines, telecom and space exploration. Ashim holds a Ph.D. in artificial intelligence and a master’s degree in mechanical engineering.

Presentations

Minimum viable machine learning: The applied data science bootcamp (sponsored by DXC Technology) 1-Day Training

Acquiring machine learning (ML) technology is relatively straightforward, but ML must be applied to be useful. In this one-day boot camp that is equal parts hackathon, presentation, and group participation, Jerry Overton, Ashim Bose, and Samir Sehovic teach you how to apply advanced analytics in ways that reshape the enterprise and improve outcomes.

Bob Bradley is the data solutions manager at Geotab, a global leader in telematics providing open platform fleet management solutions to over 1.2 million connected vehicles worldwide. Bob leads a team responsible for developing data-driven solutions that leverage Geotab’s big data repository of over 3 billion records each day. Previously, Bob spent more than 14 years as the cofounder and vice president of a software development shop (acquired by Geotab in 2016), where he focused on delivering custom business intelligence solutions to companies across Canada.

Presentations

Building the bridge from big data to ML, featuring Geotab (sponsored by Google Cloud) Session

If your company isn’t good at analytics, it’s not ready for AI. Bob Bradley and Chad W. Jennings explain how the right data strategy can set you up for success in machine learning and artificial intelligence—the new ground for gaining competitive edge and creating business value. You'll then see an in-depth demonstration of Google technology from smart cities innovator Geotab.

Claudiu Branzan is an analytics senior manager in the Applied Intelligence Group at Accenture, based in Seattle, where he leverages his more than 10 years of expertise in data science, machine learning, and AI to promote the use and benefits of these technologies to build smarter solutions to complex problems. Previously, Claudiu held highly technical client-facing leadership roles in companies using big data and advanced analytics to offer solutions for clients in healthcare, high-tech, telecom, and payments verticals.

Presentations

Natural language understanding at scale with Spark NLP Tutorial

David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial for scalable NLP using the highly performant, highly scalable open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Mikio Braun is a principal engineer for search at Zalando, one of Europe’s biggest fashion platforms. He worked in research for a number of years before becoming interested in putting research results to good use in the industry. Mikio holds a PhD in machine learning.

Presentations

Executive Briefing: From Business to AI—The missing pieces in becoming "AI ready" Session

In order to become "AI ready," an organization not only has to provide the right technical infrastructure for data collection and processing but also must learn new skills. Mikio Braun highlights three pieces companies often miss when trying to become AI ready: making the connection between business problems and AI technology, implementing AI-driven development, and running AI-based projects.

Machine learning for time series: What works and what doesn't Session

Time series data has many applications in industry, from analyzing server metrics to monitoring IoT signals and outlier detection. Mikio Braun offers an overview of time series analysis with a focus on modern machine learning approaches and practical considerations, including recommendations for what works and what doesn’t, and industry use cases.

Lindsay Brin is a data scientist at T4G. A motivated, curious, and analytical data scientist with more than a decade of experience with research methods and the scientific process, Lindsay excels at asking incisive questions and using data to tell compelling stories, from generating testable hypotheses to wrangling imperfect data to finding insights via analytical models. Lindsay is passionate about teaching the skills necessary to analyze data more efficiently and effectively and has developed and taught workshops and online courses at the University of New Brunswick. She is also a Data Carpentry instructor and Ladies Learning Code chapter co-lead. Having recently made a career pivot from biogeochemistry to data science, she is also well-positioned to provide insight into the applicability of academic research and analysis skills to business problems.

Presentations

From theory to data product: Applying data science methods to effect business change Tutorial

Janet Forbes, Danielle Leighton, and Lindsay Brin lead a primer on crafting well-conceived data science projects that uncover valuable business insights. Using case studies and hands-on skills development, Janet, Danielle, and Lindsay walk you through essential techniques for effecting real business change.

Ian Brooks is a solutions engineer at Cloudera. He is incredibly passionate over the power of data. Ian holds a PhD in computer science.

Presentations

Improving patient screening by applying predictive analytics to electronic medical records. Session

The power of big data continues to modernize traditional industries, including healthcare. Ian Brooks explains how to implement intelligent preventive screening for conditions by applying electronic medical records (EMR) to predictive analytics via supervised machine learning techniques.

Bruno Faria is a senior EMR solutions architect at Amazon Web Services. He spends his time helping customers understand big data application architectures and integration approaches for running those applications in Amazon Web Services.

Presentations

Best practices for migrating big data workloads to Amazon Web Services (sponsored by Amazon Web Services) Session

Bruno Faria explains how to identify the components and workflows in your current environment and shares best practices to migrate these workloads to AWS.

Best practices for migrating big data workloads to Amazon Web Services (sponsored by Amazon Web Services) Session

Bruno Faria explains how to identify the components and workflows in your current environment and shares best practices to migrate these workloads to AWS.

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez, Radhika Ravirala, Paul Sears, and Bruno Faria walk you through building a big data application using a combination of open source technologies and AWS managed services.

Andrew Brust is founder and CEO of Blue Badge Insights, a blogger for ZDNet Big Data, and is a data and analytics-focused analyst for GigaOm. He’s the coauthor of Programming Microsoft SQL Server 2012, a Microsoft tech influencer, and advises data and analytics ISVs on winning in the market, solution providers on their service offerings, and customers on their analytics strategy. Andrew is an entrepreneur, a consulting veteran, a former research director, and a current Microsoft Data Platform MVP.

Presentations

Data governance: A big job that's getting bigger Session

Data governance has grown from a set of mostly data management-oriented technologies in the data warehouse era to encompass catalogs, glossaries, and more in the data lake era. Now new requirements are emerging, and new products are rising to meet the challenge. Andrew Brust tracks data governance's past and present and offers a glimpse of the future.

Andrew is Managing Partner at bnh.ai, a boutique law firm focused on AI and analytics, and Chief Legal Officer at Immuta. He is also a Visiting Fellow at Yale Law School’s Information Society Project. Previously, Andrew served as Special Advisor for Policy to the head of the Federal Bureau of Investigation’s Cyber Division, where he served as lead author on the FBI’s after action report for the 2014 attack on Sony.

A leading authority on the intersection between law and technology, Andrew has published articles in The New York Times, The Financial Times, and Harvard Business Review, where he is a regular contributor.

Andrew is a term-member of the Council on Foreign Relations, a member of the Washington, D.C. and Virginia State Bars, and a certified cyber incident response handler. He holds a JD from Yale Law School and a BA with first-class honors from McGill University.

Presentations

Beyond explainability: Regulating machine learning in practice Session

Machine learning is becoming prevalent across industries, creating new types of risk. Managing this risk is quickly becoming the central challenge of major organizations, one that strains data science teams, legal personnel, and the C-suite alike. Andrew Burt shares lessons from past regulations focused on similar technology along with a proposal for new ways to manage risk in ML.

Alen Capalik is the founder and CEO of FASTDATA.io, producer of Plasma Engine, the first GPU-native software to fully leverage NVIDIA GPUs and Apache Arrow for real-time processing of infinite data in motion. As the chief architect of a cybersecurity software company, Alen was constantly hindered by CPU-bound data processing Java software like Spark and Hadoop. After exiting cybersecurity, he started experimenting with processing streaming data through GPUs instead of CPUs and saw amazing potential to process data much faster and more efficiently, which inspired him to found FASTDATA.io.

Presentations

GPU-accelerated analytics and machine learning ecosystems (Inception Showcase sponsored by NVIDIA) Session

Explore case studies from Datalogue, FASTDATA.io, and H20.ai that demonstrate how GPU-accelerated analytics, machine learning, and ETL help companies overcome slow queries and tedious data preparation process, dynamically correlate among data, and enjoy automatic feature engineering.

Michelle Casbon is a senior engineer on the Google Cloud Platform developer relations team, where she focuses on open source contributions and community engagement for machine learning and big data tools. Michelle’s development experience spans more than a decade and has primarily focused on multilingual natural language processing, system architecture and integration, and continuous delivery pipelines for machine learning applications. Previously, she was a senior engineer and director of data science at several San Francisco-based startups, building and shipping machine learning products on distributed platforms using both AWS and GCP. She especially loves working with open source projects and is a contributor to Kubeflow. Michelle holds a master’s degree from the University of Cambridge.

Presentations

Kubeflow explained: Portable machine learning on Kubernetes Session

Michelle Casbon demonstrates how to build a machine learning application with Kubeflow. Kubeflow makes it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere and supports the full lifecycle of an ML product, including iteration via Jupyter notebooks. Join Michelle to find out what Kubeflow currently supports and the long-term vision for the project.

Amber Case studies the interaction between humans and computers and how our relationship with information is changing the way cultures think, act, and understand their worlds. She is currently a fellow at Harvard University’s Berkman Klein Center for Internet and Society and a visiting researcher at the MIT Center for Civic Media. Previously, she was the cofounder and former CEO of location-based software company Geoloqi (acquired by Esri in 2012). Amber is the author of Calm Technology, Design for the Next Generation of Devices. She spoke about the future of the interface for SXSW 2012’s keynote address, and her TED talk, “We are all cyborgs now,” has been viewed over a million times. Named one of National Geographic’s “emerging explorers,” she’s also been listed among Inc. magazine’s “30 under 30” and featured among Fast Company’s “most influential women in technology.” You can follow her on Twitter @caseorganic and learn more at Caseorganic.com and Medium.

Presentations

Sound design and the future of experience Keynote

Amber Case outlines several methods that product designers and managers can use to improve everyday interactions through an understanding and application of sound design.

Sarah Catanzaro is a principal at Amplify Partners, where she focuses on investing in high-potential startups that leverage machine intelligence and high-performance computing to solve real-world problems. Previously, Sarah co-led investments in Kinetica, Platform9, and Fluxx at Canvas Ventures. Sarah has several years of experience in developing data acquisition strategies and leading machine and deep learning-enabled product development at organizations of various sizes: As head of data at Mattermark, she led a team to collect and organize information on over one million private companies; as a consultant at Palantir and as an analyst at Cyveillance, she implemented analytics solutions for municipal and federal agencies; and as a program manager at the Center for Advanced Defense Studies, she directed projects on adversary behavioral modeling and Somali pirate network analysis. Sarah holds a BA in international security studies from Stanford University.

Presentations

VC trends in machine learning and data science Session

In this panel discussion, venture capital investors explain how startups can accelerate enterprise adoption of machine learning and explore the new tech trends that will give rise to the next transformation in the big data landscape.

William Chambers is a product manager at Databricks, where he works on Structured Streaming and data science products. He is lead author of Spark: The Definitive Guide, coauthored with Matei Zaharia. Bill also created SparkTutorials.net as a way to teach Apache Spark basics. Bill holds a master’s degree in information management and systems from UC Berkeley’s School of Information. During his time at school, Bill was also creator of the Data Analysis in Python with pandas course for Udemy and cocreator of and first instructor for Python for Data Science, part of UC Berkeley’s Masters of Data Science program.

Presentations

Streaming big data in the cloud: What to consider and why Session

Streaming big data is a rapidly growing field but currently involves a lot of operational complexity and expertise. Bill Chambers shares a decision making framework for determining the best tools and technologies for successfully deploying and maintaining streaming data pipelines to solve business problems and offers an overview of Apache Spark’s Structured Streaming processing engine.

Mark Chan is a hacker and data scientist at H2O.ai. Previously, he was a quantitative research developer at Thomson Reuters and Nipun Capital and a data scientist at an IoT startup, where he built a web-based machine learning platform and developed predictive models. Mark holds an MS in financial engineering from UCLA and a BS in computer engineering from the University of Illinois Urbana-Champaign. In his spare time, Mark likes competing on Kaggle and cycling.

Presentations

Practical techniques for interpreting machine learning models Tutorial

Transparency, auditability, and stability are crucial for business adoption and human acceptance of complex machine learning models. Patrick Hall, Avni Wadhwa, and Mark Chan share practical and productizable approaches for explaining, testing, and visualizing machine learning models using open source, Python-friendly tools such as GraphViz, H2O, and XGBoost.

Mr. Samuel Chance retired from the U.S. Navy in 2006 and has since served as a Senior Systems Engineer and Architect. Mr. Chance completed his Master’s thesis in Semantic Web Technologies in 2003. He possesses nearly 35 years of experience in electronics, communications, and distributed systems design, development and lifecycle support. Mr. Chance has broad technical expertise in Service Oriented Architecture, Semantics-Based Computing, Intelligent Agent and Knowledge Management Systems. He keynoted or presented in myriad public U.S. and international engagements such as the Semantic Technologies Conference, International Standards Organization Open Forums, Object Management Group, and DODIIS Worldwide Conference. Mr. Chance worked as technology consultant on various national security programs before joining Cambridge Semantics. He currently works closely with the Cambridge Semantics Sales and Engineering teams to accurately define and communicate the value of the company’s Anzo platform to the marketplace and its growing roster of customers, while also architecting customized solutions for their environments.

Presentations

From data lakes to the data fabric: Our vision for digital strategy (sponsored by Cambridge Semantics) Session

Ben Szekely shares a vision for digital innovation: The data fabric connects enterprise data for unprecedented access in an overlay fashion that does not disrupt current investments. Interconnected and reliable data drives business outcomes by automating scalable AI and ML efforts. Graph technology is the way forward to realize this future.

Vinoth Chandar is the cocreator of the Hudi project at Uber and also PMC and lead of Apache Hudi (Incubating). Previously, he was a senior staff engineer at Uber, where he led projects across various technology areas like data infrastructure, data architecture, and mobile and network performance; was the LinkedIn lead on Voldemort; and worked on Oracle Server’s replication engine, HPC, and stream processing . Vinoth has keen interest in unified architectures for data analytics and processing.

Presentations

Hudi: Unifying storage and serving for batch and near-real-time analytics Session

Uber has a real need to provide faster, fresher data to its data consumers and products, which are running hundreds of thousands of analytical queries every day. Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar share the design, architecture, and use cases of the second-generation of Hudi, an analytical storage engine designed to serve such needs and beyond.

Manna Chang is senior data scientist at Optum Enterprise Analytics, where she plays a leading role in providing and developing innovative technologies and methods to meet customer needs and answer healthcare-related challenges. Her experience includes applying machine learning techniques in drug discovery and genomic outcome studies. Manna holds a PhD in biochemistry and an MS in statistics. She loves sci-fi movies and enjoys hiking.

Presentations

Breaking the rules: End-stage renal disease prediction Session

Olga Cuznetova and Manna Chang demonstrate supervised and unsupervised learning methods to work with claims data and explain how the methods complement each other. The supervised method looks at CKD patients at risk of developing end-stage renal disease (ESRD), while the unsupervised approach looks at the classification of patients that tend to develop this disease faster than others.

Andrew Chen is a software engineer at Databricks and a MLflow committer. At Databricks, Andrew is working on tools to simplify the end to end experience of machine learning, all the way from data ETL to model training and deployment. Before working at Databricks, Andrew received his BS in EECS from UC Berkeley in 2016. While in school, Andrew also briefly worked on search quality at Pinterest and search engine marketing at Groupon.

Presentations

MLflow: An open platform to simplify the machine learning lifecycle Session

Successfully building and deploying a machine learning model is difficult to do once. Enabling other data scientists to reproduce your pipeline, compare the results of different versions, track what's running where, and redeploy and rollback updated models is much harder. Mani Parkhe and Andrew Chen offer an overview of MLflow—a new open source project from Databricks that simplifies this process.

Danny Chen is a software engineer on the Hadoop platform team at Uber, where he works on large-scale data ingestion and dispersal pipelines and libraries leveraging Apache Spark. Previously, he was the tech lead at Uber Maps building data pipelines to produce metrics to help analyze the quality of mapping data. Before joining Uber, Danny was at Twitter and an original member of the core team building Manhattan, a key-value store powering Twitter’s use cases. Danny holds a BS in computer science from UCLA and an MS in computer science from USC.

Presentations

Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework Session

Danny Chen, Omkar Joshi, and Eric Sayle offer an overview of Marmaray, a generic Hadoop ingestion and dispersal framework recently released to production at Uber. You'll learn how Marmaray can meet a team's data needs by ensuring that data can be reliably ingested into Hive or dispersed into online data stores and take a deep dive into the architecture to see how it all works.

Felix Cheung is an engineer at Uber and a PMC and committer for Apache Spark. Felix started his journey in the big data space about five years ago with the then state-of-the-art MapReduce. Since then, he’s (re-)built Hadoop clusters from metal more times than he would like, created a Hadoop distro from two dozen or so projects, and juggled hundreds to thousands of cores in the cloud or in data centers. He built a few interesting apps with Apache Spark and ended up contributing to the project. In addition to building stuff, he frequently presents at conferences, meetups, and workshops. He was also a teaching assistant for the first set of edX MOOCs on Apache Spark.

Presentations

Your 10 billion rides are arriving now: Scaling Apache Spark for data pipelines and intelligent systems at Uber Session

Did you know that your Uber rides are powered by Apache Spark? Join Felix Cheung to learn how Uber is building its data platform with Apache Spark at enormous scale and discover the unique challenges the company faced and overcame.

Kapil Chhabra is a senior product manager at Amazon Web Services, focusing on real-time machine learning on high-volume and high-velocity data. He also runs the streaming data ingestion business at AWS, Kinesis Data Firehose. Previously, he led the analytics business at Akamai Technologies and launched and scaled multiple new products, including real-time video monitoring services (Media Analytics and QoS Monitor) and the award-winning broadcast operations as a service (BOCC).

Presentations

Continuous machine learning over streaming data: The story continues. Session

Roger Barga, Sudipto Guha, and Kapil Chhabra explain how unsupervised learning with the robust random cut forest (RRCF) algorithm enables insights into streaming data and share new applications to impute missing values, forecast future values, detect hotspots, and perform classification tasks. They also demonstrate how to implement unsupervised learning over massive data streams.

Anant Chintamaneni is vice president of products at BlueData, where he is responsible for product management and focuses on helping enterprises deploy big data technologies such as Hadoop and Spark. Anant has more than 15 years’ experience in business intelligence, advanced analytics, and big data infrastructure. Previously, Anant led the product management team for Pivotal’s big data suite.

Presentations

What's the Hadoop-la about Kubernetes? Session

Kubernetes (K8s)—the open source container orchestration system for modern big data workloads—is increasingly popular. While the promised land is a unified platform for cloud-native stateless and stateful data services, stateful, multiservice big data cluster orchestration brings unique challenges. Anant Chintamaneni and Nanda Vijaydev outline the considerations for big data services for K8s.

Erin Coffman is the most tenured data scientist at Airbnb; she currently works for the Human team, which has a mission to house people in need, including evacuees of disasters and refugees. Erin has led data science and analytics initiatives across the company, including work with customer experience, legal, communications, and public policy. In 2016, she cofounded Data University, a company-wide data training program, in which over a quarter of the company has participated. Previously, she worked in education consulting and program management in Washington, DC. Erin holds a PhD in economics from Georgia State University and a BA in mathematics education and economics from Anderson University (IN). Erin is a proud Airbnb superhost, having welcomed nearly 1,000 guests since 2011. In her spare time she enjoys traveling, reading, pub trivia, and golfing.

Presentations

Data University: How Airbnb democratized data Session

Airbnb has open-sourced many high-leverage data tools, including Airflow, Superset, and the Knowledge Repo, but adoption of these tools across the company was relatively low. Erin Coffman offers an overview of Data University, launched to make data more accessible and utilized in decision making at Airbnb.

Selwyn Collaco is chief data officer of the TMX Group, where he is responsible for defining and implementing an enterprise data strategy and capabilities to leverage data across the enterprise for business enablement and monetization. A results-driven visionary leader with an entrepreneurial spirit and strategic leadership capabilities, Selwyn has 20 years of IT delivery experience in the fields of capital markets, risk management, cash management, business intelligence, data warehousing, data management, data governance, and CRM. He has worked at BMO, Pepsi, and CIBC, where he was the chief data office leading the enterprise client data management and strategy portfolio and served within the Capital Markets Technology Division. He holds an an MBA from Richard Ivey School of Business.

Presentations

Governing your cloud-based enterprise data lake (sponsored by Zaloni) Session

Selwyn Collaco and Ben Sharma share insights from their real-world experience and discuss best practices for architecture, technology, data management, and governance to enable centralized data services and explain how to leverage the Zaloni Data Platform (ZDP), an integrated self-service data platform, to operationalize the enterprise data lake .

Ian Cook is a data scientist at Cloudera and the author of several R packages, including implyr. Previously, he was a data scientist at TIBCO and a statistical software developer at AMD. Ian is a cofounder of Research Triangle Analysts, the largest data science meetup group in the Raleigh, North Carolina, area, where he lives with his wife and two young children. He holds an MS in statistics from Lehigh University and a BS in applied mathematics from Stony Brook University.

Presentations

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow 1-Day Training

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, you'll overcome obstacles to getting started using new tools.

Lawrence Cowan is a partner and advanced analytics practice leader with the Cicero Group, where he has spent the last decade building the company’s analytics practice and helping Fortune 500 firms solve real business challenges with data, including attrition, segmentation, sales prioritization, pricing, and customer satisfaction. He also leads the firm in predictive analytics and big data-related engagements, applying Cicero’s deep expertise in strategy execution to ensure data delivers ROI, and has partnered with companies to help them to shift from reactive to predictive analytics by collecting and analyzing real-time information and distributing it across the organization— allowing management to make better, faster decisions that move the business forward. Lawrence is a frequent speaker and thought leader in the advanced analytics space, at events such as Predictive Analytics World for Business and Workforce and Global Big Data Conference, as well as serving as chairperson for the Data Analytics Leaders Event—the place where data chiefs and BI and analytics function heads come together to explore accelerating the path of data to value. His views and recommendations on big data and advanced analytics have been published in CIO Review and Predictive Analytics Times. Lawrence holds an MS in predictive analytics from Northwestern University, an MBA with an emphasis in business economics from Westminster College, and a BA from Brigham Young University.

Presentations

Realizing the true value in your data: Data-drivenness assessment Session

Firms are struggling to leverage their data. Lawrence Cowan outlines a methodology for assessing four critical areas that firms must consider when looking to make the analytical leap: data strategy, data culture, data analysis and implementation, and data management and architecture.

Brian Coyne is the business owner of PNC’s big data ecosystem leveraging Cloudera’s Hadoop platform. Brian also cochairs PNC’s Analytics Competency Center, whose mission is to maximize the value of PNC’s enterprise data assets through improved and optimized analytical practices. Previously, he managed and delivered a business intelligence solution for retail banking and marketing and managed the delivery of regulatory reporting solutions and wholesale risk rating solutions on the risk technology team. In his 23 years of delivering data solutions, Brian has worked in a wide range of business sectors, including retail, manufacturing, and finance at both the national and international levels. Brian is based out of the Cleveland, OH, area, where he lives with his wife and two children. Outside of work, Brian volunteers on committees for his local church and youth sports associations.

Presentations

The future of data warehousing Keynote

Data volumes don’t translate to business value. What matters is your data platform’s ability to support unprecedented numbers of business users and use cases. Anupam Singh and Brian Coyne look at some of the challenges posed by data-hungry organizations and share new techniques to extract meaningful insights at the speed of today’s modern business.

Dan Crankshaw is a PhD student in the CS Department at UC Berkeley, where he works in the RISELab. After cutting his teeth doing large-scale data analysis on cosmology simulation data and building systems for distributed graph analysis, Dan has turned his attention to machine learning systems. His current research interests include systems and techniques for serving and deploying machine learning, with a particular emphasis on low-latency and interactive applications.

Presentations

Model serving and management at scale using open source tools Tutorial

Dan Crankshaw offers an overview of the current challenges in deploying machine applications into production and the current state of prediction serving infrastructure. He then leads a deep dive into the Clipper serving system and shows you how to get started.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Findata welcome Tutorial

Program chairs Alistair Croll and Robert Passarella welcome you to Findata Day.

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Umur Cubukcu is cofounder and CEO of Citus Data, a leading Postgres company whose mission is to make it so companies never have to worry about scaling their relational database again. Focusing on both operations and strategy, Umur works directly with technical founders at SaaS companies to help them scale their multitenant applications and with enterprise leaders to power real-time apps that need to handle large-scale data. Umur’s team at Citus Data is active in the Postgres community, sharing expertise and contributing key components and extensions. Citus Data open-sourced its distributed database extension for PostgreSQL in early 2016. Umur has over 15 years of experience driving complex enterprise software, IT, and database initiatives at large enterprises and startups, and he has a deep interest in how scalable systems of record and systems of engagement can help businesses grow. He holds a master’s degree in management science and engineering from Stanford University.

Presentations

The state of Postgres Session

PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you’ll learn how PostgreSQL’s extension APIs are fueling innovations in relational databases.

Nick Curcuru is vice president of enterprise information management at Mastercard, where he’s responsible for leading a team that works with organizations to generate revenue through smart data, architect next-generation technology platforms, and protect data assets from cyberattacks by leveraging Mastercard’s information technology and information security resources and creating peer-to-peer collaboration with their clients. Nick brings over 20 years of global experience successfully delivering large-scale advanced analytics initiatives for such companies as the Walt Disney Company, Capital One, Home Depot, Burlington Northern Railroad, Merrill Lynch, Nordea Bank, and GE. He frequently speaks on big data trends and data security strategy at conferences and symposiums, has published several articles on security, revenue management, and data security, and has contributed to several books on the topic of data and analytics.

Presentations

GDPR and the Australian Privacy Act: Forcing the legal and ethical hands of companies that collect, use, and analyze data Findata

Data—in part, harvested personal data—brings industries unprecedented insights about customer behavior. We know more about our customers and neighbors than at any other time in history, but we need to avoid crossing the "creepy" line. Laura Eisenhardt discusses how ethical behavior drives trust, especially in today's IoT age.

Principal Solutions Architect at Weaveworks, previously a Principal Engineer at MapR.

Presentations

Clouds and containers: Case studies for big data Session

Once the data has been captured, how can the cloud, containers, and a data fabric combine to build the infrastructure to provide the business insights? Paul Curtis explores three customer deployments that leverage the best of the private clouds and containers to provide a flexible big data environment.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Olga Cuznetova is a data science team lead at Optum Enterprise Analytics, where she guides junior team members on their projects and helps implement data science solutions that address healthcare business needs. Currently, her projects focus mostly on building disease progression and clinical operations models. A few examples include predicting high-cost diabetic patients, predicting the progression tofo end-stage renal disease, implementing a substance abuse disorder model using external client data, and predicting medical prior authorization outcomes. Previously, Olga completed a one-year technology development program with a focus on the development of essential technical skills, healthcare business acumen, and an analytical skill set, which led her to choose a data science career path. Olga holds a BS in finance from Central Connecticut State University. When she has a spare moment, you can find her traveling both in the United States and abroad.

Presentations

Breaking the rules: End-stage renal disease prediction Session

Olga Cuznetova and Manna Chang demonstrate supervised and unsupervised learning methods to work with claims data and explain how the methods complement each other. The supervised method looks at CKD patients at risk of developing end-stage renal disease (ESRD), while the unsupervised approach looks at the classification of patients that tend to develop this disease faster than others.

Michelangelo D’Agostino is the vice president of data science and engineering at ShopRunner, where he leads a team that develops statistical models and writes software that leverages their unique cross-retailer ecommerce dataset. Previously, Michelangelo led the data science R&D team at Civis Analytics, a Chicago-based data science software and consulting company that spun out of the 2012 Obama reelection campaign, and was a senior analyst in digital analytics with the 2012 Obama reelection campaign, where he helped to optimize the campaign’s email fundraising juggernaut and analyzed social media data. Michelangelo has been a mentor with the Data Science for Social Good Fellowship. He holds a PhD in particle astrophysics from UC Berkeley and got his start in analytics sifting through neutrino data from the IceCube experiment. Accordingly, he spent two glorious months at the South Pole, where he slept in a tent salvaged from the Korean War and enjoyed the twice-weekly shower rationing. He’s also written about science and technology for the Economist.

Presentations

The care and feeding of data scientists: Concrete tips for retaining your data science team Session

Data scientists are hard to hire. But too often, companies struggle to find the right talent only to make avoidable mistakes that cause their best data scientists to leave. From org structure and leadership to tooling, infrastructure, and more, Michelangelo D'Agostino shares concrete (and inexpensive) tips for keeping your data scientists engaged, productive, and adding business value.

Zavain Dar is a principal at Lux Capital. Zavain is driven by smart software, leveraging data and machine intelligence to scale, augment, and balance human intelligence. He invests in companies that are using machine learning and AI to augment or replace physical-world functions, including biology, language, manufacturing, and analysis. He looks for entrepreneurs that can use software and data to hone a philosophical position on where the world is and how to direct it for the better. Zavain has led Lux’s investments in Primer, a machine intelligence startup; Clarifai, which democratizes cutting-edge deep neural networks; Capella, which is developing novel medicines based on computational insight applied to genomic data; Recursion, which uses automation and deep learning to develop drugs for rare diseases; Tempo Automation, which applies software and automation to electronics manufacturing; Rigetti Computing, which is fabricating some of the fastest quantum chips in the world; Visor, which aims to simplify tax preparation; and Blockstack, which builds architectures to decentralize current winner-take-all centralized web components. Previously, Zavain was a founder and computer scientist. At Discovery Engine (acquired by Twitter), he engineered machine learning and AI systems across a proprietary distributed computing framework to build web-scale ranking algorithms. Zavain was also a cofounder of Fountainhop, one of the first hyperlocal social networks. Zavain holds a BS in symbolic systems and an MS in computer science from Stanford, where he was a researcher in Stanford’s AI Lab. He is currently a lecturer at Stanford and has taught quarter-long seminars on cryptocurrencies, artificial intelligence and philosophy, and venture capital.

Presentations

VC trends in machine learning and data science Session

In this panel discussion, venture capital investors explain how startups can accelerate enterprise adoption of machine learning and explore the new tech trends that will give rise to the next transformation in the big data landscape.

Milene Darnis is a data product manager at Uber, focusing on building a world-class experimentation platform. Previously, she was a data engineer at Uber, where she modeled core datasets, and a business intelligence engineer at a mobile gaming company. Milene is passionate about linking data to concrete business problems. She holds a master’s degree in engineering from Telecom ParisTech, France.

Presentations

A/B testing at Uber: How we built a BYOM (bring your own metrics) platform Session

Every new launch at Uber is vetted via robust A/B testing. Given the pace at which Uber operates, the metrics needed to assess the impact of experiments constantly evolve. Milene Darnis explains how the team built a scalable and self-serve platform that lets users plug in any metric to analyze.

Dhritiman Dasgupta (aka DD) is the vice president of marketing at Cisco, where he heads product and solutions marketing for Cisco’s flagship data center platforms, including servers (UCS), switches (Nexus), storage (MDS), and SDN (ACI). DD brings nearly two decades of experience in building and leading high-performance product management and marketing teams at high-tech firms in a wide variety of roles, including software development, product management, business development, and corporate marketing, to name a few. Previously, DD was the vice president of marketing at Avi Networks, where he launched the company out stealth mode and generated millions of dollars of pipeline revenue from scratch; a vice president of product and technical marketing for data center switching, routing, and security at Juniper Networks; and a software architect at Nortel Networks in Canada. In his spare time he plays the bass guitar for several bands in the Bay Area and is an avid golfer.

Presentations

AI, ML, and the IoT will destroy the data center and the cloud (just not in the way you think) (sponsored by Cisco) Keynote

DD Dasgupta explores the exciting development of the edge-cloud continuum, which is redefining business models and technology strategies while creating a vast array of new applications that will power the digital age. The continuum is also destroying what we know about the centralized data centers and cloud computing infrastructures that were so vital to the success of the previous computing eras.

Kyle Davis is the Head of Developer Advocacy at Redis Labs. Kyle has been writing software since the age of 6 and enjoys coding in Node.js, Rust, and occasionally in Python and OCaml. As a database enthusiast, he is interested in in-memory and compute/storage fusions. His research time revolves around probabilistic data structures and full-text search. Kyle holds a Bachelors from the University of Southern Indiana and a Masters from the University of Central Missouri. Kyle lives in Edmonton, Alberta, Canada.

Presentations

Redis for velocity and volume: Fast data ingest and probabilistic data structures (sponsored by Redi Labs) Session

Kyle Davis explains how Redis can be used for ingesting high-velocity data from large-scale platforms and IoT data collections as well as for storing and querying data using probabilistic data structures that trade some precision for both higher speed and lower storage requirements. Along the way, Kyle shares examples and a demo of the solution.

Tim Davis is the executive director of the Lab Services Technical Architecture Group for IBM Software Group focused on Information Management (IM), InfoSphere, and Information On Demand (IOD) worldwide. In this position, Tim directs customer-facing enterprise technical architecture engagements and IM product-solution deployments for customers worldwide. It is Tim’s mission to drive rapid product adoption through the advancement of industry-leading architectures and deployment roadmaps in master data management (MDM), enterprise data warehousing (EDW), ERP/SAP, high-performance computing, and enterprise content management. Tim is also one of the leading content providers for all IBM IM-InfoSphere-IOD curriculums, best practices, and methodologies and the founder of the IBM Center of Excellence for Data Integration. He recently led the development and launch of IBM’s Information Grid, MDM Server Rapid Deployment, and IBM’s SAP deployment accelerators. Tim joined IBM via Ascential Software, acquired by IBM in 2005.

Presentations

Guidebook to unwind the enterprise "data hairball" and get ready for AI (sponsored by IBM) Session

Tim Davis discusses key pain points and solutions to problems many enterprises face with data in silos, poor-quality data that cannot always be trusted, and managing and making large volumes of data available to derive more accurate insights and machine learning models.

Bryan Dean is director of business development at Red Hat, where he is leading the company’s global initiatives and solutions for the internet of things (IoT) within Red Hat’s Global Partner Solutions Group. As the business development lead for the IoT, Bryan is responsible for solution development, partnerships, and go-to-market initiatives. Working extensively with key strategic partners Cloudera and Eurotech, Bryan has led the effort to develop and promote an end-to-end enterprise architecture for the IoT built on open source technologies. The architecture was awarded IoT Infrastructure of the Year during Computing’s Big Data Awards in May, 2018. Previously, Bryan held leadership positions at NetApp and Hewlett-Packard Software. His background includes product management, marketing, alliances, and strategic business planning. Bryan is originally from the San Francisco Bay Area but has called Fort Collins, Colorado, home for the last 20 years.

Presentations

Using machine learning to drive intelligence at the edge Session

The focus on the IoT is turning increasingly to the edge, and the way to make the edge more intelligent is by building machine learning models in the cloud and pushing them back out to the edge. Dave Shuman and Bryan Dean explain how Cloudera and Red Hat executed this architecture at one of Europe's leading manufacturers, along with a demo highlighting this architecture.

Kaushik Deka is a partner and CTO at Novantas, where he is responsible for technology strategy and R&D roadmap of a number of cloud-based platforms. He has more than 15 years’ experience leading large engineering teams to develop scalable, high-performance analytics platforms. Kaushik holds an MS in computer science from the University of Missouri, an MS in engineering from the University of Pennsylvania, and an MS in computational finance from Carnegie Mellon University.

Presentations

Case study: A Spark-based distributed simulation optimization architecture for portfolio optimization in retail banking Session

Kaushik Deka and Ted Gibson share a large-scale optimization architecture in Spark for a consumer product portfolio optimization use case in retail banking. The architecture combines a simulator that distributes computation of complex real-world scenarios and a constraint optimizer that uses business rules as constraints to meet growth targets.

Tim Delisle is the cofounder and CEO of Datalogue, the AI-powered data infrastructure for the enterprise. Tim’s obsession with data stems from too many underslept and overcaffeinated nights working on data problems, inspiring him to cofound Datalogue. Tim holds a BS in cell biology and anatomy from McGill University, an MSc in computer and information science from Cornell Tech, and an MSc in applied information sciences from the Technion-Israel Institute of Technology.

Presentations

GPU-accelerated analytics and machine learning ecosystems (Inception Showcase sponsored by NVIDIA) Session

Explore case studies from Datalogue, FASTDATA.io, and H20.ai that demonstrate how GPU-accelerated analytics, machine learning, and ETL help companies overcome slow queries and tedious data preparation process, dynamically correlate among data, and enjoy automatic feature engineering.

Ifi Derekli is a senior solutions engineer at Cloudera, focusing on helping large enterprises solve big data problems using Hadoop technologies. Her subject-matter expertise is around security and governance, a crucial component of every successful production big data use case. Previously, Ifi was a presales technical consultant at Hewlett Packard Enterprise, where she provided technical expertise for Vertica and IDOL (currently part of Micro Focus). She holds a BS in electrical engineering and computer science from Yale University.

Presentations

Getting ready for GDPR: Securing and governing hybrid, cloud, and on-premises big data deployments, step by step Tutorial

New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Syed Rafice, Mubashir Kazia, Ifigeneia Derekli, and Camila Hiskey share hands-on best practices for meeting these challenges, with special attention paid to GDPR.

Damien Desfontaines protects personal data for a living. He’s a privacy engineer at Google, where he builds scalable anonymization tools, conducts privacy reviews, and translates high-level policies into technical best practices, and a doctoral researcher at ETH Zürich, focusing on differential privacy. He also sometimes vulgarizes academic definitions of privacy on his blog.

Presentations

Protecting sensitive data in huge datasets: Cloud tools you can use Session

Before releasing a public dataset, practitioners need to thread the needle between utility and protection of individuals. Felipe Hoffa and Damien Desfontaines explore how to handle massive public datasets, taking you from theory to real life as they showcase newly available tools that help with PII detection and brings concepts like k-anonymity and l-diversity to the practical realm.

Srikanth “Sri” Desikan is the senior director of product management for the Big Data Cloud platform at Oracle, where he drives product management for a scale-out OLAP BI service running natively on Apache Spark designed for high-performance data lake-based analytics. Previously, he was the cofounder and CEO of SparklineData (acquired by Oracle), where the technology was originally developed; was vice president of products at IoT analytics company Glassbeam; and headed the analytics infrastructure team for data warehousing and data science at Disney Interactive Media Group. Sri has over 20 years of experience in data management and analytics at companies like Siebel and Agile and has also worked at startups delivering advertising gaming analytics.

Presentations

Interactive business intelligence and OLAP on big data lakes using a Spark-native fast data mart (sponsored by Oracle + DataScience.com) Session

SparklineData is an in-memory distributed scale-out analytics platform built on Apache Spark to enable enterprises to query on data lakes directly with instant response times. Srikanth Desikan offers an overview of SparklineData and explains how it can enable new analytics use cases working on the most granular data directly on data lakes.

Aatif Din is an architect of data science and engineering at Fanatics. Prior to Fanatics, he worked as an engineer at other retail technology companies including, Groupon and eBay. Aatif has a degree in mathematics and computing science from the University of Illinois.

Presentations

Leveraging the best of the past to power a better future (sponsored by MemSQL) Keynote

Today’s successful businesses utilize data better than their competitors; however, data sprawl and inefficient data infrastructure restrict what’s possible. Blending the best of the past with the software innovations of today will solve future data challenges. Drew Paroski shares how to develop modern database applications without sacrificing cost savings, data familiarity, and flexibility.

Ding Ding is a senior software engineer on Intel’s big data technology team, where she works on developing and optimizing distributed machine learning and deep learning algorithms on Apache Spark, focusing particularly on large-scale analytical applications and infrastructure on Spark.

Presentations

A deep learning approach for precipitation nowcasting with RNN using Analytics Zoo on BigDL Session

Precipitation nowcasting is used to predict the future rainfall intensity over a relatively short timeframe. The forecasting resolution and time accuracy required are much higher than for other traditional forecasting tasks. Alexander Heye and Ding Ding explain how to build a precipitation nowcasting system with recurrent neural networks using BigDL on Apache Spark.

Harish Doddi is cofunder and CEO of Datatron. Previously, he held roles at Oracle; Twitter, where he worked on open source technologies, including Apache Cassandra and Apache Hadoop, and built Blobstore, Twitter’s photo storage platform; Snap, where he worked on the backend for Snapchat Stories; and Lyft, where he worked on the surge pricing model. Harish holds a master’s degree in computer science from Stanford, where he focused on systems and databases, and an undergraduate degree in computer science from the International Institute of Information Technology in Hyderabad.

Presentations

Infrastructure for deploying machine learning to production in large financial institutions: Lessons learned and best practices Session

Large financial institutions have many data science teams (e.g., those for fraud, credit risk, and marketing), each often using diverse set of tools to build predictive models. There are many challenges involved in productionizing these predictive AI models. Harish Doddi and Jerry Xu share challenges and lessons learned deploying AI models to production in large financial institutions.

Mark Donsky is a director of product management at Okera, a software provider that provides discovery, access control, and governance at scale for today’s modern heterogenous data environments, where he leads product management. Previously, Mark led data management and governance solutions at Cloudera, and he’s held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions by millions of dollars annually. He holds a BS with honors in computer science from the Western University, Ontario, Canada.

Presentations

Executive Briefing: GDPR—Getting your data ready for heavy, new EU privacy regulations Session

In May 2018, the General Data Protection Regulation (GDPR) went into effect for firms doing business in the EU, but many companies still aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky and Steven Ross outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.

Getting ready for GDPR: Securing and governing hybrid, cloud, and on-premises big data deployments, step by step Tutorial

New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Syed Rafice, Mubashir Kazia, Ifigeneia Derekli, and Camila Hiskey share hands-on best practices for meeting these challenges, with special attention paid to GDPR.

Joe DosSantos is vice president of enterprise information management in the Enterprise Enabling Technology Services organization at TD Bank, where he is responsible for the strategy, implementation, and support of technology solutions for analytics, big data, data integration, data quality, data warehousing, and data governance. Joe is also responsible for the bank’s enterprise data shared services and the alignment of information management technology across the bank’s lines of business. Joe has over 20 years of experience in large-scale technology program delivery and information management shaped by a 10-year career at Accenture, a solution development leadership role at master data management startup Siperian (now part of Informatica), and head of EMC’s big data professional services practice. He is a graduate of Georgetown University.

Presentations

TD Bank’s journey to turn its big data environment into a true data lake (sponsored by Talend) Session

TD Bank’s data analytics team has undertaken a multiyear journey to modernize its data infrastructure for today and future needs. Joseph DosSantos explains how the team built a governed data lake foundation, enabling business users to leverage its big data environment to extract analytical insights while minimizing risks.

Florian Douetteau is the CEO of Dataiku, a company democratizing access to data science. After starting programming in his early childhood, Florian dropped out of the prestigious École Normale maths courses to start working at a startup that later became Exalead, a search engine company in the early days of the French startup community. His subjects of interests include data, artificial intelligence, and how tech can improve the daily work-life of tech people.

Presentations

Executive Briefing: Profit from AI and machine learning—The best practices for people and process Session

Tony Baer and Florian Douetteau share the results of research cosponsored by Ovum and Dataiku that surveyed a specially selected sample of chief data officers and data scientists on how to map roles and processes to make success with AI in the business repeatable.

James Dreiss is a senior data scientist at Reuters. Previously, he worked at the Metropolitan Museum of Art in New York. He studied at New York University and the London School of Economics.

Presentations

Document vectors in the wild: Building a content recommendation system for Reuters.com Session

James Dreiss discusses the challenges in building a content recommendation system for one of the largest news sites in the world, Reuters.com. The particularities of the system include developing a scrolling newsfeed and the use of document vectors for semantic representation of content.

Chiny Driscoll is founder and CEO at MetiStream, a provider of real-time integration and analytic services in the big data arena. Chiny has more than 24 years of management and executive leadership experience in the technology industry and has served in a variety of roles with Fortune 500 tech companies. Previously, Chiny was the worldwide executive leader of big data services for IBM’s Information Management Division, where she led all of the professional services which implemented and supported IBM’s big data products and solutions, including streaming, analytics, Hadoop, and DW appliance-related, across industries such as financial services, communications, the public sector, and retail; was the vice president and general manager of Netezza, a leader in big data warehouse appliances and advanced analytics (acquired by IBM in 2010); held various global and regional leadership roles at TIBCO Software, where her responsibilities included running the presales, services, and sales operations for the Public Sector Division; and served in services leadership roles at EDS and other services and technology companies.

Presentations

Digging for gold: Developing AI in healthcare against unstructured text data Session

Chiny Driscoll and Jawad Khan offer an overview of a solution by Cloudera and MetiStream that lets healthcare providers automate the extraction, processing, and analysis of clinical notes within an electronic health record in batch or real time, improving care, identifying errors, and recognizing efficiencies in billing and diagnoses.

Carolyn Duby is a solutions engineer at Cloudera, where she helps customers harness the power of their data with Apache open source platforms. Previously, she was the architect for cybersecurity event correlation at Secureworks. A subject-matter expert in cybersecurity and data science, Carolyn is an active leader in the community and frequent speaker at Future of Data meetups in Boston, MA, and Providence, RI, and at conferences such as Open Data Science Conference and Global Data Science Conference. Carolyn holds an ScB (magna cum laude) and ScM from Brown University, both in computer science. She’s lifelong learner and recently completed the Johns Hopkins University Coursera data science specialization.

Presentations

Apache Metron: Open source cybersecurity at scale Tutorial

Carolyn Duby shows you how to find the cybersecurity threat needle in your event haystack using Apache Metron: a real-time, horizontally scalable open source platform. After this interactive overview of the platform's major features, you'll be ready to analyze your own haystack back at the office.

Ted Dunning is the chief technology officer at MapR, an HPE company. He’s also a board member for the Apache Software Foundation, a PMC member, and committer on a number of projects. Ted has years of experience with machine learning and other big data solutions across a range of sectors. He’s contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library and designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date plus a dozen pending. He holds a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.

Presentations

Progress for big data in Kubernetes Session

Stateful containers are a well-known anti-pattern, but the standard solution—managing state in a separate storage tier—is costly and complex. Recent developments have changed things dramatically for the better. In particular, you can now manage a high-performance software-defined-storage tier entirely in Kubernetes. Ted Dunning describes what's new and how it makes big data easier on Kubernetes.

The answer to life, the universe, and everything: But can you get that into production? (sponsored by MapR) Keynote

There’s real value in big data and more waiting when you add real-time, but to get the payoff, you need successful deployments of your AI and data-intensive applications. You need to be ready with your current applications in production but must have an architecture and infrastructure that are ready for the next ones as well. Ted Dunning explores how others have fared in this journey.

Brent Dykes is the director of data strategy at Domo. Brent has over 14 years of enterprise analytics experience at Omniture, Adobe, and Domo. He is a regular Forbes contributor on data-related topics and has published two books on digital analytics, including Web Analytics Action Hero. In 2016, Brent received the Most Influential Industry Contributor Award from the Digital Analytics Association (DAA). He is a popular speaker at conferences such as Shop.org, Adtech, Pubcon, and Adobe Summit. Brent holds an MBA from Brigham Young University and a BBA in marketing from Simon Fraser University.

Presentations

Stories beat statistics: How to master the art and science of data storytelling Session

Companies collect all kinds of data and use advanced tools and techniques to find insights, but they often fail in the last mile: communicating insights effectively to drive change. Brent Dykes discusses the power that stories wield over statistics and explores the art and science of data storytelling—an essential skill in today’s data economy.

Barbara Eckman is a principal data architect at Comcast, where she leads data governance for an innovative, division-wide initiative comprising near-real-time ingesting, streaming, transforming, storing, and analyzing big data. Barbara is a technical innovator and strategist with internationally recognized expertise in scientific data architecture and integration. Her experience includes technical leadership positions at a Human Genome Project center, Merck, GlaxoSmithKline, and IBM. She served on the IBM Academy of Technology, an internal peer-elected organization akin to the National Academy of Sciences.

Presentations

Data discovery and lineage: Integrating streaming data in the public cloud with on-prem, classic data stores, and heterogeneous schema types Session

Comcast’s streaming data platform comprises ingest, transformation, and storage services in the public cloud, with Apache Atlas for data discovery and lineage. Barbara Eckman explains how Comcast recently integrated on-prem data sources, including traditional data warehouses and RDBMSs, which required its data governance strategy to include relational and JSON schemas in addition to Apache Avro.

Jacob Eisinger is the director of data at Talroo, where he is responsible for the Special Projects initiative to pilot and validate high-impact business models and technologies. Previously, Jacob led search, personalization, data warehouse, bot detection, and machine learning at Talroo and worked in the Emerging Technologies Group at IBM, where he worked with technologies like BlueMix, Apache Spark, Apache Kafka, OAuth, and web service standards. Jacob is an accomplished inventor with over 20 patent applications. He holds a bachelor’s degree in computer science from Virginia Tech.

Presentations

Job recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL Session

Can the talent industry make the job search/match more relevant and personalized for a candidate by leveraging deep learning techniques? Guoqiong Song, Wenjing Zhan, and Jacob Eisinger demonstrate how to leverage distributed deep learning framework BigDL on Apache Spark to predict a candidate’s probability of applying to specific jobs based on their résumé.

Ward Eldred is a solution architect responsible for assisting customers in tackling complex business problems with deep learning and HPC solutions that leverage NVIDIA technologies. Ward also leads courses as part of NVIDIA’s Deep Learning Institute, which focuses on teaching students the fundamentals of deep learning through seminars and labs. Previously, Ward spent 20 years as a systems engineer at Sun Microsystems, architecting HA and cluster solutions.

Presentations

Deep learning: Assessing analytics project feasibility and requirements (sponsored by NVIDIA) Session

Ward Eldred offers an overview of the types of analytical problems that can be solved using deep learning and shares a set of heuristics that can be used to evaluate the feasibility of analytical AI projects.

Jonathan Ellis is cofounder and CTO at DataStax and the founding project chair of Apache Cassandra. Previously, Jonathan built a multipetabyte, scalable storage system based on Reed-Solomon encoding for backup provider Mozy.

Presentations

Cassandra versus cloud databases Session

Is open source Apache Cassandra still relevant in an era of hosted cloud databases? Jonathan Ellis discusses Cassandra’s strengths and weaknesses relative to Amazon DynamoDB, Microsoft CosmosDB, and Google Cloud Spanner.

Moty Fania is a principal engineer and the CTO of the Advanced Analytics Group at Intel, which delivers AI and big data solutions across Intel. Moty has rich experience in ML engineering, analytics, data warehousing, and decision-support solutions. He led the architecture work and development of various AI and big data initiatives such as IoT systems, predictive engines, online inference systems, and more.

Presentations

A high-performance system for deep learning inference and visual inspection Session

Moty Fania and Sergei Kom share their experience and lessons learned implementing an AI inference platform to enable internal visual inspection use cases. The platform is based on open source technologies and was designed for real-time, streaming, and online actuation.

Basil Faruqui is lead solutions manager at BMC, where he leads the development and execution of big data and multicloud strategy for BMC’s Digital Business Automation line of business (Control-M). Basil’s key areas of focus include evangelizing the role automation plays in delivering successful big data projects and advising companies on how to build scalable automation strategies for cloud and big data initiatives. Basil has over 15 years of industry experience in various areas of software research and development, customer support, and knowledge management.

Presentations

Enabling predictive maintenance using automated IoT data pipelines (sponsored by BMC) Session

Basil Faruqui demonstrates how to simplify the automation and orchestration of an IoT-driven data pipeline in a cloud environment where machine learning algorithms predict failures.

Usama Fayyad is a cofounder and chief technology officer at OODA Health, a VC-funded company founded in 2017 to bring AI and automation to create a retail-like experience in payments and processing to healthcare delivery, and founder and chairman at Open Insights, a technology and strategic consulting firm founded in 2008 to help enterprises deploy data-driven solutions to grow revenue from data assets. In addition to big data strategy and building new business models on data assets, Open Insights deploys data science, AI and ML, and big data solutions for large enterprises. Previously, he served as global chief data officer at Barclays in London after he launched the largest tech startup accelerator in MENA as executive chairman of Oasis500 in Jordan; held chairman and CEO roles at several startups, including Blue Kangaroo, DMX Group, and DigiMine; was the first person to hold the chief data officer title when Yahoo acquired his second startup in 2004, where he built the Strategic Data Solutions Group and founded Yahoo Research Labs; held leadership roles at Microsoft and founded the Machine Learning Systems Group at NASA’s Jet Propulsion Laboratory, where his work on machine learning resulted in the top Excellence in Research award from Caltech and a US government medal from NASA. Usama has published over 100 technical articles on data mining, data science, AI and ML, and databases. He holds over 30 patents and is a fellow of both the AAAI and the ACM. Usama earned his PhD in engineering in AI and machine learning from the University of Michigan-Ann Arbor. He’s edited two influential books on data mining and served as editor-in-chief on two key industry journals. He also served on the boards or advisory boards of several private and public companies including Criteo, InvenSense, RapidMiner, Stella, Virsec, Silniva, Abe AI, NetSeer, ChoiceStream, Medio, and others. On the academic front, he’s on advisory boards of the Data Science Institute at Imperial College, AAI at UTS, and the University of Michigan College of Engineering National advisory Board.

Presentations

Next-generation cybersecurity via data fusion, AI, and big data: Pragmatic lessons from the front lines in financial services Session

Usama Fayyad and Troels Oerting share outcomes and lessons learned from building and deploying a global data fusion, incident analysis/visualization, and effective cybersecurity defense based on big data and AI at a major EU bank, in collaboration with several financial services institutions.

Stephanie Fischer is the founder of datanizing GmbH. Stephanie has many years of consulting experience in big data, machine learning, and human-centric innovation. As a product owner, she develops services and products based on machine learning and content analytics. She is a frequent speaker at conferences and the author of articles on big data and machine learning.

Presentations

From chaos to insight: Automatically derive value from your user-generated content Data Case Studies

Whether customer emails, product reviews, company wikis, or support communities, user-generated content (UGC) as a form of unstructured text is everywhere, and it’s growing exponentially. Stephanie Fischer explains how to discover meaningful insights from the UGC of a famous New York discussion forum.

Brian Foo is a senior software engineer for Google Cloud working on applied artificial intelligence, where he builds demos for Google Cloud’s strategic customers and creates open source tutorials to improve public understanding of AI. Previously, Brian worked at Uber, where he trained machine learning models and built a large-scale training and inference pipeline for mapping and sensing/perception applications using Hadoop and Spark, and headed the real-time bidding optimization team at Rocket Fuel, where he worked on algorithms that determined millions of ads shown every second across many platforms such as web, mobile, and programmatic TV. Brian holds a BS in EECS from UC Berkeley and a PhD in EE telecommunications from UCLA.

Presentations

From training to serving: Deploying TensorFlow models with Kubernetes Tutorial

TensorFlow and Keras are popular libraries for training deep models due to hardware accelerator support. Brian Foo, Jay Smith, and Holden Karau explain how to bring deep learning models from training to serving in a cloud production environment. You'll learn how to unit-test, export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.

Janet Forbes is an experienced enterprise, business, and senior systems architect at T4G. With over 25 years of experience, Janet has a deep understanding of data and functional and technical architecture, with specific focus on business and data architecture, a proven ability to define, audit, and improve business processes based on best practices, and extensive experience leading multifunctional teams through the planning and delivery of complex solutions. As a trusted advisor, Janet works closely with clients in assessing and shaping their data strategy practices.

Presentations

From theory to data product: Applying data science methods to effect business change Tutorial

Janet Forbes, Danielle Leighton, and Lindsay Brin lead a primer on crafting well-conceived data science projects that uncover valuable business insights. Using case studies and hands-on skills development, Janet, Danielle, and Lindsay walk you through essential techniques for effecting real business change.

Antonio Fragoso is a senior data scientist and technical leader at Globant, where he works with top-retail and financial services clients within Globant’s Artificial Intelligence and Big Data Studios. Antonio is responsible for helping clients revamp their analytics cycles and business operations through machine learning and deep learning initiatives.

Presentations

The importance of experimental iteration: A data-centric approach to an AI project (sponsored by Globant) Session

Antonio Fragoso explores the key aspects of implementing a natural language processing project within your organization and reveals the necessary steps for making it a success. Antonio focuses on how to leverage an iterative process that can pave the way toward building a successful product.

Jean-Michel Franco is director of product marketing for Talend’s data governance solutions. He has dedicated his career to developing and broadening the adoption of innovative technologies in companies. He started his career at EDS (now HP) by creating and developing a business intelligence (BI) practice, joined SAP EMEA as director of marketing solutions in France and North Africa, and served as innovation director for Business & Decision. He is the author of four books and regularly publishes articles and presents at events and trade shows.

Presentations

Enacting Data Subject Access Rights for GDPR with data services and data management Session

GDPR is more than another regulation to be handled by your back office. Enacting the GDPR's Data Subject Access Rights (DSAR) requires practical actions. Jean-Michel Franco outlines the practical steps to deploy governed data services.

Bill Franks is chief analytics officer at the International Institute for Analytics (IIA). His work has spanned clients in a variety of industries for companies ranging in size from Fortune 100 companies to small nonprofit organizations. Previously, he was chief analytics officer at Teradata. Bill is the author of Taming the Big Data Tidal Wave and The Analytics Revolution. You can learn more on his website.

Presentations

Analytics maturity: Industry trends and financial impacts Session

Drawing on a recent study of the analytics maturity level of large enterprises by the International Institute for Analytics, Bill Franks discusses how maturity varies by industry, shares key steps organizations can take to move up the maturity scale, and explains how the research correlates analytics maturity with a wide range of success metrics, including financial and reputational measures.

Michael J. Freedman is the cofounder and CTO of TimescaleDB, an open source database that scales SQL for time series data, and a professor of computer science at Princeton University, where his research focuses on distributed systems, networking, and security. Previously, Michael developed CoralCDN (a decentralized CDN serving millions of daily users) and Ethane (the basis for OpenFlow and software-defined networking) and cofounded Illuminics Systems (acquired by Quova, now part of Neustar). He is a technical advisor to Blockstack. Michael’s honors include the Presidential Early Career Award for Scientists and Engineers (PECASE, given by President Obama), the SIGCOMM Test of Time Award, the Caspar Bowden Award for Privacy Enhancing Technologies, a Sloan Fellowship, the NSF CAREER Award, the Office of Naval Research Young Investigator Award, a DARPA Computer Science Study Group membership, and multiple award publications. He holds a PhD in computer science from NYU’s Courant Institute and bachelor’s and master’s degrees from MIT.

Presentations

Performant time series data management and analytics with Postgres Session

Michael Freedman explains how to leverage Postgres for high-volume time series workloads using TimescaleDB, an open source time series database built as a Postgres plug-in. Michael covers the general architectural design principles and new time series data management features, including adaptive time partitioning and near-real-time continuous aggregations.

Brandon Freeman is a Mid-Atlantic region strategic system engineer at Cloudera, specializing in infrastructure, the cloud, and Hadoop. Previously, Brandon was an infrastructure architect at Explorys, working in operations, architecture, and performance optimization for the Cloudera Hadoop environments, where he was responsible for designing, building, and managing many large Hadoop clusters.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Attend this tutorial to learn how to successfully run a data analytics pipeline in the cloud and integrate data engineering and data analytic workflows and explore considerations and best practices for data analytics pipelines in the cloud. Along the way, you'll see how to share metadata across workloads in a big data PaaS.

Chris Fregly is a senior developer advocate focused on AI and machine learning at Amazon Web Services (AWS). Chris shares knowledge with fellow developers and data scientists through his Advanced Kubeflow AI Meetup and regularly speaks at AI and ML conferences across the globe. Previously, Chris was a founder at PipelineAI, where he worked with many startups and enterprises to deploy machine learning pipelines using many open source and AWS products including Kubeflow, Amazon EKS, and Amazon SageMaker.

Presentations

Building a high-performance model serving engine from scratch using Kubernetes, GPUs, Docker, Istio, and TensorFlow Session

Chris Fregly details a full-featured, open source end-to-end TensorFlow model training and deployment system, using the latest advancements with Kubernetes, TensorFlow, and GPUs.

Brandy Freitas is a principal data scientist at Pitney Bowes, where she works with clients in a wide variety of industries to develop analytical solutions for their business needs. Brandy is a research-physicist-turned-data-scientist based in Boston, Massachusetts. Her academic research focused primarily on protein structure determination, applying machine learning techniques to single-particle cryoelectron microscopy data. Brandy is a National Science Foundation Graduate Research Fellow and a James Mills Pierce Fellow. She holds an undergraduate degree in physics and chemistry from the Rochester Institute of Technology and did her graduate work in biophysics at Harvard University.

Presentations

Executive Briefing: Analytics for executives—Building an approachable language to drive data science in your organization Session

Data science is an approachable field given the right framing. Often, though, practitioners and executives are describing opportunities using completely different languages. Join Brandy Freitas to develop context and vocabulary around data science topics to help build a culture of data within your organization.

JF Gagne is CEO of Element AI. A senior global executive, JF has managed a number of implementation projects for small and large companies and initiated and directed a number of AI, OR, and optimization R&D projects over the past decade. JF has also developed, transferred, and established best practices and cutting-edge technology in many industries, including retail, distribution, manufacturing, call centers, healthcare, airport services, and security. Previous roles include chief innovation and products officer and head of JDA Labs, cofounder and CEO of Planora, and cofounder and director of products for Logiweb.

Presentations

From data governance to AI governance: The CIO's new role Session

JF Gagne explains why the CIO is going to need a broader mandate in the company to better align their AI training and outcomes with business goals and compliance. This mandate should include an AI governance team that is well staffed and deeply established in the company, in order to catch biases that can develop from faulty goals or flawed data.

Ted Gibson is a product management principal at Novantas Solutions, where he is responsible for content product management for the PriceTek suite of products, focusing on business use cases, metrics, models, and calculations for innovative new development. In his more than eight years working on PriceTek, Ted has held various roles across product management, sales, client services, and engineering and has experience in pricing for consumer deposits, home equity, mortgage, auto, and unsecured lending. He holds a BA in applied mathematics from Yale University.

Presentations

Case study: A Spark-based distributed simulation optimization architecture for portfolio optimization in retail banking Session

Kaushik Deka and Ted Gibson share a large-scale optimization architecture in Spark for a consumer product portfolio optimization use case in retail banking. The architecture combines a simulator that distributes computation of complex real-world scenarios and a constraint optimizer that uses business rules as constraints to meet growth targets.

Harry Glaser is cofounder and CEO of Periscope Data. Harry and cofounder Tom O’Neill have grown Periscope Data to serve nearly 1,000 customers. Previously, he worked at Google. Harry holds a bachelor’s degree in computer science from the University of Rochester.

Presentations

An ethical foundation for the AI-driven future Session

What is the moral responsibility of a data team today? As AI and machine learning technologies become part of our everyday life and as data becomes accessible to everyone, CDOs and data teams are taking on a very important moral role as the conscience of the corporation. Harry Glaser highlights the risks companies will face if they don't empower data teams to lead the way for ethical data use.

Zachary Glassman is a data scientist in residence at the Data Incubator. Zachary has a passion for building data tools and teaching others to use Python. He studied physics and mathematics as an undergraduate at Pomona College and holds a master’s degree in atomic physics from the University of Maryland.

Presentations

Hands-on data science with Python 1-Day Training

Zachary Glassman leads a hands-on dive into building intelligent business applications using machine learning, walking you through all the steps of developing a machine learning pipeline. You'll explore data cleaning, feature engineering, model building and evaluation, and deployment and extend these models into two applications using a real-world dataset.

Heitor Murilo Gomes is a researcher at Télécom ParisTech focusing on machine learning—particularly, evolving data streams, concept drift, ensemble methods, and big data streams. He coleads the streamDM open data stream mining project.

Presentations

Machine learning for nonstationary streaming data using Structured Streaming and StreamDM Session

The StreamDM library provides the largest collection of data stream mining algorithms for Spark. Heitor Murilo Gomes and Albert Bifet explain how to use StreamDM and Structured Streaming to develop, apply, and evaluate learning models specially for nonstationary streams (i.e., those with concept drifts).

Bruno Gonçalves is a chief data scientist at Data For Science, working at the intersection of data science and finance. Previously, he was a data science fellow at NYU’s Center for Data Science while on leave from a tenured faculty position at Aix-Marseille Université. Since completing his PhD in the physics of complex systems in 2008, he’s been pursuing the use of data science and machine learning to study human behavior. Using large datasets from Twitter, Wikipedia, web access logs, and Yahoo! Meme, he studied how we can observe both large scale and individual human behavior in an obtrusive and widespread manner. The main applications have been to the study of computational linguistics, information diffusion, behavioral change and epidemic spreading. In 2015, he was awarded the Complex Systems Society’s 2015 Junior Scientific Award for “outstanding contributions in complex systems science” and in 2018 was named a science fellow of the Institute for Scientific Interchange in Turin, Italy.

Presentations

Recurrent neural networks for time series analysis Tutorial

Time series are everywhere around us. Understanding them requires taking into account the sequence of values seen in previous steps and even long-term temporal correlations. Join Bruno Gonçalves to learn how to use recurrent neural networks to model and forecast time series and discover the advantages and disadvantages of recurrent neural networks with respect to more traditional approaches.

Mark Grover is a product manager at Lyft. Mark’s a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He’s also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He’s a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Presentations

Big data at speed Session

Many details go into building a big data system for speed, from determining a respectable latency until data access and where to store the data to solving multiregion problems—or even knowing just what data you have and where stream processing fits in. Mark Grover and Ted Malaska share challenges, best practices, and lessons learned doing big data processing and analytics at scale and at speed.

Near-real-time anomaly detection at Lyft Session

Thomas Weise and Mark Grover explain how Lyft uses its streaming platform to detect and respond to anomalous events, using data science tools for machine learning and a process that allows for fast and predictable deployment.

Sudipto Guha is principal scientist at Amazon Web Services, where he studies the design and implementation of a wide range of computational systems, from resource-constrained devices, such as sensors, to massively parallel and distributed systems. Using an algorithmic framework, Sudipto seeks to design systems that are correct, efficient, and optimized despite their bidirectional asymptotic scale and seeming lack of similarity to human information processes. His recent work focuses on clustering and location theory, statistics and learning theory, database query optimization and mining, approximation algorithms for stochastic control, communication complexity, and data stream algorithms.

Presentations

Continuous machine learning over streaming data: The story continues. Session

Roger Barga, Sudipto Guha, and Kapil Chhabra explain how unsupervised learning with the robust random cut forest (RRCF) algorithm enables insights into streaming data and share new applications to impute missing values, forecast future values, detect hotspots, and perform classification tasks. They also demonstrate how to implement unsupervised learning over massive data streams.

Sumit Gulwani is a partner research manager at Microsoft, where he leads the PROSE research and engineering team that develops APIs for program synthesis (programming by examples and natural language) and incorporates them into real products. He is the inventor of the popular Flash Fill feature in Microsoft Excel, used by hundreds of millions of people. He has published 120+ peer-reviewed papers in top-tier conferences and journals across multiple computer science areas, delivered 40+ keynotes and invited talks at various forums, and authored 50+ patent applications (granted and pending). Sumit is a recipient of the prestigious ACM SIGPLAN Robin Milner Young Researcher Award, ACM SIGPLAN Outstanding Doctoral Dissertation Award, and the President’s Gold Medal from IIT Kanpur.

Presentations

Programming by input-output examples Session

Programming by input-output examples (PBE) is a new frontier in AI, set to revolutionize the programming experience for the masses. It can enable end users—99% of whom are nonprogrammers—to create small scripts and make data scientists 10–100x more productive for many data wrangling tasks. Sumit Gulwani leads a deep dive into this new programming paradigm and explores the science behind it.

.

Presentations

From two weeks in Python to two hours in Pentaho: Building modern big data pipelines for machine learning (sponsored by Hitachi Vantara) Session

Data in most organizations today is massive, messy, and often found in silos. With so many sources to analyze, data engineers need to construct robust data pipelines using automation and minimize duplicate processes, as computation is costly for big data. David Huh shares strategies to construct data pipelines for machine learning, including one to reduce time to insight from weeks to hours.

Patrick Hall is principle scientist at bnh.ai, a boutique law firm focused on AI and analytics; a senior director of product at H2O.ai, a leading Silicon Valley machine learning software company; and a lecturer in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning.

At both bnh.ai and H2O.ai, he works to mitigate AI risks and advance the responsible practice of machine learning. Previously, Patrick held global customer-facing and R&D research roles at SAS. He holds multiple patents in automated market segmentation using clustering and deep neural networks. Patrick is the 11th person worldwide to become a Cloudera Certified Data Scientist. He studied computational chemistry at the University of Illinois before graduating from the Institute for Advanced Analytics at North Carolina State University.

Presentations

Practical techniques for interpreting machine learning models Tutorial

Transparency, auditability, and stability are crucial for business adoption and human acceptance of complex machine learning models. Patrick Hall, Avni Wadhwa, and Mark Chan share practical and productizable approaches for explaining, testing, and visualizing machine learning models using open source, Python-friendly tools such as GraphViz, H2O, and XGBoost.

Luke (Qing) Han is a cofounder and CEO of Kyligence, cocreator and PMC chair of Apache Kylin, the leading open source OLAP for big data, and a Microsoft regional director and MVP. Luke has 10+ years’ experience in data warehouses, business intelligence, and big data. Previously, he was big data product lead at eBay and chief consultant of Actuate China.

Presentations

Refactor your data warehouse with mobile analytics products (sponsored by Kyligence) Session

When China Construction Bank wanted to migrate 23,000+ reports to mobile, it chose Apache Kylin as the high-performance and high-concurrency platform to refactor its data warehouse architecture to serving 400K+ users. Zhi Zhu and Luke Han detail the necessary architecture and best practices for refactoring a data warehouse for mobile analytics.

Zachary Hanif is a director in Capital One’s Center for Machine Learning, where he leads teams focused on applying machine learning to cybersecurity and financial crime. His research interests include applications of machine learning and graph mining within the realm of massive security data and the automation of model validation and governance. Zachary graduated from the Georgia Institute of Technology.

Presentations

Network effects: Working with modern graph analytic systems Session

An understanding of graph-based analytical techniques can be extremely powerful when applied to modern practical problems, and modern frameworks and analytical techniques are making graph analysis methods viable for increasingly large, complex tasks. Zachary Hanif examines three prominent graph analytic methods, including graph convolutional networks, and applies them to concrete use cases.

Dan Harple is the founder CEO of Context Labs, a leader in delivering at-scale enterprise blockchain-enabled systems and advising global market segments and countries on the development of highly efficient ecosystems and interoperable standards to accelerate positive change for stakeholders. Recent work at Context Labs has resulted in work taking blockchain-enabled platforms from proof-of-concept (POC) stage to at-scale production, with reference deployments in global printing/publishing, global environmental data, and cybersecurity. A technology entrepreneur for more than 25 years, Dan has founded and built technologies, companies, and products that have been used by billions of internet users, merging companies with Netscape Communications and Oracle and leading a joint venture with China’s Sina. Each of Dan’s firms successfully raised multiple rounds of Silicon Valley-based venture capital and had liquidity events at various stages. He has been a founder and CEO of technology companies, held senior executive and CEO roles for three NASDAQ-listed tech companies, and served as an advisor and investor in many others, including acting chief innovation strategy officer at RR Donnelley. He has served as a director and/or advisor for a variety of nonprofits and educational institutions, including the Berklee College of Music, Stichting Nexuslabs Foundation, International School of Amsterdam, Tabor Academy, University of Rhode Island College of Engineering Advisory Board, Friends Academy, Marlboro College, and Harrisburg Academy. Dan has received numerous awards, including Inc. magazine’s Entrepreneur of the Year Award and the NEA President’s Award. He was a coauthor with internet pioneer Vint Cerf of Disrupting Unemployment, focusing on technology’s impact on employment and the economy, and has published in a variety of conference proceedings dealing with the application of CAE/CAD/CAM and finite element analysis in distributed computing environments, specifically in the field of mechanical design and ergonomics. Dan holds an MSc from MIT and degrees in mechanical engineering and psychology from the University of Rhode Island. He also attended Marlboro College.

Presentations

Architectural principles for building trusted, real-time, distributed IoT systems Session

Dan Harple explains how distributed systems are being influenced by and are influencing operational, financial, and social impact requirements of a wide range of enterprises and how trust in these distributed systems is being challenged, elevated, and resolved by engineers and architects today.

Patrick Harrison started and leads the data science team at S&P Global Market Intelligence (S&P MI), a business and financial intelligence firm and data provider. The team employs a wide variety of data science tools and techniques, including machine learning, natural language processing, recommender systems, graph analytics, among others. Patrick is the coauthor of the forthcoming book Deep Learning with Text from O’Reilly Media, along with Matthew Honnibal, creator of spaCy, the industrial-strength natural language processing software library, and is a founding organizer of a machine learning conference in Charlottesville, Virginia. He is actively involved in building both regional and global data science communities. Patrick holds a BA in economics and an MS in systems engineering, both from the University of Virginia. His graduate research focused on complex systems and agent-based modeling.

Presentations

Managing data science in the enterprise Tutorial

The honeymoon era of data science is ending, and accountability is coming. Successful data science leaders deliver measurable impact on an increasing share of an enterprise’s KPIs. Joshua Poduska and Patrick Harrison detail how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage

Kenji Hayashida is a Japan-based data engineer at Recruit Lifestyle Co., Ltd., part of Recruit Group, where he has worked on projects such as advertising technology, content marketing, and the company’s data pipeline. Kenji started his career as a software engineer at HITECLAB while he was in college. He is the author of a popular data science textbook and holds a master’s degree in information engineering from Osaka University. In his free time, Kenji enjoys programing competitions such as TopCoder, Google Code Jam, and Kaggle.

Presentations

Best practices for developing an enterprise data hub to collect and analyze 1 TB of data a day from a multiple services with Apache Kafka and Google Cloud Platform Session

Recruit Group and NTT DATA Corporation have developed a platform based on a data hub, utilizing Apache Kafka. This platform can handle around 1 TB/day of application logs generated by a number of services in Recruit Group. Kenji Hayashida and Toru Sasaki share best practices for and lessons learned about topics such as schema evolution and network architecture.

Jeffrey Heer is Trifacta’s chief experience officer and cofounder as well as a professor of computer science at the University of Washington, where he directs the Interactive Data Lab. Jeff’s passion is the design of novel user interfaces for exploring, managing, and communicating data. The data visualization tools developed by his lab (D3.js, Protovis, Prefuse) are used by thousands of data enthusiasts around the world. In 2009, Jeff was named to MIT Technology Review’s list of “top innovators under 35.”

Presentations

The Vega project: Building an ecosystem of tools for interactive visualization Session

Jeffrey Heer offers an overview of Vega and Vega-Lite—high-level declarative languages for interactive visualization that support exploratory data analysis, communication, and the development of new visualization tools.

Sam Helmich is a data scientist in John Deere’s Intelligent Solutions Group. Previously, he worked in applied analytics roles within John Deere Worldwide Parts and Global Order Fulfillment. Sam holds an MS in statistics from Iowa State University.

Presentations

Data science in an Agile environment: Methods and organization for success Data Case Studies

Sam Helmich explains how data science can benefit from borrowing Agile principles. These benefits are compounded by structuring the team roles in such a manner to enable success without relying on employing full stack expert “unicorns.”

Alex Heye is a software engineer with the Analytics Group at Cray focused on deep learning technologies. His team works to develop applications for HPC users to readily incorporate data analytics and machine learning tools into their workflow.

Presentations

A deep learning approach for precipitation nowcasting with RNN using Analytics Zoo on BigDL Session

Precipitation nowcasting is used to predict the future rainfall intensity over a relatively short timeframe. The forecasting resolution and time accuracy required are much higher than for other traditional forecasting tasks. Alexander Heye and Ding Ding explain how to build a precipitation nowcasting system with recurrent neural networks using BigDL on Apache Spark.

Camila Hiskey is a senior systems engineer at Cloudera. A hands-on technologist, she architects enterprise data solutions primarily for large financial services and life sciences organizations. Camila helps educate IT and business teams implement Hadoop, open source software, and big data. Previously, she was an engineer and DBA at IBM, where she worked with operational data stores and analytical databases.

Presentations

Getting ready for GDPR: Securing and governing hybrid, cloud, and on-premises big data deployments, step by step Tutorial

New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Syed Rafice, Mubashir Kazia, Ifigeneia Derekli, and Camila Hiskey share hands-on best practices for meeting these challenges, with special attention paid to GDPR.

Felipe Hoffa is a developer advocate for big data at Google, where he inspires developers around the world to leverage the Google Cloud Platform tools to analyze and understand their data in ways they could never before. You can find him in several videos, blog posts, and conferences around the world.

Presentations

Protecting sensitive data in huge datasets: Cloud tools you can use Session

Before releasing a public dataset, practitioners need to thread the needle between utility and protection of individuals. Felipe Hoffa and Damien Desfontaines explore how to handle massive public datasets, taking you from theory to real life as they showcase newly available tools that help with PII detection and brings concepts like k-anonymity and l-diversity to the practical realm.

Garrett Hoffman is a director of data science at StockTwits, where he leads efforts to use data science and machine learning to understand social dynamics and develop research and discovery tools that are used by a network of over one million investors. Garrett has a technical background in math and computer science but gets most excited about approaching data problems from a people-first perspective—using what we know or can learn about complex systems to drive optimal decisions, experiences, and outcomes.

Presentations

Deep learning methods for natural language processing Tutorial

Garrett Hoffman walks you through deep learning methods for natural language processing and natural language understanding tasks, using a live example in Python and TensorFlow with StockTwits data. Methods include word2vec, recurrent neural networks and variants (LSTM, GRU), and convolutional neural networks.

Keqiu Hu is a staff software engineer at LinkedIn, where he’s working on LinkedIn’s big data platforms, primarily focusing on TensorFlow and Hadoop.

Presentations

TonY: Native support of TensorFlow on Hadoop Session

Jonathan Hung, Keqiu Hu, and Zhe Zhang offer an overview of TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. TonY enables running TensorFlow distributed training as a new type of Hadoop application. Its native Hadoop connector, together with other features, aims to run TensorFlow jobs as reliably and flexibly as other first-class citizens on Hadoop.

Crystal Huang is a principal in NEA’s New York offices, where she focuses on enterprise software, infrastructure, and security. Previously, she spent nearly five years at GGV Capital in Silicon Valley, where she worked closely with many of the firm’s portfolio companies, including OpenDoor, Slack, Wish, BigCommerce, Tile, HashiCorp, Unravel Data, BrightWheel, NS1, and Wish, and led or sourced the firm’s investments in Bitsight, Headspin, and Aptible, among others. Prior to GGV, Crystal worked in technology investment banking at Blackstone and sales and trading at Goldman Sachs. She was named to Forbes’s “30 under 30” list for venture capital in 2016. Crystal holds a bachelor’s degree from Harvard University.

Presentations

VC trends in machine learning and data science Session

In this panel discussion, venture capital investors explain how startups can accelerate enterprise adoption of machine learning and explore the new tech trends that will give rise to the next transformation in the big data landscape.

Mark Huang is the director of data engineering at Bell Canada, where he manages BI environments, ETL development, architecture, governance, and technology decisions. Since joining Bell over six years ago, Mark has played a key role in designing the BI roadmap and executing on the strategy to meet current and future business needs.

Presentations

How Bell Canada increased the scale of BI exponentially with OLAP on big data (sponsored by Kyvos Insights) Session

Like all telecommunication giants, Bell Canada relies on huge volumes of data to make accurate business decisions and deliver better services. Mark Huang discusses why Bell Canada chose Kyvos’s OLAP on big data technology to achieve multidimensional analytics and how it helped the company deliver to its growing business reporting demands.

Tao Huang is a big data platform development engineer at JD.com, where he is mainly engaged in the development and maintenance of the company’s big data platform, using open source projects such as Hadoop, Spark, Alluxio and Kubernetes. He focuses on migrating Hadoop to the Kubernetes cluster, which will run long-running services and batch jobs, to improve the cluster resource utilization.

Presentations

Using Alluxio as a fault-tolerant pluggable optimization component of JD.com's compute frameworks Session

Tao Huang, Mang Zhang, and 白冰 explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average.

Fabian Hueske is a committer and PMC member of the Apache Flink project. He was one of the three original authors of the Stratosphere research system, from which Apache Flink was forked in 2014. Fabian is a cofounder of Ververica, a Berlin-based startup devoted to fostering Flink, where he works as a software engineer and contributes to Apache Flink. He holds a PhD in computer science from TU Berlin and is currently spending a lot of his time writing a book, Stream Processing with Apache Flink.

Presentations

Why and how to leverage the power and simplicity of SQL on Apache Flink Session

Fabian Hueske discusses why SQL is a great approach to unify batch and stream processing. He gives an update on Apache Flink's SQL support and shares some interesting use cases from large-scale production deployments. Finally, Fabian presents Flink's new query service that enables users and applications to submit streaming and batch SQL queries and retrieve low-latency updated results.

Dave Huh is a data scientist in the Professional Services Group at Hitachi Vantara, where he works with healthcare and insurance companies to provide insights with advanced analytics. Dave is passionate about making analytics technologies accessible to the broader public.

Presentations

From two weeks in Python to two hours in Pentaho: Building modern big data pipelines for machine learning (sponsored by Hitachi Vantara) Session

Data in most organizations today is massive, messy, and often found in silos. With so many sources to analyze, data engineers need to construct robust data pipelines using automation and minimize duplicate processes, as computation is costly for big data. David Huh shares strategies to construct data pipelines for machine learning, including one to reduce time to insight from weeks to hours.

Lars Hulstaert is a data scientist at Microsoft. Previously, he studied machine learning at Cambridge University and Ghent University.

Presentations

Democratizing deep learning with transfer learning Session

Transfer learning allows data scientists to leverage insights from large labeled datasets. The general idea of transfer learning is to use knowledge learned from tasks for which a lot of labeled data is available in settings where little labeled data is available. Lars Hulstaert explains what transfer learning is and how it can boost your NLP or CV pipelines.

Jonathan Hung is a senior software engineer on the Hadoop development team at LinkedIn.

Presentations

TonY: Native support of TensorFlow on Hadoop Session

Jonathan Hung, Keqiu Hu, and Zhe Zhang offer an overview of TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. TonY enables running TensorFlow distributed training as a new type of Hadoop application. Its native Hadoop connector, together with other features, aims to run TensorFlow jobs as reliably and flexibly as other first-class citizens on Hadoop.

Daniel Huss is head of product management for Verus at State Street. Previously, he laid the foundation for the very product he’s building now with Boston Consulting Group’s Digital Ventures. In another life, he would have been a physicist. He tries to fake it by applying as much scientific method as he can to product development and entrepreneurship, valuing uncertainty and ambiguity and having a default setting of “build, measure, learn.”

Presentations

A roadmap for open data science and AI for business: Panel discussion with State Street Session

Bethann Noble, Abhishek Kodi, and Daniel Huss share their experience and best practices for designing and executing on a roadmap for open data science and AI for business.

Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo, where his research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, rank-aware query processing, and information extraction. Ihab is also a cofounder of Tamr, a startup focusing on large-scale data integration and cleaning. He’s a recipient of the Ontario Early Researcher Award (2009), a Cheriton faculty fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he’s an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees and an associate editor of ACM Transactions of Database Systems (TODS). He holds a PhD in computer science from Purdue University, West Lafayette.

Presentations

Scalable machine learning for data cleaning Session

Machine learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas explains why leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions.

Maryam Jahanshahi is a research scientist at TapRecruit, a platform that uses AI and automation tools to bring efficiency and fairness to the recruiting process. She holds a PhD from the Icahn School of Medicine at Mount Sinai, where she studied molecular regulators of organ-size control. Maryam’s long-term research goal is to reduce bias in decision making by using a combination of computation linguistics, machine learning, and behavioral economics methods.

Presentations

"Moneyballing" recruiting: A data-driven approach to battling bottlenecks and biases in hiring Data Case Studies

Hiring teams have long relied on intuition and experience to scout talent. Increased data and data-science techniques give us a chance to test common recruiting wisdom. Drawing on results from her recent behavioral experiments and analyses of over 10 million jobs and their outcomes, Maryam Jahanshahi illustrates how often innocuous recruiting decisions have dramatic impacts on hiring outcomes.

Ankit Jain is a senior data scientist at Uber AI Labs, the machine learning research arm of Uber. His work primarily involves the application of deep learning methods to a variety of Uber’s problems, ranging from forecasting and food delivery to self-driving cars. Previously, he held a variety of data science roles at Bank of America, Facebook, and other startups. Ankit holds an MFE from UC Berkeley and a BS from IIT Bombay (India). Outside of his job, he likes to mentor students in data science, run marathons, and travel.

Presentations

Achieving personalization with LSTMs Session

Personalization is a common theme in social networks and ecommerce businesses. Personalization at Uber involves an understanding of how each driver and rider is expected to behave on the platform. Ankit Jain explains how Uber employs deep learning using LSTMs and its huge database to understand and predict the behavior of each and every user on the platform.

Jeroen Janssens is the founder, CEO, and an instructor of Data Science Workshops, which provides on-the-job training and coaching in data visualization, machine learning, and programming. Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and startups YPlan and Outbrain in New York City. He’s the author of Data Science at the Command Line (O’Reilly). Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University.

Presentations

50 reasons to learn the shell for doing data science Session

"Anyone who does not have the command line at their beck and call is really missing something," tweeted Tim O'Reilly when Jeroen Janssens's Data Science at the Command Line was recently made available online for free. Join Jeroen to learn what you're missing out on if you're not applying the command line and many of its power tools to typical data science problems.

Data science with Unix power tools Tutorial

The Unix command line remains an amazing environment for efficiently performing tedious but essential data science tasks. By combining small, powerful command-line tools, you can quickly scrub, explore, and model your data as well as hack together prototypes. Join Jeroen Janssens for a hands-on workshop based on his book Data Science at the Command Line.

Chad W. Jennings is a product manager for BigQuery at Google Cloud. Chad has a long history in navigation processing; he came to Google from the startup world, where he worked on navigation algorithms on airplanes, helicopters, and mobile phones, and holds a PhD in aeronautics and astronautics from Stanford University. He is an avid skier and surfer. When he’s not working on big things or playing in nature, he’s at home with his wife and two young children.

Presentations

Building the bridge from big data to ML, featuring Geotab (sponsored by Google Cloud) Session

If your company isn’t good at analytics, it’s not ready for AI. Bob Bradley and Chad W. Jennings explain how the right data strategy can set you up for success in machine learning and artificial intelligence—the new ground for gaining competitive edge and creating business value. You'll then see an in-depth demonstration of Google technology from smart cities innovator Geotab.

Smarter cities through Geotab with BigQuery ML and geospatial analytics (sponsored by Google Cloud) Keynote

Cities all over the world are using data and analytics to optimize infrastructure, but city planners are often held back by outdated data gathering methods and legacy analysis tools. Chad Jennings details how Geotab, a leader in IoT fleet logistics, brought BigQuery's unique machine learning and geospatial capabilities to its existing datasets to deliver a more capable solution to city planners.

Ivan Jibaja is a tech lead for the big data analytics team at Pure Storage. Previously, he was a part of the core development team that built the FlashBlade from the ground up. Ivan holds a PhD in computer science with a focus on systems and compilers from the University of Texas at Austin.

Presentations

How to avoid drowning in logs: Streaming 80 billion events and batch processing 40 TB/hour (sponsored by Pure Storage) Session

Pure Storage runs over 70,000 tests per day. Using Spark’s flexible computing platform, the company can write a single application for both streaming and batch jobs so the company's team of triage engineers can understand the state of the continuous integration pipeline. Ivan Jibaja discusses the use case for big data analytics technologies, the architecture of the solution, and lessons learned.

Darrin Johnson is a technologist at NVIDIA working in software and hardware engineering, including system software, scientific, and cloud computing. Darrin has a proven track record of building and driving high-performance global teams and delivering innovative technology, products, and solutions, and has demonstrated a mastery of numerous technologies, including kernels, networking, filesystems, storage, security, and cloud computing. His experience includes product marketing, technical marketing, and product management, both inbound and outbound. Darrin is currently driven to master deep learning as an enabler for innovation.

Presentations

Simplifying AI infrastructure: Lessons in scaling a deep learning enterprise (sponsored by NVIDIA) Session

While every enterprise is on a mission to infuse its business with deep learning, few know how to build the infrastructure to get them there. Darrin Johnson shares insights and best practices learned from NVIDIA's deep learning deployments around the globe that you can leverage to shorten deployment timeframes, improve developer productivity, and streamline operations.

Theresa Johnson is a product manager for metrics and forecasting products at Airbnb. As a data scientist, she was part of the task force and cross-functional hackathon team at Airbnb that worked to develop the framework for the current antidiscrimination efforts. Theresa is a founding board member of Street Code Academy, a nonprofit dedicated to high-touch technical training for inner city youth, and has been featured in TechCrunch for her commitment to helping early-stage founders raise capital. Theresa is passionate about extending technology access for everyone and finding mission-driven companies that can have an outsized impact on the world. She holds a PhD in aeronautics and astronautics and dual undergraduate degrees in science, technology, and society and computer science, all from Stanford University.

Presentations

The revenue forecasting platform at Airbnb Findata

Theresa Johnson explains how Airbnb is building its next-generation end-to-end revenue forecasting platform, leveraging machine learning, Bayesian inference, TensorFlow, Hadoop, and web technology.

Ken Jones is an Apache Spark instructor at Databricks. Ken has thousands of hours of in-class instruction experience presenting classes on Spark, Scala, and other open source technologies to Fortune 500 companies and individual developers worldwide. Previously, Ken was a senior instructor at Twitter, where in his role as coordinator for Twitter’s engineering onboarding program, he taught classes on Scala programming and backend service development in Scala. Ken also spent several years teaching Android application development and Android operating system internals, as well as several programming languages. He is the coauthor of Practical Programming in Tcl and Tk, 4th edition, and Tcl and the Tk Toolkit, 2nd edition. Ken lives in San Diego, CA, with his husband, Dean, and their cat, Jasper. He enjoys traveling extensively for work to accumulate airline miles and hotel points so that he can travel extensively for pleasure. When not in front of a class or wandering about strange cities, he likes to read and watch science fiction and fantasy, listen to jazz and ’80s alternative music, and mix (and drink) cocktails.

Presentations

Apache Spark programming 1-Day Training

Ken Jones walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs.

Andrew Jorgensen is a senior software engineer at Google primarily focused on building distributed real-time data processing systems for mobile analytics to help give developers deeper insight into their app. Andrew has worked on Fabric Answers since its launch in 2014 and helped it grow to handle over six billion sessions per day.

Presentations

Building Fabric Answers using Apache Heron Session

Streaming systems like Apache Heron are being used for an increasingly broad array of applications. Karthik Ramasamy and Andrew Jorgensen offer an overview of Fabric Answers, which provides real-time insights to mobile developers to improve their product experience at Google Fabric using Apache Heron.

Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Omkar has a keen interest in solving large-scale distributed systems problems. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.

Presentations

Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework Session

Danny Chen, Omkar Joshi, and Eric Sayle offer an overview of Marmaray, a generic Hadoop ingestion and dispersal framework recently released to production at Uber. You'll learn how Marmaray can meet a team's data needs by ensuring that data can be reliably ingested into Hive or dispersed into online data stores and take a deep dive into the architecture to see how it all works.

Oleksii (Alexey) Kachaiev is the CTO at Attendify, where he spends his days coding in Clojure, Haskell, and Rust. His interests include algebra and protocols. Alexey is the author of the Muse and Fn.py libraries and is an active contributor to Aleph and other open source projects.

Presentations

Managing data chaos in the world of microservices Session

When we talk about microservices, we usually focus on the communication layer. In practice, data is the much harder and often overlooked problem. Splitting applications into independent units leads to increased complexity, such as structural and semantic changes, knowledge sharing, and data discovery. Join Alexey Kachayev to explore emerging technologies created to tackle these challenges.

Atul Kale is a software engineer on Airbnb’s machine learning infrastructure team. Previously, Atul worked in finance building machine learning-driven proprietary trading strategies and the data pipelines to support them. He holds a degree in computer engineering from the University of Illinois Urbana-Champaign.

Presentations

Bighead: Airbnb's end-to-end machine learning platform Session

Atul Kale and Xiaohan Zeng offer an overview of Bighead, Airbnb's user-friendly and scalable end-to-end machine learning framework that powers Airbnb's data-driven products. Built on Python, Spark, and Kubernetes, Bighead integrates popular libraries like TensorFlow, XGBoost, and PyTorch and is designed be used in modular pieces.

Mirko Kämpf is a solutions architect on the CEMEA team at Cloudera, where he applies tools from the Hadoop ecosystem, such as Spark, HBase, and Solr, to solve customer’s problems and is working on graph-based knowledge representation using Apache Jena to enable semantic search at scale. Mirko’s research focuses on time-dependent networks and time series analysis at scale. He loves to deliver data-centric workshops and has spoken at several big data-related conferences and meetups. He holds a PhD in statistical physics.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh, Alan Silva, Alex Bleakley, Steven Totman, Mirko Kämpf, and Syed Nasar outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Daniel Kang is a PhD student in the Stanford InfoLab, where he is supervised by Peter Bailis and Matei Zaharia. Daniel’s research interests lie broadly at the intersection of machine learning and systems. Currently, he is working on deep learning applied to video analysis.

Presentations

BlazeIt: An exploratory video analytics engine Session

Daniel Kang offers an overview of exploratory video analytics engine BlazeIt, which offers FrameQL, a declarative SQL-like language for querying video, and a query optimizer for executing these queries. You'll see how FrameQL can capture a large set of real-world queries ranging from aggregation and scrubbing and how BlazeIt can execute certain queries up to 2,000x faster than a naive approach.

Holden Karau is a transgender Canadian software engineer working in the bay area. Previously, she worked at IBM, Alpine, Databricks, Google (twice), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She’s a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

Presentations

From training to serving: Deploying TensorFlow models with Kubernetes Tutorial

TensorFlow and Keras are popular libraries for training deep models due to hardware accelerator support. Brian Foo, Jay Smith, and Holden Karau explain how to bring deep learning models from training to serving in a cloud production environment. You'll learn how to unit-test, export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am Session

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads.

Yasuyuki Kataoka is a data scientist at NTT Innovation Institute, Inc. His primary interest is applied R&D in machine learning applications for time series and heterogeneous data such as vision, audio, text, and IoT sensor signals. This data science work spans various fields including automotive, sports, healthcare, and social media. Other areas of interest include robotics control such as self-driving car and drone systems. When not doing research activities, he likes to participate in hackathons, where he has won prizes in the automotive and healthcare industries. Yasuyuki is a PhD candidate in artificial intelligence at the University of Tokyo and holds an MS and BS in mechanical and system engineering from Tokyo Institute of Technology, where he graduated with valedictorian honors.

Presentations

Real-time machine intelligence in IndyCar and Tour de France Session

One of the challenges of sports data analytics is how to deliver machine intelligence beyond a mere real-time monitoring tool. Yasuyuki Kataoka highlights various real-time machine learning models in both IndyCar and Tour de France, sharing real-time data processing architectures, machine learning models, and demonstrations that deliver meaningful insights for players and fans.

Mubashir Kazia is a principal solutions architect at Cloudera and an SME in Apache Hadoop security in Cloudera’s Professional Services practice, where he helps customers secure their Hadoop clusters and comply to internal security policies. He also helps new customers transition to Hadoop platform and implement their first few use cases and trains and mentors peers in Hadoop and Hadoop security. Mubashir has worked with customers from all verticals, including banking, manufacturing, healthcare, telecom, retail, and gaming. Previously, he worked on developing solutions for leading investment banking firms.

Presentations

Getting ready for GDPR: Securing and governing hybrid, cloud, and on-premises big data deployments, step by step Tutorial

New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Syed Rafice, Mubashir Kazia, Ifigeneia Derekli, and Camila Hiskey share hands-on best practices for meeting these challenges, with special attention paid to GDPR.

Arun Kejariwal is an independent lead engineer. Previously, he was he was a statistical learning principal at Machine Zone (MZ), where he led a team of top-tier researchers and worked on research and development of novel techniques for install-and-click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns, and his team built novel methods for bot detection, intrusion detection, and real-time anomaly detection; and he developed and open-sourced techniques for anomaly detection and breakout detection at Twitter. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Correlation analysis on live data streams Session

The rate of growth of data volume and velocity has been accelerating along with increases in the variety of data sources. This poses a significant challenge to extracting actionable insights in a timely fashion. Arun Kejariwal and Francois Orsini explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making.

Designing modern streaming data applications Tutorial

Arun Kejariwal and Karthik Ramasamy lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, covering messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. They also share case studies from the IoT, gaming, and healthcare and their experience operating these systems at internet scale.

Paul Kent is vice president of big data initiatives at SAS, where he divides his time between customers, partners, and the research and development teams discussing, evangelizing, and developing software at the confluence of big data and high-performance computing. Previously, Paul was vice president of the Platform R&D Division at SAS, where he led groups responsible for the SAS foundation and mid-tier technologies—teams that develop, maintain, and test Base SAS, as well as related data access, storage, management, presentation, connectivity, and middleware software products. Paul has contributed to the development of SAS software components including PROC SQL, TCP/IP connectivity, the output delivery system (ODS), and more recently the inside-database and high-performance initiatives. A strong customer advocate, Paul is widely recognized within the SAS community for his active participation in the community and at local and international user conferences. Paul holds a bachelor of commerce (with honors) from WITS in South Africa, followed by an almost complete MBA (interrupted to try a North American posting). He got his commercial introduction to using computers to make better business decisions in the Gold Division of Anglo American.

Presentations

Commercial software in an increasingly open source ecosystem (sponsored by SAS) Session

Software is eating the world, and open source is eating the software. Most contemporary analytics shops use a lot of open source software in their analytics platform. So where does commercial software like SAS fit? Paul Kent explains how you can achieve the best of both worlds by combining your favorite open source software with the power of SAS analytics.

Jawad Khan is director of data sciences and knowledge management at Rush University Medical Center, where he leads Rush’s analytics and data strategy, focusing on leveraging data from all sections of the business, including clinical, ERP, security, device sensors, and people/patient-generated data, to provide improved safety, better clinical outcomes, reduced cost, and innovation. Jawad has more than 20 years of experience in analytics, software development, data management, and data security. Previously, he was a lead architect at Century Link, where he provided cloud enablement strategies for data and applications to clients like GE Capital, Coca-Cola, Proctor & Gamble, and Warner Bros., and a managing director at Opus Capital Markets, where he was responsible for leading analytics, data security and compliance, and software development as well as data center and infrastructure development and operations. He also worked as a software engineer consultant for one of the Big Six consulting firms. Jawad holds a degree in computer engineering from Southern Illinois University. He speaks regularly at professional and community events and is a Cricket commentator for Chicago NPR affiliate WBEZ.

Presentations

Digging for gold: Developing AI in healthcare against unstructured text data Session

Chiny Driscoll and Jawad Khan offer an overview of a solution by Cloudera and MetiStream that lets healthcare providers automate the extraction, processing, and analysis of clinical notes within an electronic health record in batch or real time, improving care, identifying errors, and recognizing efficiencies in billing and diagnoses.

Amandeep Khurana is cofounder and CEO at Okera, which he launched in 2016 with CTO and cofounder Nong Li. After witnessing firsthand the challenges companies faced in big data and cloud migration, he built Okera to empower all users with easy access through a unified, secured, and governed platform across heterogenous data stores. Amandeep is passionate about distributed systems, big data, and everything cloud. Previously, he supported customer cloud initiatives at Cloudera and played an integral role at AWS on the Elastic MapReduce team, where he oversaw some of the industry’s largest big data implementations. As such, he understands that customers need self-serve analytics without trading in governance or security. Amandeep is the coauthor of HBase in Action, a book on building applications with HBase. Amandeep holds an MS in computer science from the University of California, Santa Cruz and a bachelor’s degree in engineering from Thapar Institute of Engineering and Technology.

Presentations

The move to a modern data platform in the cloud: Pitfalls to avoid and best practices to follow Session

Amandeep Khurana shares critical data management practices for easy and unified data access that meets security and regulatory compliance, helping you avoid the pitfalls that could lead to complex expensive architectures.

Rita Ko is the director of the Hive, the innovation lab at the UN Refugee Agency in the United States (USA for UNHCR), where she heads the application of machine learning and data science to explore new modes of engagement around the global refugee crisis. Her work in data science stems from her election campaign experience in Canada at the Office the Mayor in the City of Vancouver, where she successfully reelected Mayor Gregor Robertson three consecutive terms, and nationally on three election campaigns applying predictive modeling.

Presentations

From strategy to implementation: Putting data to work at USA for UNHCR Session

Friederike Schuur and Rita Ko explain how the Hive (an internal group at USA for UNHCR) and Cloudera Fast Forward Labs transformed USA for UNHCR, enabling the agency to use data science and machine learning (DS/ML) to address the refugee crisis. Along the way, they cover the development and implementation of a DS/ML strategy, identify use cases and success metrics, and showcase the value of DS/ML.

Abhishek Kodi is a data engineer on the Verus team at State Street, where he strives to create novel solutions for deterministic and nondeterministic problems that balance business asks against learning frameworks. His focus area is supervised and unsupervised analytical solutions that merge natural language processing, supply chain, and financial data. His integral approach to understanding business needs, playing a part in designing ML-driven solutions, and implementing them for production systems enables sustainable pathways from ML tools to meaningful business solutions.

Presentations

A roadmap for open data science and AI for business: Panel discussion with State Street Session

Bethann Noble, Abhishek Kodi, and Daniel Huss share their experience and best practices for designing and executing on a roadmap for open data science and AI for business.

Andreas Kohlmaier is head of data engineering at Munich Re, where he leads the team responsible for setting up a group-wide data lake and supporting the transformation of Munich Re to a data-driven organization. Andreas has more than 15 years of experience in IT and data projects, with a focus on microservices, data management, IT architecture, and Agile project management. Previously, he was an IT architect at Munich Re. He holds a master’s degree in computer science.

Presentations

Cataloging the data lake for distributed analytics innovation at Munich Re Findata

Munich Re is increasing client resilience against economic, political, and cyberrisks while setting and shaping trends in the insurance market. Recently, Munich Re successfully launched a data catalog as the driver for analyst adoption of a data lake. Andreas Kohlmaier explains how cataloging new data encouraged users to explore new ideas, developed new business, and enhanced customer service.

Sergei Kom is a senior software engineer in Intel’s Advanced Analytics Department. Sergei has a lot of experience in developing real-time applications using Spark Streaming, Kafka, Kafka Streams, and TensorFlow Serving. He enjoys learning new technologies and implement them in new projects.

Presentations

A high-performance system for deep learning inference and visual inspection Session

Moty Fania and Sergei Kom share their experience and lessons learned implementing an AI inference platform to enable internal visual inspection use cases. The platform is based on open source technologies and was designed for real-time, streaming, and online actuation.

Cassie Kozyrkov is Google Cloud’s chief decision scientist. Cassie is passionate about helping everyone make better decisions through harnessing the beauty and power of data. She speaks at conferences and meets with leadership teams to empower decision makers to transform their industries through AI, machine learning, and analytics. At Google, Cassie has advised more than a hundred teams on statistics and machine learning, working most closely with research and machine intelligence, Google Maps, and ads and commerce. She has also personally trained more than 15,000 Googlers (executives, engineers, scientists, and even nontechnical staff members) in machine learning, statistics, and data-driven decision making. Previously, Cassie spent a decade working as a data scientist and consultant. She’s a leading expert in decision science, with undergraduate studies in statistics and economics at the University of Chicago and graduate studies in statistics, neuroscience, and psychology at Duke University and NCSU. When she’s not working, you’re most likely to find Cassie at the theatre, in an art museum, exploring the world, playing board games, or curled up with a good novel.

Presentations

Executive Briefing: Most data-driven cultures aren’t Session

Many organizations aren’t aware that they have a blindspot with respect to their lack of data effectiveness, and hiring experts doesn’t seem to help. Cassie Kozyrkov examines what it takes to build a truly data-driven organizational culture and highlights a vital yet often neglected job function: the data science manager.

The Missing Piece Keynote

Why do businesses fail at machine learning despite its tremendous potential and the excitement it generates? Is the answer always in data, algorithms, and infrastructure, or is there a subtler problem? Will things improve in the near future? Let's talk about some lessons learned at Google and what they mean for applied data science.

Andreea Kremm is the founder of Netex Group, an international business service provider with over 700 employees in nine countries. Andreea has over 20 years of experience successfully implementing solutions for international online businesses, drawing on her expertise in computer science and psychology. Andreea holds a master’s degree in psychology from the University of Roehampton in London and is currently a PhD student in Psychology at Northcentral University in Arizona.

Presentations

From emotion analysis and topic extraction to narrative modeling Session

Narrative economics studies the impact of popular narratives and stories on economic fluctuations in the context of human interests and emotions. Andreea Kremm and Mohammed Ibraaz Syed describe the use of emotion analysis, entity relationship extraction, and topic modeling in modeling narratives from written human communication.

Jay Kreps is the cofounder and CEO of Confluent, a company focused on Apache Kafka. Previously, Jay was one of the primary architects for LinkedIn, where he focused on data infrastructure and data-driven products. He was among the original authors of a number of open source projects in the scalable data systems space, including Voldemort (a key-value store), Azkaban, Kafka (a distributed messaging system), and Samza (a stream processing system).

Presentations

Apache Kafka and the four challenges of production machine learning systems Session

Machine learning has become mainstream, and suddenly businesses everywhere are looking to build systems that use it to optimize aspects of their product, processes or customer experience. Jay Kreps explores some of the difficulties of building production machine learning systems and explains how Apache Kafka and stream processing can help.

Ramesh Krishnan is a software engineer at Imco. Ramesh has 15 years experience in the IT industry, with extensive experience in architecting and implementing IT solutions. He’s handled major telecom, banking, sales, finance, and HCM projects and has a decade of MIS and business information reporting experience. He’s also led onsite and offshore projects. His drive and determination to succeed have proved him a valuable addition to many projects.

Presentations

Self-service modern analytics on the GovCloud Session

Lockheed Martin is a data-driven company with a massive variety and volume of data. To extract the most value from its information assets, the company is constantly exploring ways to enable effective self-service scenarios. Ramesh Krishnan and Steve Morgan discuss Lockheed Martin's journey into modern analytics and explore its analytics platform focused on leveraging AWS GovCloud.

Ajay Kulkarni is the cofounder and CEO of TimescaleDB, developers of an open source time series database optimized for fast ingest and complex queries. Ajay got his first PC when he was four years old and has been playing with computers ever since. Previously, he cofounded communication data analysis company Sensobi (acquired in 2011 by GroupMe, later acquired by Skype while Skype was being acquired by Microsoft) and led the mobile engineering team at GroupMe, which grew to millions of daily users and billions of monthly messages over a short period of time. His experience also includes roles at Citigroup, Microsoft, and several startups. Ajay holds both a bachelor’s and master’s degree in computer science from MIT and an MBA from the MIT Sloan School of Management. He is an avid runner and conscientiously maintains an up-to-date list of the top ice cream shops in NYC.

Presentations

Why the internet of things doesn’t exist but will still reshape your business Session

Ajay Kulkarni explores the underlying changes that are characterizing the next wave of computing and shares several ways in which individual businesses and overall industries will be transformed.

Abhishek Kumar is a senior manager of data science in Publicis Sapient’s India office, where he looks after scaling up the data science practice by applying machine learning and deep learning techniques to domains such as retail, ecommerce, marketing, and operations. Abhishek is an experienced data science professional and technical team lead specializing in building and managing data products from conceptualization to the deployment phase and interested in solving challenging machine learning problems. Previously, he worked in the R&D center for the largest power-generation company in India on various machine learning projects involving predictive modeling, forecasting, optimization, and anomaly detection and led the center’s data science team in the development and deployment of data science-related projects in several thermal and solar power plant sites. Abhishek is a technical writer and blogger as well as a Pluralsight author and has created several data science courses. He’s also a regular speaker at various national and international conferences and universities. Abhishek holds a master’s degree in information and data science from the University of California, Berkeley. Abhishek has spoken at past O’Reilly conferences, including Strata 2019, Strata 2018, and AI 2019.

Presentations

Deep learning-based search and recommendation systems using TensorFlow Tutorial

Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.

Manoj Kumar is a senior software engineer on the data team at LinkedIn, where he is currently working on auto-tuning Hadoop jobs. He has more than four years of experience in big data technologies like Hadoop, MapReduce, Spark, HBase, Pig, Hive, Kafka, and Gobblin. Previously, he worked on the data framework for slicing and dicing (30 dimensions, 50 metrics) advertising data at PubMatic and worked at Amazon.

Presentations

TuneIn: How to get your jobs tuned while you are sleeping Session

Have you ever tuned a Spark or MR job? If the answer is yes, you already know how difficult it is to tune more than hundred parameters to optimize the resources used. Manoj Kumar, Pralabh Kumar, and Arpan Agrawal offer an overview of TuneIn, an auto-tuning tool developed to minimize the resource usage of jobs. Experiments have shown up to a 50% reduction in resource usage.

Pralabh Kumar is a senior software engineer on the data team at LinkedIn, where he is working on auto-tuning Spark jobs. He has more than seven years of experience in big data technologies like Spark, Hadoop, MapReduce, Cassandra, Hive, Kafka, and ELK. He contributes to Spark and Livy and has filed couple of patents. Previously, he worked on the real-time system for unique customer identification at Walmart. He holds a degree from the University of Texas at Dallas.

Presentations

TuneIn: How to get your jobs tuned while you are sleeping Session

Have you ever tuned a Spark or MR job? If the answer is yes, you already know how difficult it is to tune more than hundred parameters to optimize the resources used. Manoj Kumar, Pralabh Kumar, and Arpan Agrawal offer an overview of TuneIn, an auto-tuning tool developed to minimize the resource usage of jobs. Experiments have shown up to a 50% reduction in resource usage.

Jared P. Lander is chief data scientist of Lander Analytics, where he oversees the long-term direction of the company and researches the best strategy, models, and algorithms for modern data needs. He specializes in data management, multilevel models, machine learning, generalized linear models, data management, visualization, and statistical computing. In addition to his client-facing consulting and training, Jared is an adjunct professor of statistics at Columbia University and the organizer of the New York Open Statistical Programming Meetup and the New York R Conference. He is the author of R for Everyone, a book about R programming geared toward data scientists and nonstatisticians alike. Very active in the data community, Jared is a frequent speaker at conferences, universities, and meetups around the world and was a member of the 2014 Strata New York selection committee. His writings on statistics can be found at Jaredlander.com. He was recently featured in the Wall Street Journal for his work with the Minnesota Vikings during the 2015 NFL Draft. Jared holds a master’s degree in statistics from Columbia University and a bachelor’s degree in mathematics from Muhlenberg College.

Presentations

Modeling time series in R Session

Temporal data is being produced in ever-greater quantity, but fortunately our time series capabilities are keeping pace. Jared Lander explores techniques for modeling time series, from traditional methods such as ARMA to more modern tools such as Prophet and machine learning models like XGBoost and neural nets. Along the way, Jared shares theory and code for training these models.

Paul Lashmet is practice lead and advisor for financial services at Arcadia Data, a company that provides visual big data analytics software that empowers business users to glean meaningful and real-time business insights from high-volume and varied data in a timely, secure, and collaborative way. Paul writes extensively about the practical applications of emerging and innovative technologies to regulatory compliance. Previously, he led programs at HSBC, Deutsche Bank, and Fannie Mae.

Presentations

Visualize AI to spot new trading opportunities Findata

Artificial intelligence and deep learning are used to generate and execute trading strategies. Regulators and investors demand transparency into investment decisions, but the decision-making processes of machine learning technologies are opaque. Paul Lashmet explains how these same machines generate data that can be visualized to spot new trading opportunities.

Josh Laurito is the director of analytics at Squarespace. Previously, Josh worked at Gawker Media and Univision Digital and taught data visualization at the City University of New York. He also helped start a data visualization company and cowrote a book on simulation modeling in finance. Long before any of that, he asked for an abacus for his fourth birthday. If you are interested in learning more, please visit his website.

Presentations

Building it beautiful: Analyzing the effectiveness of platform products and marketing at scale Session

Joshua Laurito explores systems Squarespace built for acquiring and enforcing consistency on obtained data and for inferring conclusions from a company’s marketing and product initiatives. Joshua discusses the intricacies of gathering and evaluating marketing and user data, from raising awareness to driving purchases, and shares results of previous analyses.

Francesca Lazzeri is a senior machine learning scientist at Microsoft on the cloud advocacy team and an expert in big data technology innovations and the applications of machine learning-based solutions to real-world problems. Her research has spanned the areas of machine learning, statistical modeling, time series econometrics and forecasting, and a range of industries—energy, oil and gas, retail, aerospace, healthcare, and professional services. Previously, she was a research fellow in business economics at Harvard Business School, where she performed statistical and econometric analysis within the technology and operations management unit. At Harvard, she worked on multiple patent, publication and social network data-driven projects to investigate and measure the impact of external knowledge networks on companies’ competitiveness and innovation. Francesca periodically teaches applied analytics and machine learning classes at universities and research institutions around the world. She’s a data science mentor for PhD and postdoc students at the Massachusetts Institute of Technology and speaker at academic and industry conferences—where she shares her knowledge and passion for AI, machine learning, and coding.

Presentations

A day in the life of a data scientist: How do we train our teams to get started with AI? Session

With the growing buzz around data science, many professionals want to learn how to become a data scientist—the role Harvard Business Review called the "sexiest job of the 21st century." Francesca Lazzeri and Jaya Mathew explain what it takes to become a data scientist and how artificial intelligence solutions have started to reinvent businesses.

Julien Le Dem is a principal engineer at WeWork. He’s also the coauthor of Apache Parquet and the PMC chair of the project, and he’s a committer and PMC member on Apache Pig, Apache Arrow, and a few other projects. Previously, he was an architect at Dremio; tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_); and a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

Presentations

From flat files to deconstructed database: The evolution and future of the big data ecosystem Session

Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem.

Randy Lea is chief revenue officer at Arcadia Data, where he is charged with leading the company’s sales momentum. Randy is passionate about solving customer problems by leveraging analytics and data. An early participant in the data warehouse and BI analytics market, he has held leadership positions at companies including Aster Data, Think Big Analytics, and Teradata. Randy holds a bachelor’s degree in marketing from California State University, Fullerton.

Presentations

A tale of two BI standards: Data warehouses and data lakes (sponsored by Arcadia Data) Session

The use of data lakes continue to grow, and the right business intelligence (BI) and analytics tools on data lakes are critical to data lake success. Randy Lea explains why existing BI tools work well for data warehouses but not data lakes and why every organization should have two BI standards: one for data warehouses and one for data lakes.

A tale of two BI standards: Data warehouses and data lakes (sponsored by Arcadia Data) Session

The use of data lakes continue to grow, and the right business intelligence (BI) and analytics tools on data lakes are critical to data lake success. Randy Lea explains why existing BI tools work well for data warehouses but not data lakes and why every organization should have two BI standards: one for data warehouses and one for data lakes.

Jonathan Lehr is a cofounder and general partner at Work-Bench, where he focuses on early-stage enterprise technology investments in areas including AI/ML infrastructure and applications, cybersecurity, cloud-native infrastructure, and the future of work. Previously, Jon worked at Morgan Stanley on the Office of the CIO team in IT, where he partnered with internal technology clients to facilitate the selection and onboarding of emerging technology vendors. He has written about enterprise technology trends for publications such as the Wall Street Journal’s CIO Journal and TechCrunch. Jon founded the NY Enterprise Technology meetup in January 2012 and organizes monthly meetups of the 5,000+ person group as a way to promote collaboration for the enterprise tech ecosystem in New York, with a focus on connecting entrepreneurs, Fortune 500 technologists, investors, and graduate students to network and learn from one another. He holds a BSE in bioengineering, with minors in mathematics and economics, from the University of Pennsylvania.

Presentations

VC trends in machine learning and data science Session

In this panel discussion, venture capital investors explain how startups can accelerate enterprise adoption of machine learning and explore the new tech trends that will give rise to the next transformation in the big data landscape.

Danielle Leighton is director of data science at T4G, where she helps clients approach, design, implement, and integrate new insights and advanced analytics data products that align with their business goals. She currently focuses most of her time on data science in the energy sector. She’s passionate about keeping data in context and applying research methods, best practices, and academic algorithms to industry business needs. With a strong background in machine learning, Danielle identifies the math, visualizations, and the business questions and processes necessary to create reliable predictive models and, ultimately, good, data-driven business guidance. Danielle has worked in healthcare, academia, government, retail, gaming, and energy and with quantified selfers, biohackers, hacklabs, and makerspaces. She is notoriously unreadable to GSR wearables. In her previous life, Danielle worked with the world’s most sophisticated wearable to date, the hearing aid.

Presentations

From theory to data product: Applying data science methods to effect business change Tutorial

Janet Forbes, Danielle Leighton, and Lindsay Brin lead a primer on crafting well-conceived data science projects that uncover valuable business insights. Using case studies and hands-on skills development, Janet, Danielle, and Lindsay walk you through essential techniques for effecting real business change.

Bob Levy is CEO of Virtual Cove, Inc., where he and his team leverage virtual and augmented reality for helping see datasets in a completely new light. A veteran tech executive, Bob brings over two decades of product leadership experience at companies such as IBM, Harte Hanks, and MathWorks. In 2001, he served as founding president of the BPMA, a 6,000+ person industry group.

Presentations

Augmented reality: Going beyond plots in 3D Session

Augmented reality opens a completely new lens on your data through which you see and accomplish amazing things. Bob Levy explains how to use simple Python scripts to leverage completely new plot types. You'll explore use cases revealing new insight into financial markets data as well as new ways of interacting with data that build trust in otherwise “black box” machine learning solutions.

Jennifer Lim leads the enterprise architecture team at Cerner Corporation, a company focused on creating intelligent solutions for the healthcare industry. Enterprise architecture (EA) is committed to driving business value by maximizing technology investments, optimizing business capabilities, and managing risks, ultimately accelerating Cerner’s vision and business outcomes in areas such as data management and governance, data architecture, API life cycle management, engineering practices, user experience, and workforce collaboration. Jennifer has over 18 years of experience in the telecommunications, banking and federal, and healthcare IT industries. She has led both IT and business teams across a variety of functional areas, including data management, finance analytics, marketing, research analytics, IT architecture, and application development. Jennifer holds a BS in management information systems and an MBA in management.

Presentations

Modernizing operational architecture with big data: Creating and implementing a modern data strategy Data Case Studies

The use of data throughout Cerner had taxed the company's legacy operational data store, data warehouse, and enterprise reporting pipeline to the point where it would no longer scale to meet needs. Jennifer Lim explains how Cerner modernized its corporate data platform with the use of a hybrid cloud architecture.

Chang Liu is an applied research scientist at Georgian Partners and a member of the Georgian impact team, where she draws on her in-depth knowledge of mathematical and combinatorial optimization to help Georgian’s portfolio companies. Previously, Chang was a risk analyst at Manulife Bank, where she built models to assess the bank’s risk exposure based on extensive market research, including evaluating and predicting the impact of the oil price drop to the mortgage lending risks in Alberta in 2014. Chang holds a master of applied science in operations research from the University of Toronto, where she specialized in combinatorial optimization, and a bachelor’s degree in mathematics from the University of Waterloo.

Presentations

Solving the cold start problem: Data and model aggregation using differential privacy Session

Chang Liu offers an overview of a common problem faced by many software companies, the cold-start problem, and explains how Georgian Partners has been successful at solving this problem by transferring knowledge from existing data through differentially private data aggregation.

Mathew Lodge is senior vice president of product and marketing at Anaconda. Mathew has well over 20 years’ diverse experience in cloud computing and product leadership. Previously, he was chief operating officer at container and microservices networking and management startup Weaveworks; vice president of VMware’s Cloud Services Group and cofounder of what became VMware’s vCloud Air IaaS service; and senior director of Symantec’s $1B+ Information Management Group. Early in his career, Mathew built compilers and distributed systems for projects like the International Space Station, helped connect six countries to the internet for the first time, managed a $630M router product line at Cisco, and attempted to do SDN 10 years too early at CPlane.

Presentations

Conda, Docker, and Kubernetes: The cloud-native future of data science (sponsored by Anaconda) Session

The days of deploying Java code to Hadoop and Spark data lakes for data science and ML are numbered. Welcome to the future. Containers and Kubernetes make great language-agnostic distributed computing clusters: it's just as easy to deploy Python as it is Java. Mathew Lodge shows you how.

Conda, Docker, and Kubernetes: The cloud-native future of data science (sponsored by Anaconda) Session

The days of deploying Java code to Hadoop and Spark data lakes for data science and ML are numbered. Welcome to the future. Containers and Kubernetes make great language-agnostic distributed computing clusters: it's just as easy to deploy Python as it is Java. Mathew Lodge shows you how.

Jorge A. Lopez works in big data solutions at Amazon Web Services. Jorge has more than 15 years of business intelligence and DI experience. He enjoys intelligent design and engaging storytelling and is passionate about data, music, and nature.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez, Radhika Ravirala, Paul Sears, and Bruno Faria walk you through building a big data application using a combination of open source technologies and AWS managed services.

Ben Lorica is the chief data scientist at O’Reilly. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Managing risk in machine learning Keynote

As companies begin adopting machine learning, important considerations, including fairness, transparency, privacy, and security, need to be accounted for. Ben Lorica offers an overview of recent tools for building privacy-preserving and secure machine learning products and services.

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Cristobal Lowery is a senior manager and team lead for Baringa Partners’s modeling and machine learning centre of excellence, where he led the creation of Baringa’s data science and analytics team and supported our clients in their journeys to become leaders in artificial intelligence. Previously, he was an independent data science consultant in an investment bank and for a leading Formula 1 team. Cristobal is a passionate advocate of artificial intelligence and its potential to transform businesses. He holds two first-class master’s degrees in quantitative subjects and has published and patented a machine learning system.

Presentations

Predicting residential occupancy and hot water usage from high-frequency, multivector utilities data Session

In EU households, heating and hot water alone account for 80% of energy usage. Cristobal Lowery and Marc Warner explain how future home energy management systems could improve their energy efficiency by predicting resident needs through utilities data, with a particular focus on the key data features, the need for data compression, and the data quality challenges.

Kevin Lu is a software engineer at PayPal developing various Kafka components. He holds a degree in computer science from the University of California, Berkeley. Kevin first discovered his passion for coding in high school, when he developed plug-ins for Minecraft.

Presentations

Kafka at PayPal: Enabling 400 billion messages a day Session

PayPal is one of the biggest Kafka users in the industry; it manages and maintains over 40 production Kafka clusters in three geodistributed data centers and supports 400 billion Kafka messages a day. Kevin Lu, Maulin Vasavada, and Na Yang explore the management and monitoring PayPal applies to Kafka, from client-perceived statistics to configuration management, failover, and data loss auditing.

Joseph Lubin is a cofounder of blockchain computing platform Ethereum and the founder of Consensus Systems (ConsenSys), a blockchain venture studio. ConsenSys is one of the largest and fastest-growing companies in the blockchain technology space, building developer tools, decentralized applications, and solutions for enterprises and governments that harness the power of Ethereum. Previously, Joseph was a software engineer and consultant, where he worked with eMagine on the Identrus project and was involved in the founding and operation of a hedge fund with a partner; was director of the New York office of Blacksmith Software Consulting and vice president of technology in private wealth management at Goldman Sachs, where he focused on the intersection of cryptography, engineering, and finance; and worked in the Princeton Robotics Lab, at tomandandy music, where he developed an autonomous music composition tool, and at private research firm Vision Applications Inc., where he built autonomous mobile robots. He also spent time in Kingston, Jamaica, working on projects in the music industry. Joseph holds a degree in electrical engineering and computer science from Princeton University.

Presentations

The power of Ethereum Keynote

Ethereum is a world computer on top of a peer-to-peer network that runs smart contracts - applications that run exactly as programmed without the possibility of censorship, fraud, or third-party interference. Until now, businesses had to build their systems on database technologies that resulted in siloed and redundant information in typically adversarial contexts.

Boris Lublinsky is a principal architect at Lightbend, where he specializes in big data, stream processing, and services. Boris has over 30 years’ experience in enterprise architecture. Previously, he was responsible for setting architectural direction, conducting architecture assessments, and creating and executing architectural road maps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, and Serving Machine Learning Models. He’s also cofounder of and frequent speaker at several Chicago user groups.

Presentations

Hands-on Kafka streaming microservices with Akka Streams and Kafka Streams Tutorial

Dean Wampler and Boris Lublinsky walk you through building streaming apps as microservices using Akka Streams and Kafka Streams. Dean and Boris discuss the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead. You'll also discover a few ML model serving ideas along the way.

Gerard Maas is a senior software engineer at Lightbend, where he contributes to the Fast Data Platform and focuses on the integration of stream processing technologies. Previously, he held leading roles at several startups and large enterprises, building data science governance, cloud-native IoT platforms, and scalable APIs. He is the coauthor of Stream Processing with Apache Spark from O’Reilly. Gerard is a frequent speaker and contributes to small and large open source projects. In his free time, he tinkers with drones and builds personal IoT projects.

Presentations

Processing fast data with Apache Spark: A tale of two APIs Session

Apache Spark has two streaming APIs: Spark Streaming and Structured Streaming. Gerard Maas offers a critical overview of their differences with regard to key aspects of a streaming application: API usability, dealing with time, dealing with state and machine learning capabilities, and more. You'll learn when to pick one over the other or combine both to implement resilient streaming pipelines.

Swetha Machanavajhala is a software engineer for Azure Networking at Microsoft, where she builds tools to help engineers detect and diagnose network issues within seconds. She is very passionate about building products and awareness for people with disabilities and has led several related projects at hackathons, driving them from idea to reality to launching as a beta product and winning multiple awards. Swetha is a co-lead of the Disability Employee Resource Group, where she represents the community of people who are deaf or hard of hearing, and is a part of the ERG chair committee. She is also a frequent speaker at both internal and external events.

Presentations

Deep learning on audio in Azure to detect sounds in real time Session

In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds. While many of us take this for granted, there are over 360 million in this world who are deaf or hard of hearing. Swetha Machanavajhala and Xiaoyong Zhu explain how to make the auditory world inclusive and meet the great demand in other sectors by applying deep learning on audio in Azure.

Mark Madsen is a Fellow at Teradata, where he’s responsible for understanding, forecasting, and defining analytics ecosystems and architectures. Previously, he was CEO of Third Nature, where he advised companies on data strategy and technology planning, and vendors on product management. Mark has designed analysis, machine learning, data collection, and data management infrastructure for companies worldwide.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Michael Mahoney is the worldwide vice president of solution engineering for Kinetica. He has over 20 years of experience leading and developing highly skilled and professional solution engineering teams. Previously, Michael was the vice president of solution engineers at MapR, transforming the team into an enterprise-oriented organization, and spent 10 years at Oracle in various groups within applications and technology, focusing on driving analytic solutions for their strategic customers. Michael has also held various management and individual contributor positions at Transamerica, Witness Systems, and Cognos. He holds a BS in business specializing in management information systems.

Presentations

Speed, scale, smarts: GPU-powered analytics for the extreme data economy (sponsored by Kinetica) Session

Michael Mahoney demonstrates how to leverage the power of GPUs to converge streaming data analysis, location analysis, and streamlined machine learning with a single engine. Along the way, Michael shares real-world case studies on how Kinetica is used to solve complex data challenges.

Ted Malaska is a director of enterprise architecture at Capital One. Previously, he was the director of engineering in the Global Insight Department at Blizzard; principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem; and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the internet of things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, including components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.

Big data at speed Session

Many details go into building a big data system for speed, from determining a respectable latency until data access and where to store the data to solving multiregion problems—or even knowing just what data you have and where stream processing fits in. Mark Grover and Ted Malaska share challenges, best practices, and lessons learned doing big data processing and analytics at scale and at speed.

Executive Briefing: Managing successful data projects—Technology selection and team building Session

Creating a successful big data practice in your organization presents new challenges in managing projects and teams. Ted Malaska and Jonathan Seidman share guidance and best practices to help technical leaders deliver successful projects from planning to implementation.

Hilary Mason is vice president of research at Cloudera Fast Forward Labs and data scientist in residence at Accel Partners. Previously, Hilary was chief scientist at Bitly. She cohosts DataGotham, a conference for New York’s homegrown data community, and cofounded HackNY, a nonprofit that helps engineering students find opportunities in New York’s creative technical economy. She’s on the board of the Anita Borg Institute and an advisor to several companies, including SparkFun Electronics, Wildcard, and Wonder. Hilary served on Mayor Bloomberg’s Technology Advisory Board and is a member of Brooklyn hacker collective NYC Resistor.

Presentations

Practical ML today and tomorrow Keynote

Machine learning and artificial intelligence are exciting technologies, but real value comes from marrying those capabilities with the right business problems. Hilary Mason explores the current state of these technologies, investigates what's coming next in applied machine learning, and explains how to identify and execute on the right business opportunities at the right time.

Jaya Mathew is a senior data scientist on the artificial intelligence and research team at Microsoft, where she focuses on the deployment of AI and ML solutions to solve real business problems for customers in multiple domains. Previously, she worked on analytics and machine learning at Nokia and Hewlett Packard Enterprise. Jaya holds an undergraduate degree in mathematics and a graduate degree in statistics from the University of Texas at Austin.

Presentations

A day in the life of a data scientist: How do we train our teams to get started with AI? Session

With the growing buzz around data science, many professionals want to learn how to become a data scientist—the role Harvard Business Review called the "sexiest job of the 21st century." Francesca Lazzeri and Jaya Mathew explain what it takes to become a data scientist and how artificial intelligence solutions have started to reinvent businesses.

Jim McHugh is vice president and general manager at NVIDIA. He currently leads DGX-1, the world’s first AI supercomputer in a box. Jim focuses on building a vision of organizational success and executing strategies to deliver computing solutions that benefit from GPUs in the data center. With over 25 years of experience as a marketing and business executive with startup, mid-sized, and high-profile companies, Jim has a deep knowledge and understanding of business drivers, market/customer dynamics, technology-centered products, and accelerated solutions. Previously, Jim held leadership positions with Cisco Systems, Sun Microsystems, and Apple, among others.

Presentations

GPU-accelerated analytics and machine learning ecosystems (Inception Showcase sponsored by NVIDIA) Session

Explore case studies from Datalogue, FASTDATA.io, and H20.ai that demonstrate how GPU-accelerated analytics, machine learning, and ETL help companies overcome slow queries and tedious data preparation process, dynamically correlate among data, and enjoy automatic feature engineering.

Les McMonagle is vice president of security strategy at BlueTalon. Les has over 20 years’ experience in information security. Previously, he was chief information security officer (CISO) for a credit card company and ILC bank; founded a computer training and IT outsourcing company in Europe; directed the security and network technology practice for Cambridge Technology Partners across Europe; helped several security technology firms develop their initial product strategy; founded and managed Teradata’s Information Security, Data Privacy, and Regulatory Compliance Center of Excellence; and was chief security strategist at Protegrity. Les holds a BS in MIS as well as a number of relevant industry certifications, including CISSP, CISA, and ITIL.

Presentations

Privacy by design: Building in data privacy and protection versus bolting it on later Session

Privacy by design is a fundamentally important approach to achieving compliance with GDPR and other data privacy or data protection regulations. Les McMonagle outlines how organizations can save time and money while improving data security and regulatory compliance and dramatically reduce the risk of a data breach or expensive penalties for noncompliance.

Matteo Merli is a software engineer at Streamlio working on messaging and storage technologies. Previously, he spent several years at Yahoo building database replication systems and multitenant messaging platforms. Matteo was the architect and lead developer for Yahoo Pulsar and a member of the PMC of Apache BookKeeper.

Presentations

High-performance messaging with Apache Pulsar Session

Apache Pulsar is being used for an increasingly broad array of data ingestion tasks. When operating at scale, it's very important to ensure that the system can make use of all the available resources. Karthik Ramasamy and Matteo Merli share insights into the design decisions and the implementation techniques that allow Pulsar to achieve high performance with strong durability guarantees.

Jeff Miller is vice president of data and analytics, where he oversees the product management and development of a diversified set of analytics products and data science services for GE’s industrial businesses and finance professionals and leads a multidisciplinary team of both technical and functional resources in the development of a suite of persona-aware solutions. Drawing on his unique blend of financial system, data engineering, business intelligence, and product management experience, Jeff provides architectural leadership and oversight across the data lifecycle, from ingestion and modeling to visualization and delivery.

Presentations

Augmented data engineering: Leveraging machine learning in data profiling and discovery (sponsored by Io-Tahoe) Session

Arun Murugan and Jeff Miller detail how complex relationships are discovered and modeled to simplify analytics while keeping an Agile architecture for data acquisition. You’ll see how GE uses machine learning (powered by Io-Tahoe) in data discovery and profiling for data engineering of the development of a standard data model essential to enterprise use cases.

Cory Minton is a staff technologist on the ready solutions team at Dell EMC, where he works hand in hand with clients across the globe to assess and develop big data strategies, architect technology solutions, and insure successful deployments of these transformational initiatives. A geek, technology evangelist, and business strategist, Cory is focused on finding creative ways for organizations to drive the utmost value from their data while transforming IT’s relevance to the organizations and customers they serve. With a diverse background in IT applications, consulting, data center infrastructure, and the expanding big data ecosystem, Cory brings an interesting perspective to the clients he serves while consistently challenging them to think bigger. Cory holds an undergraduate degree in engineering from Texas A&M University and an MBA from Tennessee Tech University. Cory resides in Birmingham, Alabama, with his beautiful wife and two awesome children.

Presentations

DIY versus designer approaches to deploying data center infrastructure for machine learning and analytics Session

Cory Minton and Colm Moynihan explain how to choose the right deployment model for on-premises infrastructure to reduce risk, reduce costs, and be more nimble.

Mridul Mishra is an architect at Fidelity Investments, where he is responsible for emerging technology in Asset Management Group as well as for machine learning and AI projects. Mridul has around 21 years of experience building enterprise software, ranging from core trading software to smart applications, using AI/ML capabilities.

Presentations

Explainable artificial intelligence (XAI): Why, when, and how? Findata

Currently, most ML models—and particularly those for deep learning—work like a black box. As a result, a key challenge in their adoption is the need for explainability. Mridul Mishra discusses the need for explainability and its current state. Mridul then provides a framework for considering these needs and offers potential solutions.

Sanjeev Mohan leads big data research for technical professionals at Gartner, where he researches trends and technologies for relational and NoSQL databases, object stores, and cloud databases. His areas of expertise span the end-to-end data pipeline, including ingestion, persistence, integration, transformation, and advanced analytics. Sanjeev is a well-respected speaker on big data and data governance. His research includes machine learning and the IoT. He also serves on a panel of judges for many Hadoop distribution organizations, such as Cloudera and Hortonworks.

Presentations

Executive Briefing: Enhance your data lake with comprehensive data governance to improve adoption and meet compliance needs Session

If the last few years were spent proving the value of data lakes, the emphasis now is to monetize the big data architecture investments. The rallying cry is to onboard new workloads efficiently. But how do you do so if you don’t know what data is in the lake, the level of its quality, or the trustworthiness of models? Sanjeev Mohan explains why data governance is the linchpin to success.

Andrew Montalenti is the cofounder and CTO of Parse.ly, a widely used real-time web content analytics platform. The product is trusted daily by editors at HuffPost, Time, TechCrunch, Slate, Quartz, the Wall Street Journal, and over 350 other leading digital companies. Andrew is a dedicated Pythonista and has presented his team’s work at the PyCon and PyData conferences. He is also the cohost of the web data and analytics podcast The Center of Attention. For more information, check out Parse.ly’s research on internet attention via @parsely.

Presentations

Applying petabyte-scale analytics and machine learning to billions of news reading sessions Session

What can we learn from a one-billion-person live poll of the internet? Andrew Montalenti explains how Parse.ly has gathered a unique dataset of news reading sessions of billions of devices, peaking at over two million sessions per minute on thousands of high-traffic news and information websites, and how the company uses this data to unearth the secrets behind online content.

Richard Mooney is the lead product manager for SAP’s Predictive Analytics product portfolio, including SAP Analytics Cloud Predictive, Predictive Analytics, and Predictive Analytics Application Edition. Richard has 18 years of experience in the software industry, including roles in development, product management, sales, and marketing. Richard also spent two years as an innovation expert, using techniques like design thinking, ROI analysis, and ideation to drive customer innovation and value. Richard lives in Kilkenny, Ireland, with his wife Anne, two children, and a very energetic border collie.

Presentations

Bringing together machine and human intelligence (sponsored by SAP) Session

Intelligent enterprises—fueled by rapid advances in artificial intelligence (AI), machine learning (ML), and the internet of things (IoT)—promise significant business value. Richard Mooney explains how to achieve the game-changing outcomes of an intelligent enterprise, delivering value across business functions with the synergy of machine and human intelligence.

Steve Morgan is a software engineer and lead architect at Lockheed Martin.

Presentations

Self-service modern analytics on the GovCloud Session

Lockheed Martin is a data-driven company with a massive variety and volume of data. To extract the most value from its information assets, the company is constantly exploring ways to enable effective self-service scenarios. Ramesh Krishnan and Steve Morgan discuss Lockheed Martin's journey into modern analytics and explore its analytics platform focused on leveraging AWS GovCloud.

Colm Moynihan is partner presales manager in EMEA for Cloudera, where he helps system integrators, ISVs, hardware, cloud partners, resellers, and distributors drive digital transformation into joint customers. Previously, Colm was director of presales in EMEA at Informatica, working with resellers, OEMs, and GSIs to integrate, master, and cleanse customers’ enterprise data. Colm has over 25 years’ experience in development, consulting, finance and banking, startups, and large multinational software companies. Colm holds a master’s degree in distributed computing from Trinity College Dublin.

Presentations

DIY versus designer approaches to deploying data center infrastructure for machine learning and analytics Session

Cory Minton and Colm Moynihan explain how to choose the right deployment model for on-premises infrastructure to reduce risk, reduce costs, and be more nimble.

Francesco Mucio is a data consultant. The first time Francesco met the word data, it was just the plural of datum, and now he’s building a small consulting firm to help organizations to avoid or solve some of the problems he’s seen in the past. He likes to draw data models and optimize queries. He spends his free time with his daughter, who, for some reason, speaks four languages.

Presentations

Scaling data infrastructure in the fashion world; or, “What is this? Business intelligence for ants?” Session

Francesco Mucio tells the story of how Zalando went from an old-school BI company to an AI-driven company built on a solid data platform. Along the way, he shares what Zalando learned in the process and the challenges that still lie ahead.

Kurt Muehmel is the vice president of solutions engineering at Dataiku, where he’s built analytics and AI solutions for Fortune 100 companies worldwide and is building its solutions engineering capability worldwide. Having worked with dozens of clients of all sizes and across a multitude of sectors, Kurt has developed a deep understanding of the challenges and opportunities for companies looking to increase the value they’re deriving from their data and increase the capabilities of the growing teams of data scientists, engineers, and analysts. In a career that’s spanned several international moves, he’s worked for the United Nations, a Big Four consultancy, and a struggling high school in the Paris suburbs

Presentations

From analytic silos to analytic democratization: How (and why) companies make the shift (sponsored by Dataiku) Session

By creating a collaborative and interactive analytic environment, a forward-thinking company may harness the best capabilities of its business analysts and data scientists to answer the company’s most pressing business questions. Deborah Reynolds and Kurt Muehmel explain how large enterprises can successfully put data at the core of everyday business decisions.

Ash Munshi is CEO of Pepperdata. Previously, Ash was executive chairman for deep learning startup Marianas Labs (acquired by Askin in 2015); CEO of big data storage startup Graphite Systems (acquired by EMC DSSD in 2015); CTO of Yahoo; and CEO of a number of other public and private companies. He serves on the board of several technology startups.

Presentations

Classifying job execution using deep learning Session

Ash Munshi outlines a technique for labeling applications using runtime measurements of CPU, memory, and network I/O along with a deep neural network. This labeling groups the applications into buckets that have understandable characteristics, which can then be used to reason about the cluster and its performance.

Arun Murugan is senior director of data engineering at GE, where he leads the data architecture team at GE Digital responsible for data modeling, data engineering, and core ETL framework for finance data lake within GE. He has extensive experience in translating complex business requirements into technology solutions, with a specialization in creating data-centric processes involving complex data integration and analytical solutions. Arun has expertise in designing and architecting enterprise data lakes by leveraging big data ecosystems, state-of-the-art tools and technologies, and multiple distributed data processing platforms.

Presentations

Augmented data engineering: Leveraging machine learning in data profiling and discovery (sponsored by Io-Tahoe) Session

Arun Murugan and Jeff Miller detail how complex relationships are discovered and modeled to simplify analytics while keeping an Agile architecture for data acquisition. You’ll see how GE uses machine learning (powered by Io-Tahoe) in data discovery and profiling for data engineering of the development of a standard data model essential to enterprise use cases.

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.

Presentations

Setting up a lightweight distributed caching layer using Apache Arrow Session

Jacques Nadeau leads a deep dive into a new Apache-licensed lightweight distributed in-memory cache that allows multiple applications to consume Arrow directly using the Arrow RPC and IPC protocols. You'll explore the system design and deployment architecture—including the cache life cycle, update patterns, cache cohesion, and appropriate use cases—learn how it all works, and see it in action.

Karthikeyan Nagalingam is a big data analytics technical marketing engineer at NetApp. His roles include architecting Hadoop solutions, engineering proofs of concept, presenting Hadoop solutions to customers, field experts, and partners at events such as NetApp Insight, attending meetups at Research Triangle Park and the NetApp Executive Briefing Center, and assisting field engineers and customers with presales and postsales issues. He holds an MS in software systems from Birla Institute of Technology.

Presentations

Accelerate big data analytics and AI with NetApp hybrid cloud architecture (sponsored by NetApp) Session

As the data authority for hybrid cloud for big data analytics and AI, NetApp understands the value of the access, management, and control of data. Karthikeyan Nagalingam discusses the NetApp Data Fabric, which provides a unified data management environment that spans edge devices, data centers, and multiple hyperscale clouds using ONTAP software, all-flash systems, ONTAP Select, and cloud volumes.

Syed Nasar is a solutions architect at Cloudera. As a big data and machine learning professional, his expertise extends to artificial intelligence, machine learning, and computer vision, and he has worked with a number of enterprises in bridging big data technologies with advanced statistical analysis, machine learning, and deep learning to create high-quality data products and intelligent systems that drive strategy and investment decisions. Syed is a founder of the Nashville Artificial Intelligence Society. His research interests include NLP, deep learning (mainly RNN and GAN), distributed systems, machine learning at scale, and emerging technologies. He is the founder of Nashville Artificial Intelligence Society. He holds a master’s degree in interactive intelligence from the Georgia Institute of Technology.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh, Alan Silva, Alex Bleakley, Steven Totman, Mirko Kämpf, and Syed Nasar outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Paco Nathan is known as a “player/coach” with core expertise in data science, natural language processing, machine learning, and cloud computing. He has 35+ years of experience in the tech industry, at companies ranging from Bell Labs to early-stage startups. His recent roles include director of the Learning Group at O’Reilly and director of community evangelism at Databricks and Apache Spark. Paco is the cochair of Rev conference and an advisor for Amplify Partners, Deep Learning Analytics, Recognai, and Primer. He was named one of the "top 30 people in big data and analytics" in 2015 by Innovation Enterprise.

Presentations

Data Case Studies welcome Tutorial

Program chair Alistair Croll welcomes you to the Data Case Studies tutorial.

Executive Briefing: Best practices for human in the loop—The business case for active learning Session

Deep learning works well when you have large labeled datasets, but not every team has those assets. Paco Nathan offers an overview of active learning, an ML variant that incorporates human-in-the-loop computing. Active learning focuses input from human experts, leveraging intelligence already in the system, and provides systematic ways to explore and exploit uncertainty in your data.

Revant Nayar is CTO of FMI Technologies LLC and a PhD candidate at Princeton University. Revant has authored four academic papers and given talks at conferences.

Presentations

Stochastic field theory for time series Session

Machine learning has so far underperformed in time series prediction (slowness and overfitting), and classical methods are ineffective at capturing nonlinearity. Revant Nayar shares an alternative approach that is faster and more transparent and does not overfit. It can also pick up regime changes in the time series and systematically captures all the nonlinearity of a given dataset.

Kimberly Nevala is a Strategic Advisor for SAS where she balances forward thinking with real-world perspectives on business analytics, data governance, analytic cultures, and change management. Kimberly’s current focus is helping customers understand both the business potential and practical implications of artificial intelligence (AI) and machine learning (ML).

Presentations

Rationalizing risk in AI and ML Session

Too often, the discussion of AI and ML includes an expectation—if not a requirement—for infallibility. But as we know, this expectation is not realistic. So what’s a company to do? While risk can’t be eliminated, it can be rationalized. Kimberly Nevala demonstrates how an unflinching risk assessment enables AI/ML adoption and deployment.

Ann Nguyen evangelizes design for impact at Whole Whale, where she leads the tech and design team in building meaningful digital products for nonprofits. She has designed and managed the execution of multiple websites for organizations including the LAMP, Opportunities for a Better Tomorrow, and Breakthrough. Ann is always challenging designs with A/B testing. She bets $1 on every experiment that she runs and to date has accumulated a decent sum. Previously, Ann worked with a wide range of organizations from the Ford Foundation to Bitly. She is Google Analytics and Optimizely Platform certified. Ann is a regular speaker on nonprofit design and strategy and recently presented at the DMA Nonprofit Conference. She has also taught at Sarah Lawrence College. Outside of work, Ann enjoys multisensory art, comedy shows, fitness, and making cocktails, ideally all together.

Presentations

How to be aggressively tone-deaf using data; or, We should all be "for-benefits." Data Case Studies

The for-profit system lacks a conscious and empathy thinking. Ann Nguyen takes a look at the good, the bad, and the ugly of data culture, explores successes in the nonprofit sector, and shows how all companies can adapt a “for-benefit” mindset, merging their data culture with an empathy economy and using data to create and share value among their core audiences.

Minh Chau Nguyen is a researcher in the smart data platform research department at the Electronic and Telecommunications Research Institute (ETRI). His research interests include big data management, software architecture, and distributed systems.

Presentations

A data marketplace case study with the blockchain and advanced multitenant Hadoop in a smart open data platform Session

Minh Chau Nguyen and Heesun Won explain how to implement analytics services in data marketplace systems on a single Hadoop cluster across distributed data centers. The solution extends the overall architecture of the Hadoop ecosystem with the blockchain so that multiple tenants and authorized third parties can securely access data while still maintaining privacy, scalability, and reliability.

Anna Nicanorova is director of Annalect Labs, a space for experimentation and rapid prototyping within Annalect. During her time at Annalect, she has worked on numerous data-marketing solutions, including attribution, optimizers, quantification of content, and image recognition technology. She was part of Annalect team that won the 2015 I-Com Data Science Hackathon. Anna is cofounder of the Books+Whiskey meetup and a coding volunteer teacher with ScriptEd. She holds an MBA from the Wharton School at the University of Pennsylvania and a BA from Hogeschool van Utrecht.

Presentations

Data visualization in mixed reality with Python Session

Data visualization is supposed to be our map to information. However, contemporary charting techniques have a few shortcomings, including context reduction, hard numeric grasp, and perceptual dehumanization. Anna Nicanorova explains how augmented reality can solve these issues by presenting an intuitive and interactive environment for data exploration.

Tawny Nichols is chief information officer at SelectData, where she is responsible for new product development, clinical tools, and all technology-related needs. She also leads SelectData’s innovation of data-driven business models. Tawny has over 15 years’ experience supporting the homecare industry. She is currently pursuing an MS in healthcare informatics at the University of San Diego.

Presentations

Spark NLP in action: How SelectData uses AI to better understand home health patients Session

David Talby, Alberto Andreotti, Stacy Ashworth, and Tawny Nichols outline a question-answering system for accurately extracting facts from free-text patient records and share best practices for training domain-specific deep learning NLP models. The solution is based on Spark NLP, an extension of Spark ML that provides state-of-the-art performance and accuracy for natural language understanding.

Aileen Nielsen works at an early-stage NYC startup that has something to do with time series data and neural networks, and she’s the author of a Practical Time Series Analysis (2019) and an upcoming book, Practical Fairness, (summer 2020). Previously, Aileen worked at corporate law firms, physics research labs, a variety of NYC tech startups, the mobile health platform One Drop, and on Hillary Clinton’s presidential campaign. Aileen is the chair of the NYC Bar’s Science and Law Committee and a fellow in law and tech at ETH Zurich. Aileen is a frequent speaker at machine learning conferences on both technical and legal subjects.

Presentations

How to be fair: A tutorial for beginners Tutorial

There is mounting evidence that the widespread deployment of machine learning and artificial intelligence in business and government applications is reproducing or even amplifying existing prejudices and social inequalities. Aileen Nielsen demonstrates how to identify and avoid bias and other unfairness in your analyses.

Dinesh Nirmal is vice president of development for data and AI at IBM. His mission is to empower every organization to transform their industry—whether it’s aerospace, finance, or healthcare—by unlocking the power of their data. Dinesh speaks and writes internationally on operationalizing machine learning and advises business leaders on strategies to ready their enterprises for new technologies. He leads more than a dozen IBM Development Labs globally; recognizing a market need for data science mastery, he launched six machine learning hubs to work face-to-face with clients. Products in his portfolio regularly win major design awards, including two Red Dot Awards and the iF Design Award. Dinesh is a member of the board of the R Consortium and an advisor to Accel.AI. He lives in San Jose with his wife Catherine Plaia, formerly an engineer at Apple, and their two young sons.

Presentations

Wait. . .pizza is a vegetable? Decoding regulations using machine learning (sponsored by IBM) Keynote

IBM Analytics’s Dinesh Nirmal solves school lunch and the struggle to keep ahead of regulations. With AI tech like deep learning and NLG, supplying meals to California’s kids leaps from enriching metadata for compliance to actionable insights for the business.

Bethann Noble is a director of product marketing at Cloudera, responsible for driving marketing and strategy initiatives in support of Cloudera machine learning solutions. Previously, Bethann held roles in developer and product marketing, technical sales, and software engineering at IBM, with several years’ experience in high-performance computing and big data and analytics technologies. She holds a bachelor’s degree in mathematics from the University of Texas at Austin.

Presentations

A roadmap for open data science and AI for business: Panel discussion with State Street Session

Bethann Noble, Abhishek Kodi, and Daniel Huss share their experience and best practices for designing and executing on a roadmap for open data science and AI for business.

Patrick Nussbaumer is the director of product marketing at Alteryx, where he is focused on helping users and organizations leverage analytics. As part of his responsibilities, Patrick led the development of the Udacity Predictive Analytics for Business Nanodegree that is focused on empowering business users with the skills to solve more advanced business problems. Patrick spent the previous 20 years in a variety of roles focused on data, analytics, and self-service visual analytics in the semiconductor, telecommunications, defense, and financial services industries.

Presentations

Getting the most out of advanced analytics with people (sponsored by Alteryx) Session

There is a lot of buzz around data science and machine learning in the world today. Unfortunately, to truly innovate with data and advanced capabilities, organizations need to expand their focus beyond just a few specialists. Patrick Nussbaumer details how focusing on people can help improve analytic value and drive innovation.

Owen O’Malley is a cofounder and technical fellow at Cloudera, formerly Hortonworks. Cloudera’s software includes Hadoop and the large ecosystem of big data tools that enterprises need for their data analytics. Owen has been working on Hadoop since the beginning of 2006 at Yahoo, was the first committer added to the project, and used Hadoop to set the Gray sort benchmark in 2008 and 2009. Previously, he was the architect of MapReduce, Security, and now Hive. He’s driving the development of the ORC file format and adding ACID transactions to Hive.

Presentations

Introducing Iceberg: Tables designed for object stores Session

Owen O'Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout with properties specifically designed for cloud object stores, such as S3. It provides a common set of capabilities such as partition pruning, schema evolution, atomic additions, removal, or replacements of files regardless of whether the data is stored in Avro, ORC, or Parquet.

Brian O’Neill is the founder and consulting product designer at Designing for Analytics, where he focuses on helping companies design indispensable data products that customers love. Brian’s clients and past employers include Dell EMC, NetApp, TripAdvisor, Fidelity, DataXu, Apptopia, Accenture, MITRE, Kyruus, Dispatch.me, JPMorgan Chase, the Future of Music Coalition, and E*TRADE, among others; over his career, he has worked on award-winning storage industry software for Akorri and Infinio. Brian has been designing useful, usable, and beautiful products for the web since 1996. Brian has also brought over 20 years of design experience to various podcasts, meetups, and conferences such as the O’Reilly Strata Conference in New York City and London, England. He is the author of the Designing for Analytics Self-Assessment Guide for Non-Designers as well numerous articles on design strategy, user experience, and business related to analytics. Brian is also an expert advisor on the topics of design and user experience for the International Institute for Analytics. When he is not manning his Big Green Egg at a BBQ or mixing a classic tiki cocktail, Brian can be found on stage performing as a professional percussionist and drummer. He leads the acclaimed dual-ensemble Mr. Ho’s Orchestrotica, which the Washington Post called “anything but straightforward,” and has performed at Carnegie Hall, the Kennedy Center, and the Montreal Jazz Festival. If you’re at a conference, just look for only guy with a stylish orange leather messenger bag.

Presentations

UX strategies for underperforming analytics services and data products Session

Gartner says 85%+ of big data projects will fail, despite the fact your company may have invested millions on engineering implementation. Why are customers and employees not engaging with these products and services? Brian O'Neill explains why a "people first, technology second" mission—a design strategy, in other words—enables the best UX and business outcomes possible.

Troels Oerting is a globally recognized cybersecurity expert. He serves on a number of corporate boards, including as nonexecutive director in key companies and in high-profile advisory roles. Troels has been working on the cybersecurity frontline for the last 38 years and has held a number of significant posts both nationally and international. Previously, Troels was group chief information security officer (CISO) and group chief security officer at Barclays, where he had end-to-end responsibility for all security in Barclays Group, leading the more than 3,000 security experts worldwide who protect the bank’s 50 million customers and 140,000 employees. Before joining Barclays, Troels was director of the European Cybercrime Centre (EC3), an EU-wide center located in Europol’s HQ tasked with assisting law enforcement agencies in protecting 500 million citizens in the 28 EU member states from cybercrime or loss of privacy. In this role, he also initiated the establishment of the international Joint Cybercrime Action Task Force (J-CAT) comprising global leading law enforcement agencies, prosecutors, and Interpol’s Global Centre of Innovation. The J-CAT has since been recognized as the leading international response to the increasing threat from organized cybercriminal networks. An expert in cybersecurity, Troels has constantly been looking for new legislative, technical, or cooperation opportunities to efficiently protect privacy and security for internet users, and he has been pioneering new methodologies to prevent crime in cyberspace and protect innocent users from losing their digital identity, assets, or privacy online. Troels was cyber advisor for the EU Commission and Parliament and has been a permanent delegate in many governance organizations, including ICANN, ITU, and the Council of Europe. He has also served as an advisor to several governments and organizations for cyber-related questions. Troels established a vast global outreach program that brought together law enforcement, NGOs, key tech companies, industry leaders, and academic research institutes to establish a multifaceted global coalition against cybercriminal syndicates and networks, with the aim of enhancing online security without harming privacy and inventing new ways of protecting internet users. Earlier in his career, Troels was assistant director for Europol’s Organised Crime Department and the Counterterrorist Department, as well as director of operations for the Danish Security Intelligence Service and director of the Danish Serious Organised Crime Agency (SOCA). Troels is an extern lecturer in cybercrime at a number of universities and business schools and has been recognized several times by global law enforcement agencies for his international leadership in fighting cyber and organized crime. He is author of a political thriller published in Danish: Operation Gamma.

Presentations

Next-generation cybersecurity via data fusion, AI, and big data: Pragmatic lessons from the front lines in financial services Session

Usama Fayyad and Troels Oerting share outcomes and lessons learned from building and deploying a global data fusion, incident analysis/visualization, and effective cybersecurity defense based on big data and AI at a major EU bank, in collaboration with several financial services institutions.

Diego Oppenheimer is the founder and CEO of Algorithmia. An entrepreneur and product developer with extensive background in all things data, Diego has designed, managed, and shipped some of Microsoft’s most used data analysis products, including Excel, Power Pivot, SQL Server, and Power BI. Diego holds a bachelor’s degree in information systems and a master’s degree in business intelligence and data analytics from Carnegie Mellon University.

Presentations

Deploying machine learning models in the enterprise Session

After big investments in collecting and cleaning data and building machine learning (ML) models, enterprises face big challenges in deploying models to production and managing a growing portfolio of ML models. Diego Oppenheimer covers the strategic and technical hurdles each company must overcome and the best practices developed while deploying over 4,000 ML models for 70,000 engineers.

Francois Orsini is the chief technology officer for MZ’s Satori business unit. Previously, he served as vice president of platform engineering and chief architect, bringing his expertise in building server-side architecture and implementation for a next-gen social and server platform; was a database architect and evangelist at Sun Microsystems; and worked in OLTP database systems, middleware, and real-time infrastructure development at companies like Oracle, Sybase, and Cloudscape. Francois has extensive experience working with database and infrastructure development, honing his expertise in distributed data management systems, scalability, security, resource management, HA cluster solutions, and soft real-time and connectivity services. He also collaborated with Visa International and Visa USA to implement the first Visa Cash Virtual ATM for the internet and founded a VC-backed startup called Unikala in 1999. Francois holds a bachelor’s degree in civil engineering and computer sciences from the Paris Institute of Technology.

Presentations

Correlation analysis on live data streams Session

The rate of growth of data volume and velocity has been accelerating along with increases in the variety of data sources. This poses a significant challenge to extracting actionable insights in a timely fashion. Arun Kejariwal and Francois Orsini explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making.

Occhio Orsini is senior data science architect at Aetna, where he leads the solution engineering and architecture efforts to build Aetna’s Data Fabric, which supports the companies advanced analytics initiatives across the organization. Occhio has over 25 years’ experience building data and analytics technology platforms. He started his career in application development and then spent time developing database engine technology and internet search technology for heritage Ascential software (acquired by IBM); played a central role in creation of the IBM Information Server Suite; and worked on the strategy and adoption of data analytics and data governance platforms for Aetna’s Enterprise Architecture Group.

Presentations

Aetna's advanced analytics platform, Data Fabric Session

Occhio Orsini offers an overview of Aetna's Data Fabric platform. Join in to learn the needs and desires that led to the creation of the advanced analytics platform, explore the platform's architecture, technology, and capabilities, and understand the key technologies and capabilities that made it possible to build a hybrid solution across on-premises and cloud-hosted data centers.

Steve Otto is the associate director of the enterprise architecture team at Navistar, where he helps shape the technology strategy and architecture to drive business goals. Previously, he was the manager of the information management team at Navistar. Steve started his career as developer in the management consulting practice at Ernst & Young and has held a variety of roles over his IT career, including the planning, design, build, operation, and support functions for IT projects in the consumer products, retail, aerospace and defense, healthcare, manufacturing, and higher education markets.

Presentations

Driving predictive analytics for the IoT and connected vehicles Data Case Studies

Navistar built an IoT-enabled remote diagnostics platform, OnCommand Connection, to bring together data from 375,000+ vehicles in real time, in order to drive predictive analytics. This service is now being offered to fleet owners, who can monitor the health and performance of their trucks from smartphones or tablets. Join Steven Otto to learn more about Navistar's IoT and data journey.

Jerry Overton is a data scientist and distinguished technologist in DXC’s Analytics Group, where he is the principal data scientist for industrial machine learning, a strategic alliance between DXC and Microsoft comprising enterprise-scale applications across six different industries: banking and capital markets, energy and technology, insurance, manufacturing, healthcare, and retail. Jerry is the author of Going Pro in Data Science: What It Takes to Succeed as a Professional Data Scientist (O’Reilly) and teaches the O’Reilly training course Mastering Data Science at Enterprise Scale. In his blog, Doing Data Science, Jerry shares his experience leading open research and transforming organizations using data science.

Presentations

Minimum viable machine learning: The applied data science bootcamp (sponsored by DXC Technology) 1-Day Training

Acquiring machine learning (ML) technology is relatively straightforward, but ML must be applied to be useful. In this one-day boot camp that is equal parts hackathon, presentation, and group participation, Jerry Overton, Ashim Bose, and Samir Sehovic teach you how to apply advanced analytics in ways that reshape the enterprise and improve outcomes.

Shravan (Sean) Pabba is a Principal Systems Engineer at Cloudera. He helps Cloudera customers and prospects adopt, architect and build applications using Cloudera Platform. His current area of focus is Cloudera Altus. Before Cloudera, Sean worked as a Solutions Architect at various companies including GigaSpaces and IBM, where he was involved in architecture, design and development of distributed and mainframe applications.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Attend this tutorial to learn how to successfully run a data analytics pipeline in the cloud and integrate data engineering and data analytic workflows and explore considerations and best practices for data analytics pipelines in the cloud. Along the way, you'll see how to share metadata across workloads in a big data PaaS.

Mani Parkhe is an ML and AI platform engineer at Databricks, where he works on various customer-facing and open source platform initiatives to enable data discovery, training, experimentation, and deployment of ML models in the cloud. Mani is a lifelong student and coding geek with a passion for elegance in design. Previously, he spent 15 years building software for semiconductor chip CAD before transitioning to building big data infrastructure, distributed systems and web services, and machine learning. He also worked on various data intensive batch and stream processing problems at LinkedIn and Uber. Mani holds a master’s degree in CS from the University of Florida. He lives in Almaden Valley with his wife and three amazing kids.

Presentations

MLflow: An open platform to simplify the machine learning lifecycle Session

Successfully building and deploying a machine learning model is difficult to do once. Enabling other data scientists to reproduce your pipeline, compare the results of different versions, track what's running where, and redeploy and rollback updated models is much harder. Mani Parkhe and Andrew Chen offer an overview of MLflow—a new open source project from Databricks that simplifies this process.

Drew Paroski is vice president of engineering at MemSQL, where he oversees engineering development and operations. Previously, he was an architect and manager leading development of the new native SQL compilation architecture in MemSQL 5; worked at Facebook, where he cofounded HHVM, a JIT compiler for PHP used by a number of popular web properties, including Facebook, Wikipedia, and Baidu; and worked on the .NET compiler and runtime at Microsoft. Drew holds a bachelor’s and master’s degree in computer science from State University of New York at Binghamton.

Presentations

Leveraging the best of the past to power a better future (sponsored by MemSQL) Keynote

Today’s successful businesses utilize data better than their competitors; however, data sprawl and inefficient data infrastructure restrict what’s possible. Blending the best of the past with the software innovations of today will solve future data challenges. Drew Paroski shares how to develop modern database applications without sacrificing cost savings, data familiarity, and flexibility.

Robert Passarella evaluates AI and machine learning investment managers for Alpha Features. Rob has spent over 20 years on Wall Street in the gray zone between business and technology, focusing on using technology and innovative information sources to empower novel ideas in research and the investment process. A veteran of Morgan Stanley, JPMorgan, Bear Stearns, Dow Jones, and Bloomberg, he has seen the transformational challenges firsthand, up close and personal. Always intrigued by the consumption and use of information for investment analysis, Rob is passionate about leveraging alternative and unstructured data for use with machine learning techniques. Rob holds an MBA from the Columbia Business School.

Presentations

Findata welcome Tutorial

Program chairs Alistair Croll and Robert Passarella welcome you to Findata Day.

Joshua Patterson is a director of AI infrastructure at NVIDIA leading engineering for RAPIDS.AI. Previously, Josh was a White House Presidential Innovation Fellow and worked with leading experts across public sector, private sector, and academia to build a next-generation cyberdefense platform. His current passions are graph analytics, machine learning, and large-scale system design. Josh loves storytelling with data and creating interactive data visualizations. He holds a BA in economics from the University of North Carolina at Chapel Hill and an MA in economics from the University of South Carolina Moore School of Business.

Presentations

Accelerating financial data science workflows with GPUs Session

GPUs have allowed financial firms to accelerate their computationally demanding workloads. Today, the bottleneck has moved completely to ETL. The GPU Open Analytics Initiative (GoAi) is helping accelerate ETL while keeping the entire workflow on GPUs. Joshua Patterson and Onur Yilmaz discuss several GPU-accelerated data science tools and libraries.

Josh Poduska is the chief data scientist at Domino Data Lab. He has 17 years of experience in analytics. His work experience includes leading the statistical practice at one of Intel’s largest manufacturing sites, working on smarter cities data science projects with IBM, and leading data science teams and strategy with several big data software companies. Josh holds a master’s degree in applied statistics from Cornell University.

Presentations

Managing data science in the enterprise Tutorial

The honeymoon era of data science is ending, and accountability is coming. Successful data science leaders deliver measurable impact on an increasing share of an enterprise’s KPIs. Joshua Poduska and Patrick Harrison detail how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage

Jennifer Prendki is the vice president of machine learning at Figure Eight, the essential human-in-the-loop AI platform for data science and machine learning teams. She has spent most of her career creating a data-driven culture wherever she went, succeeding in sometimes highly skeptical environments. She is particularly skilled at building and scaling high-performance machine learning teams and is known for enjoying a good challenge. Trained as a particle physicist (she holds a PhD in particle physics from Sorbonne University), she likes to use her analytical mind not only when building complex models but also as part of her leadership philosophy. She is pragmatic yet detail oriented. Jennifer also takes great pleasure in addressing both technical and nontechnical audiences alike at conferences and seminars and is passionate about attracting more women to careers in STEM.

Presentations

Agile for data science teams Session

Agile methodologies have been widely successful for software engineering teams but seem inappropriate for data science teams, because data science is part engineering, part research. Jennifer Prendki demonstrates how, with a minimum amount of tweaking, data science managers can adapt Agile techniques and establish best practices to make their teams more efficient.

James Psota is the Co-Founder and CTO of Panjiva, which helps companies engaged in global trade make informed decisions. James is responsible for the company’s technical and product direction and leads the data science, engineering, and product teams. James led the creation of Panjiva from the ground up, culminating in a successful acquisition by S&P Global in 2018. Data from the Panjiva Supply Chain Graph have appeared in publications including the New York Times, Wall Street Journal, Forbes, and The Financial Times. Panjiva was named one of the top 10 Most Innovative Data Science Companies in the World by Fast Company in 2018. James has spoken about artificial intelligence, open data, and entrepreneurship at Harvard Business School, MIT, and The White House, as well as numerous industry and academic conferences. James studied computer science at Cornell and MIT.

Presentations

Avoiding Data Disillusionment: 3 Things to Get Right When Building Data Products Findata

James Psota explains how organizationsåBusinesses are pouring massive amounts of money into data science projects, and expectations are sky-high. But how many of those projects will deliver real value to customers? The history of other hyped new technologies predicts that many will fail, leaving a sense of disillusionment in their wake.

Amanda C. Pustilnik is a professor of law at the University of Maryland School of Law and on the permanent faculty at the Center for Law, Brain & Behavior at Massachusetts General Hospital. Her work focuses on the intersections of law, science, and culture, with a particular emphasis on neuroscience and neurotechnologies. In 2015, she served as Harvard Law School’s first senior fellow on law and applied neuroscience, where she focused on the neuroimaging of pain in itself and as a model for imaging subjective states relevant to law. Her collaborations with scientists on pain-related brain imaging and her expertise in criminal law led to her recent work on the opioid crisis on behalf of the Aspen Institute. She also writes and teaches in the areas of scientific and forensic evidence, on which she helps train federal and state judges. Prior to entering the academy, Amanda practiced litigation at Covington & Burling and at Sullivan & Cromwell, clerked on the Second Circuit Court of Appeals, and worked as a management consultant at McKinsey & Co. She is a graduate of Harvard College and Yale Law School and completed a fellowship at the University of Cambridge, where she studied history and philosophy of science. Her work has been published in numerous law reviews and peer-reviewed scientific journals, including Nature.

Presentations

Brain-based human-machine interfaces: New developments, legal and ethical issues, and potential uses Keynote

Have you ever dreamed you could read minds? Do telekinesis? Maybe fly a magic carpet by thought alone? Until now, these powers have existed only in the realm of imagination or, more recently, video, AR, and VR games. Join Amanda Pustilnik to learn how brain-based human-machine interfaces are beginning to offer these powers in near-commercially-viable forms.

Greg Quist is the cofounder, president, and CEO of SmartCover Systems, where he leads the strategic direction and operations of the company. Greg is a longtime member of the water community. He was elected to the Rincon del Diablo MWD board of directors in 1990 and for the past 27 years has served in various roles, including president and treasurer. Rincon’s Board appointed Greg to the San Diego County Water Authority Board in 1996, where he served for 12 years, leading a coalition of seven agencies to achieve more than $1M/year in water delivery savings. He is currently the chairman of the Urban Water Institute. With a background in the areas of metamaterials, numerical analysis, signal processing, pattern recognition, wireless communications, and system integration, Greg has worked as a technologist, manager, and executive at Alcoa, McDonnell-Douglas, and SAIC and has founded and successfully spun off several high-tech startups, primarily in real-time detection and water technology. He has held top-level government clearances and holds 14 patents and has several pending. Greg has an undergraduate degree in astrophysics with a minor in economics from Yale, where he played football and baseball, and a PhD in physics from the University of California, Santa Barbara. He currently resides in Escondido, CA. In his rare free time, he enjoys fly fishing, hiking, golf, basketball, and tennis.

Presentations

Sewers can talk: Understanding the language of sewers Data Case Studies

Sewers can talk. Water levels in sewers have a signature, analogous to a human EKG. Greg Quist explains how this signature can be analyzed in real time, using pattern recognition techniques, revealing distressed pipelines and allowing users of this technology to take appropriate steps for maintenance and repair.

Syed Rafice is a principal system engineer at Cloudera specializing in big data on Hadoop technologies and both platform and cybersecurity. He is responsible for designing, building, developing, and assuring a number of enterprise-level big data platforms using the Cloudera distribution. Syed has worked across multiple sectors including government, telecoms, media, utilities, financial services, and transport.

Presentations

Getting ready for GDPR: Securing and governing hybrid, cloud, and on-premises big data deployments, step by step Tutorial

New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Syed Rafice, Mubashir Kazia, Ifigeneia Derekli, and Camila Hiskey share hands-on best practices for meeting these challenges, with special attention paid to GDPR.

Greg Rahn is director of product management at Cloudera, where he’s responsible for driving SQL product strategy as part of the company’s data warehouse product team, including working directly with Impala. For over 20 years, Greg has worked with relational database systems in a variety of roles, including software engineering, database administration, database performance engineering, and most recently product management, providing a holistic view and expertise on the database market. Previously, Greg was part of the esteemed Real-World Performance Group at Oracle and was the first member of the product management team at Snowflake Computing.

Presentations

Optimizing Apache Impala for a cloud-based data warehouse Session

Cloud object stores are becoming the bedrock of cloud data warehouses for modern data-driven enterprises, and it's become a necessity for data teams to have the ability to directly query data stored in S3 or ADLS. Greg Rahn and Mostafa Mokhtar discuss optimal end-to-end workflows and technical considerations for using Apache Impala over object stores for your cloud data warehouse.

Anand Raman is the chief of staff for the AI CTO office at Microsoft. Previously, he was the chief of staff for the Microsoft Azure Data Group, covering data platforms and machine learning, and ran the company’s product management and the development teams for Azure Data Services and the Visual Studio and Windows Server user experience teams; he also worked several years as researcher before joining Microsoft. Anand holds a PhD in computational fluid mechanics.

Presentations

A developer's guide to building AI applications (sponsored by Microsoft) Session

Anand Raman and Wee Hyong Tok walk you through applying AI technologies in the cloud. You'll learn how to add prebuilt AI capabilities like object detection, face understanding, translation, and speech to applications, build cognitive search applications that understand deep content in images, text, and other data, use the Azure platform to accelerate machine learning, and more.

Anand Raman is vice president of sales and account management at Impetus Technologies, where he is responsible for business growth, from sales strategy ideation to execution of big data strategy for Fortune 500 companies. With over 20 years of experience in the IT industry, Anand has a successful track record in working with large clients to improve business strategy and strengthen the creation of engagement models. Previously, Anand led Impetus’s business development and global outsourcing and built the IT sales team and big data practice from scratch. He began his career as a programmer, developing enterprise applications for large manufacturing organizations. Anand holds an MBA and bachelor’s degree in mathematics. He currently resides in Silicon Valley with his wife and children. When he is not spending time with his family, Anand enjoys adventure sports, squash, and classical music.

Presentations

Keys to operationalize enterprise 360 (sponsored by Impetus) Session

Is a single source of truth across the enterprise possible, or is it just an expensive myth? Anand Raman explains why you need a holistic decision framework that addresses multiple facets from platform to processes. Join in to explore EDW modernization strategies, self-service analytics, and interactive insights on big data and discover a process to get to a unified data model.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); worked briefly on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper. He’s the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin–Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Building Fabric Answers using Apache Heron Session

Streaming systems like Apache Heron are being used for an increasingly broad array of applications. Karthik Ramasamy and Andrew Jorgensen offer an overview of Fabric Answers, which provides real-time insights to mobile developers to improve their product experience at Google Fabric using Apache Heron.

Designing modern streaming data applications Tutorial

Arun Kejariwal and Karthik Ramasamy lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, covering messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. They also share case studies from the IoT, gaming, and healthcare and their experience operating these systems at internet scale.

High-performance messaging with Apache Pulsar Session

Apache Pulsar is being used for an increasingly broad array of data ingestion tasks. When operating at scale, it's very important to ensure that the system can make use of all the available resources. Karthik Ramasamy and Matteo Merli share insights into the design decisions and the implementation techniques that allow Pulsar to achieve high performance with strong durability guarantees.

Arakere Ramesh leads an engineering team at Intel to enable an ecosystem of database and analytics ISVs to run best on Intel Data Center platforms. He’s led engineering efforts across Intel businesses for decades, working with ISVs globally.

Presentations

How the blurring of memory and storage is revolutionizing the data era (sponsored by Intel) Session

Persistent memory accelerates analytics, database, and storage workloads across a variety of use cases, bringing new levels of speed and efficiency to the data center and to in-memory computing. Arakere Ramesh and Bharath Yadla offer an overview of the newly announced Intel Optane data center persistent memory and share the exciting potential of this technology in analytics solutions.

Jun Rao is the cofounder of Confluent, a company that provides a streaming data platform on top of Apache Kafka. Previously, Jun was a senior staff engineer at LinkedIn, where he led the development of Kafka, and a researcher at IBM’s Almaden research data center, where he conducted research on database and distributed systems. Jun is the PMC chair of Apache Kafka and a committer of Apache Cassandra.

Presentations

A deep dive into Kafka controller Session

The controller is the brain of Apache Kafka and is responsible for maintaining the consistency of the replicas. Jun Rao outlines the main data flow in the controller, then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.

Radhika Ravirala is a solutions architect at Amazon Web Services, where she helps customers craft
distributed big data applications on the AWS platform. Previously, she was a software engineer and designer for technology companies in Silicon Valley. She holds an MS in computer science from San Jose State University.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez, Radhika Ravirala, Paul Sears, and Bruno Faria walk you through building a big data application using a combination of open source technologies and AWS managed services.

LaVonne Reimer is the founder of Lumenous. A lawyer-turned-entrepreneur with decades of experience building digital platforms for markets with identity and data privacy sensitivities, LaVonne was previously founder and CEO of Cenquest, a venture-backed startup that provided the technology backbone for graduate schools such as NYU’s Stern School, the London School of Economics, and UT Austin to offer branded degree programs online. More recently, she led a program to foster entrepreneurship in open source together with Open Source Development Labs (Linux), IBM, and Intel. The Open Authorization Protocol, initiated by members of this community, inspired her to begin work on governance and trust assurance for free-flowing data.

Presentations

Balancing stakeholder interests in personal data governance technology Session

GDPR asks us to rethink personal data systems—viewing UI/UX, consent management, and value-add data services through the eyes of subjects of the data. LaVonne Reimer explains why the opportunity in the $150B credit and risk industry is to deploy data governance technologies that balance the interests of individuals to control their own data with requirements for trusted data.

Deborah Reynolds is vice president of data and analytic innovation at Pfizer, where she is responsible for the identification, development, and deployment of analytic tools and capabilities that improve the speed, creativity, and impact of analytics on business decisions. She’s also responsible for the company’s data acquisition strategy, including uses of new and alternative data sources for business analysis. Debbie has been with Pfizer since 1992 with roles in finance, operations, and analytics. She holds an MBA in finance and accounting from Columbia Business School and a BA in economics from Cornell University.

Presentations

From analytic silos to analytic democratization: How (and why) companies make the shift (sponsored by Dataiku) Session

By creating a collaborative and interactive analytic environment, a forward-thinking company may harness the best capabilities of its business analysts and data scientists to answer the company’s most pressing business questions. Deborah Reynolds and Kurt Muehmel explain how large enterprises can successfully put data at the core of everyday business decisions.

Emily Riederer is an Analytics Manager at Capital One where she focuses on building opinionated data products to promote scalable and reproducible business analysis. At Capital One, she has worked across acquisitions and CRM credit strategy and led consulting initiatives for retail partners.

Outside of work, Emily is an active member of the #rstats community. Most recently, she has reviewed packages for rOpenSci and helped to co-organizer the first Chicago R unconference and the inaugural satRday Chicago conference.

Previously, Emily earned degrees Mathematics and Statistics / OR at the University of North Carolina at Chapel Hill. During her studies, she focused on healthcare analytics as a research assistant in emergency department discrete event simulation and a student consultant for a large managed healthcare provider.

Presentations

InnerSource for reproducible and extensible business analysis Session

Emily Riederer explains how best practices from data science, open source, and open science can solve common business pain points. Using a case example from Capital One, Emily illustrates how designing empathetic analytical tools and fostering a vibrant InnerSource community are keys to developing reproducible and extensible business analysis.

Steve Ross is the director of product management at Cloudera, where he focuses on security across the big data ecosystem, balancing the interests of citizens, data scientists, and IT teams working to get the most out of their data while preserving privacy and complying with the demands of information security and regulations. Previously, at RSA Security and Voltage Security, Steve managed product portfolios now in use by the largest global companies and hundreds of millions of users.

Presentations

Executive Briefing: GDPR—Getting your data ready for heavy, new EU privacy regulations Session

In May 2018, the General Data Protection Regulation (GDPR) went into effect for firms doing business in the EU, but many companies still aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky and Steven Ross outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.

Nuria Ruiz is a full stack engineer on the analytics team at the Wikimedia Foundation. Before being part of the awesome project that is Wikipedia, she spent time working in JavaScript, performance, mobile apps, and web frameworks. Most of her experience deploying large applications comes from the seven years she worked at Amazon. Nuria is a physicist by trade and started writing software 15 years ago in a physical oceanography lab in Seattle.

Presentations

Data and privacy at scale at Wikipedia Session

The Wikipedia community feels strongly that you shouldn’t have to provide personal information to participate in the free knowledge movement. Nuria Ruiz discusses the challenges that this strong privacy stance poses for the Wikimedia Foundation, including how it affects data collection, and details some creative workarounds that allow WMF to calculate metrics in a privacy-conscious way.

Patty Ryan is an applied data scientist at Microsoft, where she codes with the company’s partners and customers to tackle tough problems using machine learning approaches with sensor, text, and vision data. She’s a graduate of the University of Michigan.

Presentations

When Tiramisu meets online fashion retail Session

Large online fashion retailers must efficiently maintain catalogues of millions of items. Due to human error, it's not unusual that some items have duplicate entries. Since manually trawling such a large catalogue is next to impossible, how can you find these entries? Patty Ryan, CY Yam, and Elena Terenzi explain how they applied deep learning for image segmentation and background removal.

Stefan Salandy is a systems engineer at Cloudera.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Attend this tutorial to learn how to successfully run a data analytics pipeline in the cloud and integrate data engineering and data analytic workflows and explore considerations and best practices for data analytics pipelines in the cloud. Along the way, you'll see how to share metadata across workloads in a big data PaaS.

Neelesh Srinivas Salian is a software engineer on the data platform team at Stitch Fix, where he works on the compute infrastructure used by the company’s data scientists. Previously, he worked at Cloudera, where he worked with Apache projects like YARN, Spark, and Kafka.

Presentations

Tracking data lineage at Stitch Fix Session

Neelesh Srinivas Salian explains how Stitch Fix built a service to better understand the movement and evolution of data within the company's data warehouse, from the initial ingestion from outside sources through all of its ETLs. Neelesh covers why and how Stitch Fix built the service and details some use cases.

Shioulin Sam is a research engineer at Cloudera Fast Forward Labs, where she bridges academic research in machine learning with industrial applications. Previously, she managed a portfolio of early stage ventures focusing on women-led startups and public market investments and worked in the investment management industry designing quantitative trading strategies. She holds a PhD in electrical engineering and computer science from the Massachusetts Institute of Technology.

Presentations

Semantic recommendations Session

Recent advances in deep learning allow us to use the semantic content of items in recommendation systems, addressing a weakness of traditional methods. Shioulin Sam explores the limitations of classical approaches and explains how using the content of items can help solve common recommendation pitfalls, such as the cold start problem, and open up new product possibilities.

Osman Sarood leads the infrastructure team at Mist Systems, where he helps Mist scale the Mist Cloud in a cost-effective and reliable manner. Osman has published more than 20 research papers in highly rated journals, conferences, and workshops and has presented his research at several academic conferences. He has over 400 citations along with an i10-index and h-index of 12. Previously, he was a software engineer at Yelp, where he prototyped, architected, and implemented several key production systems and architected and authored Yelp’s autoscaled spot infrastructure, fleet_miser. Osman holds a PhD in high-performance computing from the University of Illinois Urbana-Champaign, where he focused on load balancing and fault tolerance.

Presentations

How to cost-effectively and reliably build infrastructure for machine learning Session

Mist consumes several terabytes of telemetry data daily from its globally deployed wireless access points, a significant portion of which is consumed by ML algorithms. Last year, Mist saw 10x infrastructure growth. Osman Sarood explains how Mist runs 75% of its production infrastructure, reliably, on AWS EC2 spot instances, which has brought its annual AWS cost from $3 million to $1 million.

Toru Sasaki is a system infrastructure engineer and leads the OSS professional services team at NTT Data Corporation. He is interested in open source distributed computing systems, such as Apache Hadoop, Apache Spark, and Apache Kafka. Over his career, Toru has designed and developed many clusters utilizing these products to solve his customers’ problems. He is a coauthor of one of the most popular Apache Spark books written in Japanese.

Presentations

Best practices for developing an enterprise data hub to collect and analyze 1 TB of data a day from a multiple services with Apache Kafka and Google Cloud Platform Session

Recruit Group and NTT DATA Corporation have developed a platform based on a data hub, utilizing Apache Kafka. This platform can handle around 1 TB/day of application logs generated by a number of services in Recruit Group. Kenji Hayashida and Toru Sasaki share best practices for and lessons learned about topics such as schema evolution and network architecture.

Eric Sayle is a senior software engineer at Uber, where he works with the large volume of geospatial data helping people move in countries around the world. Eric has worked in the data space for the past 10 years, starting with call center performance analytics at Merced Systems.

Presentations

Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework Session

Danny Chen, Omkar Joshi, and Eric Sayle offer an overview of Marmaray, a generic Hadoop ingestion and dispersal framework recently released to production at Uber. You'll learn how Marmaray can meet a team's data needs by ensuring that data can be reliably ingested into Hive or dispersed into online data stores and take a deep dive into the architecture to see how it all works.

Friederike Schüür is a research engineer at Cloudera Fast Forward Labs, where she imagines what machine learning (ML) in industry will look like in the near-term future. She dives into new ML capabilities, builds prototypes that showcase state-of-the-art technology applied to real use cases, and advises clients on how to make use of new ML capabilities, from strategy to hands-on collaboration with in-house technical teams. Friederike is an advisor to a healthcare startup (in stealth mode) and a data science for social good volunteer with DataKind. She holds a PhD in cognitive neuroscience from University College London.

Presentations

From strategy to implementation: Putting data to work at USA for UNHCR Session

Friederike Schuur and Rita Ko explain how the Hive (an internal group at USA for UNHCR) and Cloudera Fast Forward Labs transformed USA for UNHCR, enabling the agency to use data science and machine learning (DS/ML) to address the refugee crisis. Along the way, they cover the development and implementation of a DS/ML strategy, identify use cases and success metrics, and showcase the value of DS/ML.

Jim Scott is the head of developer relations, data science, at NVIDIA. He’s passionate about building combined big data and blockchain solutions. Over his career, Jim has held positions running operations, engineering, architecture, and QA teams in the financial services, regulatory, digital advertising, IoT, manufacturing, healthcare, chemicals, and geographical management systems industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG).

Presentations

Data operations problems created by deep learning and how to fix them (sponsored by MapR) Session

Drawing on his experience working with customers across many industries, including chemical sciences, healthcare, and oil and gas, Jim Scott details the major impediments to successful completion of deep learning projects and solutions while walking you through a customer use case.

Using the blockchain in the enterprise Session

Jim Scott details relevant use cases for blockchain-based solutions across a variety of industries, focusing on a suggested architecture to achieve high-transaction-rate private blockchains and decentralized applications backed by a blockchain. Along the way, Jim compares public and private blockchain architectures.

Paul Scott-Murphy is vice president of product management at WANdisco, where he has overall responsibility for the definition and management of WANdisco’s product strategy, the delivery of product to market, and its success. This includes the direction of the product management team, product strategy, requirements definitions, feature management and prioritization, road maps, coordination of product releases with customer and partner requirements, user testing, and feedback. Paul has built his career on technical leadership, strategy, and consulting roles for major organizations. Previously, he was the regional CTO for TIBCO Software in Asia-Pacific and Japan.

Presentations

Hadoop-compatible filesystems: The limits of "compatible" (sponsored by WANdisco) Session

Every organization is considering its storage options, with an eye toward the cloud. Paul Scott-Murphy explores what makes different large-scale storage systems and services unique, their clear (and unexpected) differences, the options you have to use them, and the surprises you can expect along the way.

Paul Sears is a solutions architect supporting AWS partners in the big data space.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez, Radhika Ravirala, Paul Sears, and Bruno Faria walk you through building a big data application using a combination of open source technologies and AWS managed services.

Presentations

Minimum viable machine learning: The applied data science bootcamp (sponsored by DXC Technology) 1-Day Training

Acquiring machine learning (ML) technology is relatively straightforward, but ML must be applied to be useful. In this one-day boot camp that is equal parts hackathon, presentation, and group participation, Jerry Overton, Ashim Bose, and Samir Sehovic teach you how to apply advanced analytics in ways that reshape the enterprise and improve outcomes.

Jonathan Seidman is a software engineer on the cloud team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the internet of things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, including components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.

Executive Briefing: Managing successful data projects—Technology selection and team building Session

Creating a successful big data practice in your organization presents new challenges in managing projects and teams. Ted Malaska and Jonathan Seidman share guidance and best practices to help technical leaders deliver successful projects from planning to implementation.

Rama Sekhar is partner at Norwest Venture Partners, where he focuses on early- to late-stage venture investments in enterprise and infrastructure, including the cloud, big data, DevOps, cybersecurity, and networking. Rama’s current investments include Agari, Bitglass, and Qubole. Previously, Rama was an investor in Morta Security (acquired by Palo Alto Networks), Pertino Networks (acquired by Cradlepoint), and Exablox (acquired by StorageCraft). Before joining Norwest, Rama was with Comcast Ventures; a product manager at Cisco Systems, where he defined product strategy for the GSR 12000 Series and CRS-1 routers—$1B+ networking products in the carrier and data center markets; and a sales engineer at Cisco Systems, where he sold networking and security products to AT&T. Rama holds an MBA from the Wharton School of the University of Pennsylvania with a double major in finance and entrepreneurial management and a BS in electrical and computer engineering, with high honors, from Rutgers University.

Presentations

VC trends in machine learning and data science Session

In this panel discussion, venture capital investors explain how startups can accelerate enterprise adoption of machine learning and explore the new tech trends that will give rise to the next transformation in the big data landscape.

Gwen Shapira is a system architect at Confluent, where she helps customers achieve success with their Apache Kafka implementations. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen currently specializes in building real-time reliable data processing pipelines using Apache Kafka. Gwen is an Oracle Ace Director, the coauthor of Hadoop Application Architectures, and a frequent presenter at industry conferences. She is also a committer on Apache Kafka and Apache Sqoop. When Gwen isn’t coding or building data pipelines, you can find her pedaling her bike, exploring the roads and trails of California and beyond.

Presentations

The future of ETL isn’t what it used to be. Session

Gwen Shapira shares design and architecture patterns that are used to modernize data engineering. You'll learn how modern engineering organizations use Apache Kafka, microservices, and event streams to efficiently build data pipelines that are scalable, reliable, and built to evolve.

Ben Sharma is CEO and cofounder of Zaloni. Ben is a passionate technologist with experience in solutions architecture and service delivery of big data, analytics, and enterprise infrastructure solutions and expertise ranging from development to production deployment in a wide array of technologies, including Hadoop, HBase, databases, virtualization, and storage. He has held technology leadership positions for NetApp, Fujitsu, and others. Ben is the coauthor of Java in Telecommunications and Architecting Data Lakes. He holds two patents.

Presentations

Governing your cloud-based enterprise data lake (sponsored by Zaloni) Session

Selwyn Collaco and Ben Sharma share insights from their real-world experience and discuss best practices for architecture, technology, data management, and governance to enable centralized data services and explain how to leverage the Zaloni Data Platform (ZDP), an integrated self-service data platform, to operationalize the enterprise data lake .

The data imperative (sponsored by Zaloni) Keynote

Once, a company could live 60-70 years on the S&P 500. Now it averages 15 years. If companies were people, this would be an epidemic on par with the Black Plague. But the same things that dragged humanity out of that dark age can drag companies out of this one.

Jennifer Shin is the founder of data science, analytics, and technology company 8 Path Solutions and an adjunct professor at New York University’s Stern School of Business. An experienced data scientist and management consultant, Jennifer has led complex, large-scale, and high-profile projects as a product director at NBCUniversal, director of data science at Comcast, senior principal data scientist at The Nielsen Company, and management consultant at GE Capital, the Carlyle Group, Fortress Investment Group, the City of New York, and Columbia University. Previously, Jennifer taught courses in statistics, data science, and business at UC Berkeley, the Columbia Business School, and the City University of New York. She is internationally recognized as a thought leader, influencer, and expert in data science, business, and technology by governments, corporations, and academic institutions. Jennifer has several patents and trademarks related to data science, machine learning, and AI, has published research in peer-reviewed journals, and has been featured in news publications, press conferences, and on billboards in Times Square and the Vegas Strip. She serves on the data science committee for the Grace Hopper Conference, the advisory board for the data science graduate program at City University of New York, and the advisory board for up-and-coming startups. Jennifer holds an undergraduate degree in economics, mathematics, and creative writing and a graduate degree in statistics, both from Columbia University.

Presentations

Assumptions, constraints, and risks: How the wrong assumptions can jeopardize any model (sponsored by IBM) Session

Common wisdom dictates that we should never make assumptions, but assumptions are essential in the creation of statistical models. Jennifer Shin explores how assumptions fit into the creation of a statistical model, the pitfalls of applying a model to data without taking the underlying assumptions into account, and how to identify datasets where the model and its assumptions are applicable.

Assumptions, constraints, and risks: How the wrong assumptions can jeopardize any model (sponsored by IBM) Session

Common wisdom dictates that we should never make assumptions, but assumptions are essential in the creation of statistical models. Jennifer Shin explores how assumptions fit into the creation of a statistical model, the pitfalls of applying a model to data without taking the underlying assumptions into account, and how to identify datasets where the model and its assumptions are applicable.

Dave Shuman is the industry lead for the IoT and manufacturing at Cloudera. Dave has an extensive background in big data analytics, business intelligence applications, database architecture, logical and physical database design, and data warehousing. Previously, Dave held a number of roles at Vision Chain, a leading demand signal repository provider enabling retailer and manufacturer collaboration, including chief operations officer, vice president of field operations responsible for customer success and user adoption, vice president of product responsible for product strategy and messaging, and director of services. He also served at such top CG companies as Kraft Foods, PepsiCo, and General Mills, where he was responsible for implementations; was vice president of operations for enews, an ecommerce company acquired by Barnes and Noble; was executive vice president of management information systems, where he managed software development, operations, and retail analytics; and developed ecommerce applications and business processes used by Barnesandnoble.com, Yahoo, and Excite and pioneered an innovative process for affiliate commerce. He holds an MBA with a concentration in information systems from Temple University and a BA from Earlham College.

Presentations

Using machine learning to drive intelligence at the edge Session

The focus on the IoT is turning increasingly to the edge, and the way to make the edge more intelligent is by building machine learning models in the cloud and pushing them back out to the edge. Dave Shuman and Bryan Dean explain how Cloudera and Red Hat executed this architecture at one of Europe's leading manufacturers, along with a demo highlighting this architecture.

Alan Silva is a solutions architect and data scientist for Latin America (LATAM) at Cloudera, where he is focused on developing new solutions using machine learning algorithms and solutions, using Marvin as a workflow to support data science and machine learning projects. Alan has experience with a wide range of security systems and network technologies; his technical background includes cryptography, mathematics, network protocols, distributed systems, operational systems, application security, and secure software development. He holds an MSc in computer science from University Federal of São Carlos (UFSCAR), a postgraduate degree in cryptography and network security from University Federal Fluminense (UFF), and a BS in mathematics.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh, Alan Silva, Alex Bleakley, Steven Totman, Mirko Kämpf, and Syed Nasar outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Kamil Sindi is a principal engineer at JW Player, where he works on productionizing machine learning algorithms and scaling distributed systems. He holds a bachelor’s degree in mathematics with computer science from the Massachusetts Institute of Technology.

Presentations

Building turnkey recommendations for 5% of internet video Session

JW Player—the world’s largest network-independent video platform, representing 5% of global internet video—provides on-demand recommendations as a service to thousands of media publishers. Nir Yungster and Kamil Sindi explain how the company is systematically improving model performance while navigating the many engineering challenges and unique needs of the diverse publishers it serves.

Anupam Singh is the general manager of analytics at Cloudera. Previously, Anupam was the cofounder and CEO of Xplain.io (acquired by Cloudera in 2015), a company whose technology accelerates self-service BI with a massively scalable SQL workload analyzer; was the cofounder and CTO at Joviandata (acquired by Marketshare), a pioneer in combining the power of Hadoop and the cloud; worked at Marketshare, where he led the effort to combine machine learning techniques with Hadoop-based data warehousing; and built his database expertise on the SQL Query Optimizer teams at Oracle, Sybase (now SAP), and Informix (now IBM). He graduated from Pune University in India and holds patents in the areas of automatic SQL performance tuning, object databases, and resilient query execution.

Presentations

The future of data warehousing Keynote

Data volumes don’t translate to business value. What matters is your data platform’s ability to support unprecedented numbers of business users and use cases. Anupam Singh and Brian Coyne look at some of the challenges posed by data-hungry organizations and share new techniques to extract meaningful insights at the speed of today’s modern business.

Swatee Singh is the vice president of the big data platform and the first female distinguished architect of the machine learning platform at American Express, where she’s spearheading machine learning transformation. Swatee’s a proponent of democratizing machine learning by providing the right tools, capabilities, and talent structure to the broader engineering and data science community. The platform her team is building looks to leverage American Express’s closed loop data to enhance its customer experience by combining artificial intelligence, big data, and the cloud, incorporating guiding pillars such as ease of use, reusability, shareability, and discoverability. Swatee also led the American Express recommendation engine road map and delivery for card-linked merchant offers as well as for personalized merchant recommendations. Over the course of her career, she’s applied predictive modeling to a variety of problems ranging from financial services to retailers and even power companies. Previously, Swatee was a consultant at McKinsey & Company and PwC, where she supported leading businesses in retail, banking and financial services, insurance, and manufacturing, and cofounded a medical device startup that used a business card-sized thermoelectric cooling device implanted in the brain of someone with epilepsy as a mechanism to stop seizures. Swatee holds a PhD focused on machine learning techniques from Duke University.

Presentations

Democratizing artificial intelligence: Lessons from the real world Findata

Artificial intelligence (AI) is now being adopted in the financial world at an unprecedented scale. Swatee Singh discusses the need to “democratize” AI in the company beyond the purview of "unicorn" data scientists and offers a framework to do this by stitching AI with the cloud and big data at its backend.

Vartika Singh is a field data science architect at Cloudera. Previously, Vartika was a data scientist applying machine learning algorithms to real-world use cases ranging from clickstream to image processing. She has 12 years of experience designing and developing solutions and frameworks utilizing machine learning techniques.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh, Alan Silva, Alex Bleakley, Steven Totman, Mirko Kämpf, and Syed Nasar outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Jason “Jay” Smith is a Cloud customer engineer at Google. He spends his day helping enterprises find ways to expand their workload capabilities on Google Cloud. He’s on the Kubeflow go-to-market team and provides code contributions to help people build an ecosystem for their machine learning operations. His passions include big data, ML, and helping organizations find a way to collect, store, and analyze information.

Presentations

From training to serving: Deploying TensorFlow models with Kubernetes Tutorial

TensorFlow and Keras are popular libraries for training deep models due to hardware accelerator support. Brian Foo, Jay Smith, and Holden Karau explain how to bring deep learning models from training to serving in a cloud production environment. You'll learn how to unit-test, export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.

Peet Smith is the lead director for the PwC Data Engineering Practice, based in Tampa, FL. Together with his team, he has been a key player in laying the foundation for the firm’s data acquisition technologies and data analytics platform. Peet is a Chartered Accountant CA (SA) and CISA by profession and a technologist at heart; he started his career with PwC South Africa in the accounting and auditing field, but his passion for challenging the status quo has led him to pursue a career in data engineering.

Presentations

Using modern database and open source tools to accelerate client service delivery (sponsored by MemSQL) Session

Peet Smith explains how PwC is using modern database tools with a combination of open source technologies to automate and scale data ingestion and transformation to get data to engagement teams to help them streamline and accelerate client service delivery.

Guoqiong Song is a senior deep learning software engineer on the big data technology team at Intel. She’s interested in developing and optimizing distributed deep learning algorithms on Spark. She holds a PhD in atmospheric and oceanic sciences with a focus on numerical modeling and optimization from UCLA.

Guoqiong Song是英特尔大数据技术团队的高级深度学习软件工程师。 她拥有加州大学洛杉矶分校的大气和海洋科学博士学位,专业方向是数值建模和优化。 她现在的研究兴趣是开发和优化分布式深度学习算法。

Presentations

Job recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL Session

Can the talent industry make the job search/match more relevant and personalized for a candidate by leveraging deep learning techniques? Guoqiong Song, Wenjing Zhan, and Jacob Eisinger demonstrate how to leverage distributed deep learning framework BigDL on Apache Spark to predict a candidate’s probability of applying to specific jobs based on their résumé.

Tim Spann is a Senior Solutions Engineer at Cloudera where he works with Apache NiFi, MiniFi, Kafka, , MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a senior solutions architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.

Presentations

IoT edge processing with Apache NiFi, Apache MiniFi, and multiple deep learning libraries Session

Timothy Spann leads a hands-on deep dive into using Apache MiniFi with Apache MXNet and other deep learning libraries on edge devices.

Mark Stange-Tregear is the vice president of analytics at Ebates, which includes Ebates’s ShopStyle subsidiary. He leads analytics for Ebates, including the product management of the company’s centralized data warehouse and enterprise business intelligence platforms, and leads a team of analysts and data scientists supporting functional teams throughout Ebates. Previously, Mark worked in the real estate, video game, and nonprofit sectors.

Presentations

Feet on the ground, head in the clouds (sponsored by AtScale) Session

Interested in how Ebates is using a hybrid on-premises and cloud implementation to scale out its centralized business intelligence and data hub? Mark Stange-Tregear shares the history, business context, and technical plan around Ebates’s hybrid Hadoop-AWS cloud approach.

Chris Stirrat is a digital transformation engineering leader at Eagle Investment Systems. Chris has more than 25 years of technology engineering and product management experience. Previously, he held executive-level roles in software companies including Microsoft and Caradigm, a leading provider of population healthcare solutions. At Microsoft, Chris led product and engineering teams that delivered key strategic solutions in their extremely successful cloud-based technologies; at Caradigm, he led new product offerings and product transformations using cloud technologies.

Presentations

The big data makeover: 10 months from ideation to enterprise-scale solution (sponsored by Infoworks) Session

Eagle Investment Systems, a leading provider of financial services technology, is building a new Hadoop and cloud-based data management solution. Chris Stirrat explains how Eagle went from incubation to an enterprise-scale solution in just 10 months, using a Hadoop-based big data stack and multitenant architecture, transforming software creation, delivery, quality, technology, and culture.

Ian Swanson is vice president of product for AI and machine learning at Oracle, where he oversees the product strategy for the company’s AI/ML PaaS offerings. Previously, Ian was founder and CEO of DataScience.com (acquired by Oracle in 2018)—a company that provided an industry-leading enterprise data science platform that combined the tools, libraries, and languages data scientists loved with the infrastructure and workflows their organizations needed. Earlier in his career, he was an executive at American Express and Sprint and CEO of Sometrics, a company that launched the industry’s first global virtual currency platform (acquired by American Express in 2011). That platform, for which he earned a patent, managed more than 3.3 trillion units of virtual currency and served an online audience of 250 million in more than 180 countries. A sought-after speaker and expert on digital transformation, data science, big data, and performance-based analytics, Ian actively advises Fortune 500 companies and invests in leading startups.

Presentations

On the road to digital transformation, AI is a team sport (sponsored by Oracle + DataScience.com) Session

Ian Swanson explores why and how data scientists and line-of-business leaders must treat AI as a team sport and explains what tools are needed to deploy models and applications that truly inform decision making.

Mohammed Ibraaz Syed recently completed his master’s degree in applied economics at UCLA, where he focused on utilizing data science and machine learning techniques to solve economic problems. One of his primary research interests is applying artificial intelligence (AI) algorithms to extract narratives from a corpus of text. Previously, Ibraaz worked at the World Bank, providing analysis of the bank’s existing work and developing databases that have been used to draw inferences and implications for improving the bank’s activities. Ibraaz holds a BA in economics and a BSc in mathematics from the University of Maryland, College Park.

Presentations

From emotion analysis and topic extraction to narrative modeling Session

Narrative economics studies the impact of popular narratives and stories on economic fluctuations in the context of human interests and emotions. Andreea Kremm and Mohammed Ibraaz Syed describe the use of emotion analysis, entity relationship extraction, and topic modeling in modeling narratives from written human communication.

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, agile, distributed teams. Previously, he led business operations for Bing Shopping in the US and Europe with Microsoft’s Bing Group and built and ran distributed teams that helped scale Amazon’s financial systems with Amazon in both Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Executive Briefing: Why machine-learned models crash and burn in production and what to do about it Session

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Natural language understanding at scale with Spark NLP Tutorial

David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial for scalable NLP using the highly performant, highly scalable open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Spark NLP in action: How SelectData uses AI to better understand home health patients Session

David Talby, Alberto Andreotti, Stacy Ashworth, and Tawny Nichols outline a question-answering system for accurately extracting facts from free-text patient records and share best practices for training domain-specific deep learning NLP models. The solution is based on Spark NLP, an extension of Spark ML that provides state-of-the-art performance and accuracy for natural language understanding.

Wangda Tan is a product management committee (PMC) member of Apache Hadoop and engineering manager of the computation platform team at Cloudera. He manages all efforts related to Kubernetes and YARN for both on-cloud and on-premises use cases of Cloudera. His primary areas of interest are the YuniKorn scheduler (scheduling containers across YARN and Kubernetes) and the Hadoop submarine project (running a deep learning workload across YARN and Kubernetes). He’s also led features like resource scheduling, GPU isolation, node labeling, resource preemption, etc., efforts in the Hadoop YARN community. Previously, he worked on integration of OpenMPI and GraphLab with Hadoop YARN at Pivotal and participated in creating a large-scale machine learning, matrix, and statistics computation program using MapReduce and MPI and Alibaba.

Presentations

Deep learning on YARN: Running distributed TensorFlow, MXNet, Caffe, and XGBoost on Hadoop clusters Session

In order to train deep learning and machine learning models, you must leverage applications such as TensorFlow, MXNet, Caffe, and XGBoost. Wangda Tan discusses new features in Apache Hadoop 3.x to better support deep learning workloads and demonstrates how to run these applications on YARN.

Elena Terenzi is a software development engineer at Microsoft, where she brings business intelligence solutions to Microsoft Enterprise customers and advocates for business analytics and big data solutions for the manufacturing sector in Western Europe, such as helping big automotive customers implement telemetry analytics solutions with IoT flavor in their enterprises. She started her career with data as a database administrator and data analyst for an investment bank in Italy. Elena holds a master’s degree in AI and NLP from the University of Illinois at Chicago.

Presentations

When Tiramisu meets online fashion retail Session

Large online fashion retailers must efficiently maintain catalogues of millions of items. Due to human error, it's not unusual that some items have duplicate entries. Since manually trawling such a large catalogue is next to impossible, how can you find these entries? Patty Ryan, CY Yam, and Elena Terenzi explain how they applied deep learning for image segmentation and background removal.

Shawn Terry is lead architect for Joy Global Analytics.

Presentations

How Komatsu is improving mining efficiencies using the IoT and machine learning Session

Global heavy equipment manufacturer Komatsu is using IoT data to continuously monitor some of the largest mining equipment to ultimately improve mine performance and efficiencies. Shawn Terry details the company's data journey and explains how it is using advanced analytics and predictive modeling to drive insights on terabytes of IoT data from connected mining equipment.

Alex Thomas is a data scientist at John Snow Labs. He’s used natural language processing (NLP) and machine learning with clinical data, identity data, and job data. He’s worked with Apache Spark since version 0.9 as well as with NLP libraries and frameworks including UIMA and OpenNLP.

Presentations

Natural language understanding at scale with Spark NLP Tutorial

David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial for scalable NLP using the highly performant, highly scalable open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.

Skyler Thomas is an engineer at MapR, where he is designing Kubernetes-based infrastructure to deliver machine learning and big data applications at scale. Previously, Skyler was chief architect for WebSphere user experience at IBM, where he worked with more than a hundred customers to deliver extreme-scaled applications in the healthcare, financial services, and retail industries.

Presentations

Kubernetes plays Cupid for data scientists and IT (sponsored by MapR) Session

In the past, there have been major challenges in quickly creating machine learning training environments and deploying trained models into production. Skyler Thomas details how Kubernetes helps data scientists and IT work in concert to speed model training and time-to-value.

John Thuma is director of marketing data at Arcadia Data, where he assists clients in developing solutions that have measurable business impact. With 25 years of field experience, John has developed solutions for multiple vertical industries, including banking and financial services, retail, life sciences, and others.

Presentations

If you thought politics was dirty, you should see the analytics behind it. Session

Forget about the fake news; data and analytics in politics is what drives elections. John Thuma shares ethical dilemmas he faced while proposing analytical solutions to the RNC and DNC. Not only did he help causes he disagreed with, but he also armed politicians with real-time data to manipulate voters.

Yaroslav Tkachenko is a software engineer interested in distributed systems, microservices, functional programming, modern cloud infrastructure, and DevOps practices. Currently, Yaroslav is a Software Architect at Activision Blizzard, working on a Data Platform for Activision games like Call of Duty franchise.

Prior to joining Activision Yaroslav held various leadership roles in multiple startups. He was responsible for designing, developing, delivering and maintaining platform services and cloud infrastructure for mission-critical systems.

Presentations

Lessons learned building a scalable and extendable data pipeline for Call of Duty Session

What's easier than building a data pipeline? You add a few Apache Kafka clusters and a way to ingest data, design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse. . .wait, this looks like a lot of things. Join Yaroslav Tkachenko to learn best practices for building a data pipeline, drawn from his experience at Demonware/Activision.

Wee Hyong Tok is a principal data science manager with the AI CTO Office at Microsoft, where he leads the engineering and data science team for the AI for Earth program. Wee Hyong has worn many hats in his career, including developer, program and product manager, data scientist, researcher, and strategist, and his track record of leading successful engineering and data science teams has given him unique superpowers to be a trusted AI advisor to customers. Wee Hyong coauthored several books on artificial intelligence, including Predictive Analytics Using Azure Machine Learning and Doing Data Science with SQL Server. Wee Hyong holds a PhD in computer science from the National University of Singapore.

Presentations

A developer's guide to building AI applications (sponsored by Microsoft) Session

Anand Raman and Wee Hyong Tok walk you through applying AI technologies in the cloud. You'll learn how to add prebuilt AI capabilities like object detection, face understanding, translation, and speech to applications, build cognitive search applications that understand deep content in images, text, and other data, use the Azure platform to accelerate machine learning, and more.

Steven Totman is the financial services industry lead for Cloudera’s Field Technology Office, where he helps companies monetize their big data assets using Cloudera’s Enterprise Data Hub. Prior to Cloudera, Steve ran strategy for a mainframe-to-Hadoop company and drove product strategy at IBM for DataStage and Information Server after joining with the Ascential acquisition. He architected IBM’s Infosphere product suite and led the design and creation of governance and metadata products like Business Glossary and Metadata Workbench. Steve holds several patents for data-integration and governance/metadata-related designs.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh, Alan Silva, Alex Bleakley, Steven Totman, Mirko Kämpf, and Syed Nasar outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Jane Tran is vice president of strategy and presales at Unqork, where she works directly with clients to set the direction for the Unqork platform in both user experience and functionality. Jane has been helping leaders in financial services assess and implement new business strategies since the start of her career. Previously, she worked at internal strategy teams for C-suites at JPMorgan Chase, Marsh, and MetLife and advised a portfolio of startups for Techstars Connection in partnership with AB InBev. Jane holds a BA in economics and policy studies from Syracuse University.

Presentations

The balancing act: Building business-relevant data solutions for the front line Findata

Data’s role in financial services has been elevated. However, often the rollout of data solutions fails when an organization’s existing culture is misaligned with its capabilities. Unqork is increasing adoption by honoring existing capabilities. Jane Tran explores methods to finally implement data solutions through both qualitative and quantitative discoveries.

Madhu Tumma is director of IT engineering at TIAA, where he focuses on database core services, engineering, infrastructure, operations, and strategy. Madhu has 25 years of experience working on various platforms and data technologies. Over his career, he has held senior IT positions at JPMorgan, Bear Stearns, AboveNet, and Merrill Lynch’s DAF Group (BOA). Madhu is the author/coauthor of three Oracle database management-related books as well as a speaker and subject-matter expert in cloud analytics, data privacy, server engineering, and database management.

Presentations

Quick, reliable, and cost-effective ways to operationalize big data apps (sponsored by Unravel) Session

Operationalizing big data apps in a quick, reliable, and cost-effective manner remains a daunting task. Shivnath Babu and Madhusudan Tumma outline common problems and their causes and share best practices to find and fix these problems quickly and prevent such problems from happening in the first place.

Mike Tung is the CEO of Diffbot, an adviser at the Stanford StartX accelerator, and the leader of Stanford’s entry in the DARPA Robotics Challenge. In a previous life, he was a patent lawyer, a grad student in the Stanford AI lab, and a software engineer at eBay, Yahoo, and Microsoft. Mike studied electrical engineering and computer science at UC Berkeley and artificial intelligence at Stanford.

Presentations

Automating business processes with large-scale knowledge graphs Session

Mike Tung offers an overview of available open source and commercial knowledge graphs and explains how consumer and business applications are already taking advantage of them to provide intelligent experiences and enhanced business efficiency. Mike then discusses what's coming in the future.

Michelle Ufford leads the data platform architecture core team at Netflix, which focuses on platform innovation and usability. Previously, she led the data management team at GoDaddy, where she built data engineering solutions for personalization and helped pioneer Hadoop data warehousing techniques. Michelle is a published author, patented developer, award-winning open source contributor, and Most Valuable Professional (MVP) for Microsoft Data Platform. You can find her on Twitter at @MichelleUfford.

Presentations

Data at Netflix: See what’s next Session

Michelle Ufford shares some of the cool things Netflix is doing with data and the big bets the company is making on data infrastructure, covering workflow orchestration, machine learning, interactive notebooks, centralized alerting, event-based processing, platform intelligence, and more.

Sandeep Uttamchandani is a chief data architect at Intuit, where he leads the cloud transformation of the big data analytics, ML, and transactional platform used by 4M+ small business users for financial accounting, payroll, and billions of dollars in daily payments. Previously, Sandeep was cofounder and CEO of a machine learning startup focused on ML for managing enterprise systems and played various engineering roles at VMware and IBM. His experience uniquely combines building enterprise data products and operational expertise in managing petabyte-scale data and analytics platforms in production. He’s received several excellence awards, over 40 issued patents , and 25 publications in key systems conferences such as the International Conference on Very Large Data Bases (VLDB), Special Interest Group on Management of Data (SIGMOD), Conference on Innovative Data Systems Research (CIDR), and USENIX. He’s a regular speaker at academic institutions, guest lectures for university courses, and conducts conference tutorials for data engineers and scientists, as well as advising PhD students and startups, serving as program committee member for systems and data conferences, and was an associate editor for ACM Transactions on Storage. He holds a PhD in computer science from the University of Illinois Urbana-Champaign.

Presentations

Circuit breakers to safeguard for garbage in, garbage out Session

Do your analysts always trust the insights generated by your data platform? Ensuring insights are always reliable is critical for use cases in the financial sector. Sandeep Uttamchandani outlines a circuit breaker pattern developed for data pipelines, similar to the common design pattern used in service architectures, that detects and corrects problems and ensures always reliable insights.

Balaji Varadarajan is a senior software engineer at Uber, where he works on the Hudi project and oversees data engineering broadly across the network performance monitoring domain. Previously, he was one of the lead engineers on LinkedIn’s databus change capture system as well as the Espresso NoSQL store. Balaji’s interests lie in distributed data systems.

Presentations

Hudi: Unifying storage and serving for batch and near-real-time analytics Session

Uber has a real need to provide faster, fresher data to its data consumers and products, which are running hundreds of thousands of analytical queries every day. Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar share the design, architecture, and use cases of the second-generation of Hudi, an analytical storage engine designed to serve such needs and beyond.

Maulin Vasavada is a software developer and an architect on the Kafka team at PayPal, building a suite of components for Kafka as a service. He has strong experience building large-scale financial systems, shipping and logistics software, and software release management systems. Previously, he worked for eBay and as a consultant for Sun Microsystems.

Presentations

Kafka at PayPal: Enabling 400 billion messages a day Session

PayPal is one of the biggest Kafka users in the industry; it manages and maintains over 40 production Kafka clusters in three geodistributed data centers and supports 400 billion Kafka messages a day. Kevin Lu, Maulin Vasavada, and Na Yang explore the management and monitoring PayPal applies to Kafka, from client-perceived statistics to configuration management, failover, and data loss auditing.

Nanda Vijaydev is the lead data scientist and head of solutions at BlueData (now HPE), where she leverages technologies like TensorFlow, H2O, and Spark to build solutions for enterprise machine learning and deep learning use cases. Nanda has more than 10 years of experience in data science and data management. Previously, she worked on data science projects in multiple industries as a principal solutions architect at Silicon Valley Data Science and served as director of solutions engineering at Karmasphere.

Presentations

What's the Hadoop-la about Kubernetes? Session

Kubernetes (K8s)—the open source container orchestration system for modern big data workloads—is increasingly popular. While the promised land is a unified platform for cloud-native stateless and stateful data services, stateful, multiservice big data cluster orchestration brings unique challenges. Anant Chintamaneni and Nanda Vijaydev outline the considerations for big data services for K8s.

Avni Wadhwa is an analytics and marketing hacker at H2O.ai, where she does a mix of marketing and sales engineering. She holds a BS in management science from the University of California, San Diego.

Presentations

Practical techniques for interpreting machine learning models Tutorial

Transparency, auditability, and stability are crucial for business adoption and human acceptance of complex machine learning models. Patrick Hall, Avni Wadhwa, and Mark Chan share practical and productizable approaches for explaining, testing, and visualizing machine learning models using open source, Python-friendly tools such as GraphViz, H2O, and XGBoost.

Tim Walpole is a cognitive architect at BJSS, where he designs complex, vendor-agnostic, multilingual cloud-based chatbot solutions for a range of clients. Previously, he was head of mobile at BJSS. Tim began his 21-year career as an IT consultant with ICL (and then Fujitsu). Over his career, he has worked at Microsoft, at the European Commission in Luxembourg, and HP. Tim is passionate about systems integration and is always looking for clever and innovative ways to connect systems together. Tim has two grown-up daughters and lives in Newington Green in London. He enjoys choral singing and is currently working on renovating and extending his property, built in the 1850s.

Presentations

Using big data to unlock the delivery of personalized, multilingual real-time chat services for global financial service organizations Session

Financial service clients demand increased data-driven personalization, faster insight-based decisions, and multichannel real-time access. Tim Walpole details how organizations can deliver real-time, vendor-agnostic, personalized chat services and explores issues around security, privacy, legal sign-off, data compliance, and how the internet of things can be used as a delivery platform.

Todd Walter is chief technologist and fellow at Teradata, where he helps business leaders, analysts, and technologists better understand all of the astonishing possibilities of big data and analytics in view of emerging and existing capabilities of information infrastructures. Todd has been with Teradata for more than 30 years. He’s a sought-after speaker and educator on analytics strategy, big data architecture, and exposing the virtually limitless business opportunities that can be realized by architecting with the most advanced analytic intelligence platforms and solutions. Todd holds more than a dozen patents.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Dean Wampler is an expert in streaming data systems, focusing on applications of machine learning and artificial intelligence (ML/AI). He’s head of developer relations at Anyscale, which is developing Ray for distributed Python, primarily for ML/AI. Previously, he was an engineering VP at Lightbend, where he led the development of Lightbend CloudFlow, an integrated system for building and running streaming data applications with Akka Streams, Apache Spark, Apache Flink, and Apache Kafka. Dean is the author of Fast Data Architectures for Streaming Applications, Programming Scala, and Functional Programming for Java Developers, and he’s the coauthor of Programming Hive, all from O’Reilly. He’s a contributor to several open source projects. A frequent conference speaker and tutorial teacher, he’s also the co-organizer of several conferences around the world and several user groups in Chicago. He earned his PhD in physics from the University of Washington.

Presentations

Executive Briefing: What you need to know about fast data Session

Streaming data systems, so called "fast data," promise accelerated access to information, leading to new innovations and competitive advantages. But they aren't just faster versions of big data. They force architecture changes to meet new demands for reliability and dynamic scalability, more like microservices. Dean Wampler shares what you need to know to exploit fast data successfully.

Hands-on Kafka streaming microservices with Akka Streams and Kafka Streams Tutorial

Dean Wampler and Boris Lublinsky walk you through building streaming apps as microservices using Akka Streams and Kafka Streams. Dean and Boris discuss the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead. You'll also discover a few ML model serving ideas along the way.

Jason Wang is a software engineer at Cloudera focusing on the cloud.

Presentations

A comparative analysis of the fundamentals of AWS and Azure Session

The largest infrastructure paradigm change of the 21st century is the shift to the cloud. Companies now face the difficult decision of which cloud to go with. This decision is not just financial and in many cases rests on the underlying infrastructure. Jason Wang, Suraj Acharya, and Tony Wu compare the relative strengths and weaknesses of AWS and Azure.

Running multidisciplinary big data workloads in the cloud Tutorial

Attend this tutorial to learn how to successfully run a data analytics pipeline in the cloud and integrate data engineering and data analytic workflows and explore considerations and best practices for data analytics pipelines in the cloud. Along the way, you'll see how to share metadata across workloads in a big data PaaS.

Jacob Ward is a science and technology correspondent for CNN, Al Jazeera, and PBS. The former editor-in-chief of Popular Science magazine, Jacob writes for The New Yorker, Wired, and Men’s Health. His 10-episode Audible podcast, Complicated, discusses humanity’s most difficult problems, and he’s the host of an upcoming four-hour public television series, Hacking Your Mind, about human decision making and irrationality. Jacob is developing a CNN original series about the unintended consequences of big ideas. He is a 2018–2019 Berggruen Fellow at Stanford University’s Center for Advanced Study in the Behavioral Sciences, where he’s writing a book, due for publication by Hachette Books in 2020, about how artificial intelligence will amplify good and bad human instincts.

Presentations

Black box: How AI will amplify the best and worst of humanity Keynote

For most of us, our own mind is a black box—an all-powerful and utterly mysterious device that runs our lives for us, using rules and shortcuts of which we aren’t even aware. Jacob Ward reveals the relationship between the unconscious habits of our minds and the way that AI is poised to amplify them, alter them, maybe even reprogram them.

Marc Warner is the cofounder and CEO of ASI Data Science. He founded ASI in the belief that the benefits of AI should extend to everyone and has shaped the company so that it can support organizations of all shapes and sizes to take advantage of rapid advances in this field. In the two years since founding ASI, Marc has overseen its growth to more than 50 employees and expanded its scope from a small fellowship scheme to a cutting-edge range of software, training, project, and advisory services. He has led over 50 data science projects for clients ranging from multinational companies like EasyJet and Siemens to the UK government and NHS. His work has been covered by the BBC, the Telegraph, the Independent, and many more. Previously, Marc was the Marie Curie Fellow of Physics at Harvard University, studying quantum metrology and quantum computing. His PhD research, in the field of quantum computing, was awarded the Stoneham prize and was published in Nature and covered in the New York Times.

Presentations

Predicting residential occupancy and hot water usage from high-frequency, multivector utilities data Session

In EU households, heating and hot water alone account for 80% of energy usage. Cristobal Lowery and Marc Warner explain how future home energy management systems could improve their energy efficiency by predicting resident needs through utilities data, with a particular focus on the key data features, the need for data compression, and the data quality challenges.

Rachel Warren is a software engineer and data scientist for Salesforce Einstein, where she is working on scaling and productionizing auto ML on Spark. Previously, Rachel was a machine learning engineer for Alpine Data, where she helped build a Spark auto-tuner to automatically configure Spark applications in new environments. A Spark enthusiast, she is the coauthor of High Performance Spark. Rachel is a climber, frisbee player, cyclist, and adventurer. Last year, she and her partner completed a thousand-mile off-road unassisted bicycle tour of Patagonia.

Presentations

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am Session

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads.

Katharina Warzel is head of data analytics and performance marketing at EveryMundo, where she helps airlines around the world innovate based on actionable insights from analytics. She is responsible for defining data standards, specifying data collection systems, implementing the tracking environment, analyzing data quality, and supporting airline’s digital strategy leveraging automation and data science. Katharina has a passion for travel and trying out new things, which has taken her from living in four countries to overnighting in the Moroccan desert, hiking to Machu Picchu, volunteering in Kenya, and becoming a certified yoga teacher in Bali. Everything she does is multilingual as she speaks German, French, Spanish, and English. Her long-term intention is to join the Data for Good movement, which encourages using data in meaningful ways to solve humanitarian issues around poverty, health, human rights, education, and the environment.

Presentations

Self-reliant, secure, end-to-end data, activity, and revenue analytics: A roadmap for the airline industry Data Case Studies

Airlines want to know what happens after a user interacts with their websites. Do they convert? Do they close the browser and come back later? Airlines traditionally have depended on analytics tools to prove value. Katharina Warzel explores how to implement a client-independent end-to-end tracking system.

Sophie Watson is a senior data scientist at Red Hat, where she helps customers use machine learning to solve business problems in the hybrid cloud. She’s a frequent public speaker on topics including machine learning workflows on Kubernetes, recommendation engines, and machine learning for search. Sophie earned her PhD in Bayesian statistics.

Presentations

Building a recommendation engine Session

Recommender systems enhance user experience and business revenue every day. Sophie Watson demonstrates how to develop a robust recommendation engine using a microservice architecture.

Robin Way is a faculty member for banking at the International Institute of Analytics and the founder and president of management analytics consultancy Corios. Robin has over 25 years of experience in the design, development, execution, and improvement of applied analytics models for clients in the credit, payments, lending, brokerage, insurance, and energy industries. Previously, Robin was a managing analytics consultant in SAS Institute’s Financial Services Business Unit for 12 years and spent another 10+ years in analytic management roles for several client-side and consulting firms. Robin’s professional passion is devoted to democratizing and demystifying the science of applied analytics. His contributions to the field correspondingly emphasize statistical visualization, analytical data preparation, predictive modeling, time series forecasting, mathematical optimization applied to marketing, and risk management strategies. He is author of Skate Where the Puck’s Headed: A Playbook for Scoring Big with Predictive Analytics. Robin holds an undergraduate degree from the University of California at Berkeley; his subsequent graduate-level coursework emphasized the analytical modeling of human and consumer behavior. He lives in Portland, Oregon, with his wife, Melissa, and two sons, Colin and Liam. In his spare time, Robin plays soccer and holds a black belt in taekwondo.

Presentations

CANCELED: Leading next-best offer strategies for financial services Findata

Robin Way shares case study examples of next-best offer strategies, predictive customer journey analytics, and behavior-driven time-to-event targeting for mathematically optimal customer messaging that drives incremental margins.

Jeffrey Wecker is chief data officer at Goldman Sachs.

Presentations

Von Neumann to deep learning: Data revolutionizing the future Keynote

Jeffrey Wecker leads a deep dive on data in financial services, with perspectives on the evolving landscape of data science, the advent of alternative data, the importance of data centricity, and the future for machine learning and AI.

Daniel Weeks manages the big data compute team at Netflix and is a Parquet committer. Previously, Daniel focused on research in big data solutions and distributed systems.

Presentations

The evolution of Netflix's S3 data warehouse Session

In the last few years, Netflix's data warehouse has grown to more than 100 PB in S3. Ryan Blue and Daniel Weeks share lessons learned, the tools Netflix currently uses and those it has retired, and the improvements it is rolling out, including Iceberg, a new table format for S3.

Thomas Weise is a software engineer for the streaming platform at Lyft. He’s also a PMC member for the Apache Apex and Apache Beam projects and has contributed to several more projects within the ASF ecosystem. Thomas is a frequent speaker at international big data conferences and the author of Learning Apache Apex.

Presentations

Near-real-time anomaly detection at Lyft Session

Thomas Weise and Mark Grover explain how Lyft uses its streaming platform to detect and respond to anomalous events, using data science tools for machine learning and a process that allows for fast and predictable deployment.

Masha Westerlund is director of the data science team at Investopedia, where she works to answer questions such as “What can Investopedia’s readership tell us about current market sentiment?” and “What financial concepts are most interesting to American investors, from Wall Street to Silicon Valley?” Masha holds a PhD in cognitive science from New York University.

Presentations

Anxiety at scale: How Investopedia used readership data to track market volatility Session

Businesses rely on user data to power their sites, products, and sales. Can we give back by sharing those insights with users? Masha Westerlund explains how Investopedia harnessed reader data to build an index that tracks market anxiety and moves with the VIX, a proprietary measure of market volatility. You'll see how thinking outside the box helps turn data into tools for users, not stakeholders.

Chris Wojdak is a senior program and managing architect at Symcor, where he leads innovation, digital transformation strategies, and next-generation analytics. Chris has more than 20 years of experience in solutioning and implementing modern, secure, and advanced solutions for the financial services industry. He is currently focused on helping customers leverage machine learning, the IoT, and next-generation analytics to help detect and prevent fraud as customers move to digital channels. He has received innovation awards for next-generation analytics from Informatica. Chris holds a bachelor’s degree in technology from the Faculty of Engineering at McMaster University.

Presentations

Preventing more fraud in less time with machine learning-driven data management (sponsored by Informatica) Session

Chris Wojdak explains how Symcor has transformed its big data architecture using Informatica’s comprehensive machine learning-based solutions for data integration, data quality, data cataloging, and data governance.

Heesun Won is a principal researcher at the Electronic and Telecommunications Research Institute (ETRI), where she has been developing an open data reference model and data distribution system with semantic data map—SODAS: Smart Open Data as a System. Her research interests include software architecture for big data processing in cloud environments.

Presentations

A data marketplace case study with the blockchain and advanced multitenant Hadoop in a smart open data platform Session

Minh Chau Nguyen and Heesun Won explain how to implement analytics services in data marketplace systems on a single Hadoop cluster across distributed data centers. The solution extends the overall architecture of the Hadoop ecosystem with the blockchain so that multiple tenants and authorized third parties can securely access data while still maintaining privacy, scalability, and reliability.

Brian Wu is an engineer on the AppNexus optimization team, where he has worked closely with budgeting, valuation, and allocation systems and has seen great changes and great mistakes. Coming from a pure mathematics background, Brian enjoys working on algorithm, logic, and streaming data problems with his team. In addition to control systems, data technologies, and real-time applications, Brian loves talking about process, team work, management, sequencers, synthesizers, and the NYC music scene.

Presentations

AppNexus's stream-based control system for automated buying of digital ads Session

Automating the success of digital ad campaigns is complicated and comes with the risk of wasting the advertiser's budget or a trader's margin and time. Brian Wu describes the evolution of Inventory Discovery, a streaming control system of eligibility, prioritization, and real-time evaluation that helps digital advertisers hit their performance goals with AppNexus.

Tony Wu is an engineering manager at Cloudera, where he manages the Altus core engineering team. Previously, Tony was a team lead for the partner engineering team at Cloudera. He’s responsible for Microsoft Azure integration for Cloudera Director.

Presentations

A comparative analysis of the fundamentals of AWS and Azure Session

The largest infrastructure paradigm change of the 21st century is the shift to the cloud. Companies now face the difficult decision of which cloud to go with. This decision is not just financial and in many cases rests on the underlying infrastructure. Jason Wang, Suraj Acharya, and Tony Wu compare the relative strengths and weaknesses of AWS and Azure.

Jerry Xu is cofounder and CTO at Datatron Technologies. An innovative software engineer with extensive programming and design experience in storage systems, online services, mobile, distributed systems, virtualization, and OS kernels, Jerry also has a demonstrated ability to direct and motivate a team of software engineers to complete projects meeting specifications and deadlines. Previously, he worked at Zynga, Twitter, Box, and Lyft, where he built the company’s ETA machine learning model. Jerry is the author of open source project LibCrunch. He’s a three-time Microsoft Gold Star Award winner.

Presentations

Infrastructure for deploying machine learning to production in large financial institutions: Lessons learned and best practices Session

Large financial institutions have many data science teams (e.g., those for fraud, credit risk, and marketing), each often using diverse set of tools to build predictive models. There are many challenges involved in productionizing these predictive AI models. Harish Doddi and Jerry Xu share challenges and lessons learned deploying AI models to production in large financial institutions.

Bharath Yadla is vice president of product strategy for ecosystems at Aerospike, focusing on Aerospike connectors, cloud, clients, and tooling. Bharath is a serial entrepreneur and seasoned executive with over 20+ years of experience in building disruptive products and solutions, product marketing, and customer success teams for both startups and high-growth companies. He serves as an adviser to startups and incubation accelerators. Previously, he was vice president of strategy and solutions at Persistent Systems and associate vice president and founding member of the Digital Transformation Unit at HCL, where he drove sales, solutions, and strategy.

Presentations

How the blurring of memory and storage is revolutionizing the data era (sponsored by Intel) Session

Persistent memory accelerates analytics, database, and storage workloads across a variety of use cases, bringing new levels of speed and efficiency to the data center and to in-memory computing. Arakere Ramesh and Bharath Yadla offer an overview of the newly announced Intel Optane data center persistent memory and share the exciting potential of this technology in analytics solutions.

CY Yam is a data scientist at Microsoft, where she applies machine learning techniques to solving various problems in daily life. Previously, CY invented new ways to recognize people by the way they move.

Presentations

When Tiramisu meets online fashion retail Session

Large online fashion retailers must efficiently maintain catalogues of millions of items. Due to human error, it's not unusual that some items have duplicate entries. Since manually trawling such a large catalogue is next to impossible, how can you find these entries? Patty Ryan, CY Yam, and Elena Terenzi explain how they applied deep learning for image segmentation and background removal.

Han Yang is a senior product manager at Cisco, where he drives UCS solutions for artificial intelligence and machine learning. And he’s always enjoyed driving technologies. Previously, Han drove the big data and analytics UCS solutions and the largest switching beta at Cisco with the software virtual switch, Nexus 1000V. Han has a PhD in electrical engineering from Stanford University.

Presentations

Ubiquitous machine learning (sponsored by Cisco) Session

Data is the lifeblood of an enterprise, and it's being generated everywhere. To overcome the challenges of data gravity, data analytics, including machine learning, is best done where the data is located: ubiquitous machine learning. Han Yang explains how to overcome the challenges of machine learning everywhere.

Longqi Yang is a PhD candidate in computer science at Cornell Tech and Cornell University, where he is advised by Deborah Estrin, and is a member of the Connected Experiences Lab and the Small Data Lab. His current research focuses are user modeling, recommendation systems, and recommendation for social good. His work has been published and presented in top academic conferences, such as WWW, WSDM, Recsys, and CIKM. He co-organized workshops at the NYC Media Lab annual summit 2017 and KDD 2018.

Presentations

Harnessing and customizing state-of-the-art recommendation solutions with OpenRec Session

State-of-the-art recommendation algorithms are increasingly complex and no longer one size fits all. Current monolithic development practice poses significant challenges to rapid, iterative, and systematic, experimentation. Longqi Yang explains how to use OpenRec to easily customize state-of-the-art solutions for diverse scenarios.

Na Yang is a software engineer at PayPal, where she focuses on building a scalable streaming infrastructure platform. Previously, she built various big data and distributed systems at MapR and Quova. Outside of work, she likes to spend time hiking with her kids.

Presentations

Kafka at PayPal: Enabling 400 billion messages a day Session

PayPal is one of the biggest Kafka users in the industry; it manages and maintains over 40 production Kafka clusters in three geodistributed data centers and supports 400 billion Kafka messages a day. Kevin Lu, Maulin Vasavada, and Na Yang explore the management and monitoring PayPal applies to Kafka, from client-perceived statistics to configuration management, failover, and data loss auditing.

Renee Yao is a senior product marketing manager at NVIDIA, focusing on deep learning and analytics solutions on AI Systems. She graduated from Haas School of Business at the University of California, Berkeley and was named the top 50 B2B product marketers to watch by Kapost. When she has free time, she enjoys competitive Latin dancing, horseback riding, golfing, and sculpturing.

Presentations

Accelerate AI with synthetic data using generative adversarial networks (GAN) (sponsored by NVIDIA) Session

Renee Yao explains how generative adversarial networks (GAN) are successfully used to improve data generation and explores specific real-world examples where customers have deployed GANs to solve challenges in healthcare, space, transportation, and retail industries.

Onur Yilmaz is a deep learning solution architect at NVIDIA, where he works on deep learning use cases for finance and helps researchers and data scientists adopt deep learning and GPU technology. Onur holds a PhD in computer engineering from the New Jersey Institute of Technology; his dissertation focused on traditional machine learning and high-performance signal processing for finance.

Presentations

Accelerating financial data science workflows with GPUs Session

GPUs have allowed financial firms to accelerate their computationally demanding workloads. Today, the bottleneck has moved completely to ETL. The GPU Open Analytics Initiative (GoAi) is helping accelerate ETL while keeping the entire workflow on GPUs. Joshua Patterson and Onur Yilmaz discuss several GPU-accelerated data science tools and libraries.

Nir Yungster leads the data science team at JW Player, focusing on providing recommendations as a service to thousands of online video publishers. Nir studied aerospace engineering at Princeton University and holds a master’s degree in applied mathematics from Northwestern University.

Presentations

Building turnkey recommendations for 5% of internet video Session

JW Player—the world’s largest network-independent video platform, representing 5% of global internet video—provides on-demand recommendations as a service to thousands of media publishers. Nir Yungster and Kamil Sindi explain how the company is systematically improving model performance while navigating the many engineering challenges and unique needs of the diverse publishers it serves.

Varant Zanoyan is a software engineer on the ML Infrastructure team at Airbnb, where he works on tools and frameworks for building and productionizing ML models. Previously, he solved data infrastructure problems at Palantir Technologies.

Presentations

Zipline: Airbnb's data management platform for machine learning Session

Zipline is Airbnb’s soon to be open-sourced data management platform specifically designed for ML use cases. It has taken the task of feature generation from months to days and offers features to support end-to-end data management for machine learning. Varant Zanoyan covers Zipline's architecture and dives into how it solves ML-specific problems.

Xiaohan Zeng is a software engineer on the machine learning infrastructure team at Airbnb. Previously, he worked on the machine learning platform team at Groupon. He holds a degree in chemical engineering from Tsinghua University and Northwestern University but started to pursue a career in software engineering and machine learning after doing research in data science. Outside work, he enjoys reading, writing, traveling, movies, and trying to follow his daughter around when she suddenly decides to practice walking.

Presentations

Bighead: Airbnb's end-to-end machine learning platform Session

Atul Kale and Xiaohan Zeng offer an overview of Bighead, Airbnb's user-friendly and scalable end-to-end machine learning framework that powers Airbnb's data-driven products. Built on Python, Spark, and Kubernetes, Bighead integrates popular libraries like TensorFlow, XGBoost, and PyTorch and is designed be used in modular pieces.

Wenjing Zhan is a data scientist at Talroo, where she is in charge of predictive machine learning. Previously, Wenjing aided in search relevance through classification modeling and has done data engineering with Apache Spark and machine learning in Scala, R, and Python. She holds a master’s degree in statistics from the University of Texas at Austin.

Presentations

Job recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL Session

Can the talent industry make the job search/match more relevant and personalized for a candidate by leveraging deep learning techniques? Guoqiong Song, Wenjing Zhan, and Jacob Eisinger demonstrate how to leverage distributed deep learning framework BigDL on Apache Spark to predict a candidate’s probability of applying to specific jobs based on their résumé.

Mang Zhang is a big data platform development engineer at JD.com, where he is mainly engaged in the construction and development of the company’s big data platform, using open source projects such as Hadoop, Spark, Hive, Alluxio and Presto. He focuses on the big data ecosystem and is an open source developer, the contributor of Alluxio, Hadoop, Hive and Presto.

Presentations

Using Alluxio as a fault-tolerant pluggable optimization component of JD.com's compute frameworks Session

Tao Huang, Mang Zhang, and 白冰 explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average.

Zhe Zhang is a senior manager of core big data infrastructure at LinkedIn, where he leads an excellent engineering team to provide big data services (Hadoop distributed file system (HDFS), YARN, Spark, TensorFlow, and beyond) to power LinkedIn’s business intelligence and relevance applications. Zhe’s an Apache Hadoop PMC member; he led the design and development of HDFS Erasure Coding (HDFS-EC).

Presentations

TonY: Native support of TensorFlow on Hadoop Session

Jonathan Hung, Keqiu Hu, and Zhe Zhang offer an overview of TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. TonY enables running TensorFlow distributed training as a new type of Hadoop application. Its native Hadoop connector, together with other features, aims to run TensorFlow jobs as reliably and flexibly as other first-class citizens on Hadoop.

Xiaoyong Zhu is a senior data scientist at Microsoft, where he focuses on distributed machine learning and its applications.

Presentations

Deep learning on audio in Azure to detect sounds in real time Session

In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds. While many of us take this for granted, there are over 360 million in this world who are deaf or hard of hearing. Swetha Machanavajhala and Xiaoyong Zhu explain how to make the auditory world inclusive and meet the great demand in other sectors by applying deep learning on audio in Azure.

Zhi Zhu is vice director of technology mangement at CCB, where he manages the bank’s big data platform planning and technology assets. Zhi has 15 years of experience in bank IT management, analytics platforms, data warehouses, governance, and architecture. Zhi has led many nationwide projects at CCB, including creating its data warehouse and next-generation analytics platform.

Presentations

Refactor your data warehouse with mobile analytics products (sponsored by Kyligence) Session

When China Construction Bank wanted to migrate 23,000+ reports to mobile, it chose Apache Kylin as the high-performance and high-concurrency platform to refactor its data warehouse architecture to serving 400K+ users. Zhi Zhu and Luke Han detail the necessary architecture and best practices for refactoring a data warehouse for mobile analytics.