Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Speakers

Hear from innovative data scientists, senior engineers, and leading executives who are doing amazing things with data. More speakers will be announced; please check back for updates.

Filter

Search Speakers

Bill Chambers is a product manager at Databricks, where he works on Structured Streaming and data science products. He is lead author of Spark: The Definitive Guide, coauthored with Matei Zaharia. Bill also created SparkTutorials.net as a way to teach Apache Spark basics. Bill holds a master’s degree in information management and systems from UC Berkeley’s School of Information. During his time at school, Bill was also creator of the Data Analysis in Python with pandas course for Udemy and cocreator of and first instructor for Python for Data Science, part of UC Berkeley’s Masters of Data Science program.

Presentations

Streaming big data in the cloud: What to consider and why 40-minute session

Streaming big data is a rapidly growing field and one that currently involves a lot of operational complexity and expertise. This talk will discuss a decision making framework for attendees about how to reason about the tools and technologies with which they can be successful deploying and maintaining streaming data pipelines to solve business problems.

I am an astrophysicist using data science techniques to study the Universe.

Presentations

Learning Machine Learning using Astronomy data sets Tutorial

We present an intermediate Machine Learning tutorial based on actual problems in Astronomy research. Our strengths are that we use interesting, diverse, publicly available data sets; we feature students' feedback as "best and worst" content; we focus on the customization of algorithms and evaluation metrics required by scientific applications; and we propose open problems to our participants.

Nishith works on the Hudi project & Hadoop platform at large at Uber. His interests lie in large scale distributed and data systems.

Presentations

Hudi : Unifying storage & serving for batch & near real-time analytics 40-minute session

Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers share the design, architecture & use-cases of the second generation of ‘Hudi’, an analytical storage engine designed to serve such needs and beyond.

Vijay Srinivas Agneeswaran is a senior director of technology at SapientRazorfish. Vijay has spent the last 10 years creating intellectual property and building products in the big data area at Oracle, Cognizant, and Impetus, including building PMML support into Spark/Storm and implementing several machine learning algorithms, such as LDA and random forests, over Spark. He also led a team that build a big data governance product for role-based, fine-grained access control inside of Hadoop YARN and built the first distributed deep learning framework on Spark. Earlier in his career, Vijay was a postdoctoral research fellow at the LSIR Labs within the Swiss Federal Institute of Technology, Lausanne (EPFL). He is a senior member of the IEEE and a professional member of the ACM. He holds four full US patents and has published in leading journals and conferences, including IEEE Transactions. His research interests include distributed systems, cloud, grid, peer-to-peer computing, machine learning for big data, and other emerging technologies. Vijay holds a bachelor’s degree in computer science and engineering from SVCE, Madras University, an MS (by research) from IIT Madras, and a PhD from IIT Madras.

Presentations

Deep learning-based search and recommendation systems using TensorFlow Tutorial

Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.

Arpan is software engineer at LinkedIn working with Analytics Platforms and Applications team. He holds a graduate degree in Computer Science and Engineering from IIT Kanpur.

Presentations

TuneIn: How to get your jobs tuned while you are sleeping 40-minute session

Have you ever tuned a Spark or MR job? If the answer is yes, then you already know how difficult it is to tune more than hundred parameters to optimize the resources used. With Dr. Elephant we introduced heuristic based tuning recommendations. Now, we introduce TuneIn, an auto tuning tool developed to minimize the resource usage of jobs. Experiments have shown upto 50% reduction in resource usage.

Adil Aijaz is CEO and co-founder at Split Software. Adil brings over ten years of engineering and technical experience having worked as a software engineer and technical specialist at some of the most innovative enterprise companies such as LinkedIn, Yahoo!, and most recently RelateIQ (acquired by Salesforce). Prior to founding Split in 2015, Adil’s tenure at these companies helped build the foundation for the startup giving him the needed experience in solving data-driven challenges and delivering data infrastructure. Adil holds a Bachelor of Science in Computer Science & Engineering from UCLA and a Master of Engineering in Computer Science from Cornell University.

Presentations

The Lure of "One Metric That Matters" 40-minute session

Many products - whether data driven or not - chase “the one metric that matters”. It may be engagement, revenue, or conversion, but the common theme is the pursuit of improvement in one metric. Product development teams should focus on the design of metrics that measure our goals. Adil will present an approach to designing metrics, discuss best practices and common pitfalls that you may run into.

Amro is a data scientist with National Health Insurance Company – Daman, a leading health insurance company headquartered in Abu Dhabi- UAE. His focus is on business driven AI expert systems for health insurance. He holds an MSc in quantum computing from Masdar Institute in partnership with MIT. He received his BSc in computer systems engineering from Birzeit University in 2009.

Presentations

Real-time automated claim processing: the surprising utility of NLP methods on non-text data Findata

Processing claims is central to every insurance business. We present a successful business case for automating claims processing from idea to production. The machine learning based claim automation model uses NLP methods on non-text data and allows auditable automated claims decisions to be made.

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Real-time systems with Spark Streaming and Kafka 1-Day Training

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks and explains how to choose the right one for your company.

André Araujo is a Senior Solutions Architect at Cloudera. An experienced consultant with a deep understanding of the Hadoop stack and its components, André is skilled across the entire Hadoop ecosystem and specializes in building high-performance, secure, robust, and scalable architectures to fit customers’ needs. André is a methodical and keen troubleshooter who loves making things run faster.

Presentations

Getting ready for GDPR: securing and governing hybrid, cloud and on-prem big data deployments Tutorial

New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span on-prem, private cloud, multi-cloud, and hybrid cloud. We will share hands-on best practices for meeting these challenges, with special attention to GDPR.

Mauricio Aristizabal is the Data Pipeline Architect at Impact (formerly Impact Radius), a marketing technology company that helps brands grow by optimizing their paid marketing and media spend. At Impact Mauricio was responsible for massively scaling and modernizing our analytics capabilities, selecting datastores and processing platforms and designing many of the jobs that process internally and externally captured data and make it available to our report and dashboard users, analytic applications and machine learning jobs; he has also assisted our operations team with maintaining and tuning our Hadoop and Kafka clusters.

Presentations

Real time analytics and BI with Data Lake and Data Warehouse using Kudu, HBase, Spark and Kafka: Lessons learned 40-minute session

Lessons learned from migrating Impact's traditional ETL platform to real-time on Hadoop (leveraging the full Cloudera EDH stack). A Data Lake in HBase, Spark Streaming jobs (with Spark SQL), Kudu for 'fast data' BI queries, and Kafka data bus for loose coupling between components are some of the topics we'll explore in detail.

Presentations

From Training to Serving: Deploying Tensorflow Models with Kubernetes Tutorial

TensorFlow and Keras are popular libraries for training deep models due to hardware accelerator support. Join Ron Bodkin and Brian Foo to learn how to bring deep learning models from training to serving in a cloud production environment. You will learn how to unit-test, export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.

TBD

Presentations

Building A Large-Scale Machine Learning Application Using Amazon SageMaker and Spark Tutorial

Outline - What is Amazon SageMaker? Quick product overview of AWS's newest ML Platform - Create a Spark EMR cluster - Integrate SageMaker algorithms into Spark pipelines - Ensemble multiple models for a real-time prediction task

Ahsan Ashraf is a data scientist at Pinterest, focusing on recommendations and ranking for the Discovery team. Previously, Ahsan worked with personal finance startup, wallet.ai, as part of an Insight Data Science Fellowship where he designed and built a recommender system that drew insights into users spending habits from their transaction history. Ahsan holds a PhD in condensed/soft matter physics.

Presentations

Diversification in recommender systems: Using topical variety to increase user satisfaction 40-minute session

Online recommender systems often rely heavily on user engagement features. This can cause a bias towards exploitation over exploration, over-optimizing on users' interests. Content diversification is important for user satisfaction, however measuring and evaluating impact is challenging. This work outlines techniques used at Pinterest that drove ~2-3% impression gains and a ~1% time spent gain.

Tony Baer leads Ovum’s research in Big Data, middleware, and the management of embedded software development in the product lifecycle. Tony has defined the architecture, use cases, and market outlook for Big Data and led the industry’s first global enterprise survey on Big Data adoption.

Tony has been a noted authority on data management, integration architecture, and software development platforms for nearly 20 years. Prior to joining Ovum, he was an independent analyst whose company ‘onStrategies’ delivered software development and integration tools to vendors with technology assessment and market positioning services.

He co-authored some of the earliest books on the Java and .NET frameworks including Understanding the .NET Framework and J2EE Technology in Practice.

His career began as journalist with leading publications including Computerworld, Application Development Trends, Computergram, Software Magazine, Information Week and Manufacturing Business Technology.

Presentations

Executive Briefing: Profit from AI and Machine Learning – The best practices for people & process 40-minute session

Ovum will present the results of research cosponsored by Dataiku, surveying a specially selected sample of chief data officers and data scientists, on how to map roles and processes to make success with AI in the business repeatable.

Marton Balassi is a solutions architect at Cloudera, where he focuses on data science and stream processing with big data tools. Marton is a PMC member at Apache Flink and a regular contributor to open source. He is a frequent speaker at big data-related conferences and meetups, including Hadoop Summit, Spark Summit, and Apache Big Data.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Dylan Bargteil is a data scientist in residence at the Data Incubator, where he works on research-guided curriculum development and instruction. Previously, he worked with deep learning models to assist surgical robots and was a research and teaching assistant at the University of Maryland, where he developed a new introductory physics curriculum and pedagogy in partnership with HHMI. Dylan studied physics and math at University of Maryland and holds a PhD in physics from New York University.

Presentations

Machine Learning from Scratch in TensorFlow 1-Day Training

The TensorFlow library provides for the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs. This architecture makes it ideal for implementing neural networks and other machine learning algorithms. This training will introduce TensorFlow's capabilities through its Python interface.

Bonnie Barrilleaux is a staff data scientist in analytics at LinkedIn, who primarily focuses on communities and the content ecosystem. She uses data to guide product strategy, performs experiments to understand the ecosystem, and creates metrics to evaluate product performance. Previously, she completed a postdoctoral fellowship in genomics at the University of California, Davis, studying the function of the Myc gene in cancer and stem cells. She holds a PhD in Chemical Engineering from Tulane University; has published peer-reviewed works including 11 journal articles, a book chapter, and a video article; and has been awarded multiple grants to create interactive art.

Presentations

Perverse incentives in metrics: inequality in the like economy 40-minute session

Following metrics blindly leads to unintended negative side-effects. At LinkedIn as we encouraged members to join conversations, we found ourselves in danger of creating a "rich get richer" economy in which a few creators got an increasing share of all feedback. This example reminds us to regularly re-evaluate metrics, because creating value for users is more important than driving any metric.

James Bednar is a senior solutions architect at Anaconda. Previously, Jim was a lecturer and researcher in computational neuroscience at the University of Edinburgh, Scotland, and a software and hardware engineer at National Instruments. He manages the open source Python projects datashader, HoloViews, GeoViews, ImaGen, and Param. He has published more than 50 papers and books about the visual system, data visualization, and software development. Jim holds a PhD in computer science from the University of Texas as well as degrees in electrical engineering and philosophy.

Presentations

Making interactive browser-based visualizations easy in Python Tutorial

Python lets you solve data-science problems by stitching together packages from the Python ecosystem, but it can be difficult to assemble the right tools to solve real-world problems. Here we show how to use the 15+ packages covered by the new PyViz.org initiative to make it simple to build interactive plots and dashboards, even for large, streaming, and highly multidimensional data.

William Benton leads a team of data scientists and engineers at Red Hat, where he has applied machine learning to problems ranging from forecasting cloud infrastructure costs to designing better cycling workouts. His current focus is investigating the best ways to build and deploy intelligent applications in cloud-native environments, but he has also conducted research and development in the areas of static program analysis, managed language runtimes, logic databases, cluster configuration management, and music technology.

Presentations

Why data scientists should love Linux containers 40-minute session

Containers are a hot technology for application developers, but they also provide key benefits for data scientists. In this talk, you'll learn about the advantages of containers for data scientists and AI developers, focusing on high-level tools that will enable you to become more productive and collaborate more effectively.

Mike Berger joined Mount Sinai’s new population health initiative, Mount Sinai Health Partners at the end of 2015 and is now the VP of Data Science and Informatics where he’s accountable to develop our data driven decision making (D3M) strategy. This can’t be done without creating a high performing data science group with a mix of advanced analytics, storytelling visualizations and population health domain expertise. Then aligning the skills and tools of his team with the business need to better manage the health and wellness of a population over 350k under financial risk. Finally, finding innovative ways to deliver these insights to physicians, care management teams and executives – right data, right device at the right time means better decisions for the patient and the business.

Mike has over twenty years of experience across a combination of large health systems and payer organizations as well as entrepreneurial startups and management consulting. Mike also acts as the committee co-chair for the HIMSS Clinical & Business Intelligence community and is an active member of multiple healthcare data warehousing and informatics associations.

Originally from Huntington Beach, CA, Mike has an Industrial and Systems Engineering degree from USC, a healthcare project management certification from Harvard’s Graduate School of Public Health and is completing his Masters from NYU Stern in May, 2018.

Presentations

Decision-Centricity: Operationalizing Analytics and Data Science in Health Systems Data Case Studies

Hear how Mount Sinai Health has moved up the analytics maturity chart to deliver business value in new risk models around Population Health. Learn how to design a team, build a data factory and generate the analytics to drive decision-centricity. See examples of mixing Tableau, SQL, Hive, APIs, Python and R into a cohesive ecosystem supported by our data factory

Tim Berglund is a teacher, author, and technology leader with Confluent, where he serves as the senior director of developer experience. Tim can frequently be found at speaking at conferences internationally and in the United States. He is the copresenter of various O’Reilly training videos on topics ranging from Git to distributed systems and is the author of Gradle Beyond the Basics. He tweets as @tlberglund, blogs very occasionally at Timberglund.com, and is the cohost of the DevRel Radio Podcast. He lives in Littleton, Colorado, with the wife of his youth and their youngest child, the other two having mostly grown up.

Presentations

Stream Processing with Kafka and KSQL Tutorial

A solid introduction to Apache Kafka as a streaming data platform. We'll cover its internal architecture, APIs, and platform components like Kafka Connect and Kafka Streams—then finish with an exercise processing streaming data using KSQL, the new SQL-like declarative stream processing language for Kafka.

Anya Bida is a senior member of the technical staff (SRE) at Salesforce. She’s also a co-organizer of the SF Big Analytics meetup group and is always looking for ways to make platforms more scalable, cost efficient, and secure. Previously, Anya worked at Alpine Data, where she focused on Spark operations.

Presentations

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am 40-minute session

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads.

Albert Bifet is a Professor at LTCI, Telecom ParisTech, Head of the Data, Intelligence and Graphs (DIG) Group at Telecom ParisTech, and Scientific Collaborator at Ecole Polytechnique. He is a big data scientist with 10+ years of international experience in research, Albert has led new open source software projects for business analytics, data mining, and machine learning at Huawei, Yahoo, the University of Waikato, and UPC. At Yahoo Labs, he cofounded Apache SAMOA (Scalable Advanced Massive Online Analysis), a distributed streaming machine learning framework that contains a programing abstraction for distributed streaming ML algorithms. At the WEKA Machine Learning Group, he co-led MOA (Massive Online Analysis), the most popular open source framework for data stream mining, with more than 20,000 downloads each year. Albert is the author of Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams and the editor of the “Big Data Mining” special issue of SIGKDD Explorations in 2012. Currently, he was cochair of the industrial track at ECML PKDD 2015, BigMine (2014, 2013, 2012), and the data streams track at ACM SAC (2015, 2014, 2013, 2012). He holds a PhD from BarcelonaTech.

Presentations

Machine learning for non-stationary streaming data using Structured Streaming and StreamDM 40-minute session

The StreamDM library provides the largest collection of data stream mining algorithms for Spark. This talk will cover how StreamDM can be used alongside Structured Streaming for build incremental models specially for non-stationary streams (i.e. those with concept drifts). Concretely, we will cover how to develop, apply and evaluate learning models using StreamDM and Structured Streaming.

Ryan Blue is an engineer on Netflix’s Big Data Platform team. Before Netflix, Ryan was responsible for the Avro and Parquet file formats at Cloudera. He is also the author of the Analytic Data Storage in Hadoop series of screencasts from O’Reilly.

Presentations

Introducing Iceberg: Tables Designed for Object Stores 40-minute session

Iceberg is a new open source project that defines a new table layout with properties specifically designed for cloud object stores, such as S3. It provides a common set of capabilities such as partition pruning, schema evolution, atomic additions, removal, or replacements of files regardless of whether the data is stored in Avro, ORC, or Parquet.

The evolution of Netflix's S3 data warehouse 40-minute session

In the last few years, Netflix's data warehouse has grown to more than 100PB in S3. This talk will summarize what we've learned, the tools we currently use and those we've retired, as well as the improvements we are rolling out, including Iceberg, a new table format for S3.

Ron Bodkin is technical director for applied artificial intelligence at Google, where he helps Global Fortune 500 enterprises unlock strategic value with AI, acts as executive sponsor for Google product and engineering teams to deliver value from AI solutions, and leads strategic initiatives working with customers and partners. Previously, Ron was vice president and general manager of artificial intelligence at Teradata; the founding CEO of Think Big Analytics (acquired by Teradata in 2014), which provides end-to-end support for enterprise big data, including data science, data engineering, advisory and managed services, and frameworks such as Kylo for enterprise data lakes; vice president of engineering at Quantcast, where he led the data science and engineer teams that pioneered the use of Hadoop and NoSQL for batch and real-time decision making; founder of enterprise consulting firm New Aspects; and cofounder and CTO of B2B applications provider C-Bridge. Ron holds a BS in math and computer science with honors from McGill University and a master’s degree in computer science from MIT.

Presentations

From Training to Serving: Deploying Tensorflow Models with Kubernetes Tutorial

TensorFlow and Keras are popular libraries for training deep models due to hardware accelerator support. Join Ron Bodkin and Brian Foo to learn how to bring deep learning models from training to serving in a cloud production environment. You will learn how to unit-test, export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.

Matt leads the machine learning product team at Cloudera, guiding the platform experience for data scientists and data engineers, including products like Cloudera Data Science Workbench. Before that, he led Cloudera’s product marketing team for three years, with roles spanning product, solution, and partner marketing. Prior to Cloudera, he built enterprise search and data discovery products at Endeca/Oracle. Matt holds degrees in Computer Science and Mathematics from the University of Massachusetts Amherst.

Presentations

A roadmap for open data science 40-minute session

An overview of considerations and tradeoffs for choosing an open approach to enterprise data science. In this talk we’ll share a model to help organizations begin the journey, build momentum, and reduce reliance on legacy software. This includes such things as executive leadership, cost transparency, and clear metrics of user adoption and success with open data science tools.

Claudiu Branzan is the VP of data science and engineering at G2 Web Services, where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine learning and distributed systems experience. Previously, Claudiu worked for Atigeo building big data and data science-driven products for various customers.

Presentations

Natural language understanding at scale with spaCy and Spark NLP Tutorial

Natural language processing is a key component in many data science systems. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP, using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, with Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

Mikio Braun is principal engineer for search at Zalando, one of the biggest European fashion platforms. Mikio holds a PhD in machine learning and worked in research for a number of years before becoming interested in putting research results to good use in the industry.

Presentations

Executive Briefing: from Business to AI - missing pieces in becoming "AI ready" 40-minute session

In order to become "AI ready", an organization not just has to provide the right technical infrastructure for data collection and processing, but also learn new skills. In this talk I will highlight three such missing pieces: making the connection between business problems and AI technology, AI driven development, and how to run AI based projects.

Machine learning for time series: What works and what doesn't 40-minute session

Time series data has many applications in industry, from analyzing server metrics to monitoring IoT signals and outlier detection. Mikio Braun offers an overview of time series analysis with a focus on modern machine learning approaches and practical considerations, including recommendations for what works and what doesn’t, and industry use cases.

Lindsay is a motivated, curious, and analytical data scientist with more than a decade of experience with research methods and the scientific process. From generating testable hypotheses, through wrangling imperfect data, to finding insights via analytical models, she excels at asking incisive questions and using data to tell compelling stories.
Lindsay is passionate about teaching the skills necessary to analyze data more efficiently and effectively. Through this work, she has developed and taught workshops and online courses at the University of New Brunswick, and is a Data Carpentry instructor and Ladies Learning Code chapter co-lead. Having recently made a career pivot from biogeochemistry to data science, she is also well-positioned to provide insight into the applicability of academic research and analysis skills to business problems.

Presentations

From Theory to Data Product - Applying Data Science Methods to Effect Business Change Tutorial

This tutorial is a primer on crafting well-conceived data science projects on course toward uncovering valuable business insights. Using case studies and hands-on skills development, we will teach techniques that are essential for a variety of audiences invested in effecting real business change.

Founder/CEO, Blue Badge Insights; ZDNet Big Data blogger; Gigaom analyst; Microsoft tech influencer.

Presentations

Data Governance: A Big Job That's Getting Bigger 40-minute session

Data governance is a product category that has grown from a set of mostly data management-oriented technologies in the data warehouse era, to encompass catalogs, glossaries and more in the data lake era. Now new requirements are emerging and new products are rising to meet the challenge. This session tracks data governance's past present and future.

Andrew Burt is chief privacy officer and legal engineer at Immuta, the data management platform for the world’s most secure organizations. He is also a visiting fellow at Yale Law School’s Information Society Project. Previously, Andrew was a special advisor for policy to the head of the FBI Cyber Division, where he served as lead author on the FBI’s after-action report on the 2014 attack on Sony. The leading authority on the intersection between machine learning, regulation and law, Andrew has published articles on technology, history and law in the New York Times, the Financial Times, Slate, and the Yale Journal of International Affairs.” His book, American Hysteria: The Untold Story of Mass Political Extremism in the United States, was called “a must-read book dealing with a topic few want to tackle” by Nobel laureate Archbishop Emeritus Desmond Tutu. Andrew holds a JD from Yale Law School and a BA from McGill University. He is a term-member of the Council on Foreign Relations, a member of the Washington, DC, and Virginia State Bars, and a Global Information Assurance Certified (GIAC) cyber incident response handler.

Presentations

Beyond Explainability: Regulating Machine Learning In Practice 40-minute session

Machine learning is becoming prevalent across industries, creating new types of risk. Managing this risk is quickly becoming the central challenge of major organizations, one that strains data science teams, legal personnel and the c-suite alike. This talk will highlight lessons from past regulations focused on similar technology, and conclude with a proposal for new ways to manage risk in ML.

Michelle Casbon is a senior engineer on the Google Cloud Platform developer relations team, where she focuses on open source contributions and community engagement for machine learning and big data tools. Michelle’s development experience spans more than a decade and has primarily focused on multilingual natural language processing, system architecture and integration, and continuous delivery pipelines for machine learning applications. Previously, she was a senior engineer and director of data science at several San Francisco-based startups, building and shipping machine learning products on distributed platforms using both AWS and GCP. She especially loves working with open source projects and is a contributor to Kubeflow. Michelle holds a master’s degree from the University of Cambridge.

Presentations

Kubeflow explained: Portable Machine Learning on Kubernetes 40-minute session

Learn how to build a Machine Learning application with Kubeflow, which makes it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere. Kubeflow supports the full lifecycle of an ML product, including iteration via Jupyter notebooks. Find out what Kubeflow currently supports and the long-term vision for the project, presented by a project contributor.

Mark is a hacker at H2O. He was previously in the finance world as a quantitative research developer at Thomson Reuters and Nipun Capital. He also worked as a data scientist at an IoT startup, where he built a web based machine learning platform and developed predictive models.

Mark has a MS Financial Engineering from UCLA and a BS Computer Engineering from University of Illinois Urbana-Champaign. In his spare time Mark likes competing on Kaggle and cycling.

Presentations

Practical Techniques for Interpreting Machine Learning Models Tutorial

Transparency, auditability, and stability are crucial for business adoption and human acceptance of complex machine learning models. This technical tutorial will share practical and productizable approaches for explaining, testing, and visualizing machine learning models through a series of publicly available examples using open source, Python-friendly tools such as GraphViz, H2O, and XGBoost.

Vinoth Chandar works on data infrastructure at Uber, with a focus on Hadoop and Spark. Vinoth has keen interest in unified architectures for data analytics and processing. Previously, Vinoth was the LinkedIn lead on Voldemort and worked on Oracle server’s replication engine, HPC, and stream processing.

Presentations

Hudi : Unifying storage & serving for batch & near real-time analytics 40-minute session

Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers share the design, architecture & use-cases of the second generation of ‘Hudi’, an analytical storage engine designed to serve such needs and beyond.

Manna Chang is senior data scientist in Optum Enterprise Analytics, where she plays a leading role in providing and developing innovative technologies/methods to meet customer needs and answer healthcare-related challenges. She holds Ph.D. in Biochemistry and MS in Statistics. Her past experience in applying machine learning techniques in drug discovery and genomic outcome studies led to the current role in data science. For the hobbies, she loves sci-fi movies and enjoys hiking.

Presentations

Breaking the rules: End Stage Renal Disease Prediction 40-minute session

This presentation will focus on showing both supervised and unsupervised learning methods to work with claims data and how they can complement each other. A supervised method will look at CKD patients at-risk to develop ESRD, and unsupervised approach will look at classification of patients that tend to develop this disease faster than others.

I am currently a software engineer at Uber where I am a member of the Hadoop Platform team working on large scale data ingestion and dispersal pipelines and libraries leveraging Apache Spark. I was also previously the tech lead on the metrics team at Uber Maps building data pipelines to produce metrics to help analyze the quality of our mapping data. Before joining Uber, I worked at Twitter as an original member of the Core Storage team building Manhattan, a key/value store powering Twitter’s use cases. I love learning anything about storage and data platforms and distributed systems at scale.

See links:
https://www.wired.com/2014/04/twitter-manhattan/
https://blog.twitter.com/engineering/en_us/a/2014/manhattan-our-real-time-multi-tenant-distributed-database-for-twitter-scale.html

I hold a B.S in computer science from UCLA and a M.S in computer science from USC. In my spare time I like to dabble in building Alexa and IPhone projects for personal use and training my 2 kids for a future as professional basketball players in the NBA :)

Presentations

Marmaray – A generic, scalable, and pluggable Hadoop data ingestion & dispersal framework 40-minute session

Marmaray is a generic Hadoop ingestion and dispersal framework recently released to production at Uber. We will introduce the main features of Marmaray and business needs met, share how Marmaray can help a team's data needs by ensuring data can be reliably ingested into Hive or dispersed into online data stores, and give a deep dive into the architecture to show you how it all works.

Felix, a PMC & Committer of Apache Spark, started his journey in the big data space about 5 years ago with the then state-of-the-art MapReduce. Since then, he (re-)built Hadoop clusters from metal more times than he would like, created a Hadoop “distro” from two dozens or so projects, and juggled hundreds to thousands of cores in the cloud or in data centers. He built a few interesting app with Apache Spark for 3.5 years and ended up contributing to it for more than 3 years. In addition to building stuff, he frequently presented in conferences, meetups, or workshops. He was also a teaching assistant of the first set of edx MOOCs on Apache Spark.

Presentations

Your 5 billions rides are arriving now - scaling Apache Spark for data pipelines and intelligent systems at Uber 40-minute session

Do you know how your Uber rides are powered by Apache Spark? Come to this talk to learn how Uber builds data platform with Apache Spark at enormous scale, what unique challenges we face and overcome.

Anant Chintamaneni is VP of products at BlueData. Anant has more than 15 years experience in business intelligence, advanced analytics, and big data infrastructure. He is currently responsible for product management at BlueData, where he focuses on helping enterprises deploy big data technologies including Hadoop and Spark. Prior to BlueData, Anant led the product management team for Pivotal’s big data suite.

Presentations

What's the Hadoop-la about Kubernetes? 40-minute session

There is increased interest in using Kubernetes (K8s), the open-source container orchestration system for modern Big Data workloads. The promised land is a unified platform for cloud native stateless and stateful data services. However, stateful, multi-service Big Data cluster orchestration brings unique challenges. This session will delve into the considerations for Big Data services for K8s.

Erin joined Airbnb in 2011 and is the company’s most tenured data scientist. She has led data science and analytics initiatives across the company, including work with Customer Experience, Legal, Communications, and Public Policy. Currently, she is the data scientist for Airbnb’s Human team, which has a mission to house people in need, including evacuees of disasters and refugees. In 2016, she co-founded Data University, a company-wide data training program, of which over a quarter of the company has participated. Prior to Airbnb, she worked in education consulting and program management in Washington, DC. Erin received a PhD in Economics from Georgia State University and a BA in Mathematics Education and Economics from Anderson University (IN). Erin is a proud Airbnb Superhost, having welcomed nearly 1000 guests since 2011. In her spare time she enjoys traveling, reading, pub trivia, and golfing.

Presentations

Data University: How Airbnb Democratized Data 40-minute session

Airbnb has open-sourced many high-leverage data tools: Airflow, Superset, and the Knowledge Repo. However, adoption of these tools across Airbnb was relatively low. To make data more accessible and utilized in decision-making, Airbnb launched Data University in early 2017. Since the launch, over a quarter of the company has participated in the program, and data tool utilization rates have doubled.

Mark has been building Web applications since his first image map for his band’s page in 1995 and working with computers long before that. He is currently working at Viacom with talented engineers working with data and machine learning. He also helped to initiate Viacom’s open source program.

Presentations

Agility to Data Product Development: Plug and Play Data Architecture 40-minute session

Data Products, different from Data-Driven Products, are finding their own place in organizational Data. Driven Decision Making. Shifting the focus to “data”, opens up new opportunities. The presentation, with case studies, dives deeper into a layered implementation architecture, provides intuitive learnings and solutions that allow for more agile, reusable data modules for a data product team.

Ian Cook is a data scientist at Cloudera and the author of several R packages, including implyr. Previously, Ian was a data scientist at TIBCO and a statistical software developer at Advanced Micro Devices. Ian is cofounder of Research Triangle Analysts, the largest data science meetup group in the Raleigh, North Carolina, area, where he lives with his wife and two young children. He holds an MS in statistics from Lehigh University and a BS in applied mathematics from Stony Brook University.

Presentations

Expand Your Data Science and Machine Learning Skills (Python, R, SQL, Spark, TensorFlow) 1-Day Training

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, with different syntaxes, conventions, and terminology. The instructor will simplify the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, participants will overcome obstacles to getting started using new tools.

Lawrence is a Partner and Advanced Analytics Practice Leader with Cicero Group. Lawrence has spent the last decade building Cicero’s analytics practice where he has experience helping Fortune 500 firms solve real business challenges with data, including attrition, segmentation, sales prioritization, pricing, and customer satisfaction. He also leads the firm in predictive analytics and Big Data related engagements, applying Cicero’s deep expertise in strategy execution to ensure data delivers ROI. He has partnered with companies to help them to shift from reactive to predictive analytics by collecting and analyzing real-time information and distributing it across the organization— allowing management to make better, faster decisions that move the business forward.

Lawrence is a frequent speaker and thought leader in the advanced analytics space, speaking at events such as Predictive Analytics World for Business and Workforce, Global Big Data Conference, as well as serving as chairperson for the Data Analytics Leaders Event – the place where data chiefs, BI and analytics function heads come together to explore accelerating the path of data-to-value. His views and recommendations on Big Data, and advanced analytics have been published in CIO Review and Predictive Analytics Times.

Lawrence holds a Master’s of Science in Predictive Analytics from Northwestern University, an MBA with an emphasis in Business Economics from Westminster College, and a BA from Brigham Young University.

Presentations

Realizing the true value in your data: Data-drivenness Assessment 40-minute session

We've worked with firms and seen over and over that they are struggling to leverage their data. We've developed a methodology for assessing 4 critical areas that firms must consider when looking to make the analytical leap: Data Strategy; Data Culture; Data Analysis & Implementation; Data Management & Architecture.

Dan Crankshaw is a PhD student in the CS Department at UC Berkeley, where he works in the RISELab. After cutting his teeth doing large-scale data analysis on cosmology simulation data and building systems for distributed graph analysis, Dan has turned his attention to machine learning systems. His current research interests include systems and techniques for serving and deploying machine learning, with a particular emphasis on low-latency and interactive applications.

Presentations

Model serving and management at scale using open-source tools Tutorial

This tutorial consists of three parts. First, I will present an overview of the current challenges in deploying machine applications into production and provide a survey of the current state of prediction serving infrastructure. Next, I will provide a deep dive on the Clipper serving system. Finally, I will run a hands-on workshop for getting started with Clipper.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Findata welcome Tutorial

Program Chair, Alistair Croll, welcomes you to Findata Day.

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday Keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Umur is co-founder and CEO of Citus Data, a leading Postgres company whose mission is to make it so companies never have to worry about scaling their relational database again.

Umur has over 15 years of experience driving complex enterprise software, IT, and database initiatives at large enterprises and at different startups—and he earned a master’s in Management Science & Engineering from Stanford University.

As CEO of Citus Data, Umur wears both operational and strategic hats. Umur works directly with technical founders at SaaS companies to help them scale their multi-tenant applications, and with enterprise architects to power real-time analytics apps that need to handle large-scale data.

Umur’s team at Citus Data is active in the Postgres community, sharing expertise and contributing key components and extensions. Umur’s company open sourced their distributed database extension for PostgreSQL, in early 2016.

Umur has a deep interest in how scalable systems of record and systems of engagement can help businesses grow—and is excited about the past, present, and future state of Postgres.

Presentations

The state of Postgres 40-minute session

PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you’ll learn how PostgreSQL’s extension APIs are fueling innovations in relational databases.

Paul Curtis is a principal solutions engineer at MapR, where he provides pre- and postsales technical support to MapR’s worldwide systems engineering team. Previously, Paul served as senior operations engineer for Unami, a startup founded to deliver on the promise of interactive TV for consumers, networks, and advertisers; systems manager for Spiral Universe, a company providing school administration software as a service; senior support engineer positions at Sun Microsystems; enterprise account technical management positions for both Netscape and FileNet; and roles in application development at Applix, IBM Service Bureau, and Ticketron. Paul got started in the ancient personal computing days; he began his first full-time programming job on the day the IBM PC was introduced.

Presentations

Clouds and Containers: Case Studies for Big Data 40-minute session

Now that the data has been captured, how can the cloud, containers, and a data fabric combine to build the infrastructure to provide the business insights? In this discussion explores three customer deployments that leverage the best of the private clouds and containers to provide a flexible big data environment.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday Keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Olga Cuznetova is a data science team lead in Optum Enterprise Analytics, where she guides junior team members on their projects and helps implement data science solutions that address healthcare business needs. Currently, her projects focus mostly on building disease progression and clinical operations models, a few examples include: predicting high-cost diabetic patients, prediction of progression to end stage renal disease, implementing substance abuse disorder model using external clients’ data, and predicting medical prior authorization outcomes. Prior joining Optum Enterprise Analytics team Olga completed a one-year Technology Development Program with a focus on the development of essential technical skills, healthcare business acumen, and analytical skill set, that led Olga choose a data science career path. Olga holds BS in Finance from Central Connecticut State University. When Olga has a spare moment, you can find her traveling both in the United States and abroad.

Presentations

Breaking the rules: End Stage Renal Disease Prediction 40-minute session

This presentation will focus on showing both supervised and unsupervised learning methods to work with claims data and how they can complement each other. A supervised method will look at CKD patients at-risk to develop ESRD, and unsupervised approach will look at classification of patients that tend to develop this disease faster than others.

Michelangelo D’Agostino is the senior director of data science at ShopRunner, where he leads a team that develops statistical models and writes software that leverages their unique cross-retailer e-commerce dataset. Michelangelo came to ShopRunner from Civis Analytics, a Chicago-based data science software and consulting company that spun out of the 2012 Obama re-election campaign. At Civis, he led the data science R&D team. Prior to that, he was a senior analyst in digital analytics with the 2012 Obama re-election campaign, where he helped to optimize the campaign’s email fundraising juggernaut and analyzed social media data. Michelangelo has been a mentor with the Data Science for Social Good Fellowship. He holds a PhD in particle astrophysics from UC Berkeley and got his start in analytics sifting through neutrino data from the IceCube experiment. Accordingly, he spent two glorious months at the South Pole, where he slept in a tent salvaged from the Korean War and enjoyed the twice-weekly shower rationing. He’s also written about science and technology for the Economist.

Presentations

The Care and Feeding of Data Scientists: Concrete Tips for Retaining Your Data Science Team 40-minute session

Data scientists are hard to hire. But too often, companies struggle to find the right talent only to make avoidable mistakes that cause their best data scientists to leave. From org structure and leadership to tooling and infrastructure to continuing education, this talk will offer concrete (and inexpensive) tips for keeping your data scientists engaged, productive, and adding business value.

Milene Darnis is a Data Product Manager at Uber, focusing on building a world-class experimentation platform. From her role as a Product Manager and her previous experience at Uber as a Data Engineer, she has developed a passion for linking data to concrete business problems.
Previously, Milene was a Business Intelligence Engineer at a mobile gaming company. She holds a Master’s Degree in Engineering from Telecom ParisTech, France.

Presentations

A/B testing at Uber: how we built a BYOM (Bring Your Own Metrics) platform 40-minute session

Every new launch at Uber is vetted via robust A/B testing. Given the pace at which Uber operates, the metrics needed to assess the impact of experiments constantly evolve. Milene Darnis, who leads Product Management for the Experimentation platform, will talk about how the team built a scalable and self-serve platform, that lets users plug in any metric to analyze.

Kaushik is a Partner/CTO at Novantas where he is responsible for technology strategy and R&D roadmap of a number of cloud-based platforms. He has 15+ years of experience leading large engineering teams to develop scalable, high performance analytics platforms. He graduated from the University of Pennsylvania with a MS in Engineering, University of Missouri with a MS in Computer Science, and Carnegie Mellon University with a MS in Computational Finance.

Presentations

Case Study : A Spark-based Distributed Simulation Optimization Architecture for Portfolio Optimization in Retail Banking 40-minute session

We discuss a large-scale optimization architecture in Spark for a consumer product portfolio optimization case study in retail banking—which combines a simulator that distributes computation of complex real-world scenarios given varying macro-economic factors, consumer behavior and competitor landscape, and a constraint optimizer that uses business rules as constraints to meet growth targets.

Mark Donsky leads data management and governance solutions at Cloudera. Previously, Mark held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

Executive Briefing: Getting Your Data Ready for Heavy EU Privacy Regulations (GDPR ) 40-minute session

General Data Protection Regulation (GDPR) goes into effect in May 2018 for firms doing any business in the EU. However many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). This session will explore the capabilities your data environment needs in order to simplify GDPR compliance, as well as future regulations.

Getting ready for GDPR: securing and governing hybrid, cloud and on-prem big data deployments Tutorial

New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span on-prem, private cloud, multi-cloud, and hybrid cloud. We will share hands-on best practices for meeting these challenges, with special attention to GDPR.

Florian Douetteau is the CEO of Dataiku, a company democratising access to Data Science.
Starting programming in his early childhood, he dropped the prestigious “Ecole Normale” Maths courses to start working at 20 in a startup that later became Exalead, a search engine company in the early days of the french startup community. His subjects of interests include data, artificial intelligence, and how tech can improve the daily work life of tech people.

Presentations

Executive Briefing: Profit from AI and Machine Learning – The best practices for people & process 40-minute session

Ovum will present the results of research cosponsored by Dataiku, surveying a specially selected sample of chief data officers and data scientists, on how to map roles and processes to make success with AI in the business repeatable.

James Dreiss is a Senior Data Scientist at Reuters. He studied at New York University and the London School of Economics, and previously worked at the Metropolitan Museum of Art in New York.

Presentations

Document Vectors in the Wild: Building a Content Recommendation System for Reuters.com 40-minute session

A discussion of the challenges in building a content recommendation system for one of the largest news sites in the world, Reuters.com. The particularities of the system include developing a scrolling newsfeed and the use of document vectors for semantic representation of content.

Chiny Driscoll is MetiStream’s founder and CEO. MetiStream is a provider of real-time integration and analytic services in the Big Data arena.

Chiny has more than 24 years of management and executive leadership experience in the technology industry, having served in a variety of roles with Fortune 500 tech companies.

Prior to founding MetiStream, Chiny was the Worldwide Executive Leader of Big Data Services for IBM’s Information Management division. There, she led all of the professional services which implemented and supported IBM’s Big Data products and solutions across industries such as financial services, communications, public sector and retail. Key solutions included streaming, analytics, Hadoop, and DW appliance related solutions.

Before IBM, Chiny was the VP and General Manager of Netezza, a leader in Big Data warehouse appliances and advanced analytics which was acquired by IBM in 2010. Preceding her work building Netezza’s services and education organization, Chiny held various global and regional leadership roles at TIBCO Software. Chiny’s last position at TIBCO was running the pre-sales, services and sales operations for the Public Sector division. Prior to TIBCO she served in services leadership roles at EDS and other services and technology companies.

Presentations

Digging for Gold: Developing AI in healthcare against unstructured text data - exploring the opportunities and challenges 40-minute session

This Cloudera/MetiStream solution lets healthcare providers automate the extraction, processing and analysis of clinical notes within the Electronic Health Record in batch or real-time. Improve care, identify errors, and recognize efficiencies in billing and diagnoses by leveraging NLP capabilities to conduct fast analytics in a distributed environment. Use case by Rush University Medical Center.

Carolyn Duby, a Hortonworks Solutions Engineer, is dedicated to helping her customers harness the power of their data with Apache open source platforms. A subject matter expert in cyber security and data science, Carolyn is an active leader in the community and frequent speaker at Future of Data meetups in Boston, MA and Providence, RI and at conferences such as Open Data Science Conference and Global Data Science Conference. Prior to joining Hortonworks she was the architect for cyber security event correlation at SecureWorks. Ms Duby earned a ScB Magna Cum Laude and ScM from Brown University in Computer Science. She is life long learner and recently completed the Johns Hopkins Univerity Coursera Data Science Specialization.

Presentations

Apache Metron: Open Source Cyber Security at Scale Tutorial

Learn how to find the cybersecurity threat needle in your event haystack using Apache Metron: a real-time, horizontally scalable, open-source platform. After this interactive overview of the platform's major features, you will be ready to analyze your own haystack back at the office.

Ted Dunning is chief application architect at MapR Technologies. He’s also a board member for the Apache Software Foundation, a PMC member and committer of the Apache Mahout, Apache Zookeeper, and Apache Drill projects, and a mentor for various incubator projects. Ted has years of experience with machine learning and other big data solutions across a range of sectors. He has contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library and designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date plus a dozen pending. He holds a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.

Presentations

Progress for Big Data in Kubernetes 40-minute session

Stateful containers are a well-known antipattern, but the standard answer of managing state in a separate storage tier is costly and complex. Recent developments have changed things dramatically for the better. In particular, you can now manage a high-performance software defined storage tier entirely in Kubernetes. I will describe what's new and how it makes big data easier on Kubernetes.

Brent is the Director of Data Strategy at Domo and has over 14 years of enterprise analytics experience at Omniture, Adobe, and Domo. He is a regular Forbes contributor on data-related topics and has published two books on digital analytics, including Web Analytics Action Hero. In 2016, Brent received the Most Influential Industry Contributor Award from the Digital Analytics Association (DAA). He has been a popular presenter at multiple conferences such as Shop.org, Adtech, Pubcon, and Adobe Summit. Brent earned his MBA from Brigham Young University and his BBA (Marketing) degree from Simon Fraser University. Follow him on Twitter @analyticshero.

Presentations

Stories Beat Statistics: How to Master the Art and Science of Data Storytelling 40-minute session

With companies collecting all kinds of data and using advanced tools and techniques to find insights, they often fail in the last mile--communicating insights effectively to drive change. This session will look at the power that stories wield over statistics and explore the art and science of data storytelling—an essential skill that everyone must have in today’s data economy.

Barbara Eckman is a Principal Data Architect at Comcast. She leads data governance for an innovative, division-wide initiative comprising near-real-time ingesting, streaming, transforming, storing, and analyzing Big Data. Barbara is a recognized technical innovator in Big Data architecture and governance, as well as scientific data and model integration. Her experience includes technical leadership positions at a Human Genome Project Center, Merck, GlaxoSmithKline, and IBM. She served on the IBM Academy of Technology, an internal peer-elected organization akin to the National Academy of Sciences.

Presentations

Data Discovery and Lineage: Integrating streaming data in the public cloud with on-prem, classic datastores and heterogeneous schema types 40-minute session

Comcast’s Streaming Data platform comprises ingest, transformation, and storage services in the public cloud, with Apache Atlas for data discovery and lineage. We were recently challenged to integrate on-prem datasources, including traditional data warehouses and RDBMS’s. Our data governance strategy must now include relational and JSON schemas in addition to Apache Avro. Here’s how we did it!

Laura Eisenhardt is EVP at iKnow Solutions Europe and the founder of DigitalConscience.org, a CSR platform designed to create opportunities for technical resources (specifically expats) to give back to communities with their unique skills while making a huge impact locally. Laura has led massive programs for the World Health Organization across Africa, collecting big data in over 165 languages, and specializes in data quality and consistency. Laura is also COO for the American Institute of Minimally Invasive Heart Surgery (AIMHS.org), a nonprofit designed to educate the public and heart surgeons worldwide on how to do open heart surgery without splitting open the chest. Why? People that have complex heart surgery in a minimally invasive procedure return to work in two weeks versus 9–12 months, which has a substantial impact on society, family finances, depression, and cost for all.

Presentations

GDPR and the Australian Privacy Act – Forcing the Legal and Ethical Hands of How Companies Collect, Use and Analyze Data Findata

Data brings unprecedented insights to industries about customer behavior & personal data is being harvested. We know more about our customers and neighbors then at any other time in history but need to avoid "crossing the creepy line". Governance and Security experts from Cloudera, Mastercard and iKnow solutions discuss how ethical behavior drives trust especially in today's IoT age

Jonathan is CTO and co-founder at DataStax and the founding project chair of Apache Cassandra. Previously, Jonathan built a multi-petabyte, scalable storage system based on Reed-Solomon encoding for backup provider Mozy.

Presentations

Cassandra vs Cloud Databases 40-minute session

Is open-source Apache Cassandra still relevant in an era of hosted cloud databases? DataStax CTO Jonathan Ellis will discuss Cassandra’s strengths and weaknesses relative to Amazon DynamoDB, Microsoft CosmosDB, and Google Cloud Spanner.

Nick Elprin is the cofounder and CEO of Domino Data Lab, a data science platform that accelerates the development and deployment of models while enabling best practices like collaboration and reproducibility. Previously, Nick built tools for quantitative researchers at Bridgewater, one of the world’s largest hedge funds. He has over a decade of experience working with data scientists at advanced enterprises. Nick holds a BA and MS in computer science from Harvard.

Presentations

Managing Data Science in the Enterprise Tutorial

The honeymoon era of data science is ending, and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders deliver measurable impact on an increasing share of an enterprise’s KPIs. Nick Elprin details how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage.

Moty Fania is a principal engineer for big data analytics at Intel IT and the CTO of the Advanced Analytics Group, which delivers big data and AI solutions across Intel. With over 15 years of experience in analytics, data warehousing, and decision support solutions, Moty leads the development and architecture of various big data and AI initiatives, such as IoT systems, predictive engines, online inference systems, and more. Moty holds a bachelor’s degree in economics and computer science and a master’s degree in business administration from Ben-Gurion University.

Presentations

A high-performance system for deep learning inference and visual inspection 40-minute session

In this session, Moty Fania will share Intel’s IT experience from implementing an AI inference platform to enable internal visual inspection use cases. The platform is based on open source technologies and was designed for real-time, streaming and online actuation. This session highlights the key learnings from this work with a thorough review of platform’s architecture

Usama is founder/CEO of Open Insights where he worked with large and small enterprises on AI, BigData strategy, and launching new business models, most recently serving as Interim CTO for Stella.AI, a VC-funded startup in AI for HR/recruiting; and Interim COTO of MTN2.0 — helping develop new revenue streams in mobile payments/MFS at MTN, Africa’s largest mobile operator. Usama was the first Global Chief Data Officer at Barclays in London (2013-2014) after he launched the largest tech startup accelerator in MENA (2010-2013) as Executive Chairman of Oasis500 in Jordan. His background includes Chairman/CEO roles at several startups, including Blue Kangaroo Corp, DMX Group and digiMine (Audience Science). He was the first person ever to hold the Chief Data Officer title when Yahoo! acquired his second startup in 2004. He held leadership roles at Microsoft (1996-2000) and founded the machine learning systems group at NASA’s Jet Propulsion Laboratory (1989-1995) where his work on machine learning resulted in the top Excellence in Research award from Caltech, and a U.S. Government medal from NASA. Usama has published over 100 technical articles on data mining, data science, AI/ML, and databases. He holds over 30 patents, is a Fellow of the Association for Advancement of Artificial Intelligence and a Fellow of the Association of Computing Machinery. Usama earned his Ph.D. in engineering in AI/Machine Learning from the University of Michigan. He holds two BSE’s in Engineering, MSE Computer Engineering and M.Sc. in Mathematics.

Presentations

Next Generation Cybersecurity via Data Fusion, AI and BigData: Pragmatic Lessons from the Font Lines in Financial Services 40-minute session

This presentation will share the main outcomes and learnings from building and deploying a global data fusion, incident analysis/visualization, and effective cybersecurity defense based on BigData and AI at a major EU bank and in collaboration with several financial services institutions. The focus is on learnings and breakthroughs gleaned from making the systems work

Dr. William “Bill” Fehlman is a Data Scientist Lead at USAA, who provides machine learning tools and guidance that create efficiencies and greater effectiveness in contact center operations through actionable insights generated from analytics. Bill came to USAA after working as a Senior Analytics Consultant with Clarity Solution Group. Prior to working with Clarity, he served as an Automation & Robotics Systems Engineer at NASA Langley Research Center. Before that, he served 23 years in the US Army in numerous leadership and operations research roles, to include Assistant Professor and Director of the Differential Calculus Program at the US Military Academy. Bill’s academic credentials include a PhD in Applied Science with a concentration in machine learning from the College of William & Mary, MS in Applied Mathematics from Rensselaer Polytechnic Institute, and BS in Mathematics from SUNY Fredonia.

Presentations

An Intuitive Explanation for Approaching Topic Modeling 40-minute session

Provide a comparison of topic modeling algorithms used to identify latent topics in large volumes of text data, and then present coherence scores that illustrate the method that shows high consistency with human judgments on the quality of topics. We will then discuss the importance of the coherence scores in choosing topic modeling algorithms that best support different use cases.

Stephanie Fischer has many years of consulting experience in big data, machine learning and human-centric innovation. As Product Owner, she develops services and products based on machine learning and content analytics. She is speaker at conferences, author of articles on Big Data/machine learning and founder of datanizing GmbH.

Presentations

From chaos to insight: Automatically derive value from your user-generated content Data Case Studies

Users generate text all over the internet - 24 hours a day, 7 days a week. This text often contains complaints, wishes and clever ideas. Using both unsupervised and supervised Machine Learning we show you what insight can be derived from 100.000 user comments related to New York. We will uncover the most exciting trends and sentiments with interactive visualisations.

Brian Foo is a senior software engineer in Google Cloud working on applied artificial intelligence, where he builds demos for Google Cloud’s strategic customers, as well as open source tutorials to improve public understanding of AI. Brian previously worked at Uber, where he trained machine learning models and built large scale training and inference pipeline for mapping and sensing/perception applications using Hadoop/Spark. Prior to that, Brian headed the real-time bidding optimization team at Rocket Fuel, where he worked on algorithms that determined millions of ads shown every second across many platforms such as web, mobile, and programmatic TV. Brian received a B.S. in EECS from Berkeley, and a Ph.D. in EE Telecommunications from UCLA.

Presentations

From Training to Serving: Deploying Tensorflow Models with Kubernetes Tutorial

TensorFlow and Keras are popular libraries for training deep models due to hardware accelerator support. Join Ron Bodkin and Brian Foo to learn how to bring deep learning models from training to serving in a cloud production environment. You will learn how to unit-test, export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.

Janet Forbes is an experienced Enterprise, Business and Senior Systems Architect with deep understanding of data, functional and technical architecture and proven ability to define, audit and improve business processes based on best practices. She has extensive experience in leading multi-functional teams through the planning and delivery of complex solutions.
With over 25 years of experience in various roles and organizations, Janet has proven capability in enterprise, functional and technical architecture with specific focus on Business and Data Architecture. As a trusted advisor, Janet works closely with clients in assessing and shaping their data strategy practices.

Presentations

From Theory to Data Product - Applying Data Science Methods to Effect Business Change Tutorial

This tutorial is a primer on crafting well-conceived data science projects on course toward uncovering valuable business insights. Using case studies and hands-on skills development, we will teach techniques that are essential for a variety of audiences invested in effecting real business change.

Jean-Michel Franco is Director of Product Marketing for Talend’s Data Governance solutions. He has dedicated his career to developing and broadening the adoption of innovative technologies in companies. Prior to joining Talend, he started out at EDS (now HP) by creating and developing a business intelligence (BI) practice, joined SAP EMEA as Director of Marketing Solutions in France and North Africa, and then lately Business & Decision as Innovation Director. He authored 4 books and regularly publishes articles, presents at events and tradeshows.

Presentations

Enacting The Data Subjects Access Rights For GDPR With Data Services And Data Management 40-minute session

GDPR is more than another regulation to be handled by your back office. Enacting the Data Subject Access Rights (DSAR) requires practical actions. In this session, we will discuss the practical steps to deploy governed data services

Bill Franks is Chief Analytics Officer for The International Institute For Analytics (IIA). Franks is also the author of Taming The Big Data Tidal Wave and The Analytics Revolution. His work has spanned clients in a variety of industries for companies ranging in size from Fortune 100 companies to small non-profit organizations. You can learn more at http://www.bill-franks.com.

Presentations

Analytics Maturity: Industry Trends And Financial Impacts 40-minute session

The International Institute For Analytics studied the analytics maturity level of large enterprises. The talk will cover how maturity varies by industry and some of the key steps organizations can take to move up the maturity scale. The research also correlates analytics maturity with a wide range of corporate success metrics including financial and reputational measures.

Eugene Fratkin is a director of engineering at Cloudera, where he leads the company’s cloud infrastructure efforts. He was one of the founding members of the Apache MADlib project (scalable in-database algorithms for machine learning). Previously, Eugene was a cofounder of a Sequoia Capital-backed company focusing on applications of data analytics to problems of genomics. He holds PhD in computer science from Stanford University’s AI lab.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Organizations now run diverse, multidisciplinary big data workloads that span data engineering, analytic database, and data science applications. Many of these workloads operate on the same underlying data and the workloads themselves can be transient or long-running in nature. One of the challenges we will explore is keeping the data context consistent across these various workloads.

Michael J. Freedman is the cofounder and CTO of TimescaleDB, an open source database that scales SQL for time-series data, and Professor of Computer Science at Princeton University, where his research focuses on distributed systems, networking, and security.

Previously, Michael developed CoralCDN (a decentralized CDN serving millions of daily users) and Ethane (the basis for OpenFlow and software-defined networking) and cofounded Illuminics Systems (acquired by Quova, now part of Neustar). He is a technical advisor to Blockstack.

Michael’s honors include the Presidential Early Career Award for Scientists and Engineers (PECASE, given by President Obama), the SIGCOMM Test of Time Award, the Caspar Bowden Award for Privacy Enhancing Technologies, a Sloan Fellowship, the NSF CAREER Award, the Office of Naval Research Young Investigator Award, a DARPA Computer Science Study Group membership, and multiple award publications. He holds a PhD in computer science from NYU’s Courant Institute and bachelor’s and master’s degrees from MIT.

Presentations

Performant time-series data management and analytics with Postgres 40-minute session

I describe how to leverage Postgres even for high-volume time-series workloads using TimescaleDB, an open-source time-series database built as a Postgres plugin. I explain its general architectural design principles, as well as new time-series data management features including adaptive time partitioning and near-real-time continuous aggregations.

Brandon Freeman is a Mid-Atlantic Region Strategic System Engineer at Cloudera that specializes in Infrastructure, Cloud, and Hadoop. Brandon previously worked for Explorys in operations, architecture, and performance optimization for the Cloudera Hadoop Environments as the Infrastructure Architect responsible for designing, building and managing many large Hadoop clusters.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Organizations now run diverse, multidisciplinary big data workloads that span data engineering, analytic database, and data science applications. Many of these workloads operate on the same underlying data and the workloads themselves can be transient or long-running in nature. One of the challenges we will explore is keeping the data context consistent across these various workloads.

Chris Fregly is founder and research engineer at PipelineAI, a San Francisco-based streaming machine learning and artificial intelligence startup. Previously, Chris was a distributed systems engineer at Netflix, a data solutions engineer at Databricks, and a founding member of the IBM Spark Technology Center in San Francisco. Chris is a regular speaker at conferences and meetups throughout the world. He’s also an Apache Spark contributor, a Netflix Open Source committer, founder of the Global Advanced Spark and TensorFlow meetup, author of the upcoming book Advanced Spark, and creator of the O’Reilly video series Deploying and Scaling Distributed TensorFlow in Production.

Presentations

Building a High Performance Model Serving Engine from Scratch using Kubernetes, GPUs, Docker, Istio, and TensorFlow 40-minute session

Applying my Netflix experience to a real-world problem in the ML and AI world, I will demonstrate a full-featured, open-source, end-to-end TensorFlow Model Training and Deployment System using the latest advancements with Kubernetes, TensorFlow, and GPUs.

Brandy Freitas is a research physicist-turned-data scientist based in Boston, MA. Her academic research focused primarily on protein structure determination, applying machine learning techniques to single-particle cryo-electron microscopy data. She is currently a Principal Data Scientist at Pitney Bowes, where she works with clients in a wide variety of industries to develop analytical solutions for their business needs. Brandy is a National Science Foundation Graduate Research Fellow, a James Mills Pierce Fellow, and holds an SM in Biophysics from Harvard University.

Presentations

Executive Briefing: Analytics for Executives - Building an Approachable Language to Drive Data Science in Your Organization 40-minute session

Data science is an approachable field given the right framing. Often, though, practitioners and executives are describing opportunities using completely different languages. In this session, Harvard Biophysicist-turned-Data Scientist, Brandy Freitas, will work with participants to develop context and vocabulary around data science topics to help build a culture of data within their organization.

I am a senior global Executive and have managed a number of implementation projects for small and large companies over the past decade+. I have also initiated and directed a number of AI, OR and Optimization R&D projects. I have developed, transferred and established best practices and cutting edge technology in many industries, including Retail, Distribution, Manufacturing, Call-Centers, Healthcare, Airports Services, Security.

- Current CEO of Element AI
- Chief Innovation / Products Officer and Head of JDA Labs, a position I assumed following the successful sale of my company, Planora, in July of 2012
- CEO and co-founder of Planora, bringing to market a highly disruptive SAAS solution in the extremely competitive Workforce Management space by combining a unique blend of AI, Machine Learning, Operation Research and user experience.
- Co-founder and Director of Products for Logiweb, a leading custom web development firm focusing on web-based analytics and decision support tools acquired by Innobec

Presentations

From Data Governance to AI Governance: The CIO's new role 40-minute session

The CIO is going to need a broader mandate in the company to better align their AI training and outcomes with business goals and compliance. This mandate should include an AI Governance team that is well staffed and deeply established in the company in order to catch biases that can develop from faulty goals or flawed data

Navdeep is a Hacker Scientist at H2O.ai. He graduated from California State University, East Bay with a M.S. degree in Computational Statistics, B.S. in Statistics, and a B.A. in Psychology (minor in Mathematics). During his education he gained interests in machine learning, time series analysis, statistical computing, data mining, & data visualization.

Previous to H2O.ai he worked at a couple start ups and Cisco Systems, Inc. focusing on data science, software development, and marketing research. Before that, he was a consultant at FICO working with small to mid level banks in the U.S. & South America focusing on risk management across different bank portfolios (car loan, home mortgage, and credit card). Before stepping into industry he worked in various Neuroscience labs as a researcher/analyst. These labs were at institutions such as UC Berkeley, UCSF, and Smith Kettlewell Eye Research Institute. His work across these labs varied from behavioral, electrophysiology, and functional magnetic resonance imaging research.

In his spare time Navdeep enjoys watching documentaries, reading (mostly non-fiction or academic), and working out.

Presentations

Practical Techniques for Interpreting Machine Learning Models Tutorial

Transparency, auditability, and stability are crucial for business adoption and human acceptance of complex machine learning models. This technical tutorial will share practical and productizable approaches for explaining, testing, and visualizing machine learning models through a series of publicly available examples using open source, Python-friendly tools such as GraphViz, H2O, and XGBoost.

Harry founded Periscope Data in 2012 with co-founder Tom O’Neill. The two have grown Periscope Data to serve nearly 1000 customers. Glaser was previously at Google, and graduated from the University of Rochester with a bachelor’s degree in computer science.

Presentations

An ethical foundation for the AI-driven future 40-minute session

What is the moral responsibility of a data team today? As AI & machine learning technologies become part of our everyday life, and as data becomes accessible to everyone, CDOs and data teams are taking on a very important moral role as the conscience of the corporation. This session will highlight the risks companies will face if they don't empower data teams to lead the way for ethical data use.

Zachary Glassman is a data scientist in residence at the Data Incubator. Zachary has a passion for building data tools and teaching others to use Python. He studied physics and mathematics as an undergraduate at Pomona College and holds a master’s degree in atomic physics from the University of Maryland.

Presentations

Hands-On Data Science with Python 1-Day Training

The Data Incubator offers a foundation in building intelligent business applications using machine learning. We will walk through all the steps - from prototyping to production - of developing a machine learning pipeline. We’ll look at data cleaning, feature engineering, model building/evaluation, and deployment. Students will extend these models into an application using a real-world dataset.

Heitor Murilo Gomes is a researcher at Télécom ParisTech focusing on machine learning—particularly evolving data streams, concept drift, ensemble methods, and big data streams. He co-leads the StreamDM open data stream mining project.

Presentations

Machine learning for non-stationary streaming data using Structured Streaming and StreamDM 40-minute session

The StreamDM library provides the largest collection of data stream mining algorithms for Spark. This talk will cover how StreamDM can be used alongside Structured Streaming for build incremental models specially for non-stationary streams (i.e. those with concept drifts). Concretely, we will cover how to develop, apply and evaluate learning models using StreamDM and Structured Streaming.

Bruno Gonçalves is a Moore-Sloan fellow at NYU’s Center for Data Science. With a background in physics and computer science, Bruno has spent his career exploring the use of datasets from sources as diverse as Apache web logs, Wikipedia edits, Twitter posts, epidemiological reports, and census data to analyze and model human behavior and mobility. More recently, he has been focusing on the application of machine learning and neural network techniques to analyze large geolocated datasets.

Presentations

Recurrent Neural Networks for timeseries analysis Tutorial

The world is ever changing. As a result, many of the systems and phenomena we are interested in evolve over time resulting in time evolving datasets. Timeseries often display any interesting properties and levels of correlation. In this tutorial we will introduce the students to the use of Recurrent Neural Networks and LSTMs to model and forecast different kinds of timeseries.

Mark Grover is a product manager at Lyft. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He has also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Presentations

Near-real time Anomaly Detection at Lyft 40-minute session

Consumer facing real-time processing poses a number of challenges to protect from fraudulent transactions and other risks. The streaming platform at Lyft seeks to support this with an architecture that brings together a data science friendly programming environment with a deployment stack for the reliability, scalability and other SLA requirements of a mission critical stream processing system.

Sumit Gulwani is a Partner Research manager at Microsoft, leading the PROSE research and engineering team that develops APIs for program synthesis (programming by examples and natural language) and incorporates them into real products. He is the inventor of the popular Flash Fill feature in Microsoft Excel used by hundreds of millions of people. He has published 120+ peer-reviewed papers in top-tier conferences/journals across multiple computer science areas, delivered 40+ keynotes/invited talks at various forums, and authored 50+ patent applications (granted and pending). He is a recipient of the prestigious ACM SIGPLAN Robin Milner Young Researcher Award, ACM SIGPLAN Outstanding Doctoral Dissertation Award, and the President’s Gold Medal from IIT Kanpur.

Presentations

Programming by input-output Examples 40-minute session

Programming by input-output examples (PBE) is a new frontier in AI, set to revolutionize the programming experience for the masses. It can enable end users (99% of whom are non-programmers) to create small scripts, and make data scientists 10-100x more productive for many data wrangling tasks. Come learn about this new programming paradigm: its applications, form factors, the science behind it.

Patrick Hall is a senior director for data science products at H2o.ai where he focuses mainly on model interpretability. Patrick is also currently an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning.

Previously, Patrick held global customer-facing and R&D research roles at SAS Institute. He holds multiple patents in automated market segmentation using clustering and deep neural networks. Patrick is the eleventh person worldwide to become a Cloudera Certified Data Scientist. He studied computational chemistry at the University of Illinois before graduating from the Institute for Advanced Analytics at North Carolina State University.

Presentations

Practical Techniques for Interpreting Machine Learning Models Tutorial

Transparency, auditability, and stability are crucial for business adoption and human acceptance of complex machine learning models. This technical tutorial will share practical and productizable approaches for explaining, testing, and visualizing machine learning models through a series of publicly available examples using open source, Python-friendly tools such as GraphViz, H2O, and XGBoost.

Zachary Hanif is a director in Capital One’s Center for Machine Learning, where he leads teams focused on applying machine learning to cybersecurity and financial crime. His research interests revolve around applications of machine learning and graph mining within the realm of massive security data and the automation of model validation and governance. Zachary graduated from the Georgia Institute of Technology.

Presentations

Network Effects: Working with Modern Graph Analytic Systems 40-minute session

Modern frameworks and analytical techniques are making graph analysis methods viable for increasingly large complex tasks, while an understanding of graph based analytical techniques can be extremely powerful when applied to modern practical problems. This talk examines three prominent graph analytic methods, including graph convolutional networks, and applies them to concrete use cases.

Kenji Hayashida is a data engineer at Recruit Lifestyle co., ltd. in Japan.
He holds a Master degree in information engineering from Osaka University.

Kenji started his carrer as a software engineer at HITECLAB while he was in college.
After joining Recruit Group, Kenji has involved in many projects such as advertising technology, contents marketing and data pipeline.

In his pastime, Kenji enjoys programing competitins such as TopCoder, Google Code Jam and Kaggle.
He is also a writer of a popular data science textbook.

Presentations

Best Practices to Develop an Enterprise Datahub to Collect and Analyze 1TB/day Data from a Lot of Services with Apache Kafka and Google Cloud Platform in Production 40-minute session

Recruit Group and NTT DATA Corporation developed a platform based on "Datahub" utilizing Apache Kafka. This platform should handle around 1TB/day application logs generated by a lot of services in Recruit Group. Some of the best practices and know-hows, such as schema evolution and network architecture, learned during this project are explained in this session.

Jeff is Trifacta’s Chief Experience Officer, Co-founder and a Professor of Computer Science at the University of Washington, where he directs the Interactive Data Lab. Jeff’s passion is the design of novel user interfaces for exploring, managing and communicating data. The data visualization tools developed by his lab (D3.js, Protovis, Prefuse) are used by thousands of data enthusiasts around the world. In 2009, Jeff was named to MIT Technology Review’s list of “Top Innovators under 35”.

Presentations

The Vega Project: Building an Ecosystem of Tools for Interactive Visualization 40-minute session

Introduces Vega and Vega-Lite -- high-level declarative languages for interactive visualization that support exploratory data analysis, communication, and the development of new visualization tools.

Sam Helmich is a Data Scientist in the John Deere’s Intelligent Solutions Group. He has worked in applied analytics roles within John Deere Worldwide Parts and Global Order Fulfillment, and has a MS in Statistics from Iowa State University.

Presentations

Data Science in an Agile Environment: Methods and Organization for Success Data Case Studies

Data science can benefit by borrowing some principles of Agile. These benefits can be compounded by structuring the team roles in such a manner to enable success without relying on employing full-stack expert “unicorns”.

Felipe Hoffa is a developer advocate for big data at Google, where he inspires developers around the world to leverage the Google Cloud Platform tools to analyze and understand their data in ways they could never before. You can find him in several videos, blog posts, and conferences around the world.

Presentations

Protecting sensitive data in huge datasets: Cloud tools you can use 40-minute session

Before releasing a public dataset, practitioners need to thread the balance between utility and protection of individuals. In this talk we'll move from theory to real-life while handling massive public datasets. We'll showcase newly available tools that help with PII detection, and bring concepts like k-anonymity and l-diversity to a practical realm.

Garrett Hoffman is a Senior Data Scientist at StockTwits, where he leads efforts to use data science and machine learning to understand social dynamics and develop research and discovery tools that are used by a network of over one million investors. Garrett has a technical background in math and computer science but gets most excited about approaching data problems from a people-first perspective–using what we know or can learn about complex systems to drive optimal decisions, experiences, and outcomes.

Presentations

Deep Learning Methods for Natural Language Processing Tutorial

This workshop will review deep learning methods used for natural language processing and natural language understanding tasks while working on a live example with StockTwits data using python and TensorFlow. Methods we review include Word2Vec, Recurrent Neural Networks and Variants (LSTM, GRU) and Convolutional Neural Networks.

Keqiu is a Staff Software Engineer at LinkedIn, he used to work on the mobile infrastructure at LinkedIn and recently moved to work on big data platforms.

Presentations

TonY -- Native support of TensorFlow on Hadoop 40-minute session

We have developed TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. TonY enables running TensorFlow distributed training as a new type of Hadoop application. TonY's native Hadoop connector, together with other features, aims to run TensorFlow jobs as reliably and flexibly as other first-class citizens on Hadoop including MapReduce and Spark.

Data Science and Machine Learning consultant at Microsoft. Previously Machine Learning student at Cambridge, Engineering student in Ghent.

Presentations

Democratising deep learning with transfer learning 40-minute session

Transfer learning allows data scientists to leverage insights from large labelled data sets. The general idea of transfer learning is to use knowledge learned from tasks for which a lot of labelled data is available in settings where only little labelled data is available. In this talk, you’ll learn what transfer learning is and how it can boost your NLP or CV pipelines.

Sr. Software Engineer at LinkedIn on the Hadoop development team.

Presentations

TonY -- Native support of TensorFlow on Hadoop 40-minute session

We have developed TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. TonY enables running TensorFlow distributed training as a new type of Hadoop application. TonY's native Hadoop connector, together with other features, aims to run TensorFlow jobs as reliably and flexibly as other first-class citizens on Hadoop including MapReduce and Spark.

Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo, where his research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, rank-aware query processing, and information extraction. Ihab is also a cofounder of Tamr, a startup focusing on large-scale data integration and cleaning. He is a recipient of the Ontario Early Researcher Award (2009), a Cheriton faculty fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he is an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees and an associate editor of ACM Transactions of Database Systems (TODS). He holds a PhD in computer science from Purdue University, West Lafayette.

Presentations

Scalable Machine Learning for Data Cleaning 40-minute session

Machine-learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas explains why leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions.

Maryam is a research scientist at TapRecruit, a platform that uses AI and automation tools to bring efficiency and fairness to the recruiting process. She obtained her PhD from the Icahn School of Medicine at Mount Sinai (New York) for her studies on molecular regulators of organ size control. Maryam’s long-term research goal is to reduce bias in decision-making by using a combination of computation linguistics, machine learning and behavioral economics methods.

Presentations

‘Moneyballing’ Recruiting: A Data-Driven Approach to Battling Bottlenecks and Biases in Hiring Data Case Studies

Hiring teams have long-relied on intuition and experience to scout talent. Increased data and data-science techniques give us a chance to test common recruiting wisdom. Maryam will draw on results from her recent behavioral experiments and analyses of over 10 million jobs and their outcomes to illustrate how often innocuous recruiting decisions have dramatic impacts on hiring outcomes.

Ankit currently works as a Data Scientist at Uber where his primary focus is on forecasting using Deep Learning methods and self driving car’s business problems.Prior to that, he has worked in a variety of data science roles at Runnr, Facebook, BofA and Clearslide. Ankit holds a Masters from UC Berkeley and BS from IIT Bombay (India).

Presentations

Achieving Personalization with LSTMs 40-minute session

Personalization is a common theme in social networks and e-commerce businesses. However, personalization at Uber will involve understanding of how each driver/rider is expected to behave on the platform. In this talk, we will focus on how Deep Learning (LSTM's) and Uber's huge database can be used to understand/predict future behavior of each and every user on the platform.

Jeroen Janssens is the founder and CEO of Data Science Workshops, which provides on-the-job training and coaching in data visualization, machine learning, and programming. Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and startups YPlan and Outbrain in New York City. He is the author of Data Science at the Command Line, published by O’Reilly Media. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University.

Presentations

50 reasons to learn the shell for doing data science 40-minute session

"Anyone who does not have the command line at their beck and call is really missing something," tweeted Tim O'Reilly when Jeroen Janssens's Data Science at the Command Line was recently made available online for free. Join Jeroen to learn what you're missing out on if you're not applying the command line and many of its power tools to typical data science problems.

Data Science with Unix Power Tools Tutorial

The unix command line remains an amazing environment for efficiently performing tedious but essential data science tasks. By combining small, powerful, command-line tools you can quickly scrub, explore, and model your data as well as hack together prototypes. This hands-on workshop is based on the O’Reilly book Data Science at the Command Line, written by instructor Jeroen Janssens.

My interest is in designing and building distributed systems and I love to experiment new technology. I have worked on lot of open source projects; which includes hadoop (yarn, hdfs), apache spark, nfs-ganesha, rocksdb, openstack swift and have built systems from ground up. Currently at Uber I am designing and building unified data ingestion platform using apache spark.
In the past I have worked with Hedvig Inc, Hortonworks and VMWare Inc.

Presentations

Marmaray – A generic, scalable, and pluggable Hadoop data ingestion & dispersal framework 40-minute session

Marmaray is a generic Hadoop ingestion and dispersal framework recently released to production at Uber. We will introduce the main features of Marmaray and business needs met, share how Marmaray can help a team's data needs by ensuring data can be reliably ingested into Hive or dispersed into online data stores, and give a deep dive into the architecture to show you how it all works.

All-day-long-coding-CTO at Attendify. Clojure, Haskell, Rust. Fields of interest: algebras and protocols. Author of Muse and Fn.py libraries. An active contributor to Aleph and other OS projects.

Presentations

Managing Data Chaos in the World of Microservices 40-minute session

When we talk about microservices we usually focus on the communication layer and rarely on the data. In practice, data is the much harder and often overlooked problem. Splitting applications into independent units leads to increased complexity: structural and semantic changes, knowledge sharing and data discovery. We'll discuss emerging technologies created to tackle these challenges.

Atul Kale is a Software Engineer on Airbnb’s Machine Learning Infrastructure team. He majored in Computer Engineering at the University of Illinois Urbana-Champaign; prior to joining Airbnb, he worked in finance building and deploying machine-learning driven proprietary trading strategies as well as data pipelines to support them.

Presentations

Bighead: Airbnb's End-to-End Machine Learning Platform 40-minute session

We introduce Bighead, Airbnb's user-friendly and scalable end-to-end machine learning framework that powers Airbnb's data-driven products. Bighead integrates popular libraries including Tensorflow, XGBoost, and PyTorch. It is built on Python, Spark, and Kubernetes, and is designed be used in modular pieces. It has reduced the overall model development time from many months to days at Airbnb.

Daniel Kang is a PhD student in the Stanford InfoLab, where he is supervised by Peter Bailis and Matei Zaharia. Daniel’s research interests lie broadly at the intersection of machine learning and systems. Currently, he is working on deep learning applied to video analysis.

Presentations

BlazeIt: An Exploratory Video Analytics Engine 40-minute session

As video volumes grow, automatic methods are required to prioritize human attention. However, these methods do not scale and are cumbersome to deploy. In response, we introduce BlazeIt, an exploratory video analytics engine. We show our declarative language, FrameQL, can capture a range of real-world queries and BlazeIt's optimizer can execute these queries over 2000x faster than naive approaches.

Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She is a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work she enjoys playing with fire, riding scooters, and dancing.

Presentations

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am 40-minute session

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads.

Yasuyuki Kataoka is a data scientist at NTT Innovation Institute Inc. His primary interest is applied R&D in machine learning applications for time-series and heterogeneous data such as vision, audio, text and IoT sensor signals. This data science work spans various fields including automotive, sports, healthcare and social media. Other areas of interest include robotics control such as self-driving car and drone system. When not doing research activities, he likes to participate in hackathons where he has won prizes in automotive and healthcare industry. He earned MS and BS in Mechanical and System Engineering from Tokyo Institute of Technology with valedictorian honor. While a full-time employee, he is also a Ph.D. candidate in Artificial Intelligence field at the University of Tokyo.

Presentations

Real-time machine intelligence in IndyCar and Tour de France 40-minute session

One of the challenges of sports data analytics is how to deliver machine intelligence beyond a mere real-time monitoring tool. This session highlights various real-time machine learning models in both IndyCar and Tour de France. This talk encompasses real-time data processing architecture, machine learning model, and demonstration that delivers meaningful insights for players and fans.

Mubashir Kazia is a principal solutions architect at Cloudera and an SME in Apache Hadoop security in Cloudera’s Professional Services practice, where he helps customers secure their Hadoop clusters and comply to internal security policies. He also helps new customers transition to Hadoop platform and implement their first few use cases and trains and mentors peers in Hadoop and Hadoop security. Mubashir has worked with customers from all verticals, including banking, manufacturing, healthcare, telecom, retail, and gaming. Previously, he worked on developing solutions for leading investment banking firms.

Presentations

Getting ready for GDPR: securing and governing hybrid, cloud and on-prem big data deployments Tutorial

New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span on-prem, private cloud, multi-cloud, and hybrid cloud. We will share hands-on best practices for meeting these challenges, with special attention to GDPR.

Arun Kejariwal is a statistical learning principal at Machine Zone (MZ), where he leads a team of top-tier researchers and works on research and development of novel techniques for install and click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns. In addition, his team is building novel methods for bot detection, intrusion detection, and real-time anomaly detection. Previously, Arun worked at Twitter, where he developed and open-sourced techniques for anomaly detection and breakout detection. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high-performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Correlation analysis on live data streams 40-minute session

One current application, anomaly detection, is a necessary but insufficient step, due to the fact that anomaly detection over a set of live data streams may result in an anomaly fatigue, limiting effective decision making. We shall walk the audience through how to marry correlation analysis with anomaly detection can help and share techniques to guide effective decision making.

Designing Modern Streaming Data Applications Tutorial

In this tutorial, we will walk the audience through the landscape of state-of-the-art systems for each stage of a end-to-end data processing pipeline, viz., messaging frameworks, streaming computing frameworks, storage frameworks for real-time data. We will also walk through case studies from IoT, Gaming and Healthcare, and share our experiences operating these systems at Internet scale.

Jawad Khan is Director of Data Sciences and Knowledge Management at Rush University Medical Center. In this role, he will leverage his extensive experience to lead Rush’s Analytics and data strategy.

Jawad is passionate about leading the analytics program at Rush. He focuses on leveraging data from all sections of the business including clinical, ERP, security, device sensors, and people/patient generated data. In turn, the integrated data analytic will provide improved safety, better clinical outcomes, reduce cost and drive innovation.

Jawad brings to Rush more than 20 years of experience in analytics, software development, data management, and data security. Prior to joining Rush, Jawad provided Cloud enablement strategies for data and applications to clients like GE Capital, Coke, Proctor & Gamble, and Warner Bros working as a Lead Architect at Century Link.

Prior to that, Jawad served as a Managing Director at Opus Capital Markets where he was responsible for leading analytics, data security and compliance, and software development. At Opus, Jawad was also responsible for data center and infrastructure development and operations.

Jawad graduated from Southern Illinois University with a degree in Computer Engineering and proceeded to work as a Software Engineer consultant for one of the big 6 consulting firms. He also speaks regularly at professional and community events and is a Cricket commentator for NPR affiliate Public Radio, WBEZ, in Chicago.

Presentations

Digging for Gold: Developing AI in healthcare against unstructured text data - exploring the opportunities and challenges 40-minute session

This Cloudera/MetiStream solution lets healthcare providers automate the extraction, processing and analysis of clinical notes within the Electronic Health Record in batch or real-time. Improve care, identify errors, and recognize efficiencies in billing and diagnoses by leveraging NLP capabilities to conduct fast analytics in a distributed environment. Use case by Rush University Medical Center.

Amandeep Khurana is Chief Executive Officer and Co-founder at Cerebro Data, which he launched in 2016 with CTO and co-founder Nong Li. After witnessing first-hand the challenges companies faced in Big Data and Cloud migration, he built Cerebro Data to empower all users with easy access through a unified, secured, and governed platform across heterogenous data stores.

While supporting customer cloud initiatives at Cloudera and playing an integral role at AWS on the Elastic MapReduce team, Amandeep oversaw some of the industry’s largest big data implementations. As such, he understands that customers need self-serve analytics without trading in governance or security. Amandeep is the co-author of HBase in Action, a book on building applications by HBase and is passionate about distributed systems, big data and everything cloud.

Amandeep received his MS in Computer Science from the University of Santa Cruz and a Bachelor in Engineering at Thapar Institute of Engineering and Technology.

Presentations

The Move to a Modern Data Platform in the Cloud. Pitfalls to Avoid and Best Practices to Follow 40-minute session

Critical data management practices for easy and unified data access that meets security and regulatory compliance

James Kirkland is the advocate for Red Hat’s initiatives and solutions for the internet of things (IoT) and is the architect of Red Hat’s strategy for IoT deployments. This open source architecture combines data acquisition, integration, and rules activation with command and control data flows among devices, gateways, and the cloud to connect customers’ operational technology environments with information technology infrastructure and provide agile IoT integration. James serves as the head subject-matter expert and global team leader of system architects responsible for accelerating IoT implementations for customers worldwide. Through his collaboration with customers, partners, and systems integrators, Red Hat has grown its IoT ecosystem, expanding its presence in industries including transportation, logistics, and retail and accelerating adoption of IoT in large enterprises. James has a deep knowledge Unix and Linux variants that spans the course of his 20-year career at Red Hat, Racemi, and Hewlett-Packard. He is a steering committee member of the IoT working group for Eclipse.org, a member of the IIC, and a frequent public speaker and author on a wide range of technical topics.

Presentations

Using Machine Learning to Drive Intelligence at the Edge 40-minute session

The focus on IoT is turning increasingly to the edge. And the way to make the edge more intelligent is by building machine learning models in the cloud and pushing those learnings back out to the edge. Join Cloudera and Red Hat where they will showcase how they executed this architecture at one of the world’s leading manufacturers in Europe, including a demo highlighting this architecture.

Spencer Kirn is a PhD student in the Applied Science Department at the College of William & Mary. His research interests include: applications of topic modeling, deep learning, and other AI methods to respond to real world problems as well as understanding what is happening inside these complicated algorithms to fully understand predictions. Spencer graduated from the College of Wooster in 2016 with a BA in Physics before arriving at the College of William & Mary.

Presentations

An Intuitive Explanation for Approaching Topic Modeling 40-minute session

Provide a comparison of topic modeling algorithms used to identify latent topics in large volumes of text data, and then present coherence scores that illustrate the method that shows high consistency with human judgments on the quality of topics. We will then discuss the importance of the coherence scores in choosing topic modeling algorithms that best support different use cases.

Rita Ko is the Director of The Hive, the innovation lab at the UN Refugee Agency in the United States (USA for UNHCR.) She heads the application of machine learning and data science to explore new modes of engagement around the global refugee crisis. Her work in data science stems from her election campaign experience in Canada at the Office the Mayor in the City of Vancouver where she successfully re-elected Mayor Gregor Robertson three consecutive terms, and has worked on three national election campaigns applying predictive modeling. Rita earned her Masters of Business Administration from Cornell University.

Presentations

From Strategy to Implementation — Putting Data to Work at USA for UNHCR 40-minute session

The Hive and Cloudera Fast Forward Labs share how they transformed USA for UNHCR (UN Refugee Agency) to use data science and machine learning (DS/ML) to address the refugee crisis. From identifying use cases and success metrics to showcasing the value of DS/ML, we cover the development and implementation of a DS/ML strategy hoping to inspire other organizations looking to derive value from data.

As Google Cloud’s Chief Decision Scientist, Cassie Kozyrkov is passionate about helping everyone – Google, its customers, the world! – make better decisions through harnessing the beauty and power of data. She speaks at conferences and meets with leadership teams to empower decision-makers to transform their industries through AI, machine learning, and analytics.

At Google, Cassie has advised more than a hundred teams on statistics and machine learning, working most closely with Research & Machine Intelligence, Google Maps, and Ads & Commerce. She has also personally trained more than 15,000 Googlers (executives, engineers, scientists, and even non-technical staff members) in machine learning, statistics, and data-driven decision-making.

Prior to joining Google, Cassie spent a decade working as a data scientist and consultant. She is a leading expert in decision science, with undergraduate studies in statistics and economics (University of Chicago) and graduate studies in statistics, neuroscience, and psychology (Duke University and NCSU).

When she’s not working, you’re most likely to find Cassie at the theatre, in an art museum, exploring the world, playing board games, or curled up with a good novel.

Presentations

Executive Briefing: Most Data-Driven Cultures, Aren’t 40-minute session

Many organizations aren’t aware that they have a blindspot with respect to their lack of data effectiveness and hiring experts doesn’t seem to help. This session examines what it takes to build a truly data-driven organizational culture and highlights a vital, yet often neglected, job function: the data science manager.

Jay Kreps is the co-founder and CEO of Confluent, the company behind the popular Apache Kafka streaming platform. Previously, Jay was one of the primary architects for LinkedIn, where he focused on data infrastructure and data-driven products. He was among the original authors of a number of open source projects in the scalable data systems space, including Voldemort (a key-value store), Azkaban, Kafka (a distributed streaming platform), and Samza (a stream processing system).

Presentations

Apache Kafka and the four challenges of production machine learning systems 40-minute session

Machine learning has become mainstream, and suddenly businesses everywhere are looking to build systems that use it to optimize aspects of their product, processes or customer experience. This talk will explain some of the difficulties of building production machine learning systems and talk about how Apache Kafka and stream processing can help.

Abhishek Kumar is a manager of data science in Sapient’s Bangalore office, where he looks after scaling up the data science practice by applying machine learning and deep learning techniques to domains such as retail, ecommerce, marketing, and operations. Abhishek is an experienced data science professional and technical team lead specializing in building and managing data products from conceptualization to deployment phase and interested in solving challenging machine learning problems. Previously, he worked in the R&D center for the largest power-generation company in India on various machine learning projects involving predictive modeling, forecasting, optimization, and anomaly detection and led the center’s data science team in the development and deployment of data science-related projects in several thermal and solar power plant sites. Abhishek is a technical writer and blogger as well as a Pluralsight author and has created several data science courses. He is also a regular speaker at various national and international conferences and universities. Abhishek holds a master’s degree in information and data science from the University of California, Berkeley.

Presentations

Deep learning-based search and recommendation systems using TensorFlow Tutorial

Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.

Manoj Kumar is a senior software engineer working at Linkedin in data team. He is currently working on auto tuning hadoop jobs. He has more than 4 years of experience in big data technologies like hadoop, mapreduce, spark, hbase, pig, hive, kafka, gobblin etc. Before joining Linekdin he has worked with PubMatic for more than 4 years. At PubMatic he was working on data framework for slicing and dicing (30 dimensions, 50 metrics) advertising data, which gets more than 20TB of data everyday. Before that he had worked for Amazon for more than 18 months.

Presentations

TuneIn: How to get your jobs tuned while you are sleeping 40-minute session

Have you ever tuned a Spark or MR job? If the answer is yes, then you already know how difficult it is to tune more than hundred parameters to optimize the resources used. With Dr. Elephant we introduced heuristic based tuning recommendations. Now, we introduce TuneIn, an auto tuning tool developed to minimize the resource usage of jobs. Experiments have shown upto 50% reduction in resource usage.

Data Scientist at H2O.ai

Presentations

Practical Techniques for Interpreting Machine Learning Models Tutorial

Transparency, auditability, and stability are crucial for business adoption and human acceptance of complex machine learning models. This technical tutorial will share practical and productizable approaches for explaining, testing, and visualizing machine learning models through a series of publicly available examples using open source, Python-friendly tools such as GraphViz, H2O, and XGBoost.

Jared P. Lander is chief data scientist of Lander Analytics, where he oversees the long-term direction of the company and researches the best strategy, models, and algorithms for modern data needs. He specializes in data management, multilevel models, machine learning, generalized linear models, data management, visualization, and statistical computing. In addition to his client-facing consulting and training, Jared is an adjunct professor of statistics at Columbia University and the organizer of the New York Open Statistical Programming Meetup and the New York R Conference. He is the author of R for Everyone, a book about R programming geared toward data scientists and nonstatisticians alike. Very active in the data community, Jared is a frequent speaker at conferences, universities, and meetups around the world and was a member of the 2014 Strata New York selection committee. His writings on statistics can be found at Jaredlander.com. He was recently featured in the Wall Street Journal for his work with the Minnesota Vikings during the 2015 NFL Draft. Jared holds a master’s degree in statistics from Columbia University and a bachelor’s degree in mathematics from Muhlenberg College.

Presentations

Modeling Time Series in R 40-minute session

Temporal data is being produced in ever greater quantity and fortunately so do our time series capabilities. We look at a number of techniques for modeling time series. We start with traditional methods such as ARMA then go over more modern tools such as Prophet and machine learning models like XGBoost and Neural Nets. Along the way we look at a bit of theory and code for training these models.

Paul Lashmet is practice lead and advisor for financial services at Arcadia Data, a company that provides visual big data analytics software that empowers business users to glean meaningful and real-time business insights from high volume and varied data in a timely, secure, and collaborative way. Paul writes extensively about the practical applications of emerging and innovative technologies to regulatory compliance. Previously, he led programs at HSBC, Deutsche Bank, and Fannie Mae.

Presentations

Visualize AI to Spot New Trading Opportunities Findata

Artificial intelligence and deep learning are used to generate and execute trading strategies today. Meanwhile, regulators and investors demand transparency into investment decisions. The challenge is that the decision-making processes of machine learning technologies are opaque. The opportunity is that these same machines generate data that can be visualized to spot new trading opportunities.

Francesca Lazzeri is a data scientist at Microsoft, where she is part of the algorithms and data science team. Francesca is passionate about innovations in big data technologies and the applications of advanced analytics to real-world problems. Her work focuses on the deployment of machine learning algorithms and web service-based solutions to solve real business problems for customers in the energy, retail, and HR analytics sectors. Previously, she was a research fellow in business economics at Harvard Business School. She holds a PhD in innovation management.

Presentations

A Day in the Life of a Data Scientist: how do we train our teams to get started with AI? 40-minute session

What profession did Harvard Business Review call the Sexiest Job of the 21st Century? With the growing buzz of data science, several professionals have approached us at various events to learn more about how to become a data scientist. This session aims at raising awareness of what it takes to become a data-scientist and how artificial intelligence solutions have started to reinvent businesses.

Julien Le Dem is the coauthor of Apache Parquet and the PMC chair of the project. He is also a committer and PMC Member on Apache Pig, Apache Arrow and a few others. Julien is a Principal Engineer at WeWork and was previously Architect at Dremio and tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_). Prior to Twitter, Julien was a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

Presentations

From flat files to deconstructed database: the evolution and future of the Big Data ecosystem. 40-minute session

Over the past 10 years the Big Data infrastructure has evolved from flat files in a distributed file system to an efficient ecosystem turning into a fully deconstructed and open source database with reusable components. We started from a system good at looking for a needle in a haystack using snowplows. A lot of horsepower and scalability but lacking the efficiency of relational databases.

Danielle helps clients approach, design, implement, and integrate new insights and advanced analytics data products that align with their business goals. She’s passionate about keeping data in context and applying research methods, best practices, and academic algorithms to industry business needs. With a strong background in machine learning, Danielle identifies the math, visualizations, and the business questions and processes necessary to create reliable predictive models and, ultimately, good, data driven business guidance. Danielle has worked in healthcare, academia, government, retail, gaming, and energy, and with quantified selfers, biohackers, hacklabs, and makerspaces. She is notoriously unreadable to GSR wearables. In her previous life, Danielle worked with the world’s most sophisticated wearable to date, the hearing aid. Currently, she focuses most of her time on data science in the energy sector.

Presentations

From Theory to Data Product - Applying Data Science Methods to Effect Business Change Tutorial

This tutorial is a primer on crafting well-conceived data science projects on course toward uncovering valuable business insights. Using case studies and hands-on skills development, we will teach techniques that are essential for a variety of audiences invested in effecting real business change.

Bob Levy is CEO of Virtual Cove, Inc. where his team leverages virtual & augmented reality for helping see datasets in a completely new light. A veteran tech executive, Mr. Levy brings over two decades’ product leadership experience including IBM, Harte Hanks & MathWorks. He also served as founding president of the BPMA in 2001, a 6,000+ person industry group.

Presentations

Augmented Reality: Going Beyond Plots in 3D 40-minute session

Augmented Reality opens a completely new lens on your data through which you see and accomplish amazing things. Learn how simple Python scripts can leverage completely new plot types. See use cases revealing new insight into financial markets data. Explore new ways of seeing & interacting with data to shed light on & build trust in otherwise “black box” machine learning solutions.

Jennifer Lim is the data and integration enterprise architect for Cerner Corporation, a company focused on creating intelligent solutions for the health care industry. Jennifer is responsible for the strategic planning, leadership, facilitation, analysis, and design tasks associated with the integration of internal Cerner applications. Her areas of focus include data management and governance, data architecture, API life cycle management, and services architecture.

Jennifer has over 18 years of experience in the telecommunications, banking and federal, and healthcare IT industries. She has filled roles in data analysis, data architecture, and application design having both built and used data warehouses, data marts, operational data stores, data lakes, API management platforms, and even the occasional application database. Jennifer holds a BS in management information systems and an MBA in management.

Presentations

Modernizing Operational Architecture with Big Data — Creating and Implementing a Modern Data Strategy Data Case Studies

Big Data expectations can no longer be technical requirements managed with bubble systems. It impacts our entire architecture including operational assets in areas like HR, Finance, Marketing, and Service Management. Share in approaches used to create our modern architecture strategy, realigning big data expectations with our business goals to increase our efficiency and innovation.

Chang Liu is an Applied Research Scientist and a member of the Georgian Impact team. She brings her in-depth knowledge of mathematical and combinatorial optimization to helping Georgian’s portfolio companies. Prior to joining Georgian Partners, Chang was a Risk Analyst at Manulife Bank, where she built models to assess the bank’s risk exposure based on extensive market research, including evaluating and predicting the impact of the oil price drop to the mortgage lending risks in Alberta in 2014.

Chang holds a Master’s of Applied Science in Operations Research from the University of Toronto, where she specialized in combinatorial optimization. She also holds a Bachelor’s Degree in mathematics from the University of Waterloo.

Presentations

Solving the Cold Start Problem: Data and Model Aggregation Using Differential Privacy 40-minute session

This talk outlines a common problem faced by many software companies, the cold-start problem, and how Georgian Partners has been successful at solving this problem by transferring knowledge from existing data through differentially private data aggregation.

Changshu Liu is a software engineer at Pinterest, where as one of the founding member of data engineering team, he built big data infrastructure like workflow engines and query processing engines. Before joining Pinterest, he worked for Facebook and Microsoft Research Asian on search infrastructure and big data processing system.

Presentations

Scaling Pinterest’s data lake in cloud using Hive, Presto and Spark SQL 40-minute session

At Pinterest we builds data lake on s3 where we read and write data directly. The footprint of the data lake exceeds 100 PB as the business expands rapidly which makes Pinterest as one of the biggest customers in AWS. This massive scale and the nature of S3 bring a lots of technical challenges on processing engines like Hive, Presto and Spark SQL.

Ingrid Liu is a senior software engineer (big data) at Novantas with an economics degree from Princeton University. Passionate about software and machine learning, she is a specialist in building Spark-based analytical platforms and enjoys innovating solutions in Fintech.

Presentations

Case Study : A Spark-based Distributed Simulation Optimization Architecture for Portfolio Optimization in Retail Banking 40-minute session

We discuss a large-scale optimization architecture in Spark for a consumer product portfolio optimization case study in retail banking—which combines a simulator that distributes computation of complex real-world scenarios given varying macro-economic factors, consumer behavior and competitor landscape, and a constraint optimizer that uses business rules as constraints to meet growth targets.

Jorge A. Lopez works in big data solutions at Amazon Web Services. Jorge has more than 15 years of business intelligence and DI experience and enjoys intelligent design and engaging storytelling. He is passionate about data, music, and nature.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday Keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

I am a passionate advocate of Artificial Intelligence and how it will transform businesses. Over the past two and a half years I have led the creation of Baringa’s Data Science and Analytics team and supported our clients in their journeys to become leaders in Artificial Intelligence within their respective industries. Prior to this role, I worked as an independent data science consultant, in an investment bank and for the leading Formula 1 team. I have two first class Masters degrees in quantitative subjects and have published and patented a machine learning system.

Presentations

Predicting residential occupancy and hot water usage from high frequency, multi-vector utilities data 40-minute session

Future Home Energy Management Systems could improve their energy efficiency by predicting resident needs through utilities data. This session discusses the opportunity with a particular focus on the key data features, the need for data compression and the data quality challenges.

Boris Lublinsky is a software architect at Lightbend, where he specializes in big data, stream processing, and services. Boris has over 30 years’ experience in enterprise architecture. Over his career, he has been responsible for setting architectural direction, conducting architecture assessments, and creating and executing architectural roadmaps in fields such as big data (Hadoop-based) solutions, service-oriented architecture (SOA), business process management (BPM), and enterprise application integration (EAI). Boris is the coauthor of Applied SOA: Service-Oriented Architecture and Design Strategies, Professional Hadoop Solutions, and Serving Machine Learning Models. He is also cofounder of and frequent speaker at several Chicago user groups.

Presentations

Hands-on Kafka Streaming Microservices with Akka Streams and Kafka Streams Tutorial

This hands-on tutorial builds streaming apps as microservices using Kafka with Akka Streams and Kafka Streams. We'll assess the strengths and weaknesses of each tool for particular needs, so you'll be better informed when choosing tools for your needs. We'll contrast them with Spark Streaming and Flink, including when to chose them instead. The sample apps demonstrate ML model serving ideas.

Gerard Maas is a senior software engineer at Lightbend, where he contributes to the Fast Data Platform and focuses on the integration of stream processing technologies. Previously, he held leading roles at several startups and large enterprises, building data science governance, cloud-native IoT platforms, and scalable APIs. He is the coauthor of Stream Processing with Apache Spark from O’Reilly. Gerard is a frequent speaker and contributes to small and large open source projects. In his free time, he tinkers with drones and builds personal IoT projects.

Presentations

Processing Fast Data with Apache Spark: The Tale of Two APIs 40-minute session

Apache Spark has two streaming APIs: Spark Streaming and Structured Streaming. We will provide a critical view of their differences in key aspects of a streaming application: API usability, dealing with time, dealing with state and machine learning capabilities. We will round up with practical guidance on picking one or combining both to implement resilient streaming pipelines.

Swetha Machanavajhala works as a software engineer in Azure Networking, building tools to help engineers detect and diagnose network issues within seconds. She is very passionate in building products and awareness for people with disabilities. This passion made her lead several projects at hackathons, winning multiple awards. Swetha has driven such projects from idea to reality to launching as a beta product. She is also a co-lead of disability Employee Resource Group where she represents the community of people who are deaf or hard of hearing and is a part of the ERG chair committee. Adding to the list, Swetha is a public speaker and has given several talks internally at Microsoft as well as at external events.

Presentations

Deep Learning on audio in Azure to detect sounds in real-time 40-minute session

In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds (dog bark, alarms, people calling from behind etc.,). We all take this for granted, there are over 360 million in this world who are deaf or hard of hearing. How can we make the auditory world inclusive as well as meet the great demand in other sectors by applying deep learning on audio in Azure?

Mark Madsen is the global head of architecture at Think Big Analytics, where he is responsible for understanding, forecasting, and defining the analytics landscape and architecture. Prior to this, he was CEO of Third Nature, where he advised companies on data strategy and technology planning. Mark has designed analysis, data collection, and data management infrastructure for companies worldwide.

Presentations

Architecting a data platform for enterprise use Tutorial

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. We will explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem, and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the Internet of Things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.

Executive Briefing: Managing successful data projects - technology selection and team building 40-minute session

Creating a successful big data practice in your organization presents new challenges in managing projects and teams. In this session we'll provide guidance and best practices to help technical leaders deliver successful projects from planning to implementation.

Shankar has 17+ years of experience building distributed systems and productivity tools. He started out building a highly successful distributed test automation for windows and bing in microsoft. Then he spent the 8 years help build a middle tier platform that powered most of the online services that formed the backbone of bing and microsoft ads. He is currently leading the grid productivity team in bangalore. Empowering hadoop developers at linkedin to be more productive with their time and cluster resources.

Presentations

TuneIn: How to get your jobs tuned while you are sleeping 40-minute session

Have you ever tuned a Spark or MR job? If the answer is yes, then you already know how difficult it is to tune more than hundred parameters to optimize the resources used. With Dr. Elephant we introduced heuristic based tuning recommendations. Now, we introduce TuneIn, an auto tuning tool developed to minimize the resource usage of jobs. Experiments have shown upto 50% reduction in resource usage.

Jaya Mathew is a senior data scientist on the artificial intelligence and research team at Microsoft, where she focuses on the deployment of AI and ML solutions to solve real business problems for customers in multiple domains. Previously, she worked on analytics and machine learning at Nokia and Hewlett-Packard. Jaya holds an undergraduate degree in mathematics and a graduate degree in statistics from the University of Texas at Austin.

Presentations

A Day in the Life of a Data Scientist: how do we train our teams to get started with AI? 40-minute session

What profession did Harvard Business Review call the Sexiest Job of the 21st Century? With the growing buzz of data science, several professionals have approached us at various events to learn more about how to become a data scientist. This session aims at raising awareness of what it takes to become a data-scientist and how artificial intelligence solutions have started to reinvent businesses.

Les McMonagle (CISSP, CISA, ITIL) – VP of Security Strategy, BlueTalon Inc.

Les has over twenty years’ experience in information security. He has held the position of Chief Information Security Officer (CISO) for a credit card company and ILC bank, founded a computer training and IT outsourcing company in Europe, directed the security and network technology practice for Cambridge Technology Partners across Europe and helped several security technology firms develop their initial product strategy. Les founded and managed Teradata’s Information Security, Data Privacy and Regulatory Compliance Center of Excellence, was Chief Security Strategist at Protegrity and is currently Vice President of Security Strategy at BlueTalon.

Les holds a BS in MIS, CISSP, CISA, ITIL and other relevant industry certifications.

Presentations

Privacy by Design – Building data privacy and protection in, versus bolting it on later 40-minute session

"Privacy by Design" is a fundamentally important approach to achieving compliance with GDPR and other data privacy or data protection regulations. This session will outline how organizations can save time and money while improving data security and regulatory compliance and dramatically reduce the risk of a data breach or expensive penalties for non-compliance.

Matteo Merli is a software engineer at Streamlio working on messaging and storage technologies. Previously, he spent several years at Yahoo building database replication systems and multitenant messaging platforms. Matteo was the architect and lead developer for Yahoo Pulsar and a member of the PMC of Apache BookKeeper.

Presentations

High Performance Messaging with Apache Pulsar 40-minute session

Apache Pulsar, a messaging system is being used for an increasingly broad array of data ingestion tasks. When operating at scale, it is very important to ensure that the system can make use of all the available resources. This talk will provide insight on on the design decisions and the implementation techniques that allow Pulsar high achieve performance with strong durability guarantees.

Cory Minton is a Staff Technologist for Dell EMC in their Ready Solutions team, where he works hand in hand with clients across the globe to assess and develop big data strategies, architect technology solutions, and insure successful deployments of these transformational initiatives. A geek, technology evangelist and business strategist, Cory is focused on finding creative ways for organizations to drive the utmost value from their data while transforming IT’s relevance to the organizations and customers they serve. With a diverse background in IT applications, consulting, data center infrastructure, and the expanding Big Data ecosystem, Cory brings an interesting perspective to the clients he serves while consistently challenging them to think bigger. Cory holds an undergraduate degree in engineering from Texas A&M University and an MBA from Tennessee Tech University. Cory resides in Birmingham, Alabama, with his beautiful wife and two awesome children.

Presentations

DIY vs. designer approaches to deploying data center infrastructure for machine learning and analytics 40-minute session

How to choose the right deployment model for on-premises infrastructure to reduce risk, reduce costs, and be more nimble

Mridul Mishra is responsible for emerging technology in Asset Management group at Fidelity Investments.He is responsible for machine learning/AI projects and has been involved with an approach to put these projects in production. Mridul has around 21 years of experience building enterprise software ranging from core trading software to smart applications using AI/ML capabilities.

Presentations

Explainable Artificial Intelligence (XAI): Why, When, and How? Findata

Currently, most of the ML (specifically deep learning) models work like a black box and a key challenge in their adoption is the need for explainability. In this talk, we will explore the reason for the need of explainability, current state and provide a framework to think for these needs and the potential solution options.

Sanjeev Mohan leads Big Data research at Gartner for Technical Professionals. His areas of expertise span the end-to-end data pipelines including ingestion, persistence, integration, transformation and advanced analytics. He researches trends and technologies for relational and NoSQL databases as well as object stores and cloud databases. He is a well-respected speaker on Big Data and Data Governance. His research includes Machine Learning and IoT. He is also on panel of judges for many Hadoop distribution organizations such as Cloudera and Hortonworks.

Presentations

Executive Briefing: Enhance your Data Lake with comprehensive Data Governance to improve adoption and meet compliance needs 40-minute session

If the last few years were spent proving the value of data lakes, the emphasis now is to monetize the big data architecture investments. The rallying cry is to onboard new workloads efficiently. However, how does one do so if they don’t know what data is in the lake, the level of its quality and the trustworthiness of models? This is why data governance become the linchpin to the success of lakes.

Mostafa Mokhtar is a performance engineer at Cloudera. Previously, he held similar roles at Hortonworks and on the SQL Server team at Microsoft.

Presentations

Optimizing Apache Impala for a cloud-based data warehouse 40-minute session

Cloud object stores are becoming the bedrock of a cloud data warehouse for modern data-driven enterprises. Given today's data sizes, it's become a necessity for data teams to have the ability to directly query data stored in S3 or ADLS. In this talk, we'll discuss the optimal end-to-end workflows and technical considerations of using Apache Impala over object stores for your cloud data warehouse.

Andrew Montalenti is the co-founder and CTO of Parse.ly, a widely-used real-time web content analytics platform. The product is trusted daily by editors at HuffPost, TIME, TechCrunch, Slate, Quartz, The Wall Street Journal, and over 350 other leading digital companies. Andrew is a dedicated Pythonista who has presented his team’s work at the PyCon and PyData conferences. He is also the co-host of the web data/analytics podcast, The Center of Attention. Follow him on Twitter via @amontalenti and check out Parse.ly’s research on Internet attention via @parsely.

Presentations

Applying petabyte-scale analytics and machine learning to billions of news reading sessions 40-minute session

What can we learn from a one-billion-person live-poll of the Internet? Parse.ly has gathered a unique data set of news reading sessions of billions of devices, peaking at over 2 million sessions per minute on thousands of high-traffic news and information websites. Our team of data scientists and machine learning engineers have used this data to unearth the secrets behind online content.

The first time I met the word data it was just the plural of datum.

I am a BI Architect at Zalando, where I am redesigning the current Data Infrastructure. I like to solve problems and to learn new things.

I like to draw data models and optimize queries.

In my free time I have a daughter, for some reasons she speaks four languages (but not my dialect).

Presentations

Scaling Data Infrastructure in the fashion world or “What is this? Business Intelligence for ants?” 40-minute session

The story of how Zalando went from old school BI to an AI driven company built on a solid data platform, what we learned in the process and what are the challenges we still see in front of us.

Ash Munshi is CEO of Pepperdata. Previously, Ash was executive chairman for deep learning startup Marianas Labs (acquired by Askin in 2015); CEO of big data storage startup Graphite Systems (acquired by EMC DSSD in 2015); CTO of Yahoo; and CEO of a number of other public and private companies. He serves on the board of several technology startups.

Presentations

Classifying Job Execution Using Deep Learning 40-minute session

In this talk we will describe a technique for labeling applications using runtime measurements of CPU, memory, i/o and network and a deep neural network. This labeling groups the applications into buckets that have understandable characteristics and which can then be used to reason about the cluster and its performance.

Jacques Nadeau is the CTO and cofounder of Dremio. Jacques is also the founding PMC chair of the open source Apache Drill project, spearheading the project’s technology and community. Prior to Dremio, he was the architect and engineering manager for Drill and other distributed systems technologies at MapR. In addition, Jacques was CTO and cofounder of YapMap, an enterprise search startup, and held engineering leadership roles at Quigo (AOL), Offermatica (ADBE), and aQuantive (MSFT).

Presentations

Setting Up a Lightweight Distributed Caching Layer using Apache Arrow 40-minute session

This talk will deep-dive on a new Apache-licensed lightweight distributed in-memory cache that allows multiple applications to consume Arrow directly using the Arrow RPC and IPC protocols. We'll start with an overview of the system design and deployment architecture. This includes coverage of cache lifecycle, update patterns, cache cohesion and appropriate use cases.

Niraj Nagrani joined Ancestry, the global leader in family history and consumer genomics, in September 2017 as Senior Vice President of Data and Cloud Platform. He is responsible Prior to joining Ancestry, Niraj was Global Head of Platforms for Cloud, Data, Analytics, Frameworks, Products and Engineering at American Express. There, he was responsible for consumer, merchants, corporate Application, Cloud and Data Science Platform. Earlier, Niraj was VP of Engineering and Product at A16Z funded SnapLogic and General Manager of Microsoft Azure Cloud and O365 Engineering at Microsoft in addition to prior senior executive product and engineering leadership roles at Oracle, Interwoven, and Cap Gemini.

Presentations

Using Data Science & Technology to Deliver Personalized Insights from Genome at Scale: An Ancestry.com Case Study 40-minute session

Ancestry has more than 10 petabytes of structured and unstructured data. Ancestry’s SVP of platform, Niraj Nagrani, will discuss how companies can build a data platform that uses cloud computing, Data Science, Artificial Intelligence and Machine Learning to analyze complex data sets at scale to provide personalized insights and relationship graph to consumers.

As the Director of Business Strategies for SAS Best Practices Kimberly balances forward-thinking with real-world perspectives on business analytics, data governance, analytic cultures and change management.

Kimberly’s current focus is helping customers understand both the business potential and practical implications of artificial intelligence (AI) and machine learning (ML).

Presentations

Rationalizing Risk in AI/ML 40-minute session

Too often, the discussion of AI and ML includes an expectation - if not a requirement - for infallibility. But as we know, this expectation is not realistic. So what’s a company to do? While risk can’t be eliminated, it can be rationalized. This session will demonstrate how a unflinching risk assessment enables AI/ML adoption and deployment.

Ann Nguyen evangelizes design for impact at Whole Whale, where she leads the tech and design team in building meaningful digital products for nonprofits. She has designed and managed the execution of multiple websites for organizations including the LAMP, Opportunities for a Better Tomorrow, and Breakthrough. Ann is always challenging designs with A/B testing. She bets $1 on every experiment that she runs and to date has accumulated a decent sum. Previously, Ann worked with a wide range of organizations from the Ford Foundation to Bitly. She is Google Analytics and Optimizely Platform certified. Ann is a regular speaker on nonprofit design and strategy and recently presented at the DMA Nonprofit Conference. She has also taught at Sarah Lawrence College. Outside of work, Ann enjoys multisensory art, comedy shows, fitness, and making cocktails, ideally all together.

Presentations

How to Be Aggressively Tone-Deaf Using Data (or, We Should All Be For-Benefits) Data Case Studies

Google returns 97,900,000 results for “data-driven business.” Innovation is the key to survival and data, combined with design thinking and iteration is a proven path. The problem is that this system lacks a conscious, it lacks empathy thinking.

Minh Chau Nguyen is a researcher in the Big Data Software Platform Research department at the Electronic and Telecommunications Research Institute (ETRI), one of the largest government-funded research institutes in Korea. His research interests include big data management, software architecture, and distributed systems.

Presentations

A Data Marketplace Case Study with Blockchain and Advanced Multitentant Hadoop in Smart Open Data Platform 40-minute session

This session will address how analytics services in data marketplace systems can be performed on one single Hadoop cluster across distributed data centers. We extend the overall architecture of Hadoop ecosystem with blockchain so that multiple tenants and authorized third parties can securely access data to perform various analytics while still maintaining the privacy, scalability and reliability.

Anna Nicanorova is Director of Annalect Labs – space for experimentation and rapid prototyping within Annalect. During her time at Annalect she has worked on numerous data-marketing solutions: attribution, optimizers, quantification of content and image recognition technology. In 2015 Anna was part of Annalect team, that won I-Com Data Science Hackathon 2015.

Anna is Co–Founder of Books+Whiskey meetup and coding volunteer teacher with ScriptEd. She holds an MBA from University of Pennsylvania – The Wharton School and BA from Hogeschool van Utrecht.

Presentations

Data Visualization in Mixed Reality with Python 40-minute session

Data Visualization is supposed to be our map to information. However, contemporary charting techniques have a few shortcomings: Context Reduction, Hard numeric grasp, Perceptual de-humanization. Augmented Reality can potentially solve all of the issues listed above by presenting an intuitive and interactive environment for data exploration.

Aileen Nielsen is a software engineer at One Drop, a company working on diabetes-management products. Aileen has worked in corporate law, physics research laboratories, and, most recently, NYC startups oriented toward improving daily life for underserved populations—particularly groups who have yet to fully enjoy the benefits of mobile technology. Her interests range from defensive software engineering to UX designs for reducing cognitive load to the interplay between law and technology. She currently serves as a member of the New York City Bar Association’s Science and Law committee, where she chairs a subcommittee devoted to exploring and advocating for scientifically driven regulation (and deregulation) of new and existing technologies. Aileen holds degrees in anthropology, law, and physics from Princeton, Yale, and Columbia respectively.

Presentations

How to be fair: a tutorial for beginners Tutorial

There is mounting evidence that the widespread deployment of machine learning and artificial intelligence in business and government applications is likely reproducing or even amplifying existing prejudices and social inequalities. This tutorial is designed to give knowledge and tools to data scientists so they can identify and avoid bias and other unfairness in their analyses.

Ryan Nienhuis is a senior technical product manager on the Amazon Kinesis team, where he defines products and features that make it easier for customers to work with real-time, streaming data in the cloud. Previously, Ryan worked at Deloitte Consulting, helping customers in banking and insurance solve their data architecture and real-time processing problems. Ryan holds a BE from Virginia Tech.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.

Owen O’Malley is a software architect on Hadoop working for HortonWorks, a startup focusing on Hadoop development. Prior to cofounding HortonWorks, Owen and the rest of the HortonWorks team worked at Yahoo developing Hadoop. He has been contributing patches to Hadoop since before it was separated from Nutch and was the original chair of the Hadoop PMC. Before working on Hadoop, he worked on Yahoo Search’s WebMap project, which builds a graph of the known Web and applies many heuristics to the entire graph that control search. Prior to Yahoo, Owen wandered between testing (UCI), static analysis (Reasoning), configuration management (Sun), and software model checking (NASA). He holds a PhD in software engineering from the University of California, Irvine.

Presentations

Introducing Iceberg: Tables Designed for Object Stores 40-minute session

Iceberg is a new open source project that defines a new table layout with properties specifically designed for cloud object stores, such as S3. It provides a common set of capabilities such as partition pruning, schema evolution, atomic additions, removal, or replacements of files regardless of whether the data is stored in Avro, ORC, or Parquet.

Brian O’Neill is a product designer and founder of the consulting firm, Designing for Analytics, which helps companies design indispensable data products and analytics solutions that customers love. His clients and past employers include DELL/EMC, NetApp, Tripadvisor, Fidelity, DataXu, Apptopia, Accenture, MITRE, Kyruus, Dispatch.me, JP Morgan Chase, the Future of Music Coalition, and ETrade among others, and he has worked on award-winning IT/storage industry software for Akorri and Infinio. Brian has also brought over 20 years of design experience to various podcasts, meetups, and conferences such as the O’Reilly Strata Conference in New York City and London, England. Has has also authored the Designing for Analytics Self-Assessment Guide for Non-Designers as well numerous articles on design strategy, user experience, and business related to analytics. Brian is also an Expert Advisor on the topics of design and user experience for the International Institute for Analytics.

When he is not manning his Big Green Egg at a BBQ or mixing a classic tiki cocktail, Brian can be found on stage performing as a professional percussionist and drummer. He leads the acclaimed dual-ensemble, Mr. Ho’s Orchestrotica that is “anything but straightforward” (Washington Post) and has performed at Carnegie Hall, the Kennedy Center, and the Montreal Jazz Festival.

If you’re at a conference, just look for only guy with a stylish orange leather messenger bag.

Presentations

UX Strategies for Underperforming Analytics Services and Data Products 40-minute session

Gartner says 85%+ of big data projects will fail, despite the fact your company may have invested millions on engineering implementation. Why are customers and employees not engaging with these products and services? CDOs, CIOs, product managers, and analytics leaders with a "people first, technology second" mission–a design strategy–will realize the best UX and business outcomes possible. #design

Troels Oerting is a globally recognized Cyber Security Expert who now holds a number of Board post as Non Executive Director in key companies, as well as high profile advisory roles. He has been working in cyber/security ‘first line’ for the last 38 years and has held a number of significant posts both nationally and international.
Troels joined Barclays as Group Chief Information Security Officer (CISO) early 2015. Reporting to Group Chief Operations and Technology Officer. He was appointed Group Chief Security Officer in 2016 with end to end responsibility of all security in Barclays Group, responsible for more than 3000 security experts World Wide protecting the banks 50 million customers and 140.000 employees.
Before joining Barclays Troels held the position as Director of the European Cybercrime Centre (EC3), an EU wide centre located in Europol’s HQ with the task to assist Law Enforcement Agencies in protecting 500 million citizens in the 28 EU Member States from cybercrime or loss of privacy.
As an expert in cyber security Troels has constantly been looking for new legislative, technical or cooperation opportunities to efficiently protect privacy and security for users of the Internet. He has been pioneering new methodologies to prevent crime in Cyberspace and protect innocent users from losing their digital identity, assets or privacy online. As Director of EC3 he also initiated the establishment of the International ‘Joint Cybercrime Action Task Force’ (J-CAT) including global leading law enforcement agencies, prosecutors and Interpol’s Global Centre of Innovation and the JCAT has since been recognised as the leading international response to the increasing threat from Organized Cyber Criminal networks.
He has been Cyber advisor for the EU Commission and Parliament and been a permanent delegate in many governance organisations i.e. ICANN, ITU and The Council of Europe and used by several governments and organisations as advisor in cyber related questions. He also established a vast global Outreach programme including law enforcement, NGO’s, key tech companies and industry who together with Academic Research Institutes established a multifaceted global coalition against cybercriminal syndicates and networks, with the aim to enhance online security without harming privacy and to invent new ways of protecting users of the Internet. Before joining Europol as Director for EC3 Troels held the position as Assistant Director for Europol Organised Crime department as well as the Counter Terrorist Department and he also held positions as Director of Operation in the Danish Security Intelligence Service and Director for the Danish Serious Organised Crime Agency (SOCA).
Troels is also an extern lecturer in cybercrime at a number of Universities and Business Schools and has been Internationally awarded several times by global law enforcement agencies for his international leadership in fighting cyber- and organised crime. He is author of a political thriller published in Danish: Operation Gamma.

Presentations

Next Generation Cybersecurity via Data Fusion, AI and BigData: Pragmatic Lessons from the Font Lines in Financial Services 40-minute session

This presentation will share the main outcomes and learnings from building and deploying a global data fusion, incident analysis/visualization, and effective cybersecurity defense based on BigData and AI at a major EU bank and in collaboration with several financial services institutions. The focus is on learnings and breakthroughs gleaned from making the systems work

Diego Oppenheimer, founder and CEO of Algorithmia, is an entrepreneur and product developer with extensive background in all things data. Prior to founding Algorithmia he designed , managed and shipped some of Microsoft’s most used data analysis products including Excel, Power Pivot, SQL Server and Power BI.
Diego holds a Bachelors degree in Information Systems and a Masters degree in Business Intelligence and Data Analytics from Carnegie Mellon University.

Presentations

Deploying ML Models in the Enterprise 40-minute session

After big investments in collecting & cleaning data, and building Machine Learning models, enterprises discover the big challenges in deploying models to production and managing a growing portfolio of ML models. This talk covers the strategic and technical hurdles each company must overcome and the best practices we've developed while deploying over 4,000 ML models for 70,000 engineers.

Francois Orsini is the chief technology officer for MZ’s Satori business unit. Previously, he served as vice president of platform engineering and chief architect, bringing his expertise in building server-side architecture and implementation for a next-gen social and server platform; was a database architect and evangelist at Sun Microsystems; and worked in OLTP database systems, middleware, and real-time infrastructure development at companies like Oracle, Sybase, and Cloudscape. Francois has extensive experience working with database and infrastructure development, honing his expertise in distributed data management systems, scalability, security, resource management, HA cluster solutions, soft real-time and connectivity services. He also collaborated with Visa International and Visa USA to implement the first Visa Cash Virtual ATM for the internet and founded a VC-backed startup called Unikala in 1999. Francois holds a bachelor’s degree in civil engineering and computer sciences from the Paris Institute of Technology.

Presentations

Correlation analysis on live data streams 40-minute session

One current application, anomaly detection, is a necessary but insufficient step, due to the fact that anomaly detection over a set of live data streams may result in an anomaly fatigue, limiting effective decision making. We shall walk the audience through how to marry correlation analysis with anomaly detection can help and share techniques to guide effective decision making.

Occhio Orsini has over 25 years’ experience in the data and analytics technology platforms. He started his career in application development, and then spent time developing database engine technology, and internet search technology for heritage Ascential software. Then IBM acquired Ascential and Occhio played a central role in creation of the IBM Information Server Suite. At this time he wanted to really improve the adoption of these technologies and took a position in Aetna’s Enterprise Architecture group and worked on the strategy and adoption of data analytics and data governance platforms. Then as Big Data and data science became the new direction for analytics, Occhio lead the Solution engineering and architecture efforts to build Aetna’s Data Fabric which supports the companies advanced analytics initiatives across the organization.

Presentations

Aetna's Advanced Analytics Platform (Data Fabric) 40-minute session

Aetna's Data Fabric platform is based on the Hadoop technology stack but has integrated many different technologies to create a robust Data Lake and Advanced Analytics platform to meet the needs to Aetna's Data Scientists and analytics practitioners.

Steve Otto is the Associate Director of the Enterprise Architecture team at Navistar and helps shape the technology strategy and architecture to drive business goals. He was formerly the manager of the Information Management team at Navistar.

Mr. Otto started his career as developer in the management consulting practice at Ernst & Young and has had a variety of roles in his IT career. Mr Otto had worked in a number of different capacities and had direct responsibility for a wide range of activities, including the planning, design, build, operation, and support functions for IT projects in consumer products, retail, aerospace and defense, healthcare, manufacturing, and higher education markets.

Presentations

Driving Predictive Analytics for IoT & Connected Vehicles Data Case Studies

Navistar built an IoT-enabled remote diagnostics platform, called OnCommand®™ Connection, to bring together data from 375,000+ vehicles in real-time, to drive predictive analytics. This service is being offered to fleet owners who can now monitor the health and performance of their trucks from smartphones or tablets. Join Steven Otto, from Navistar to learn more about their IoT & data journey.

Jerry Overton is a data scientist and distinguished technologist in DXC’s Analytics group, where he is the principal data scientist for industrial machine learning, a strategic alliance between DXC and Microsoft comprising enterprise-scale applications across six different industries: banking and capital markets, energy and technology, insurance, manufacturing, healthcare, and retail. Jerry is the author of Going Pro in Data Science: What It Takes to Succeed as a Professional Data Scientist (O’Reilly) and teaches the Safari training course Mastering Data Science at Enterprise Scale. In his blog, Doing Data Science, Jerry shares his experiences leading open research and transforming organizations using data science.

Presentations

Minimum-Viable Machine Learning: The Applied Data Science Bootcamp (Sponsored by DXC Technology) 1-Day Training

Acquiring machine-learning (ML) technology is relatively straightforward, but ML must be applied to be useful. In this one-day boot camp, we teach students how to apply advanced analytics in ways that reshape the enterprise and improve outcomes. This training is equal parts hackathon, presentation, and group participation.

Joshua Patterson is the director of applied solutions engineering at NVIDIA. Previously, Josh worked with leading experts across the public and private sectors and academia to build a next-generation cyberdefense platform. He was also a White House Presidential Innovation Fellow. His current passions are graph analytics, machine learning, and GPU data acceleration. Josh also loves storytelling with data and creating interactive data visualizations. He holds a BA in economics from the University of North Carolina at Chapel Hill and an MA in economics from the University of South Carolina’s Moore School of Business.

Presentations

Accelerating Financial Data Science Workflows With GPU 40-minute session

GPUs have allowed financial firms to run complex simulations, train myriads of models, and data mine at unparalleled speeds. Today, the bottleneck has moved completely to ETL. With the GPU Open Analytics Initiative (GoAi), we’re accelerating ETL and keeping the entire workflow on GPUs. We’ll discuss real-world examples, benchmarks, and how we’re accelerating our largest FS customers.

NATASHA PONGONIS
Co-Owner & Partner, Nativa, Inc.
Co-Founder and CEO, OYE! Business Intelligence

Natasha Pongonis is the Co-Owner of Nativa, a multicultural marketing agency with offices in Columbus Ohio and Phoenix, Arizona; and is the CEO of OYE! Business Intelligence, a solution that provides data analytics based user demographic.

Natasha is a business communications expert with extensive marketing experience. She has developed key digital strategies for many organizations and Governmental agencies engaging the diverse Hispanic audiences through her understanding of communication between cultures, traditions, and regional variations of Spanish.

She is a native from Argentina who has worked with companies in Europe, North and South America developing a strong sense of understanding the client’s needs and creating culturally relevant online presences.

In 2017, Natasha was invited to participate in a roundtable at the White House to discuss priorities for Latina business owners in America. In 2016 Natasha was accepted into the prestigious Stanford Latino Entrepreneur Leaders Program to focus on strategic scalability and economic growth and received The Women in Business and Leadership Award by the US Hispanic Chamber of Commerce Foundation, she was also named one of twelve women “Women Welding the Way 2016” by WELD receiving a recognition from the U.S. Congress.

Natasha started her Architectural studies while living in Belgium and continued her degree at the Catholic University of Cordoba, Argentina. She concluded her thesis studies in Architecture and Urban Planning at the University of Venice, Italy where she found her passion for marketing communication while working for a European firm. Being fluent in Spanish, English, French, and Italian has always enabled her to reach out and connect with a diverse digital audience across the globe.

Presentations

Data-Driven Insights: The future of Multicultural Marketing 40-minute session

As America grows more diverse, minority groups are becoming the “Super Consumers” becoming the fastest growing segment, transforming the US mainstream and driving the buying power. Using big data analytics to uncover demographic & psychographic insights it’s critical to understand consumer behavior among the fastest growing segment of the US Consumer economy.

Jennifer Prendki is the head of data science at Atlassian, where she leads all search and machine learning initiatives and is in charge of leveraging the massive amount of data collected by the company to load the suite of Atlassian products with smart features. Jennifer has worked as a data scientist for many different industries. Previously, she was a senior data science manager on the search team at Walmart eCommerce.  Jennifer enjoys addressing both technical and nontechnical audiences at conferences and sharing her knowledge and experience with aspiring data scientists. She holds a PhD in particle physics from University UPMC-La Sorbonne.

Presentations

Executive Briefing: Agile for Data Science teams 40-minute session

The Agile Methodology has been widely successful for Software Engineering teams, but seems inappropriate for Data Science teams. This is because Data Science is part-engineering, part-research. In this talk, I will show how, with a minimum amount of tweaking, Data Science managers can adapt the techniques used in Agile and establish best practices to make their teams more efficient.

Gregory M. Quist, Ph.D.
President & CEO
Greg is the co-founder, President and CEO of SmartCover Systems, leading the strategic direction and operations of the Company. Greg is a long-time member of the water community, elected to the Rincon del Diablo MWD Board of Directors in 1990, where he has served for the past 27 years in various roles including President and Treasurer. Rincon’s Board appointed Greg to the San Diego County Water Authority Board in 1996 for 12 years where he led a coalition of seven agencies to achieve more than $1M/year in water delivery savings. He is currently the Chairman of the Urban Water Institute. With a background in the areas of metamaterials, numerical analysis, signal processing, pattern recognition, wireless communications, and system integration, Greg has worked as a technologist, manager and executive at Alcoa, McDonnell-Douglas, and SAIC and has founded and successfully spun off several high technology start-up companies, primarily in real-time detection and water technology. He holds 14 patents and has several pending. Greg received his undergraduate degree in astrophysics with a minor concentration in economics from Yale College where he played football and baseball and his Ph.D. in physics from the University of California, Santa Barbara. He has held top-level government clearances and currently resides in Escondido, CA. In his rare free time he enjoys fly fishing, hiking, golf, basketball, and tennis.

Presentations

Sewers can Talk – Understanding the Language of Sewers Data Case Studies

The first step in solving this crisis is knowing the extent and severity of the problem. Water levels in sewers have a signature, analogous to a human EKG. This signature can be analyzed in real-time, using pattern recognition techniques, revealing distressed pipeline and allowing users of this technology to take appropriate steps for maintenance and repair. Sewers can talk!

Syed Rafice is a principal system engineer at Cloudera, where he specializes in big data on Hadoop technologies and is responsible for designing, building, developing, and assuring a number of enterprise-level big data platforms using the Cloudera distribution. Syed also focuses on both platform and cybersecurity. He has worked across multiple sectors, including government, telecoms, media, utilities, financial services, and transport.

Presentations

Getting ready for GDPR: securing and governing hybrid, cloud and on-prem big data deployments Tutorial

New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span on-prem, private cloud, multi-cloud, and hybrid cloud. We will share hands-on best practices for meeting these challenges, with special attention to GDPR.

Greg Rahn is a director of product management at Cloudera, where he is responsible for driving SQL product strategy as part of Cloudera’s analytic database product, including working directly with Impala. Over his 20-year career, Greg has worked with relational database systems in a variety of roles, including software engineering, database administration, database performance engineering, and most recently, product management, to provide a holistic view and expertise on the database market. Previously, Greg was part of the esteemed Real-World Performance Group at Oracle and was the first member of the product management team at Snowflake Computing.

Presentations

Optimizing Apache Impala for a cloud-based data warehouse 40-minute session

Cloud object stores are becoming the bedrock of a cloud data warehouse for modern data-driven enterprises. Given today's data sizes, it's become a necessity for data teams to have the ability to directly query data stored in S3 or ADLS. In this talk, we'll discuss the optimal end-to-end workflows and technical considerations of using Apache Impala over object stores for your cloud data warehouse.

Mala Ramakrishnan heads product initiatives for Cloudera Altus – big data platform-as-a-service. She has 17+ years experience in product management, marketing, and software development in organizations of varied sizes that deliver middleware, software security, network optimization, and mobile computing. She holds a master’s degree in computer science from Stanford University.

Presentations

Comparative Analysis of the Fundamentals of AWS and Azure 40-minute session

The largest infrastructure paradigm change of the 21st Century is the shift to the cloud. Companies are faced with the difficult and daunting decision of which cloud to go with. This decision is not just financial and in many cases rests on the underlying infrastructure. In this talk we use our experience from building production services on AWS and Azure to compare their strengths and weaknesses.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); briefly worked on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper Networks. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin-Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Designing Modern Streaming Data Applications Tutorial

In this tutorial, we will walk the audience through the landscape of state-of-the-art systems for each stage of a end-to-end data processing pipeline, viz., messaging frameworks, streaming computing frameworks, storage frameworks for real-time data. We will also walk through case studies from IoT, Gaming and Healthcare, and share our experiences operating these systems at Internet scale.

High Performance Messaging with Apache Pulsar 40-minute session

Apache Pulsar, a messaging system is being used for an increasingly broad array of data ingestion tasks. When operating at scale, it is very important to ensure that the system can make use of all the available resources. This talk will provide insight on on the design decisions and the implementation techniques that allow Pulsar high achieve performance with strong durability guarantees.

Suyash Ramineni is a Pre-Sales Engineer at Cloudera. At Cloudera, he’s focused on helping customers solve business problems at scale. He’s part of the Cloudera Field IoT and Manufacturing Specialization team and is usually occupied guiding Platform teams to manage and deploy Data Analytics and Data Science workloads on large clusters. Previously, Suyash worked as a Software Engineer at Intel and a few startups, focused on a data-driven approach to solve problems. He has 8 years of experience working with customer and partners to solve business problems.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Delip Rao is the founder of Joostware AI Research Corp., which specializes in consulting and building IP in natural language processing and deep learning. Delip is a well-cited researcher in natural language processing and machine learning and has worked at Google Research, Twitter, and Amazon (Echo) on various NLP problems. He is interested in building cost-effective, state-of-the-art AI solutions that scale well. Delip has an upcoming book on NLP and deep learning from O’Reilly.

Presentations

Machine Learning with PyTorch 1-Day Training

Explore machine learning and deep learning with PyTorch and walk you through how to build effective models for real world data.

Jun Rao is the cofounder of Confluent, a company that provides a streaming data platform on top of Apache Kafka. Previously, Jun was a senior staff engineer at LinkedIn, where he led the development of Kafka, and a researcher at IBM’s Almaden research data center, where he conducted research on database and distributed systems. Jun is the PMC chair of Apache Kafka and a committer of Apache Cassandra.

Presentations

A deep dive into Kafka controller 40-minute session

The controller is the brain of Apache Kafka and is responsible for maintaining the consistency of the replicas. We will first describe the main data flow in the controller. We then describe some of the recent improvements in the controller that handle certain edge cases correctly and allows for more partitions in a Kafka cluster.

Radhika Ravirala is a solutions architect at Amazon Web Services, where she helps customers craft distributed, robust cloud applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley. Radhika enjoys spending time with her family, walking her dog, doing Warrior X-Fit, and playing an occasional hand at Smash Bros.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.

Anthony Reid is the Sr. Manager of Analytics with Komatsu and is an automation technology specialist with broad ranging experience in machine perception and control, distributed systems and data analytics. Antony is interested in the integration of automation and perception technologies to enhance interconnectivity and intelligence in machines, and currently works to drive analytics and intelligence of IoT & connected mining equipment at Komatsu.

Prior to working at Komatsu, Anthony worked as an Engineering lead at The University of Queensland, for the development of technologies to improve the safety and efficiency of mining operations. He was also responsible for the commissioning and testing of the P&H Payload system on two shovels in Australia and in the US and for various programming tasks to complete this work.

Anthony holds a PhD in Mechanical Engineering from the University of Queensland and currently resides in Milwaukee, Wisconsin.

Presentations

How Komatsu is Improving Mining Efficiencies using IoT and Machine Learning 40-minute session

Global heavy equipment manufacturer, Komatsu, is using IoT data to continuously monitor some of the largest mining equipment to ultimately improve mine performance and efficiencies. Join Shawn Terry & Anthony Reid, to learn more about their data journey and how they are using advanced analytics and predictive modeling to drive insights on terabytes of IoT data from connected mining equipment.

The system featured in this presentation is the invention of LaVonne Reimer, a lawyer-turned-entrepreneur with decades of experience building digital platforms for markets with identity and data privacy sensitivities. LaVonne was Founder and CEO of Cenquest, a venture-backed startup that provided the technology backbone for graduate schools such as NYU Stern School, London School of Economics, and UT Austin to offer branded degree programs online. More recently, she led a program to foster entrepreneurship in open source together with Open Source Development Labs (Linux), IBM, and Intel. The Open Authorization Protocol, initiated by members of this community, inspired her to begin work on governance and trust assurance for free-flowing data.

Presentations

Balancing stakeholder interests in personal data governance technology 40-minute session

GDPR asks us to rethink personal data systems--viewing UI/UX, consent management, and value-add data services through the eyes of subjects of the data. The opportunity in the $150B credit and risk industry is to deploy data governance technologies that balance the interests of individuals to control their own data with requirements for trusted data.

Randy Ridgley is a Solutions Architect on the Amazon Web Services Public Sector team. Previously, Randy worked for Walt Disney World in Orlando as the Principal Application Architect on their MagicBand platform, improving guest experience and cast coordination by building big data solutions based on AWS services. He has over 15 years of experience in the fields of Media & Entertainment, Casino Gaming and Publishing building real time streaming and big data analytics applications.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.

Steve Ross is the director of product management at Cloudera, where he focuses on security across the big data ecosystem, balancing the interests of citizens, data scientists, and IT teams working to get the most out of their data while preserving privacy and complying with the demands of information security and regulations. Previously, at RSA Security and Voltage Security, Steve managed product portfolios now in use by the largest global companies and hundreds of millions of users.

Presentations

Executive Briefing: Getting Your Data Ready for Heavy EU Privacy Regulations (GDPR ) 40-minute session

General Data Protection Regulation (GDPR) goes into effect in May 2018 for firms doing any business in the EU. However many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). This session will explore the capabilities your data environment needs in order to simplify GDPR compliance, as well as future regulations.

Nikki Rouda is the cloud and core platform director at Cloudera. Nik has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their IT challenges. His career spans big data, analytics, machine learning, AI, storage, networking, security, and the IoT. Nik holds an MBA from Cambridge and an ScB in geophysics and math from Brown.

Presentations

DIY vs. designer approaches to deploying data center infrastructure for machine learning and analytics 40-minute session

How to choose the right deployment model for on-premises infrastructure to reduce risk, reduce costs, and be more nimble

Nuria Ruiz: (@pantojacoder)
Nuria began working for the Wikimedia Foundation Analytics team in December 2013. Before being part of the awesome project Wikipedia is, I spent time working in JavaScript, performance, mobile apps and web frameworks in the retail and social space. Most of my experience deploying large applications comes from the seven years that I worked at Amazon.com. I am a physicist by training and I started writing software in a Physical Oceanography Lab in Seattle. A long time ago. When Big Data was just called “science”.

Presentations

Data and Privacy at Scale at Wikipedia 40-minute session

The Wikipedia community feels strongly that you shouldn’t have to provide personal information to participate in the free knowledge movement. In this talk we will go into the challenges that this strong privacy stance poses for the Wikimedia Foundation, including how it affects data collection and some creative workarounds that allow WMF to calculate metrics in a privacy conscious way.

Patty Ryan is an applied data scientist for Microsoft. She codes with our partners and customers to tackle tough problems using machine learning approaches, with sensor, text and vision data. She’s a graduate of University of Michigan. On Twitter: @singingdata

Presentations

When Tiramisu Meets Online Fashion Retail 40-minute session

Large online fashion retailers face the problem of efficiently maintaining catalogue of millions of items. Due to human error, it is not unusual that some items have duplicate entries. To trawl along such a large catalogue manually is near to impossible. How would you prevent such error? Find out how we applied deep learning as part of the solution.

Anand is a co-founder of Gramener, a data science company. He leads a team of data enthusiasts who tell visual stories of insights from analysis. These are built on the Gramener Visualisation Server.

He studied at IIT Madras, IIM Bangalore and LBS, and worked at IBM, Infosys, Lehman Brothers and BCG.

Profile: https://www.linkedin.com/in/sanand0/

Presentations

Mapping India Data Case Studies

Answering simple questions about India's geography can be a nightmare. What is the boundary of the postal code? Or a census block? Or even a constituency? The official answer resides in a set of manually drawn PDFs. But an active group of volunteers are crafting open maps. Their coverage and quality are such that it may enable the largest census exercise in the world in 2020.

Cloudera Systems Engineer

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Organizations now run diverse, multidisciplinary big data workloads that span data engineering, analytic database, and data science applications. Many of these workloads operate on the same underlying data and the workloads themselves can be transient or long-running in nature. One of the challenges we will explore is keeping the data context consistent across these various workloads.

Neelesh Srinivas Salian is a Software Engineer on the Data Platform team at Stitch Fix, where he works on the compute infrastructure used by data scientists. He helps build services that are part of Stitch Fix’s Data Warehouse ecosystem.

Presentations

Tracking Data Lineage at Stitch Fix 40-minute session

This talk explain how we, at Stitch Fix, built a service to better understand the movement and evolution of data within the Data Warehouse from the initial ingestion from outside sources and through all of our ETLs. We talk about why we built the service, how we built it and the use cases that are benefitted by it.

Shioulin Sam is a research engineer at Cloudera Fast Forward Labs. In her previous life, she was an angel investor focusing on women-led start-ups. She also worked in the investment management industry designing quantitative trading strategies. She holds a Ph.D in Electrical Engineering and Computer Science from Massachusetts Institute of Technology.

Presentations

Semantic Recommendations 40-minute session

Recent advances in deep learning allow us to use the semantic content of items in recommendation systems, addressing a weakness of traditional methods. In this talk we explore limitations of classical approaches and look at how using the content of items can help solve common recommendation pitfalls such as the cold start problem, and open up new product possibilities.

Osman Sarood received his PhD in High Performance Computing from the Computer Science department at the University of Illinois Urbana Champaign in Dec 2013 where he focussed on load balancing and fault tolerance. Dr. Sarood has published more than 20 research papers in highly rated journals, conferences and workshops. He has presented his research at several academic conferences and has over 400 citations along with an i10-index and h-index of 12. He worked at Yelp from 2014 to 2016 as a Software Engineer where he prototyped, architected and implemented several key production systems that have been presented at various high profile conferences. He presented his work, Seagull, at the prestigious Amazon Web Services (AWS) annual conference, reInvent in 2015. He architected and authored Yelp’s autoscaled spot infrastructure, fleet_miser, which was presented at AWS reInvent 2016. Dr. Sarood started working at Mist in 2016 and is leading the infrastructure team to help Mist scale the Mist Cloud in a cost effective and reliable manner.

Presentations

How to Cost Effectively and Reliably Build Infrastructure for Machine Learning 40-minute session

Mist consumes several Terabytes of telemetry data daily from its globally deployed wireless Access Points, a significant portion of which is consumed by ML algorithms. Last year, we saw 10X infrastructure growth. Learn how we are running 75% of our production infrastructure — reliably -- on AWS EC2 spot instances, which has kept our annual AWS cost to $1 million vs. $3 million (a 66% reduction).

Toru Sasaki is a system infrastructure engineer and leads OSS professional services team at NTT Data Corporation.
He is interested in open-source distributed computing systems, such as Apache Hadoop, Apache Spark and Apache Kafka.
He has designed and developed many clusters utilizing these products to solve his customer’s problems.
He is a co-author of one of the famous Apache Spark book written in Japanese.

Presentations

Best Practices to Develop an Enterprise Datahub to Collect and Analyze 1TB/day Data from a Lot of Services with Apache Kafka and Google Cloud Platform in Production 40-minute session

Recruit Group and NTT DATA Corporation developed a platform based on "Datahub" utilizing Apache Kafka. This platform should handle around 1TB/day application logs generated by a lot of services in Recruit Group. Some of the best practices and know-hows, such as schema evolution and network architecture, learned during this project are explained in this session.

Eric has worked in the data space for the past 10 years, starting with call center performance analytics at Merced Systems. He is currently working at Uber with the large volume of geospatial data helping people move in countries around the world.

Presentations

Marmaray – A generic, scalable, and pluggable Hadoop data ingestion & dispersal framework 40-minute session

Marmaray is a generic Hadoop ingestion and dispersal framework recently released to production at Uber. We will introduce the main features of Marmaray and business needs met, share how Marmaray can help a team's data needs by ensuring data can be reliably ingested into Hive or dispersed into online data stores, and give a deep dive into the architecture to show you how it all works.

Friederike Schüür is a research engineer at Cloudera Fast Forward Labs, where she imagines what applied machine learning in industry will look like in two years time; a time horizon that fosters ambition and yet provides grounding. She dives into new machine learning capabilities and builds fully functioning prototypes that showcase state-of-the-art technology applied to real use cases. She advises clients on how to make use of new machine learning capabilities, from strategy advising to hands-on collaboration with in-house technical teams. She earned a PhD in Cognitive Neuroscience from University College London and is a long-time data science for social good volunteer with DataKind.

Presentations

From Strategy to Implementation — Putting Data to Work at USA for UNHCR 40-minute session

The Hive and Cloudera Fast Forward Labs share how they transformed USA for UNHCR (UN Refugee Agency) to use data science and machine learning (DS/ML) to address the refugee crisis. From identifying use cases and success metrics to showcasing the value of DS/ML, we cover the development and implementation of a DS/ML strategy hoping to inspire other organizations looking to derive value from data.

Jim Scott is the director of enterprise strategy and architecture at MapR Technologies. He is passionate about building combined big data and blockchain solutions. Over his career, Jim has held positions running operations, engineering, architecture, and QA teams in the financial services, regulatory, digital advertising, IoT, manufacturing, healthcare, chemicals and geographical management systems industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG).

Presentations

Using Blockchain in the Enterprise 40-minute session

Relevant use cases across a variety of industries for blockchain based solutions will be explained. A focus will be given to the suggested architecture to achieve high transaction rate private blockchains in addition to building decentralized applications backed by a blockchain. Comparisons between public and private blockchain architectures will also be discussed.

I am a Solutions Architect supporting AWS partners in the Big Data space.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.

Jonathan Seidman is a software engineer on the cloud team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the Internet of Things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.

Executive Briefing: Managing successful data projects - technology selection and team building 40-minute session

Creating a successful big data practice in your organization presents new challenges in managing projects and teams. In this session we'll provide guidance and best practices to help technical leaders deliver successful projects from planning to implementation.

Dave Shuman is the Industry lead for IoT & manufacturing at Cloudera. Dave has an extensive background in big data analytics, business intelligence applications, database architecture, logical and physical database design, and data warehousing. Previously, Dave held a number of roles at Vision Chain, a leading demand signal repository provider enabling retailer and manufacturer collaboration, including chief operations officer, vice president of field operations responsible for customer success and user adoption, vice president of product responsible for product strategy and messaging, and director of services. He also served at such top CG companies as Kraft Foods, PepsiCo, and General Mills, where he was responsible for implementations; was vice president of operations for enews, an e-commerce company acquired by Barnes and Noble; was executive vice president of management information systems, where he managed software development, operations, and retail analytics; and developed e-commerce applications and business processes used by Barnesandnoble.com, Yahoo, and Excite, and pioneered an innovative process for affiliate commerce. He holds an MBA with a concentration in information systems from Temple University and a BA from Earlham College.

Presentations

Using Machine Learning to Drive Intelligence at the Edge 40-minute session

The focus on IoT is turning increasingly to the edge. And the way to make the edge more intelligent is by building machine learning models in the cloud and pushing those learnings back out to the edge. Join Cloudera and Red Hat where they will showcase how they executed this architecture at one of the world’s leading manufacturers in Europe, including a demo highlighting this architecture.

Kamil Sindi is a Principal Engineer working on productionizing machine learning algorithms and scaling distributed systems. He received his Bachelor’s Degree in Mathematics with Computer Science from Massachusetts Institute of Technology.

Presentations

Building Turn-key Recommendations for 5% of Internet Video 40-minute session

Building a video recommendation model that serves millions of monthly visitors is a challenge in itself. At JW Player, we face the challenge of providing on-demand recommendations as a service to thousands of media publishers. We focus on how to systematically improve model performance while navigating the many engineering challenges and unique needs of the diverse publishers we serve.

Vartika Singh is a field data science architect at Cloudera. Previously, Vartika was a data scientist applying machine learning algorithms to real-world use cases, ranging from clickstream to image processing. She has 12 years of experience designing and developing solutions and frameworks utilizing machine learning techniques.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Jason ‘Jay’ Smith is a Cloud Customer Engineer at Google. He spends his day helping enterprises find ways move expand their workload capabilities on Google Cloud. He is currently on the Kubeflow go-to-market team, helping people containerize machine learning to improve portability and scalability. He has been building container based solutions since the early days of Docker and now enjoys finding new ways to implement the technology

Presentations

From Training to Serving: Deploying Tensorflow Models with Kubernetes Tutorial

TensorFlow and Keras are popular libraries for training deep models due to hardware accelerator support. Join Ron Bodkin and Brian Foo to learn how to bring deep learning models from training to serving in a cloud production environment. You will learn how to unit-test, export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.

Tim Spann has over a decade of experience with IoT, Big Data, Distributed Computing, Streaming technologies and Java programming. He has a BS and MS in Computer Science. He has been a Senior Solutions Architect at AirisData working with Spark and Machine Learning. Before that he was a Senior Field Engineer for Pivotal. He blogs for DZone where he is the Big Data Zone Leader. He runs a popular meetup in Princeton on Big Data, IoT, Deep Learning, Streaming, NiFi, Blockchain and Spark. He currently is a Solutions Engineer II at Hortonworks working with Apache Spark, Big Data, IoT, Machine Learning and Deep Learning. He is speaking at http://iotfusion.net/ and Data Works Summit Berlin this year. He has spoken at DataWorks Summit Sydney and Oracle Code NYC last year.

https://dzone.com/refcardz/introduction-to-tensorflow
http://www.meetup.com/futureofdata-princeton/
https://community.hortonworks.com/users/9304/tspann.html
https://dzone.com/users/297029/bunkertor.html
https://github.com/tspannhw

Presentations

IoT Edge Processing with Apache NiFi and MiniFi and Multiple Deep Learning Libraries 40-minute session

A hands-on deep dive on using Apache MiniFi with Apache MXNet and other Deep Learning Libraries on the edge device.

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe, and worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Executive Briefing: Why machine-learned models crash and burn in production and what to do about it 40-minute session

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Natural language understanding at scale with spaCy and Spark NLP Tutorial

Natural language processing is a key component in many data science systems. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP, using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, with Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

Spark NLP in Action: How SelectData Uses AI to Better Understand Home Health Patients 40-minute session

This case study describes a question answering system, for accurately extracting facts from free-text patient records. The solution is based on Spark NLP - an open source extension of Spark ML, providing state of the art performance and accuracy for natural language understanding. We'll share best practices for training domain specific deep learning NLP models as such problems usually require.

Wangda Tan is Product Management Committee (PMC) member of Apache Hadoop and Staff Software Engineer at Hortonworks. His major working field is Hadoop YARN GPU isolation and resource scheduler, participated features like node labeling, resource preemption, container resizing etc. Before join Hortonworks, he was working at Pivotal, working on integration OpenMPI/GraphLab with Hadoop YARN. Before that, he was working at Alibaba cloud computing, participated creating a large scale machine learning, matrix and statistics computation platform using Map-Reduce and MPI.

Presentations

Deep learning on YARN - Running distributed Tensorflow / MXNet / Caffe / XGBoost on Hadoop clusters 40-minute session

In order to train deep learning/machine learning models, applications such as TensorFlow / MXNet / Caffe / XGBoost can be leveraged, we introduced new features in Apache Hadoop 3.x to better support deep learning workloads. (Such as GPU isolation, Docker support, etc.). This talk we will take a closer look at these improvements and show how to run these applications on YARN with demos.

Elena Terenzi is a software development engineer at Microsoft, where she brings business intelligence solutions to Microsoft Enterprise customers and advocates for business analytics and big data solutions for the manufacturing sector in Western Europe, such as helping big automotive customers implement telemetry analytics solutions with IoT flavor in their enterprises. She started her career with data as a database administrator and data analyst for an investment bank in Italy. Elena holds a master’s degree in AI and NLP from the University of Illinois at Chicago.

Presentations

When Tiramisu Meets Online Fashion Retail 40-minute session

Large online fashion retailers face the problem of efficiently maintaining catalogue of millions of items. Due to human error, it is not unusual that some items have duplicate entries. To trawl along such a large catalogue manually is near to impossible. How would you prevent such error? Find out how we applied deep learning as part of the solution.

Shawn Terry (Edmonton, AB, Canada) is the Lead Systems Architect for Komatsu Mining’s Analytics Platform. With more than 20 years experience as a software developer, consultant, and architect Shawn has spent the last 10 years working to design, develop, deploy and evolve Komatsu Mining’s Data Analytics platform. In 2016 he helped lead an effort to transform fragile, fragmented legacy systems into a truly scalable distributed solution built on open source and centered around Cloudera CDH and deployed in Microsoft Azure. The project was completed in under a year with a handful of developers and no increase in budget and enables Komatsu Mining to successfully partner with their customers in solving mining’s toughest challenges with smart, data-driven solutions.

Presentations

How Komatsu is Improving Mining Efficiencies using IoT and Machine Learning 40-minute session

Global heavy equipment manufacturer, Komatsu, is using IoT data to continuously monitor some of the largest mining equipment to ultimately improve mine performance and efficiencies. Join Shawn Terry & Anthony Reid, to learn more about their data journey and how they are using advanced analytics and predictive modeling to drive insights on terabytes of IoT data from connected mining equipment.

Dr. Theresa Johnson is a Product Manager on Metrics and Forecasting products at Airbnb. As a data scientist, she worked on the taskforce and cross-functional hackathon team at Airbnb who worked to develop the framework for the current Anti-discrimination efforts. Theresa joined Airbnb after earning a PhD in Aeronautics and Astronautics from Stanford University.

She is a founding board member of Street Code Academy, a non-profit dedicated to high touch technical training for inner city youth, and has been featured in TechCrunch for her commitment to helping early-stage founders raise capital. Her lifelong fascination with the capacity for technology to change lives led her to Stanford University, where she earned dual undergraduate degrees in Science, Technology and Society and Computer Science. Theresa is passionate about extending technology access for everyone and finding mission driven companies that can have an outsized impact on the world.

Presentations

Revenue Forecasting Platform at Airbnb Findata

How Airbnb builds its next generation end to end Revenu Forecasting Platform leveraging Machine Learning, Bayesian Inference, Tensorflow, Hadoop, and Web technology.

Alex Thomas is a data scientist at Indeed. He has used natural language processing (NLP) and machine learning with clinical data, identity data, and now employer and jobseeker data. He has worked with Apache Spark since version 0.9, and has worked with NLP libraries and frameworks including UIMA and OpenNLP.

Presentations

Natural language understanding at scale with spaCy and Spark NLP Tutorial

Natural language processing is a key component in many data science systems. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP, using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, with Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

I succeed when assisting clients in developing solutions that have measurable business impact. With 25 years of field experience, I have developed solutions for multiple vertical industries including: banking/financial services, retail, life-sciences, and others.

Presentations

If You Thought Politics Was Dirty, You Should See the Analytics Behind It 40-minute session

Forget about the fake news, data and analytics in politics is what drives elections. While proposing analytical solutions to the RNC and DNC, I faced ethical dilemmas. Not only did I help causes I disagreed with, but I also armed politicians with “REAL-TIME” data to manipulate voters. Politics is a business, and today’s modern data infrastructure optimize campaign funds more effectively than ever.

Yaroslav Tkachenko is a software engineer interested in distributed systems, microservices, functional programming, modern cloud infrastructure and DevOps practices. Currently Yaroslav is a Senior Data Engineer at Activision, working on a large-scale data pipeline.

Prior to joining Activision Yaroslav held various leadership roles in multiple startups. He was responsible for designing, developing, delivering and maintaining platform services and cloud infrastructure for mission critical systems.

Presentations

Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned 40-minute session

What can be easier than building a data pipeline? You add a few Apache Kafka clusters, some way to ingest data, design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse... wait, it does start to look like A LOT of things, doesn't it? Join this talk to learn about the best practices we've been using for all the above.

Steven Totman is the financial services industry lead for Cloudera’s Field Technology Office, where he helps companies monetize their big data assets using Cloudera’s Enterprise Data Hub. Prior to Cloudera, Steve ran strategy for a mainframe-to-Hadoop company and drove product strategy at IBM for DataStage and Information Server after joining with the Ascential acquisition. He architected IBM’s Infosphere product suite and led the design and creation of governance and metadata products like Business Glossary and Metadata Workbench. Steve holds several patents for data-integration and governance/metadata-related designs.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

As VP of Strategy, Jane works directly with clients to set the direction for the Unqork platform in both user experience and functionality.

She has been helping leaders in financial services assess and implement new business strategies since the start of her career. She worked at internal strategy teams for c-suites at JPMorgan Chase, Marsh, MetLife. Most recently, she advised a portfolio of startups for Techstars Connection in partnership with AB Inbev.

She holds a B.A. in Economics and Policy Studies from Syracuse University

Presentations

The balancing act: Building business relevant data solutions for the front line Findata

Data’s role in financial services has been elevated. However, often times the rollout of data solutions fail when an organization’s existing culture is misaligned with its capabilities. With Unqork, we’re increasing adoption by honoring existing capabilities. This discussion will explore methods to finally implement data solutions through both qualitative and quantitative discoveries.

Michelle Ufford leads the Data Platform Architecture Core team at Netflix, which focuses on platform innovation and usability. Previously, she led the data management team at GoDaddy, where she built data engineering solutions for personalization and helped pioneer Hadoop data warehousing techniques. Michelle is a published author, patented developer, award-winning open source contributor, and Most Valuable Professional (MVP) for Microsoft Data Platform. You can find her on Twitter at @MichelleUfford.

Presentations

Data @ Netflix: See What’s Next 40-minute session

In this talk, Michelle Ufford will share some cool things Netflix is doing with data and the big bets we’re making on data infrastructure. Topics will include workflow orchestration, machine learning, interactive notebooks, centralized alerting, event-based processing, and platform intelligence.

Sandeep Uttamchandani is a Distinguished Engineer at Intuit, focussing on platforms for storage, databases, analytics, and machine learning. Prior to Intuit, Sandeep was co-founder and CEO of a machine learning startup focussed on finding security vulnerabilities in Cloud Native deployment stacks. Sandeep has nearly two decades of experience in storage and data platforms, and has held various technical leadership roles at VMware and IBM. Over his career, Sandeep has contributed to multiple enterprise products, and holds 35+ issued patents, 20+ conference and journal publications, and regularly blogs on All-things-Enterprise-Data. He has a Ph.D. from University of Illinois at Urbana-Champaign.

Presentations

Circuit-breakers to safeguard for garbage in, garbage out 40-minute session

Do your analysts always trust the insights generated by your Data Platform? Ensuring insights are always reliable is critical for use-cases in the Financial Sector. Similar to a circuit-breaker design pattern used in Service Architectures, this talk describes a circuit-breaker pattern we developed for data pipelines. We are able to detect/correct problems and ensure always reliable insights!

Preeti Vaidya is a Data Scientist/Big Data Engineer at Viacom’s Data Strategy HQ. She leads Analytic Process Simplification, and applies novel computing technologies to data product development.

She holds a MS in CS with a focus on Machine Learning, Columbia SEAS ’15, and her research interests include Big Data, Machine Learning, Graph Analytics, Parallel and Distributed Computing Systems, and Image Processing. Her work has been published in IEEE Explore, as well as other technical journals, and has been an invited to present her work at technical conferences and hackathons.

Presentations

Agility to Data Product Development: Plug and Play Data Architecture 40-minute session

Data Products, different from Data-Driven Products, are finding their own place in organizational Data. Driven Decision Making. Shifting the focus to “data”, opens up new opportunities. The presentation, with case studies, dives deeper into a layered implementation architecture, provides intuitive learnings and solutions that allow for more agile, reusable data modules for a data product team.

Vinithra Varadharajan is an engineering manager in the cloud organization at Cloudera, where she is responsible for products such as Cloudera Director and Cloudera’s usage-based billing service. Previously, Vinithra was a software engineer at Cloudera, working on Cloudera Director and Cloudera Manager with a focus on automating Hadoop lifecycle management.

Presentations

Running multidisciplinary big data workloads in the cloud Tutorial

Organizations now run diverse, multidisciplinary big data workloads that span data engineering, analytic database, and data science applications. Many of these workloads operate on the same underlying data and the workloads themselves can be transient or long-running in nature. One of the challenges we will explore is keeping the data context consistent across these various workloads.

Abhi leads Applied Technology Innovation at Novartis and is responsible for experimenting, incubating and rapidly scaling digital platforms and services across the enterprise.

Over the years as a serial intrapreneur he has established platforms and services in Real World Evidence, Robotic & Cognitive Automation, Advanced Analytics, Data Platforms, Standards and wearables in clinical trials.

Abhi is passionate about and driven by harnessing the intersection of science, technology, data and people to improve patient’s lives. Abhi has also been involved in deploying technology for public health projects in underdeveloped regions.

Passionate about cycling, comics, science fiction and history, Abhi holds a MBA from the Indian School of Business a Masters in Science in pharmaceutical medicine and is an engineer. He lives in Basel, Switzerland currently.

Presentations

Crossing the chasm: Case study of realizing a big data strategy Data Case Studies

A case study on how a transformational business opportunity was realized on the foundation of an integrated data, process, culture, organization and technology strategy

Tim began his 21-year career as an IT Consultant with ICL (and then Fujitsu). He spent time working in Seattle for Microsoft, 3 years at the European Commission in Luxembourg, 3 years for HP, and now he works for BJSS as a hands-on Cognitive Architect.

Tim has always had passion for systems integration and is always looking for clever and innovative ways to connect systems together.

After joining BJSS, Tim become Head of Mobile where he has showed his passion for design, development and delivery of quality mobile applications.

Since then, Chabot’s, artificial intelligence and machine learning have sparked Tim’s interest, and he has focused his attention on this exciting area. As a Cognitive Architect, he is very excited to be designing complex, vendor agnostic, multi-lingual cloud based Chatbot solutions for a range of clients.

Tim has 2 grown up daughters and lives in Newington Green in London. He enjoys choral singing and is currently working on renovating and extending his 1850’s property.

Presentations

Using big data to unlock the delivery of personalized, multi-lingual real-time chat services for global financial service organizations 40-minute session

Financial Service clients demand increased data driven personalization, faster insight-based decisions and multi-channel, real-time access. BJSS discusses how organizations can deliver real time, vendor-agnostic, personalized chat services. We discuss the issues around security, privacy, legal sign-off, data compliance and how the Internet of Things can be used as a delivery platform.

Dean Wampler, Ph.D., is the VP of Fast Data Engineering at Lightbend. He leads the development of Lightbend Fast Data Platform, a distribution of scalable, distributed stream processing tools including Spark, Flink, Kafka, and Akka, with machine learning and management tools. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly Media. He is a contributor to several open source projects, a frequent Strata speaker, and the co-organizer of several conferences around the world and several user groups in Chicago. Dean lurks on Twitter as @deanwampler.

Presentations

Executive Briefing: What You Need to Know About Fast Data 40-minute session

Streaming data systems, so called "Fast Data", promise accelerated access to information, leading to new innovations and competitive advantages. But they aren't just "faster" versions of Big Data. They force architecture changes to meet new demands for reliability and dynamic scalability, more like microservices. This talk tells you what you need to know to exploit Fast Data successfully.

Hands-on Kafka Streaming Microservices with Akka Streams and Kafka Streams Tutorial

This hands-on tutorial builds streaming apps as microservices using Kafka with Akka Streams and Kafka Streams. We'll assess the strengths and weaknesses of each tool for particular needs, so you'll be better informed when choosing tools for your needs. We'll contrast them with Spark Streaming and Flink, including when to chose them instead. The sample apps demonstrate ML model serving ideas.

Jason is a software engineer at Cloudera focusing on the cloud.

Presentations

Comparative Analysis of the Fundamentals of AWS and Azure 40-minute session

The largest infrastructure paradigm change of the 21st Century is the shift to the cloud. Companies are faced with the difficult and daunting decision of which cloud to go with. This decision is not just financial and in many cases rests on the underlying infrastructure. In this talk we use our experience from building production services on AWS and Azure to compare their strengths and weaknesses.

Jacob Ward is a science and technology correspondent for CNN, Al Jazeera, and PBS. The former editor-in-chief of Popular Science magazine, Ward writes for The New Yorker, Wired, and Men’s Health. His ten-episode Audible podcast, Complicated, discusses humanity’s most difficult problems, and he’s the host of an upcoming four-hour public television series, Hacking Your Mind, about human decision making and irrationality. Ward is developing a CNN original series about the unintended consequences of big ideas, and is a 2018-2019 Berggruen Fellow at Stanford University’s Center for Advanced Study in the Behavioral Sciences, where he’s writing a book, due for publication by Hachette Books in 2020, about how artificial intelligence will amplify good and bad human instincts.

Presentations

How AI Will Amplify the Best and Worst of Humanity Keynote

For most of us, our own mind is a black box: an all-powerful and utterly mysterious device that runs our lives for us. And not only do we humans just barely understand how it works, science is now revealing that it makes most of our decisions for us using rules and shortcuts of which you and I aren’t even aware.

I have a Research Fellowship in Physics at Harvard University, studying quantum metrology and quantum computing. My PhD research, in the field of quantum computing, was published in Nature and covered in the New York Times. I moved into artificial intelligence as the CEO of ASI Data Science because I believe it is the most exciting, and important, field of our time. ASI’s thesis is that the way to build the most valuable company, and add the most value to humanity, is to bring artificial intelligence to the real world; to schools, governments, businesses and hospitals. We intend to pursue this mission for many years.

Presentations

Predicting residential occupancy and hot water usage from high frequency, multi-vector utilities data 40-minute session

Future Home Energy Management Systems could improve their energy efficiency by predicting resident needs through utilities data. This session discusses the opportunity with a particular focus on the key data features, the need for data compression and the data quality challenges.

Rachel Warren is a software engineer and data scientist for Salesforce Einstein, where she is working on scaling and productionizing auto ML on Spark. Previously, Rachel was a machine learning engineer for Alpine Data, where she helped build a Spark auto-tuner to automatically configure Spark applications in new environments. A Spark enthusiast, she is the coauthor of High Performance Spark. Rachel is a climber, frisbee player, cyclist, and adventurer. Last year, she and her partner completed a thousand-mile off-road unassisted bicycle tour of Patagonia.

Presentations

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am 40-minute session

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads.

Katharina has a passion for travel and trying out new things that have taken her from living in four countries to overnighting in the Moroccan desert, hiking to Machu Picchu, volunteering in Kenya, and becoming a certified yoga teacher in Bali. As Head of Data Analytics and Performance Marketing at EveryMundo, Katharina works with airlines around the world on innovating based on actionable insights from analytics. She is responsible for defining data standards, specifying data collection systems, implementing the tracking environment, analyzing data quality, and supporting airline’s digital strategy leveraging automation and data science. Everything she does is multi-lingual as she speaks German, French, Spanish and English. Her long-term intention is to join the Data for Good movement which encourages using data in meaningful ways to solve humanitarian issues around poverty, health, human rights, education and the environment.

Presentations

Self-reliant, secure, end-to-end data, activity, and revenue analytics - Roadmap for the airline industry Data Case Studies

Airlines want to know what happens after a user interacts with our technology on their website. Do they convert? Do they close the browser and come back later? Previously depending on airline’s analytics tools to prove value, Katharina explores how to implement a client-independent end-to-end tracking system.

Robin Way is a Faculty Member for Banking at the International Institute of Analytics, and is the founder and President of the management analytics consultancy, Corios. He has over 25 years of experience in the design, development, execution, and improvement of applied analytics models for clients in the credit, payments, lending, brokerage, insurance and energy industries. Robin was previously employed with SAS® Institute’s Financial Services Business Unit as a managing analytics consultant for 12 years, in addition to another 10+ years in analytic management roles for several client-side and consulting firms.

Robin is author of Skate Where The Puck’s Headed: A Playbook for Scoring Big with Predictive Analytics. He lives in Portland, Oregon with his wife, Melissa and two sons, Colin and Liam. In his spare time, Robin plays soccer and holds a black belt in taekwondo.
Robin’s professional passion is devoted to democratizing and demystifying the science of applied analytics. His contributions to the field correspondingly emphasize statistical visualization, analytical data preparation, predictive modeling, time series forecasting, mathematical optimization applied to marketing, and risk management strategies. Robin’s undergraduate degree from the University of California at Berkeley and his subsequent graduate-level coursework emphasized the analytical modeling of human and consumer behavior.

Presentations

Leading Next Best Offer Strategies for Financial Services Findata

This session will present case study examples of next best offer strategies, predictive customer journey analytics, and behavior-driven time-to-event targeting for mathematically-optimal customer messaging that drives incremental margins.

Daniel Weeks manages the Big Data Compute team at Netflix and is a Parquet committer. Prior to joining Netflix, Daniel focused on research in big data solutions and distributed systems.

Presentations

The evolution of Netflix's S3 data warehouse 40-minute session

In the last few years, Netflix's data warehouse has grown to more than 100PB in S3. This talk will summarize what we've learned, the tools we currently use and those we've retired, as well as the improvements we are rolling out, including Iceberg, a new table format for S3.

Thomas is Software Engineer, Streaming Platform at Lyft and Apache Apex PMC Chair. Earlier he worked at a number of other technology companies in the San Francisco Bay Area, including DataTorrent, where he was a Co-Founder of the Apex project. Thomas is also a committer to Apache Beam and has contributed to several other of the ecosystem projects. Thomas has worked on distributed systems for over 20 years, speaker at international big data conferences and author of the book “Learning Apache Apex”.

Presentations

Near-real time Anomaly Detection at Lyft 40-minute session

Consumer facing real-time processing poses a number of challenges to protect from fraudulent transactions and other risks. The streaming platform at Lyft seeks to support this with an architecture that brings together a data science friendly programming environment with a deployment stack for the reliability, scalability and other SLA requirements of a mission critical stream processing system.

Mike Wendt is an Engineering Manager in the AI Infrastructure group at NVIDIA. His research work has focused on leveraging GPUs for big data analytics, data visualizations, and stream processing. Prior to joining NVIDIA, Mike led engineering work on big data technologies like Hadoop, Datastax Cassandra, Storm, Spark, and others. In addition, Mike has focused on developing new ways of visualizing data and the scalable architectures to support them. Mike holds a BS in computer engineering from the University of Maryland.

Presentations

Accelerating Financial Data Science Workflows With GPU 40-minute session

GPUs have allowed financial firms to run complex simulations, train myriads of models, and data mine at unparalleled speeds. Today, the bottleneck has moved completely to ETL. With the GPU Open Analytics Initiative (GoAi), we’re accelerating ETL and keeping the entire workflow on GPUs. We’ll discuss real-world examples, benchmarks, and how we’re accelerating our largest FS customers.

Masha completed her PhD in Cognitive Science at New York University in 2015. She is now Director of the Investopedia Data Science team, where she works to answer questions such as “What can Investopedia’s readership tell us about current market sentiment?” and “What financial concepts are most interesting to American investors, from Wall Street to Silicon Valley?”

Presentations

Anxiety at scale: How Investopedia used readership data to track market volatility 40-minute session

As our businesses rely more heavily on user data to power our sites, products, and sales, can we give back by sharing those insights with users? Learn how Investopedia harnessed reader data to build an index that tracks market anxiety and moves with the VIX, a proprietary measure of market volatility. We’ll focus on thinking outside the box to turn data into tools for users, not just stakeholders.

Hee Sun Won is a principal researcher at the Electronic and Telecommunications Research Institute (ETRI) and leads the Collaborative Analytics Platform for BDaaS (big data as a service) and analytics for the Network Management System (NFV/SDN/cloud). Her research interests include multitenant systems, cloud resource management, and big data analysis.

Presentations

A Data Marketplace Case Study with Blockchain and Advanced Multitentant Hadoop in Smart Open Data Platform 40-minute session

This session will address how analytics services in data marketplace systems can be performed on one single Hadoop cluster across distributed data centers. We extend the overall architecture of Hadoop ecosystem with blockchain so that multiple tenants and authorized third parties can securely access data to perform various analytics while still maintaining the privacy, scalability and reliability.

Brian has been an engineer on the AppNexus Optimization team for five years. During his tenure at AppNexus, Brian has worked closely with budgeting, valuation and allocation systems and has seen great changes and great mistakes. Coming from a pure mathematics background, Brian enjoys working on algorithm, logic and streaming data problems with his team. In addition to control systems, data technologies and real-time applications, Brian loves talking about process, team work, management, sequencers, synthesizers and the NYC music scene.

Presentations

AppNexus's Stream-based Control System for Automated Buying of Digital Ads 40-minute session

Automating the success of digital ad campaigns is complicated and comes with the risk of wasting the advertiser's budget or a trader's margin and time. This talk describes the evolution of Inventory Discovery, a streaming control system of eligibility, prioritization and real-time evaluation that helps digital advertisers hit their performance goals with AppNexus.

Tony Wu manages Altus Core Engineering team at Cloudera. Perviously Tony is a team lead of the Partner Engineering team at Cloudera, and is responsible for Microsoft Azure integration for Cloudera Director.

Presentations

Comparative Analysis of the Fundamentals of AWS and Azure 40-minute session

The largest infrastructure paradigm change of the 21st Century is the shift to the cloud. Companies are faced with the difficult and daunting decision of which cloud to go with. This decision is not just financial and in many cases rests on the underlying infrastructure. In this talk we use our experience from building production services on AWS and Azure to compare their strengths and weaknesses.

Invented new ways to recognise people by the way the move in the early days, now, main focus is to apply machine learning techniques into solving various problems in daily life.

Presentations

When Tiramisu Meets Online Fashion Retail 40-minute session

Large online fashion retailers face the problem of efficiently maintaining catalogue of millions of items. Due to human error, it is not unusual that some items have duplicate entries. To trawl along such a large catalogue manually is near to impossible. How would you prevent such error? Find out how we applied deep learning as part of the solution.

Juan Yu is a software engineer at Cloudera working on the Impala project, where she helps customers investigate, troubleshoot, and resolve escalations and analyzes performance issues to identify bottlenecks, failure points, and security holes. Juan also implements enhancements in Impala to improve customer experience. Previously, Juan was a software engineer at Interactive Intelligence and held developer positions at Bluestreak, Gameloft, and Engenuity.

Presentations

Leveraging Spark and deep learning frameworks to understand data at scale Tutorial

Vartika Singh and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Nir Yungster leads the Data Science team at JW Player, where the team focuses on building recommendation engines as a service to online video publishers. He received his Bachelor’s Degree in Aerospace Engineering from Princeton and his Master’s in Applied Mathematics from Northwestern University.

Presentations

Building Turn-key Recommendations for 5% of Internet Video 40-minute session

Building a video recommendation model that serves millions of monthly visitors is a challenge in itself. At JW Player, we face the challenge of providing on-demand recommendations as a service to thousands of media publishers. We focus on how to systematically improve model performance while navigating the many engineering challenges and unique needs of the diverse publishers we serve.

Varant Zanoyan is a software engineer on the ML Infrastructure team at Airbnb where he works on tools and frameworks for building and productionizing ML models. Before Airbnb, he worked on solving data infrastructure problems at Palantir Technologies.

Presentations

Zipline - Airbnb's Data Management Platform for Machine Learning 40-minute session

Zipline is Airbnb’s soon to be open sourced data management platform specifically designed for ML use cases. It has taken the task of feature generation from months to days, and offers features to support end-to-end data management for machine learning. This talk covers architecture and dives into how Zipline solves ML specific problems.

Xiaohan Zeng is a Software Engineer on the Machine Learning Infrastructure team at Airbnb. He majored in Chemical Engineering at Tsinghua University and Northwestern University, but started to pursue a career in software engineering and machine learning after doing research in data science. Prior to joining Airbnb, he worked on the Machine Learning Platform team at Groupon for 3 years. Outside work, he enjoys reading, writing, traveling, movies, and trying to follow his daughter around when she suddenly decides to practice walking.

Presentations

Bighead: Airbnb's End-to-End Machine Learning Platform 40-minute session

We introduce Bighead, Airbnb's user-friendly and scalable end-to-end machine learning framework that powers Airbnb's data-driven products. Bighead integrates popular libraries including Tensorflow, XGBoost, and PyTorch. It is built on Python, Spark, and Kubernetes, and is designed be used in modular pieces. It has reduced the overall model development time from many months to days at Airbnb.

Currently leading an excellent engineering team to provide big data services (HDFS/YARN/Spark/TensorFlow and beyond) to power LinkedIn’s business intelligence and relevance applications.

Apache Hadoop PMC member; led the design and development of HDFS Erasure Coding (HDFS-EC).

Presentations

TonY -- Native support of TensorFlow on Hadoop 40-minute session

We have developed TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. TonY enables running TensorFlow distributed training as a new type of Hadoop application. TonY's native Hadoop connector, together with other features, aims to run TensorFlow jobs as reliably and flexibly as other first-class citizens on Hadoop including MapReduce and Spark.

Xiaoyong Zhu is a Program Manager working in Microsoft Cloud AI group. His current focus is on building scalable deep learning algorithms.

Presentations

Deep Learning on audio in Azure to detect sounds in real-time 40-minute session

In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds (dog bark, alarms, people calling from behind etc.,). We all take this for granted, there are over 360 million in this world who are deaf or hard of hearing. How can we make the auditory world inclusive as well as meet the great demand in other sectors by applying deep learning on audio in Azure?